On most UNIX systems, root has omnipotent power. This promotes insecurity.
If an attacker were to gain root on a system, he would have every function
at his fingertips. In FreeBSD there are sysctls which dilute the power of
root, in order to minimize the damage caused by an attacker. Specifically,
one of these functions is called secure levels. Similarly, another function
which is present from FreeBSD 4.0 and onward, is a utility called jail. Jail
chroots an environment and sets certain restrictions on processes which are
forked from within. For example, a jailed process cannot affect processes
outside of the jail, utilize certain system calls, or inflict any damage
on the main computer. Jail is becoming the new security model. People are
running potentially vulnerable servers such as Apache, BIND, and sendmail
within jails, so that if an attacker gains root within the Jail, it is only
an annoyance, and not a devastation. This article focuses on the internals
(source code) of Jail and Jail NG. It will also suggest improvements upon
the jail code base which are already being worked on. If you are looking
for a how-to on setting up a Jail, I suggest you look at my other article
in Sys Admin Magazine, May 2001, entitled "Securing FreeBSD using Jail."
1.1 Architecture
Jail consists of two realms: the user-space program, jail, and the
code implemented within the kernel: the jail() system call and associated
restrictions. I will be discussing the user-space program and then how jail
is implemented within the kernel.
2 User-land program
The source for the user-land jail is located in /usr/src/usr.sbin/jail
, consisting of one file, jail.c. The program takes these arguments:
the path of the jail, hostname, ip address, and the command to be executed.
2.1 Jail structure
In jail.c, the first thing I would note is the declaration of an
important structure struct jail j; which was included from /usr/include/sys/jail.h
. The definition of the jail structure is: /usr/include/sys/jail.h: struct jail - u.int32.t version; char *path;
char *hostname; u.int32.t ip.number; "";
As you can see, there is an entry for each of the arguments passed to the
jail program, and indeed, they are set during it's execution.
One of the arguments passed to the Jail program is an IP address with which the
jail can be accessed over the network. Jail translates the ip address given
into network byte order and then stores it in j (the jail structure).
The inet aton function "interprets the specified character string
as an Internet address, placing the address into the structure provided."
The ip number node in the jail structure is set only when the ip address
placed onto the in structure by inet aton is translated into network
byte order by ntohl().
2.3 Jailing the Process
Finally, the userland program jails the process, and executes the command
specified. Jail now becomes an imprisoned process itself and forks a child
process which then executes the command given using execv().
/usr/src/sys/usr.sbin/jail/jail.c i = jail(&j); ... i = execv(argv[4], argv + 4)
As you can see, the jail function is being called, and its argument is the
jail structure which has been filled with the arguments given to the program.
Finally, the program you specify is executed. I will now discuss how Jail
is implemented within the kernel.
3 Kernel Space
3.1 kern jail.c
We will now be looking at the file /usr/src/sys/kern/kern jail.c.
This is the file where the jail system call, appropriate sysctls, and networking
functions are defined.
3.2 sysctls
In kern jail.c, the following sysctls are defined: /usr/src/sys/kern/kern jail.c:
int jail.set.hostname.allowed = 1; SYSCTL.INT(.jail, OID.AUTO, set.hostname.allowed,
CTLFLAG.RW, &jail.set.hostname.allowed, 0, "Processes in jail can set their
hostnames"); int jail.socket.unixiproute.only = 1; SYSCTL.INT(.jail, OID.AUTO,
socket.unixiproute.only, CTLFLAG.RW, &jail.socket.unixiproute.only, 0, "Processes in jail are limited
to creating UNIX/IPv4/route sockets only"); int jail.sysvipc.allowed = 0; SYSCTL.INT(.jail, OID.AUTO, sysvipc.allowed,
CTLFLAG.RW, &jail.sysvipc.allowed, 0, "Processes in jail can use System
V IPC primitives");
Each of these sysctls can be accessed by the user through the sysctl program.
Throughout the kernel, these specific sysctls are recognized by their name.
For example, the name of the first sysctl is jail.set.hostname.allowed
.
3.3 jail() system call
Like all system calls, the jail system call takes two arguments, p
and uap. p is a pointer to a proc structure which describes
the calling process. In this context, uap is a pointer to a structure
which specifies the arguments given to jail() from the userland
program jail.c. When I described the userland program before, you
saw that the jail system call was given a jail structure as its own argument.
Therefore, uap)jail would access the jail structure which was passed
to the system call. Next, the system call copies the jail structure into
kernel space using the copyin() function. copyin() takes
three arguments: the data which is to be copied in (uap)jail), where
to store it j, and the size of the storage. The jail structure
(uap)jail is copied into kernel space and stored in another jail structure,
j.
/usr/src/sys/kern/kern jail.c: error = copyin(uap-?jail, &j, sizeof
j); There is another important structure defined in jail.h.
It is the prison structure (pr). The prison structure is used exclusively
within kernel space. The jail() system call copies everything from
the jail structure onto the prison structure. Here is the definition of the
prison structure.
Finally, the jail system call chroots the path specified. The
chroot function is given two arguments. The first is p, which
represents the calling process, the second is a pointer to the structure
chroot args. The structure chroot args contains the path
which is to be chrooted. As you can see, the path specified in the jail structure
is copied to the chroot args structure and used.
These next three lines in the source are very important, as they specify
how the kernel recognizes a process as jailed. Each process on a Unix system
is described by its own proc structure. You can see the whole proc structure
in /usr/include/sys/proc.h. For example, the p argument
in any system call is actually a pointer to that process' proc structure,
as stated before. The proc structure contains nodes which can describe the
owner's identity (p cred), the process resource limits (p limit
), and so on. In the definition of the process structure, there is a pointer
to a prison structure. (p prison).
In kern jail.c, the function then copies the pr structure,
which is filled with all the information from the original jail structure,
over to the p)p prison structure. It then does a bitwise OR of
p)p flag with P.JAILED, meaning that the calling process is
now recognized as jailed. The parent process of each process, forked within
the jail, is the program jail itself, as it calls the jail() system
call. When the program is executed through execve, it inherits the
properties of its parents proc structure, therefore it has the p)p
flag set, and the p)p prison structure is filled.
/usr/src/sys/kern/kern jail.c:
p-?p.prison = pr; p-?p.flag --= P.JAILED;
When a process is forked from a parent process, the fork() system
call deals differently with imprisoned processes. In the fork system call,
there are two pointers to a proc structure p1 and p2.
p1 points to the parent's proc structure and p2 points to the
child's unfilled proc structure. After copying all relevant data between
the structures, fork() checks if the structure p)p prison
is filled on p2. If it is, it increments the pr.ref by
one, and sets the p flag to one on the child process.
/usr/src/sys/kern/kern fork.c: if (p2-?p.prison) - p2-?p.prison-?pr.ref++; p2-?p.flag --= P.JAILED; ""
4 Restrictions
Throughout the kernel there are access restrictions relating to jailed processes.
Usually, these restrictions only check if the process is jailed, and if so,
returns an error. For example:
System V IPC is based on messages. Processes can send each other these
messages which tell them how to act. The functions which deal with messages
are: msgsys, msgctl, msgget, msgsend and msgrcv.
Earlier, I mentioned that there were certain sysctls you could turn on or
off in order to affect the behavior of Jail. One of these sysctls was
jail sysvipc allowed. On most systems, this sysctl is set to 0. If it
were set to 1, it would defeat the whole purpose of having a jail; privleged
users from within the jail would be able to affect processes outside of the
environment. The difference between a message and a signal is that the message
only consists of the signal number.
/usr/src/sys/kern/sysv msg.c: ffl msgget(): msgget returns (and possibly creates)
a message descriptor that designates a message queue for use in other system
calls. ffl msgctl(): Using this function, a process can query the status
of a message descriptor. ffl msgsnd(): msgsnd sends a message to a process ffl msgrcv(): a process receives messages using this function
In each of these system calls, there is this conditional:
/usr/src/sys/kern/sysv msg.c: if (!jail.sysvipc.allowed && p-?p.prison != NULL) return
(ENOSYS);
Semaphore system calls allow processes to synchronize execution by doing
a set of operations atomically on a set of semaphores. Basically semaphores
provide another way for processes lock resources. However, process waiting
on a semaphore, that is being used, will sleep until the resources are relinquished.
The following semaphore system calls are blocked inside a jail: semsys,
semget, semctl and semop.
/usr/src/sys/kern/sysv sem.c:
ffl semctl(id, num, cmd, arg): Semctl does the specified cmd on
the semaphore queue indicated by id.
ffl semget(key, nsems, flag): Semget creates an array of semaphores,
corresponding to key.
Key and flag take on the same meaning as they do in msgget.
ffl semop(id, ops, num): Semop does the set of semaphore operations
in the array of structures ops, to the set of semaphores identified by id.
System V IPC allows for processes to share memory. Processes can communicate
directly with each other by sharing parts of their virtual address space
and then reading and writing data stored in the shared memory. These system
calls are blocked within a jailed environment: shmdt, shmat, oshmctl,
shmctl, shmget, and shmsys.
/usr/src/sys/kern/sysv shm.c:
ffl shmctl(id, cmd, buf): shmctl does various control operations
on the shared memory region identified by id.
ffl shmget(key, size, flag): shmget accesses or creates
a shared memory region of size bytes. ffl shmat(id, addr, flag): shmat attaches a shared memory
region identified by id to the address space of a process. ffl shmdt(addr): shmdt detaches the shared memory region
previously attached at addr.
4.2 socket()
Jail treats the socket() system call and related lower-level socket
functions in a special manner. In order to determine whether a certain socket
is allowed to be created, it first checks to see if the sysctl jail.socket.unixiproute.only
is set. If set, sockets are only allowed to be created if the family specified
is either PF.LOCAL, PF.INET or PF.ROUTE. Otherwise,
it returns an error.
The Berkeley Packet Filter provides a raw interface to data link layers in
a protocol independent fashion. The function bpfopen() opens an
Ethernet device. There is a conditional which disallows any jailed processes
from accessing this function.
/usr/src/sys/net/bpf.c: static int bpfopen(dev, flags, fmt, p) ... -
if (p-?p.prison) return (EPERM) ... ""
4.4 Protocols
There are certain protocols which are very common, such as TCP, UDP, IP and
ICMP. IP and ICMP are on the same level: the network layer
2
. There are certain precautions which are taken in order to prevent a jailed
process from binding a protocol to a certain port only if the nam
parameter is set. nam is a pointer to a sockaddr structure, which
describes the address on which to bind the service. A more exact definition
is that sockaddr "may be used as a template for reffering to
the identifying tag and length of each address"[2]
. In the function in pcbbind, sin is a pointer to a
sockaddr.in structure, which contains the port, address, length and
domain family of the socket which is to be bound. Basically, this disallows
any processes from jail to be able to specify the domain family.
/usr/src/sys/kern/netinet/in pcb.c: int in.pcbbind(int, nam, p) ... struct
sockaddr *nam; struct proc *p; - ... struct sockaddr.in *sin; ... if (nam)
- sin = (struct sockaddr.in *)nam; ... if (sin-?sin.addr.s.addr != INADDR.ANY)
if (prison.ip(p, 0, &sin-?sin.addr.s.addr)) return (EINVAL); ... "" ""
You might be wondering what function prison.ip() does. prison.ip
is given three arguments, the current process (represented by p
), any flags, and an ip address. It returns 1 if the ip address belongs to
a jail or 0 if it does not. As you can see from the code, if it is indeed
an ip address belonging to a jail, the protcol is not allowed to bind to
a certain port.
/usr/src/sys/kern/kern jail.c: int prison.ip(struct proc *p, int flag, u.int32.t *ip) - u.int32.t
tmp; if (!p-?p.prison) return (0); if (flag) tmp = *ip; else tmp = ntohl
(*ip) if (tmp == INADDR.ANY) - if (flag) *ip = p-?p.prison-?pr.ip; else *ip
= htonl(p-?p.prison-?pr.ip); return (0); ""
if (p-?p.prison-?pr.ip != tmp) return (1); return (0); ""
Jailed users are not allowed to bind services to an ip which does not belong
to the jail. The restriction is also written within the function in.pcbbind
:
/usr/src/sys/netinet/in.pcb.c if (nam) - ... lport = sin-?sin.port; ... if (lport) - ... if (p
&& p-?p.prison) prison = 1; ... if (prison && prison.ip (p, 0, &sin-?sin.addr.s.addr)) return (EADDRNOTAVAIL);
4.5 Filesystem
Even root users within the jail are not allowed to set any file flags, such
as immutable, append, and no unlink flags, if the securelevel is greater
than 0.
It would be neat for users to be able to login to a jail from console. An
easy way for it to be done would require that the user enter his information
twice. Basically, all users with the GID 1000 will login to a jailed environment.
This is how it is implemented:
1. Open up login.c in the FreeBSD source tree. It is located in
/usr/src/usr.bin/login.
2. Before the execlp, enter these lines:
Of course, you have to define JAILDIR and JAILIPADDR. If
your GID (group ID) is 1000, login executes login from within the jail.
5.2 Jail NG
Jail NG is a "from-scratch re-implementation of Jail" by Robert Watson, a
FreeBSD committer. Some of the new features include the ability to add processes
to a jail, an improved management tool, and per-jail sysctls. For example,
you could have sysvipc.permitted set on one jail while another jail
may be allowed to use System V IPC. You can download the kernel patches and
utilities for Jail NG from his website at:
http://www.watson.org/~robert/jailng/
.
6 Conclusion
It seems that Jail will become the new security model. Instead of one computer
per server, one computer could have multiple jails, each one providing a
service. With the release of Jail NG, this method of security seems to be
becoming the norm.
References
[1] M. J. Bach, The Design of the Unix Operating System (Prentice Hall, Edgewood
Cliffs, New Jersey07632, 1990).
[2] M. K. M. . (et al), The Design and Im[lementation of the 4.4BSD Operating
System (Addison-Wesley, Reading, Massachusetts, 1996).