-
Notifications
You must be signed in to change notification settings - Fork 32
Sandboxing and isolation
My notes on software sandboxing and isolation mechanisms. For information on hardware features, have a look at Intel CPU security features and ARM CPU security features.
Isolation mechanisms may be present in several layers of a computer system:
Layer | Technology |
---|---|
Application | Userland CPU emulators, ptrace() based emulation, browser sandboxes |
OS | cgroups, namespaces, jails, chroot() , jobs |
ISA | VT-x, AMD-V, SGX |
Hardware | TPM, TrustZone |
Make sure the product of RLIMIT_NPROC times RLIMIT_NOFILE is lower than the system's maximum allowed number of file descriptors, so that a user cannot conduct file descriptor exhaustion attacks (yes, they can lead to elevation of privilege).
On Linux check the values in the following files:
- /proc/sys/fs/file-max
- /proc/sys/net/ipv4/ip_local_port_range
and correlate with the output of ulimit -a.
Although not intended for isolation purposes, some mount options like ro and noexec are usually used to restrict chroot and other forms of jails. Keep in mind that, unlike Linux, OpenBSD does allow loading shared libraries from noexec partitions.
On most systems, only certain directories holding system binaries (e.g. /bin, /usr/bin and so on) and shell scripts (e.g. /etc/rc.d, /etc/init.d) need to be executable. It would be ideal to be able to create separate partitions for these special directories and install the rest of the system in noexec partitions. However, having a separate partition for /bin, for example, makes a system unbootable, unless initrd is used, and even then it introduces certain complications.
The following is an experimental shell script for Linux systems, which allows for achieving this result by (ab)using certain mount features. It can be used as a boot script on a Debian installation or as a standalone utility.
#!/bin/bash
#
# Mounts the whole filesystem as "noexec" apart from a set of directories that
# are expected to hold executable files (like "/bin", "/usr/bin" and so on).
# The underlying partitioning scheme and any existing mounts are not affected.
#
# Should be used along with either the Linux support for BSD secure levels [1],
# or grsecurity's "romount_protect" and "audit_mount" [2].
#
# [1] https://lwn.net/Articles/566169/
# [2] http://en.wikibooks.org/wiki/Grsecurity/Appendix/ \
# Grsecurity_and_PaX_Configuration_Options
#
# To install, run the following commands as root:
#
# install -g root -o root -m 0755 -T noexec /etc/init.d/noexec
# update-rc.d noexec defaults
#
# huku <[email protected]>
### BEGIN INIT INFO
# Provides: noexec
# Required-Start: $local_fs
# Required-Stop: umountfs
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# X-Stop-After: networking
# X-Start-Before: $all
# X-Interactive: true
### END INIT INFO
. /lib/lsb/init-functions
# Return a list of all paths in a system that should be executable.
make_exec_paths()
{
local prefix
exec_paths=
for prefix in "" "/usr" "/usr/local"; do
exec_paths="$exec_paths $prefix/bin $prefix/sbin $prefix/libexec"
exec_paths="$exec_paths $prefix/lib $prefix/lib64"
done
# "/etc/" should be executable; `init' will call several shell scripts in
# "/etc/init.d", "/etc/rc.d" etc.
exec_paths="$exec_paths /etc"
}
# Unfortunately, `mountpoint(1)' won't work for bind mounts. We have to check
# by hand; see `is_mounted()' below for more information.
mountpoints=$(cat /etc/mtab | cut -d " " -f 2)
is_mounted()
{
local r=1
local path=$1
local mountpoint
for mountpoint in $mountpoints; do
if [ "$mountpoint" = "$path" ]; then
r=0
break
fi
done
return $r
}
mount_all()
{
make_exec_paths
local exec_path
for exec_path in $exec_paths; do
# If not mounted, bind it, make it executable and mark it as private.
if ! is_mounted $exec_path; then
mount --bind $exec_path $exec_path
mount -o remount,exec $exec_path
mount --make-private $exec_path
fi
done
# Mark the root filesystem as private and make it non executable. This way
# sub-mounts are not affected.
mount --make-private /
mount -o remount,noexec /
}
umount_all()
{
make_exec_paths
# If remouting of the root filesystem fails, defer for later.
if ! mount -o remount,exec / &>/dev/null; then
mount -l -o remount,exec /
fi
local exec_path
for exec_path in $exec_paths; do
# Check if path is a mountpoint; if not, do nothing.
if is_mounted $exec_path; then
# If unmounting fails, defer for later.
if ! umount $exec_path &>/dev/null; then
umount -l $exec_path
fi
fi
done
}
main()
{
case "$1" in
start)
log_action_begin_msg "Setting up no-exec restrictions"
mount_all
log_action_end_msg 0
;;
stop)
log_action_begin_msg "Removing no-exec restrictions"
umount_all
log_action_end_msg 0
;;
restart)
log_action_begin_msg "Resetting no-exec restrictions"
mount_all
umount_all
log_action_end_msg 0
;;
reload|force-reload)
;;
*)
echo "Usage: $0 {start|stop|restart|reload|force-reload}" >&2
;;
esac
}
main $@
# EOF
Probably one of the oldest isolation mechanisms.
Is deboostrap a good choice for building jails?
This section contains notes on various Linux kernel features used by LXC. A brief introduction by Stephane Graber can be found here.
Complete documentation on control groups v1 (cgroup-v1) may be found here.
Use cgroup.procs to have all threads of an application transfered in the target control group, otherwise, use tasks!
-
May be used to filter
mknod()
andopen()
to device nodes. -
Can be used to limit and monitor user and kernel memory usage. May be used to protect against a bunch of attacks involving physical memory exhaustion that may result in local root exploits (I've actually seen that in a friend's exploit).
Probably useful for detecting abrupt increases of memory usage which may be the result of heap spraying attacks.
RSS pages are accounted at page_fault unless they've already been accounted for earlier.
Does it protect against VM space exhaustion via
MEM_NORESERVE
? -
Allows for classifying network traffic coming from specific control groups. Classified traffic can be handled by tc and iptables.
To solve this problem:
- Install a recent kernel.
- Add support for
CONFIG_NETFILTER_XT_MATCH_CGROUP
. - Install a recent snapshot of iptables.
-
Allows for pausing the tasks of a control group. May be useful for monitoring daemons which can use this subsystem to take snapshots of tasks that violate resource limits.
-
Its ability to limit both
fork()
andclone()
makes it superior to RLIMIT_NPROC.
Secure Computing Mode (a.k.a. seccomp) can be configured either via prctl()
or via the seccomp()
system call.
Code samples can be found in the kernel source tree here and here.
A minimal sample that blocks getpid()
is shown below:
/* seccomp_example.c
* huku <[email protected]>
*/
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <string.h>
#include <signal.h>
#include <sys/types.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
void handle_bad_syscall(int signum, siginfo_t *si, void *ctx)
{
printf("Syscall %d (called at %p) denied!\n",
si->si_syscall, si->si_call_addr);
}
int block_getpid(void)
{
int ret = -1;
struct sock_filter filter[] =
{
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_getpid, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_TRAP),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
};
struct sock_fprog fprog =
{
sizeof(filter) / sizeof(filter[0]), &filter[0]
};
if(prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &fprog) != 0)
{
perror("filter: prctl");
goto _err;
}
ret = 0;
_err:
return ret;
}
int main(int argc, char *argv[])
{
int ret = EXIT_FAILURE;
struct sigaction act;
memset(&act, 0, sizeof(act));
act.sa_sigaction = handle_bad_syscall;
act.sa_flags = SA_SIGINFO;
sigaction(SIGSYS, &act, NULL);
/* We need to set `PR_SET_NO_NEW_PRIVS' before using seccomp. */
if(prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) != 0)
{
perror("prctl");
goto _err;
}
block_getpid();
/* Denied! */
getpid();
_err:
return ret;
}
See https://lwn.net/Articles/280279/ for a good overview.
SELinux is a MAC system for Linux based on LSM.
On desktop and server based Linux distributions (e.g. RedHat, CentOS, etc.) policy sources may be available. However, on Android systems, each vendor may have made its own modifications to the upstream policy provided by Google. Dumping the policy directly from the mobile device may allow for attack surface enumeration.
To do that, make sure you clone the Android repository and use the sesearch utility found under prebuilts/python/linux-x86/2.7.5/lib/python2.7/site-packages/setoolsgui/sesearch.
$ adb pull /sys/fs/selinux/policy
$ python sesearch -A policy > policy.txt
SEAL is also worth looking at.
According to this article:
The dynamic heap provides serialization to avoid conflict among multiple threads accessing the same heap.
However multiple heaps can be used for isolation purposes too:
- Isolated Heap & Friends - Object Allocation Hardening in Web Browsers
- Significant Flash exploit mitigations are live in v18.0.0.209
No write-up a-la Biba?
No write-down a-la Bell–LaPadula?
See this.
See this.
- Application specific sandboxes
- Browsers
- Chromium sandbox design principles
- Security/Sandbox at MozillaWiki
- Microsoft Office
- Browsers
- Implementations
- F-Secure's see
- Other
- Sandboxing at kernelthread.com
Thanks fly to argp, fotisl and dcbz for their ideas and contributions.