Sandboxing and isolation

My notes on software sandboxing and isolation mechanisms. For information on hardware features, have a look at Intel CPU security features and ARM CPU security features.

1. General notes

Isolation mechanisms may be present in several layers of a computer system:

Layer	Technology
Application	Userland CPU emulators, `ptrace()` based emulation, browser sandboxes
OS	cgroups, namespaces, jails, `chroot()`, jobs
ISA	VT-x, AMD-V, SGX
Hardware	TPM, TrustZone

2. Unixoids

2.1 Common techniques

2.1.1 Resource limits

Make sure the product of RLIMIT_NPROC times RLIMIT_NOFILE is lower than the system's maximum allowed number of file descriptors, so that a user cannot conduct file descriptor exhaustion attacks (yes, they can lead to elevation of privilege).

On Linux check the values in the following files:

/proc/sys/fs/file-max
/proc/sys/net/ipv4/ip_local_port_range

and correlate with the output of ulimit -a.

2.1.2 Mount options

Although not intended for isolation purposes, some mount options like ro and noexec are usually used to restrict chroot and other forms of jails. Keep in mind that, unlike Linux, OpenBSD does allow loading shared libraries from noexec partitions.

On most systems, only certain directories holding system binaries (e.g. /bin, /usr/bin and so on) and shell scripts (e.g. /etc/rc.d, /etc/init.d) need to be executable. It would be ideal to be able to create separate partitions for these special directories and install the rest of the system in noexec partitions. However, having a separate partition for /bin, for example, makes a system unbootable, unless initrd is used, and even then it introduces certain complications.

The following is an experimental shell script for Linux systems, which allows for achieving this result by (ab)using certain mount features. It can be used as a boot script on a Debian installation or as a standalone utility.

#!/bin/bash
#
# Mounts the whole filesystem as "noexec" apart from a set of directories that
# are expected to hold executable files (like "/bin", "/usr/bin" and so on).
# The underlying partitioning scheme and any existing mounts are not affected.
#
# Should be used along with either the Linux support for BSD secure levels [1],
# or grsecurity's "romount_protect" and "audit_mount" [2].
#
# [1] https://lwn.net/Articles/566169/
# [2] http://en.wikibooks.org/wiki/Grsecurity/Appendix/ \
#         Grsecurity_and_PaX_Configuration_Options
#
# To install, run the following commands as root:
#
#     install -g root -o root -m 0755 -T noexec /etc/init.d/noexec
#     update-rc.d noexec defaults
#
# huku <[email protected]>

### BEGIN INIT INFO
# Provides:          noexec
# Required-Start:    $local_fs
# Required-Stop:     umountfs
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# X-Stop-After:      networking
# X-Start-Before:    $all
# X-Interactive:     true
### END INIT INFO


. /lib/lsb/init-functions


# Return a list of all paths in a system that should be executable.
make_exec_paths()
{
    local prefix

    exec_paths=
    for prefix in "" "/usr" "/usr/local"; do
        exec_paths="$exec_paths $prefix/bin $prefix/sbin $prefix/libexec"
        exec_paths="$exec_paths $prefix/lib $prefix/lib64"
    done

    # "/etc/" should be executable; `init' will call several shell scripts in
    # "/etc/init.d", "/etc/rc.d" etc.
    exec_paths="$exec_paths /etc"
}


# Unfortunately, `mountpoint(1)' won't work for bind mounts. We have to check
# by hand; see `is_mounted()' below for more information.
mountpoints=$(cat /etc/mtab | cut -d " " -f 2)

is_mounted()
{
    local r=1
    local path=$1

    local mountpoint
    for mountpoint in $mountpoints; do
        if [ "$mountpoint" = "$path" ]; then
            r=0
            break
        fi
    done

    return $r
}


mount_all()
{
    make_exec_paths

    local exec_path
    for exec_path in $exec_paths; do
        # If not mounted, bind it, make it executable and mark it as private.
        if ! is_mounted $exec_path; then
            mount --bind $exec_path $exec_path
            mount -o remount,exec $exec_path
            mount --make-private $exec_path
        fi
    done

    # Mark the root filesystem as private and make it non executable. This way
    # sub-mounts are not affected.
    mount --make-private /
    mount -o remount,noexec /
}

umount_all()
{
    make_exec_paths

    # If remouting of the root filesystem fails, defer for later.
    if ! mount -o remount,exec / &>/dev/null; then
        mount -l -o remount,exec /
    fi

    local exec_path
    for exec_path in $exec_paths; do
        # Check if path is a mountpoint; if not, do nothing.
        if is_mounted $exec_path; then
            # If unmounting fails, defer for later.
            if ! umount $exec_path &>/dev/null; then
                umount -l $exec_path
            fi
        fi
    done
}


main()
{
    case "$1" in
        start)
            log_action_begin_msg "Setting up no-exec restrictions"
            mount_all
            log_action_end_msg 0
            ;;
        stop)
            log_action_begin_msg "Removing no-exec restrictions"
            umount_all
            log_action_end_msg 0
            ;;
        restart)
            log_action_begin_msg "Resetting no-exec restrictions"
            mount_all
            umount_all
            log_action_end_msg 0
            ;;
        reload|force-reload)
            ;;
        *)
            echo "Usage: $0 {start|stop|restart|reload|force-reload}" >&2
            ;;
    esac
}


main $@

# EOF

2.1.3 Chroot jails

Probably one of the oldest isolation mechanisms.

Is deboostrap a good choice for building jails?

2.2 Linux specific mechanisms

This section contains notes on various Linux kernel features used by LXC. A brief introduction by Stephane Graber can be found here.

2.2.1 Control groups v1

Complete documentation on control groups v1 (cgroup-v1) may be found here.

Use cgroup.procs to have all threads of an application transfered in the target control group, otherwise, use tasks!

devices

May be used to filter mknod() and open() to device nodes.
memory

Can be used to limit and monitor user and kernel memory usage. May be used to protect against a bunch of attacks involving physical memory exhaustion that may result in local root exploits (I've actually seen that in a friend's exploit).

Probably useful for detecting abrupt increases of memory usage which may be the result of heap spraying attacks.

RSS pages are accounted at page_fault unless they've already been accounted for earlier.

Does it protect against VM space exhaustion via MEM_NORESERVE?
net_cls & net_prio

Allows for classifying network traffic coming from specific control groups. Classified traffic can be handled by tc and iptables.

To solve this problem:
- Install a recent kernel.
- Add support for CONFIG_NETFILTER_XT_MATCH_CGROUP.
- Install a recent snapshot of iptables.
freezer

Allows for pausing the tasks of a control group. May be useful for monitoring daemons which can use this subsystem to take snapshots of tasks that violate resource limits.
pids

Its ability to limit both fork() and clone() makes it superior to RLIMIT_NPROC.

2.2.2 Control groups v2

2.2.3 Secure Computing Mode (a.k.a. seccomp)

Secure Computing Mode (a.k.a. seccomp) can be configured either via prctl() or via the seccomp() system call.

Code samples can be found in the kernel source tree here and here.

A minimal sample that blocks getpid() is shown below:

/* seccomp_example.c
 * huku <[email protected]>
 */
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <string.h>
#include <signal.h>
#include <sys/types.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <unistd.h>

#include <linux/filter.h>
#include <linux/seccomp.h>


void handle_bad_syscall(int signum, siginfo_t *si, void *ctx)
{
    printf("Syscall %d (called at %p) denied!\n",
        si->si_syscall, si->si_call_addr);
}

int block_getpid(void)
{
    int ret = -1;

    struct sock_filter filter[] =
    {
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_getpid, 0, 1),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_TRAP),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
    };

    struct sock_fprog fprog =
    {
        sizeof(filter) / sizeof(filter[0]), &filter[0]
    };

    if(prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &fprog) != 0)
    {
        perror("filter: prctl");
        goto _err;
    }

    ret = 0;

_err:
    return ret;
}

int main(int argc, char *argv[])
{
    int ret = EXIT_FAILURE;

    struct sigaction act;

    memset(&act, 0, sizeof(act));
    act.sa_sigaction = handle_bad_syscall;
    act.sa_flags = SA_SIGINFO;

    sigaction(SIGSYS, &act, NULL);

    /* We need to set `PR_SET_NO_NEW_PRIVS' before using seccomp. */
    if(prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) != 0)
    {
        perror("prctl");
        goto _err;
    }

    block_getpid();

    /* Denied! */
    getpid();

_err:
    return ret;
}

2.2.4 Per-process securebits

See https://lwn.net/Articles/280279/ for a good overview.

2.2.5 POSIX capabilities

2.2.6 SELinux & SEAndroid

SELinux is a MAC system for Linux based on LSM.

On desktop and server based Linux distributions (e.g. RedHat, CentOS, etc.) policy sources may be available. However, on Android systems, each vendor may have made its own modifications to the upstream policy provided by Google. Dumping the policy directly from the mobile device may allow for attack surface enumeration.

To do that, make sure you clone the Android repository and use the sesearch utility found under prebuilts/python/linux-x86/2.7.5/lib/python2.7/site-packages/setoolsgui/sesearch.

$ adb pull /sys/fs/selinux/policy
$ python sesearch -A policy > policy.txt

SEAL is also worth looking at.

2.3 Apple MacOS X specific mechanisms

2.4 FreeBSD specific mechanisms

2.4.1 FreeBSD jails

2.4.2 Capsicum

3. Microsoft Windows

3.1 Multiple heaps

According to this article:

The dynamic heap provides serialization to avoid conflict among multiple threads accessing the same heap.

However multiple heaps can be used for isolation purposes too:

3.2 Job objects

3.3 Integrity levels

No write-up a-la Biba?

3.4 Protected processes (heavy & light)

No write-down a-la Bell–LaPadula?

3.5 Window stations

See this.

3.6 Desktops