Why

Building a Minimal Container Runtime in Go

Why

This is Part 1 of my attempt at building a Code Judge, similiar to LeetCode, since it’ll need to run user-submitted code, a secure sandboxed environment is required, which is where containers come in.

What are Containers?

A container is collection of packages required to run a software (dependencies, libraries etc), including the software itself. In the broader context, they are a collection of processes isolated from the rest of the system, with their own filesystem, networking and process tree.

A Simple Container

Instead of a full-fledged OS, let’s make a simple container that can run ls and cd with a few directories and files.

Create a directory structure for our container

mkdir -p container/rootfs
cd container/rootfs
mkdir -p bin etc home
echo "Hello, World!" > home/hello.txt

We can now use the chroot command to change the root directory to container/rootfs and run commands inside it.

chroot essentially changes the root directory for the current running process, making all directories above it in the filesystem inaccessible. This achieves filesystem isolation for our container.

chroot into the container and run ls and cat

sudo chroot . /bin/bash

You’ll get the following error

chroot: failed to run command ‘/bin/bash’: No such file or directory

This is because the container is looking for the bash binary in its own filesystem, which is currently empty.

We’ll need to copy the bash binary and its dependencies into the container’s filesystem.

To check the dependencies of bash, we can use the ldd command

ldd /bin/bash

On my system, this outputs

[himanshu@archbox] ~/personal/projects/test/container ❯ ldd /bin/bash
	linux-vdso.so.1 (0x00007fc84474c000)
	libreadline.so.8 => /usr/lib/libreadline.so.8 (0x00007fc84458f000)
	libc.so.6 => /usr/lib/libc.so.6 (0x00007fc84439e000)
	libncursesw.so.6 => /usr/lib/libncursesw.so.6 (0x00007fc84432d000)
	/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007fc84474e000)

So we’ll need to copy over the following files into the container’s filesystem

cp /bin/bash container/rootfs/bin/
cp /usr/lib/libreadline.so.8 container/rootfs/usr/lib/
cp /usr/lib/libc.so.6 container/rootfs/usr/lib/
cp /usr/lib/libncursesw.so.6 container/rootfs/usr/lib/
cp /usr/lib64/ld-linux-x86-64.so.2 container/rootfs/usr

Trying to chroot again, we can now successfully run bash inside the container

[himanshu@archbox] ~/personal/projects/test/container ❯ sudo chroot rootfs /bin/bash
bash-5.3# ls
bash: ls: command not found
bash-5.3# pwd
/
bash-5.3# whoami
bash: whoami: command not found
bash-5.3# cd home/
bash-5.3# pwd
/home
bash-5.3# cat hello.txt
bash: cat: command not found
bash-5.3#

As you can see, some command like ls and cat do not work. This is because just like bash, they too have their own dependencies that need to be copied into the container directory.

Translating this to Go

In Go, we can use the syscall package to call the chroot syscall and run commands inside the container.

package main

import (
    "fmt"
    "os"
    "os/exec"
    "syscall"
)

func main() {
    err := syscall.Chroot("container/rootfs")
    if err != nil {
        fmt.Println("Error chrooting:", err)
        return
    }

    err = os.Chdir("/")
    if err != nil {
        fmt.Println("Error changing directory:", err)
        return
    }

    cmd := exec.Command("/bin/bash", "-c", "ls /home && cat /home/hello.txt")
    output, err := cmd.CombinedOutput()
    if err != nil {
        fmt.Println("Error running command:", err)
        return
    }

    fmt.Println(string(output))
}

This works as expected. But let’s automate the tedious process of copying over binaries and their dependencies into the container.

var binaries []string = []string{"/bin/bash", "/bin/ls", "/bin/cat", "/usr/bin/whoami"}
var dependencies map[string][]string = map[string][]string{}

for _, binary := range binaries {
    output, err := exec.Command("ldd", binary).CombinedOutput()
    if err != nil {
        fmt.Printf("Error running ldd on %s: %v\n", binary, err)
        continue
    }

    parseLdd(string(output), binary)
}

func parseLdd(output string, binary string) {
    lines := strings.Split(output, "\n")
    for _, line := range lines {
        parts := strings.Fields(line)
        if len(parts) >= 3 && parts[1] == "=>" {
            dependencies[binary] = append(dependencies[binary], parts[2])
        } else if len(parts) >= 2 && parts[0] == "linux-vdso.so.1" {
            dependencies[binary] = append(dependencies[binary], parts[0])
        }
    }
}

With this, we can now simply specify the binaries we want to run inside the container, and the program will automatically copy over the binaries and their dependencies into the container’s filesystem.

Adding Safeguards

While this is a functional container, it is still nowhere near secure. A user can easily break out of the container and access the host system.

There are 3 main things a container requires:

Filesystem Isolation: The container should have its own filesystem, and should not be able to access the host’s filesystem. This has already been achieved using chroot.
Process Isolation: The container should have its own process tree, with process IDs starting from 1. To achieve this, we’ll make use of PID namespaces.
Network Isolation: The container should have its own network stack, with its own IP address and network interfaces. This can be achieved using network namespaces.

Namespaces are a Linux kernel feature that provides isolation of system resources between processes. There exist multiple different types of namespaces, but we’re only concerned with PID and network namespaces for our container.

For PID namespaces, we can use the clone syscall with the CLONE_NEWPID flag. This will create a new process that is isolated from the host’s process tree.

Similarly, for network namespaces, we can use the clone syscall with the CLONE_NEWNET flag. This will create a new process that is isolated from the host’s network stack.

func main() {
    cmd := exec.Command("/bin/bash", "-c", "ls /home && cat /home/hello.txt")
    cmd.SysProcAttr = &syscall.SysProcAttr{
        Cloneflags: syscall.CLONE_NEWPID | syscall.CLONE_NEWNET,
    }
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr
    cmd.Run()
}

This pops us into a shell, where if we run:

bash-5.3# ps

PID   USER     TIME  COMMAND
    1 root      0:00 /proc/self/exe child /home/himanshu/personal/projects/capsule/testdir2
    9 root      0:00 /bin/sh
   10 root      0:00 ps

Since PIDs start from 1, this confirms that we are in a new PID namespace.

Similarly, if we run ip addr, we can see that we have a different network interface than the host, confirming that we’re in a new network namespace.

Best of all, we can run Vim inside a container!

References

Containers From Scratch

The code for this project can be found here