How Linux containers work

Posted by Kyle Olivo on August 15, 2016

I’ve had the opportunity to use a few different container technologies at work, so I thought I’d write a few blog posts explaining what containers are and go into some depth on how LXC works in particular.

What are containers?

It is sometimes useful to partition computing resources on a single bare metal host. Perhaps you want to isolate processes for security, or just want to make the most efficient use of the computing resources available on your machine. Unix systems have had isolation capabilities from as far back as 1979 with chroot, and 2005 marked a resurgence of interest in virtualization. Chroot environments allowed a system administrator to limit a processes access to only a particular subdirectory tree on the filesystem, but provided no support for limiting access to other resources (CPU, memory, network, etc). Virtualization introduced a hypervisor layer on top of the primary operating system that allowed for the creation of a new, virtual, machine on top of the existing system.

While this provided isolation, it also introduced overhead by requiring another copy of the operating system to be installed on the guest virtual machine. In comparision to these two approaches, you can think of a container as an extension of the chroot concept. It limits access to a particular part of the host filesystem, but also provides mechanisms to limit access to other computing resources on the host. In contrast to virtual machines, a container uses the host operating system and kernel to provide the isolation and limits, so you avoid the overhead of the hypervisor and the guest operating system layers entirely. The folks at Docker have created a really nice graphic comparing the differences between virtual machines and containers. The result is that more computing resources are available for the processes you actually care about, and the setup of the container environment is much simpler than the creation of a similar virtual machine environment.

How do containers work?

Containers work through four main components: namespaces, cgroups, images, and userspace tools like LXC or docker.

In a traditional Linux system, the init process is started on machine boot, and each subsequent process is fork-execed from its parent process (with init at the root of the process tree). Every running process exists in one common environment and is able to access all resources on that machine. Namespaces allow you to group resources together in a common collection. Processes can then be associated with that namespace, thereby giving them a more limited view of the resources available on that machine. There are currently six namespaces in the Linux kernel. These can be roughly categorized as: mount, process, network, interprocess communication, hostname, and user. The value of namespacing should be clear: if one process needs to make changes to the network, for instance, then it would be free to do so without affecting processes in other namespaces. How does this relate to containers? Recall that all processes on a Linux system descend from the init process. One main component of a Linux container is the creation of a new init process under a new namespace. Without the namespacing functionality, it would not be possible to have multiple init processes running on the host. So with namespaces alone, we have the ability to spawn a process tree and manipulate some underlying system resources without impacting the host system. If one of the primary benefits of containers is isolation, what prevents a newly spawned container from overwhelming the host’s resources? Such as consuming too much disk, memory, or processing power? This is where cgroups come in.

Cgroups (or “control groups”) are responsible for placing limits and recording usage of system resources. This functionality became available in the Linux kernel in 2008. With cgroups, we can limit CPU usage, memory, disk and much more. It’s beyond the scope of this overview to describe how to configure cgroups, but you can look at the kernel documentation if you want to learn more. With the combination of namespacing and cgroups, we’re able to spawn a separate process hierarchy with a limited view of system resources, and we can constrain how much access this process hierarchy has to the resources that it can see. We’re still missing one major component of the container ecosystem though, and that is the contents of the filesystem that it has access to.

One of the most common misconceptions about containers is that they are “just like a mini virtual machine”. The confusion is understandable. When you are inside of a Linux container, it looks just like any other Linux distribution. You see the familiar filesystem layout, devices, and system software that you are used to. But the contents of the container’s filesystem is not a full operating system, but rather a slimmed-down representation of the target operating system. The underlying resources and kernel are provided by the host, and the system software and devices are provided by the image. A host Linux distribution is therefore able to run a container that appears to be an entirely different Linux distribution. The images themselves were created by taking a slim version of the target distribution and manipulating it in a way that reduces its size and makes it runnable in a container. It’s beyond the scope of this article to describe exactly how these images are built, but you can see how the LXC images are created by looking at the LXC github repository. Namespaces, cgroups, and images thus provide the three major building blocks that make containers work. The last remaining technology is the set of user tools that allow you to seamlessly manipulate these concepts and create your own environments inside of a container. In the next post, I’ll go into some detail on how you can use LXC to accomplish this.