In this chapter, we will learn how security is implemented in the context of containers in general and how QoS policies are implemented to make sure that resources such as CPU and IO are shared as intended. Most of the discussion will focus on the relevance of these topics in the context of Docker.
We will cover the following in this chapter:
In this section, we are going to study filesystem restrictions with which Docker containers are started. The following section explains the read-only mount points and copy-on-write filesystems, which are used as a base for Docker containers and the representation of kernel objects.
Docker needs access to filesystems such as sysfs and proc for processes to function. But it doesn't necessarily need to modify these mount points.
Two primary mount points loaded in read-only mode are:
/sys
/proc
The sysfs filesystem is loaded into mount point /sys
. sysfs is a mechanism for representing kernel objects, their attributes, and their relationships with each other. It provides two components:
The following code shows the mount points being mounted:
{ Source: "sysfs", Destination: "/sys", Device: "sysfs", Flags: defaultMountFlags | syscall.MS_RDONLY, },
A reference link for the preceding code is at https://github.com/docker/docker/blob/ecc3717cb17313186ee711e624b960b096a9334f/daemon/execdriver/native/template/default_template_linux.go.
The proc filesystem (procfs) is a special file system in Unix-like operating systems, which presents information about processes and other systems information in a hierarchical file-like structure. It is loaded into /proc
. It provides a more convenient and standardized method for dynamically accessing process data held in the kernel than traditional tracing methods or direct access to kernel memory. It is mapped to a mount point named /proc
at boot time:
{ Source: "proc", Destination: "/proc", Device: "proc", Flags: defaultMountFlags, },
ReadonlyPaths: []string{ "/proc/asound", "/proc/bus", "/proc/fs", "/proc/irq", "/proc/sys", "/proc/sysrq-trigger", }
This is another mount point that is mounted as read-write for the container during creation. /dev/pts
lives purely in memory and nothing is stored on disk, hence it is safe to load it in read-write mode.
Entries in /dev/pts
are pseudo-terminals (pty for short). Unix kernels have a generic notion of terminals. A terminal provides a way for applications to display output and to receive input through a terminal device. A process may have a controlling terminal. For a text mode application, this is how it interacts with the user:
{ Source: "devpts", Destination: "/dev/pts", Device: "devpts", Flags: syscall.MS_NOSUID | syscall.MS_NOEXEC, Data: "newinstance,ptmxmode=0666,mode=0620,gid=5", },
Docker uses union filesystems, which are copy-on-write filesystems. This means containers can use the same filesystem image as the base for the container. When a container writes content to the image, it gets written to a container-specific filesystem. It prevents one container from being able to access the changes of another container even if they are created from the same filesystem image. One container cannot change the image content to effect the processes in another container. The following figure explains this process: