In this chapter, we'll look at how Kubernetes manages storage. Storage is very different from compute, but at a high level they are both resources. Kubernetes as a generic platform takes the approach of abstracting storage behind a programming model and a set of plugins for storage providers. First, we'll go in to detail about the storage conceptual model and how storage is made available to containers in the cluster. Then, we'll cover the common case cloud platform storage providers, such as AWS, GCE, and Azure. Then we'll look at a prominent open source storage provider (GlusterFS from Red Hat), which provides a distributed filesystem. We'll also look into an alternative solution – Flocker – that manages your data in containers as part of the Kubernetes cluster. Finally, we'll see how Kubernetes supports integration of existing enterprise storage solutions.
At the end of this chapter, you'll have a solid understanding of how storage is represented in Kubernetes, the various storage options in each deployment environment (local testing, public cloud, enterprise), and how to choose the best option for your use case.
In this section, we will understand the Kubernetes storage conceptual model and see how to map persistent storage into containers so they can read and write. Let's start by understanding the problem of storage. Containers and pods are ephemeral. Anything a container writes to its own filesystem gets wiped out when the container dies. Containers can also mount directories from their host node and read or write. That will survive container restarts, but the nodes themselves are not immortal.
There are other problems, such as ownership for mounted hosted directories when the container dies. Just imagine a bunch of containers writing important data to various data directories on their host and then go away leaving all that data all over the nodes with no direct way to tell what container wrote what data. You can try to record this information, but where would you record it? It's pretty clear that for a large-scale system, you need persistent storage accessible from any node to reliably manage the data.
The basic Kubernetes storage abstraction is the volume. Containers mount volumes that bind to their pod and they access the storage wherever it may be as if it's in their local filesystem. This is nothing new, and it is great because, as a developer who writes applications that need access to data, you don't have to worry about where and how the data is stored.
It is very simple to share data between containers in the same pod using a shared volume. Container 1 and container 2 simply mount the same volume and can communicate by reading and writing to this shared space. The most basic volume is the emptyDir
. An emptyDir
volume is an empty
directory on the host. Note that it is not persistent because when the pod is removed from the node, the contents are erased. If a container just crashes, the pod will stick around and you can access it later. Another very interesting option is to use a RAM disk, by specifying the medium as Memory
. Now, your containers communicate through shared memory, which is much faster but more volatile of course. If the node is restarted, the emptyDir
's volume contents are lost.
Here is a pod
configuration file that has two containers that mount the same volume called shared-volume
. The containers mount it in different paths, but when the hue-global-listener container is writing a file to /notifications
, the hue-job-scheduler
will see that file under /incoming
:
apiVersion: v1 kind: Pod metadata: name: hue-scheduler spec: containers: - image: the_g1g1/hue-global-listener name: hue-global-listener volumeMounts: - mountPath: /notifications name: shared-volume - image: the_g1g1/hue-job-scheduler name: hue-job-scheduler volumeMounts: - mountPath: /incoming name: shared-volume volumes: - name: shared-volume emptyDir: {}
To use the shared memory option, we just need to add medium
: Memory
to the emptyDir
section:
volumes: - name: shared-volume emptyDir: medium: Memory
Sometimes you want your pods to get access to some host information (for example, the Docker Daemon) or you want pods on the same node to communicate with each other. This is useful if the pods know they are on the same host. Since Kubernetes schedules pods based on available resources, pods usually don't know what other pods they share the node with. There are two cases where a pod can rely on other pods being scheduled with it on the same node:
For example, in Chapter 6, Using Critical Kubernetes Resources, we discussed a DeamonSet pod that serves as an aggregating proxy to other pods. Another way to implement this behavior is for the pods to simply write their data to a mounted volume that is bound to a host
directory and the DaemonSet pod can directly read it and act on it.
Before you decide to use HostPath volume, make sure you understand the limitations:
privileged
set to true
or, on the host side, you need to change the permissions to allow writing.Here is a configuration file that mounts the /coupons
directory into the hue-coupon-hunter
container, which is mapped to the host's /etc/hue/data/coupons
directory:
apiVersion: v1 kind: Pod metadata: name: hue-coupon-hunter spec: containers: - image: the_g1g1/hue-coupon-hunter name: hue-coupon-hunter volumeMounts: - mountPath: /coupons name: coupons-volume volumes: - name: coupons-volume host-path: path: /etc/hue/data/coupons
Since the pod doesn't have a privileged
security context, it will not be able to write to the host
directory. Let's change the container spec to enable it by adding a security context:
- image: the_g1g1/hue-coupon-hunter name: hue-coupon-hunter volumeMounts: - mountPath: /coupons name: coupons-volume securityContext: privileged: true
In the following diagram, you can see that each container has its own local storage area inaccessible to other containers or pods and the host's /data
directory is mounted as a volume into both container 1 and container 2:
While emptyDir
volumes can be mounted and used by containers, they are not persistent and don't require any special provisioning because they use existing storage on the node. HostPath volumes persist on the original node, but if a pod is restarted on a different node, it can't access the HostPath volume from its previous node. Real persistent volumes use storage provisioned ahead of time by administrators. In cloud environments, the provisioning may be very streamlined but it is still required, and as a Kubernetes cluster administrator you have to at least make sure your storage quota is adequate and monitor usage versus quota diligently.
Remember that persistent volumes are resources that the Kubernetes cluster is using similar to nodes. As such they are not managed by the Kubernetes API server.
You can provision resources statically or dynamically.
Static provisioning is straightforward. The cluster administrator creates persistent volumes backed up by some storage media ahead of time, and these persistent volumes can be claimed by containers.
Dynamic provisioning may happen when a persistent volume claim doesn't match any of the statically provisioned persistent volumes. If the claim specified a storage class and the administrator configured that class for dynamic provisioning, then a persistent volume may be provisioned on the fly. We will see examples later when we discuss persistent volume claims and storage classes.
Here is the configuration file for an NFS persistent volume:
apiVersion: v1 kind: PersistentVolume metadata: name: pv-1 annotations: volume.beta.kubernetes.io/storage-class: "normal" labels: release: stable capacity: 100Gi spec: capacity: storage: 100Gi accessModes: - ReadWriteOnce - ReadOnlyMany persistentVolumeReclaimPolicy: Recycle nfs: path: /tmp server: 172.17.0.8
A persistent volume has a spec and metadata that includes the name and possibly an annotation of a storage class. The storage class annotation will become an attribute when storage classes get out of beta. Note that persistent volumes are at v1, but storage classes are still in beta. More on storage classes later. Let's focus on the spec here. There are four sections: capacity, access mode, reclaim policy, and the volume type (nfs
in the example).
Each volume has a designated amount of storage. Storage claims may be satisfied by persistent volumes that have at least that amount of storage. In the example, the persistent volume has a capacity of 100
Gibibytes (230 bytes). It is important when allocating static persistent volumes to understand the storage request patterns. For example, if you provision 20 persistent volumes with 100 GiB capacity and a container claims a persistent volume with 150 GiB, then this claim will not be satisfied even though there is enough capacity overall:
capacity: storage: 100Gi
There are three access modes:
The storage is mounted to nodes, so even with ReadWriteOnce
multiple containers on the same node can mount the volume and write to it. If that causes a problem, you need to handle it though some other mechanism (for example, claim the volume only in DaemonSet pods that you know will have just one per node).
Different storage providers support some subset of these modes. When you provision a persistent volume, you can specify which modes it will support. For example, NFS supports all modes, but in the example, only these modes were enabled:
accessModes: - ReadWriteMany - ReadOnlyMany
The reclaim policy determines what happens when a persistent volume claim is deleted. There are three different policies:
rm -rf /volume/*
)The Retain
and Delete
policies mean the persistent volume is not available anymore for future claims. The recycle
policy allows the volume to be claimed again.
Currently, only NFS and HostPath support recycling. AWS EBS, GCE PD, Azure disk, and Cinder volumes support deletion. Dynamically provisioned volumes are always deleted.
When containers want access to some persistent storage they make a claim (or rather, the developer and cluster administrator coordinate on necessary storage resources to claim). Here is a sample claim that matches the persistent volume from the previous section:
kind: PersistentVolumeClaim apiVersion: v1 metadata: name: storage-claim annotations: volume.beta.kubernetes.io/storage-class: "normal" spec: accessModes: - ReadWriteOnce resources: requests: storage: 80Gi selector: matchLabels: release: "stable" matchExpressions: - {key: capacity, operator: In, values: [80Gi, 100Gi]}
In the metadata, you can see the storage class annotation. The name storage-claim
will be important later when mounting the claim into a container.
The access mode in the spec is ReadWriteOnce
, which means if the claim is satisfied no other claim with the ReadWriteOnce
access mode can be satisfied, but claims for ReadOnlyMany
can still be satisfied.
The resources section requests 80 GiB. This can be satisfied by our persistent volume, which has a capacity of 100 Gi. But, this is a little bit of a waste because 20 Gi will not be used by definition.
The selector section allows you to filter available volumes further. For example, here the volume must match the label release:
stable
and also have a label with either capacity:
80
Gi
or capacity:
100
Gi
. Imagine that we have several other volumes provisioned with capacities of 200 Gi and 500 Gi. We don't want to claim a 500 Gi volume when we only need 80 Gi.
Kubernetes always tries to match the smallest volume that can satisfy a claim, but if there are no 80 Gi or 100 Gi volumes then the labels will prevent assigning a 200 Gi or 500 Gi volume and use dynamic provisioning instead.
It's important to realize that claims don't mention volumes by name. The matching is done by Kubernetes based on storage class, capacity, and labels.
Finally, persistent volume claims belong to a namespace. Binding a persistent volume to a claim is exclusive. That means that a persistent volume will be bound to a namespace. Even if the access mode is ReadOnlyMany
or ReadWriteMany
, all the pods that mount the persistent volume claim must be from that claim's namespace.
OK. We have provisioned a volume and claimed it. It's time to use the claimed storage in a container. This turns out to be pretty simple. First, the persistent volume claim must be used as a volume in the pod and then the containers in the pod can mount it, just like any other volume. Here is a pod
configuration file that specifies the persistent volume claim we created earlier (bound to the NFS persistent volume we provisioned):
kind: Pod apiVersion: v1 metadata: name: the-pod spec: containers: - name: the-container image: some-image volumeMounts: - mountPath: "/mnt/data" name: persistent-volume volumes: - name: persistent-volume persistentVolumeClaim: claimName: storage-claim
The key is in the persistentVolumeClaim
section under volumes
. The claim name (storage-claim
here) uniquely identifies within the current namespace the specific claim and makes it available as a volume named persistent-volume
here. Then, the container can refer to it by its name and mount it to /mnt/data
.
Storage classes let an administrator configure your cluster with custom persistent storage (as long as there is a proper plugin to support it). A storage class has a name in the metadata (it must be specified in the annotation to claim), a provisioner, and parameters.
The storage class is still in beta
as of Kubernetes 1.5. Here is a sample storage class:
kind: StorageClass apiVersion: storage.k8s.io/v1beta1 metadata: name: standard provisioner: kubernetes.io/aws-ebs parameters: type: gp2
You may create multiple storage classes for the same provisioner with different parameters. Each provisioner has its own parameters.
The currently supported volume types are as follows:
emptyDir
hostPath
gcePersistentDisk
awsElasticBlockStore
nfs
iscsi
flocker
glusterfs
rbd
cephfs
gitRepo
secret
persistentVolumeClaim
downwardAPI
azureFileVolume
azureDisk
vsphereVolume
Quobyte
This list contains both persistent volumes and other volume types, such as gitRepo
or secret
, that are not backed by your typical network storage. This area of Kubernetes is still in flux and, in the future, it will be decoupled further and the design will be cleaner, where the plugins are not part of Kubernetes itself. Utilizing volume types intelligently is a major part of architecting and managing your cluster.
The cluster administrator can also assign a default storage class. When a default storage class is assigned and the DefaultStorageClass
admission plugin is turned on, then claims with no storage class will be dynamically provisioned using the default storage class. If the default storage class is not defined or the admission plugin is not turned on, then claims with no storage class can only match volumes with no storage class.
To illustrate all the concepts, let's do a mini demonstration where we create a HostPath volume, claim it, mount it, and have containers write to it.
Let's start by creating a hostPath
volume. Save the following in persistent-volume.yaml
:
kind: PersistentVolume apiVersion: v1 metadata: name: persistent-volume-1 spec: capacity: storage: 1Gi accessModes: - ReadWriteMany hostPath: path: "/tmp/data" > kubectl create -f persistent-volume.yaml persistentvolume "persistent-volume-1" created
To check out the available volumes, you can use the resource type persistentvolumes
or pv
for short:
> kubectl get pv NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM REASON AGE persistent-volume-1 1Gi RWX Retain Available 6m
The capacity is 1
GiB as requested. The reclaim policy is Retain
because host path volumes are retained. The status is Available
because the volume has not been claimed yet. The access mode is specified a RWX
, which means ReadWriteMany
. All access modes have a shorthand version:
ReadWriteOnce
ReadOnlyMany
ReadWriteMany
We have a persistent volume. Let's create a claim. Save the following to persistent-volume-claim.yaml
:
kind: PersistentVolumeClaim apiVersion: v1 metadata: name: persistent-volume-claim spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi
Then, run the following command:
> kubectl create -f .persistent-volume-claim.yaml persistentvolumeclaim "persistent-volume-claim" created
Let's check the claim
and the volume
:
k get pvc NAME STATUS VOLUME CAPACITY ACCESSMODES AGE persistent-volume-claim Bound persistent-volume-1 1Gi RWX 27s > k get pv NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM REASON AGE persistent-volume-1 1Gi RWX Retain Bound default/persistent-volume-claim 40m
As you can see, the claim
and the volume are bound to each other. The final step is to create a pod
and assign the claim
as a volume
. Save the following to shell-pod.yaml
:
kind: Pod apiVersion: v1 metadata: name: just-a-shell labels: name: just-a-shell spec: containers: - name: a-shell image: ubuntu command: ["/bin/bash", "-c", "while true ; do sleep 10 ; done"] volumeMounts: - mountPath: "/data" name: pv - name: another-shell image: ubuntu command: ["/bin/bash", "-c", "while true ; do sleep 10 ; done"] volumeMounts: - mountPath: "/data" name: pv volumes: - name: pv persistentVolumeClaim: claimName: persistent-volume-claim
This pod has two containers that use the Ubuntu image and both run a shell command that just sleeps in an infinite loop. The idea is that the container will keep running, so we can connect to it later and check its filesystem. The pod mounts our persistent volume claim with a volume name of pv
. Both containers mount it into their /data
directory.
Let's create the pod
and verify that both containers are running:
> kubectl create -f shell-pod.yaml pod "just-a-shell" created > kubectl get pods NAME READY STATUS RESTARTS AGE just-a-shell 2/2 Running 0 1h
Then, ssh
to the node. This is the host whose /tmp/data
is the pod's volume that mounted as /data
into each of the running containers:
> minikube ssh ## . ## ## ## == ## ## ## ## ## === /"""""""""""""""""\___/ === ~~~ {~~ ~~~~ ~~~ ~~~~ ~~~ ~ / ===- ~~~ \______ o __/ __/ \____\_______/ _ _ ____ _ _ | |__ ___ ___ | |_|___ __| | ___ ___| | _____ _ __ | '_ / _ / _ | __| __) / _` |/ _ / __| |/ / _ '__| | |_) | (_) | (_) | |_ / __/ (_| | (_) | (__| < __/ | |_.__/ \___/ \___/ \__|_____\__,_|\___/ \___|_|\_\___|_| Boot2Docker version 1.11.1, build master : 901340f - Fri Jul 1 22:52:19 UTC 2016 Docker version 1.11.1, build 5604cbe docker@minikube:~$
Inside the node, we can communicate with the containers using Docker commands. Let's look at the last two running containers:
docker@minikube:~$ docker ps -n=2 CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 3c91a46b834a ubuntu "/bin/bash -c 'while " About an hour ago Up About an hour k8s_another-shell.b64b3aab_just-a-shell_default_ebf12a22-cee9-11e6-a2ae-4ae3ce72fe94_8c7a8408 f1f9de10fdfd ubuntu "/bin/bash -c 'while " About an hour ago Up About an hour k8s_a-shell.1a38381b_just-a-shell_default_ebf12a22-cee9-11e6-a2ae-4ae3ce72fe94_451fa9ec
Then, let's create a file in the /tmp/data
directory on the host. It should be visible by both containers via the mounted volume:
docker@minikube:~$ sudo touch /tmp/data/1.txt
Let's execute a shell
on one of the containers, verify that the file 1.txt
is indeed visible, and create another file, 2.txt
:
docker@minikube:~$ docker exec -it 3c91a46b834a /bin/bash root@just-a-shell:/# ls /data 1.txt root@just-a-shell:/# touch /data/2.txt root@just-a-shell:/# exit Finally, we can run a shell on the other container and verify that both 1.txt and 2.txt are visible: docker@minikube:~$ docker exec -it f1f9de10fdfd /bin/bash root@just-a-shell:/# ls /data 1.txt 2.txt