Let's go methodically over this example stateful set configuration file that declares a three-node Cassandra cluster.
Here is the basic metadata. Note the apiVersion string is apps/v1 (StatefulSet became generally available from Kubernetes 1.9):
apiVersion: "apps/v1" kind: StatefulSet metadata: name: cassandra
The stateful set spec defines the headless service name, how many pods there are in the stateful set, and the pod template (explained later). The replicas field specifies how many pods are in the stateful set:
spec: serviceName: cassandra replicas: 3 template: ...
The term replicas for the pods is an unfortunate choice because the pods are not replicas of each other. They share the same pod template, but they have a unique identity and they are responsible for different subsets of the state in general. This is even more confusing in the case of Cassandra, which uses the same term, replicas, to refer to groups of nodes that redundantly duplicate some subset of the state (but are not identical, because each can manage additional state too). I opened a GitHub issue with the Kubernetes project to change the term from replicas to members:
https://github.com/kubernetes/kubernetes.github.io/issues/2103
The pod template contains a single container based on the custom Cassandra image. Here is the pod template with the app: cassandra label:
template: metadata: labels: app: cassandra spec: containers: ...
The container spec has multiple important parts. It starts with a name and the image we looked at earlier:
containers: - name: cassandra image: gcr.io/google-samples/cassandra:v12 imagePullPolicy: Always
Then, it defines multiple container ports needed for external and internal communication by Cassandra nodes:
ports: - containerPort: 7000 name: intra-node - containerPort: 7001 name: tls-intra-node - containerPort: 7199 name: jmx - containerPort: 9042 name: cql
The resources section specifies the CPU and memory needed by the container. This is critical because the storage management layer should never be a performance bottleneck due to cpu or memory.
resources: limits: cpu: "500m" memory: 1Gi requests: cpu: "500m" memory: 1Gi
Cassandra needs access to IPC, which the container requests through the security content's capabilities:
securityContext: capabilities: add: - IPC_LOCK
The env section specifies environment variables that will be available inside the container. The following is a partial list of the necessary variables. The CASSANDRA_SEEDS variable is set to the headless service, so a Cassandra node can talk to seeds on startup and discover the whole cluster. Note that in this configuration we don't use the special Kubernetes seed provider. POD_IP is interesting because it utilizes the Downward API to populate its value via the field reference to status.podIP:
env: - name: MAX_HEAP_SIZE value: 512M - name: CASSANDRA_SEEDS value: "cassandra-0.cassandra.default.svc.cluster.local" - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP
The container has a readiness probe, too, to ensure the Cassandra node doesn't receive requests before it's fully online:
readinessProbe: exec: command: - /bin/bash - -c - /ready-probe.sh initialDelaySeconds: 15 timeoutSeconds: 5
Cassandra needs to read and write the data, of course. The cassandra-data volume mount is where it's at:
volumeMounts: - name: cassandra-data mountPath: /cassandra_data
That's it for the container spec. The last part is the volume claim template. In this case, dynamic provisioning is used. It's highly recommended to use SSD drives for Cassandra storage, and especially its journal. The requested storage in this example is 1 Gi. I discovered through experimentation that 1-2 TB is ideal for a single Cassandra node. The reason is that Cassandra does a lot of data shuffling under the covers, compacting and rebalancing the data. If a node leaves the cluster or a new one joins the cluster, you have to wait until the data is properly rebalanced before the data from the node that left is properly re-distributed or a new node is populated. Note that Cassandra needs a lot of disk space to do all this shuffling. It is recommended to have 50% free disk space. When you consider that you also need replication (typically 3x), then the required storage space can be 6x your data size. You can get by with 30% free space if you're adventurous and maybe use just 2x replication depending on your use case. But don't get below 10% free disk space, even on a single node. I learned the hard way that Cassandra will simply get stuck and will be unable to compact and rebalance such nodes without extreme measures.
The access mode is, of course, ReadWriteOnce:
volumeClaimTemplates: - metadata: name: cassandra-data annotations: volume.beta.kubernetes.io/storage-class: fast spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 1Gi
When deploying a stateful set, Kubernetes creates the pod in order per its index number. When scaling up or down, it also does it in order. For Cassandra, this is not important because it can handle nodes joining or leaving the cluster in any order. When a Cassandra pod is destroyed, the persistent volume remains. If a pod with the same index is created later, the original persistent volume will be mounted into it. This stable connection between a particular pod and its storage enables Cassandra to manage the state properly.