Running a Cassandra cluster in Kubernetes

In this section, we will explore in detail a very large example of configuring a Cassandra cluster to run on a Kubernetes cluster. The full example can be accessed here:

https://github.com/kubernetes/kubernetes/tree/master/examples/storage/cassandra.

First, we'll learn a little bit about Cassandra itself and its idiosyncrasies and then follow a step-by-step procedure to get it running using several of the techniques and strategies we've covered in the previous section.

Quick introduction to Cassandra

Cassandra is a distributed columnar data store. It was designed from the get go for big data. Cassandra is fast, robust (no single point of failure), highly-available, and linearly scalable. It also has multi-data center support. It achieves all this by having a laser focus and carefully crafting the features it supports—and just as importantly—the features it doesn't support. In a previous company, I ran a Kubernetes cluster that used Cassandra as the main data store for a sensors data (about 100 TB). Cassandra allocates the data to a set of nodes (node ring) based on a DHT algorithm. The cluster nodes talk to each other via a gossip protocol and learn quickly about the overall state of the cluster (what nodes joined and what nodes left or are unavailable). Cassandra constantly compacts the data and balances the cluster. The data is typically replicated multiple times for redundancy, robustness, and high-availability. From a developer's point of view, Cassandra is very good for time-series data and provides a flexible model where you can specify the consistency level in each query. It is also idempotent (a very important feature for a distributed database), which means repeated inserts or updates are allowed.

Here is a diagram that shows how a Cassandra cluster is organized and how a client can access any node and the request will be forwarded automatically to the nodes that have the requested data:

Quick introduction to Cassandra

The Cassandra Docker image

Deploying Cassandra on Kubernetes as opposed to a standalone Cassandra cluster deployment requires a special Docker image. This is an important step because it means we can use Kubernetes to keep track of our Cassandra pods. The image is available here:

https://github.com/kubernetes/kubernetes/tree/master/exa mples/storage/cassandra/image.

Here are the essential parts of the Docker file.

The image is based on Debian Jessie:

FROM google/debian:jessie

Add and copy the necessary files (Cassandra.jar, various configuration files, run script, and read-probe script), create a data directory for Cassandra to store its SSTables, and mount it:

ADD files /
RUN mv /java.list /etc/apt/sources.list.d/java.list 
  && mv /cassandra.list /etc/apt/sources.list.d/cassandra.list 
  && chmod a+rx /run.sh /sbin/dumb-init /ready-probe.sh 
  && mkdir -p /cassandra_data/data 
  && mv /logback.xml /cassandra.yaml /jvm.options /etc/cassandra/

VOLUME ["/cassandra_data"]

Expose important ports for accessing Cassandra and to let Cassandra nodes gossip with each other:

# 7000: intra-node communication
# 7001: TLS intra-node communication
# 7199: JMX
# 9042: CQL
EXPOSE 7000 7001 7199 9042

Finally, the command, which uses dumb-init, a simple container init system from yelp, eventually runs the run.sh script:

CMD ["/sbin/dumb-init", "/bin/bash", "/run.sh"]

Exploring the run.sh script

The run.sh script requires some shell skills but it's worth the effort. Since Docker allows running only one command, it is very common with non-trivial applications to have a launcher script that sets up the environment and prepares for the actual application. In this case, the image supports several deployment options (stateful set, replication controller, DaemonSet) that we'll cover later and the run script accommodates all by being very configurable via environment variables.

First, some local variables are set for the Cassandra configuration file at /etc/cassandra/cassandra.yaml. The CASSANDRA_CFG variable will be used in the rest of the script:

set -e
CASSANDRA_CONF_DIR=/etc/cassandra
CASSANDRA_CFG=$CASSANDRA_CONF_DIR/cassandra.yaml

If no CASSANDRA_SEEDS were specified, then set the HOSTNAME, which is used in the stateful set solution:

# we are doing StatefulSet or just setting our seeds
if [ -z "$CASSANDRA_SEEDS" ]; then
  HOSTNAME=$(hostname -f)
Fi

Then comes a long list of environment variables with defaults. The syntax, ${VAR_NAME:-<default}, uses the environment variable, VAR_NAME, if it's defined, or the default value.

A similar syntax: ${VAR_NAME:=<default}, does the same thing, but also assigns the default value to the environment variable if not defined.

Both variations are used here:

CASSANDRA_RPC_ADDRESS="${CASSANDRA_RPC_ADDRESS:-0.0.0.0}"
CASSANDRA_NUM_TOKENS="${CASSANDRA_NUM_TOKENS:-32}"
CASSANDRA_CLUSTER_NAME="${CASSANDRA_CLUSTER_NAME:='Test Cluster'}"
CASSANDRA_LISTEN_ADDRESS=${POD_IP:-$HOSTNAME}
CASSANDRA_BROADCAST_ADDRESS=${POD_IP:-$HOSTNAME}
CASSANDRA_BROADCAST_RPC_ADDRESS=${POD_IP:-$HOSTNAME}
CASSANDRA_DISK_OPTIMIZATION_STRATEGY="${CASSANDRA_DISK_OPTIMIZATION_STRATEGY:-ssd}"
CASSANDRA_MIGRATION_WAIT="${CASSANDRA_MIGRATION_WAIT:-1}"
CASSANDRA_ENDPOINT_SNITCH="${CASSANDRA_ENDPOINT_SNITCH:-SimpleSnitch}"
CASSANDRA_DC="${CASSANDRA_DC}"
CASSANDRA_RACK="${CASSANDRA_RACK}"
CASSANDRA_RING_DELAY="${CASSANDRA_RING_DELAY:-30000}"
CASSANDRA_AUTO_BOOTSTRAP="${CASSANDRA_AUTO_BOOTSTRAP:-true}"
CASSANDRA_SEEDS="${CASSANDRA_SEEDS:false}"
CASSANDRA_SEED_PROVIDER="${CASSANDRA_SEED_PROVIDER:-org.apache.cassandra.locator.SimpleSeedProvider}"
CASSANDRA_AUTO_BOOTSTRAP="${CASSANDRA_AUTO_BOOTSTRAP:false}"

# Turn off JMX auth
CASSANDRA_OPEN_JMX="${CASSANDRA_OPEN_JMX:-false}"
# send GC to STDOUT
CASSANDRA_GC_STDOUT="${CASSANDRA_GC_STDOUT:-false}"

Then comes a section where all the variables are printed to the screen. Let's skip most of it:

echo Starting Cassandra on ${CASSANDRA_LISTEN_ADDRESS}
echo CASSANDRA_CONF_DIR ${CASSANDRA_CONF_DIR}

The next section is very important. By default, Cassandra uses a simple snitch, which is unaware of racks and data centers. This is not optimal when the cluster spans multiple data centers and racks. Cassandra is rack and data center aware and can optimize both for redundancy and high-availability while limiting communication across data centers appropriately:

# if DC and RACK are set, use GossipingPropertyFileSnitch
if [[ $CASSANDRA_DC && $CASSANDRA_RACK ]]; then
  echo "dc=$CASSANDRA_DC" > $CASSANDRA_CONF_DIR/cassandra-rackdc.properties
  echo "rack=$CASSANDRA_RACK" >> $CASSANDRA_CONF_DIR/cassandra-rackdc.properties
  CASSANDRA_ENDPOINT_SNITCH="GossipingPropertyFileSnitch"
fi

Memory management is important and you can control the maximum heap size to ensure Cassandra doesn't start thrashing and swapping to disk:

if [ -n "$CASSANDRA_MAX_HEAP" ]; then
  sed -ri "s/^(#)?-Xmx[0-9]+.*/-Xmx$CASSANDRA_MAX_HEAP/" "$CASSANDRA_CONF_DIR/jvm.options"
  sed -ri "s/^(#)?-Xms[0-9]+.*/-Xms$CASSANDRA_MAX_HEAP/" "$CASSANDRA_CONF_DIR/jvm.options"
fi

if [ -n "$CASSANDRA_REPLACE_NODE" ]; then
   echo "-Dcassandra.replace_address=$CASSANDRA_REPLACE_NODE/" >> "$CASSANDRA_CONF_DIR/jvm.options"
fi

The rack and data center information is stored in a simple Java properties file:

for rackdc in dc rack; do
  var="CASSANDRA_${rackdc^^}"
  val="${!var}"
  if [ "$val" ]; then
  sed -ri 's/^('"$rackdc"'=).*/1 '"$val"'/' "$CASSANDRA_CONF_DIR/cassandra-rackdc.properties"
  fi
done

The next section loops over all the variables defined earlier, finds the corresponding key in the Cassandra.yaml configuration files, and overwrites them. That ensures that each configuration file is customized on the fly just before it launches Cassandra itself:

for yaml in 
  broadcast_address 
  broadcast_rpc_address 
  cluster_name 
  disk_optimization_strategy 
  endpoint_snitch 
  listen_address 
  num_tokens 
  rpc_address 
  start_rpc 
  key_cache_size_in_mb 
  concurrent_reads 
  concurrent_writes 
  memtable_cleanup_threshold 
  memtable_allocation_type 
  memtable_flush_writers 
  concurrent_compactors 
  compaction_throughput_mb_per_sec 
  counter_cache_size_in_mb 
  internode_compression 
  endpoint_snitch 
  gc_warn_threshold_in_ms 
  listen_interface 
  rpc_interface 
  ; do
  var="CASSANDRA_${yaml^^}"
  val="${!var}"
  if [ "$val" ]; then
    sed -ri 's/^(# )?('"$yaml"':).*/2 '"$val"'/' "$CASSANDRA_CFG"
  fi
done

echo "auto_bootstrap: ${CASSANDRA_AUTO_BOOTSTRAP}" >> $CASSANDRA_CFG

The next section is all about setting the seeds or seed provider depending on the deployment solution (stateful set or not). There is a little trick for the first pod to bootstrap as its own seed:

# set the seed to itself.  This is only for the first pod, otherwise
# it will be able to get seeds from the seed provider
if [[ $CASSANDRA_SEEDS == 'false' ]]; then
  sed -ri 's/- seeds:.*/- seeds: "'"$POD_IP"'"/' $CASSANDRA_CFG
else # if we have seeds set them.  Probably StatefulSet
  sed -ri 's/- seeds:.*/- seeds: "'"$CASSANDRA_SEEDS"'"/' $CASSANDRA_CFG
fi

sed -ri 's/- class_name: SEED_PROVIDER/- class_name: '"$CASSANDRA_SEED_PROVIDER"'/' $CASSANDRA_CFG

The following section sets up various options for remote management and JMX monitoring. It's critical in complicated distributed systems to have proper administration tools. Cassandra has deep support for the ubiquitous Java Management Extensions (JMX) standard:

# send gc to stdout
if [[ $CASSANDRA_GC_STDOUT == 'true' ]]; then
  sed -ri 's/ -Xloggc:/var/log/cassandra/gc.log//' $CASSANDRA_CONF_DIR/cassandra-env.sh
fi

# enable RMI and JMX to work on one port
echo "JVM_OPTS="$JVM_OPTS -Djava.rmi.server.hostname=$POD_IP"" >> $CASSANDRA_CONF_DIR/cassandra-env.sh

# getting WARNING messages with Migration Service
echo "-Dcassandra.migration_task_wait_in_seconds=${CASSANDRA_MIGRATION_WAIT}" >> $CASSANDRA_CONF_DIR/jvm.options
echo "-Dcassandra.ring_delay_ms=${CASSANDRA_RING_DELAY}" >> $CASSANDRA_CONF_DIR/jvm.options

if [[ $CASSANDRA_OPEN_JMX == 'true' ]]; then
  export LOCAL_JMX=no
  sed -ri 's/ -Dcom.sun.management.jmxremote.authenticate=true/ -Dcom.sun.management.jmxremote.authenticate=false/' $CASSANDRA_CONF_DIR/cassandra-env.sh
  sed -ri 's/ -Dcom.sun.management.jmxremote.password.file=/etc/cassandra/jmxremote.password//' $CASSANDRA_CONF_DIR/cassandra-env.sh
fi

Finally, the class path is set to the Cassandra JAR file and it launches Cassandra itself in the foreground (not Daemonized):

export CLASSPATH=/kubernetes-cassandra.jar
cassandra -R -f

Hooking up Kubernetes and Cassandra

Connecting Kubernetes and Cassandra takes some work because Cassandra was designed to be very self-sufficient, but we want to let it hook Kubernetes at the right time to provide capabilities such as automatically restarting failed nodes, monitoring, allocating Cassandra pods, and providing a unified view of the Cassandra pods side by side of other pods. Cassandra is a complicated beast and has many knobs to control it. It comes with a Cassandra.yaml configuration file and you can override all the options with environment variables.

Digging into the Cassandra configuration

There are two settings that are particularly relevant: the seed provider and the snitch. The seed provider is responsible for publishing a list of IP addresses (seeds) of nodes in the cluster. Every node that starts running connects to the seeds (there are usually at least three) and if it successfully reaches one of them they immediately exchange information about all the nodes in the cluster. This information is updated constantly for each node as the nodes gossip with each other.

The default seed provider configured in Cassandra.yaml is just a static list of IP addresses, in this case just the loopback interface:

seed_provider:
    - class_name: SEED_PROVIDER
      parameters:
          # seeds is actually a comma-delimited list of addresses.
          # Ex: "<ip1>,<ip2>,<ip3>"
          - seeds: "127.0.0.1" 

The other important setting is the snitch. It has two roles:

  • It teaches Cassandra enough about your network topology to route requests efficiently.
  • It allows Cassandra to spread replicas around your cluster to avoid correlated failures. It does this by grouping machines into data centers and racks. Cassandra will do its best not to have more than one replica on the same rack (which may not actually be a physical location).

Cassandra comes pre-loaded with several snitch classes, but none of them are Kubernetes aware. The default is SimpleSnitch, but can be overridden:

# You can use a custom Snitch by setting this to the full class 
# name of the snitch, which will be assumed to be on your # classpath.
endpoint_snitch: SimpleSnitch

The custom seed provider

When running Cassandra nodes as pods in Kubernetes, Kubernetes may move pods around including seeds. To accommodate that, a Cassandra seed provider needs to interact with the Kubernetes API server.

Here is a short snippet from the custom KubernetesSeedPRovider Java class that implements the Cassandra SeedProvider API:

public class KubernetesSeedProvider implements SeedProvider {
   ...
    /**
     * Call kubernetes API to collect a list of seed providers
     * @return list of seed providers
     */
    public List<InetAddress> getSeeds() {
        String host = getEnvOrDefault("KUBERNETES_PORT_443_TCP_ADDR", "kubernetes.default.svc.cluster.local");
        String port = getEnvOrDefault("KUBERNETES_PORT_443_TCP_PORT", "443");
        String serviceName = getEnvOrDefault("CASSANDRA_SERVICE", "cassandra");
        String podNamespace = getEnvOrDefault("POD_NAMESPACE", "default");
        String path = String.format("/api/v1/namespaces/%s/endpoints/", podNamespace);
        String seedSizeVar = getEnvOrDefault("CASSANDRA_SERVICE_NUM_SEEDS", "8");
        Integer seedSize = Integer.valueOf(seedSizeVar);
        String accountToken = getEnvOrDefault("K8S_ACCOUNT_TOKEN", "/var/run/secrets/kubernetes.io/serviceaccount/token");
        
        List<InetAddress> seeds = new ArrayList<InetAddress>();
        try {
            String token = getServiceAccountToken(accountToken);

            SSLContext ctx = SSLContext.getInstance("SSL");
            ctx.init(null, trustAll, new SecureRandom());

            String PROTO = "https://";
            URL url = new URL(PROTO + host + ":" + port + path + serviceName);
            logger.info("Getting endpoints from " + url);
            HttpsURLConnection conn = (HttpsURLConnection)url.openConnection();

            conn.setSSLSocketFactory(ctx.getSocketFactory());
            conn.addRequestProperty("Authorization", "Bearer " + token);
            ObjectMapper mapper = new ObjectMapper();
            Endpoints endpoints = mapper.readValue(conn.getInputStream(), Endpoints.class);    }   
            ...
        }
        ...
        
    return Collections.unmodifiableList(seeds);   
}

Creating a Cassandra headless service

The role of the headless service is to allow clients in the Kubernetes cluster to connect to the Cassandra cluster through a standard Kubernetes service instead of keeping track of the network identities of the nodes or putting a dedicated load balancer in front of all the nodes. Kubernetes provides all that out of the box through its services.

Here is the configuration file:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: cassandra
  name: cassandra
spec:
  clusterIP: None
  ports:
    - port: 9042
  selector:
    app: Cassandra

The app: Cassandra label will group all the pods to participate in the service. Kubernetes will create endpoint records and the DNS will return a record for discovery. The clusterIP is None, which means the service is headless and Kubernetes will not do any load balancing or proxying. This is important because Cassandra nodes do their own communication directly.

The 9042 port is used by Cassandra to serve CQL requests. Those can be queries, inserts/updates (it's always an upsert with Cassandra), or deletes.

Using statefulSet to create the Cassandra cluster

Declaring a stateful set is not trivial. It is arguably the most complex Kubernetes resource. It has a lot of moving parts: standard metadata, the stateful set spec, the pod template (which is often pretty complex itself), and volume claim templates.

Dissecting the stateful set configuration file

Let's go methodically over this example stateful set configuration file that declares a three-node Cassandra cluster.

Here is the basic metadata. Note the apiVersion string starting with apps/:

apiVersion: "apps/v1beta1"
kind: StatefulSet
metadata:
  name: cassandra

The stateful set spec defines the headless service name, how many pods there are in the stateful set, and the pod template (explained later). The replicas field specifies how many pods are in the stateful set:

spec:
  serviceName: cassandra
  replicas: 3 
  template: …

The term replicas for the pods is an unfortunate choice because the pods are NOT replicas of each other. They share the same pod template, but they have a unique identity and they are responsible for different subsets of the state in general. This is even more confusing in the case of Cassandra, which uses the same term replicas to refer to groups of nodes that redundantly duplicate some subset of the state (but are not identical, because each can manage additional states too). I opened a GitHub issue with the Kubernetes project to change the term from replicas to members:

https://github.com/kubernetes/kubernetes.github.io/issues/2103.

The pod template contains a single container based on the custom Cassandra image. Here is the pod template, with the app: cassandra label:

template:
  metadata:
    labels:
      app: cassandra
  spec:
    containers: …  

The container spec has multiple important parts. It starts with a name and the image we looked at earlier:

containers:
   - name: cassandra
      image: gcr.io/google-samples/cassandra:v11
      imagePullPolicy: Always

Then it defines multiple container ports needed for external and internal communication by Cassandra nodes:

    ports:
    - containerPort: 7000
      name: intra-node
    - containerPort: 7001
      name: tls-intra-node
    - containerPort: 7199
      name: jmx
    - containerPort: 9042
      name: cql

The resources section specifies the CPU and memory needed by the container. This is critical because the storage management layer should never be a performance bottleneck due to cpu or memory.

resources:
  limits:
    cpu: "500m"
    memory: 1Gi
  requests:
    cpu: "500m"
    memory: 1Gi

Cassandra needs access to IPC, which the container requests through the security content's capabilities:

securityContext:
capabilities:
  add:
       - IPC_LOCK

The env section specifies environment variables that will be available inside the container. The following is a partial list of the necessary variables. The CASSANDRA_SEEDS variable is set to the headless service, so a Cassandra node can talk to seeds on startup and discover the whole cluster. Note that in this configuration, we don't use the special Kubernetes seed provider. The POD_IP is interesting because it utilizes the Downward API to populate its value via the field reference to status.podIP:

 env:
   - name: MAX_HEAP_SIZE
     value: 512M
   - name: CASSANDRA_SEEDS
     value: "cassandra-0.cassandra.default.svc.cluster.local"
  - name: POD_IP
    valueFrom:
      fieldRef:
        fieldPath: status.podIP

The container has a readiness probe too to ensure the Cassandra node doesn't receive requests before it's fully online:

readinessProbe:
  exec:
    command:
    - /bin/bash
    - -c
    - /ready-probe.sh
  initialDelaySeconds: 15
  timeoutSeconds: 5

Cassandra needs to read and write the data of course. The cassandra-data volume mount is where it's at:

volumeMounts:
- name: cassandra-data
  mountPath: /cassandra_data

That's it for the container spec. The last part is the volume claim template. In this case, dynamic provisioning is used. It's highly recommended to use SSD drives for Cassandra storage and especially its journal. The requested storage in this example is 1 Gi. I discovered through experimentation that 1-2 TB is ideal for a single Cassandra node. The reason is that Cassandra does a lot of data shuffling under the covers, compacting and rebalancing the data. If a node leaves the cluster or a new one joins the cluster, you have to wait until the data is properly rebalanced before the data from the node that left is properly re-distributed or a new node is populated. Note that Cassandra needs a lot of disk space to do all this shuffling. It is recommended to have 50% free disk space. When you consider that you also need replication (typically 3X) then the required storage space can be 6X your data size. You can get by with 30% free space if you're adventurous and maybe use just 2X replication depending on your use case. But, don't get below 10% free disk space even on a single node. I learned the hard way that Cassandra will be simply stuck and unable to compact and rebalance such nodes without extreme measures.

The access mode is of course ReadWriteOnce:

volumeClaimTemplates:
- metadata:
  name: cassandra-data
  annotations:
    volume.alpha.kubernetes.io/storage-class: anything
spec:
  accessModes: [ "ReadWriteOnce" ]
  resources:
    requests:
      storage: 1Gi

When deploying a stateful set, Kubernetes creates the pod in order per its index number. When scaling up or down, it also does it in order. For Cassandra, this is not important because it can handle nodes joining or leaving the cluster in any order. When a Cassandra pod is destroyed, the persistent volume remains. If a pod with the same index is created later, the original persistent volume will be mounted into it. This stable connection between a particular pod and its storage enables Cassandra to manage the state properly.

Using a replication controller to distribute Cassandra

A stateful set is great, but as mentioned earlier, Cassandra is already a sophisticated distributed database. It has a lot of mechanisms for automatically distributing and balancing and replicating the data around the cluster. These mechanisms are not optimized for working with network persistent storage. Cassandra was designed to work with the data stored directly on the nodes. When a node dies, Cassandra can recover having redundant data stored on other nodes. Let's look at a different way to deploy Cassandra on a Kubernetes cluster, which is more aligned with Cassandra's semantics. Another benefit of this approach is that if you have an existing Kubernetes cluster; you don't have to upgrade it to the latest and greatest just to use a stateful set.

We will still use the headless service, but instead of a stateful set we'll use a regular replication controller. There are some important differences:

  • Replication controller instead of a stateful set
  • Storage on the node the pod is scheduled to
  • The custom Kubernetes seed provider class is used

Dissecting the replication controller configuration file

The metadata is pretty minimal, with just a name (labels are not required):

apiVersion: v1
kind: ReplicationController
metadata:
  name: cassandra
  # The labels will be applied automatically
  # from the labels in the pod template, if not set
  # labels:
    # app: Cassandra

The spec specifies the number of replicas:

spec:
  replicas: 3
  # The selector will be applied automatically
  # from the labels in the pod template, if not set.
  # selector:
      # app: Cassandra
  

The pod template's metadata is where the app: Cassandra label is specified. The replication controller will keep track and make sure that there are exactly three pods with that label:

  template:
    metadata:
      labels:
        app: Cassandra

The pod template's spec describes the list of containers. In this case, there is just one container. It uses the same Cassandra Docker image named cassandra and runs the run.sh script:

    spec:
      containers:

        - command:
            - /run.sh
          image: gcr.io/google-samples/cassandra:v11
          name: cassandra

The resources section just requires 0.5 units of CPU in this example:

resources:
            limits:
              cpu: 0.5

The environment section is a little different. The CASSANDRA_SEED_PROVDIER specifies the custom Kubernetes seed provider class we examined earlier. Another new addition here is POD_NAMESPACE, which uses the Downward API again to fetch the value from the metadata:

          env:
            - name: MAX_HEAP_SIZE
              value: 512M
            - name: HEAP_NEWSIZE
              value: 100M
            - name: CASSANDRA_SEED_PROVIDER
              value: "io.k8s.cassandra.KubernetesSeedProvider"
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP

The ports section is identical, exposing the intra-node communication ports: 7000 and 7001, the 7199 JMX port used by external tools such as Cassandra OpsCenter to communicate with the Cassandra cluster, and of course the 9042 CQL port through which clients communicate with the cluster:

          ports:
            - containerPort: 7000
              name: intra-node
            - containerPort: 7001
              name: tls-intra-node
            - containerPort: 7199
              name: jmx
            - containerPort: 9042
              name: cql

Once again, the volume is mounted into /cassandra_data. This is important because the same Cassandra image configured properly just expects its data directory to be at a certain path. Cassandra doesn't care about the backing storage (although you should care as the cluster administrator). Cassandra will just read and write using filesystem calls:

          volumeMounts:
            - mountPath: /cassandra_data
              name: data

The volumes section is the biggest difference from the stateful set solution. A stateful set uses persistent storage claims to connect a particular pod with stable identity to a particular persistent volume. The replication controller solution just uses an emptyDir on the hosting node:

      volumes:
        - name: data
          emptyDir: {}

This has many ramifications. You have to provision enough storage on each node. If a Cassandra pod dies, its storage goes away. Even if the pod is restarted on the same physical (or virtual) machine the data on disk will be lost because emptyDir is deleted once its pod is removed. Note that container restarts are OK, because emptyDir survives container crashes. So, what happens when the pod dies? The replication controller will start a new pod with empty data. Cassandra will detect that a new node was added to the cluster, assign it some portion of the data, and start rebalancing automatically by moving data from other nodes. This is where Cassandra shines. It constantly compacts, rebalances, and distributes the data evenly across the cluster. It will just figure out what to do on your behalf.

Assigning pods to nodes

The main problem with the replication controller approach is that multiple pods can get scheduled on the same Kubernetes node. What if you have a replication factor of three and all three pods that are responsible for some range of the keyspace are all scheduled to the same Kubernetes node? First, all requests for read or writes of that range of keys will go to the same node, creating more pressure. But, even worse, we just lost our redundancy. We have a single point of failure (SPOF). If that node dies, the replication controller will happily start three new pods on some other Kubernetes node, but all of them will have no data and no other Cassandra node in the cluster (the other pods) will have the data to copy from.

This can be solved using a Kubernetes 1.4 Alpha concept called anti-affinity. When assigning pods to nodes, a pod can be annotated such that the scheduler will not schedule it to a node that already had a pod with a particular set of labels. Here is how to ensure that at most a single Cassandra pod will be assigned to a node:

annotations:
    scheduler.alpha.kubernetes.io/affinity: >
      {
        "nodeAffinity": {
          "requiredDuringSchedulingIgnoredDuringExecution": {
            "nodeSelectorTerms": [
              {
                "matchExpressions": [
                  {
                    "key": "app",
                    "operator": "NotIn",
                    "values": ["cassandra"]
                  }
                ]
              }
            ]
          }
        }
      }

Using DaemonSet to distribute Cassandra

A better solution to the problem of assigning Cassandra pods to different nodes is to use a DaemonSet. A DaemonSet has a pod template like a replication controller. But, a DaemonSet has a node selector that determines on which nodes to schedule its pods. It doesn't have a certain number of replicas, it just schedules a pod on each node that matches its selector. The simplest case is to schedule a pod on each node in the Kubernetes cluster. But, the node selector can also use match expressions against labels to deploy to a particular subset of nodes. Let's create a DaemonSet for deploying our Cassandra cluster onto the Kubernetes cluster:

DaemonSet is still a beta resource:
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: cassandra-daemonset

The spec of the DaemonSet contains a regular pod template. The nodeSelector section is where the magic happens and ensures that one and exactly one pod will always be scheduled to each node with a label of app: Cassandra:

spec:
  template:
    metadata:
      labels:
        app: cassandra
    spec:
      # Filter only nodes with the label "app: cassandra":
      nodeSelector:
        app: cassandra
      containers:

The rest is identical to the replication controller. Note that nodeSelector is expected to be deprecated in favor of affinity. When that happens, it's not clear.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset