© Raul Estrada and Isaac Ruiz 2016

Raul Estrada and Isaac Ruiz, Big Data SMACK, 10.1007/978-1-4842-2175-4_5

5. Storage: Apache Cassandra

Raul Estrada and Isaac Ruiz1

(1)Mexico City, Mexico

Congratulations! You are almost halfway through this journey. You are at the point where it is necessary to meet the component responsible for information persistence; the sometimes neglected “data layer” will take on a new dimension when you have finished this chapter. It’s time to meet Apache Cassandra, a NoSQL database that provides high availability and scalability without compromising performance.

Note

We suggest that you have your favorite terminal ready to follow the exercises. This will help you become familiar with the tools faster.

Once Upon a Time...

Before you start, let’s do a little time traveling to ancient Greece to meet the other Cassandra. In Greek mythology, there was a priestess who was chastised for her treason to the god Apollo. She asked for the gift of prophecy in exchange for a carnal encounter; however, she failed to fulfill her part of the deal. For this, she received this punishment: she would have the gift of prophecy, but no one would ever believe her prophecies. A real tragedy. This priestess’s name was Cassandra.

Perhaps the modern Cassandra, the Apache project, has come to claim the ancient Cassandra. With modern Cassandra, it is probably best to believe what she tells you and do not be afraid to ask.

Modern Cassandra

Modern Cassandra represents the persistence layer in our reference implementation.

First, let’s have a short overview of NoSQL, and then continue to the installation and learn how to integrate Cassandra on the map.

NoSQL Everywhere

Fifteen years ago, nobody imagined the amount of information that a modern application would have to manage; the Web was only beginning to take its shape today. Computer systems were becoming more powerful, defying the Moore’s law,1 not only in large data centers but also in desktop computers, warning us that the free lunch is over. 2

In this scenario, those who drove the change had to be innovative in the way that they looked for alternatives to a relational database management system (RDBMS). Google, Facebook, and Twitter had to experiment with creating their own data models—each with different architectures—gradually building what is known today as NoSQL.

The diversity of NoSQL tools is so broad that it is difficult to make a classification. But there was one audacious guy who did, and he proposed that a NoSQL tool must meet the following characteristics :

  • Non-relational

  • Open source

  • Cluster-friendly

  • Twenty-first-century web

  • Schemaless

That guy was Martin Fowler and he exposes this in his book with Pramod J. Sadalage, NoSQL Distilled (Addison-Wesley Professional, 2012).3

At the GOTO conference in 2013,4 Fowler presented the “Introduction to NoSQL,” a very educational presentation well worth checking out.

But how is that NoSQL improves data access performance over traditional RDBMS? It has much to do with the way NoSQL handles and abstracts data; that is, how it has defined the data model.

Following Martin Fowler’s comments, if you use this criterion, you can classify NoSQL (as shown in Figure 5-1) with distinct types according to this data model: document, column-family, graph, key-value.

A420086_1_En_5_Fig1_HTML.jpg
Figure 5-1. NoSQL classification according to the data model used

Another NoSQL-specific feature is that the data model does not require a data schema, which allows a greater degree of freedom and faster data access.

As seen in Figure 5-2, the data models can be grouped as aggregated-oriented and schemaless.

A420086_1_En_5_Fig2_HTML.jpg
Figure 5-2. Another data model classification

The amount of data to be handled and the need of a mechanism to ease the development are indicators of when to use NoSQL. Martin Fowler recommends that you use these two major criteria for when to start using NoSQL, as shown in Figure 5-3.

A420086_1_En_5_Fig3_HTML.jpg
Figure 5-3. A simple way to determine when to use NoSQL

Finally, you must remember that there is no silver bullet, and although the rise of SQL is not large, you must be cautious in choosing when to use it.

The Memory Value

Many of the advantages of NoSQL are based on the fact that a lot of data management is performed in memory, which gives excellent performance to data access.

Note

The processing performance of main memory is 800 times faster than HDD, 40 times faster than a common SSD, and seven times faster than the fastest SSD.5

Surely, you already know that the memory access is greater than disk access, but with this speed, you want to do everything in memory. Fortunately, all of these advantages are abstracted by Cassandra and you just have to worry about what and how to store.

Key-Value and Column

There are two particular data models that to discuss: key-value and column-family. It is common practice that NoSQL use several data models to increase its performance. Cassandra makes use of key-value and column-family data models.

Key-Value

The simplest data model is key-value . You have probably already used this paradigm within a programming language. In a nutshell, it is a hash table.

Given a key, you can access the content (value), as demonstrated in Figure 5-4.

A420086_1_En_5_Fig4_HTML.jpg
Figure 5-4. You can imagine this data model as a big hash. The key allows access to certain <content>. This <content> can be different types of data, which makes it a much more flexible structure.

Column-Family

An important part of this data model is that the storage and fetch processes are made from columns and not from rows. Also, a lot of the data model is done in memory, and you already know how important that is.

What’s a column? It is a tuple containing key-value pairs. In the case of several NoSQL, this tuple is formed by three pairs: name/key, value, and one timestamp.

In this model, several columns (the family) are grouped by a key called a row-key. Figure 5-5 shows this relationship.

A420086_1_En_5_Fig5_HTML.jpg
Figure 5-5. A key (row-key) can access the column family. This group exemplifies the data model column-family

The main advantage of this model is that it substantially improves write operations, which improves their performance in distributed environments.6

Why Cassandra?

Cassandra implements “no single points of failure,” which is achieved with redundant nodes and data. Unlike legacy systems based on master-slave architectures, Cassandra implements a masterless “ring” architecture (see Figure 5-6).

A420086_1_En_5_Fig6_HTML.jpg
Figure 5-6. When all nodes have the same role, having data redundancy is much easier to maintain a replication, which always help maintain the availability of data.

With this architecture, all nodes have an identical role: there is no master node. All nodes communicate with each other using a scalable and distributed protocol called gossip.7

This architecture, together with the protocol, collectively cannot have a single point of failure. It offers true continuous availability.

The Data Model

At this point, you can say that the Cassandra data model is based primarily on managing columns. As mentioned earlier, some NoSQL combine multiple data models, as is the case with Cassandra (see Figure 5-7).

A420086_1_En_5_Fig7_HTML.jpg
Figure 5-7. Cassandra uses a model of combined data; key-value uses this to store and retrieve the columns

Cassandra has some similarity to an RDBMS; these similarities facilitate use and adoption, although you must remember that they do not work the same way.

Table 5-1 provides some comparisons that help us better understand Cassandra’s concepts.

Table 5-1. Cassandra Data Model and RDBMS Equivalences
 

Definition

RDBMS Equivalent

Schema/Keyspace

A collection of column families.

Schema/database

Table/Column-Family

A set of rows.

Table

Row

An ordered set of columns.

Row

Column

A key/value pair and timestamp.

Column (name, value)

Figure 5-8 illustrates the relationships among these concepts.

A420086_1_En_5_Fig8_HTML.jpg
Figure 5-8. Relationships among column, row, column-family, and keyspace

Cassandra 101

Installation

This section explains how to install Apache Cassandra on a local machine. The following steps were performed on a Linux machine. At the time of this writing, the stable version is 3.4, released on March 8, 2016.

Prerequisites

Apache Cassandra requires Java version 7 or 8, preferably the Oracle/Sun distribution. The documentation indicates that it is also compatible with OpenJDK, Zing, and IBM distributions.

File Download

The first step is to download the .zip file distribution downloaded from http://cassandra.apache.org/download/ .

On the Apache Cassandra project home page (see Figure 5-9 and http://cassandra.apache.org ), the project logo reminds us of the seeing ability of the mythological character.

A420086_1_En_5_Fig9_HTML.jpg
Figure 5-9. Cassandra project home page

Locate the following file:

apache-cassandra-3.4-bin.tar.gz

It’s important to validate file integrity. You do not want it to fail while it’s running. This particular file has the following values to validate its integrity:

[MD5] e9f490211812b7db782fed09f20c5bb0
[SHA1]7d010b8cc92d5354f384b646b302407ab90be1f0

It’s easy to make this validation. Any flavor of Linux gives the md5 and sha1sum commands, as shown in the following:

%> md5sum apache-cassandra-3.4-bin.tar.gz
e9f490211812b7db782fed09f20c5bb0  apache-cassandra-3.4-bin.tar.gz


%> sha1sum apache-cassandra-3.4-bin.tar.gz
7d010b8cc92d5354f384b646b302407ab90be1f0  apache-cassandra-3.4-bin.tar.gz


%> ls -lrt apache-cassandra-3.4-bin.tar.gz
-rw-r--r--. 1 rugi rugi 34083682 Mar  7 22:06 apache-cassandra-3.4-bin.tar.gz

Once you have validated the file’s integrity, you can unzip it and continue.

Start

Starting Apache Cassandra is easy; you only execute the following:

./cassandra -f
INFO  18:06:58 Starting listening for CQL clients on localhost/127.0.0.1:9042 (unencrypted)...
INFO  18:06:58 Not starting RPC server as requested. Use JMX (StorageService->startRPCServer()) or nodetool (enablethrift) to start it
INFO  18:07:07 Scheduling approximate time-check task with a precision of 10 milliseconds
INFO  18:07:07 Created default superuser role 'cassandra'

With this command, your server Apache Cassandra is ready to receive requests. You must always keep in mind that Apache Cassandra runs with a client-server approach. You have launched the server; the server is responsible for receiving requests from clients and then giving them answers. So now you need to validate that clients can send requests.

The next step is to use the CLI validation tool, an Apache Cassandra client.

Validation

Now, let’s execute the CLI tool.

%> ./cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.4 | CQL spec 3.4.0 | Native protocol v4]
Use HELP for help.
cqlsh>

The first step is to create a keyspace, an analogy with relational databases in which you define the database, per se.

%>CREATE KEYSPACE mykeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

Once the keyspace is defined, indicate that you will use it.

USE mykeyspace;

This is a familiar sentence, isn’t?

Apache Cassandra, like other NoSQL frameworks, tries to use analogies with the SQL statements that you already know.

And if you have already defined the database, do you remember what is next? Now you create a test table.

%> CREATE TABLE users (  user_id int PRIMARY KEY,  fname text,  lname text);

And, having the table, inserting records is simple, as you can see in the following:

%>INSERT INTO users (user_id,  fname, lname)  VALUES (1745, 'john', 'smith');
%>INSERT INTO users (user_id,  fname, lname)  VALUES (1744, 'john', 'doe');
%>INSERT INTO users (user_id,  fname, lname)  VALUES (1746, 'john', 'smith');

You make a simple query, like this:

%>SELECT * FROM users;

And, you should have the following results (or similar, if you already modified the data with the inserts):

 user_id | fname | lname
---------+-------+-------
    1745 |  john | smith
    1744 |  john |   doe
    1746 |  john | smith

With Apache Cassandra, you can create indexes on the fly:

CREATE INDEX ON users (lname);

To facilitate searches on specific fields, do this:

SELECT * FROM users WHERE lname = 'smith';

This is the result:

 user_id | fname | lname
---------+-------+-------
    1745 |  john | smith
    1746 |  john | smith

And that’s it. This is enough to validate that communication between your CLI and your Cassandra server is working properly.

Here is the complete output from the previous commands:

{16-03-21 14:05}localhost:∼/opt/apache/cassandra/apache-cassandra-3.4/bin rugi% cd /home/rugi/opt/apache/cassandra/apache-cassandra-3.4/bin
{16-03-21 14:05}localhost:∼/opt/apache/cassandra/apache-cassandra-3.4/bin rugi% ./cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.4 | CQL spec 3.4.0 | Native protocol v4]
Use HELP for help.
cqlsh> CREATE KEYSPACE mykeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
cqlsh> USE mykeyspace;
cqlsh:mykeyspace> CREATE TABLE users (  user_id int PRIMARY KEY,  fname text,  lname text);
cqlsh:mykeyspace> INSERT INTO users  (user_id,  fname, lname)  VALUES (1745, 'john', 'smith');
cqlsh:mykeyspace> INSERT INTO users  (user_id,  fname, lname)  VALUES (1744, 'john', 'doe');
cqlsh:mykeyspace> INSERT INTO users  (user_id,  fname, lname)  VALUES (1746, 'john', 'smith');


cqlsh:mykeyspace> SELECT * FROM users;
user_id | fname | lname
---------+-------+-------
    1745 |  john | smith
    1744 |  john |   doe
    1746 |  john | smith
(3 rows)


cqlsh:mykeyspace> CREATE INDEX ON users (lname);
cqlsh:mykeyspace> SELECT * FROM users WHERE lname = 'smith';
user_id | fname | lname
---------+-------+-------
    1745 |  john | smith
    1746 |  john | smith
(2 rows)


cqlsh:mykeyspace> exit
{16-03-21 15:24}localhost:∼/opt/apache/cassandra/apache-cassandra-3.4/bin rugi%

You should have two terminals open: one with the server running and the other one with CLI running. If you check the first one, you will see how CLI is processing the requests. Figure 5-10 shows the server running.

A420086_1_En_5_Fig10_HTML.jpg
Figure 5-10. Cassandra server running

Figure 5-11 is a screenshot of running the test commands described earlier.

A420086_1_En_5_Fig11_HTML.jpg
Figure 5-11. CQL running on CQLs

CQL

CQL (Cassandra Query Language) is a language similar to SQL. The queries on a keyspace are made in CQL.

CQL Shell

There are several ways to interact with a keyspace ; in the previous section, you saw how to do it using a shell called CQL shell (CQLs) . Later you will see other ways to interact with the keyspace.

CQL shell is the primary way to interact with Cassandra; Table 5-2 lists the main commands.

Table 5-2. Shell Command Summary

Command

Description

cqlsh

Starts the CQL interactive terminal.

CAPTURE

Captures the command output and appends it to a file.

CONSISTENCY

Shows the current consistency level; or given a level, sets it.

COPY

Imports and exports CSV (comma-separated values) data to and from Cassandra.

DESCRIBE

Provides information about the connected Cassandra cluster or about the data objects stored in the cluster.

EXPAND

Formats the output of a query vertically.

EXIT

Terminates cqlsh.

PAGING

Enables or disables query paging.

SHOW

Shows the Cassandra version, host, or tracing information for the current cqlsh client session.

SOURCE

Executes a file containing CQL statements.

TRACING

Enables or disables request tracing.

For more detailed information on shell commands, you should visit the following web page:

http://docs.datastax.com/en/cql/3.1/cql/cql_reference/cqlshCommandsTOC.html

Let’s try some of these commands. First, activate the shell, as follows;

{16-04-15 23:54}localhost:∼/opt/apache/cassandra/apache-cassandra-3.4/bin rugi% ./cqlsh        
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.4 | CQL spec 3.4.0 | Native protocol v4]
Use HELP for help.

The describe command can work in specific tables, in all the keyspaces, or in one specific keyspace:

cqlsh> describe keyspaces

system_schema  system      system_distributed
system_auth    mykeyspace  system_traces    


cqlsh> describe mykeyspace

CREATE KEYSPACE mykeyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}  AND durable_writes = true;

CREATE TABLE mykeyspace.users (
    user_id int PRIMARY KEY,
    fname text,
    lname text
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64','class': 'org.apache.cassandra.io.compress.  LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';
CREATE INDEX users_lname_idx ON mykeyspace.users (lname);

The show command is also simple to test to see the version number:

cqlsh> show version
[cqlsh 5.0.1 | Cassandra 3.4 | CQL spec 3.4.0 | Native protocol v4]
cqlsh>

As you can see, these commands are very easy to use.

CQL Commands

CQL is very similar to SQL, as you have already seen in the first part of this chapter. You have created a keyspace, made inserts, and created a filter.

CQL, like SQL, is based on sentences/statements. These sentences are for data manipulation and work with their logical container, the keyspace. As in SQL statements, they must end with a semicolon (;).

Table 5-3 lists all the language commands.

Table 5-3. CQL Command Summary

Command

Description

ALTER KEYSPACE

Changes the property values of a keyspace.

ALTER TABLE

Modifies the column metadata of a table.

ALTER TYPE

Modifies a user-defined type. Cassandra 2.1 and later.

ALTER USER

Alters existing user options.

BATCH

Writes multiple DML statements.

CREATE INDEX

Defines a new index on a single column of a table.

CREATE KEYSPACE

Defines a new keyspace and its replica placement strategy.

CREATE TABLE

Defines a new table.

CREATE TRIGGER

Registers a trigger on a table.

CREATE TYPE

Creates a user-defined type. Cassandra 2.1 and later.

CREATE USER

Creates a new user.

DELETE

Removes entire rows or one or more columns from one or more rows.

DESCRIBE

Provides information about the connected Cassandra cluster or about the data objects stored in the cluster.

DROP INDEX

Drops the named index.

DROP KEYSPACE

Removes the keyspace.

DROP TABLE

Removes the named table.

DROP TRIGGER

Removes registration of a trigger.

DROP TYPE

Drops a user-defined type. Cassandra 2.1 and later.

DROP USER

Removes a user.

GRANT

Provides access to database objects.

INSERT

Adds or updates columns.

LIST PERMISSIONS

Lists permissions granted to a user.

LIST USERS

Lists existing users and their superuser status.

REVOKE

Revokes user permissions.

SELECT

Retrieves data from a Cassandra table.

TRUNCATE

Removes all data from a table.

UPDATE

Updates columns in a row.

USE

Connects the client session to a keyspace.

For more detailed information of CQL commands, you can visit the following web page:

http://docs.datastax.com/en/cql/3.1/cql/cql_reference/cqlCommandsTOC.html

Let’s play with some of these commands.

{16-04-16 6:19}localhost:∼/opt/apache/cassandra/apache-cassandra-3.4/bin rugi% ./cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.4 | CQL spec 3.4.0 | Native protocol v4]
Use HELP for help.

Use the keyspace created at beginning, as follows:

cqlsh> use mykeyspace;

The DESCRIBE command can be applied to almost any object to discover the keyspace tables.

cqlsh:mykeyspace> describe tables users

Or in a specific table.

cqlsh:mykeyspace> describe users

CREATE TABLE mykeyspace.users (
    user_id int PRIMARY KEY,
    fname text,
    lname text
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';
CREATE INDEX users_lname_idx ON mykeyspace.users (lname);
cqlsh:mykeyspace> exit

Beyond the Basics

You already know that Apache Cassandra runs on a client-server architecture . The client-server architecture is used by nearly everyone every day; it is the base of what you know as the Internet.

Client-Server

By definition, the client-server architecture allows distributed applications, since the tasks are divided into two main parts:

  • The service providers: the servers

  • The service petitioners : the clients

In this architecture, several clients are allowed to access the server. The server is responsible for meeting requests and it handles each one according its own rules. So far, you have only used one client, managed from the same machine—that is, from the same data network.

Figure 5-12 shows our current client-server architecture in Cassandra.

A420086_1_En_5_Fig12_HTML.jpg
Figure 5-12. The native way to connect to a Cassandra server is via CQLs.

CQL shell allows you to connect to Cassandra, access a keyspace, and send CQL statements to the Cassandra server. This is the most immediate method, but in daily practice, it is common to access the keyspaces from different execution contexts (other systems and other programming languages).

Other Clients

You require other clients, different from CQLs, to do it in the Apache Cassandra context. You require connection drivers.

Drivers

A driveris a software component that allows access to a keyspace to run CQL statements.

Figure 5-13 illustrates accessing clients through the use of a driver. A driver can access a keyspace and also allows the execution of CQL sentences.

A420086_1_En_5_Fig13_HTML.jpg
Figure 5-13. To access a Cassandra server, a driver is required

Fortunately, there are a lot of these drivers for Cassandra in almost any modern programming language. You can see an extensive list at http://wiki.apache.org/cassandra/ClientOptions .

Currently, there are different drivers to access a keyspace in almost all modern programming languages. Typically, in a client-server architecture, there are clients accessing the server from different clients, which are distributed in different networks, therefore, Figure 5-13 may now look like what’s shown in Figure 5-14.

A420086_1_En_5_Fig14_HTML.jpg
Figure 5-14. Different clients connecting to a Cassandra server through the cloud

The Figure 5-14 illustrates that given the distributed characteristics that modern systems require, the clients actually are in different points and access the Cassandra server through public and private networks.

Your implementation needs will dictate the required clients.

All languages offer a similar API through the driver. Consider the following code snippets in Java, Ruby, and Node.

Java

The following snippet was tested with JDK 1.8.x.

Get Dependence

With java, it is easiest is to use Maven. You can get the driver using the following Maven artifact:

<dependency>
    <groupId>com.datastax.cassandra</groupId>
    <artifactId>cassandra-driver-core</artifactId>
    <version>3.0.2</version>
</dependency>
Snippet

The following is the Java snippet:

import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.ResultSet;
import com.datastax.driver.core.Row;
import com.datastax.driver.core.Session;
import java.util.Iterator;


...

    public static void main(String[] args) {
        Cluster cluster = Cluster.builder().addContactPoint("127.0.0.1").build();
        Session session = cluster.connect("mykeyspace");
        ResultSet results = session.execute("SELECT * FROM users");
        StringBuilder line = new StringBuilder();


        for (Iterator<Row> iterator = results.iterator(); iterator.hasNext();) {
            Row row = iterator.next();
            line.delete(0, line.length());
            line.append("FirstName = ").                  
                    append(row.getString("fname")).
                    append(",").append(" ").
                    append("LastName = ").
                    append(row.getString("lname"));
            System.out.println(line.toString());
        }
    }
Ruby

The snippet was tested with Ruby 2.0.x.

Get Dependence

In Ruby, obtaining the driver is as simple as installing a gem.

%>gem install cassandra-driver
Snippet

The following is the Ruby snippet:

require 'cassandra'

node = '127.0.0.1'
cluster = Cassandra.cluster(hosts: node)
keyspace = 'mykeyspace'
session  = cluster.connect(keyspace)
session.execute("SELECT fname, lname FROM users").each do |row|
            p "FirstName = #{row['fname']}, LastName = #{row['lname']}"
end
Node

The snippet was tested with Node v5.0.0.

Get Dependence

With Node, it could not be otherwise; the driver is obtained with npm.

%>npm install cassandra-driver
%>npm install async
Snippet

The following is the Node snippet:

var cassandra = require('cassandra-driver');
var async = require('async');


var client = new cassandra.Client({contactPoints: ['127.0.0.1'], keyspace: 'mykeyspace'});
client.stream('SELECT fname, lname FROM users', [])
  .on('readable', function () {
               var row;
                   while (row = this.read()) {
                                 console.log('FirstName =  %s , LastName= %s', row.fname, row.lname);
                                     }
                     })
  .on('end', function () {
             //todo
             })
  .on('error', function (err) {
                // todo
             });
</code>

These three snippets did the same thing: made a connection to the Cassandra server, got a reference to the keyspace, made a single query, and displayed the results. In conclusion, the three snippets generated the same result:

"FirstName = john, LastName = smith"
"FirstName = john, LastName = doe"
"FirstName = john, LastName = smith"

You can see more examples that use other languages on the following web page:

 http://www.planetcassandra.org/apache-cassandra-client-drivers/

Apache Spark-Cassandra Connector

Now that you have a clear understanding on how connecting to a Cassandra server is done, let’s talk about a very special client. Everything that you have seen previously has been done to get to this point. You can now see what Spark can do since you know Cassandra and you know that you can use it as a storage layer to improve the Spark performance.

What do you need to achieve this connection? A client. This client is special because it is designed specifically for Spark, not for a specific language. This special client is called the Spark-Cassandra Connector (see Figure 5-15).

A420086_1_En_5_Fig15_HTML.jpg
Figure 5-15. The Spark-Cassandra Connector is a special type of client that allows access to keyspaces from a Spark context

Installing the Connector

The Spark-Cassandra connector has its own GitHub repository. The latest stable version is the master, but you can access a special version through a particular branch.

Figure 5-16 shows the Spark-Cassandra Connector project home page, which is located at https://github.com/datastax/spark-cassandra-connector .

A420086_1_En_5_Fig16_HTML.jpg
Figure 5-16. The Spark-Cassandra Connector on GitHub

At the time of this writing, the most stable connector version is 1.6.0. The connector is basically a .jar file loaded when Spark starts. If you prefer to directly access the .jar file and avoid the build process, you can do it by downloading the official maven repository. A widely used repository is located at http://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector_ .10/1.6.0-M2.

Generating the .jar file directly from the Git repository has one main advantage: all the necessary dependencies of the connector are generated. If you choose to download the jar from the official repository, you must also download all of these dependencies.

Fortunately, there is a third way to run the connector, which is by telling the spark-shell that you require certain packages for the session to start. This is done by adding the following flag:

./spark-shell --packages datastax:spark-cassandra-connector:1.6.0-M2-s_2.10

The nomenclature of the package is the same used with Gradle, Buildr, or SBT:

GroupID: datastax
ArtifactID: spark-cassandra-connector
Version: 1.6.0-M2-s_2.10

In the preceding lines of code, you are telling the shell that you require that artifact, and the shell will handle all the units. Now let’s see how it works.

Establishing the Connection

The connector version used in this section is 1.6.0 because it is the latest stable version of Apache Spark as of this writing.

First, validate that the versions are compatible. Access the Spark shell to see if you have the correct version.

{16-04-18 1:10}localhost:∼/opt/apache/spark/spark-1.6.0-bin-hadoop2.6/bin rugi% ./spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark’s repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel(“INFO”)
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  ‘_/
   /___/ .__/\_,_/_/ /_/\_   version 1.6.0
      /_/


Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
....
scala>

Next, try a simple task:

scala> sc.parallelize( 1 to 50 ).sum()
res0: Double = 1275.0
scala>

Stop the shell (exit command). Now at start time, indicate the package that you require (the connector). The first time, the shell makes downloading dependencies:

{16-06-08 23:18}localhost:∼/opt/apache/spark/spark-1.6.0-bin-hadoop2.6/bin rugi% >./spark-shell --packages datastax:spark-cassandra-connector:1.6.0-M2-s_2.10
Ivy Default Cache set to: /home/rugi/.ivy2/cache
The jars for the packages stored in: /home/rugi/.ivy2/jars
:: loading settings :: url = jar:file:/home/rugi/opt/apache/spark/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
datastax#spark-cassandra-connector added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found datastax#spark-cassandra-connector;1.6.0-M2-s_2.10 in spark-packages
        found joda-time#joda-time;2.3 in local-m2-cache
        found com.twitter#jsr166e;1.1.0 in central
        found org.scala-lang#scala-reflect;2.10.5 in central
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   16  |   2   |   2   |   0   ||   16  |   2   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        2 artifacts copied, 14 already retrieved (5621kB/32ms)
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark’s repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel(“INFO”)
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  ‘_/
   /___/ .__/\_,_/_/ /_/\_   version 1.6.0
      /_/


Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
16/06/08 23:18:59 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 192.168.1.6 instead (on interface wlp7s0)
16/06/08 23:18:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Spark context available as sc.
16/06/08 23:19:07 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/06/08 23:19:07 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
SQL context available as sqlContext.


scala>

The connector is loaded and ready for use.

First, stop the Scala executor from the shell:

sc.stop

Next, import the required classes for communication:

import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf

Then, set a variable with the required configuration to connect:

val conf = new SparkConf(true).set(“spark.cassandra.connection.host”, “localhost”)

Finally, connect to the well-known keyspace and table that were created at the beginning of this chapter:

val sc = new SparkContext(conf)
val test_spark_rdd = sc.cassandraTable("mykeyspace", "users")

Given the context and keyspace, it is possible to consult the values with the following statement:

test_spark_rdd.foreach(println)

Here is the complete sequence of the five lines of code:

scala> sc.stop

scala> import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf


scala> val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@68b5a37d


scala> val sc = new SparkContext(conf)
sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@3d872a12


scala> val test_spark_rdd = sc.cassandraTable("mykeyspace", "users")

The connection is established and is already accessible through test_spark_rdd to make operations in our table within our keyspace; for example, to show values.

scala> test_spark_rdd.foreach(println)
CassandraRow{user_id: 1745, fname: john, lname: smith}
CassandraRow{user_id: 1744, fname: john, lname: doe}
CassandraRow{user_id: 1746, fname: john, lname: smith}

More Than One Is Better

Up to this moment, unknowingly, you have been working with a cluster of Cassandra. A cluster with a single node, but a cluster. Let’s check it. ;)

To check, use the nodetool utility, which is administered as a cluster of the Cassandra nodetool via CLI.

You can run the following to see the full list of commands:

CASSANDRA_HOME/bin>./nodetool

Among the list, you see the status command.

status         Print cluster information (state, load, IDs, ...)

You can run nodetool with the status command.

CASSANDRA_HOME/bin>./nodetool  status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  122.62 KiB  256      100.0%        3e7ccbd4-8ffb-4b77-bd06-110d27536cb2  rack1

You can see that you run a cluster with a single node.

cassandra.yaml

When you have a cluster of more than one node, you modify the cassandra.yaml file. In this file, the necessary settings of each node within a cluster are made. When you have only one node, there’s nothing to change. The file is located in the CASSANDRA HOME/conf folder.

The file has several options; you can see each option in detail in the documentation.8 For a basic configuration, however, there are few options that are required.

Table 5-4 describes the fields to create our cluster. The descriptions were taken from the afore mentioned documentation.

Table 5-4. Minimum Configuration Options for Each Node in the Cluster

Option

Description

cluster_name

The name of the cluster.

seed_provider

The addresses of the hosts deemed as contact points. Cassandra nodes use the -seeds list to find each provider and learn the topology of the ring.

seed_provider - class_name

The class within Cassandra that handles the seed logic. It can be customized, but this is typically not required.

seed_provider- parameters - seeds

A comma-delimited list of IP addresses used by gossip for bootstrapping new nodes joining a cluster.

listen_address

The IP address or hostname that Cassandra binds to in order to connect to other Cassandra nodes.

rpc_address

The listen address for client connections (Thrift RPC service and native transport).

broadcast_rpc_address

The RPC address to broadcast to drivers and other Cassandra nodes.

endpoint_snitch

Set to a class that implements the IEndpointSnitch interface.

Setting the Cluster

In this example , assume that Cassandra is installed on the following machines:

107.170.38.238 (seed)
107.170.112.81
107.170.115.161

The documentation recommends having more than one seed, but because you have only three nodes in this exercise, leave only a single seed. All machines have Ubuntu 14.04 and JDK 1.8 (HotSpot) ready.

The following steps assume that you are starting a clean installation in each machine, so, if a machine is running Cassandra, you must stop and delete all data. We recommend that you start with clean installations. If there is a firewall between the machines, it is important to open specific ports.9

Machine01

Our first machine has the address 107.170.38.238 and it is the seed. It starts first when you finish setting up the three machines.

Locate the CASSANDRA HOME/conf/cassandra.yaml file and make the following modifications. All nodes in the cluster must have the same cluster_name.

cluster_name: 'BedxheCluster'
num_tokens: 256
seed_provider:
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          - seeds: “107.170.38.238”
listen_address: 107.170.38.238
rpc_address: 0.0.0.0
broadcast_rpc_address: 1.2.3.4
endpoint_snitch: RackInferringSnitch

Machine02

Our second machine has the address 107.170.112.81. Its setting only changes the value of listen_address.

cluster_name: 'BedxheCluster'
num_tokens: 256
seed_provider:
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          - seeds: "107.170.38.238"
listen_address: 107.170.112.81
rpc_address: 0.0.0.0
broadcast_rpc_address: 1.2.3.4
endpoint_snitch: RackInferringSnitch

Machine03

Our third machine has the address 107.170.115.161. Its setting also only changes the value of listen_address.

cluster_name: 'BedxheCluster'
num_tokens: 256
seed_provider:
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          - seeds: "107.170.38.238"
listen_address: 107.170.115.161
rpc_address: 0.0.0.0
broadcast_rpc_address: 1.2.3.4
endpoint_snitch: RackInferringSnitch

You have now finished the configuration of the nodes.

Note

This configuration was simple. It was used for illustrative purposes. Configuring a cluster to a production environment requires studyng several factors and experimenting a lot. Therefore, we recommend using this setting because it is a simple exercise to begin learning the options.

Booting the Cluster

You first started Cassandra in the seed node (removing the -f flag, Cassandra starts the process and passes the background).

MACHINE01/CASSANDRA_HOME/bin%>./cassandra
After you started cassandra in the other two nodes.
MACHINE02/CASSANDRA_HOME/bin%>./cassandra
MACHINE03/CASSANDRA_HOME/bin%>./cassandra

Now, if you execute nodetool in any of the machines, you see something like the following.

CASSANDRA_HOME/bin>./nodetool status
Datacenter: 170
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load         Tokens  Owns        Host ID                              Rack                                       (effective)                    
UN  107.170.38.238    107.95 KiB   256     68.4%       23e16126-8c7f-4eb8-9ea0-40ae488127e8  38
UN  107.170.115.161   15.3 KiB     256     63.7%       b3a9970a-ff77-43b2-ad4e-594deb04e7f7  115
UN  107.170.112.81    102.49 KiB   256     67.9%       ece8b83f-d51d-43ce-b9f2-89b79a0a2097  112

Now, if you repeat the creation of the keyspace example in the seed node, you will see how the keyspace is available in the other nodes. And conversely, if you apply a change to the keyspace in any node, it is immediately reflected in the others.

Execute the following in machine01 (the seed machine):

MACHINE01_CASSANDRA_HOME/bin%> ./cqlsh

cqlsh>CREATE KEYSPACE mykeyspace WITH REPLICATION = { ‘class’ : ‘SimpleStrategy’, ‘replication_factor’ : 1 };
cqlsh>USE mykeyspace;
cqlsh>CREATE TABLE users (user_id int PRIMARY KEY,  fname text,  lname text);
cqlsh>INSERT INTO users (user_id,  fname, lname)  VALUES (1745, ‘john’, ‘smith’);
cqlsh>INSERT INTO users (user_id,  fname, lname)  VALUES (1744, ‘john’, ‘doe’);
cqlsh>INSERT INTO users (user_id,  fname, lname)  VALUES (1746, ‘john’, ‘smith’);

Execute the following in machine02 or machine03:

CASSANDRA_HOME/bin%>./cqlsh
cqlsh> use mykeyspace;
cqlsh:mykeyspace> select * from users;


 user_id | fname | lname
---------+-------+-------
    1745 |  john | smith
    1744 |  john |   doe
    1746 |  john | smith


(3 rows)

That’s it. You have a cluster of three nodes working properly.

Putting It All Together

The best way to assimilate all of these concepts is through examples, so in later chapters, we show concrete examples of the use of this architecture.

As you can see, beginning to use Cassandra is very simple; the similarity to SQL in making queries helps to manipulate data from the start. Perhaps now that you know the advantages of Cassandra, you want to know who is using it. There are three companies in particular that have helped increase the popularity of Cassandra: SoundCloud,10 Spotify,11 and Netflix.12

A lot of the stuff that exists online about Cassandra makes references to these companies, but they are not the only ones. The following two web pages offer more extensive lists of companies that are committed to Cassandra, and using some part of their data management in interesting use cases.

Beyond the advantages Cassandra, as the ring model and distributed data management within the cluster, its main advantage is the level of integration with Spark, and in general, with the rest of the technologies in this book.

Surely, you’ll use Cassandra in an upcoming project.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset