The Apache Cassandra project was started at Facebook in 2007 to offer users a better experience when searching their inbox. The challenges that Facebook engineers had to face was mostly related to massive amount of data, very high throughput, and scalability at a mind-blowing rate.
Cassandra is a distributed column-oriented database designed to manage humongous amounts of structured data in a decentralized, highly scalable way. The absence of a single point of failure makes Cassandra highly available and fault tolerant.
While Cassandra resembles a traditional database and shares some design strategies, it does not support a full relational data model. On the contrary, the Cassandra's data model is flexible, because each row can contain a variable number of columns.
In this recipe, we will go through different aspects of connecting and querying a Cassandra database.
For this recipe, we assume that the reader has already some familiarity with the Cassandra core concepts and data model (columns, super columns, column family, and keyspaces). Installing Cassandra is straightforward.
The only requirement for running Cassandra is a Java 1.6 JVM. Just download the distribution from the product website (http://cassandra.apache.org/download/), unzip it, and run the bin/cassandra
or bin/cassandra.bat
executable to start a single node.
Before we start, we have to create a couple of entities in Cassandra, a Keyspace and a Column Family. Fire up the CQLSH console located in the bin folder and type:
create keyspace hr with strategy_class='SimpleStrategy' and strategy_options:replication_factor=1; use hr; create columnfamily employee (empid int primary key);
These commands create a Keyspace named hr
and a column family named employee
. The column family has one field only, which is also a primary key (of type int
).
There are several client strategies to interact with Cassandra. In this recipe, we are going to use the open source library Hector, which is a well-established high level Java client.
The Hector APIs are not very fluent, and it takes a lot of boilerplate code to insert or manipulate data with it.
Why not create a simpler, more fluent wrapper on top of the Hector API using Groovy?
@Grab('org.hectorclient:hector-core:1.1-2') @GrabExclude('org.apache.httpcomponents:httpcore') import me.prettyprint.hector.api.Cluster import me.prettyprint.hector.api.factory.HFactory import me.prettyprint.hector.api.Keyspace import me.prettyprint.cassandra.serializers.* import me.prettyprint.hector.api.Serializer import me.prettyprint.hector.api.mutation.Mutator import me.prettyprint.hector.api.ddl.* import me.prettyprint.hector.api.beans.ColumnSlice class Gassandra { def cluster def keyspace def colFamily Serializer serializer def stringSerializer = StringSerializer.get() private Gassandra (Keyspace keyspace) { this.keyspace = keyspace } Gassandra() {} void connect(clusterName, host, port) { cluster = HFactory. getOrCreateCluster( clusterName, "$host:$port" ) } List<KeyspaceDefinition> getKeyspaces() { cluster.describeKeyspaces() } Gassandra withKeyspace(keyspaceName) { keyspace = HFactory. createKeyspace( keyspaceName, cluster ) new Gassandra(keyspace) } Gassandra withColumnFamily(columnFamily, Serializer c) { colFamily = columnFamily serializer = c this } Gassandra insert(key, columnName, value) { def mutator = HFactory. createMutator( keyspace, serializer ) def column = HFactory. createStringColumn( columnName, value ) mutator.insert(key, colFamily, column) this } Gassandra insert(key, Map args) { def mutator = HFactory. createMutator( keyspace, serializer ) args.each { mutator.insert( key, colFamily, HFactory. createStringColumn( it.key, it.value ) ) } this } ColumnSlice findByKey(key) { def sliceQuery = HFactory. createSliceQuery( keyspace, serializer, stringSerializer, stringSerializer ) sliceQuery. setColumnFamily(colFamily). setKey(key). setRange('', '', false, 100). execute(). get() } }
The Gassandra
class exposes a very simple, fluent interface that leverages the dynamic nature of Groovy. The class imports the Hector API and allows writing code as follows:
def g = new Gassandra() g.connect('test', 'localhost', '9160') def employee = g .withKeyspace('hr') .withColumnFamily('employee', IntegerSerializer.get()) employee.insert(5005, 'name', 'Zoe') employee.insert(5005, 'lastName', 'Ross') employee.insert(5005, 'age', '31')
The withKeySpace
and withColumnFamily
methods are written in a fluent style, so that we can pass the relevant information to Hector. Note that the withColumnFamily
requires a Serializer type to specify the type of the primary key.
The insert
method accepts a Map as well, so that the previous code can be rewritten as:
employee.insert('5005', ['name': 'Zoe', 'lastName': 'Ross', 'age': '31' ])
To find a row by primary key, there is a findByKey
method that returns a me.prettyprint.hector.api.beans.ColumnSlice
object.
println employee.findByKey(5005)
The previous statement will output:
ColumnSlice([HColumn(age=31), HColumn(lastName=Ross), HColumn(name=Zoe)])
The Gassandra
class lacks many basic methods to update or delete rows and other advanced query features. We leave them to the reader as an exercise.