How to do it...

Perform the following steps to load data from Cassandra:

Create a keyspace named people in Cassandra using the CQL shell:

cqlsh> CREATE KEYSPACE people WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };

Create a column family (from CQL 3.0 onwards, it can also be called a table) person :

cqlsh> use people:
cqlsh> create table person(id int primary key,first_name varchar,last_name varchar,age int);

Insert a few records in the column family:

cqlsh> insert into person(id, first_name, last_name, age) values(1,'Barack','Obama',55);
cqlsh> insert into person(id, first_name, last_name, age) values(2,'Joe','Smith',14);                          cqlsh> insert into person(id, first_name, last_name, age) values(3,'Billy','Kid',18);

Now start the Spark shell with Cassandra connector dependency added:

$ spark-shell --packages datastax:spark-cassandra-connector:2.0.0-M2-s_2.11 --conf spark.cassandra.connection.host=localhost

Load the person table as a DataFrame:

scala> val person = spark.read.format("org.apache.spark.sql.cassandra").options(Map("keyspace" ->"people", "table" -> "person")).load

Print the schema:

scala> person.printSchema
root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

Count the number of records in the DataFrame:

scala> person.count

Print persons DataFrame:

scala> person.show
+---+---+----------+---------+
| id|age|first_name|last_name|
+---+---+----------+---------+
| 1| 55| Barack| Obama|
| 2| 14| Joe| Smith|
| 3| 18| Billy| Kid|
+---+---+----------+---------+

Retrieve the first row:

scala> val firstRow = person.first

Create a Person case class:

scala> case class Person( first_name:String, last_name:String, age:Int)

Create a temporary view out of person DataFrame:

scala> person.createOrReplaceTempView("persons")

Create a DataFrame of teenagers:

scala> val teens = spark.sql("select * from persons where age > 12 and age < 20")

Print the teens DataFrame:

scala> teens.show

Convert it into dataset:

scala> val teensDS = teens.as[Person]

There are different ways to look at the difference between DataFrame and dataset. Folks who have a hangover from the Hive and Pig days can think of DataFrame being an equivalent of Hive and dataset, Pig.

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...