Perform the following steps to load data from Cassandra:
- Create a keyspace named people in Cassandra using the CQL shell:
cqlsh> CREATE KEYSPACE people WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
- Create a column family (from CQL 3.0 onwards, it can also be called a table) person :
cqlsh> use people:
cqlsh> create table person(id int primary key,first_name varchar,last_name varchar,age int);
- Insert a few records in the column family:
cqlsh> insert into person(id, first_name, last_name, age) values(1,'Barack','Obama',55);
cqlsh> insert into person(id, first_name, last_name, age) values(2,'Joe','Smith',14); cqlsh> insert into person(id, first_name, last_name, age) values(3,'Billy','Kid',18);
- Now start the Spark shell with Cassandra connector dependency added:
$ spark-shell --packages datastax:spark-cassandra-connector:2.0.0-M2-s_2.11 --conf spark.cassandra.connection.host=localhost
- Load the person table as a DataFrame:
scala> val person = spark.read.format("org.apache.spark.sql.cassandra").options(Map("keyspace" ->"people", "table" -> "person")).load
- Print the schema:
scala> person.printSchema
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
- Count the number of records in the DataFrame:
scala> person.count
- Print persons DataFrame:
scala> person.show
+---+---+----------+---------+
| id|age|first_name|last_name|
+---+---+----------+---------+
| 1| 55| Barack| Obama|
| 2| 14| Joe| Smith|
| 3| 18| Billy| Kid|
+---+---+----------+---------+
- Retrieve the first row:
scala> val firstRow = person.first
- Create a Person case class:
scala> case class Person( first_name:String, last_name:String, age:Int)
- Create a temporary view out of person DataFrame:
scala> person.createOrReplaceTempView("persons")
- Create a DataFrame of teenagers:
scala> val teens = spark.sql("select * from persons where age > 12 and age < 20")
- Print the teens DataFrame:
scala> teens.show
- Convert it into dataset:
scala> val teensDS = teens.as[Person]
There are different ways to look at the difference between DataFrame and dataset. Folks who have a hangover from the Hive and Pig days can think of DataFrame being an equivalent of Hive and dataset, Pig.