How to do it...

Perform the following steps to load data from Cassandra:

  1. Create a keyspace named people in Cassandra using the CQL shell:
cqlsh> CREATE KEYSPACE people WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
  1. Create a column family (from CQL 3.0 onwards, it can also be called a table) person :
cqlsh> use people:
cqlsh> create table person(id int primary key,first_name varchar,last_name varchar,age int);
  1. Insert a few records in the column family:
cqlsh> insert into person(id, first_name, last_name, age) values(1,'Barack','Obama',55);
cqlsh> insert into person(id, first_name, last_name, age) values(2,'Joe','Smith',14); cqlsh> insert into person(id, first_name, last_name, age) values(3,'Billy','Kid',18);
  1. Now start the Spark shell with Cassandra connector dependency added:
$ spark-shell --packages datastax:spark-cassandra-connector:2.0.0-M2-s_2.11 --conf 
  1. Load the person table as a DataFrame:
scala> val person ="org.apache.spark.sql.cassandra").options(Map("keyspace" ->"people", "table" -> "person")).load
  1. Print the schema:
scala> person.printSchema
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
  1. Count the number of records in the DataFrame:
scala> person.count
  1. Print persons DataFrame:
| id|age|first_name|last_name|
| 1| 55| Barack| Obama|
| 2| 14| Joe| Smith|
| 3| 18| Billy| Kid|

  1. Retrieve the first row:
scala> val firstRow = person.first
  1. Create a Person case class:
scala> case class Person( first_name:String, last_name:String, age:Int)
  1. Create a temporary view out of person DataFrame:
scala> person.createOrReplaceTempView("persons")
  1. Create a DataFrame of teenagers:
scala> val teens = spark.sql("select * from persons where age > 12 and age < 20")
  1. Print the teens DataFrame:
  1. Convert it into dataset:
scala> val teensDS =[Person]
There are different ways to look at the difference between DataFrame and dataset. Folks who have a hangover from the Hive and Pig days can think of DataFrame being an equivalent of Hive and dataset, Pig.
