How to do it...

Perform the following steps to load data from Cassandra:

  1. Create a keyspace named people in Cassandra using the CQL shell:
cqlsh> CREATE KEYSPACE people WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
  1. Create a column family (from CQL 3.0 onwards, it can also be called a table) person :
cqlsh> use people:
cqlsh> create table person(id int primary key,first_name varchar,last_name varchar,age int);
  1. Insert a few records in the column family:
cqlsh> insert into person(id, first_name, last_name, age) values(1,'Barack','Obama',55);
cqlsh> insert into person(id, first_name, last_name, age) values(2,'Joe','Smith',14); cqlsh> insert into person(id, first_name, last_name, age) values(3,'Billy','Kid',18);
  1. Now start the Spark shell with Cassandra connector dependency added:
$ spark-shell --packages datastax:spark-cassandra-connector:2.0.0-M2-s_2.11 --conf spark.cassandra.connection.host=localhost 
  1. Load the person table as a DataFrame:
scala> val person = spark.read.format("org.apache.spark.sql.cassandra").options(Map("keyspace" ->"people", "table" -> "person")).load
  1. Print the schema:
scala> person.printSchema
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
  1. Count the number of records in the DataFrame:
scala> person.count
  1. Print persons DataFrame:
scala> person.show
+---+---+----------+---------+
| id|age|first_name|last_name|
+---+---+----------+---------+
| 1| 55| Barack| Obama|
| 2| 14| Joe| Smith|
| 3| 18| Billy| Kid|
+---+---+----------+---------+

  1. Retrieve the first row:
scala> val firstRow = person.first
  1. Create a Person case class:
scala> case class Person( first_name:String, last_name:String, age:Int)
  1. Create a temporary view out of person DataFrame:
scala> person.createOrReplaceTempView("persons")
  1. Create a DataFrame of teenagers:
scala> val teens = spark.sql("select * from persons where age > 12 and age < 20")
  1. Print the teens DataFrame:
scala> teens.show
  1. Convert it into dataset:
scala> val teensDS = teens.as[Person]
There are different ways to look at the difference between DataFrame and dataset. Folks who have a hangover from the Hive and Pig days can think of DataFrame being an equivalent of Hive and dataset, Pig.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset