In this recipe, we'll see how to create a new DataFrame from Scala case classes.
The code for this recipe can be found at https://github.com/arunma/ScalaDataAnalysisCookbook/blob/master/chapter1-spark-csv/src/main/scala/com/packt/scaladata/spark/csv/DataFrameFromCaseClasses.scala.
Employee
with the id
and name
fields, like this:case class Employee(id:Int, name:String)
Similar to the previous recipe, we create SparkContext
and SQLContext
.
val conf = new SparkConf().setAppName("colRowDataFrame").setMaster("local[2]") //Initialize Spark context with Spark configuration. This is the core entry point to do anything with Spark val sc = new SparkContext(conf) //The easiest way to query data in Spark is to use SQL queries. val sqlContext=new SQLContext(sc)
val listOfEmployees =List(Employee(1,"Arun"), Employee(2, "Jason"), Employee (3, "Abhi"))
listOfEmployees
to the createDataFrame
function of SQLContext
. That's it! We now have a DataFrame. When we try to print the schema using the printSchema
method of the DataFrame, we will see that the DataFrame has two columns, with names id
and name
, as defined in the case
class://Pass in the Employees into the `createDataFrame` function. val empFrame=sqlContext.createDataFrame(listOfEmployees) empFrame.printSchema
Output:
root |-- id: integer (nullable = false) |-- name: string (nullable = true)
As you might have guessed, the schema of the DataFrame is inferenced from the case class using reflection.
case
class—using the withColumnRenamed
function, as shown here:val empFrameWithRenamedColumns=sqlContext.createDataFrame(listOfEmployees).withColumnRenamed("id", "empId") empFrameWithRenamedColumns.printSchema
Output:
root |-- empId: integer (nullable = false) |-- name: string (nullable = true)
registerTempTable
, as we saw in the previous recipe, helps us achieve this. With the following command, we will have registered the DataFrame as a table by name "employeeTable"
"employeeTable" empFrameWithRenamedColumns.registerTempTable("employeeTable")
val sortedByNameEmployees=sqlContext.sql("select * from employeeTable order by name desc") sortedByNameEmployees.show()
Output:
+-----+-----+ |empId| name| +-----+-----+ | 2|Jason| | 1| Arun| | 3| Abhi| +-----+-----+
The createDataFrame
function accepts a sequence of scala.Product
. Scala case classes extend from Product
, and therefore it fits in the budget. That said, we can actually use a sequence of tuples to create a DataFrame, since tuples implement Product
too:
val mobiles=sqlContext.createDataFrame(Seq((1,"Android"), (2, "iPhone"))) mobiles.printSchema mobiles.show()
Output:
//Schema root |-- _1: integer (nullable = false) |-- _2: string (nullable = true) //Data +--+-------+ |_1| _2| +--+-------+ | 1|Android| | 2| iPhone| +--+-------+
Of course, you can rename the column using
withColumnRenamed
.