- Start the Spark shell or Scala notebook in the Databricks Cloud:
$ spark-shell
- Create a case class for Person:
scala> case class Person(firstName: String, lastName: String,
age:Int)
- Load the person data as a DataSet and map it to the Person case class:
scala> val personDS =
spark.read.textFile("s3a://sparkcookbook/person").map( line =>
line.split(",")).map( p => Person(p(0),p(1),p(2).toInt))
- Register the person DataSet as a temp table so that SQL queries can be run against it. Note that the DataFrame name does not have to be the same as the table name:
scala> person.createOrReplaceTempView("person")
- Select all persons whose age is over 60 years:
scala> val sixtyPlus = spark.sql("select * from person where
age > 60")
- Print the values:
scala> sixtyPlus.show
- Save this sixtyPlus DataFrame in the Parquet format:
scala> sixtyPlus.write.parquet(
"hdfs://localhost:9000/user/hduser/sp.parquet")
What if you would like to write data in a compressed format, for example, snappy? Use the following code: sixtyPlus.write.option("compression","snappy").parquet("hdfs://localhost:9000/user/hduser/sp.parquet")
- The previous step created a directory called sp.parquet in the HDFS root. You can run the hdfs dfs -ls command in another shell to make sure that it's created:
$ hdfs dfs -ls sp.parquet
- Load the contents of the Parquet files in the Spark shell:
scala> val parquetDF = spark.read.parquet
("hdfs://localhost:9000/user/hduser/sp.parquet")
- Register the loaded Parquet DataFrame as a temp view:
scala> parquetRDD.createOrReplaceTempView("sixty_plus")
- Run a query against the preceding temp view:
scala> spark.sql("select * from sixty_plus")