How to do it...

Start the Spark shell or Scala notebook in the Databricks Cloud:

        $ spark-shell

Create a case class for Person:

        scala> case class Person(firstName: String, lastName: String, 
          age:Int)

Load the person data as a DataSet and map it to the Person case class:

        scala> val personDS = 
         spark.read.textFile("s3a://sparkcookbook/person").map( line => 
          line.split(",")).map( p => Person(p(0),p(1),p(2).toInt))

Register the person DataSet as a temp table so that SQL queries can be run against it. Note that the DataFrame name does not have to be the same as the table name:

        scala> person.createOrReplaceTempView("person")

Select all persons whose age is over 60 years:

        scala> val sixtyPlus = spark.sql("select * from person where 
          age > 60")

Print the values:

        scala> sixtyPlus.show

Save this sixtyPlus DataFrame in the Parquet format:

        scala> sixtyPlus.write.parquet(
          "hdfs://localhost:9000/user/hduser/sp.parquet")

What if you would like to write data in a compressed format, for example, snappy? Use the following code: sixtyPlus.write.option("compression","snappy").parquet("hdfs://localhost:9000/user/hduser/sp.parquet")

The previous step created a directory called sp.parquet in the HDFS root. You can run the hdfs dfs -ls command in another shell to make sure that it's created:

        $ hdfs dfs -ls sp.parquet

Load the contents of the Parquet files in the Spark shell:

        scala> val parquetDF = spark.read.parquet
          ("hdfs://localhost:9000/user/hduser/sp.parquet")

        scala> parquetRDD.createOrReplaceTempView("sixty_plus")

Run a query against the preceding temp view:

        scala> spark.sql("select * from sixty_plus")

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...