How to do it...

  1. Start the Spark shell or Scala notebook in the Databricks Cloud:
        $ spark-shell  
  1. Create a case class for Person:
        scala> case class Person(firstName: String, lastName: String, 
age:Int)
  1. Load the person data as a DataSet and map it to the Person case class:
        scala> val personDS = 
spark.read.textFile("s3a://sparkcookbook/person").
map( line =>
line.split(",")).
map( p => Person(p(0),p(1),p(2).toInt))
  1. Register the person DataSet as a temp table so that SQL queries can be run against it. Note that the DataFrame name does not have to be the same as the table name:
        scala> person.createOrReplaceTempView("person")
  1. Select all persons whose age is over 60 years:
        scala> val sixtyPlus = spark.sql("select * from person where 
age > 60")
  1. Print the values:
        scala> sixtyPlus.show
  1. Save this sixtyPlus DataFrame in the Parquet format:
        scala> sixtyPlus.write.parquet(
"hdfs://localhost:9000/user/hduser/sp.parquet")
What if you would like to write data in a compressed format, for example, snappy? Use the following code:                                                   sixtyPlus.write.option("compression","snappy").parquet("hdfs://localhost:9000/user/hduser/sp.parquet")
  1. The previous step created a directory called sp.parquet in the HDFS root. You can run the hdfs dfs -ls command in another shell to make sure that it's created:
        $ hdfs dfs -ls sp.parquet
  1. Load the contents of the Parquet files in the Spark shell:
        scala> val parquetDF = spark.read.parquet
("hdfs://localhost:9000/user/hduser/sp.parquet")
  1. Register the loaded Parquet DataFrame as a temp view:
        scala> parquetRDD.createOrReplaceTempView("sixty_plus")
  1. Run a query against the preceding temp view:
        scala> spark.sql("select * from sixty_plus")
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset