How to do it...

Start the Spark shell:

        $ spark-shell

Load the data from Parquet; since parquet is the default data source, you do not have to specify it:

        scala> val people = spark.read.load("hdfs://localhost:
          9000/user/hduser/people.parquet")

Load the data from parquet by manually specifying the format:

        scala> val people = spark.read.format("parquet").load
          ("hdfs://localhost:9000/user/hduser/people.parquet")

For inbuilt datatypes, you do not have to specify the full format name; only specifying "parquet", "json", or "jdbc" would work:

        scala> val people = spark.read.format("parquet").load
          ("hdfs://localhost:9000/user/hduser/people.parquet")

When writing data, there are four save modes: append, overwrite, errorIfExists, and ignore. The append mode adds data to a data source, overwrite overwrites it, errorIfExists throws an exception that the data already exists, and ignore does nothing when the data already exists.

Save people as JSON in the append mode:

      scala> val people = people.write.format("json").mode
      ("append").save ("hdfs://localhost:9000/user/hduser/people.json")

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...