- Start the Spark shell:
$ spark-shell
- Load the data from Parquet; since parquet is the default data source, you do not have to specify it:
scala> val people = spark.read.load("hdfs://localhost:
9000/user/hduser/people.parquet")
- Load the data from parquet by manually specifying the format:
scala> val people = spark.read.format("parquet").load
("hdfs://localhost:9000/user/hduser/people.parquet")
- For inbuilt datatypes, you do not have to specify the full format name; only specifying "parquet", "json", or "jdbc" would work:
scala> val people = spark.read.format("parquet").load
("hdfs://localhost:9000/user/hduser/people.parquet")
When writing data, there are four save modes: append, overwrite, errorIfExists, and ignore. The append mode adds data to a data source, overwrite overwrites it, errorIfExists throws an exception that the data already exists, and ignore does nothing when the data already exists.
- Save people as JSON in the append mode:
scala> val people = people.write.format("json").mode
("append").save ("hdfs://localhost:9000/user/hduser/people.json")