In this recipe, we'll look at how to create a new DataFrame from a delimiter-separated values file.
The code for this recipe can be found at https://github.com/arunma/ScalaDataAnalysisCookbook/blob/master/chapter1-spark-csv/src/main/scala/com/packt/scaladata/spark/csv/DataFrameCSV.scala.
This recipe involves four steps:
spark-csv
support to our project.SQLContext
from the Spark context.SQLContext
.build.sbt
.After adding the spark-csv
dependency, our complete build.sbt
looks like this:
organization := "com.packt" name := "chapter1-spark-csv" scalaVersion := "2.10.4" val sparkVersion="1.4.1" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "com.databricks" %% "spark-csv" % "1.0.3" )
SparkConf
holds all of the information required to run this Spark "cluster." For this recipe, we are running locally, and we intend to use only two cores in the machine—local[2]
. More details about this can be found in the There's more… section of this recipe:import org.apache.spark.SparkConf val conf = new SparkConf().setAppName("csvDataFrame").setMaster("local[2]")
import org.apache.spark.SparkContext val sc = new SparkContext(conf)
The easiest way to query data in Spark is by using SQL queries:
import org.apache.spark.sql.SQLContext val sqlContext=new SQLContext(sc)
students
is of type org.apache.spark.sql.DataFrame
:import com.databricks.spark.csv._ val students=sqlContext.csvFile(filePath="StudentData.csv", useHeader=true, delimiter='|')
The csvFile
function of sqlContext
accepts the full filePath
of the file to be loaded. If the CSV has a header, then the useHeader
flag will read the first row as column names. The delimiter flag defaults to a comma, but you can override the character as needed.
Instead of using the csvFile
function, we could also use the load
function available in SQLContext
. The load
function accepts the format of the file (in our case, it is CSV) and options as Map
. We can specify the same parameters that we specified earlier using Map
, like this:
val options=Map("header"->"true", "path"->"ModifiedStudent.csv") val newStudents=sqlContext.load("com.databricks.spark.csv",options)
As we saw earlier, we now ran the Spark program in standalone mode. In standalone mode, the
Driver program (the brain) and the
Worker nodes all get crammed into a single JVM. In our example, we set master
to local[2]
, which means that we intend to run Spark in standalone mode and request it to use only two cores in the machine.
Spark can be run on three different modes:
In Chapter 6, Scaling Up, we have dedicated explanations and recipes for how to run Spark on inbuilt cluster modes on Mesos and YARN. In a clustered environment, Spark runs a Driver program along with a number of Worker nodes. As the name indicates, the Driver program houses the brain of the program, which is our main program. The Worker nodes have the data and perform various transformations on it.