How to do it...

Create a table named person in MySQL using the following DDL:

        CREATE TABLE 'person' ( 
          'person_id' int(11) NOT NULL AUTO_INCREMENT, 
          'first_name' varchar(30) DEFAULT NULL, 
          'last_name' varchar(30) DEFAULT NULL, 
          'gender' char(1) DEFAULT NULL, 
          'age' tinyint(4) DEFAULT NULL, 
          PRIMARY KEY ('person_id') 
        )

Insert some data:

        Insert into person values('Barack','Obama','M',55); 
        Insert into person values('Bill','Clinton','M',70); 
        Insert into person values('Hillary','Clinton','F',69); 
        Insert into person values('Bill','Gates','M',61); 
        Insert into person values('Michelle','Obama','F',52);

Download mysql-connector-java-x.x.xx-bin.jar from http://dev.mysql.com/downloads/connector/j/.
Make the MySQL driver available to the spark-shell and launch it:

        $ spark-shell --driver-class-path/path-to-mysql-jar/mysql-
          connector-java-5.1.34-bin.jar

Note that path-to-mysql-jar is not the actual path name. You need to use your pathname. This is just a syntax.

Construct a JDBC URL:

        scala> val url="jdbc:mysql://localhost:3306/hadoopdb"

Create a connection properties object with a username and password:

        scala> val prop = new java.util.Properties
scala> prop.setProperty("user","hduser")
scala> prop.setProperty("password","********")

Load the DataFrame with the JDBC data source (url, table name, and properties):

        scala> val people = sqlContext.read.jdbc(url,"person",prop)

Show the results in a nice tabular format by executing the following command:

        scala> people.show

This has loaded the whole table. What if I would like to only load males (url, table name, predicates, and properties)? To do this, run the following command:

        scala> val males = spark.read.jdbc
          (url,"person",Array("gender='M'"),prop)
scala> males.show

Show only the first names by executing the following command:

        scala> val first_names = people.select("first_name")
scala> first_names.show

Show only people below 60 years of age by executing the following command:

        scala> val below60 = people.filter(people("age") < 60)
scala> below60.show

Group people by gender as follows:

        scala> val grouped = people.groupBy("gender")

Find the number of males and females by executing the following command:

        scala> val gender_count = grouped.count
scala> gender_count.show

Find the average age of males and females by executing the following command:

        scala> val avg_age = grouped.avg("age")
scala> avg_age.show

Now if you'd like to save this avg_age data to a new table, run the following command:

        scala> gender_count.write.jdbc(url,"gender_count",prop)

Save the people DataFrame in the Parquet format:

        scala> people.write.parquet("people.parquet")

Save the people DataFrame in the JSON format:

        scala> people.write.json("people.json")

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...