- Create a table named person in MySQL using the following DDL:
CREATE TABLE 'person' (
'person_id' int(11) NOT NULL AUTO_INCREMENT,
'first_name' varchar(30) DEFAULT NULL,
'last_name' varchar(30) DEFAULT NULL,
'gender' char(1) DEFAULT NULL,
'age' tinyint(4) DEFAULT NULL,
PRIMARY KEY ('person_id')
)
- Insert some data:
Insert into person values('Barack','Obama','M',55);
Insert into person values('Bill','Clinton','M',70);
Insert into person values('Hillary','Clinton','F',69);
Insert into person values('Bill','Gates','M',61);
Insert into person values('Michelle','Obama','F',52);
- Download mysql-connector-java-x.x.xx-bin.jar from http://dev.mysql.com/downloads/connector/j/.
- Make the MySQL driver available to the spark-shell and launch it:
$ spark-shell --driver-class-path/path-to-mysql-jar/mysql-
connector-java-5.1.34-bin.jar
Note that path-to-mysql-jar is not the actual path name. You need to use your pathname. This is just a syntax.
- Construct a JDBC URL:
scala> val url="jdbc:mysql://localhost:3306/hadoopdb"
- Create a connection properties object with a username and password:
scala> val prop = new java.util.Properties
scala> prop.setProperty("user","hduser")
scala> prop.setProperty("password","********")
- Load the DataFrame with the JDBC data source (url, table name, and properties):
scala> val people = sqlContext.read.jdbc(url,"person",prop)
- Show the results in a nice tabular format by executing the following command:
scala> people.show
- This has loaded the whole table. What if I would like to only load males (url, table name, predicates, and properties)? To do this, run the following command:
scala> val males = spark.read.jdbc
(url,"person",Array("gender='M'"),prop)
scala> males.show
- Show only the first names by executing the following command:
scala> val first_names = people.select("first_name")
scala> first_names.show
- Show only people below 60 years of age by executing the following command:
scala> val below60 = people.filter(people("age") < 60)
scala> below60.show
- Group people by gender as follows:
scala> val grouped = people.groupBy("gender")
- Find the number of males and females by executing the following command:
scala> val gender_count = grouped.count
scala> gender_count.show
- Find the average age of males and females by executing the following command:
scala> val avg_age = grouped.avg("age")
scala> avg_age.show
- Now if you'd like to save this avg_age data to a new table, run the following command:
scala> gender_count.write.jdbc(url,"gender_count",prop)
- Save the people DataFrame in the Parquet format:
scala> people.write.parquet("people.parquet")
- Save the people DataFrame in the JSON format:
scala> people.write.json("people.json")