Spark SQL is generally used in two different ways. The first way is to use it as a library to write SQL, Hive QL, DSL, or to write queries in languages such as Java, Scala, Python, or R. The second way is to use it as a distributed SQL engine in which clients connect to a Thrift server and submit SQL or Hive QL queries using JDBC or ODBC interfaces. It is really useful for data warehousing users to write and execute queries from Business Intelligence (BI) tools interactively. So, Spark SQL can be used for data warehousing solutions as well a distributed SQL query engine.
Spark SQL's thrift server provides JDBC access to Spark SQL.
The Thrift JDBC server corresponds to HiveServer2 in Hive. You can test the JDBC server with the beeline client or any SQL client. From Spark 1.6, by default, the Thrift server runs in multi-session mode.
For a complete list of options for starting thriftserver
, use the following command. Note the similar options for spark-submit
, spark-shell
, or pyspark
shell:
./start-thriftserver.sh --help Usage: ./sbin/start-thriftserver [options] [thrift server options]
To start the JDBC/ODBC server, run the following command in Spark's sbin
directory:
./start-thriftserver.sh --master yarn --hiveconf hive.server2.thrift.bind.host=localhost --hiveconf hive.server2.thrift.port=10001
Check the thriftserver
log in the logs directory, which shows that it is connecting to the Hive metastore and thrift server listening on port number 10001. You can check the JDBC/ODBC Server tab in Spark's UI to check the number of users connected, their details, and also SQL statistics. The easiest way to open this UI is to go to Yarn's resource manager UI and then click on the Application Master's tracking UI.
Apache Spark needs to be built with Hive support by adding the –Phive
and –Phive-thriftserver
profiles to the build options.
After starting the Thrift server, it can be tested using beeline
:
bin/beeline -u "jdbc:hive2://localhost:10001/default;hive.server2.transport.mode=http;hive.server2.thrift.http.path=cliservice"
beeline
will ask you for a username and password. In non-secure mode, providing a blank user ID and password will work.
Once you are connected to the beeline client, issue Hive QL or SQL commands:
0: jdbc:hive2://localhost:10001/default> show tables; +------------+--------------+--+ | tableName | isTemporary | +------------+--------------+--+ | customers | false | | log | false | | sample_07 | false | | sample_08 | false | | web_logs | false | +------------+--------------+--+ CREATE TEMPORARY TABLE jsonTable USING org.apache.spark.sql.json OPTIONS ( path "/user/cloudera/people.json" ); 0: jdbc:hive2://localhost:10001/default> show tables; +------------+--------------+--+ | tableName | isTemporary | +------------+--------------+--+ | customers | false | | log | false | | sample_07 | false | | sample_08 | false | | web_logs | false | | jsontable | true | +------------+--------------+--+ 0: jdbc:hive2://localhost:10001/default> select * from jsontable; +-------+----------+--+ | age | name | +-------+----------+--+ | NULL | Michael | | 30 | Andy | | 19 | Justin | +-------+----------+--+ 3 rows selected (1.13 seconds) 0: jdbc:hive2://localhost:10001/default> select * from jsontable where age > 19; +------+-------+--+ | age | name | +------+-------+--+ | 30 | Andy | +------+-------+--+ 1 row selected (0.624 seconds) 0: jdbc:hive2://localhost:10001/default> REFRESH TABLE jsonTable CREATE TEMPORARY TABLE jdbcTable USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:postgresql:dbserver", dbtable "schema.tablename" )
The SQL tab on Spark's UI will show the all the jobs finished with the logical plan, physical plan, and visual DAG for the jobs. This is useful in debugging and performance tuning.
The Spark SQL CLI is a command line tool to run the Hive metastore service in local mode and execute queries from the command line. Spark SQL CLI cannot talk to the Thrift JDBC server.
To start the Spark SQL CLI, use the following command:
./bin/spark-sql spark-sql> show tables; spark-sql> select count(*) from tab1;
You may run ./bin/spark-sql --help
for a complete list of all available options.
To integrate BI tools such as Tableau or Qlikview, install the ODBC drivers for Spark from: http://databricks.com/spark-odbc-driver-download.
Configure and start the thrift server as mentioned in the previous section. Start Tableau and select the option to connect to Spark SQL. Configure the hostname, port number 10001, and necessary authentication details and click OK. This will enable us to access tables from the Hive Metastore and execute queries on the Spark Execution engine.
Note that you can only query tables that registered with the Hive Metastore.