Hive is a Hadoop-based data warehousing-like framework developed by Facebook. It allows users to fire queries in SQL, with languages like HiveQL, which are highly abstracted to Hadoop MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools for real-time query processing.
The following are the features of Hive:
Prerequisites for RHive are as follows:
We assume here that our readers have already configured Hadoop; else they can learn Hadoop installation from Chapter 1, Getting Ready to Use R and Hadoop. As Hive will be required for running RHive, we will first see how Hive can be installed.
The commands to install Hive are as follows:
// Downloading the hive source from apache mirror wget http://www.motorlogy.com/apache/hive/hive-0.11.0/hive-0.11.0.tar.gz // For extracting the hive source tar xzvf hive-0.11.0.tar.gz
To setup Hive configuration, we need to update the hive-site.xml
file with a few additions:
hive-site.xml
using the following commands:<description> JDBC connect string for a JDBC metastore </ description> </Property> <property> <name> javax.jdo.option.ConnectionDriverName </ name> <value> com.mysql.jdbc.Driver </ value> <description> Driver class name for a JDBC metastore </ description> </Property> <property> <name> javax.jdo.option.ConnectionUserName </ name> <value> hive </value> <description> username to use against metastore database </ description> </ Property> <property> <name> javax.jdo.option.ConnectionPassword </name> <value> hive</value> <description> password to use against metastore database </ description> </Property> <property> <name> hive.metastore.warehouse.dir </ name> <value> /user/hive/warehouse </value> <description> location of default database for the warehouse </ description> </Property>
hive-log4j.properties
by adding the following line:log4j.appender.EventCounter = org.apache.hadoop.log.metrics.EventCounter
export $HIVE_HOME=/usr/local/ hive-0.11.0
$HADOOP_HOME/bin/ hadoop fs-mkidr /tmp $HADOOP_HOME/bin/ hadoop fs-mkidr /user/hive/warehouse $HADOOP_HOME/bin/ hadoop fs-chmod g+w / tmp $HADOOP_HOME/bin/ hadoop fs-chmod g+w /user/hive/warehouse
We will see how we can load and operate over Hive datasets in R using the RHive library:
rhive.init ()
rhive.connect ("192.168.1.210")
rhive.list.tables () tab_name 1 hive_algo_t_account 2 o_account 3 r_t_account
rhive.desc.table ('o_account'), col_name data_type comment 1 id int 2 email string 3 create_date string
rhive.query ("select * from o_account");
rhive.close()