You will often want to create tables over existing data that does not live within the managed Hive warehouse in HDFS. Creating a Hive external table is one of the easiest ways to handle this scenario. Queries from the Hive client will execute as they normally do over internally managed tables.
Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster, with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account. This recipe depends on having the weblog_entries
dataset loaded into an HDFS directory at the absolute path /input/weblog/weblog_records.txt
.
Carry out the following steps to map an external table in HDFS:
CREATE TABLE
syntax, as follows:DROP TABLE IF EXISTS weblog_entries; CREATE EXTERNAL TABLE weblog_entries ( md5 STRING, url STRING, request_date STRING, request_time STRING, ip STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY ' ' LOCATION '/input/weblog/';
weblog_create_external_table.hsql
in the working directory.–f
option to the Hive client, as follows:hive –f weblog_create_external_table.hql
OK Time taken: 2.654 seconds OK Time taken: 0.473 seconds
The existing definition of the table weblog_entries
is deleted if it already exists. Following this, the script issues a
CREATE
command with the EXTERNAL
keyword, which tells the Hive Metastore that the data is not managed by the Hive warehouse in HDFS.
The table is defined as having five fields per entry. The MD5 of the URL, the URL itself, the date of the request, the exact time of the request, and the IP address that the request was associated with.
ROW FORMAT DELIMITED
uses the native Hive
SerDe,
which is Hive's extensible and internal serialization/deserialization mechanism for reading and writing raw data. We explicitly tell the SerDe that a tab character separates each field and a newline character separates each record. The LOCATION
keyword is required by Hive when creating an external table. It points to the HDFS directory that contains the table data using an absolute path.
There are a few handy tips that you need to know when working with external tables.