This recipe will outline a shorthand technique for inline table creation when the query is executed. Having to create every table definition up front is impractical and does not scale for large ETL. Being able to dynamically define intermediate tables is tremendously useful for complex analytics with multiple staging points.
In this recipe, we will create a new table that contains three fields from the weblog entry dataset, namely request_date
, request_time
, and url
. In addition to this, we will define a new field called url_length
.
Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster, with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account.
This recipe depends on having the weblog_entries
dataset loaded into Hive table named weblog_entries
with the following fields mapped to the respective datatypes.
Issue the following command to the Hive client:
describe weblog_entries
You should see the following response:
OK md5 string url string request_date string request_time string ip string
Carry out the following steps to create an inline table definition using an alias:
CREATE TABLE weblog_entries_with_url_length AS SELECT url, request_date, request_time, length(url) as url_length FROM weblog_entries;
weblog_entries_create_table_as.hql
in the active directory.–f
option to the Hive client, as follows:hive –f weblog_create_table_as.hql
–e
option:hive –e "describe weblog_entries_with_url_length"
string
fields and a fourth int
field holding the URL length:OK url string request_date string request_time string url_length int
The following statement initially defines a new table by the name weblog_entries_with_url_length
:
CREATE TABLE weblog_entries_with_url_length AS
We then define the body of this table as an alias to the result set of a nested SELECT
statement. In this case, our SELECT
statement simply grabs the url
, request_date
, and request_time
fields from each entry in the weblog_entries
table. The field names are copied as field names to our new table weblog_entires_with_url_length
. We also defined an additional field aliased as url_length
to be calculated for each selected record. It stores an int value that represents the number of characters in the record's url
field.
SELECT url, request_date, request_time, length(url) as url_length FROM weblog_entries;
In one simple statement, we created a table with a subset of fields from our starting table, as well as a new derived field.
The following are a few reminders for when using external tables:
As of Apache Hive 0.7.1, you cannot create external tables using aliases with SELECT
statements.
The ease of the CREATE TABLE AS
syntax lets Hive users create new tables very quickly, but don't forget to DROP
any temporary tables. If you are scripting the CREATE ALIAS
for repeated use, the next execution, especially, will fail if there are table name conflicts. Moreover, such intermediate tables will create a warehouse namespace that will quickly become unmanageable.