String concatenation is a very common operation in any development task. It frequently comes up when using Hive for report generation and even simple ETL tasks. This recipe will show a very basic and useful example using one of the Hive string concatenation UDFs.
In this recipe, we will take the separate
request_date
and
request_time
fields from the weblog_entries
and print a single concatenated column to the console for every record, containing both the request_date
and request_time
fields separated by an underscore (_
).
Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster, with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account.
This recipe depends on having the
weblog_entries
dataset loaded into a Hive table named weblog_entries
with the following fields
mapped to the respective datatypes.
Issue the following command to the Hive client:
describe weblog_entries
You should see the following response:
OK md5 string url string request_date string request_time string ip string
Carry out the following steps to perform string concatenation in HiveQL:
SELECT concat_ws('_', request_date, request_time) FROM weblog_entries;
weblog_concat_date_time.hql
in the active directory.–f
option to the Hive client. You should see the results of the SELECT
statement printed out to the console. The following snippet is an example that contains only two sample rows. The full printout will contain all 3000 rows.2012-05-10_21:33:26 2012-05-10_21:13:10
The script relies on the Hive built-in UDF to concatenate two strings together with a supplied separator token. For each row, it supplies the function with the respective request_date
and request_time
values that correspond to that row. The output of the function is a single string containing both the fields separated by an underscore (_
). Since the SELECT
statement consists of only that function, and the function outputs just a single string, we see a single column for all 3000 rows, one printed
per line.
The following are a few additional notes to help with the concat_ws()
function:
If you pass non-string datatypes as parameters to
concat_ws()
, you will be greeted with a very descriptive error message:
FAILED: Error in semantic analysis: Line 1:21 Argument type mismatch field1: Argument 2 of function CONCAT_WS must be "string", but "int" was found.
If you wish to encapsulate the auto-casting of your parameters to string, use the regular concat()
function.
Like most Hive UDFs, you can alias the output of concat_ws()
. This comes in handy if you are persisting the results of the concatenation and want a very descriptive column header.
When using concat_ws()
, you must, at the very least, supply the separator character first and one input string parameter to be printed out. However, you are not limited in the number of input string parameters that you can supply to be concatenated and separated.
The following usage is valid:
concat_ws('_','test')
The following output will be printed to the console:
test
The following usage of the concat_ws()
function is also valid:
concat_ws('_','hi','there','my','name','is')
The following output will be printed to the console:
hi_there_my_name_is