There are many operations you will want to repeat across various data sources and tables in Hive. For this scenario, it makes sense to write your own user-defined function (UDF). You can write your own subroutine in Java for use on any Writable input fields and to invoke your function from Hive scripts whenever necessary. This recipe will walk you through the process of creating a very simple UDF that takes a source and returns yes
or no
for whether that source is reliable.
Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account.
This recipe depends on having the Nigera_ACLED_cleaned.tsv
dataset loaded into a Hive table with the name acled_nigeria_cleaned
with the following fields mapped to the respective datatypes.
Issue the following command to the Hive client:
describe acled_nigeria_cleaned;
You should see the following response:
OK loc string event_date string event_type string actor string latitude double longitude double source string fatalities int
Additionally, you will need to place the following recipe's code into a source package for bundling within a JAR file of your choice. This recipe will use <myUDFs.jar>
as a reference point for your custom JAR file and <fully_qualified_path_to_TrustSourceUDF>
as a reference point for the Java package your class exists within. An example of a fully qualified path for a pattern would be java.util.regex.Pattern
.
In addition to the core Hadoop libraries, your project will need to have hive-exec
and hive-common
JAR dependencies on the classpath for this to compile.
Perform the following steps to implement a custom Hive UDF:
TrustSourceUDF.java
at the desired source package. Your class should exist at some package <fully_qualified_path>.TrustSourceUDF.class
.TrustSourceUDF
class:import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; import java.lang.String;import java.util.HashSet; import java.util.Set; public class TrustSourceUDF extends UDF { private static Set<String> untrustworthySources = new HashSet<String>(); private Text result = new Text(); static { untrustworthySources.add(""); untrustworthySources.add("""" http://www.afriquenligne.fr/3-soldiers""); untrustworthySources.add("Africa News Service"); untrustworthySources.add("Asharq Alawsat"); untrustworthySources.add("News Agency of Nigeria (NAN)"); untrustworthySources.add("This Day (Nigeria)"); } @Override public Text evaluate(Text source) { if(untrustworthySources.contains(source.toString())) { result.set("no"); } else { result.set("yes"); } return result; } }
<myUDFs.jar>
and test your UDF through the Hive client. Open a Hive client session through the command shell. Hive should already be on the local user environment path. Invoke the Hive shell with the following command:hive
add jar /path/to/<myUDFs.jar>;
You will know that the preceding operation succeeded if you see the following messages indicating that the JAR has been added to the classpath and the distributed cache:
Added /path/to/<myUDFs.jar> to class path Added resource: /path/to/<myUDFs.jar>
trust_source
as an alias to TrustSourceUDF
at whatever source package you specified in your JAR:create temporary function trust_source as '<fully_qualified_path_to_TrustSourceUDF>';
You should see the shell prompt you that the command executed successfully. If you see the following error, it usually indicates your class was not found on the classpath:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask
select trust_source(source) from acled_nigeria_cleaned;
The class TrustSourceUDF
extends UDF. No methods are required for implementation; however, in order for the class to function at Hive runtime as a UDF, your subclass must override evaluate()
. You can have one or more overloaded
evaluate()
methods with different arguments. Ours only needs to take in a source
value to check.
During class initialization, we set up a static instance of the java.util.Set
class named untrustworthySources
. Within a static initialization block, we set up a few sources by their names to be flagged as unreliable.
We flag an empty source as unreliable.
When the function is invoked, it expects a single Text
instance to be checked against the sources we've flagged as unreliable. Return yes
or no
depending on whether the given source appears in the set of unreliable sources or not. We set up the private Text
instance to be re-used every time the function is called.
Once the JAR file containing the class is added to the classpath, and we set up our temporary function definition, we can now use the UDF across many different queries.
User-defined functions are a very powerful feature within Hive. The following sections list a bit more information regarding them:
The Hive documentation has a great explanation of the built-in UDFs bundled with the language. A great write up is available at https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-BuiltinAggregateFunctions%28UDAF%29.
To see which functions are available in your specific version of Hive, issue the following command in the Hive shell.
show functions;
Once you pinpoint a function that looks interesting, learn more information about it from the Hive wiki or directly from the Hive shell by executing the following command:
describe function <func>;
Hive UDFs do not need to have a one-to-one interaction for input and output. The API allows the generation of many outputs from one input (GenericUDTF) as well as custom aggregate functions that take a list of input rows and output a single value (UDAF).
Adding JAR files dynamically to the classpath is useful for testing and debugging, but can be cumbersome if you have many libraries you repeatedly wish to use. The Hive command line interpreter will automatically look for the existence of HIVE_AUX_JARS_PATH
in the executing user's environment. Use this environment variable to set additional JAR paths that will always get loaded in the classpath of new Hive sessions for that client machine.