In the previous chapters, you learned about Drill’s powerful analytic capabilities. There are many situations in which you might want to perform a transformation on some data and Drill simply does not have the capability readily at hand. However, it is quite possible to extend Drill’s capabilities by writing your own user-defined functions (UDFs).
Drill supports two different types of UDFs: simple and aggregate. A simple UDF accepts a column or expression as input and returns a single result. The result can be a complex type such as an array or map. An aggregate UDF is different in that it accepts as input all the values for a group as defined in a GROUP BY
or similar clause and returns a single result. The SUM()
function is a good example: it accepts a column or expression, adds up all the values, and returns a single result. You can use an aggregate UDF in conjunction with the GROUP BY
statement as well, and it will perform aggregate operations on a section of the data.
Suppose you are conducting security research and you find a large list of what appear to be credit card numbers. You want to determine whether these are valid credit card numbers and, if so, notify the appropriate banks.
A credit card number is not simply a random sequence of digits. Indeed, these numbers are quite specific and can be validated by an algorithm known as the Luhn algorithm. Although Drill does not have the Luhn algorithm built in, if you are analyzing log files that contain credit card numbers this would be very useful to have, and you can write a UDF to filter the data so that you could quickly determine whether the numbers are valid or not.
Additionally, you can extract several other pieces of metadata from a credit card number, including the issuing bank and what kind of credit card it is (e.g., MasterCard or Visa). Using Drill, you can quickly write some UDFs to extract these pieces of metadata from the numbers. The query might look something like this:
SELECT
`card_number`
,
issuing_bank_code
(
`card_number`
)
AS
issuing_bank_code
,
card_type
(
`card_number`
)
AS
card_type
FROM
<
data
>
WHERE
is_valid_credit_card
(
`card_number`
)
Table 11-1 presents the results.
card_number | issuing_bank_code | card_type |
---|---|---|
4xxxxxxxxxxxxx | 012345 | Visa |
5xxxxxxxxxxxxx | 234567 | MasterCard |
When you have this data, you can write more advanced queries to aggregate and summarize the data, such as the following:
SELECT
card_type
(
`card_number`
)
AS
card_type
,
COUNT
(
*
)
AS
card_count
FROM
<
data
>
WHERE
is_valid_credit_card
(
`card_number`
)
GROUP
BY
card_type
(
`card_number`
)
ORDER
BY
card_count
DESC
This query counts the number of valid credit cards of each type. It will return something similar to that shown in Table 11-2.
card_type | card_count |
---|---|
MasterCard | 315 |
Visa | 212 |
Discover | 200 |
American Express |
However, to perform this kind of analysis, you must write a few UDFs.
Nearly all modern programming languages employ the concept of a function—a section of code that takes input, performs some operation on it, and returns a value, as demonstrated here:
function addTwo(n) { result = n + 2 return n } //Call the function x = addTwo(4) print(x) >> 6
This pseudocode is an example of a simple function that takes a number as input, adds two to it, and then returns the result. The biggest conceptual difference between a regular function and a Drill UDF is that a Drill UDF is always executed against a column of data instead of a single value.1 Additionally, the code for Drill UDFs is generated inline in Drill’s own generated code.
Drill UDFs are implemented in Java, and if you are not a Java developer, some aspects of writing a Drill UDF can seem a little complex—once you get past the boiler-plate code, you will find it is not much more difficult than writing a function in any language.
Let’s take a look at the structure of a Drill UDF. In this section, we’ll walk through the process of creating a few simple UDFs.
In the GitHub repository for this book you will find a folder called udf_templates/simple, which contains all the files you will need to create your own simple Drill UDF. Our first UDF will validate credit card numbers using the Luhn algorithm described earlier.
In the root directory of udf_templates/simple you will find pom.xml, which contains instructions that Maven uses to build the UDF. There is a section at the top of the pom.xml file that looks like this:
<groupId>
org.apache.drill.contrib.function</groupId>
<artifactId>
udf_template</artifactId>
<version>
1.0</version>
<name>
Drill Function Template</name>
You can change the name of the project and the version here, although it does not affect the functionality of the function.
Though typically not needed in other projects, for Drill UDFs, Maven must be configured to produce a source JAR file as well as the executable. To accomplish this, you must include the following in your pom.xml file:
<build>
<plugins>
<plugin>
<groupId>
org.apache.maven.plugins</groupId>
<artifactId>
maven-source-plugin</artifactId>
<version>
2.4</version>
<executions>
<execution>
<id>
attach-sources</id>
<phase>
package</phase>
<goals>
<goal>
jar-no-fork</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>
maven-compiler-plugin</artifactId>
<version>
3.0</version>
<configuration>
<verbose>
true</verbose>
<compilerVersion>
1.7</compilerVersion>
<source>
1.7</source>
<target>
1.7</target>
</configuration>
</plugin>
</plugins>
</build>
When you start building Drill UDFs, you will find that it can be very useful to include other libraries from Maven Central. In this example, there is a pre-existing library in Apache Commons (commons-validator
) that performs all kinds of credit card validations, including the Luhn algorithm. To use this library in your Drill UDF, you need to add the dependency to the pom.xml file. Make sure that the version of Drill in the pom.xml file matches the version of Drill that you are using:
<dependencies>
<dependency>
<groupId>
org.apache.drill.exec</groupId>
<artifactId>
drill-java-exec</artifactId>
<version>
1.14.0</version>
<scope>
provided</scope>
</dependency>
<dependency>
<groupId>
commons-validator</groupId>
<artifactId>
commons-validator</artifactId>
<version>
1.6</version>
<scope>
provided</scope>
</dependency>
</dependencies>
The actual functionality of the UDF is implemented in a file called UDFTemplate.java, which you can find in /src/main/org/apache/drill/contrib/function. This file contains a template for a simple UDF with a lot of the boilerplate code, but it is important to understand that Drill doesn’t directly execute this code. This code is a template that Drill uses to implement the method directly in Drill’s generated code—that’s why Drill also needs your function’s source code.
The first part of the UDF is the function template, which looks like the following snippet:
@
FunctionTemplate
(
name
=
"is_valid_credit_card"
,
//Your function name here
scope
=
FunctionTemplate
.
FunctionScope
.
SIMPLE
,
nulls
=
FunctionTemplate
.
NullHandling
.
NULL_IF_NULL
)
public
static
class
ValidCreditCardFunction
implements
DrillSimpleFunc
{
@
Param
VarCharHolder
creditCardNumber
;
@
Output
BitHolder
isValid
;
@
Override
public
void
setup
()
{
}
@
Override
public
void
eval
()
{
//Your function logic here
}
}
In the class that implements DrillSimpleFunc
, there are three elements in your function template annotation that must be set for your UDF to work properly. The first, name
, is the name that will be used to call your function in a Drill query. Drill function names may not be Drill reserved words, nor may they start with a digit. If you have multiple versions of the same function that accept different inputs, you may use the same name for multiple functions. For instance, suppose that you have a Drill function that converts pounds to kilograms. You might want one version of your function to accept an INT
as an input and another to accept a FLOAT
, and so on. (One note about data types: in Drill, JSON always generates BIGINT
and DOUBLE
types, so UDFs that work with numbers should at least provide these data types.)
If you would like to define aliases for your function, you can also do this as shown here:
names
=
{
"name_1"
,
"name_2"
}
The scope
variable specifies to Drill whether the function is a simple or aggregate function. For this example, this needs to be set to SIMPLE
.
The nulls
variable defines how the function will deal with null values. Setting this to NULL_IF_NULL
configures Drill to skip null rows, instead of having to handle them in the function body. In this case, query results will contain NULL
for null rows.
After you’ve defined these variables, the next step is to actually begin writing the function body.
In most programming languages, the input parameters for a function are defined in the function declaration. For instance, in Java, you would define the input parameters to a function as follows:
public
int
foo
(
int
x
,
int
y
)
In this example, both the input parameters—x
and y
—as well as their data types are defined in the function definition.
Due to Drill’s UDF templating system, Drill requires you to use holder objects to define input parameters. Table 11-3 lists some common Holders
.
Data type | Drill holder object |
---|---|
INT |
IntHolder |
FLOAT |
Float4Holder |
DECIMAL9 |
Decimal9Holder |
DOUBLE |
Float8Holder |
BIGINT |
BigIntHolder |
VARCHAR |
VarCharHolder |
DATE |
DateHolder |
BOOLEAN |
BitHolder |
TIMESTAMP |
TimeStampHolder |
VARBINARY |
VarBinaryHolder |
TIME |
TimeHolder |
For every input parameter you would like your function to accept, you must define a @Param
with the appropriate holder object in the order that they will appear when you call the function, as shown in the code snippet that follows. Drill defines both nullable and non-nullable holders. The preceding list shows the non-nullable holders. Use this if you have Drill do NULL
handling:
nulls = FunctionTemplate.NullHandling.NULL_IF_NULL
If your function handles nulls, use the Nullable holder, which is the same name but with “Nullable” prepended (for example, NullableIntHolder
).
@Param
VarCharHolder
rawCardNumber
;
In this example, the function would accept a single VARCHAR
as an input parameter. This example has only one parameter, but if you wanted to have multiple parameters, you could just define them as follows:
@Param
VarCharHolder
inputTextA
;
@Param
IntHolder
integer_parameter
;
@Param
Float8Holder
float_parameter
;
After you’ve defined the input parameters, you also must define the output value. In a conventional Java function, the return value is specified in the function body with the return
keyword and the data type is defined in the function declaration. For instance, if you wanted to define a function that returned an integer, the function might look something like this:
public
int
foo
(
int
x
)
{
int
result
;
// Do something
return
result
;
}
In this example, the function accepts an integer as input and returns an integer as output. In a Drill UDF, you do not have a return
statement; rather, you define a member variable, annotated with @OUTPUT
, defined as a holder object of the appropriate data type. For example:
@Output
BitHolder
isValid
;
Before you actually write the function, you need to know how to access the data that is contained in the holder objects, both for reading and writing. Numeric holder objects are the easiest; they have a property called value
. To read or set the value of a numeric holder object, simply access or set the value
property, as shown in this snippet:
//Read integer holder
int
myInt
=
intParam1
.
value
;
//Set integer output
outputInteger
.
value
=
myInt
;
VARCHAR
and complex types (MAP
, arrays) are more complicated to access and write. Drill provides a helper function to access the contents of a VarCharHolder
object. The following code snippet demonstrates how to access text contained in a VarCharHolder
named rawCardNumber
:
String
creditCardNumber
=
org
.
apache
.
drill
.
exec
.
expr
.
fn
.
impl
.
StringFunctionHelpers
.
toStringFromUTF8
(
rawCardNumber
.
start
,
rawCardNumber
.
end
,
rawCardNumber
.
buffer
);
As shown in the previous example, if you reference any class other than those in java.lang
, or java.util
, Drill requires you to include an absolute reference to the class. When Drill copies your UDF code into Drill’s own generated code, it does not copy imports. For the code to compile in Drill’s generated code, you must instead use fully qualified class references.
Writing to a VarCharHolder
for output is more complex as well, and requires you to convert the String
you are placing in the holder into a byte array. You then need to assign a buffer to the VarCharHolder
, define the start and end points for the String
, and, finally set the bytes in the buffer. The following snippet demonstrates how to assign a string named outputValue
to the output VarCharHolder
called output:
byte
outBytes
[]
=
outputValue
.
getBytes
(
com
.
.
common
.
base
.
Charsets
.
UTF_8
);
outputBuf
=
outputBuf
.
reallocIfNeeded
(
outBytes
.
length
);
outputBuf
.
setBytes
(
0
,
outBytes
);
output
.
buffer
=
outputBuf
;
output
.
start
=
0
;
output
.
end
=
outBytes
.
length
;
This code demonstrates how to write a string as a UDF output. The first line converts the String
outputValue
into a byte array called outBytes
. Next, we reallocate the buffer for the output VarCharHolder
by calling the reallocIfNeeded()
method. In the next line we call the setBytes()
method, which writes the byte array to the output buffer. Finally, we set the start and stop positions in the output buffer.
Now that you have defined your input parameters, the next step is to implement the DrillSimpleFunc
interface, which consists of two void functions: setup()
and eval()
. In some instances, you might have variables or objects that you want to initialize at the beginning of the function’s execution and reuse in every function call. If that is the case, you can initialize these variables in the setup()
function. If you do not have any variables like this, you can simply leave the setup()
function blank, but you must include it because it is part of the DrillSimpleFunc
interface.
The eval()
function is where you actually implement the function’s logic and where you assign a value to the holder for your output. The eval()
function itself is a void function, meaning that it does not return anything.
The pseudocode for our example function might look something like this:
function
is_valid_credit_card
(
credit_card_number
)
{
return
luhn
(
credit_card_number
);
}
To put this all together, the following code demonstrates how to convert the simple function into a Drill UDF:
@
FunctionTemplate
(
name
=
"is_valid_credit_card"
,
scope
=
FunctionTemplate
.
FunctionScope
.
SIMPLE
,
nulls
=
FunctionTemplate
.
NullHandling
.
NULL_IF_NULL
)
public
class
IsValidCreditCardFunction
implements
DrillSimpleFunc
{
@
Param
VarCharHolder
rawCardNumber
;
@
Workspace
org
.
apache
.
commons
.
validator
.
routines
.
CreditCardValidator
ccv
;
@
Output
BitHolder
out
;
public
void
setup
()
{
org
.
apache
.
commons
.
validator
.
routines
.
CreditCardValidator
ccv
=
new
org
.
apache
.
commons
.
validator
.
routines
.
CreditCardValidator
();
}
public
void
eval
()
{
String
creditCardNumber
=
org
.
apache
.
drill
.
exec
.
expr
.
fn
.
impl
.
StringFunctionHelpers
.
toStringFromUTF8
(
rawCardNumber
.
start
,
rawCardNumber
.
end
,
rawCardNumber
.
buffer
);
if
(
ccv
.
isValid
(
creditCardNumber
))
{
out
.
value
=
1
;
}
else
{
out
.
value
=
0
;
}
}
}
In the preceding example, the variable ccv
is the validator object that is used in every function iteration. To maximize performance, the UDF stores the validator object in the @Workspace
and initializes it in the setup()
function.
The previous UDF uses a BitHolder
as the output value, where 1
is true
and 0
is false
. This UDF also has a dependency on the Apache commons-validator
package, which you need to include in the dependencies
section of the pom.xml file, as shown in the following snippet:
<dependencies>
<dependency>
<groupId>
org.apache.drill.exec</groupId>
<artifactId>
drill-java-exec</artifactId>
<version>
1.13.0</version>
</dependency>
<dependency>
<groupId>
commons-validator</groupId>
<artifactId>
commons-validator</artifactId>
<version>
1.6</version>
</dependency>
</dependencies>
There is one additional file that you will need in order for Drill to recognize your UDF. In your function package, in the /src/main/resources folder, you will need to create a file called drill-module.conf, which contains the following:
drill
.
classpath
.
scanning
.
packages
+=
"org.apache.drill.contrib.function"
You might need to adjust the package name to reflect your UDF.
After you have implemented your Drill UDF, the next step is to build and install it. Open a command-line prompt and navigate to the root folder of your UDF. You can then build your UDF with Maven by entering mvn clean package –DskipTests
. If all goes according to plan, Maven will generate a JAR file of your UDF in the target/ folder of the UDF as well as the JAR file containing the source code.
You can either install the UDF statically or dynamically. Regardless of which approach you take, there are three components that you need in order to install the UDF:
The JAR file for the UDF. You can find this in the target/ folder after you build the UDF.
The JAR file containing the source code of the UDF. Again, you can find this in the target/ folder after you build the UDF.
The JAR files for any dependencies your UDF has. You can download these from Maven Central or configure Maven via your pom.xml file to copy them into the target/ folder.
Let’s look at static installation first. If you are using Drill in embedded mode in a nonshared environment, simply copy the aforementioned files into your Drill installation in the $DRILL_HOME
/jars/3rdparty/ folder. If you are using Drill in a cluster, copy the JAR files into the jars folder in your site directory.2
As of Drill version 1.9, you can also register and unregister UDFs dynamically without having to restart Drillbits. To do this, you need to first configure a few directories. Because these directories must be visible to all Drillbits, they must be in a distributed filesystem such as HDFS:
This is where you will copy the JAR files before they are registered.
This is a temporary directory for dynamic UDF JAR files. This directory is cleaned out when Drill is closed.
This is where the JAR files are copied after they are registered.
This contains the backup JAR files.
You can set the locations of all these directories in $DRILL_HOME
/conf/drill-override.conf, however usually you just set the base and let Drill define the UDF directories as subdirectories under the base directory.
drill
.
exec
.
udf
:
{
retry
-
attempts
:
5
,
directory
:
{
#
Override
this
property
if
custom
file
system
should
be
used
to
create
#
remote
directories
#
instead
of
default
taken
from
Hadoop
configuration
fs
:
"hdfs:///"
,
#
Set
this
property
if
custom
absolute
root
should
be
used
for
remote
#
directories
root
:
"/app/drill"
}
}
After configuring the filesystem and base directory, you are ready to dynamically register a UDF. First copy the JAR files to the staging directory. After that, you can register the UDF by running the following command:
CREATE
FUNCTION
USING
JAR
'
jar_file
'
;
If you want to unregister a UDF, you can do so with the following command:
DROP
FUNCTION
USING
JAR
'
jar_file
'
;
After you have installed your UDF, you can use it in queries just as you would any other Drill built-in function.
As you have seen in previous chapters, Drill supports two complex data types: arary and MAP
. These are analogous to arrays and key–value pairs in other languages and enable Drill to analyze and query complex, nested data. It can be useful to write Drill UDFs that return these complex data types.
Often, you will encounter data artifacts from which you can extract many different pieces of metadata. For instance, user agent strings are strings of text that are sent to a web server when a user requests a web page. These strings are very useful for web servers because they can determine what version of a website a user receives. They’re also useful for analyzing the type of traffic that is hitting a website. Although user agent strings are extremely useful, they are difficult to parse, and it is not really possible to simply split them by some delimiter. The example user agent string that follows illustrates why this is the case:
Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0
From this user agent string, it is possible to extract the following metadata:
{
"DeviceClass"
:
"Desktop"
,
"DeviceName"
:
"Desktop"
,
"DeviceBrand"
:
"Unknown"
,
"DeviceCpuBits"
:
"32"
,
"OperatingSystemClass"
:
"Desktop"
,
"OperatingSystemName"
:
"Windows NT"
,
"OperatingSystemVersion"
:
"Windows XP"
,
"OperatingSystemNameVersion"
:
"Windows XP"
,
"LayoutEngineClass"
:
"Browser"
,
"LayoutEngineName"
:
"Gecko"
,
"LayoutEngineVersion"
:
"35.0"
,
"LayoutEngineVersionMajor"
:
"35"
,
"LayoutEngineNameVersion"
:
"Gecko 35.0"
,
"LayoutEngineNameVersionMajor"
:
"Gecko 35"
,
"LayoutEngineBuild"
:
"20100101"
,
"AgentClass"
:
"Browser"
,
"AgentName"
:
"Firefox"
,
"AgentVersion"
:
"35.0"
,
"AgentVersionMajor"
:
"35"
,
"AgentNameVersion"
:
"Firefox 35.0"
,
"AgentNameVersionMajor"
:
"Firefox 35"
}
Instead of writing a function to extract each piece of metadata individually, you might find it convenient to write a function that extracts all this metadata all at once and returns a map of these fields, and then extract the individual fields of interest. But to do that, you need to write a UDF that returns a map of all the fields.
The ComplexWriter
object enables you to write maps and arrays and return them as output from your Drill UDF. The first step in writing a UDF that returns a complex result is setting the ComplexWriter
as the output parameter, as shown in the snippet that follows:
@Output
BaseWriter
.
ComplexWriter
out
Writer
;
Fundamentally, functions that return complex results are no different from the simple UDFs that you have already written, with the exception of the output parameter. In the example here, we are setting the output parameter of the UDF to be a ComplexWriter
called outWriter
. The next step is to access either a MapWriter
or ListWriter
object—depending on whether you want to create a map or an array—from the ComplexWriter
object by calling the rootAsMap()
or rootAsList()
function:
public
void
eval
()
{
MapWriter
queryMapWriter
=
outWriter
.
rootAsMap
();
...
}
In regular Drill UDFs, the output is contained in holder variables. In a UDF that returns a complex data type, the MapWriter
created in the previous snippet3 will contain holder objects for the various fields that will populate the result. You can create a holder separately or you can use the various writer functions to indirectly create a field, as shown in the following snippets:
//Write a floating-point field
queryMapWriter
.
float4
(
"
field_name
"
)
.
writeFloat8
(
floating_point_value
)
;
queryMapWriter
.
float8
(
"
field_name
"
)
.
writeFloat8
(
floating_point_value
)
;
//Write an integer field
queryMapWriter
.
bigInt
(
"
field_name
"
)
.
writeInt
(
int_value
)
;
queryMapWriter
.
integer
(
"
field_name
"
)
.
writeInt
(
int_value
)
;
//Write a date field
queryMapWriter
.
date
(
"
field_name
"
)
.
writeDate
(
long_date
)
;
Adding a string to a map is a little more complicated, but the essential process is the same. It is a little clearer if you create the holder object outside of the MapWriter
object and then pass the MapWriter
object the holder. This code snippet demonstrates how to write a string to a MAP
:
VarCharHolder
rowHolder
=
new
VarCharHolder
(
)
;
String
field
=
<
text
>
;
byte
[
]
rowStringBytes
=
field
.
getBytes
(
com
.
.
common
.
base
.
Charsets
.
UTF_8
)
outBuffer
.
reallocIfNeeded
(
rowStringBytes
.
length
)
;
outBuffer
.
setBytes
(
0
,
rowStringBytes
)
;
rowHolder
.
start
=
0
;
rowHolder
.
end
=
rowStringBytes
.
length
;
rowHolder
.
buffer
=
outBuffer
;
queryMapWriter
.
varChar
(
field_name
)
.
write
(
rowHolder
)
;
Here is the complete code for the function to parse user agent strings:
public
void
eval
()
{
org
.
apache
.
drill
.
exec
.
vector
.
complex
.
writer
.
BaseWriter
.
MapWriter
queryMapWriter
=
outWriter
.
rootAsMap
();
String
userAgentString
=
org
.
apache
.
drill
.
exec
.
expr
.
fn
.
impl
.
StringFunctionHelpers
.
toStringFromUTF8
(
input
.
start
,
input
.
end
,
input
.
buffer
);
if
(
userAgentString
.
isEmpty
()
||
userAgentString
.
equals
(
"null"
))
{
userAgentString
=
""
;
}
nl
.
basjes
.
parse
.
useragent
.
UserAgent
agent
=
uaa
.
parse
(
userAgentString
);
for
(
String
fieldName
:
agent
.
getAvailableFieldNamesSorted
())
{
org
.
apache
.
drill
.
exec
.
expr
.
holders
.
VarCharHolder
rowHolder
=
new
org
.
apache
.
drill
.
exec
.
expr
.
holders
.
VarCharHolder
();
String
field
=
agent
.
getValue
(
fieldName
);
byte
[]
rowStringBytes
=
field
.
getBytes
(
com
.
.
common
.
base
.
Charsets
.
UTF_8
);
outBuffer
.
reallocIfNeeded
(
rowStringBytes
.
length
);
outBuffer
.
setBytes
(
0
,
rowStringBytes
);
rowHolder
.
start
=
0
;
rowHolder
.
end
=
rowStringBytes
.
length
;
rowHolder
.
buffer
=
outBuffer
;
queryMapWriter
.
varChar
(
fieldName
).
write
(
rowHolder
);
}
}
Now that you have written a simple UDF, you will find that writing an aggregate UDF is not much more difficult; however, before you begin to write one, you should understand that there are several significant limitations for custom aggregate functions:
Drill aggregate functions cannot have complex variables in the @Workspace
.
You cannot use any kind of variable-length or complex variable in the @Output
or @Workspace
.
All @Workspace
variables must be contained in holders.
These are significant limitations and effectively restrict aggregate functions to numeric calculations. If you want to use any kind of nonnumeric holder in the @Workspace
, you can use the ObjectHolder
; however, this is a bit of a hack in that at the time of writing, the ObjectHolder
class has been deprecated. If your UDF uses the ObjectHolder
, it will still work, but your queries will fail if your datasets are large enough that Drill’s hash aggregate operator must spill to disk. One workaround is to disable the hash aggregate operator using the following query:4
ALTER
SESSION
SET
`planner.enable_hashagg`
=
false
Much like simple Drill functions, aggregate functions perform operations on collections of data; however, instead of returning a single result for every input, aggregate functions iterate through all the data but return only a single result. You can use aggregate functions with the GROUP BY
clause to segment the data into smaller segments.
The function declaration is slightly different from a simple Drill function in that the scope must be set to POINT_AGGREGATE
. Additionally, aggregate functions must handle null values internally and therefore the nulls
variable must be set to FunctionTemplate.NullHandling.INTERNAL
. Following is the complete function template:
@FunctionTemplate
(
name
=
"kendall_correlation"
,
scope
=
FunctionTemplate
.
FunctionScope
.
POINT_AGGREGATE
,
nulls
=
FunctionTemplate
.
NullHandling
.
INTERNAL
)
Writing an aggregate function is not significantly different than a simple Drill UDF; however, there are four additional functions that you must implement:
setup()
This function is executed once when the function is first called and is where you initialize any variables that will be used throughout the function iterations.
add()
This function processes every record and is roughly equivalent to the eval()
function in a simple Drill UDF. This is where the actual logic of the function will be implemented.
output()
This function is where the final output will be calculated and output. This is executed at the end of every segment of data that the function encounters.
reset()
This function is similar to the setup()
function and is called at the beginning of each new section of data or when a column changes from one data type to another. You should reset your counters or totals in this method.
To demonstrate an aggregate UDF, let’s write a UDF that implements the Kendall’s rank correlation coefficient, otherwise known as Kendall’s tau. Kendall’s tau is a measurement of correlation between two vectors that ranges from 1, which indicates perfect correlation, to –1, which indicates perfect negative correlation.
Kendall’s tau is calculated by comparing pairs of numbers in the two vectors; if both pairs of numbers share the same Boolean relationship, they are considered a concordant pair, and if not, they are a discordant pair:
To implement this as a Drill function, you need to first initialize a counter of concordant and discordant pairs as well as a counter of the number of items. You initialize these variables in the reset()
function. The add()
function will compare the rows, make the determination as to whether the pair is concordant or discordant, and increment the appropriate counter as well as the row counter. Finally, the output()
function will perform the calculation and output the result.
The code snippet that follows demonstrates how to write an aggregate UDF in Drill:
@Param
Float8Holder
xInput
;
@Param
Float8Holder
yInput
;
@Workspace
Float8Holder
prevXValue
;
@Workspace
Float8Holder
prevYValue
;
@Workspace
IntHolder
concordantPairs
;
@Workspace
IntHolder
discordantPairs
;
@Workspace
IntHolder
n
;
@Output
Float8Holder
tau
;
In the first part of this UDF, we define the input parameters, the workspace variables, and the output variable. This function needs two input variables, xInput
and yInput
, which are both floating-point numbers. For these, the UDF uses Float8Holder
s. Next, the UDF defines several @Workspace
variables. The prevXValue
and prevYValue
variables hold the values from the previous iteration, and the concordantPairs
and discordantPairs
holders contain the running totals of pairs. Finally, n
contains the number of pairs in the grouping. Notice that the pair holders and n
are integers and the others are all double holders. The last parameter is the output parameter tau
, which also uses a Float8Holder
.
Now that the inputs and outputs are all defined, as well as several workspace variables, the next steps are to implement the four functions of the aggregate function API. The first function is the add()
function, which performs the bulk of the calculations. This function is called on every row, and for each row, it compares the current pair with the previous pair that it saw and increments the appropriate counter based on whether the pair is concordant or discordant:
@Override
public
void
add
()
{
double
xValue
=
xInput
.
value
;
double
yValue
=
yInput
.
value
;
if
(
n
.
value
>
0
)
{
if
((
xValue
>
prevXValue
.
value
&&
yValue
>
prevYValue
.
value
)
||
(
xValue
<
prevXValue
.
value
&&
yValue
<
prevYValue
.
value
))
{
concordantPairs
.
value
=
concordantPairs
.
value
+
1
;
prevXValue
.
value
=
xInput
.
value
;
prevYValue
.
value
=
yInput
.
value
;
n
.
value
=
n
.
value
+
1
;
}
else
if
((
xValue
>
prevXValue
.
value
&&
yValue
<
prevYValue
.
value
)
||
(
xValue
<
prevXValue
.
value
&&
yValue
>
prevYValue
.
value
))
{
discordantPairs
.
value
=
discordantPairs
.
value
+
1
;
prevXValue
.
value
=
xInput
.
value
;
prevYValue
.
value
=
yInput
.
value
;
n
.
value
=
n
.
value
+
1
;
}
else
{
//Tie...
prevXValue
.
value
=
xInput
.
value
;
prevYValue
.
value
=
yInput
.
value
;
}
}
else
if
(
n
.
value
==
0
)
{
prevXValue
.
value
=
xValue
;
prevYValue
.
value
=
yValue
;
n
.
value
=
1
;
}
}
The next function in the API is the setup()
function, which in this case is not needed and left blank:
@Override
public
void
setup
()
{
}
The reset()
function is called before each group’s execution. In this snippet, we use the reset()
function to initialize all the variables and counters to zero:
@Override
public
void
reset
()
{
prevXValue
.
value
=
0
;
prevYValue
.
value
=
0
;
concordantPairs
.
value
=
0
;
discordantPairs
.
value
=
0
;
n
.
value
=
0
;
}
The last function in the API that we must implement is the output()
function, which calculates the difference between the pairs, divides it by (n * n - 1) / 2
, and sets the value of the output holder, tau
:
@Override
public
void
output
()
{
double
result
=
0.0
;
result
=
(
concordantPairs
.
value
-
discordantPairs
.
value
)
/
(
0.5
*
n
.
value
*
(
n
.
value
-
1
));
tau
.
value
=
result
;
}
One of Drill’s biggest strong points is its extensibility and ease of customization. In this chapter, you learned how to perform more sophisticated analyses by extending the functionality of Drill by implementing your own simple and aggregate custom UDFs. In Chapter 12, you will learn how to further extend Drill’s functionality by writing format plug-ins that enable you to get Drill to read additional file types.
1 OK, this isn’t technically true. Drill functions are executed against one value at a time, but from a high level, it is helpful to think of a Drill UDF as operating against a column of data.
2 See Chapter 9 for instructions on setting up and using the site directory.
3 For spacing, the snippet omits the full package name; however, when writing UDFs you must include the full package names (in this case, org.apache.drill.exec.vector.complex.writer.BaseWriter.MapWriter
).
4 For more information, check out the GitHub page.