Chapter 11. Writing Drill User-Defined Functions

In the previous chapters, you learned about Drill’s powerful analytic capabilities. There are many situations in which you might want to perform a transformation on some data and Drill simply does not have the capability readily at hand. However, it is quite possible to extend Drill’s capabilities by writing your own user-defined functions (UDFs).

Drill supports two different types of UDFs: simple and aggregate. A simple UDF accepts a column or expression as input and returns a single result. The result can be a complex type such as an array or map.  An aggregate UDF is different in that it accepts as input all the values for a group as defined in a GROUP BY or similar clause and returns a single result. The SUM() function is a good example: it accepts a column or expression, adds up all the values, and returns a single result. You can use an aggregate UDF in conjunction with the GROUP BY statement as well, and it will perform aggregate operations on a section of the data.

Use Case: Finding and Filtering Valid Credit Card Numbers

Suppose you are conducting security research and you find a large list of what appear to be credit card numbers. You want to determine whether these are valid credit card numbers and, if so, notify the appropriate banks.

A credit card number is not simply a random sequence of digits. Indeed, these numbers are quite specific and can be validated by an algorithm known as the Luhn algorithm. Although Drill does not have the Luhn algorithm built in, if you are analyzing log files that contain credit card numbers this would be very useful to have, and you can write a UDF to filter the data so that you could quickly determine whether the numbers are valid or not.

Additionally, you can extract several other pieces of metadata from a credit card number, including the issuing bank and what kind of credit card it is (e.g., MasterCard or Visa). Using Drill, you can quickly write some UDFs to extract these pieces of metadata from the numbers. The query might look something like this:

SELECT `card_number`,
    issuing_bank_code(`card_number`) AS issuing_bank_code,
    card_type(`card_number`) AS card_type
FROM <data>
WHERE is_valid_credit_card(`card_number`) 

Table 11-1 presents the results.

Table 11-1. Query results with UDF
card_number issuing_bank_code card_type
4xxxxxxxxxxxxx 012345 Visa
5xxxxxxxxxxxxx 234567 MasterCard

When you have this data, you can write more advanced queries to aggregate and summarize the data, such as the following:

SELECT card_type(`card_number`) AS card_type,
    COUNT(*) AS card_count
FROM <data>
WHERE is_valid_credit_card(`card_number`) 
GROUP BY card_type(`card_number`)
ORDER BY card_count DESC

This query counts the number of valid credit cards of each type. It will return something similar to that shown in Table 11-2.

Table 11-2. Counts of valid credit cards by type
card_type card_count
MasterCard 315
Visa 212
Discover 200
American Express  

However, to perform this kind of analysis, you must write a few UDFs.

How User-Defined Functions Work in Drill

Nearly all modern programming languages employ the concept of a function—a section of code that takes input, performs some operation on it, and returns a value, as demonstrated here:

function addTwo(n) {
      result = n + 2
      return n
}

//Call the function
x = addTwo(4)
print(x)

>> 6

This pseudocode is an example of a simple function that takes a number as input, adds two to it, and then returns the result. The biggest conceptual difference between a regular function and a Drill UDF is that a Drill UDF is always executed against a column of data instead of a single value.1 Additionally, the code for Drill UDFs is generated inline in Drill’s own generated code.

Structure of a Simple Drill UDF

Drill UDFs are implemented in Java, and if you are not a Java developer, some aspects of writing a Drill UDF can seem a little complex—once you get past the boiler-plate code, you will find it is not much more difficult than writing a function in any language.

Let’s take a look at the structure of a Drill UDF. In this section, we’ll walk through the process of creating a few simple UDFs.

In the GitHub repository for this book you will find a folder called udf_templates/simple, which contains all the files you will need to create your own simple Drill UDF. Our first UDF will validate credit card numbers using the Luhn algorithm described earlier.

The pom.xml File

In the root directory of udf_templates/simple you will find pom.xml, which contains instructions that Maven uses to build the UDF. There is a section at the top of the pom.xml file that looks like this:

<groupId>org.apache.drill.contrib.function</groupId>
<artifactId>udf_template</artifactId>
<version>1.0</version>
<name>Drill Function Template</name>

You can change the name of the project and the version here, although it does not affect the functionality of the function.

Don’t Forget the Source JAR!

Though typically not needed in other projects, for Drill UDFs, Maven must be configured to produce a source JAR file as well as the executable. To accomplish this, you must include the following in your pom.xml file:

<build>
  <plugins>
    <plugin>
      <groupId>org.apache.maven.plugins</groupId>
      <artifactId>maven-source-plugin</artifactId>
      <version>2.4</version>
      <executions>
        <execution>
          <id>attach-sources</id>
          <phase>package</phase>
          <goals>
            <goal>jar-no-fork</goal>
          </goals>
        </execution>
      </executions>
    </plugin>
    <plugin>
      <artifactId>maven-compiler-plugin</artifactId>
      <version>3.0</version>
      <configuration>
        <verbose>true</verbose>
        <compilerVersion>1.7</compilerVersion>
        <source>1.7</source>
        <target>1.7</target>
      </configuration>
    </plugin>
  </plugins>
</build>

Including dependencies

When you start building Drill UDFs, you will find that it can be very useful to include other libraries from Maven Central. In this example, there is a pre-existing library in Apache Commons (commons-validator) that performs all kinds of credit card validations, including the Luhn algorithm. To use this library in your Drill UDF, you need to add the dependency to the pom.xml file. Make sure that the version of Drill in the pom.xml file matches the version of Drill that you are using:

<dependencies>
  <dependency>
    <groupId>org.apache.drill.exec</groupId>
    <artifactId>drill-java-exec</artifactId>
    <version>1.14.0</version>
    <scope>provided</scope>
  </dependency>
  <dependency>
    <groupId>commons-validator</groupId>
    <artifactId>commons-validator</artifactId>
    <version>1.6</version>
    <scope>provided</scope>
  </dependency>
</dependencies>

The Function File

The actual functionality of the UDF is implemented in a file called UDFTemplate.java, which you can find in /src/main/org/apache/drill/contrib/function. This file contains a template for a simple UDF with a lot of the boilerplate code, but it is important to understand that Drill doesn’t directly execute this code. This code is a template that Drill uses to implement the method directly in Drill’s generated code—that’s why Drill also needs your function’s source code.

The first part of the UDF is the function template, which looks like the following snippet:

@FunctionTemplate(
  name = "is_valid_credit_card",  //Your function name here 
  scope = FunctionTemplate.FunctionScope.SIMPLE, 
  nulls = FunctionTemplate.NullHandling.NULL_IF_NULL
)
public static class ValidCreditCardFunction implements DrillSimpleFunc {

  @Param
  VarCharHolder creditCardNumber;

  @Output
  BitHolder isValid;

  @Override
  public void setup() {
  }

  @Override
  public void eval() {
	//Your function logic here
  }
}

In the class that implements DrillSimpleFunc, there are three elements in your function template annotation that must be set for your UDF to work properly. The first, name, is the name that will be used to call your function in a Drill query. Drill function names may not be Drill reserved words, nor may they start with a digit. If you have multiple versions of the same function that accept different inputs, you may use the same name for multiple functions. For instance, suppose that you have a Drill function that converts pounds to kilograms. You might want one version of your function to accept an INT as an input and another to accept a FLOAT, and so on. (One note about data types: in Drill, JSON always generates BIGINT and DOUBLE types, so UDFs that work with numbers should at least provide these data types.)

If you would like to define aliases for your function, you can also do this as shown here:

names = {"name_1", "name_2"}

The scope variable specifies to Drill whether the function is a simple or aggregate function. For this example, this needs to be set to SIMPLE.

The nulls variable defines how the function will deal with null values. Setting this to NULL_IF_NULL configures Drill to skip null rows, instead of having to handle them in the function body. In this case, query results will contain NULL for null rows.

After you’ve defined these variables, the next step is to actually begin writing the function body.

Defining input parameters

In most programming languages, the input parameters for a function are defined in the function declaration. For instance, in Java, you would define the input parameters to a function as follows:

public int foo(int x, int y)

In this example, both the input parameters—x and y—as well as their data types are defined in the function definition.

Due to Drill’s UDF templating system, Drill requires you to use holder objects to define input parameters. Table 11-3 lists some common Holders.

Table 11-3. Drill holders and corresponding Java data types
Data type Drill holder object
INT IntHolder
FLOAT Float4Holder
DECIMAL9 Decimal9Holder
DOUBLE Float8Holder
BIGINT BigIntHolder
VARCHAR VarCharHolder
DATE DateHolder
BOOLEAN BitHolder
TIMESTAMP TimeStampHolder
VARBINARY VarBinaryHolder
TIME TimeHolder

For every input parameter you would like your function to accept, you must define a @Param with the appropriate holder object in the order that they will appear when you call the function, as shown in the code snippet that follows. Drill defines both nullable and non-nullable holders. The preceding list shows the non-nullable holders. Use this if you have Drill do NULL handling:

nulls = FunctionTemplate.NullHandling.NULL_IF_NULL

If your function handles nulls, use the Nullable holder, which is the same name but with “Nullable” prepended (for example, NullableIntHolder).

@Param
VarCharHolder rawCardNumber;

In this example, the function would accept a single VARCHAR as an input parameter. This example has only one parameter, but if you wanted to have multiple parameters, you could just define them as follows:

@Param
VarCharHolder inputTextA; 

@Param
IntHolder integer_parameter; 

@Param
Float8Holder float_parameter; 

Setting the output value

After you’ve defined the input parameters, you also must define the output value. In a conventional Java function, the return value is specified in the function body with the return keyword and the data type is defined in the function declaration. For instance, if you wanted to define a function that returned an integer, the function might look something like this:

public int foo(int x) {
      int result;
      // Do something
      return result;
}

In this example, the function accepts an integer as input and returns an integer as output. In a Drill UDF, you do not have a return statement; rather, you define a member variable, annotated with @OUTPUT, defined as a holder object of the appropriate data type. For example:

@Output
BitHolder isValid;

Accessing data in holder objects

Before you actually write the function, you need to know how to access the data that is contained in the holder objects, both for reading and writing. Numeric holder objects are the easiest; they have a property called value. To read or set the value of a numeric holder object, simply access or set the value property, as shown in this snippet:

//Read integer holder
int myInt = intParam1.value;

//Set integer output 
outputInteger.value = myInt;

VARCHAR and complex types (MAP, arrays) are more complicated to access and write. Drill provides a helper function to access the  contents of a VarCharHolder object. The following code snippet demonstrates how to access text contained in a VarCharHolder named rawCardNumber:

String creditCardNumber = org.apache.drill.exec.expr.fn.impl.
    StringFunctionHelpers.toStringFromUTF8(rawCardNumber.start,
    rawCardNumber.end, rawCardNumber.buffer);

Drill Requires Absolute Class References in UDFs

As shown in the previous example, if you reference any class other than those in java.lang, or java.util, Drill requires you to include an absolute reference to the class. When Drill copies your UDF code into Drill’s own generated code, it does not copy imports. For the code to compile in Drill’s generated code, you must instead use fully qualified class references.

Writing to a VarCharHolder for output is more complex as well, and requires you to convert the String you are placing in the holder into a byte array. You then need to assign a buffer to the VarCharHolder, define the start and end points for the String, and, finally set the bytes in the buffer. The following snippet demonstrates how to assign a string named outputValue to the output VarCharHolder called output:

byte outBytes[] = outputValue.getBytes(com.google.common.base.Charsets.UTF_8);
outputBuf = outputBuf.reallocIfNeeded(outBytes.length); 
outputBuf.setBytes(0, outBytes); 
output.buffer = outputBuf; 
output.start = 0;
output.end = outBytes.length; 

This code demonstrates how to write a string as a UDF output. The first line converts the String outputValue into a byte array called outBytes. Next, we reallocate the buffer for the output VarCharHolder by calling the reallocIfNeeded() method. In the next line we call the setBytes() method, which writes the byte array to the output buffer. Finally, we set the start and stop positions in the output buffer.

The Simple Function API

Now that you have defined your input parameters, the next step is to implement the DrillSimpleFunc interface, which consists of two void functions: setup() and eval(). In some instances, you might have variables or objects that you want to initialize at the beginning of the function’s execution and reuse in every function call. If that is the case, you can initialize these variables in the setup() function. If you do not have any variables like this, you can simply leave the setup() function blank, but you must include it because it is part of the DrillSimpleFunc interface.

The eval() function is where you actually implement the function’s logic and where you assign a value to the holder for your output. The eval() function itself is a void function, meaning that it does not return anything.

The pseudocode for our example function might look something like this:

function is_valid_credit_card (credit_card_number) {
   return luhn(credit_card_number);
}

Putting It All Together

To put this all together, the following code demonstrates how to convert the simple function into a Drill UDF:

@FunctionTemplate(
  name = "is_valid_credit_card", 
  scope = FunctionTemplate.FunctionScope.SIMPLE, 
  nulls = FunctionTemplate.NullHandling.NULL_IF_NULL)

public class IsValidCreditCardFunction implements DrillSimpleFunc {

  @Param
  VarCharHolder rawCardNumber;

  @Workspace
  org.apache.commons.validator.routines.CreditCardValidator ccv;

  @Output
  BitHolder out;

  public void setup() {
     org.apache.commons.validator.routines.CreditCardValidator ccv = 
       new org.apache.commons.validator.routines.CreditCardValidator();
  }
  
  public void eval() {
    String creditCardNumber = org.apache.drill.exec.expr.fn.impl 
         .StringFunctionHelpers 
         .toStringFromUTF8(rawCardNumber.start, 
                           rawCardNumber.end,  
                           rawCardNumber.buffer);
    if (ccv.isValid(creditCardNumber)) {
      out.value = 1;
    } else {
      out.value = 0;
    }
  }
}

Use the Workspace for Variables Carried Over Between Iterations

In the preceding example, the variable ccv is the validator object that is used in every function iteration. To maximize performance, the UDF stores the validator object in the @Workspace and initializes it in the setup() function.

The previous UDF uses a BitHolder as the output value, where 1 is true and 0 is false. This UDF also has a dependency on the Apache commons-validator package, which you need to include in the dependencies section of the pom.xml file, as shown in the following snippet:

<dependencies>
  <dependency>
    <groupId>org.apache.drill.exec</groupId>
    <artifactId>drill-java-exec</artifactId>
    <version>1.13.0</version>
  </dependency>
  <dependency>
    <groupId>commons-validator</groupId>
    <artifactId>commons-validator</artifactId>
    <version>1.6</version>
  </dependency>
</dependencies>

There is one additional file that you will need in order for Drill to recognize your UDF. In your function package, in the /src/main/resources folder, you will need to create a file called drill-module.conf, which contains the following:

drill.classpath.scanning.packages += "org.apache.drill.contrib.function"

You might need to adjust the package name to reflect your UDF.

Building and Installing Your UDF

After you have implemented your Drill UDF, the next step is to build and install it. Open a command-line prompt and navigate to the root folder of your UDF. You can then build your UDF with Maven by entering mvn clean package –DskipTests. If all goes according to plan, Maven will generate a JAR file of your UDF in the target/ folder of the UDF as well as the JAR file containing the source code.

You can either install the UDF statically or dynamically. Regardless of which approach you take, there are three components that you need in order to install the UDF:

  • The JAR file for the UDF. You can find this in the target/ folder after you build the UDF.

  • The JAR file containing the source code of the UDF. Again, you can find this in the target/ folder after you build the UDF.

  • The JAR files for any dependencies your UDF has. You can download these from Maven Central or configure Maven via your pom.xml file to copy them into the target/ folder.

Statically Installing a UDF

Let’s look at static installation first. If you are using Drill in embedded mode in a nonshared environment, simply copy the aforementioned files into your Drill installation in the $DRILL_HOME/jars/3rdparty/ folder. If you are using Drill in a cluster, copy the JAR files into the jars folder in your site directory.2

Dynamically Installing a UDF

As of Drill version 1.9, you can also register and unregister UDFs dynamically without having to restart Drillbits. To do this, you need to first configure a few directories. Because these directories must be visible to all Drillbits, they must be in a distributed filesystem such as HDFS:

Staging directory

This is where you will copy the JAR files before they are registered.

Local directory

This is a temporary directory for dynamic UDF JAR files. This directory is cleaned out when Drill is closed.

Registry directory

This is where the JAR files are copied after they are registered.

tmp directory

This contains the backup JAR files.

You can set the locations of all these directories in $DRILL_HOME/conf/drill-override.conf, however usually you just set the base and let Drill define the UDF directories as subdirectories under the base directory.

drill.exec.udf: {
   retry-attempts: 5, 
   directory: {
      # Override this property if custom file system should be used to create 
      # remote directories
      # instead of default taken from Hadoop configuration
      fs: "hdfs:///",
      # Set this property if custom absolute root should be used for remote 
      # directories
      root: "/app/drill"

  }
}

After configuring the filesystem and base directory, you are ready to dynamically register a UDF. First copy the JAR files to the staging directory. After that, you can register the UDF by running the following command:

CREATE FUNCTION USING JAR 'jar_file';

If you want to unregister a UDF, you can do so with the following command:

DROP FUNCTION USING JAR 'jar_file';

After you have installed your UDF, you can use it in queries just as you would any other Drill built-in function.

Complex Functions: UDFs That Return Maps or Arrays

As you have seen in previous chapters, Drill supports two complex data types: arary and MAP. These are analogous to arrays and key–value pairs in other languages and enable Drill to analyze and query complex, nested data. It can be useful to write Drill UDFs that return these complex data types.

Example: Extracting User Agent Metadata

Often, you will encounter data artifacts from which you can extract many different pieces of metadata. For instance, user agent strings are strings of text that are sent to a web server when a user requests a web page. These strings are very useful for web servers because they can determine what version of a website a user receives. They’re also useful for analyzing the type of traffic that is hitting a website. Although user agent strings are extremely useful, they are difficult to parse, and it is not really possible to simply split them by some delimiter. The example user agent string that follows illustrates why this is the case:

Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0

From this user agent string, it is possible to extract the following metadata:

{
"DeviceClass":"Desktop",
"DeviceName":"Desktop",
"DeviceBrand":"Unknown",
"DeviceCpuBits":"32",
"OperatingSystemClass":"Desktop",
"OperatingSystemName":"Windows NT",
"OperatingSystemVersion":"Windows XP",
"OperatingSystemNameVersion":"Windows XP",
"LayoutEngineClass":"Browser",
"LayoutEngineName":"Gecko",
"LayoutEngineVersion":"35.0",
"LayoutEngineVersionMajor":"35",
"LayoutEngineNameVersion":"Gecko 35.0",
"LayoutEngineNameVersionMajor":"Gecko 35",
"LayoutEngineBuild":"20100101",
"AgentClass":"Browser",
"AgentName":"Firefox",
"AgentVersion":"35.0",
"AgentVersionMajor":"35",
"AgentNameVersion":"Firefox 35.0",
"AgentNameVersionMajor":"Firefox 35"
}

Instead of writing a function to extract each piece of metadata individually, you might find it convenient to write a function that extracts all this metadata all at once and returns a map of these fields, and then extract the individual fields of interest. But to do that, you need to write a UDF that returns a map of all the fields.

The ComplexWriter

The ComplexWriter object enables you to write maps and arrays and return them as output from your Drill UDF. The first step in writing a UDF that returns a complex result is setting the ComplexWriter as the output parameter, as shown in the snippet that follows:

@Output
BaseWriter.ComplexWriter out Writer;

Fundamentally, functions that return complex results are no different from the simple UDFs that you have already written, with the exception of the output parameter. In the example here, we are setting the output parameter of the UDF to be a ComplexWriter called outWriter. The next step is to access either a MapWriter or ListWriter object—depending on whether you want to create a map or an array—from the ComplexWriter object by calling the rootAsMap() or rootAsList() function:

public void eval() {
  MapWriter queryMapWriter = outWriter.rootAsMap();
  ...
}

In regular Drill UDFs, the output is contained in holder variables. In a UDF that returns a complex data type, the MapWriter created in the previous snippet3 will contain holder objects for the various fields that will populate the result. You can create a holder separately or you can use the various writer functions to indirectly create a field, as shown in the following snippets:

//Write a floating-point field
queryMapWriter.float4("field_name")
            .writeFloat8(floating_point_value);
queryMapWriter.float8("field_name")
            .writeFloat8(floating_point_value);

//Write an integer field
queryMapWriter.bigInt("field_name").writeInt(int_value);
queryMapWriter.integer("field_name").writeInt(int_value);

//Write a date field
queryMapWriter.date("field_name").writeDate(long_date);

Adding a string to a map is a little more complicated, but the essential process is the same. It is a little clearer if you create the holder object outside of the MapWriter object and then pass the MapWriter object the holder. This code snippet demonstrates how to write a string to a MAP:

VarCharHolder rowHolder = new VarCharHolder();
String field = <text>;

byte[] rowStringBytes = field.getBytes(com.google.common.base.Charsets.UTF_8)
outBuffer.reallocIfNeeded(rowStringBytes.length);
outBuffer.setBytes(0, rowStringBytes);

rowHolder.start = 0;
rowHolder.end = rowStringBytes.length;
rowHolder.buffer = outBuffer;

queryMapWriter.varChar(field_name).write(rowHolder);

Here is the complete code for the function to parse user agent strings:

public void eval() {

 org.apache.drill.exec.vector.complex.writer.BaseWriter.MapWriter 
    queryMapWriter = outWriter.rootAsMap();
 String userAgentString = 
  org.apache.drill.exec.expr.fn.impl
  .StringFunctionHelpers
  .toStringFromUTF8(input.start, input.end, input.buffer);

  if (userAgentString.isEmpty() || userAgentString.equals("null")) {
    userAgentString = "";
  }

  nl.basjes.parse.useragent.UserAgent agent = uaa.parse(userAgentString);

  for (String fieldName : agent.getAvailableFieldNamesSorted()) {

   org.apache.drill.exec.expr.holders.VarCharHolder rowHolder = 
      new org.apache.drill.exec.expr.holders.VarCharHolder();
   String field = agent.getValue(fieldName);

   byte[] rowStringBytes = field.getBytes(com.google.common.base.Charsets.UTF_8);
   outBuffer.reallocIfNeeded(rowStringBytes.length);
   outBuffer.setBytes(0, rowStringBytes);

   rowHolder.start = 0;
   rowHolder.end = rowStringBytes.length;
   rowHolder.buffer = outBuffer;

   queryMapWriter.varChar(fieldName).write(rowHolder);
  }
}

Writing Aggregate User-Defined Functions

Now that you have written a simple UDF, you will find that writing an aggregate UDF is not much more difficult; however, before you begin to write one, you should understand that there are several significant limitations for custom aggregate functions:

  • Drill aggregate functions cannot have complex variables in the @Workspace.

  • You cannot use any kind of variable-length or complex variable in the @Output or @Workspace.

  • All @Workspace variables must be contained in holders.

These are significant limitations and effectively restrict aggregate functions to numeric calculations. If you want to use any kind of nonnumeric holder in the @Workspace, you can use the ObjectHolder; however, this is a bit of a hack in that at the time of writing, the ObjectHolder class has been deprecated. If your UDF uses the ObjectHolder, it will still work, but your queries will fail if your datasets are large enough that Drill’s hash aggregate operator must spill to disk. One workaround is to disable the hash aggregate operator using the following query:4

ALTER SESSION SET `planner.enable_hashagg` = false

Much like simple Drill functions, aggregate functions perform operations on collections of data; however, instead of returning a single result for every input, aggregate functions iterate through all the data but return only a single result. You can use aggregate functions with the GROUP BY clause to segment the data into smaller segments.

The function declaration is slightly different from a simple Drill function in that the scope must be set to POINT_AGGREGATE. Additionally, aggregate functions must handle null values internally and therefore the nulls variable must be set to FunctionTemplate.NullHandling.INTERNAL. Following is the complete function template:

@FunctionTemplate(
    name = "kendall_correlation",
    scope = FunctionTemplate.FunctionScope.POINT_AGGREGATE,
    nulls = FunctionTemplate.NullHandling.INTERNAL
)

The Aggregate Function API

Writing an aggregate function is not significantly different than a simple Drill UDF; however, there are four additional functions that you must implement:

setup()

This function is executed once when the function is first called and is where you initialize any variables that will be used throughout the function iterations.

add()

This function processes every record and is roughly equivalent to the eval() function in a simple Drill UDF. This is where the actual logic of the function will be implemented.

output()

This function is where the final output will be calculated and output. This is executed at the end of every segment of data that the function encounters.

reset()

This function is similar to the setup() function and is called at the beginning of each new section of data or when a column changes from one data type to another. You should reset your counters or totals in this method.

Example Aggregate UDF: Kendall’s Rank Correlation Coefficient

To demonstrate an aggregate UDF, let’s write a UDF that implements the Kendall’s rank correlation coefficient, otherwise known as Kendall’s tau. Kendall’s tau is a measurement of correlation between two vectors that ranges from 1, which indicates perfect correlation, to –1, which indicates perfect negative correlation.

Kendall’s tau is calculated by comparing pairs of numbers in the two vectors; if both pairs of numbers share the same Boolean relationship, they are considered a concordant pair, and if not, they are a discordant pair:

τ=(number of concordant pairs) - (number of discordant pairs)n(n1)2

To implement this as a Drill function, you need to first initialize a counter of concordant and discordant pairs as well as a counter of the number of items. You initialize these variables in the reset() function. The add() function will compare the rows, make the determination as to whether the pair is concordant or discordant, and increment the appropriate counter as well as the row counter. Finally, the output() function will perform the calculation and output the result.

The code snippet that follows demonstrates how to write an aggregate UDF in Drill:

@Param
Float8Holder xInput;

@Param
Float8Holder yInput;

@Workspace
Float8Holder prevXValue;

@Workspace
Float8Holder prevYValue;

@Workspace
IntHolder concordantPairs;

@Workspace
IntHolder discordantPairs;

@Workspace
IntHolder n;

@Output
Float8Holder tau;

In the first part of this UDF, we define the input parameters, the workspace variables, and the output variable. This function needs two input variables, xInput and yInput, which are both floating-point numbers. For these, the UDF uses Float8Holders. Next, the UDF defines several @Workspace variables. The prevXValue and prevYValue variables hold the values from the previous iteration, and the concordantPairs and discordantPairs holders contain the running totals of pairs. Finally, n contains the number of pairs in the grouping. Notice that the pair holders and n are integers and the others are all double holders. The last parameter is the output parameter tau, which also uses a Float8Holder.

Now that the inputs and outputs are all defined, as well as several workspace variables, the next steps are to implement the four functions of the aggregate function API. The first function is the add() function, which performs the bulk of the calculations. This function is called on every row, and for each row, it compares the current pair with the previous pair that it saw and increments the appropriate counter based on whether the pair is concordant or discordant:

@Override
public void add() {
  double xValue = xInput.value;
  double yValue = yInput.value;

  if (n.value > 0) {
    if ((xValue > prevXValue.value 
         && yValue > prevYValue.value) 
       || (xValue < prevXValue.value 
         && yValue < prevYValue.value)) {
      concordantPairs.value = concordantPairs.value + 1;
      prevXValue.value = xInput.value;
      prevYValue.value = yInput.value;
      n.value = n.value + 1;
    } else if ((xValue > prevXValue.value 
               && yValue < prevYValue.value) 
              || (xValue < prevXValue.value 
                 && yValue > prevYValue.value)) {
      discordantPairs.value = discordantPairs.value + 1;
      prevXValue.value = xInput.value;
      prevYValue.value = yInput.value;
      n.value = n.value + 1;
    } else {
      //Tie...
      prevXValue.value = xInput.value;
      prevYValue.value = yInput.value;
    }

  } else if (n.value == 0) {
    prevXValue.value = xValue;
    prevYValue.value = yValue;
    n.value = 1;
  }
}

The next function in the API is the setup() function, which in this case is not needed and left blank:

@Override
public void setup() {
}

The reset() function is called before each group’s execution. In this snippet, we use the reset() function to initialize all the variables and counters to zero:

@Override
public void reset() {
  prevXValue.value = 0;
  prevYValue.value = 0;
  concordantPairs.value = 0;
  discordantPairs.value = 0;
  n.value = 0;
}

The last function in the API that we must implement is the output() function, which calculates the difference between the pairs, divides it by (n * n - 1) / 2, and sets the value of the output holder, tau:

@Override
public void output() {
  double result = 0.0;
  result = (concordantPairs.value - discordantPairs.value) / 
      (0.5 * n.value * (n.value - 1));
  tau.value = result;
}

Conclusion

One of Drill’s biggest strong points is its extensibility and ease of customization. In this chapter, you learned how to perform more sophisticated analyses by extending the functionality of Drill by implementing your own simple and aggregate custom UDFs. In Chapter 12, you will learn how to further extend Drill’s functionality by writing format plug-ins that enable you to get Drill to read additional file types.

1 OK, this isn’t technically true. Drill functions are executed against one value at a time, but from a high level, it is helpful to think of a Drill UDF as operating against a column of data.

2 See Chapter 9 for instructions on setting up and using the site directory.

3 For spacing, the snippet omits the full package name; however, when writing UDFs you must include the full package names (in this case, org.apache.drill.exec.vector.complex.writer.BaseWriter.MapWriter).

4 For more information, check out the GitHub page.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset