Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 12. Writing a Format Plug-in

As described in Chapter 8, Apache Drill uses storage and format plug-ins to read data. The storage plug-in connects to a storage system such as Kafka, a database, or a distributed filesystem. The DFS interface is based on the HDFS client libraries and can obtain data from HDFS, Amazon S3, MapR, and so on.

A distributed filesystem contains a wide variety of files (Parquet, CSV, JSON, and so on.) The dfs storage plug-in uses format plug-ins to read data from these files. In this chapter, we explore how to create custom format plug-ins for file formats that Drill does not yet support.

Format plug-ins integrate tightly with Drill’s internal mechanisms for configuration, memory allocation, column projection, filename resolution, and data representation. Writing plug-ins is therefore an “advanced” task that requires Java experience, patience, frequent consultation of existing code, and posting questions on the “dev” mailing list.

Drill provides two ways to structure your plug-in. Here we focus on the “Easy” format plug-in, useful for most file formats, that handles much of the boilerplate for you. It is also possible to write a plug-in without the Easy framework, but it is unlikely you will need to do so.

The Example Regex Format Plug-in

As an example, we’re going to create a format plug-in for any text file format that can be described as a regular expression, or regex. The regex defines how to parse columns from an input record and is defined as part of the format configuration. The query can then select all or a subset of the columns. (This plug-in is a simplified version of the one that was added to Drill 1.14.)

You can find the full code for this plug-in in the GitHub repository for this book, in the format-plugin directory.

The plug-in configuration defines the file format using three pieces of information:

The file extension used to identify the file
The regex pattern used to match each line
The list of column names that correspond to the patterns

For example, consider Drill’s own log file, drillbit.log, which contains entries like this:

2017-12-21 10:52:42,045 [main] ERROR 
  o.apache.drill.exec.server.Drillbit - 
  Failure during initial startup of Drillbit.

(In the actual file, all the above text is on one line.) This format does not match any of Drill’s built-in readers, but is typical of the ad hoc format produced by applications. Rather than build a plug-in for each such format, we will build a generalized plug-in using regexes. (Although regular expressions might be slow in production, they save a huge amount of development time for occasional work.)

For Drill’s format we might define our regex as follows:

(\d{4})-(\d\d)-(\d\d) (\d\d):(\d\d):(\d\d),\d+ 
\[([^]]*)] (\w+)\s+(\S+) - (.*)

Note

This example is shown on two lines for clarity due to the physical size restrictions of the printed book; however, it must be on a single line in the configuration.

We define our fields as follows:

year, month, day, hour, minute, second,
thread, level, module, message

Drill logs often contain multiline messages. To keep things simple, we’ll simply ignore input lines that do not match the pattern.

Creating the “Easy” Format Plug-in

To create a file format plug-in using the Easy framework, you must implement a number of code features:

Plan-time hints
Hints for the filename resolution mechanism
Scan operator creation
Projection resolution, including creating null columns
Vector allocation
Writing to value vectors
Managing input files
Managing file chunks

It can be tricky to get all the parts to work correctly. The secret is to find a good existing example and then adapt that for your use. This example is designed to provide most of the basic functionality you will need. We point to other good examples as we proceed.

Drill format plug-ins are difficult to write outside of the Drill source tree. Instead, a format plug-in is generally created as a subproject within Drill itself. Doing so allows rapid development and debugging. Later, we’ll move the plug-in to its own project.

You begin by creating a local copy of the Drill source project, as described in Chapter 10. Then, create a new Git branch for your code:

cd /path/to/drill/source
git checkout -b regex-plugin

The example code on GitHub contains just the regex plug-in code; if you want to follow along without typing in the code, you’ll need to drop this into an existing Drill project to build it.

Iterative development is vital in a system as complex as Drill. Let’s begin with a simple single-field regex pattern so that we can focus on the required plumbing. Our goal is to create a starter plug-in that does the following:

Reads only text columns
Handles both wildcard (SELECT *) and explicit queries (SELECT a, b, c)
Loads data into vectors

This provides a foundation on which to build more advanced features.

Creating the Maven pom.xml File

Chapter 11 explained how to create a Maven pom.xml file for a UDF. The process for the format plug-in is similar.

Custom plug-ins should reside in the contrib module. Examples include Hive, Kafka, Kudu, and others. Format plug-in names typically start with “format-”. This example is called format-regex.

Create a new directory called drill/contrib/format-regex to hold your code. Then, add a pom.xml file in this directory:

<project xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
    http://maven.apache.org/xsd/maven-4.0.0.xsd"
    xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <modelVersion>4.0.0</modelVersion>

  <parent>
    <artifactId>drill-contrib-parent</artifactId>
    <groupId>org.apache.drill.contrib</groupId>
    <version>1.15.0-SNAPSHOT</version>
  </parent>

  <artifactId>drill-format-regex</artifactId>
  <name>contrib/regex-format-plugin</name>

  <dependencies>
    <dependency>
      <groupId>org.apache.drill.exec</groupId>
      <artifactId>drill-java-exec</artifactId>
      <version>${project.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.drill</groupId>
      <artifactId>drill-common</artifactId>
      <version>${project.version}</version>
    </dependency>

    <!-- Test dependencies -->
    <dependency>
      <groupId>org.apache.drill.exec</groupId>
      <artifactId>drill-java-exec</artifactId>
      <version>${project.version}</version>
      <classifier>tests</classifier>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.drill</groupId>
      <artifactId>drill-common</artifactId>
      <version>${project.version}</version>
      <classifier>tests</classifier>
      <scope>test</scope>
    </dependency>
  </dependencies>
</project>

Adjust the Drill version to match the version you are using.

The first two dependencies provide the code you need to implement your format plug-in. The second two provide test classes you will need for unit tests.

Next you must tell Drill to include this plug-in in the Drill build. Add the following to the <modules> section of the pom.xml file in the parent drill/contrib directory:

    <module>format-regex</module>

Each Drill module, including this plug-in, creates a Java JAR file. You must tell Drill to copy the JAR file into the final product directory. In drill/distribution/src/assemble/bin.xml, find the following block of code and add the last <include> line shown here:

    <dependencySet>
      <includes>
        <include>org.apache.drill.exec:drill-jdbc:jar</include>
        <include>org.apache.drill:drill-protocol:jar</include>
        ...
        <include>org.apache.drill.contrib:format-regex</include>
      </includes>

Change format-regex to the name of your plug-in.

Finally, you must add a dependency in the distribution pom.xml file to force Maven to build your plug-in before the distribution module.

Add the following just before the </dependencies> tag in the pom.xml file:

<dependency>
  <groupId>org.apache.drill.contrib</groupId>
  <artifactId>drill-format-regex</artifactId>
  <version>${project.version}</version>
</dependency>

Creating the Plug-in Package

If you create an Easy file format extension, as shown here, your code must reside in the core java-exec package. (The Easy plug-ins rely on dependencies that are visible only when your code resides within Drill’s java-exec package.) For an Easy format plug-in, your code should live in the org.apache.drill.exec.store.easy package. We’ll use org.apache.drill.exec.store.easy.regex. You can create that package in your IDE.

Drill Module Configuration

Drill uses the Lightbend (originally Typesafe) HOCON configuration system, via the Typesafe library, to express design-time configuration. The configuration is in the src/main/resources/drill-module.conf file in your plug-in directory; in this case, contrib/format-regex/src/main/resources/drill-module.conf:

drill: {
 classpath.scanning: {
   packages += "org.apache.drill.exec.store.easy.regex"
 }
}

Format Plug-in Configuration

Drill terminology can be a bit complex given that the same terms are used for multiple things. So, let’s begin by defining three common items:

Format plug-in: The actual class that implements your format plug-in. (This name is often used for the JSON configuration, as you’ll see shortly.)
Format plug-in configuration class: A Jackson serialized class that stores the JSON configuration for a plug-in. (Often just called the “format config.”)
Format plug-in configuration: The JSON configuration for a plug-in as well as the Java class that stores that configuration. (Abbreviated to “format configuration” elsewhere in this book.)

The confusion is that “format plug-in” is used both for the class and the JSON configuration. Similarly, the term “format config” is used for the both the JSON configuration and the Java class that stores that configuration. To avoid confusion, we will use the terms we just listed.

Cautions Before Getting Started

Before we begin making changes, let’s make sure we don’t break anything important. When you edit storage or format configurations using the Drill Web Console, Drill stores those configurations in ZooKeeper. If you make a mistake while creating the configuration (such as a bug in the code), your existing dfs configuration will be permanently lost. If you have configurations of value, make a copy of your existing configuration before proceeding!

The easiest way to proceed is to work on your laptop, assuming that anything stored in ZooKeeper is disposable.

Also, it turns out that Drill provides no versioning of format plug-in configurations. Each time you make a breaking change, Drill will fail to start because it will be unable to read any configurations stored in ZooKeeper using the old schema. (We discuss later how to recover.) So, a general rule is that you can change the format configuration schema during development, but after the plug-in is in production, the configuration schema is pretty much frozen for all time.

Creating the Regex Plug-in Configuration Class

Let’s begin by creating the format plug-in configuration class. A good reference example is TextFormatPlugin.TextFormatConfig.

You must implement a number of items:

This is a Jackson serialized class, so give it a (unique) name using the @JsonTypeName annotation.
Implement the equals() and hashCode() methods.

You need three properties:

The regular expression
A list of properties
The file extension that this plug-in configuration supports

Jackson handles all the boilerplate for you if you choose good names, make the members public, and follow JavaBeans naming conventions. So, go ahead and do so.

Next, create the plug-in class, RegexFormatPlugin, which extends EasyFormatPlugin:

package org.apache.drill.exec.store.easy.regex;

import org.apache.drill.common.logical.FormatPluginConfig;
import com.fasterxml.jackson.annotation.JsonInclude;
import com.fasterxml.jackson.annotation.JsonInclude.Include;
import com.fasterxml.jackson.annotation.JsonTypeName;
import com.google.common.base.Objects;

@JsonTypeName("regex")
@JsonInclude(Include.NON_DEFAULT)
public class RegexFormatConfig implements FormatPluginConfig {

  public String regex;
  public String fields;
  public String extension;

  public String getRegex() { return regex; }
  public String getFields() { return fields; }
  public String getExtension() { return extension; }

  @Override
  public boolean equals(Object obj) {
    if (this == obj) { return true; }
    if (obj == null || getClass() != obj.getClass()) {
      return false;
    }
    final RegexFormatConfig other = (RegexFormatConfig) obj;
    return Objects.equals(regex, other.regex) &&
           Objects.equals(fields, other.fields) &&
           Objects.equals(extension, other.extension);
  }

  @Override
  public int hashCode() {
    return Arrays.hashCode(new Object[] {regex, fields, extension});
  }
}

Format Configuration Limitations

This class might strike you as a bit odd. We would prefer to store the list of fields as a Java list. We’d also prefer to make the class immutable, with private final fields and a constructor that takes all of the field values. (Jackson provides the @JsonProperty annotation to help.)

This regex plug-in is a perfect candidate to use Drill’s table functions, as you’ll see in the tests. However, as of this writing, the table function code has a number of limitations that influence this class:

DRILL-6169 specifies that we can’t use Java lists.
DRILL-6672 specifies that we can’t use setters (for example, setExtension()).
DRILL-6673 specifies that we can’t use a nondefault constructor.

Until these issues are fixed, making your fields public is the only available solution.

Copyright Headers and Code Format

To compile your code with Drill, you must include the Apache license agreement at the top of your file: simply copy one from another file. If your plug-in is proprietary, you can modify the Maven pom.xml file to remove the “rat” check for your files.

The code format is the preferred Drill formatting described in Chapter 10.

Testing the Configuration

You now have enough that you can test the configuration, so go ahead and build Drill (assuming that it is located in ~/drill):

cd ~/drill
mvn clean install

For convenience, create an environment variable for the distribution directory:

export DRILL_HOME=~/drill/distribution/target/
 apache-drill-1.xx.0-SNAPSHOT/apache-drill-1.xx.0-SNAPSHOT

Replace xx with the current release number, such as 14.

Start Drill:

cd $DRILL_HOME/bin

Then:

./drillbit.sh start

Connect to the Web Console using your browser:

http://localhost:8047

Click the Storage tab and edit the dfs plug-in. You need to add a sample configuration for the plug-in that includes just the date fields, to keep things simple to start. Add the following to the end of the file, just before the closing bracket:

 "sample": {
    "type": "regex",
    "regex": "(\d\d\d\d)-(\d\d)-(\d\d) .*",
    "fields": "year, month, day",
    "extension": "samp"
  }

The quoted name is yours to choose. In this example, we define a format called "sample" with a pattern, three columns, and a .samp file extension. The value of the type key must match that defined in the @JsonTypeName annotation of our format class. The other three keys provide values compatible with our format properties.

Save the configuration. If Drill reports “success” and displays the configuration with your changes, congratulations, everything works so far!

Fixing Configuration Problems

We noted earlier that bugs in your code can corrupt Drill’s state in ZooKeeper. If you experience this problem, you can stop Drill and clear the ZooKeeper state, assuming that you have ZooKeeper installed in $ZK_HOME:

$ $DRILL_HOME/drillbit.sh stop
$ $ZK_HOME/bin/bin/zkCli.sh  -server localhost:2181

[zk: localhost:2181(CONNECTED) 0] ls /
[zookeeper, drill, hbase]
[zk: localhost:2181(CONNECTED) 1] ls /drill
[running, sys.storage_plugins, drillbits1]
[zk: localhost:2181(CONNECTED) 2] rmr /drill
[zk: localhost:2181(CONNECTED) 3] ls /
[zookeeper, hbase]
[zk: localhost:2181(CONNECTED) 6] quit

$ $DRILL_HOME/bin/drillbit.sh start

Note that if you’re running Drill 1.12 and it won’t start, see DRILL-6064.

Plug-in Configurations Are Not Versioned

Drill plug-in configurations are not versioned. If you change the configuration class and ship your code to a user that has a JSON configuration for an earlier version, Drill might fail to start, displaying a very cryptic error. To fix the error, you must manually remove the old configuration from ZooKeeper before restarting Drill. (On a secure system, this can be quite difficult, so it is best to do your testing on your laptop without enabling Drill security.)

As noted earlier, the general rule for production systems is to never change the plug-in configuration class after you use it in production. This is an area where we want to get it right the first time!

Troubleshooting

If things don’t work, look at the $DRILL_HOME/log/drillbit.log file. For example, suppose that you mistakenly named your get method for extensions as getFileSuffix():

org.apache.drill.common.exceptions.DrillRuntimeException: 
  unable to deserialize value at dfs
...
Caused by: com.fasterxml.jackson.databind.exc.
  UnrecognizedPropertyException:
  Unrecognized field "fileSuffix" 
   (class org.apache.drill.exec.store.easy.regex.RegexFormatConfig),
  not marked as ignorable (2 known properties: "columnCount", 
   "extension"])

Creating the Format Plug-in Class

With the configuration class working, you can now create the format plug-in class itself. First you create the basic shell and then you add the required methods, one by one.

To begin, create a class that extends the EasyFormatPlugin, using your configuration class as a parameter:

package org.apache.drill.exec.store.easy.regex;
...

public class RegexFormatPlugin extends 
            EasyFormatPlugin<RegexFormatConfig> {
 ...
}

Add a default name and a field to store the plug-in configuration, and then add a constructor to pass configuration information to the base class:

  public static final String DEFAULT_NAME = "regex";
  private final RegexFormatConfig formatConfig;

  public RegexFormatPlugin(String name, DrillbitContext context,
    Configuration fsConf, StoragePluginConfig storageConfig,
    RegexFormatConfig formatConfig) {

    super(name, context, fsConf, storageConfig, formatConfig,
      true,  // readable
      false, // writable
      false, // blockSplittable
      true,  // compressible
      Lists.newArrayList(formatConfig.extension),
      DEFAULT_NAME);
    this.formatConfig = formatConfig;
  }

The constructor accomplishes a number of tasks:

Accepts the plug-in configuration name that you set previously in the Drill Web Console.
Accepts a number of configuration objects that give you access to Drill internals and to the filesystem.
Accepts an instance of your deserialized format plug-in configuration.
Passes to the parent class constructor a number of properties that define the behavior of your plug-in. (It is the ability to specify these options in the constructor that, in large part, makes this an Easy plug-in.)
Defines the file extensions to be associated with this plug-in. In this case, you get the extension from the format plug-in configuration so the user can change it.

The gist of the code is that your plug-in can read but not write files. The files are not block-splittable. (Adding this ability would be simple: more on this later.) You can compress the file as a .zip or tar.gz file. You take the extension from the configuration, and the default name from a constant.

Next, you can provide default implementations for several methods. The first says that your plug-in will support (projection) push-down:

  @Override
  public boolean supportsPushDown() { return true; }

A reader need not support projection push-down. But without such support, the reader will load all of the file’s data into memory, only to have the Drill Project operator throw much of that data away. Since doing so is inefficient, this example will show how to implement projection push-down.

The next two are stubs because the plug-in does not support writing:

  @Override
  public RecordWriter getRecordWriter(FragmentContext context,
        EasyWriter writer) throws IOException {
    return null;
  }
   @Override
   
   public int getWriterOperatorType()
 { return 0;}

Eventually, you can register your plug-in with the Drill operator registry so that it will appear nicely in query profiles. For now, just leave it as a stub:

  @Override

  public int getReaderOperatorType()
{ return 0; }

Finally, we need to create the actual record reader. Leave this as a stub for now:

  @Override
  public RecordReader getRecordReader(FragmentContext context,
      DrillFileSystem dfs, FileWork fileWork,
      List<SchemaPath> columns,
      String userName) throws ExecutionSetupException {

    // TODO Write me!
    return null;
  }

Before we dive into the reader, let’s use a unit test and the debugger to verify that your configuration is, in fact, being passed to the plug-in constructor. You’ll use this test many times as you continue development.

Creating a Test File

First, you need to create a sample input file that you’ll place in Drill’s resource folder, /drill-java-exec/src/test/resources/regex/simple.samp. Getting sample data is easy: just grab a few lines from your drillbit.log file:

2017-12-19 10:52:41,820 [main] INFO  o.a.d.e.e.f.FunctionImplementationRegist...
2017-12-19 10:52:37,652 [main] INFO  o.a.drill.common.config.DrillConfig - Con...
Base Configuration:
  - jar:file:/Users/paulrogers/git/drill/distribution/target/apache-drill...
2017-12-19 11:12:27,278 [main] ERROR o.apache.drill.exec.server.Drillbit - ...

We’ve included two different message types (INFO and ERROR), along with a multiline message. (Our code will ignore the nonmessage lines.)

Configuring RAT

We’ve added a new file type to our project, .samp. Drill uses Apache RAT to check that every file has a copyright header. Because we’ve added a data file, we don’t want to include the header (for some files, the header cannot be provided). Configure RAT to ignore this file type by adding the following lines to the contrib/format-regex/pom.xml file:

  <build>
    <plugins>
     <plugin>
       <groupId>org.apache.rat</groupId>
       <artifactId>apache-rat-plugin</artifactId>
       <inherited>true</inherited>
       <configuration>
         <excludes>
           <exclude>**/*.samp</exclude>
         </excludes>
       </configuration>
     </plugin>
    </plugins>
  </build>

Efficient Debugging

The first test you did earlier required that you run a full five-minute Drill build, then start the Drill server, and then connect using the Web Console. For that first test, this was fine because it is the most convenient way to test the format configuration. But for the remaining tasks, you’ll want to have much faster edit-compile-debug cycles.

You can do this by running Drill in your IDE using Drill’s test fixtures. See the developer documentation for details. You can also find more information about the test fixtures in the class org.apache.drill.test.ExampleTest.

Here is how to create an ad hoc test program that starts the server, including the web server, and listens for connections:

In the java_exec/src/test directory, create the org.apache.drill.exec.store.easy.regex project.
Create an “ad hoc” test program that launches the server with selected options.
Run your test program from your IDE.

The following section shows you how to build the most basic test program.

Creating the Unit Test

You want your test to be fast and self-contained. This is such a common pattern that a class exists to help you out: ClusterTest.

Start by creating a test that derives from ClusterTest:

public class TestRegexReader extends ClusterTest {

 @ClassRule
 public static final BaseDirTestWatcher dirTestWatcher = 
   new BaseDirTestWatcher();

 @BeforeClass
 public static void setup() throws Exception {
   ClusterTest.startCluster(ClusterFixture.builder(dirTestWatcher));

   // Define a regex format config for testing.
   defineRegexPlugin();
 }
}

This is mostly boilerplate except for the defineRegexPlugin() method. As it turns out, annoyingly, Drill provides no SQL syntax or API for defining plug-in configuration at runtime. Either you use the Web Console, or you must create your own, which is what we show here:

 @SuppressWarnings("resource")
 private static void defineRegexPlugin() throws ExecutionSetupException {

   // Create an instance of the regex config.

   RegexFormatConfig config = new RegexFormatConfig();
   config.extension = "samp";
   config.regex = "(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d) .*";
   config.fields = Lists.newArrayList("year", "month", "day");

   // Define a temporary format plug-in for the "cp" storage plug-in.

   Drillbit drillbit = cluster.drillbit();
   final StoragePluginRegistry pluginRegistry = 
       drillbit.getContext().getStorage();
   final FileSystemPlugin plugin = 
       (FileSystemPlugin) pluginRegistry.getPlugin("cp");
   final FileSystemConfig pluginConfig = (FileSystemConfig) plugin.getConfig();
   pluginConfig.getFormats().put("sample", config);
   pluginRegistry.createOrUpdate("cp", pluginConfig, false);
 }

The first block of code simply uses the format configuration class to define a simple test format: just the first three fields of a Drill log.

The second block is “black magic”: it retrieves the existing classpath (cp) plug-in, retrieves the configuration object for that plug-in, adds your format plug-in to that storage configuration, and then redefines the cp storage plug-in with the new configuration. (The Drill test framework has methods to set up a plug-in, but only in the context of a workspace, and workspaces are not supported for the classpath plug-in.) In any event, with the preceding code, you can configure a test version of your plug-in without having to use the Web Console.

Next, define the simplest possible test—just run a query:

 @Test
 public void testWildcard() {
   String sql = "SELECT * FROM cp.`regex/simple.samp`";
   client.queryBuilder().sql(sql).printCsv();
 }

Of course, this test won’t work yet: you haven’t implemented the reader. But you can at least test that Drill is able to find and instantiate your plug-in.

Set a breakpoint in the constructor of the plug-in and then run the test in the debugger. When the debugger hits the breakpoint, inspect the contents of the format plug-in provided to the constructor. If everything looks good, congratulations, another step completed! Go ahead and stop the debugger because we have no reader implemented.

How Drill Finds Your Plug-in

If something goes wrong, it helps to know where to look for problems. Drill uses the following mappings to find your plug-in:

The name "sample" was used to register the plug-in configuration with the storage plug-in in defineRegexPlugin().
The Drill class FormatCreator scans the classpath for all classes that derive from FormatPlugin (which yours does via EasyFormatPlugin).
FormatCreator scans the constructors for each plug-in class looking for those of the expected format. (If you had added or removed arguments, yours would not be found and the plug-in would be ignored.)
To find the plug-in class for a format configuration, Drill searches the constructors of the format plug-ins to find one in which the type of the fifth argument matches the class of the format configuration registered in step 1.
Drill invokes the matching constructor to create an instance of your format plug-in class.
The format plug-in creates the reader that reads data for your format plug-in.

If things go wrong, step through the code in FormatCreator to determine where it went off the rails.

The Record Reader

With the boilerplate taken care of and with a working test environment, we are now ready to tackle the heart of the problem: the record reader. In Drill, the record reader is responsible for a number of tasks:

Defining value vectors for each column
Populating value vectors for each batch of records
Performing projection push-down (mapping from input to query columns)
Translating errors into Drill format

Each of these tasks can be complex when you consider all the nuances and use cases. Let’s work on them step by step, discussing the theory as we go.

In Drill, a single query can scan multiple files (or multiple blocks of a single large file). As we’ve discussed earlier in this book, Drill divides queries into major fragments, one of which will perform the scan operation. The scan’s major fragment is distributed across a set of minor fragments, typically running multiple instances on each node in the Drill cluster. If the query scans a few files, each scan operator might read just one file. But if the query touches many files relative to the number of scan minor fragments, each scan operator will read multiple files.

Note

To read about fragments in more detail, see Chapter 3.

The scan operator orchestrates the scan operation, but delegates actual reading to a record reader. There is one scan operator per minor fragment, but possibly many record readers for each scan operator. In particular, Drill creates one record reader for each file. Many files are splittable. In this case, Drill creates one record reader per file block.

Because Drill reads “big data,” files are often large. Blocks are often 256 MB or 512 MB. Drill further divides each block into a series of batches: collections of records that fit comfortably into memory. Each batch is composed of a collection of value vectors, one per column.

So, your job in creating a reader is to create a class that reads data from a single file (or block) in the context of a scan operator that might read multiple files, and to read the data as one or more batches, filling the value vectors as you read each record.

Let’s begin by creating the RegexRecordReader class:

public class RegexRecordReader extends AbstractRecordReader {

  private final DrillFileSystem dfs;
  private final FileWork fileWork;
  private final RegexFormatConfig formatConfig;

  public RegexRecordReader(FragmentContext context, DrillFileSystem dfs,
     FileWork fileWork, List<SchemaPath> columns, String userName,
     RegexFormatConfig formatConfig) {
    this.dfs = dfs;
    this.fileWork = fileWork;
    this.formatConfig = formatConfig;
   
    // Ask the superclass to parse the projection list.
    setColumns(columns);
  }

  @Override
  public void setup(OperatorContext context, OutputMutator output)
     throws ExecutionSetupException { }

  @Override
  public int next() { }

  @Override
  public void close() throws Exception { }
}

The RegexRecordReader interface is pretty simple:

Constructor: Provides the five parameters that describe the file scan, plus the configuration of our regex format
setup(): Called to open the underlying file and start reading
next(): Called to read each batch of rows from the data source
close(): Called to release resources

The AbstractRecordReader class provides a few helper functions that you’ll use later.

If you create a format plug-in without the Easy framework, or if you create a storage plug-in, you are responsible for creating the scan operator, the readers, and quite a bit of other boilerplate.

However, we are using the Easy framework, which does all of this for you (hence the name). The Easy framework creates the record readers at the time that the scan operator is created. Because of this, you want to avoid doing operations in the constructor that consume resources (such as memory allocation or opening a file). Save those for the setup() call.

Let’s consider the constructor parameters:

FragmentContext: Provides information about the (minor) fragment along with information about the Drill server itself. You most likely don’t need this information.
DrillFileSystem: Drill’s version of the HDFS FileSystem class. Use this to open the input file.
FileWork: Provides the filename along with the block offset and length (for block-aware files).
List<SchemaPath>: The set of columns that the query requested from the data source.
String userName: The name of the user running the query, for use in some security models.
RegexFormatConfig: The regex format configuration that we created earlier.

You must also instruct Drill how to instantiate your reader by implementing the following method in your RegexFormatPlugin class:

  @Override
  public RecordReader getRecordReader(FragmentContext context,
      DrillFileSystem dfs, FileWork fileWork, List<SchemaPath> columns,
      String userName) throws ExecutionSetupException {
    return new RegexRecordReader(context, dfs, fileWork,
        columns, userName, formatConfig);
  }

Testing the Reader Shell

As it turns out, you’ve now created enough structure that you can successfully run a query; however, it will produce no results. Rerun the previous SELECT * query using the test created earlier. Set breakpoints in the getRecordReader() method and examine the parameters to become familiar with their structure. (Not all Drill classes provide Javadoc comments, so this is a good alternative.) Then, step into your constructor to see how the columns are parsed.

If all goes well, the console should display something like this:

Running org.apache.drill.exec.store.easy.regex.TestRegexReader#testWildcard
Total rows returned: 0.  Returned in 1804ms.

You are now ready to build the actual reader.

Logging

Logging is the best way to monitor your plug-in in production or to report runtime errors. Define a logger as a static final class field:

  private static final org.slf4j.Logger logger =
        org.slf4j.LoggerFactory.getLogger(RegexRecordReader.class);

Error Handling

Drill’s error handling rules are a bit vague, but the following should stand you in good stead:

If a failure occurs during construction or setup, log the error and throw an ExecutionSetupException.
If a failure occurs elsewhere, throw a UserException, which will automatically log the error.

In both cases, the error message you provide will be sent back to the user running a query, so try to provide a clear message.

Error Display in the Drill Web Console

The SQLLine program will show your error message to the user. However, at present, the Web Console displays only a cryptic message, losing the message text. Because of this, you should use SQLLine if you want to test error messages. You can also check the query profile in the Web Console.

Drill’s UserException class wraps exceptions that are meant to be returned to the user. The class uses a fluent builder syntax and requires you to identify the type of error. Here are two of the most common:

dataReadError(): Indicates that something went wrong with opening or reading from a file.
validationException(): Indicates that something went wrong when validating a query. Because Drill does column validation at runtime, you can throw this if the projected columns are not valid for your data source.

See the UserException class for others. You can also search the source code to see how each error is used in practice.

The UserException class allows you to provide additional context information such as the file or column that triggered the error and other information to help the user understand the issue.

One odd aspect of UserException is that it is unchecked: you do not have to declare it in the method signature, and so you can throw it from anywhere. A good rule of thumb is that if the error might be due to a user action, the environment, or faulty configuration, throw a UserException. Be sure to include the filename and line number, if applicable. If the error is likely due to a code flaw (some invariant is invalid, for example), throw an unchecked Java exception such as IllegalStateException, IllegalArgumentException, or UnsupportedOperationException. Use good unit tests to ensure that these “something is wrong in the code” exceptions are never thrown when the code is in production.

Because the UserException allows us to provide additional context, we use it in this example, even in the setup stage.

Setup

In the setup() method, you need to set up the regex parser, define the needed projection, define vectors, and open the file:

  @Override
  public void setup(OperatorContext context, OutputMutator output) {
    setupPattern();
    setupColumns();
    setupProjection();
    openFile();
    defineVectors(output);
  }

Regex Parsing

Because the purpose of this exercise is to illustrate Drill, the example uses the simplest possible regex parsing algorithm: just the Java Pattern and Matcher classes. (Production Drill readers tend to go to extremes to optimize the per-record path—which we leave as an exercise for you—and so the simple approach here would need optimization for production code.) Remember that we decided to simply ignore lines that don’t match the pattern:

  private Pattern pattern;

  private void setupPattern() {
    try {
      pattern = Pattern.compile(formatConfig.getRegex());
    } catch (PatternSyntaxException e) {
      throw UserException
          .validationError(e)
          .message("Failed to parse regex: "%s"",
                   formatConfig.getRegex())
          .build(logger);
    }
  }

As is typical in a system as complex as Drill, error handling can consume a large fraction of your code. Because the user supplies the regex (as part of the plug-in configuration), we raise a UserException if the regex has errors, passing the original exception and an error message. The build() method automatically logs the error into Drill’s log file to help with problem diagnosis.

Defining Column Names

Here is the simplest possible code to parse the list of columns:

  private List<String> columnNames;

  private void setupColumns() {
    String fieldStr = formatConfig.getFields();
    columnNames = Splitter.on(Pattern.compile(
                  "\s*,\s*")).splitToList(fieldStr);
  }

See the complete code in GitHub for the full set of error checks required. They are omitted here because they don’t shed much light on Drill itself.

Projection

Projection is the process of picking some columns from the input, but not others. (In SQL, the projection list confusingly follows the SELECT keyword.) As explained earlier, the simplest reader just reads all columns, after which Drill will discard those that are not projected. Clearly this is inefficient; instead, each reader should do the projection itself. This is called projection push-down (the projection is pushed down into the reader).

We instructed Drill that our format plug-in supports projection push-down with the following method in the RegexFormatPlugin class:

  @Override
  public boolean supportsPushDown() { return true; }

To implement projection, you must handle three cases:

An empty project list: SELECT COUNT(*)
A wildcard query: SELECT *
An explicit list: SELECT a, b, c

The base AbstractRecordReader class does some of the work for you when you call setColumns() from the constructor:

   // Ask the superclass to parse the projection list.

   setColumns(columns);

Here is the top-level method that identifies the cases:

  private void setupProjection() {
    if (isSkipQuery()) {
      projectNone();
    } else if (isStarQuery()) {
      projectAll();
    } else {
      projectSubset();
    }
  }

The isSkipQuery() and isStartQuery methods are provided by the superclass as a result of calling setColumns() in the example prior to this one.

Column Projection Accounting

Because the regex plug-in parses text files, we can assume that all of the columns will be nullable VARCHAR. You will need the column name later to create the value vector, which means that you need a data structure to keep track of this information:

  private static class ColumnDefn {
    private final String name;
    private final int index;
    private NullableVarCharVector.Mutator mutator;

    public ColumnDefn(String name, int index) {
      this.name = name;
      this.index = index;
    }
  }

The Mutator class is the mechanism Drill provides to write values into a value vector.

Project None

If Drill asks you to project no columns, you are still obligated to provide one dummy column because of an obscure limitation of Drill’s internals. Drill will discard the column later:

  private void projectNone() {
    columns = new ColumnDefn[] { new ColumnDefn("dummy", -1) };
  }

Project All

When given a wildcard (SELECT COUNT(*)) query, SQL semantics specify that you should do the following:

Include all columns from the data source.
Use the names defined in the data source.
Use them in the order in which they appear in the data source.

In this case, the “data source” is the set of columns defined in the plug-in configuration:

  private void projectAll() {
    columns = new ColumnDefn[groupCount];
    for (int i = 0; i < columns.length; i++) {
      columns[i] = new ColumnDefn(columnNames.get(i), i);
    }
  }

Project Some

The final case occurs when the user requests a specific set of columns; for example, SELECT a, b, c. Because Drill is schema-free, it cannot check at plan time which projected columns exist in the data source. Instead, that work is done at read time. The result is that Drill allows the user to project any column, even one that does not exist:

A requested column matches one defined by the configuration, so we project that column to the output batch.
A requested column does not match one defined by the configuration, so we must project a null column.

By convention, Drill creates a nullable (OPTIONAL) INT column for missing columns. In our plug-in, we only ever create VARCHAR values, so nullable INT can never be right. Instead, we create nullable VARCHAR columns for missing columns.

To keep the code simple, we will use nullable VARCHAR columns even for those columns that are available. (You might want to modify the example to use non-nullable [REQUIRED] columns, instead.)

As an aside, the typical Drill way to handle other data types is to read a text file (such as CSV, or, here, a log file) as VARCHAR, then perform conversions (CASTs) to other types within the query itself (or in a view). You could add functionality to allow the format configuration to specify the type, then do the conversion in your format plug-in code. In fact, this is exactly what the regex plug-in in Drill does. Review that code for the details.

Following are the rules for creating the output projection list:

Include in the output rows only the columns from the project list (including missing columns).
Use the names provided in the project list (which might differ in case from those in the data source).
Include columns in the order in which they are defined in the project list.

Drill is case-insensitive, so you must use case-insensitive name mapping. But Drill labels columns using the case provided in the projection list, so for the project-some case, you want to use the name as specified in the project list.

For each projected name, you look up the name in the list of columns and record either the column (pattern) index, or –1 if the column is not found. (For simplicity we use a linear search; production code might use a hash map.)

The implementation handles all these details:

  private void projectSubset() {

    // Ensure the projected columns are only simple columns;
    // no maps, no arrays.
    Collection<SchemaPath> project = this.getColumns();
    assert ! project.isEmpty();
    columns = new ColumnDefn[project.size()];
    int colIndex = 0;
    for (SchemaPath column : project) {
      if (column.getAsNamePart().hasChild()) {
        throw UserException
            .validationError()
            .message("The regex format plugin supports only" +
                     " simple columns")
            .addContext("Projected column", column.toString())
            .build(logger);
      }

      // Find a matching defined column, case-insensitive match.
      String name = column.getAsNamePart().getName();
      int patternIndex = -1;
      for (int i = 0; i < columnNames.size(); i++) {
        if (columnNames.get(i).equalsIgnoreCase(name)) {
          patternIndex = i;
          break;
        }
      }

      // Create the column. Index of -1 means column will be null.
      columns[colIndex++] = new ColumnDefn(name, patternIndex);
    }
  }

The cryptic check for hasChild() catches subtle errors. Drill allows two special kinds of columns to appear in the project list: arrays (columns[0]) and maps (a.b). Because our plug-in handles only simple columns, we reject requests for nonsimple columns.

Note what happens if a requested column does not match a column from that provided by the plug-in configuration: the patternIndex ends up as -1. We use that as our cue later to fill that column with null values.

Opening the File

Drill uses the DrillFileSystem class, which is a wrapper around the HDFS FileSystem class, to work with files. Here, we are concerned only with opening a file as an input stream, which is then, for convenience, wrapped in a BufferedReader:

  private void openFile() {
    InputStream in;
    try {
      in = dfs.open(new Path(fileWork.getPath()));
    } catch (Exception e) {
      throw UserException
        .dataReadError(e)
        .message("Failed to open open input file: %s", fileWork.getPath())
        .addContext("User name", userName)
        .build(logger);
    }
    reader = new BufferedReader(
                new InputStreamReader(in, Charsets.UTF_8));
  }

Here we see the use of the dataReadError form of the UserException, along with a method to add the username as context (in case the problem is related to permissions).

Notice that, after this call, we are holding a resource that we must free in close(), even if something fails.

Record Batches

Many query systems (such as MapReduce and Hive) are row-based: the record reader reads one row at a time. Drill, being columnar, works with groups of records called record batches. Each reader provides a batch of records on each call to next(). Drill has no standard for the number of records: some readers return 1,000, some 4,096; others return 65,536 (the maximum). For our plug-in, let’s go with the standard size set in Drill’s internals (4,096):

  private static final int BATCH_SIZE =
             BaseValueVector.INITIAL_VALUE_ALLOCATION;

The best batch size depends strongly on the size of each record. At present, Drill simply assumes records are the proper size. For various reasons, it is best to choose a number that keeps the memory required per batch below a few megabytes in size.

Let’s see how our size of 4,096 works out. We are scanning Drill log lines, which tend to be less than 100 characters long. 4,096 records * 100 characters/record = 409,600 bytes, which is fine.

Drill’s Columnar Structure

As a columnar execution engine, Drill stores data per column in a structure called a value vector. Each vector stores values for a single column, one after another. To hold the complete set of columns that make up a row, we create multiple value vectors.

Drill defines a separate value vector class for each data type. Within each data type, there are also separate classes for each cardinality: non-nullable (called REQUIRED in Drill code), nullable (called OPTIONAL) and repeated.

We do not write to vectors directly. Instead, we write through a helper class called a (vector) Mutator. (There is also an Accessor to read values.) Just as each combination of (type, cardinality) has a separate class, each also has a separate Mutator and Accessor class (defined as part of the vector class).

To keep the example code simple, we use a single type and single cardinality: OPTIONAL VARCHAR. This makes sense: our regex pattern can only pull out strings, it does not have sufficient information to determine column types. However, if you are reading from a system that maps to other Drill types (INT, DOUBLE, etc.), you must deal with multiple vector classes, each needing type-specific code.

Note

Drill 1.14 contains a new RowSet mechanism to greatly simplify the kind of work we explain here. At the time of writing, the code was not quite ready for full use, so we explain the current techniques. Watch the “dev” list for when the new mechanism becomes available.

Defining Vectors

With that background, it is now time to define value vectors. Drill provides the OutputMutator class to handle many of the details. Because we need only the mutator, and we need to use it for each column, let’s store it in our column definition class for easy access:

  private static class ColumnDefn {
    ...
    public NullableVarCharVector.Mutator mutator;

To create a vector, we just need to create metadata for each column (in the form of a MaterializedField) and then ask the output mutator to create the vector:

  private void defineVectors(OutputMutator output) {
    for (int i = 0; i < columns.length; i++) {
      MaterializedField field =
          MaterializedField.create(columns[i].name,
                    Types.optional(MinorType.VARCHAR));
      try {
        columns[i].mutator = output.addField(field,
                     NullableVarCharVector.class).getMutator();
      } catch (SchemaChangeException e) {
        throw UserException
          .systemError(e)
          .message("Vector creation failed")
          .build(logger);
      }
    }
  }

The code uses two convenience methods to define metadata: MaterializedField.create() and Types.optional().

The SchemaChangeException thrown by addField() seems odd: we are creating a vector, how could the schema change? As it turns out, Drill requires that the reader provide exactly the same value vector in each batch. In fact, if a scan operator reads multiple files, all readers must share the same vectors. If reader #2 asks to create column c as a nullable VARCHAR, but reader #1 has already created c as a REQUIRED INT, for example, the exception will be thrown. In this case, all readers use the same type, so the error should never actually occur.

Reading Data

With the setup completed, we are ready to actually read some data:

  @Override
  public int next() {
    rowIndex = 0;
    while (nextLine()) { }
    return rowIndex;
  }

Here we read rows until we fill the batch. Because we skip some rows, we can’t simply use a for loop: we want to count only matching lines (because those are the only ones loaded into Drill).

The work to do the pattern matching is straightforward:

  private boolean nextLine() {
    String line;
    try {
      line = reader.readLine();
    } catch (IOException e) {
      throw UserException
        .dataReadError(e)
        .addContext("File", fileWork.getPath())
        .build(logger);
    }
    if (line == null) {
      return false;
    }
    Matcher m = pattern.matcher(line);
    if (m.matches()) {
      loadVectors(m);
    }
    return rowIndex < BATCH_SIZE;
  }

Loading Data into Vectors

The next task is to load data into the vectors. The nullable VARCHAR Mutator class lets you set a value to null, or set a non-null value. You must tell it the row to write—the Mutator, oddly, does not keep track of this itself. In our case, if the column is “missing,” we set it to null. If the pattern itself is null (the pattern is optional and was not found), we also set the value to null. Only if we have an actual match do we copy the value (as a string) into the vector (as an array of bytes). Recall that regex groups are 1-based:

   private void loadVectors(Matcher m) {

    // Core work: write values into vectors for the current
    // row. If projected by name, some columns may be null.

    for (int i = 0; i < columns.length; i++) {
      NullableVarCharVector.Mutator mutator = columns[i].mutator;
      if (columns[i].index == -1) {
        // Not necessary; included just for clarity
        mutator.setNull(rowIndex);
      } else {
        String value = m.group(columns[i].index + 1);
        if (value == null) {
          // Not necessary; included just for clarity
          mutator.setNull(rowIndex);
        } else {
          mutator.set(rowIndex, value.getBytes());
        }
      }
    }
    rowIndex++;
  }

In practice, we did not have to set values to null, because null is the default; it was done here just for clarity.

Releasing Resources

The only remaining task is to implement close() to release resources. In our case, the only resource is the file reader. Here we use logging to report and then ignore any errors that occur on close:

  @Override
  public void close() {
    if (reader != null) {
      try {
        reader.close();
      } catch (IOException e) {
        logger.warn("Error when closing file: " + fileWork.getPath(), e);
      }
      reader = null;
    }
  }

Testing the Reader

The final step is to test the reader. Although it can be tempting to build Drill, fire up SQLLine, and throw some queries at your code, you need to resist that temptation and instead do detailed unit testing. Doing so is easy with Drill’s testing tools. Plus, you can use the test cases as a way to quickly rerun code in the unlikely event that the tests uncover some bugs in your code.

Testing the Wildcard Case

You can use your existing test to check the wildcard (SELECT *) case:

  @Test
  public void testWildcard() {
    String sql = "SELECT * FROM cp.`regex/simple.samp`";
    client.queryBuilder().sql(sql).printCsv();
  }

When you run the test, you should see something like the following:

Running org.apache.drill.exec.store.easy.regex.TestRegexReader#testWildcard
3 row(s):
year<VARCHAR(OPTIONAL)>,month<VARCHAR(OPTIONAL)>,day<VARCHAR(OPTIONAL)>
2017,12,17
2017,12,18
2017,12,19
Total rows returned : 3.  Returned in 10221ms.

Congratulations, you have a working format plug-in! But we’re not done yet. This is not much of a test given that it requires that a human review the output. Let’s follow the examples in ExampleTest and actually validate the output:

  @Test
  public void testWildcard() throws RpcException {
    String sql = "SELECT * FROM cp.`regex/simple.log1`";
    RowSet results = client.queryBuilder().sql(sql).rowSet();

    BatchSchema expectedSchema = new SchemaBuilder()
        .addNullable("year", MinorType.VARCHAR)
        .addNullable("month", MinorType.VARCHAR)
        .addNullable("day", MinorType.VARCHAR)
        .build();

    RowSet expected = client.rowSetBuilder(expectedSchema)
        .addRow("2017", "12", "17")
        .addRow("2017", "12", "18")
        .addRow("2017", "12", "19")
        .build();

    RowSetUtilities.verify(expected, results);
  }

We use three test tools. SchemaBuilder lets us define the schema we expect. The row set builder lets us build a row set (really, just a collection of vectors) that holds the expected values. Finally the RowSetUtilities.verify() function compares the two row sets: both the schemas and the values. The result is that we can easily create many tests for our plug-in.

Testing Explicit Projection

We have three kinds of project, but we’ve tested only one. Let’s test explicit projection using this query:

SELECT `day`, `missing`, `month` FROM cp.`regex/simple.samp`

As an exercise, use the techniques described earlier to test this query. Begin by printing the results as CSV, and then create a validation test. Hint: to specify a null value when building the expected rows, simply pass a Java null. Check the code in GitHub for the answer.

Testing Empty Projection

Finally, let’s verify that projection works for a COUNT(*) query:

SELECT COUNT(*)  FROM cp.`regex/simple.log1`

We know that the query returns exactly one row with a single BIGINT column. We can use a shortcut to validate the results:

  @Test
  public void testCount() throws RpcException {
    String sql = "SELECT COUNT(*) FROM cp.`regex/simple.log1`";
    long result = client.queryBuilder().sql(sql).singletonLong();
    assertEquals(3, result);
  }

Eclipse Users

If the query fails with an error “FUNCTION ERROR: Failure reading Function class,” you need to add the function sources (yes, sources) to the classpath. Add drill-java-exec/src/main/java and drill-java-exec/target/generated-sources.

Scaling Up

We’ve tested with only a simple pattern thus far. Let’s modify the test to add the full set of Drill log columns, using the pattern we identified earlier. Because each file extension is associated with a single format plug-in, we cannot use the "samp" extension we’ve been using. Instead, let’s use the actual "log" extension:

  private static void defineRegexPlugin() throws ExecutionSetupException {

    // Create an instance of the regex config.
    // Note: we can't use the ".log" extension; the Drill .gitignore
    // file ignores such files, so they'll never get committed.
    // Instead, make up a fake suffix.

    RegexFormatConfig sampleConfig = new RegexFormatConfig();
    sampleConfig.extension = "log1";
    sampleConfig.regex = DATE_ONLY_PATTERN;
    sampleConfig.fields = "year, month, day";

    // Full Drill log parser definition.

    RegexFormatConfig logConfig = new RegexFormatConfig();
    logConfig.extension = "log2";
    logConfig.regex = "(\d\d\d\d)-(\d\d)-(\d\d) " +
                      "(\d\d):(\d\d):(\d\d),\d+ " +
                      "\[([^]]*)] (\w+)\s+(\S+) - (.*)";
    logConfig.fields = "year, month, day, hour, " +
        "minute, second, thread, level, module, message";

    // Define a temporary format plug-in for the "cp" storage plug-in.

    Drillbit drillbit = cluster.drillbit();
    final StoragePluginRegistry pluginRegistry =
          drillbit.getContext().getStorage();
    final FileSystemPlugin plugin =
          (FileSystemPlugin) pluginRegistry.getPlugin("cp");
    final FileSystemConfig pluginConfig =
          (FileSystemConfig) plugin.getConfig();
    pluginConfig.getFormats().put("sample", sampleConfig);
    pluginConfig.getFormats().put("drill-log", logConfig);
    pluginRegistry.createOrUpdate("cp", pluginConfig, false);
  }

  @Test
  public void testFull() throws RpcException {
    String sql = "SELECT * FROM cp.`regex/simple.log2`";
    client.queryBuilder().sql(sql).printCsv();
  }

The output should be like the following:

Running org.apache.drill.exec.store.easy.regex.TestRegexReader...
3 row(s):
year<VARCHAR(OPTIONAL)>,month<VARCHAR(OPTIONAL)>,...
2017,12,17,10,52,41,main,INFO,o.a.d.e.e.f.Function...
2017,12,18,10,52,37,main,INFO,o.a.drill.common.config....
2017,12,19,11,12,27,main,ERROR,o.apache.drill.exec.server....
Total rows returned : 3.  Returned in 1751ms.

Success!

You’ve now fully tested the plug-in, and it is ready for actual use. As noted, we took some shortcuts for convenience so that we could go back and do some performance tuning. We’ll leave the performance tuning to you and instead focus on how to put the plug-in into production as well as how to offer the plug-in to the Drill community.

Additional Details

By following the pattern described here, you should be able to create your own simple format plug-in. There are few additional topics that can be helpful in advanced cases.

File Chunks

The Hadoop model is to store large amounts of data in each file and then read file “chunks” in parallel. For simplicity, this assumes that files are not “splittable”: that chunks of the file cannot be read in parallel. Depending on the file format, you can add this kind of support. Drill logs are not splittable because a single message can span multiple lines. However, simpler formats (such as CSV) can be splittable if each record corresponds to a single line. To read a split, you scan the file looking for the first record separator (such as a newline in CSV) and then begin parsing with the next line. The text reader (the so-called CompliantTextReader) offers an example.

Default Format Configuration

Prior sections showed two ways to create a configuration for your plug-in. You used the Drill Web Console to create it interactively. You also wrote code to set it up for tests. This is fine for development, but impractical for production users. If you’re going to share your plug-in with others, you probably want it to be configured by default. (For example, for our plug-in, we might want to offer a configuration for Drill logs “out of the box.”)

When Drill first starts on a brand-new installation, it initializes the initial set of format configurations from the following file: /drill-java-exec/src/main/resources/bootstrap-storage-plugins.json.

Modularity

Ideally, we’d add this configuration in our regex project. Format plug-in configuration is, however, a detail of the storage plug-in configuration. The configuration for the default dfs storage plug-in resides in Drill’s java-exec module, so we must add our configuration there. The ability to add configuration as a file within our format plug-in project would be a nice enhancement, but is not available today.

Drill reads the bootstrap file only once: when Drill connects to ZooKeeper and finds that no configuration information has yet been stored. If you add (or modify) a format plug-in afterward, you must manually create the configuration using Drill’s web console. You can also use the REST API.

First, let’s add our bootstrap configuration. Again, if we were defining a new storage plug-in, we’d define a bootstrap file in our module. But because we are creating an Easy format plug-in, we modify the existing Drill file to add the following section just after the last existing format entry for the dfs storage configuration:

"drill-log": {
   type: "regex",
   extension: "log1",
   regex: "(\d\d\d\d)-(\d\d)-(\d\d) 
           (\d\d):(\d\d):(\d\d),\d+ \[([^]]*)] 
           (\w+)\s+(\S+) - (.*)",
   fields: "year, month, day, hour, minute, second, 
            thread, level, module, message"
 }

(Note that this code is formatted for this book; your code must be on a single line for the regex and fields items.)

Be sure to add a comma after the "csvh" section to separate it from the new section.

To test this, delete the ZooKeeper state as described previously. Build Drill with these changes in place, then start Drill and visit the Web Console. Go to the Storage tab and click Update next to the dfs plug-in. Your format configuration should appear in the formats section.

Next Steps

With a working format plug-in, you next must choose how to maintain the code. You have three choices:

Contribute the code to the Drill project via a pull request.
Maintain your own private branch of the Drill code that includes your plug-in code.
Move your plug-in code to a separate Maven project which you maintain outside of the Drill project.

Drill is an open source project and encourages contributions. If your plug-in might be useful to others, consider contributing your plug-in code to Drill. That way, our plug-in will be part of future Drill releases. If, however, the plug-in works with a format unique to your project or company, then you may want to keep the code separate from Drill.

Production Build

The preceding steps showed you how to create your plug-in within the Drill project itself by creating a new module within the contrib module. In this case, it is very easy to move your plug-in into production. Just build all of Drill and replace your current Drill distribution with the new one in the distribution/target/apache-drill-version/apache-drill-version directory. (Yes, apache-drill-version appears twice.) Your plug-in will appear in the jars directory where Drill will pick it up.

If you plan to offer your plug-in to the Drill project, your plug-in code should reside in the contrib project, as in the preceding example. Your code is simply an additional set of Java classes and resource files added to the Drill JARs and Drill distribution. So, to deploy your plug-in, just do a full build of Drill and replace your existing Drill installation with your custom-built one.

Contributing to Drill: The Pull Request

If you want to contribute your code to Drill, the first step is to ensure your files contain the Apache copyright header and conform to the Drill coding standards. (If you used the Drill Maven file, it already enforced these rules.)

The next step is to submit a pull request against the Apache Drill GitHub project. The details are beyond the scope of this book, but they are pretty standard for Apache. Ask on the Drill “dev” mailing list for details.

Maintaining Your Branch

If your plug-in will remain private, you can keep it in Drill by maintaining your own private fork of the Drill repo. You’ll want to “rebase” your branch from time to time with the latest Drill revisions.

You should have created your plug-in in a branch from the master branch within the Drill Git repository. This example used the branch regex-plugin. Assuming that your only changes are the plug-in code that you’ve maintained in your own branch, you can keep your branch up to date as follows:

Use git checkout master to check out the master branch.
Use git pull upstream master to pull down the latest changes to master (assuming that you’ve used the default name “upstream” for the Apache Drill repo).
Use git checkout regex-plugin to switch back to your branch.
Use git rebase master to rebase your code on top of the latest master.
Rebuild Drill.

Create a Plug-In Project

Another, perhaps simpler, option for a private plug-in is to move the code to its own project.

Create a standard Maven project, as described earlier.

The pom.xml file contains the bare minimum (this project has no external dependencies):

<project xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
                        http://maven.apache.org/xsd/maven-4.0.0.xsd"
    xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <modelVersion>4.0.0</modelVersion>

  <parent>
    <artifactId>drill-contrib-parent</artifactId>
    <groupId>org.apache.drill.contrib</groupId>
    <version>1.15.0-SNAPSHOT</version>
  </parent>

  <artifactId>drill-format-regex</artifactId>
  <name>contrib/regex-format-plugin</name>

  <dependencies>
     <dependency>
      <groupId>org.apache.drill.exec</groupId>
      <artifactId>drill-java-exec</artifactId>
      <version>${project.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.drill</groupId>
      <artifactId>drill-common</artifactId>
      <version>${project.version}</version>
    </dependency>
   <dependency>
      <groupId>org.apache.drill.exec</groupId>
      <artifactId>drill-java-exec</artifactId>
      <version>${project.version}</version>
      <classifier>tests</classifier>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.drill</groupId>
      <artifactId>drill-common</artifactId>
      <version>${project.version}</version>
      <classifier>tests</classifier>
      <scope>test</scope>
      </exclusions>
    </dependency>
  </dependencies>
</project>

Replace the version number in this code example with the version of Drill that you want to use.

Then, you need to instruct Drill that this is a plug-in by adding a (empty) drill-module.conf file. If you wanted to add any boot-time options, you would add them here, but this plug-in has none. See the other format plug-in projects for more sophisticated examples.

Conclusion

This chapter discussed how to create a format plug-in using the Easy framework. Practice creating a plug-in of your own. As you dig deeper, you will find that Drill’s own source code is the best source of examples: review how other format plug-ins work for ideas. (Just be aware that Drill is constantly evolving; some of the older code uses patterns that are now a bit antiquated.)

Drill has a second kind of plug-in: the storage plug-in. Storage plug-ins access data sources beyond a distributed file system (DFS). For example, HBase, Kafka, and JDBC. Storage plug-ins introduce another large set of mechanisms to implement. Again, look at existing storage plug-ins and ask questions on the dev list to discover what has to be done. The use of the test framework shown here will greatly reduce your edit-compile-debug cycle times so that you can try things incrementally.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 12. Writing a Format Plug-in

Create new playlist

Sign In

Sign Up

Chapter 12. Writing a Format Plug-in

The Example Regex Format Plug-in

Note

Creating the “Easy” Format Plug-in

Creating the Maven pom.xml File

Creating the Plug-in Package

Drill Module Configuration

Format Plug-in Configuration

Cautions Before Getting Started

Creating the Regex Plug-in Configuration Class

Format Configuration Limitations

Copyright Headers and Code Format

Testing the Configuration

Fixing Configuration Problems

Plug-in Configurations Are Not Versioned

Troubleshooting

Creating the Format Plug-in Class

Creating a Test File

Configuring RAT

Efficient Debugging

Creating the Unit Test

How Drill Finds Your Plug-in

The Record Reader

Note

Testing the Reader Shell

Logging

Error Handling

Error Display in the Drill Web Console

Setup

Regex Parsing

Defining Column Names

Projection

Column Projection Accounting

Project None

Project All

Project Some

Opening the File

Record Batches

Drill’s Columnar Structure

Note

Defining Vectors

Reading Data

Loading Data into Vectors

Releasing Resources

Testing the Reader

Testing the Wildcard Case

Testing Explicit Projection

Testing Empty Projection

Eclipse Users

Scaling Up

Additional Details

File Chunks

Default Format Configuration

Modularity

Next Steps

Production Build

Contributing to Drill: The Pull Request

Maintaining Your Branch

Create a Plug-In Project

Conclusion

Table of Contents for
12. Writing a Format Plug-in