As described in Chapter 8, Apache Drill uses storage and format plug-ins to read data. The storage plug-in connects to a storage system such as Kafka, a database, or a distributed filesystem. The DFS interface is based on the HDFS client libraries and can obtain data from HDFS, Amazon S3, MapR, and so on.
A distributed filesystem contains a wide variety of files (Parquet, CSV, JSON, and so on.) The dfs
storage plug-in uses format plug-ins to read data from these files. In this chapter, we explore how to create custom format plug-ins for file formats that Drill does not yet support.
Format plug-ins integrate tightly with Drill’s internal mechanisms for configuration, memory allocation, column projection, filename resolution, and data representation. Writing plug-ins is therefore an “advanced” task that requires Java experience, patience, frequent consultation of existing code, and posting questions on the “dev” mailing list.
Drill provides two ways to structure your plug-in. Here we focus on the “Easy” format plug-in, useful for most file formats, that handles much of the boilerplate for you. It is also possible to write a plug-in without the Easy framework, but it is unlikely you will need to do so.
As an example, we’re going to create a format plug-in for any text file format that can be described as a regular expression, or regex. The regex defines how to parse columns from an input record and is defined as part of the format configuration. The query can then select all or a subset of the columns. (This plug-in is a simplified version of the one that was added to Drill 1.14.)
You can find the full code for this plug-in in the GitHub repository for this book, in the format-plugin directory.
The plug-in configuration defines the file format using three pieces of information:
The file extension used to identify the file
The regex pattern used to match each line
The list of column names that correspond to the patterns
For example, consider Drill’s own log file, drillbit.log, which contains entries like this:
2017-12-21 10:52:42,045 [main] ERROR o.apache.drill.exec.server.Drillbit - Failure during initial startup of Drillbit.
(In the actual file, all the above text is on one line.) This format does not match any of Drill’s built-in readers, but is typical of the ad hoc format produced by applications. Rather than build a plug-in for each such format, we will build a generalized plug-in using regexes. (Although regular expressions might be slow in production, they save a huge amount of development time for occasional work.)
For Drill’s format we might define our regex as follows:
(\d{4})-(\d\d)-(\d\d) (\d\d):(\d\d):(\d\d),\d+ \[([^]]*)] (\w+)\s+(\S+) - (.*)
This example is shown on two lines for clarity due to the physical size restrictions of the printed book; however, it must be on a single line in the configuration.
We define our fields as follows:
year, month, day, hour, minute, second, thread, level, module, message
Drill logs often contain multiline messages. To keep things simple, we’ll simply ignore input lines that do not match the pattern.
To create a file format plug-in using the Easy framework, you must implement a number of code features:
Plan-time hints
Hints for the filename resolution mechanism
Scan operator creation
Projection resolution, including creating null columns
Writing to value vectors
Managing input files
Managing file chunks
It can be tricky to get all the parts to work correctly. The secret is to find a good existing example and then adapt that for your use. This example is designed to provide most of the basic functionality you will need. We point to other good examples as we proceed.
Drill format plug-ins are difficult to write outside of the Drill source tree. Instead, a format plug-in is generally created as a subproject within Drill itself. Doing so allows rapid development and debugging. Later, we’ll move the plug-in to its own project.
You begin by creating a local copy of the Drill source project, as described in Chapter 10. Then, create a new Git branch for your code:
cd /path/to/drill/source git checkout -b regex-plugin
The example code on GitHub contains just the regex plug-in code; if you want to follow along without typing in the code, you’ll need to drop this into an existing Drill project to build it.
Iterative development is vital in a system as complex as Drill. Let’s begin with a simple single-field regex pattern so that we can focus on the required plumbing. Our goal is to create a starter plug-in that does the following:
Reads only text columns
Handles both wildcard (SELECT *
) and explicit queries (SELECT a, b, c
)
Loads data into vectors
This provides a foundation on which to build more advanced features.
Chapter 11 explained how to create a Maven pom.xml file for a UDF. The process for the format plug-in is similar.
Custom plug-ins should reside in the contrib
module. Examples include Hive, Kafka, Kudu, and others. Format plug-in names typically start with “format-”. This example is called format-regex
.
Create a new directory called drill/contrib/format-regex to hold your code. Then, add a pom.xml file in this directory:
<project
xsi:schemaLocation=
"http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd"
xmlns=
"http://maven.apache.org/POM/4.0.0"
xmlns:xsi=
"http://www.w3.org/2001/XMLSchema-instance"
>
<modelVersion>
4.0.0</modelVersion>
<parent>
<artifactId>
drill-contrib-parent</artifactId>
<groupId>
org.apache.drill.contrib</groupId>
<version>
1.15.0-SNAPSHOT</version>
</parent>
<artifactId>
drill-format-regex</artifactId>
<name>
contrib/regex-format-plugin</name>
<dependencies>
<dependency>
<groupId>
org.apache.drill.exec</groupId>
<artifactId>
drill-java-exec</artifactId>
<version>
${project.version}</version>
</dependency>
<dependency>
<groupId>
org.apache.drill</groupId>
<artifactId>
drill-common</artifactId>
<version>
${project.version}</version>
</dependency>
<!-- Test dependencies -->
<dependency>
<groupId>
org.apache.drill.exec</groupId>
<artifactId>
drill-java-exec</artifactId>
<version>
${project.version}</version>
<classifier>
tests</classifier>
<scope>
test</scope>
</dependency>
<dependency>
<groupId>
org.apache.drill</groupId>
<artifactId>
drill-common</artifactId>
<version>
${project.version}</version>
<classifier>
tests</classifier>
<scope>
test</scope>
</dependency>
</dependencies>
</project>
Adjust the Drill version to match the version you are using.
The first two dependencies provide the code you need to implement your format plug-in. The second two provide test classes you will need for unit tests.
Next you must tell Drill to include this plug-in in the Drill build. Add the following to the <modules>
section of the pom.xml file in the parent drill/contrib directory:
<module>
format-regex</module>
Each Drill module, including this plug-in, creates a Java JAR file. You must tell Drill to copy the JAR file into the final product directory. In drill/distribution/src/assemble/bin.xml, find the following block of code and add the last <include>
line shown here:
<dependencySet>
<includes>
<include>
org.apache.drill.exec:drill-jdbc:jar</include>
<include>
org.apache.drill:drill-protocol:jar</include>
...<include>
org.apache.drill.contrib:format-regex</include>
</includes>
Change format-regex
to the name of your plug-in.
Finally, you must add a dependency in the distribution pom.xml file to force Maven to build your plug-in before the distribution module.
Add the following just before the </dependencies> tag in the pom.xml file:
<dependency>
<groupId>
org.apache.drill.contrib</groupId>
<artifactId>
drill-format-regex</artifactId>
<version>
${project.version}</version>
</dependency>
If you create an Easy file format extension, as shown here, your code must reside in the core java-exec
package. (The Easy plug-ins rely on dependencies that are visible only when your code resides within Drill’s java-exec
package.) For an Easy format plug-in, your code should live in the org.apache.drill.exec.store.easy
package. We’ll use org.apache.drill.exec.store.easy.regex
. You can create that package in your IDE.
Drill uses the Lightbend (originally Typesafe) HOCON configuration system, via the Typesafe library, to express design-time configuration. The configuration is in the src/main/resources/drill-module.conf file in your plug-in directory; in this case, contrib/format-regex/src/main/resources/drill-module.conf:
drill:
{
classpath.scanning:
{
packages
+=
"org.apache.drill.exec.store.easy.regex"
}
}
Drill terminology can be a bit complex given that the same terms are used for multiple things. So, let’s begin by defining three common items:
The actual class that implements your format plug-in. (This name is often used for the JSON configuration, as you’ll see shortly.)
A Jackson serialized class that stores the JSON configuration for a plug-in. (Often just called the “format config.”)
The JSON configuration for a plug-in as well as the Java class that stores that configuration. (Abbreviated to “format configuration” elsewhere in this book.)
The confusion is that “format plug-in” is used both for the class and the JSON configuration. Similarly, the term “format config” is used for the both the JSON configuration and the Java class that stores that configuration. To avoid confusion, we will use the terms we just listed.
Before we begin making changes, let’s make sure we don’t break anything important. When you edit storage or format configurations using the Drill Web Console, Drill stores those configurations in ZooKeeper. If you make a mistake while creating the configuration (such as a bug in the code), your existing dfs
configuration will be permanently lost. If you have configurations of value, make a copy of your existing configuration before proceeding!
The easiest way to proceed is to work on your laptop, assuming that anything stored in ZooKeeper is disposable.
Also, it turns out that Drill provides no versioning of format plug-in configurations. Each time you make a breaking change, Drill will fail to start because it will be unable to read any configurations stored in ZooKeeper using the old schema. (We discuss later how to recover.) So, a general rule is that you can change the format configuration schema during development, but after the plug-in is in production, the configuration schema is pretty much frozen for all time.
Let’s begin by creating the format plug-in configuration class. A good reference example is TextFormatPlugin.TextFormatConfig
.
You must implement a number of items:
This is a Jackson serialized class, so give it a (unique) name using the @JsonTypeName
annotation.
Implement the equals()
and hashCode()
methods.
You need three properties:
The regular expression
A list of properties
The file extension that this plug-in configuration supports
Jackson handles all the boilerplate for you if you choose good names, make the members public, and follow JavaBeans naming conventions. So, go ahead and do so.
Next, create the plug-in class, RegexFormatPlugin
, which extends EasyFormatPlugin
:
package
org
.
apache
.
drill
.
exec
.
store
.
easy
.
regex
;
import
org.apache.drill.common.logical.FormatPluginConfig
;
import
com.fasterxml.jackson.annotation.JsonInclude
;
import
com.fasterxml.jackson.annotation.JsonInclude.Include
;
import
com.fasterxml.jackson.annotation.JsonTypeName
;
import
com.google.common.base.Objects
;
@JsonTypeName
(
"regex"
)
@JsonInclude
(
Include
.
NON_DEFAULT
)
public
class
RegexFormatConfig
implements
FormatPluginConfig
{
public
String
regex
;
public
String
fields
;
public
String
extension
;
public
String
getRegex
()
{
return
regex
;
}
public
String
getFields
()
{
return
fields
;
}
public
String
getExtension
()
{
return
extension
;
}
@Override
public
boolean
equals
(
Object
obj
)
{
if
(
this
==
obj
)
{
return
true
;
}
if
(
obj
==
null
||
getClass
()
!=
obj
.
getClass
())
{
return
false
;
}
final
RegexFormatConfig
other
=
(
RegexFormatConfig
)
obj
;
return
Objects
.
equals
(
regex
,
other
.
regex
)
&&
Objects
.
equals
(
fields
,
other
.
fields
)
&&
Objects
.
equals
(
extension
,
other
.
extension
);
}
@Override
public
int
hashCode
()
{
return
Arrays
.
hashCode
(
new
Object
[]
{
regex
,
fields
,
extension
});
}
}
This class might strike you as a bit odd. We would prefer to store the list of fields as a Java list. We’d also prefer to make the class immutable, with private final
fields and a constructor that takes all of the field values. (Jackson provides the @JsonProperty
annotation to help.)
This regex plug-in is a perfect candidate to use Drill’s table functions, as you’ll see in the tests. However, as of this writing, the table function code has a number of limitations that influence this class:
DRILL-6169 specifies that we can’t use Java lists.
DRILL-6672 specifies that we can’t use setters (for example, setExtension()
).
DRILL-6673 specifies that we can’t use a nondefault constructor.
Until these issues are fixed, making your fields public is the only available solution.
To compile your code with Drill, you must include the Apache license agreement at the top of your file: simply copy one from another file. If your plug-in is proprietary, you can modify the Maven pom.xml file to remove the “rat” check for your files.
The code format is the preferred Drill formatting described in Chapter 10.
You now have enough that you can test the configuration, so go ahead and build Drill (assuming that it is located in ~/drill):
cd ~/drill mvn clean install
For convenience, create an environment variable for the distribution directory:
export DRILL_HOME=~/drill/distribution/target/ apache-drill-1.xx.0-SNAPSHOT/apache-drill-1.xx.0-SNAPSHOT
Replace xx
with the current release number, such as 14.
Start Drill:
cd $DRILL_HOME/bin
Then:
./drillbit.sh start
Connect to the Web Console using your browser:
http://localhost:8047
Click the Storage tab and edit the dfs
plug-in. You need to add a sample configuration for the plug-in that includes just the date fields, to keep things simple to start. Add the following to the end of the file, just before the closing bracket:
"sample"
:
{
"type"
:
"regex"
,
"regex"
:
"(\d\d\d\d)-(\d\d)-(\d\d) .*"
,
"fields"
:
"year, month, day"
,
"extension"
:
"samp"
}
The quoted name is yours to choose. In this example, we define a format called "sample"
with a pattern, three columns, and a .samp file extension. The value of the type
key must match that defined in the @JsonTypeName
annotation of our format class. The other three keys provide values compatible with our format properties.
Save the configuration. If Drill reports “success” and displays the configuration with your changes, congratulations, everything works so far!
We noted earlier that bugs in your code can corrupt Drill’s state in ZooKeeper. If you experience this problem, you can stop Drill and clear the ZooKeeper state, assuming that you have ZooKeeper installed in $ZK_HOME
:
$ $DRILL_HOME/drillbit.sh stop $ $ZK_HOME/bin/bin/zkCli.sh -server localhost:2181 [zk: localhost:2181(CONNECTED) 0] ls / [zookeeper, drill, hbase] [zk: localhost:2181(CONNECTED) 1] ls /drill [running, sys.storage_plugins, drillbits1] [zk: localhost:2181(CONNECTED) 2] rmr /drill [zk: localhost:2181(CONNECTED) 3] ls / [zookeeper, hbase] [zk: localhost:2181(CONNECTED) 6] quit $ $DRILL_HOME/bin/drillbit.sh start
Note that if you’re running Drill 1.12 and it won’t start, see DRILL-6064.
Drill plug-in configurations are not versioned. If you change the configuration class and ship your code to a user that has a JSON configuration for an earlier version, Drill might fail to start, displaying a very cryptic error. To fix the error, you must manually remove the old configuration from ZooKeeper before restarting Drill. (On a secure system, this can be quite difficult, so it is best to do your testing on your laptop without enabling Drill security.)
As noted earlier, the general rule for production systems is to never change the plug-in configuration class after you use it in production. This is an area where we want to get it right the first time!
If things don’t work, look at the $DRILL_HOME/log/drillbit.log file. For example, suppose that you mistakenly named your get method for extensions as getFileSuffix()
:
org.apache.drill.common.exceptions.DrillRuntimeException: unable to deserialize value at dfs ... Caused by: com.fasterxml.jackson.databind.exc. UnrecognizedPropertyException: Unrecognized field "fileSuffix" (class org.apache.drill.exec.store.easy.regex.RegexFormatConfig), not marked as ignorable (2 known properties: "columnCount", "extension"])
With the configuration class working, you can now create the format plug-in class itself. First you create the basic shell and then you add the required methods, one by one.
To begin, create a class that extends the EasyFormatPlugin
, using your configuration class as a parameter:
package
org
.
apache
.
drill
.
exec
.
store
.
easy
.
regex
;
...
public
class
RegexFormatPlugin
extends
EasyFormatPlugin
<
RegexFormatConfig
>
{
...
}
Add a default name and a field to store the plug-in configuration, and then add a constructor to pass configuration information to the base class:
public
static
final
String
DEFAULT_NAME
=
"regex"
;
private
final
RegexFormatConfig
formatConfig
;
public
RegexFormatPlugin
(
String
name
,
DrillbitContext
context
,
Configuration
fsConf
,
StoragePluginConfig
storageConfig
,
RegexFormatConfig
formatConfig
)
{
super
(
name
,
context
,
fsConf
,
storageConfig
,
formatConfig
,
true
,
// readable
false
,
// writable
false
,
// blockSplittable
true
,
// compressible
Lists
.
newArrayList
(
formatConfig
.
extension
),
DEFAULT_NAME
);
this
.
formatConfig
=
formatConfig
;
}
The constructor accomplishes a number of tasks:
Accepts the plug-in configuration name that you set previously in the Drill Web Console.
Accepts a number of configuration objects that give you access to Drill internals and to the filesystem.
Accepts an instance of your deserialized format plug-in configuration.
Passes to the parent class constructor a number of properties that define the behavior of your plug-in. (It is the ability to specify these options in the constructor that, in large part, makes this an Easy plug-in.)
Defines the file extensions to be associated with this plug-in. In this case, you get the extension from the format plug-in configuration so the user can change it.
The gist of the code is that your plug-in can read but not write files. The files are not block-splittable. (Adding this ability would be simple: more on this later.) You can compress the file as a .zip or tar.gz file. You take the extension from the configuration, and the default name from a constant.
Next, you can provide default implementations for several methods. The first says that your plug-in will support (projection) push-down:
@Override
public
boolean
supportsPushDown
()
{
return
true
;
}
A reader need not support projection push-down. But without such support, the reader will load all of the file’s data into memory, only to have the Drill Project operator throw much of that data away. Since doing so is inefficient, this example will show how to implement projection push-down.
The next two are stubs because the plug-in does not support writing:
@Override
public
RecordWriter
getRecordWriter
(
FragmentContext
context
,
EasyWriter
writer
)
throws
IOException
{
return
null
;
}
@Override
public
int
getWriterOperatorType
()
{
return
0
;}
Eventually, you can register your plug-in with the Drill operator registry so that it will appear nicely in query profiles. For now, just leave it as a stub:
@Override
public
int
getReaderOperatorType
()
{
return
0
;
}
Finally, we need to create the actual record reader. Leave this as a stub for now:
@Override
public
RecordReader
getRecordReader
(
FragmentContext
context
,
DrillFileSystem
dfs
,
FileWork
fileWork
,
List
<
SchemaPath
>
columns
,
String
userName
)
throws
ExecutionSetupException
{
// TODO Write me!
return
null
;
}
Before we dive into the reader, let’s use a unit test and the debugger to verify that your configuration is, in fact, being passed to the plug-in constructor. You’ll use this test many times as you continue development.
First, you need to create a sample input file that you’ll place in Drill’s resource folder, /drill-java-exec/src/test/resources/regex/simple.samp. Getting sample data is easy: just grab a few lines from your drillbit.log file:
2017-12-19 10:52:41,820 [main] INFO o.a.d.e.e.f.FunctionImplementationRegist... 2017-12-19 10:52:37,652 [main] INFO o.a.drill.common.config.DrillConfig - Con... Base Configuration: - jar:file:/Users/paulrogers/git/drill/distribution/target/apache-drill... 2017-12-19 11:12:27,278 [main] ERROR o.apache.drill.exec.server.Drillbit - ...
We’ve included two different message types (INFO
and ERROR
), along with a multiline message. (Our code will ignore the nonmessage lines.)
We’ve added a new file type to our project, .samp. Drill uses Apache RAT to check that every file has a copyright header. Because we’ve added a data file, we don’t want to include the header (for some files, the header cannot be provided). Configure RAT to ignore this file type by adding the following lines to the contrib/format-regex/pom.xml file:
<build>
<plugins>
<plugin>
<groupId>
org.apache.rat</groupId>
<artifactId>
apache-rat-plugin</artifactId>
<inherited>
true</inherited>
<configuration>
<excludes>
<exclude>
**/*.samp</exclude>
</excludes>
</configuration>
</plugin>
</plugins>
</build>
The first test you did earlier required that you run a full five-minute Drill build, then start the Drill server, and then connect using the Web Console. For that first test, this was fine because it is the most convenient way to test the format configuration. But for the remaining tasks, you’ll want to have much faster edit-compile-debug cycles.
You can do this by running Drill in your IDE using Drill’s test fixtures. See the developer documentation for details. You can also find more information about the test fixtures in the class org.apache.drill.test.ExampleTest
.
Here is how to create an ad hoc test program that starts the server, including the web server, and listens for connections:
In the java_exec/src/test directory, create the org.apache.drill.exec.store.easy.regex project.
Create an “ad hoc” test program that launches the server with selected options.
Run your test program from your IDE.
The following section shows you how to build the most basic test program.
You want your test to be fast and self-contained. This is such a common pattern that a class exists to help you out: ClusterTest
.
Start by creating a test that derives from ClusterTest
:
public
class
TestRegexReader
extends
ClusterTest
{
@ClassRule
public
static
final
BaseDirTestWatcher
dirTestWatcher
=
new
BaseDirTestWatcher
();
@BeforeClass
public
static
void
setup
()
throws
Exception
{
ClusterTest
.
startCluster
(
ClusterFixture
.
builder
(
dirTestWatcher
));
// Define a regex format config for testing.
defineRegexPlugin
();
}
}
This is mostly boilerplate except for the defineRegexPlugin()
method. As it turns out, annoyingly, Drill provides no SQL syntax or API for defining plug-in configuration at runtime. Either you use the Web Console, or you must create your own, which is what we show here:
@SuppressWarnings
(
"resource"
)
private
static
void
defineRegexPlugin
()
throws
ExecutionSetupException
{
// Create an instance of the regex config.
RegexFormatConfig
config
=
new
RegexFormatConfig
();
config
.
extension
=
"samp"
;
config
.
regex
=
"(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d) .*"
;
config
.
fields
=
Lists
.
newArrayList
(
"year"
,
"month"
,
"day"
);
// Define a temporary format plug-in for the "cp" storage plug-in.
Drillbit
drillbit
=
cluster
.
drillbit
();
final
StoragePluginRegistry
pluginRegistry
=
drillbit
.
getContext
().
getStorage
();
final
FileSystemPlugin
plugin
=
(
FileSystemPlugin
)
pluginRegistry
.
getPlugin
(
"cp"
);
final
FileSystemConfig
pluginConfig
=
(
FileSystemConfig
)
plugin
.
getConfig
();
pluginConfig
.
getFormats
().
put
(
"sample"
,
config
);
pluginRegistry
.
createOrUpdate
(
"cp"
,
pluginConfig
,
false
);
}
The first block of code simply uses the format configuration class to define a simple test format: just the first three fields of a Drill log.
The second block is “black magic”: it retrieves the existing classpath (cp
) plug-in, retrieves the configuration object for that plug-in, adds your format plug-in to that storage configuration, and then redefines the cp
storage plug-in with the new configuration. (The Drill test framework has methods to set up a plug-in, but only in the context of a workspace, and workspaces are not supported for the classpath plug-in.) In any event, with the preceding code, you can configure a test version of your plug-in without having to use the Web Console.
Next, define the simplest possible test—just run a query:
@Test
public
void
testWildcard
()
{
String
sql
=
"SELECT * FROM cp.`regex/simple.samp`"
;
client
.
queryBuilder
().
sql
(
sql
).
printCsv
();
}
Of course, this test won’t work yet: you haven’t implemented the reader. But you can at least test that Drill is able to find and instantiate your plug-in.
Set a breakpoint in the constructor of the plug-in and then run the test in the debugger. When the debugger hits the breakpoint, inspect the contents of the format plug-in provided to the constructor. If everything looks good, congratulations, another step completed! Go ahead and stop the debugger because we have no reader implemented.
If something goes wrong, it helps to know where to look for problems. Drill uses the following mappings to find your plug-in:
The name "sample"
was used to register the plug-in configuration with the storage plug-in in defineRegexPlugin()
.
The Drill class FormatCreator
scans the classpath for all classes that derive from FormatPlugin
(which yours does via EasyFormatPlugin
).
FormatCreator
scans the constructors for each plug-in class looking for those of the expected format. (If you had added or removed arguments, yours would not be found and the plug-in would be ignored.)
To find the plug-in class for a format configuration, Drill searches the constructors of the format plug-ins to find one in which the type of the fifth argument matches the class of the format configuration registered in step 1.
Drill invokes the matching constructor to create an instance of your format plug-in class.
The format plug-in creates the reader that reads data for your format plug-in.
If things go wrong, step through the code in FormatCreator
to determine where it went off the rails.
With the boilerplate taken care of and with a working test environment, we are now ready to tackle the heart of the problem: the record reader. In Drill, the record reader is responsible for a number of tasks:
Defining value vectors for each column
Populating value vectors for each batch of records
Performing projection push-down (mapping from input to query columns)
Translating errors into Drill format
Each of these tasks can be complex when you consider all the nuances and use cases. Let’s work on them step by step, discussing the theory as we go.
In Drill, a single query can scan multiple files (or multiple blocks of a single large file). As we’ve discussed earlier in this book, Drill divides queries into major fragments, one of which will perform the scan operation. The scan’s major fragment is distributed across a set of minor fragments, typically running multiple instances on each node in the Drill cluster. If the query scans a few files, each scan operator might read just one file. But if the query touches many files relative to the number of scan minor fragments, each scan operator will read multiple files.
To read about fragments in more detail, see Chapter 3.
The scan operator orchestrates the scan operation, but delegates actual reading to a record reader. There is one scan operator per minor fragment, but possibly many record readers for each scan operator. In particular, Drill creates one record reader for each file. Many files are splittable. In this case, Drill creates one record reader per file block.
Because Drill reads “big data,” files are often large. Blocks are often 256 MB or 512 MB. Drill further divides each block into a series of batches: collections of records that fit comfortably into memory. Each batch is composed of a collection of value vectors, one per column.
So, your job in creating a reader is to create a class that reads data from a single file (or block) in the context of a scan operator that might read multiple files, and to read the data as one or more batches, filling the value vectors as you read each record.
Let’s begin by creating the RegexRecordReader
class:
public
class
RegexRecordReader
extends
AbstractRecordReader
{
private
final
DrillFileSystem
dfs
;
private
final
FileWork
fileWork
;
private
final
RegexFormatConfig
formatConfig
;
public
RegexRecordReader
(
FragmentContext
context
,
DrillFileSystem
dfs
,
FileWork
fileWork
,
List
<
SchemaPath
>
columns
,
String
userName
,
RegexFormatConfig
formatConfig
)
{
this
.
dfs
=
dfs
;
this
.
fileWork
=
fileWork
;
this
.
formatConfig
=
formatConfig
;
// Ask the superclass to parse the projection list.
setColumns
(
columns
);
}
@Override
public
void
setup
(
OperatorContext
context
,
OutputMutator
output
)
throws
ExecutionSetupException
{
}
@Override
public
int
next
()
{
}
@Override
public
void
close
()
throws
Exception
{
}
}
The RegexRecordReader
interface is pretty simple:
Provides the five parameters that describe the file scan, plus the configuration of our regex format
setup()
Called to open the underlying file and start reading
next()
Called to read each batch of rows from the data source
close()
Called to release resources
The AbstractRecordReader
class provides a few helper functions that you’ll use later.
If you create a format plug-in without the Easy framework, or if you create a storage plug-in, you are responsible for creating the scan operator, the readers, and quite a bit of other boilerplate.
However, we are using the Easy framework, which does all of this for you (hence the name). The Easy framework creates the record readers at the time that the scan operator is created. Because of this, you want to avoid doing operations in the constructor that consume resources (such as memory allocation or opening a file). Save those for the setup()
call.
Let’s consider the constructor parameters:
FragmentContext
Provides information about the (minor) fragment along with information about the Drill server itself. You most likely don’t need this information.
DrillFileSystem
Drill’s version of the HDFS FileSystem
class. Use this to open the input file.
FileWork
Provides the filename along with the block offset and length (for block-aware files).
List<SchemaPath>
The set of columns that the query requested from the data source.
String userName
The name of the user running the query, for use in some security models.
RegexFormatConfig
The regex format configuration that we created earlier.
You must also instruct Drill how to instantiate your reader by implementing the following method in your RegexFormatPlugin
class:
@Override
public
RecordReader
getRecordReader
(
FragmentContext
context
,
DrillFileSystem
dfs
,
FileWork
fileWork
,
List
<
SchemaPath
>
columns
,
String
userName
)
throws
ExecutionSetupException
{
return
new
RegexRecordReader
(
context
,
dfs
,
fileWork
,
columns
,
userName
,
formatConfig
);
}
As it turns out, you’ve now created enough structure that you can successfully run a query; however, it will produce no results. Rerun the previous SELECT *
query using the test created earlier. Set breakpoints in the getRecordReader()
method and examine the parameters to become familiar with their structure. (Not all Drill classes provide Javadoc comments, so this is a good alternative.) Then, step into your constructor to see how the columns are parsed.
If all goes well, the console should display something like this:
Running org.apache.drill.exec.store.easy.regex.TestRegexReader#testWildcard Total rows returned: 0. Returned in 1804ms.
You are now ready to build the actual reader.
Drill’s error handling rules are a bit vague, but the following should stand you in good stead:
If a failure occurs during construction or setup, log the error and throw an ExecutionSetupException
.
If a failure occurs elsewhere, throw a UserException
, which will automatically log the error.
In both cases, the error message you provide will be sent back to the user running a query, so try to provide a clear message.
The SQLLine program will show your error message to the user. However, at present, the Web Console displays only a cryptic message, losing the message text. Because of this, you should use SQLLine if you want to test error messages. You can also check the query profile in the Web Console.
Drill’s UserException
class wraps exceptions that are meant to be returned to the user. The class uses a fluent builder syntax and requires you to identify the type of error. Here are two of the most common:
dataReadError()
Indicates that something went wrong with opening or reading from a file.
validationException()
Indicates that something went wrong when validating a query. Because Drill does column validation at runtime, you can throw this if the projected columns are not valid for your data source.
See the UserException
class for others. You can also search the source code to see how each error is used in practice.
The UserException
class allows you to provide additional context information such as the file or column that triggered the error and other information to help the user understand the issue.
One odd aspect of UserException
is that it is unchecked: you do not have to declare it in the method signature, and so you can throw it from anywhere. A good rule of thumb is that if the error might be due to a user action, the environment, or faulty configuration, throw a UserException
. Be sure to include the filename and line number, if applicable. If the error is likely due to a code flaw (some invariant is invalid, for example), throw an unchecked Java exception such as IllegalStateException
, IllegalArgumentException
, or UnsupportedOperationException
. Use good unit tests to ensure that these “something is wrong in the code” exceptions are never thrown when the code is in production.
Because the UserException
allows us to provide additional context, we use it in this example, even in the setup stage.
Because the purpose of this exercise is to illustrate Drill, the example uses the simplest possible regex parsing algorithm: just the Java Pattern
and Matcher
classes. (Production Drill readers tend to go to extremes to optimize the per-record path—which we leave as an exercise for you—and so the simple approach here would need optimization for production code.) Remember that we decided to simply ignore lines that don’t match the pattern:
private
Pattern
pattern
;
private
void
setupPattern
()
{
try
{
pattern
=
Pattern
.
compile
(
formatConfig
.
getRegex
());
}
catch
(
PatternSyntaxException
e
)
{
throw
UserException
.
validationError
(
e
)
.
message
(
"Failed to parse regex: "%s""
,
formatConfig
.
getRegex
())
.
build
(
logger
);
}
}
As is typical in a system as complex as Drill, error handling can consume a large fraction of your code. Because the user supplies the regex (as part of the plug-in configuration), we raise a UserException
if the regex has errors, passing the original exception and an error message. The build()
method automatically logs the error into Drill’s log file to help with problem diagnosis.
Here is the simplest possible code to parse the list of columns:
private
List
<
String
>
columnNames
;
private
void
setupColumns
()
{
String
fieldStr
=
formatConfig
.
getFields
();
columnNames
=
Splitter
.
on
(
Pattern
.
compile
(
"\s*,\s*"
)).
splitToList
(
fieldStr
);
}
See the complete code in GitHub for the full set of error checks required. They are omitted here because they don’t shed much light on Drill itself.
Projection is the process of picking some columns from the input, but not others. (In SQL, the projection list confusingly follows the SELECT
keyword.) As explained earlier, the simplest reader just reads all columns, after which Drill will discard those that are not projected. Clearly this is inefficient; instead, each reader should do the projection itself. This is called projection push-down (the projection is pushed down into the reader).
We instructed Drill that our format plug-in supports projection push-down with the following method in the RegexFormatPlugin
class:
@Override
public
boolean
supportsPushDown
()
{
return
true
;
}
To implement projection, you must handle three cases:
An empty project list: SELECT COUNT(*)
A wildcard query: SELECT *
An explicit list: SELECT a, b, c
The base AbstractRecordReader
class does some of the work for you when you call setColumns()
from the constructor:
// Ask the superclass to parse the projection list.
setColumns
(
columns
);
Here is the top-level method that identifies the cases:
private
void
setupProjection
()
{
if
(
isSkipQuery
())
{
projectNone
();
}
else
if
(
isStarQuery
())
{
projectAll
();
}
else
{
projectSubset
();
}
}
The isSkipQuery()
and isStartQuery
methods are provided by the superclass as a result of calling setColumns()
in the example prior to this one.
Because the regex plug-in parses text files, we can assume that all of the columns will be nullable VARCHAR
. You will need the column name later to create the value vector, which means that you need a data structure to keep track of this information:
private
static
class
ColumnDefn
{
private
final
String
name
;
private
final
int
index
;
private
NullableVarCharVector
.
Mutator
mutator
;
public
ColumnDefn
(
String
name
,
int
index
)
{
this
.
name
=
name
;
this
.
index
=
index
;
}
}
The Mutator
class is the mechanism Drill provides to write values into a value vector.
When given a wildcard (SELECT COUNT(*)
) query, SQL semantics specify that you should do the following:
Include all columns from the data source.
Use the names defined in the data source.
Use them in the order in which they appear in the data source.
In this case, the “data source” is the set of columns defined in the plug-in configuration:
private
void
projectAll
()
{
columns
=
new
ColumnDefn
[
groupCount
];
for
(
int
i
=
0
;
i
<
columns
.
length
;
i
++)
{
columns
[
i
]
=
new
ColumnDefn
(
columnNames
.
get
(
i
),
i
);
}
}
The final case occurs when the user requests a specific set of columns; for example, SELECT a, b, c
. Because Drill is schema-free, it cannot check at plan time which projected columns exist in the data source. Instead, that work is done at read time. The result is that Drill allows the user to project any column, even one that does not exist:
A requested column matches one defined by the configuration, so we project that column to the output batch.
A requested column does not match one defined by the configuration, so we must project a null column.
By convention, Drill creates a nullable (OPTIONAL
) INT
column for missing columns. In our plug-in, we only ever create VARCHAR
values, so nullable INT
can never be right. Instead, we create nullable VARCHAR
columns for missing columns.
To keep the code simple, we will use nullable VARCHAR
columns even for those columns that are available. (You might want to modify the example to use non-nullable [REQUIRED
] columns, instead.)
As an aside, the typical Drill way to handle other data types is to read a text file (such as CSV, or, here, a log file) as VARCHAR
, then perform conversions (CAST
s) to other types within the query itself (or in a view). You could add functionality to allow the format configuration to specify the type, then do the conversion in your format plug-in code. In fact, this is exactly what the regex plug-in in Drill does. Review that code for the details.
Following are the rules for creating the output projection list:
Include in the output rows only the columns from the project list (including missing columns).
Use the names provided in the project list (which might differ in case from those in the data source).
Include columns in the order in which they are defined in the project list.
Drill is case-insensitive, so you must use case-insensitive name mapping. But Drill labels columns using the case provided in the projection list, so for the project-some case, you want to use the name as specified in the project list.
For each projected name, you look up the name in the list of columns and record either the column (pattern) index, or –1 if the column is not found. (For simplicity we use a linear search; production code might use a hash map.)
The implementation handles all these details:
private
void
projectSubset
()
{
// Ensure the projected columns are only simple columns;
// no maps, no arrays.
Collection
<
SchemaPath
>
project
=
this
.
getColumns
();
assert
!
project
.
isEmpty
();
columns
=
new
ColumnDefn
[
project
.
size
()];
int
colIndex
=
0
;
for
(
SchemaPath
column
:
project
)
{
if
(
column
.
getAsNamePart
().
hasChild
())
{
throw
UserException
.
validationError
()
.
message
(
"The regex format plugin supports only"
+
" simple columns"
)
.
addContext
(
"Projected column"
,
column
.
toString
())
.
build
(
logger
);
}
// Find a matching defined column, case-insensitive match.
String
name
=
column
.
getAsNamePart
().
getName
();
int
patternIndex
=
-
1
;
for
(
int
i
=
0
;
i
<
columnNames
.
size
();
i
++)
{
if
(
columnNames
.
get
(
i
).
equalsIgnoreCase
(
name
))
{
patternIndex
=
i
;
break
;
}
}
// Create the column. Index of -1 means column will be null.
columns
[
colIndex
++]
=
new
ColumnDefn
(
name
,
patternIndex
);
}
}
The cryptic check for hasChild()
catches subtle errors. Drill allows two special kinds of columns to appear in the project list: arrays (columns[0]
) and maps (a.b
). Because our plug-in handles only simple columns, we reject requests for nonsimple columns.
Note what happens if a requested column does not match a column from that provided by the plug-in configuration: the patternIndex
ends up as -1
. We use that as our cue later to fill that column with null values.
Drill uses the DrillFileSystem
class, which is a wrapper around the HDFS FileSystem
class, to work with files. Here, we are concerned only with opening a file as an input stream, which is then, for convenience, wrapped in a BufferedReader
:
private
void
openFile
()
{
InputStream
in
;
try
{
in
=
dfs
.
open
(
new
Path
(
fileWork
.
getPath
()));
}
catch
(
Exception
e
)
{
throw
UserException
.
dataReadError
(
e
)
.
message
(
"Failed to open open input file: %s"
,
fileWork
.
getPath
())
.
addContext
(
"User name"
,
userName
)
.
build
(
logger
);
}
reader
=
new
BufferedReader
(
new
InputStreamReader
(
in
,
Charsets
.
UTF_8
));
}
Here we see the use of the dataReadError
form of the UserException
, along with a method to add the username as context (in case the problem is related to permissions).
Notice that, after this call, we are holding a resource that we must free in close()
, even if something fails.
Many query systems (such as MapReduce and Hive) are row-based: the record reader reads one row at a time. Drill, being columnar, works with groups of records called record batches. Each reader provides a batch of records on each call to next()
. Drill has no standard for the number of records: some readers return 1,000, some 4,096; others return 65,536 (the maximum). For our plug-in, let’s go with the standard size set in Drill’s internals (4,096):
private
static
final
int
BATCH_SIZE
=
BaseValueVector
.
INITIAL_VALUE_ALLOCATION
;
The best batch size depends strongly on the size of each record. At present, Drill simply assumes records are the proper size. For various reasons, it is best to choose a number that keeps the memory required per batch below a few megabytes in size.
Let’s see how our size of 4,096 works out. We are scanning Drill log lines, which tend to be less than 100 characters long. 4,096 records * 100 characters/record = 409,600 bytes, which is fine.
As a columnar execution engine, Drill stores data per column in a structure called a value vector. Each vector stores values for a single column, one after another. To hold the complete set of columns that make up a row, we create multiple value vectors.
Drill defines a separate value vector class for each data type. Within each data type, there are also separate classes for each cardinality: non-nullable (called REQUIRED
in Drill code), nullable (called OPTIONAL
) and repeated.
We do not write to vectors directly. Instead, we write through a helper class called a (vector) Mutator
. (There is also an Accessor
to read values.) Just as each combination of (type, cardinality) has a separate class, each also has a separate Mutator
and Accessor
class (defined as part of the vector class).
To keep the example code simple, we use a single type and single cardinality: OPTIONAL VARCHAR
. This makes sense: our regex pattern can only pull out strings, it does not have sufficient information to determine column types. However, if you are reading from a system that maps to other Drill types (INT
, DOUBLE
, etc.), you must deal with multiple vector classes, each needing type-specific code.
Drill 1.14 contains a new RowSet
mechanism to greatly simplify the kind of work we explain here. At the time of writing, the code was not quite ready for full use, so we explain the current techniques. Watch the “dev” list for when the new mechanism becomes available.
With that background, it is now time to define value vectors. Drill provides the OutputMutator
class to handle many of the details. Because we need only the mutator, and we need to use it for each column, let’s store it in our column definition class for easy access:
private
static
class
ColumnDefn
{
...
public
NullableVarCharVector
.
Mutator
mutator
;
To create a vector, we just need to create metadata for each column (in the form of a MaterializedField
) and then ask the output mutator to create the vector:
private
void
defineVectors
(
OutputMutator
output
)
{
for
(
int
i
=
0
;
i
<
columns
.
length
;
i
++)
{
MaterializedField
field
=
MaterializedField
.
create
(
columns
[
i
].
name
,
Types
.
optional
(
MinorType
.
VARCHAR
));
try
{
columns
[
i
].
mutator
=
output
.
addField
(
field
,
NullableVarCharVector
.
class
).
getMutator
();
}
catch
(
SchemaChangeException
e
)
{
throw
UserException
.
systemError
(
e
)
.
message
(
"Vector creation failed"
)
.
build
(
logger
);
}
}
}
The code uses two convenience methods to define metadata: MaterializedField.create()
and Types.optional()
.
The SchemaChangeException
thrown by addField()
seems odd: we are creating a vector, how could the schema change? As it turns out, Drill requires that the reader provide exactly the same value vector in each batch. In fact, if a scan operator reads multiple files, all readers must share the same vectors. If reader #2 asks to create column c
as a nullable VARCHAR
, but reader #1 has already created c
as a REQUIRED INT
, for example, the exception will be thrown. In this case, all readers use the same type, so the error should never actually occur.
With the setup completed, we are ready to actually read some data:
@Override
public
int
next
()
{
rowIndex
=
0
;
while
(
nextLine
())
{
}
return
rowIndex
;
}
Here we read rows until we fill the batch. Because we skip some rows, we can’t simply use a for
loop: we want to count only matching lines (because those are the only ones loaded into Drill).
The work to do the pattern matching is straightforward:
private
boolean
nextLine
()
{
String
line
;
try
{
line
=
reader
.
readLine
();
}
catch
(
IOException
e
)
{
throw
UserException
.
dataReadError
(
e
)
.
addContext
(
"File"
,
fileWork
.
getPath
())
.
build
(
logger
);
}
if
(
line
==
null
)
{
return
false
;
}
Matcher
m
=
pattern
.
matcher
(
line
);
if
(
m
.
matches
())
{
loadVectors
(
m
);
}
return
rowIndex
<
BATCH_SIZE
;
}
The next task is to load data into the vectors. The nullable VARCHAR
Mutator
class lets you set a value to null, or set a non-null value. You must tell it the row to write—the Mutator
, oddly, does not keep track of this itself. In our case, if the column is “missing,” we set it to null. If the pattern itself is null (the pattern is optional and was not found), we also set the value to null. Only if we have an actual match do we copy the value (as a string) into the vector (as an array of bytes). Recall that regex groups are 1-based:
private
void
loadVectors
(
Matcher
m
)
{
// Core work: write values into vectors for the current
// row. If projected by name, some columns may be null.
for
(
int
i
=
0
;
i
<
columns
.
length
;
i
++)
{
NullableVarCharVector
.
Mutator
mutator
=
columns
[
i
].
mutator
;
if
(
columns
[
i
].
index
==
-
1
)
{
// Not necessary; included just for clarity
mutator
.
setNull
(
rowIndex
);
}
else
{
String
value
=
m
.
group
(
columns
[
i
].
index
+
1
);
if
(
value
==
null
)
{
// Not necessary; included just for clarity
mutator
.
setNull
(
rowIndex
);
}
else
{
mutator
.
set
(
rowIndex
,
value
.
getBytes
());
}
}
}
rowIndex
++;
}
In practice, we did not have to set values to null, because null is the default; it was done here just for clarity.
The only remaining task is to implement close()
to release resources. In our case, the only resource is the file reader. Here we use logging to report and then ignore any errors that occur on close:
@Override
public
void
close
()
{
if
(
reader
!=
null
)
{
try
{
reader
.
close
();
}
catch
(
IOException
e
)
{
logger
.
warn
(
"Error when closing file: "
+
fileWork
.
getPath
(),
e
);
}
reader
=
null
;
}
}
The final step is to test the reader. Although it can be tempting to build Drill, fire up SQLLine, and throw some queries at your code, you need to resist that temptation and instead do detailed unit testing. Doing so is easy with Drill’s testing tools. Plus, you can use the test cases as a way to quickly rerun code in the unlikely event that the tests uncover some bugs in your code.
You can use your existing test to check the wildcard (SELECT *
) case:
@Test
public
void
testWildcard
()
{
String
sql
=
"SELECT * FROM cp.`regex/simple.samp`"
;
client
.
queryBuilder
().
sql
(
sql
).
printCsv
();
}
When you run the test, you should see something like the following:
Running org.apache.drill.exec.store.easy.regex.TestRegexReader#testWildcard 3 row(s): year<VARCHAR(OPTIONAL)>,month<VARCHAR(OPTIONAL)>,day<VARCHAR(OPTIONAL)> 2017,12,17 2017,12,18 2017,12,19 Total rows returned : 3. Returned in 10221ms.
Congratulations, you have a working format plug-in! But we’re not done yet. This is not much of a test given that it requires that a human review the output. Let’s follow the examples in ExampleTest
and actually validate the output:
@Test
public
void
testWildcard
()
throws
RpcException
{
String
sql
=
"SELECT * FROM cp.`regex/simple.log1`"
;
RowSet
results
=
client
.
queryBuilder
().
sql
(
sql
).
rowSet
();
BatchSchema
expectedSchema
=
new
SchemaBuilder
()
.
addNullable
(
"year"
,
MinorType
.
VARCHAR
)
.
addNullable
(
"month"
,
MinorType
.
VARCHAR
)
.
addNullable
(
"day"
,
MinorType
.
VARCHAR
)
.
build
();
RowSet
expected
=
client
.
rowSetBuilder
(
expectedSchema
)
.
addRow
(
"2017"
,
"12"
,
"17"
)
.
addRow
(
"2017"
,
"12"
,
"18"
)
.
addRow
(
"2017"
,
"12"
,
"19"
)
.
build
();
RowSetUtilities
.
verify
(
expected
,
results
);
}
We use three test tools. SchemaBuilder
lets us define the schema we expect. The row set builder lets us build a row set (really, just a collection of vectors) that holds the expected values. Finally the RowSetUtilities.verify()
function compares the two row sets: both the schemas and the values. The result is that we can easily create many tests for our plug-in.
We have three kinds of project, but we’ve tested only one. Let’s test explicit projection using this query:
SELECT
`day`
,
`missing`
,
`month`
FROM
cp
.
`regex/simple.samp`
As an exercise, use the techniques described earlier to test this query. Begin by printing the results as CSV, and then create a validation test. Hint: to specify a null value when building the expected rows, simply pass a Java null
. Check the code in GitHub for the answer.
Finally, let’s verify that projection works for a COUNT(*)
query:
SELECT COUNT(*) FROM cp.`regex/simple.log1`
We know that the query returns exactly one row with a single BIGINT
column. We can use a shortcut to validate the results:
@Test
public
void
testCount
()
throws
RpcException
{
String
sql
=
"SELECT COUNT(*) FROM cp.`regex/simple.log1`"
;
long
result
=
client
.
queryBuilder
().
sql
(
sql
).
singletonLong
();
assertEquals
(
3
,
result
);
}
We’ve tested with only a simple pattern thus far. Let’s modify the test to add the full set of Drill log columns, using the pattern we identified earlier. Because each file extension is associated with a single format plug-in, we cannot use the "samp"
extension we’ve been using. Instead, let’s use the actual "log"
extension:
private
static
void
defineRegexPlugin
()
throws
ExecutionSetupException
{
// Create an instance of the regex config.
// Note: we can't use the ".log" extension; the Drill .gitignore
// file ignores such files, so they'll never get committed.
// Instead, make up a fake suffix.
RegexFormatConfig
sampleConfig
=
new
RegexFormatConfig
();
sampleConfig
.
extension
=
"log1"
;
sampleConfig
.
regex
=
DATE_ONLY_PATTERN
;
sampleConfig
.
fields
=
"year, month, day"
;
// Full Drill log parser definition.
RegexFormatConfig
logConfig
=
new
RegexFormatConfig
();
logConfig
.
extension
=
"log2"
;
logConfig
.
regex
=
"(\d\d\d\d)-(\d\d)-(\d\d) "
+
"(\d\d):(\d\d):(\d\d),\d+ "
+
"\[([^]]*)] (\w+)\s+(\S+) - (.*)"
;
logConfig
.
fields
=
"year, month, day, hour, "
+
"minute, second, thread, level, module, message"
;
// Define a temporary format plug-in for the "cp" storage plug-in.
Drillbit
drillbit
=
cluster
.
drillbit
();
final
StoragePluginRegistry
pluginRegistry
=
drillbit
.
getContext
().
getStorage
();
final
FileSystemPlugin
plugin
=
(
FileSystemPlugin
)
pluginRegistry
.
getPlugin
(
"cp"
);
final
FileSystemConfig
pluginConfig
=
(
FileSystemConfig
)
plugin
.
getConfig
();
pluginConfig
.
getFormats
().
put
(
"sample"
,
sampleConfig
);
pluginConfig
.
getFormats
().
put
(
"drill-log"
,
logConfig
);
pluginRegistry
.
createOrUpdate
(
"cp"
,
pluginConfig
,
false
);
}
@Test
public
void
testFull
()
throws
RpcException
{
String
sql
=
"SELECT * FROM cp.`regex/simple.log2`"
;
client
.
queryBuilder
().
sql
(
sql
).
printCsv
();
}
The output should be like the following:
Running org.apache.drill.exec.store.easy.regex.TestRegexReader... 3 row(s): year<VARCHAR(OPTIONAL)>,month<VARCHAR(OPTIONAL)>,... 2017,12,17,10,52,41,main,INFO,o.a.d.e.e.f.Function... 2017,12,18,10,52,37,main,INFO,o.a.drill.common.config.... 2017,12,19,11,12,27,main,ERROR,o.apache.drill.exec.server.... Total rows returned : 3. Returned in 1751ms.
Success!
You’ve now fully tested the plug-in, and it is ready for actual use. As noted, we took some shortcuts for convenience so that we could go back and do some performance tuning. We’ll leave the performance tuning to you and instead focus on how to put the plug-in into production as well as how to offer the plug-in to the Drill community.
By following the pattern described here, you should be able to create your own simple format plug-in. There are few additional topics that can be helpful in advanced cases.
The Hadoop model is to store large amounts of data in each file and then read file “chunks” in parallel. For simplicity, this assumes that files are not “splittable”: that chunks of the file cannot be read in parallel. Depending on the file format, you can add this kind of support. Drill logs are not splittable because a single message can span multiple lines. However, simpler formats (such as CSV) can be splittable if each record corresponds to a single line. To read a split, you scan the file looking for the first record separator (such as a newline in CSV) and then begin parsing with the next line. The text reader (the so-called CompliantTextReader
) offers an example.
Prior sections showed two ways to create a configuration for your plug-in. You used the Drill Web Console to create it interactively. You also wrote code to set it up for tests. This is fine for development, but impractical for production users. If you’re going to share your plug-in with others, you probably want it to be configured by default. (For example, for our plug-in, we might want to offer a configuration for Drill logs “out of the box.”)
When Drill first starts on a brand-new installation, it initializes the initial set of format configurations from the following file: /drill-java-exec/src/main/resources/bootstrap-storage-plugins.json.
Ideally, we’d add this configuration in our regex project. Format plug-in configuration is, however, a detail of the storage plug-in configuration. The configuration for the default dfs
storage plug-in resides in Drill’s java-exec
module, so we must add our configuration there. The ability to add configuration as a file within our format plug-in project would be a nice enhancement, but is not available today.
Drill reads the bootstrap file only once: when Drill connects to ZooKeeper and finds that no configuration information has yet been stored. If you add (or modify) a format plug-in afterward, you must manually create the configuration using Drill’s web console. You can also use the REST API.
First, let’s add our bootstrap configuration. Again, if we were defining a new storage plug-in, we’d define a bootstrap file in our module. But because we are creating an Easy format plug-in, we modify the existing Drill file to add the following section just after the last existing format entry for the dfs
storage configuration:
"drill-log"
:
{
type:
"regex"
,
extension:
"log1"
,
regex:
"(\d\d\d\d)-(\d\d)-(\d\d)
(\d\d):(\d\d):(\d\d),\d+ \[([^]]*)]
(\w+)\s+(\S+) - (.*)"
,
fields:
"year, month, day, hour, minute, second,
thread, level, module, message"
}
(Note that this code is formatted for this book; your code must be on a single line for the regex
and fields
items.)
Be sure to add a comma after the "csvh"
section to separate it from the new section.
To test this, delete the ZooKeeper state as described previously. Build Drill with these changes in place, then start Drill and visit the Web Console. Go to the Storage tab and click Update next to the dfs
plug-in. Your format configuration should appear in the formats
section.
With a working format plug-in, you next must choose how to maintain the code. You have three choices:
Contribute the code to the Drill project via a pull request.
Maintain your own private branch of the Drill code that includes your plug-in code.
Move your plug-in code to a separate Maven project which you maintain outside of the Drill project.
Drill is an open source project and encourages contributions. If your plug-in might be useful to others, consider contributing your plug-in code to Drill. That way, our plug-in will be part of future Drill releases. If, however, the plug-in works with a format unique to your project or company, then you may want to keep the code separate from Drill.
The preceding steps showed you how to create your plug-in within the Drill project itself by creating a new module within the contrib
module. In this case, it is very easy to move your plug-in into production. Just build all of Drill and replace your current Drill distribution with the new one in the distribution/target/apache-drill-version
/apache-drill-version
directory. (Yes, apache-drill-version
appears twice.) Your plug-in will appear in the jars directory where Drill will pick it up.
If you plan to offer your plug-in to the Drill project, your plug-in code should reside in the contrib project, as in the preceding example. Your code is simply an additional set of Java classes and resource files added to the Drill JARs and Drill distribution. So, to deploy your plug-in, just do a full build of Drill and replace your existing Drill installation with your custom-built one.
If you want to contribute your code to Drill, the first step is to ensure your files contain the Apache copyright header and conform to the Drill coding standards. (If you used the Drill Maven file, it already enforced these rules.)
The next step is to submit a pull request against the Apache Drill GitHub project. The details are beyond the scope of this book, but they are pretty standard for Apache. Ask on the Drill “dev” mailing list for details.
If your plug-in will remain private, you can keep it in Drill by maintaining your own private fork of the Drill repo. You’ll want to “rebase” your branch from time to time with the latest Drill revisions.
You should have created your plug-in in a branch from the master branch within the Drill Git repository. This example used the branch regex-plugin
. Assuming that your only changes are the plug-in code that you’ve maintained in your own branch, you can keep your branch up to date as follows:
Use git checkout master
to check out the master branch.
Use git pull upstream master
to pull down the latest changes to master
(assuming that you’ve used the default name “upstream” for the Apache Drill repo).
Use git checkout regex-plugin
to switch back to your branch.
Use git rebase master
to rebase your code on top of the latest master.
Rebuild Drill.
Another, perhaps simpler, option for a private plug-in is to move the code to its own project.
Create a standard Maven project, as described earlier.
The pom.xml file contains the bare minimum (this project has no external dependencies):
<project
xsi:schemaLocation=
"http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd"
xmlns=
"http://maven.apache.org/POM/4.0.0"
xmlns:xsi=
"http://www.w3.org/2001/XMLSchema-instance"
>
<modelVersion>
4.0.0</modelVersion>
<parent>
<artifactId>
drill-contrib-parent</artifactId>
<groupId>
org.apache.drill.contrib</groupId>
<version>
1.15.0-SNAPSHOT</version>
</parent>
<artifactId>
drill-format-regex</artifactId>
<name>
contrib/regex-format-plugin</name>
<dependencies>
<dependency>
<groupId>
org.apache.drill.exec</groupId>
<artifactId>
drill-java-exec</artifactId>
<version>
${project.version}</version>
</dependency>
<dependency>
<groupId>
org.apache.drill</groupId>
<artifactId>
drill-common</artifactId>
<version>
${project.version}</version>
</dependency>
<dependency>
<groupId>
org.apache.drill.exec</groupId>
<artifactId>
drill-java-exec</artifactId>
<version>
${project.version}</version>
<classifier>
tests</classifier>
<scope>
test</scope>
</dependency>
<dependency>
<groupId>
org.apache.drill</groupId>
<artifactId>
drill-common</artifactId>
<version>
${project.version}</version>
<classifier>
tests</classifier>
<scope>
test</scope>
</exclusions>
</dependency>
</dependencies>
</project>
Replace the version number in this code example with the version of Drill that you want to use.
Then, you need to instruct Drill that this is a plug-in by adding a (empty) drill-module.conf file. If you wanted to add any boot-time options, you would add them here, but this plug-in has none. See the other format plug-in projects for more sophisticated examples.
This chapter discussed how to create a format plug-in using the Easy framework. Practice creating a plug-in of your own. As you dig deeper, you will find that Drill’s own source code is the best source of examples: review how other format plug-ins work for ideas. (Just be aware that Drill is constantly evolving; some of the older code uses patterns that are now a bit antiquated.)
Drill has a second kind of plug-in: the storage plug-in. Storage plug-ins access data sources beyond a distributed file system (DFS). For example, HBase, Kafka, and JDBC. Storage plug-ins introduce another large set of mechanisms to implement. Again, look at existing storage plug-ins and ask questions on the dev list to discover what has to be done. The use of the test framework shown here will greatly reduce your edit-compile-debug cycle times so that you can try things incrementally.