Chapter 5. Analyzing Complex and Nested Data

As Chapter 4 demonstrated, Apache Drill is a very powerful tool for analyzing data contained in delimited files. In this chapter, you will learn how to apply that power to complex and nested datasets and formats such as JavaScript Object Notation (JSON) and Parquet. Data contained in NoSQL stores such as MongoDB often contains nested data structures that make it difficult to query in the traditional SQL context. These data formats often require specialized tools to analyze, but with Drill you can query them just as you would any other dataset—albeit with some additional complexities. Before you dive into these datasets, however, you must understand how Drill deals with complex data objects.

A Word About Parquet Format

Parquet is a self-describing, compressed columnar format that supports nested data. Many big data systems such as Hadoop, Hive, Spark, and others support reading and writing Parquet files. Drill performs best reading Parquet files, so we recommend that if you are planning on querying large, complex data you convert the data into Parquet format.

Arrays and Maps

In Chapter 4 you learned about all the different data types that exist in Drill, such as INTEGER, DOUBLE, and VARCHAR. These data types are common in most databases and programming languages, but unlike most databases, Drill also features two complex data types, array and map, that you’ll need to understand in order to analyze complex datasets.1 Both of these data types differ from the standard data types in that they hold more than one value and can even hold other collections.

Arrays in Drill

In the previous chapter, you saw one of Drill’s complex types: arrays. In Drill and many programming languages, an array is an ordered collection of data points that is indexed by number. Like many programming languages, Drill begins its indexing at 0, so the item at position 4 is actually the fifth item in the list, as demonstrated in Table 5-1.

Table 5-1. Array values
Index Value

0

Steve

1

Stacy

2

Mary

3

Jose

4

Bob

In the array presented in Table 5-1, you can see that the value at index 2 is Mary, but that is actually the third item in the array.

Drill array syntax is similar to most programming languages, such as using brackets to reference individual items within an array. Table 5-1 would appear like this if you used SQLLine to display the column:

["Steve","Stacy","Mary","Jose","Bob"]

In practice, you will encounter arrays when querying CSV files or other delimited formats that do not include the field names in the header. In this book’s GitHub repository, there is a file called customer_data.csv that contains the first names, last names, and birthdates of fictitious customers. If you were to execute a SELECT * query on this file, the result would look similar to that shown in Figure 5-1.

Figure 5-1. Arrays in Drill

You can see that each row contains an array with fields in it. If you wanted to access the birthday column, you could use the following query:

SELECT columns[2] AS birthday
FROM dfs.drill_tutorial.`customer_data.csv`

This would return the correct results. Likewise, if you wanted to break out all the elements of these arrays and make them more readable, you could execute the following query, which would return the results shown in Figure 5-2:

SELECT 
    columns[0] AS first_name, 
    columns[1] AS last_name, 
    columns[2] AS birthday
FROM dfs.drill_tutorial.`customer_data.csv`
Figure 5-2. Array query result

The most common use of arrays is to store the fields in a delimited file when that file has no header row with field names. Arrays are also used when reading JSON data.

Accessing Maps (Key–Value Pairs) in Drill

Maps are similar to arrays in that they are a complex data type that can hold more than one value; however, unlike arrays, maps are indexed by a key instead of a number.2 JSON represents maps (called objects in JSON) using brackets, as follows:

{
  "key1":"value1",
  "key2":"value2"
}

You can access the individual fields within the map by referencing table.map.field in your query. For example, if the preceding map were called data and was found in a table called customers, you could access the individual fields by using the following query:

SELECT customers.data.key1, customers.data.key2
FROM...

As a more practical example, the book’s GitHub repository contains a file called user_agents.csv, which contains a bunch of user agent strings3 from a web server log. If you were to query this file, the results would contain an array with the user agent string as the only array item. Here is a sample user agent string:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36  
    (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36

Although you can see there is a lot of information contained in that, it is difficult to map this information to fields. However, there is a Drill function called parse_user_agent() that returns a map of artifacts from the user agent string, as shown here:4

{
 "DeviceClass":"Desktop",
 "DeviceName":"Macintosh",
 "DeviceBrand":"Apple",
 "OperatingSystemClass":"Desktop",
 "OperatingSystemName":"Mac OS X",
 "OperatingSystemVersion":"10.10.1",
 "OperatingSystemNameVersion":"Mac OS X 10.10.1",
 "LayoutEngineClass":"Browser",
 "LayoutEngineName":"Blink",
 "LayoutEngineVersion":"39.0",
 "LayoutEngineVersionMajor":"39",
 "LayoutEngineNameVersion":"Blink 39.0",
 "LayoutEngineNameVersionMajor":"Blink 39",
 "AgentClass":"Browser",
 "AgentName":"Chrome",
 "AgentVersion":"39.0.2171.99",
 "AgentVersionMajor":"39",
 "AgentNameVersion":"Chrome 39.0.2171.99",
 "AgentNameVersionMajor":"Chrome 39",
 "DeviceCpu":"Intel"
}

If you want to access a few specific keys from this map, the best way to accomplish this is to write a query that retrieves the map itself. Then, using the table.map.field technique, you can access the individual fields as needed:

SELECT 
    uadata.ua.AgentNameVersion AS Browser,
    COUNT(*) AS BrowserCount
FROM (
   SELECT parse_user_agent(columns[0]) AS ua
   FROM dfs.drill_tutorial.`user-agents.csv`
) AS uadata
GROUP BY uadata.ua.AgentNameVersion
ORDER BY BrowserCount DESC

This query extracts the AgentNameVersion field from the map and then aggregates that and counts unique values. The resulting dataset will indicate what browsers and versions were most popular on this website.

Querying Nested Data

Now that you have an understanding of how Drill handles complex data types, you are ready to start querying nested data formats such as JSON and Parquet. Drill can natively query data in these formats quickly and easily; however, data is often encoded in many ways using these formats. In this section you will learn how to use Drill’s various functions to quickly and easily access nested data files.

Data types in JSON files

In Chapter 4, you learned that one of the big limitations of Drill is that it cannot infer data types from delimited data. When querying JSON or other nested formats, Drill is able to infer data types if the data is formatted correctly.5 By default, Drill interprets any number with a decimal point as a DOUBLE, and numbers without a decimal point as BIGINT. Drill also interprets the unquoted words true and false as a BOOLEAN data type, and anything else will  be treated as a VARCHAR

Drill does not support mixed-type fields within a map, so if your data is inconsistent in this regard, you might encounter schema type errors when querying JSON files. To work around this issue, Drill has two configuration options that can fix this problem. Setting store.json.read_numbers_as_double to true, will alleviate number formatting errors. You can do this in the configuration files or with the following query:

ALTER SESSION SET `store.json.read_numbers_as_double` = true

If the JSON data you are reading did not correctly encode the various data types—such as putting quotation marks around numbers or capitalizing True/False, or something like that—you might encounter additional errors when reading JSON data files. If that is the case, the last resort is activating Drill’s all-text mode, which causes Drill to interpret all fields as VARCHARs. You can accomplish this by using the following query:

ALTER SESSION SET `store.json.all_text_mode` = true

JSON with Heterogeneous Data

If you have JSON data with mixed data types, you will not be able to directly query this in Drill. However, a new, experimental feature, the Union type, allows you to query JSON data with different types in the same field. Because this new feature is still considered experimental, you must explicitly enable it by setting the exec.enable_union_type option to true, as shown in this query:

ALTER SESSION SET `exec.enable_union_type` = true

You can use a field with a Union type inside of functions; however, you will need to consider how the function will react to different data types. If the data requires special handling for the different types, you can do this in a CASE statement using the new type functions:

SELECT 1 + 
CASE 
   WHEN is_list(a) 
      THEN a[0] 
    ELSE a 
END 
FROM table;

Formats of nested data

There are several different methods by which you can encode data in JSON format: column-oriented, record-oriented, and split. In this section, you’ll learn how to convert all these formats into a basic table that you can further analyze.

This book’s GitHub repository contains three JSON files containing the data from the file customer_data.csv but encoded in different formats: record-oriented, split, and column-oriented. You can try all the techniques with these sample datasets.

Querying record-oriented files

The first example we will look at is records.json, which is encoded as shown in the code that follows. Note that in the JSON format, whitespace is ignored, but we have indented the fields to make the format easier for you to read:

[
   {
      "first_name":"Robert",
      "last_name":"Hernandez",
      "birthday":"5/3/67"
   },{
       "first_name":"Steve",
       "last_name":"Smith",
       "birthday":"8/4/84"},
    }
]

You can see that this data is encoded as an array of maps. This format is perhaps the easiest for Drill to query because it does not require the use of any specialized functions. Simply executing the query that follows results in Drill mapping the fields to the correct columns, as demonstrated in Figure 5-3:

SELECT *
FROM dfs.drill_tutorial.`records.json` 
Figure 5-3. JSON query results

In practice, the array-of-objects formats is uncommon. More common (and efficient) is the non-standard list-of-objects format: {...}{...}.

Using the FLATTEN() function to query split JSON files

Querying split JSON files introduces a bit of complexity into the equation. The following example demonstrates JSON data in the split format:

{
"columns":
  [
    "first_name",
    "last_name",
    "birthday"
  ],
"index":
  [0,1,2,3],
"data":
[
  [
    "Robert",
    "Hernandez",
    "5/3/67"
  ],[
    "Steve",
    "Smith",
    "8/4/84" 
  ],[
    "Anne",
    "Raps",
    "9/13/91"
  ],[
    "Alice",
    "Muller",
    "4/15/75"
  ]
]
} 

You can see in this format that the actual data is buried in an array called data. However, if you just execute the following query, it still will not produce the result we are looking for:

SELECT data
FROM dfs.drill_tutorial.`split.json`

Figure 5-4 shows the results.

Figure 5-4. Split JSON query results

In this case, you can see that we get one row with many arrays. To get the fields into a usable format, you must use the FLATTEN(x) function to break apart the array. This function breaks apart repeated elements in a column into rows. In this case, calling FLATTEN() on the data column will get you a dataset with each array as its own row (see Figure 5-5):

SELECT FLATTEN(data) AS row_data
FROM dfs.drill_tutorial.`split.json`
Figure 5-5. Query results after flattening

Note that you now have each array in its own row. From this point you can access all of the fields by using this as a subquery, as demonstrated here, with the results shown in Figure 5-6:

SELECT 
    row_data[0] AS first_name, 
    row_data[1] AS last_name,
    row_data[2] AS birthday
FROM
(
  SELECT FLATTEN(data) AS row_data
  FROM dfs.drill_tutorial.`split.json`
) AS split_data
Figure 5-6. Query results after extracting fields

Because the FLATTEN(nested_field) function returns an array, you can also access individual items within that array; however, to create the table shown in Figure 5-6, you need to use a subquery.

Querying column-oriented JSON files with KVGEN()

The following example shows another way in which you can encode data in JSON:

{
  "first_name":
  {
    "0":"Robert",
    "1":"Steve",
    "2":"Anne",
    "3":"Alice"
  },
  "last_name":
  {
    "0":"Hernandez",
    "1":"Smith",
    "2":"Raps",
    "3":"Muller"
  },
  "birthday":
  {
    "0":"5/3/67",
    "1":"8/4/84",
    "2":"9/13/91",
    "3":"4/15/75"
  }
}

In this example, the data is a map of key–value pairs. To access these columns, you need to use a new function: KVGEN(nested_field). As the name implies, the KVGEN(nested_field) function generates key–value pairs from a complex column. This next query generates key–value pairs for every entry in the first_name column:

SELECT KVGEN(first_name) AS kvgen_firstname
FROM dfs.drill_tutorial.`columns.json`

Figure 5-7 presents the results.

Figure 5-7. Unflattened query after KVGEN()

At this point, the formatting should look similar to the previous example in that we have an array of maps contained in a single row entry. And much like in the last example, to get the values, you should use the FLATTEN(nested_field) function on this column. You can see the results of this query in Figure 5-8:

SELECT FLATTEN(KVGEN(first_name)) AS kvgen_firstname
FROM dfs.book.`columns.json`
Figure 5-8. Query data after flattening

You still don’t have the actual values; however, that is as simple as adding the key name after the function calls, as shown in the following query:

SELECT FLATTEN(KVGEN(first_name))['value'] AS firstname
FROM dfs.drill_tutorial.`columns.json`

Figure 5-9 displays the results.

Figure 5-9. Query data with field extracted

Although this query successfully gets you one column, you’ll find that if you try to do this with more than one column, the FLATTEN() method will cause a lot of undesired duplicate rows. Therefore, to produce a single table with all the columns, you need to extract each column and join them together in a series of subqueries.

However, you will run into another problem: there is no key with which to join the records. To join them together, you need to create a key using ROW_NUMBER() (one of the window functions that you learned about in Chapter 4). The following query extracts all of the fields as columns, which you can analyze (see Figure 5-10):

SELECT first_name, last_name, birthday
FROM 
(
  SELECT row_number() over (ORDER BY '1')as rownum, 
  FLATTEN( KVGEN(first_name))['value'] AS first_name
  FROM dfs.book.`columns.json`
) AS tbl1
JOIN 
(
  SELECT row_number() over (ORDER BY '1')as rownum, 
  FLATTEN( KVGEN(last_name))['value'] AS last_name
  FROM dfs.book.`columns.json`
) AS tbl2 
ON tbl1.rownum=tbl2.rownum
JOIN 
(
  SELECT row_number() over (ORDER BY '1')as rownum, 
  FLATTEN( KVGEN(birthday))['value'] AS birthday
  FROM dfs.book.`columns.json`
) AS tbl3 ON tbl1.rownum=tbl3.rownum
Figure 5-10. Query data with all fields

There are other ways in which you can arrange data in JSON format, and although there might be occasional structures that Drill is not capable of reading, you will find that by using your knowledge of arrays and maps along with the FLATTEN() and KVGEN() functions you will be able to access most complex data sources.

Analyzing Log Files with Drill

One of Drill’s most powerful features is that in addition to standard file formats, as of version 1.9, Drill can also read and query Apache web server logs. The ability to read and query other types of log files was added in version 1.14. You can aggregate and join these files with any other data source that Drill can read; however, getting the most out of this data requires an understanding of the nested data structures you have been learning about.

Configuring Drill to Read HTTPD Web Server Logs

To read web server logs, you must configure the storage plug-in that you are using to read the logs. Server administrators can customize the web server output to display whichever fields they are interested in recording. On your web server, these settings are typically located in /etc/apache2/apache2.conf. In this file, there is a configuration option called LogFormat, which will look something like this:

LogFormat "%h %l %u %t "%r" %>s %O "{Referer}i" "%{User-Agent}i""

Once you’ve located the log format string, copy it to a location where you can readily access it because you will need this formatting string to configure Drill. The next step is to open the Storage tab in the Web Console and click the storage plug-in that you are configuring—probably dfs—and add the following code to the formats section:

"httpd": {
   "type": "httpd",
   "logFormat":"%h %l %u %t "%r" %>s %O "{Referer}i" "%{User-Agent}i"",
   "timestampFormat": null
}

Be sure to use the actual format string from your server. Table 5-2 shows some of the available fields.6

Table 5-2. Selected format string values
Format string Description

%h

Remote hostname. Will log the IP address if HostnameLookups is set to Off, which is the default. If it logs the hostname for only a few hosts, you probably have access control directives mentioning them by name.

%{VARNAME}i

Contents of VARNAME: header line(s) in the request sent to the server. Changes made by other modules (e.g., mod_headers) affect this. If you’re interested in what the request header was prior to when most modules would have modified it, use mod_setenvif to copy the header into an internal environment variable and log that value with %{VARNAME}e.

%l

Remote log name (from identd, if supplied). This returns a dash unless mod_ident is present and IdentityCheck is set to On.

%r

First line of request.

%s

Status. For requests that have been internally redirected, this is the status of the original request. Use %>s for the final status.

%t

Time the request was received, in the format [18/Sep/2011:19:18:28 -0400]. The last number indicates the time zone offset from GMT.

%u

Remote user if the request was authenticated. Might be bogus if return status (%s) is 401 (unauthorized).

%O

Bytes sent, including headers. Might be zero in rare cases such as when a request is aborted before a response is sent. You need to enable mod_logio to use this.

Querying Web Server Logs

After you have configured the plug-in, you are ready to begin querying web server logs. This book’s GitHub repository contains a few sample web server logs for practice. Let’s take a look at the file hackers-access.httpd (and note that the following code is broken for book formatting, but should be single lines in real data):

195.154.46.135 - - [25/Oct/2015:04:11:25 +0100] 
"GET /linux/doing-pxe-without-dhcp-control HTTP/1.1" 200 24323 
"http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) 
Gecko/20100101 Firefox/35.0"
23.95.237.180 - - [25/Oct/2015:04:11:26 +0100] 
"GET /join_form HTTP/1.0" 200 11114 "http://howto.basjes.nl/" 
"Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0"

By executing the following query, you will see all of the available fields:

SELECT *
FROM dfs.drill_tutorial.`log_files/hackers-access.httpd`
LIMIT 10

Note that this query returns many more fields than are in the format string. Drill’s HTTPD parser breaks up many of the fields into their constituent parts. For example, the request_firstline field is broken into request_firstline_uri_userinfo and many more.

Analyzing user agent strings

There are a number of built-in functions that are very useful for analysis of web server logs and other network-related data. As we saw earlier, the web server logs contain user agent strings that hold a wealth of valuable data.

You can view the user agent strings by using the following query:

SELECT `request_user-agent`
FROM dfs.drill_tutorial.`log_files/hackers-access.httpd`

These sample results show that this request was made using Firefox 34 on a Windows machine:

Mozilla/5.0 (Windows NT 5.1; rv:34.0) Gecko/20100101 Firefox/34.0

However, parsing these details out of user agent strings is quite complicated; as a result, there is a function called parse_user_agent() that returns a map similar to the following:

{
  "DeviceClass":"Desktop",
  "DeviceName":"Macintosh",
  "DeviceBrand":"Apple",
  "OperatingSystemClass":"Desktop",
  "OperatingSystemName":"Mac OS X",
  "OperatingSystemVersion":"10.10.1",
  "OperatingSystemNameVersion":"Mac OS X 10.10.1",
  "LayoutEngineClass":"Browser",
  "LayoutEngineName":"Blink",
  "LayoutEngineVersion":"39.0",
  "LayoutEngineVersionMajor":"39",
  "LayoutEngineNameVersion":"Blink 39.0",
  "LayoutEngineNameVersionMajor":"Blink 39",
  "AgentClass":"Browser",
  "AgentName":"Chrome",
  "AgentVersion":"39.0.2171.99",
  "AgentVersionMajor":"39",
  "AgentNameVersion":"Chrome 39.0.2171.99",
  "AgentNameVersionMajor":"Chrome 39",
  "DeviceCpu":"Intel"
}

Installing the parse_user_agent() Function

Because user agents are constantly being updated, this function does not ship with Drill. You can view the complete installation instructions on GitHub.

Although not all user agents will contain all of these fields, you can access the individual fields by including them in a subquery, as shown here:

SELECT 
    uadata.ua.AgentNameVersion AS Browser,
    COUNT( * ) AS BrowserCount
FROM (
      SELECT parse_user_agent( `request_user-agent` ) AS ua
      FROM dfs.drill_tutorial.`log_files/hackers-access.httpd`
      ) AS uadata
GROUP BY uadata.ua.AgentNameVersion
ORDER BY BrowserCount DESC

This query gets you a count of all the browsers and versions that accessed this server, as illustrated in Figure 5-11. You could of course aggregate any of the other fields in the same manner.

Figure 5-11. Browser count query result

Analyzing URLs and query strings

In web server logs, there are many URLs that will appear, and just like the user agent strings, there are many useful artifacts within these URLs that Drill can easily extract.

If you aren’t familiar with URL structure, consider the following example URL:

http://somesite.com/login.php?username=bob&password=pass1234

Drill has a function called parse_url() that returns the following map when given the preceding URL:

{
   "protocol":"http",
   "authority":"somesite.com",
   "host":"somesite.com",
   "path":"/login.php?username=bob&password=pass1234"
}

In a web server log, the field request_referer contains URLs, and you can access these artifacts in the URLs by using the following query:

SELECT `request_referer`, parse_url(`request_referer`)
FROM dfs.drill_tutorial.`hackers-access.httpd`
WHERE `request_referer` IS NOT NULL

Figure 5-12 presents the results.

Figure 5-12. Query results with URL metadata

You can then access the newly extracted data artifacts just as you would any other map.

The query string can also be very useful for analysis. Recall from our fictitious URL that it contained username=bob&password=pass1234, which is clearly a key–value pairing of variables. However, to analyze this data, you need to first break it up into a map. As you might imagine, Drill has the capability to do this by using the parse_query() function. This function takes a query string and returns a mapping of the key–value pairs from the query string. Note that this function will throw an error if you attempt to pass it a null string, so you need to check for empty query strings before you pass it anything. The following example demonstrates how to use the function in a query:

SELECT `request_referer_query`, 
    parse_query(`request_referer_query`)
FROM dfs.drill_tutorial.`hackers-access.httpd`
WHERE LENGTH(`request_referer_query`) > 0

This query produces the result shown in Figure 5-13.

Figure 5-13. Query with key–value pairs extracted

This function can be very useful in identifying various attacks against a web server. Suppose that you wanted to extract all the values that were being passed to a specific key. You could execute the following query:

SELECT q.query_map.`came_from` AS came_from
FROM (
   SELECT parse_query(`request_referer_query`) AS query_map
   FROM dfs.drill_tutorial.`hackers-access.httpd`
   WHERE LENGTH(`request_referer_query`) > 0
) AS q
WHERE q.query_map.`came_from` IS NOT NULL

This could be very useful if you were looking at form submissions for malicious activity. Figure 5-14 displays the results of this particular query.

Figure 5-14. Query results showing URL referrer field

In this section, you’ve learned how to use Drill to analyze web server logs and how to use functions to extract valuable artifacts from this data. In the next section, we look at how to use Drill to analyze other types of log files. Much like web server log data, these log files will also contain nested data structures; thus, understanding nested data structures will be useful for getting the most out of your log files.

Other Log Analysis with Drill

In addition to web server logs, as of version 1.14, Drill can read any other kind of log files. Consider the following log file, which is a sample of a MySQL log:

070823 21:00:32       1 Connect     root@localhost on test1
070823 21:00:48       1 Query       show tables
070823 21:00:56       1 Query       select * from category
070917 16:29:01      21 Query       select * from location
070917 16:29:12      21 Query       select * from location where id = 1 LIMIT 1

Even though it might be possible to get Drill to parse this by splitting the fields by tabs, it would probably result in some needlessly complex queries.

Analyzing such log files using Drill is similar to how you just learned to get Drill to parse HTTPD server logs; however, instead of using the LogFormat string to define the file format, you must define a regular expression.7 You will also need to use grouping parentheses to define the fields that you want to extract. Note that the log file extension uses Java-style regular expressions. You must escape backslashes within string constants, so you must double the backslashes. Thus, the regex d{4} would become \d{4}. For the previous log file, a regular expression to match the lines and extract the fields would be as follows:

(\d{6})\s(\d{2}:\d{2}:\d{2})\s+(\d+)\s(\w+)\s+(.+)

To configure Drill to read log files natively, you must add the following section to the storage plug-in configuration you will use to query your log files—probably hdfs or dfs. In the file formats section, add the following:

"log" : {
      "type" : "logRegex",
      "extension" : "log",
      "regex" : "(\d{6})\s(\d{2}:\d{2}:\d{2})\s+(\d+)\s(\w+)\s+(.+)",
      "maxErrors": 10,
      "schema": [
        {
          "fieldName": "eventDate",
          "fieldType": "DATE",
          "format": "yyMMdd"
        },
        {
          "fieldName": "eventTime",
          "fieldType": "TIME",
          "format": "HH:mm:ss"
        },
        {
          "fieldName": "PID",
          "fieldType": "INT"
        },
        {
          "fieldName": "action"
        },
        {
          "fieldName": "query"
        }
      ]
   }

In addition to the regular expression, there are a few other fields that you must define in the storage plug-in configuration:

regex

This is the regular expression that the logs should match. It must have grouping parentheses to extract the fields.

fields

This is an array of the field names. If the number of field names is less than the number of groups from the regex, Drill will assign the name of field_n to any unnamed fields.

extension

This is the file extension to be used with this configuration.

maxErrors

This defines a limit for how many errors Drill will ignore before it halts execution of a query. Because logs can be inconsistent, you need to tune this parameter for best performance on your system.

schema

This is where you define the schema of your log file. The schema is based on the number of capturing groups you defined in the regex variable. If you do not define the fields, they will be assigned a name of field_n, where n is the position index of the field, with the VARCHAR data type. The schema option is an array of objects that can contain the following fields:

fieldName

This is the name of your field.

fieldType

This allows you to define what data type Drill should apply to the column. This defaults to VARCHAR. At the time of writing, the log reader plug-in supports INT, FLOAT, VARCHAR, DATE, TIME, and TIMESTAMP.

format

This field is mandatory if the field is a date or time data type. Uses Joda format strings.

Additional Implicit Columns

In addition to the columns extracted by the regex, the log format plug-in has two implicit columns: the _raw column, which displays the complete line from the log file; and the _unmatched_rows column, which displays rows that do not match the supplied regular expression.

Using the previous configuration, you can now analyze log files as you would any other data source in Drill. Figure 5-15 illustrates the query results from SELECT * FROM dfs.book.`mysql.log`.

Figure 5-15. Results of log query

You can select individual fields, filter, aggregate, or perform any other type of analysis on this data.

Conclusion

In this chapter you learned how to query numerous complex data types, such as log files, JSON, and Parquet. You also learned about Drill’s complex data types, maps and arrays, and how to combine all this to analyze nested datasets.

In Chapter 6, you will learn how to expand your ability to access data even further by learning how to connect Drill to other systems including relational databases, cloud storage, NoSQL data stores, and much more.

1 From a user’s perspective, Drill supports arrays and maps that work as described in this chapter. As you will see when you develop user-defined functions or plug-ins for Drill, they are handled internally quite differently than in programming languages.

2 Technically, maps in Drill are more analogous to structs in C, which are collections of nested tuples such that every record will have the same fields. Therefore, when reading {a: {foo: 10}} {a: {bar: "fred"}}, Drill will produce {a: {foo: 10, bar: null} {a: {foo: null, bar: "fred"}.

3 User agent strings are text sent to a web server as part of an HTTP request that contain information about the software and equipment that the user has. User agent strings can be very difficult to parse, but they contain a wealth of information and can be used for traffic analysis, or simply to send the appropriate version of a website to a client.

4 The parse_user_agent() function is not included with Drill because it needs to be regularly updated. You can download it from this GitHub repository.

5 A very common encoding error is when data has 0 instead of 0.0 in a column containing floating-point numbers.

6 Complete documentation is available on the Apache website.

7 If you are unfamiliar with regular expressions, we recommend reading Mastering Regular Expressions by Jeffrey Friedl (O’Reilly).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset