Home Page Icon
Home Page
Table of Contents for
Learning Apache Drill
Close
Learning Apache Drill
by Paul Rogers, Charles Givre
Learning Apache Drill
Preface
Who Should Read This Book
Why We Wrote This Book
Navigating This Book
Online Resources
Conventions Used in This Book
Using Code Examples
O’Reilly Safari
How to Contact Us
Acknowledgments
Special Thanks from Charles
Special Thanks from Paul
1. Introduction to Apache Drill
What Is Apache Drill?
Drill Is Versatile
Drill Is Easy to Use
A Word About Drill’s Performance
A Very Brief History of Big Data
Drill in the Big Data Ecosystem
Comparing Drill with Similar Tools
2. Installing and Running Drill
Preparing Your Machine for Drill
Special Configuration Instructions for Windows Installations
Installing Drill on Windows
Starting Drill on a Windows Machine
Installing Drill in Embedded Mode on macOS or Linux
Starting Drill on macOS or Linux in Embedded Mode
Installing Drill in Distributed Mode on macOS or Linux
Preparing Your Cluster for Drill
Starting Drill in Distributed Mode
Connecting to the Cluster
Conclusion
3. Overview of Apache Drill
The Apache Hadoop Ecosystem
Drill Is a Low-Latency Query Engine
Distributed Processing with HDFS
Elements of a Drill System
Drill Operation: The 30,000-Foot View
Drill Is a Query Engine, Not a Database
Drill Operation Overview
Drill Components
SQL Session State
Statement Preparation
Statement Execution
Low-Latency Features
Conclusion
4. Querying Delimited Data
Ways of Querying Data with Drill
Other Interfaces
Drill SQL Query Format
Choosing a Data Source
Defining a Workspace
Specifying a Default Data Source
Accessing Columns in a Query
Delimited Data with Column Headers
Table Functions
Querying Directories
Understanding Drill Data Types
Cleaning and Preparing Data Using String Manipulation Functions
Complex Data Conversion Functions
Working with Dates and Times in Drill
Converting Strings to Dates
Reformatting Dates
Date Arithmetic and Manipulation
Date and Time Functions in Drill
Creating Views
Data Analysis Using Drill
Summarizing Data with Aggregate Functions
Common Problems in Querying Delimited Data
Spaces in Column Names
Illegal Characters in Column Headers
Reserved Words in Column Names
Conclusion
5. Analyzing Complex and Nested Data
Arrays and Maps
Arrays in Drill
Accessing Maps (Key–Value Pairs) in Drill
Querying Nested Data
Analyzing Log Files with Drill
Configuring Drill to Read HTTPD Web Server Logs
Querying Web Server Logs
Other Log Analysis with Drill
Conclusion
6. Connecting Drill to Data Sources
Querying Multiple Data Sources
Configuring a New Storage Plug-in
Connecting Drill to a Relational Database
Querying Data in Hadoop from Drill
Connecting to and Querying HBase from Drill
Querying Hive Data from Drill
Connecting to and Querying Streaming Data with Drill and Kafka
Connecting to and Querying Kudu
Connecting to and Querying MongoDB from Drill
Connecting Drill to Cloud Storage
Querying Time Series Data from Drill and OpenTSDB
Conclusion
7. Connecting to Drill
Understanding Drill’s Interfaces
JDBC and Drill
ODBC and Drill
Drill’s REST Interface
Connecting to Drill with Python
Using drillpy to Query Drill
Connecting to Drill Using pydrill
Other Ways of Connecting to Drill from Python
Connecting to Drill Using R
Querying Drill from R Using sergeant
Connecting to Drill Using Java
Querying Drill with PHP
Using the Connector
Querying Drill from PHP
Interacting with Drill from PHP
Querying Drill Using Node.js
Using Drill as a Data Source in BI Tools
Exploring Data with Apache Zeppelin and Drill
Exploring Data with Apache Superset
Conclusion
8. Data Engineering with Drill
Schema-on-Read
The SQL Relational Model
Data Life Cycle: Data Exploration to Production
Schema Inference
Data Source Inference
Storage Plug-ins
Storage Configurations
Workspaces
Querying Directories
Default Schema
File Type Inference
Format Plug-ins and Format Configuration
Format Inference
File Format Variations
Schema Inference Overview
Distributed File Scans
Schema Inference for Delimited Data
CSV Summary
Schema Inference for JSON
Ambiguous Numeric Schemas
Aligning Schemas Across Files
JSON Objects
JSON Lists in Drill
JSON Summary
Using Drill with the Parquet File Format
Schema Evolution in Parquet
Partitioning Data Directories
Defining a Table Workspace
Working with Queries in Production
Capturing Schema Mapping in Views
Running Challenging Queries in Scripts
Conclusion
9. Deploying Drill in Production
Installing Drill
Prerequisites
Production Installation
Configuring ZooKeeper
Configuring Memory
Configuring Logging
Testing the Installation
Distributing Drill Binaries and Configuration
Starting the Drill Cluster
Configuring Storage
Working with Apache Hadoop HDFS
Working with Amazon S3
Admission Control
Additional Configuration
User-Defined Functions and Custom Plug-ins
Security
Logging Levels
Controlling CPU Usage
Monitoring
Monitoring the Drill Process
Monitoring JMX Metrics
Monitoring Queries
Other Deployment Options
MapR Installer
Drill-on-YARN
Docker
Conclusion
10. Setting Up Your Development Environment
Installing Maven
Creating the Drill Build Environment
Setting Up Git and Getting the Source Code
Building Drill from Source
Installing the IDE
Conclusion
11. Writing Drill User-Defined Functions
Use Case: Finding and Filtering Valid Credit Card Numbers
How User-Defined Functions Work in Drill
Structure of a Simple Drill UDF
The pom.xml File
The Function File
The Simple Function API
Putting It All Together
Building and Installing Your UDF
Statically Installing a UDF
Dynamically Installing a UDF
Complex Functions: UDFs That Return Maps or Arrays
Example: Extracting User Agent Metadata
The ComplexWriter
Writing Aggregate User-Defined Functions
The Aggregate Function API
Example Aggregate UDF: Kendall’s Rank Correlation Coefficient
Conclusion
12. Writing a Format Plug-in
The Example Regex Format Plug-in
Creating the “Easy” Format Plug-in
Creating the Maven pom.xml File
Creating the Plug-in Package
Drill Module Configuration
Format Plug-in Configuration
Cautions Before Getting Started
Creating the Regex Plug-in Configuration Class
Copyright Headers and Code Format
Testing the Configuration
Fixing Configuration Problems
Troubleshooting
Creating the Format Plug-in Class
Creating a Test File
Configuring RAT
Efficient Debugging
Creating the Unit Test
How Drill Finds Your Plug-in
The Record Reader
Testing the Reader Shell
Logging
Error Handling
Setup
Regex Parsing
Defining Column Names
Projection
Column Projection Accounting
Project None
Project All
Project Some
Opening the File
Record Batches
Drill’s Columnar Structure
Defining Vectors
Reading Data
Loading Data into Vectors
Releasing Resources
Testing the Reader
Testing the Wildcard Case
Testing Explicit Projection
Testing Empty Projection
Scaling Up
Additional Details
File Chunks
Default Format Configuration
Next Steps
Production Build
Contributing to Drill: The Pull Request
Maintaining Your Branch
Create a Plug-In Project
Conclusion
13. Unique Uses of Drill
Finding Photos Taken Within a Geographic Region
Drilling Excel Files
The pom.xml File
The Excel Custom Record Reader
Using the Excel Format Plug-in
Network Packet Analysis (PCAP) with Drill
Examples of Queries Using PCAP Data Files
Analyzing Twitter Data with Drill
Using Drill in a Machine Learning Pipeline
Making Predictions Within Drill
Building and Serializing a Model
Writing the UDF Wrapper
Making Predictions Using the UDF
Conclusion
A. List of Drill Functions
Aggregate and Window Functions
Window Functions
Cryptological and Hashing Functions
Data Conversion Functions
Geospatial Functions
Math and Trigonometric Functions
Networking Functions
Null Handling Functions
String Manipulation Functions
Approximate String Matching Functions
Phonetic Functions
String Distance Functions
B. Drill Formatting Strings
Index
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
Cover
Next
Next Chapter
Learning Apache Drill
Learning Apache Drill
Query and Analyze Distributed Data Sources with SQL
Charles Givre and Paul Rogers
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset