Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Thomas W. Dinsmore, Disruptive Analytics, 10.1007/978-1-4842-1311-7_4

4. The Hadoop Ecosystem

Disrupting from Below

Thomas W. Dinsmore¹

(1)Newton, Massachusetts, USA

In 2003, Doug Cutting and Mike Cafarella struggled to build a web crawler to search and index the entire Internet. They needed a way to distribute the data over multiple machines, because there was too much data for a single machine.

To keep costs low, they wanted to use inexpensive commodity hardware. That meant they would need fault-tolerant software, so if any one machine failed, the system could continue to operate.

Early in their work, they ruled out using a relational database. Their data included diverse data structures and data types, without a predefined data model. Mapping that data into a relational data structure would take too much time—if it were possible at all.

Drawing from two papers published by Google engineers, Cutting and Cafarella developed a distributed file system and programming framework. The file system, now called the Hadoop Distributed File System, or HDFS , included built-in redundancy, so that if any machine in the cluster failed, the system could continue to work with copies on other machines.

The programming framework, MapReduce , provided developers with a way to distribute workloads to nodes in a cluster and collect the results. With MapReduce, the developer did not have to explicitly parallelize the task, which saved time and simplified the programming task. It also made it possible to abstract the coding layer from cluster management, so that it was possible to modify the cluster configuration without changing the program. Programs written to run on one cluster with MapReduce could run unmodified on other cluster with a different configuration without modification.

Cutting and Cafarella designed HDFS and MapReduce to work together as a system. In 2006, they contributed the code to the Apache Software Foundation as the core of Apache Hadoop .

Early versions of Hadoop were difficult to use, with few tools available to the analyst. For most enterprises, Hadoop was little more than a curiosity. Internet companies, however, adopted the technology quickly. In 2008, Yahoo announced¹ that it had set a record for a terabyte sort with Hadoop; soon thereafter, Facebook revealed² that it ingested 15 terabytes a day into its 2.5 petabyte data warehouse on Hadoop.

Hadoop’s suitability for general use took a step forward in 2009, when Cloudera and MapR delivered commercially supported distributions. In 2013, the Hadoop team introduced YARN , a resource manager. This proved to be so significant that the community coined the term “Hadoop 2.0 ” to characterize the new phase of the project.

Since then, enterprise adoption has exploded:

Wikibon estimates that one third of all organizations with Big Data have Hadoop clusters in production.³
Ovum analyst Tony Baer estimates⁴ that there were 1,000 Hadoop clusters implemented by early 2014.
Leading Hadoop distributors report sales of 50 to 75 new customers per quarter.
The most sophisticated and mature users have clusters of more than 1,000 nodes.⁵
A Gartner survey⁶ of its CIO Panel conducted in 2015 indicates a likely doubling of the Hadoop user base in 2016.
Industry analyst Forrester predicts⁷ that 100% of large enterprises will adopt Hadoop by 2017.

In this chapter, we cover basic principles of Hadoop and its ecosystem; the economics of Hadoop; an introduction to NoSQL datastores; and a review of analytics in Hadoop.

Hadoop and its Ecosystem

A modern relational database management system is a complex bundle of technologies: a file system, query engine, backup and recovery tools, scripting language, bulk load facilities, security systems, and administration and governance tools. Hadoop unbundles these components into many separate software components which can operate independently of the others.

There are three ways to define Hadoop:

Apache Hadoop: A project of the Apache Software Foundation for distributed computing and storage.
“Core Hadoop”: A set of widely used open source components that complement the functionality of Apache Hadoop.
The Hadoop ecosystem: A network of open source projects, commercial software vendors, and service providers that has developed around Apache Hadoop.

Each Apache project operates independently; this enables rapid innovation and development. However, each project supports a narrow function, so that users typically combine software from multiple projects into a working system.

Commercial vendors sell bundles of open source and proprietary components, including Apache Hadoop. These are commonly called Hadoop distributions , although Apache license agreements prevent vendors from using that terminology.⁸

Apache Hadoop

Apache Hadoop is a top-level project of the Apache Software Foundation, which owns the code and distributes it under a free license to use. Hadoop operates under Apache’s standard governance model. The project team consists of a Project Management Committee (PMC) and a number of committers, who are volunteer software developers. The PMC sets priorities for software enhancements and bug fixes. The team publishes a new release roughly every three months, based on a voting consensus of the committers.

Hadoop supports large-scale fault-tolerant distributed computing on commodity hardware. It includes three major components:

Hadoop Distributed File System (HDFS): A distributed file system implemented in Java.
MapReduce: A processing framework for working with distributed data sets.
YARN: A resource management and scheduling application.

Under Hadoop 1.0, HDFS includes a Name Node and one or more Data Nodes. The Name Node serves as a catalogue and keeps track of files stored in the system; the Data Nodes hold the data itself. Under HDFS, a file is distributed across many Data Nodes.

The MapReduce 1.0 engine consists of a Job Tracker node and one or more Task Tracker nodes. In a standard configuration, one Task Tracker and one Data Node is installed on each server in a cluster, with one or more servers designated to support the Name Node and Job Tracker.

Users query data held in HDFS by writing MapReduce programs, consisting of mappers and reducers. A mapper distributes operations such as select, filter, and sort; a reducer performs a summary operation on results of the mapper . Users can leverage these fundamental operations to support a wide range of business applications.

Prospective users can procure Apache Hadoop directly from the Apache Software Foundation.

Hadoop 2.0

In Hadoop’s first generation, workload management was hardwired into MapReduce, which made it difficult to co-locate software with other processing models in the cluster. The introduction of YARN in late 2013 marks Hadoop’s second generation, or Hadoop 2.0. YARN is the result of a code rebuild that split the role of the JobTracker and TaskTracker into three separate entities:

ResourceManager: A scheduler that allocates computing resources in the cluster to applications.
NodeManager: Deployed on each node in the Hadoop cluster, this component manages resources on its node under the direction of ResourceManager.
ApplicationMaster: Runs a specific job on the cluster, procures computing resources from ResourceManager, and works with NodeManager to manage resources available for the job.

YARN and Hadoop 2.0 had an enormous impact on business analytics. Under Hadoop 1.0, analytical applications were deployed beside Hadoop; users either physically extracted the data from Hadoop and moved it to the analytical application, or they converted their commands to MapReduce for direct execution in Hadoop.

Moving very large data sets is impractical (and sometimes impossible); moreover, the MapReduce learning curve is very steep even for trained analysts. A few software vendors attempted to build tools that automated the conversion to MapReduce behind the scenes, with limited success.

YARN permits deployment of analytic applications directly on the Hadoop cluster, co-located with MapReduce . The analytic applications can consume data stored in HDFS directly, without passing commands through the MapReduce engine. YARN serves as a “traffic cop” for resources, managing conflict between MapReduce jobs and the co-located software.

Powered by Hadoop

In 2009, Cloudera, a startup, launched the first commercially supported software powered by Hadoop. Cloudera bundled Apache Hadoop together with several other components to make a complete package for data management. They branded the bundle as Cloudera Data Hub (CDH ) , offering technical support and consulting services.

MapR , also founded in 2009, offered a competing Hadoop bundle with a number of modifications designed for high availability. MapR also included an option to substitute a proprietary file system (MapR-FS) for better performance and ease of use.

Yahoo , an early Hadoop adopter, published its own bundle for several years, but provided no technical support. In 2011, Yahoo spun off its Hadoop assets to a new company, Hortonworks, which stressed a “pure” open source approach to Hadoop. Hortonworks includes no proprietary components in its software, relying purely on technical support and consulting services for revenue.

Competition among the leading Hadoop companies is intense; each seeks to improve the value of its bundle through services, support, and product enhancements. As a result, the core Hadoop project is increasingly stable and feature-rich.

Outside of lab environments, few enterprises use pure Apache Hadoop, relying instead on one of the commercial bundles powered by Hadoop . As of July 2016, these are:

Amazon Elastic MapReduce
Cloudera Data Hub (CDH)
Hortonworks Data Platform
IBM Infosphere BigInsights
MapR

Enterprises rely on commercial bundles for several reasons. First, commercial bundles include additional components that support capabilities needed in a data platform: implementation, maintenance, provisioning, security, and so forth.

The commercial vendors test all components in the bundle for interoperability, then provide technical support and consulting services to customers. These services provide significant value, as they simplify deployment and reduce time to value.

Commercial vendors also provide a more stable release cadence than Apache Hadoop, which releases new versions frequently and irregularly.⁹ Enterprises generally prefer to avoid the cost and risk of frequent software upgrades.

Since the commercial vendors compete with one another and seek to differentiate their products, each vendor complements Apache Hadoop with a different bundle of software. The nine components listed here are included in every distribution.

Apache Flume: Distributed service for collecting aggregating and moving log data
Apache HBase: Distributed columnar database
Apache Hive: SQL-like query engine
Apache Oozie: Workflow scheduler
Apache Parquet: columnar storage format.
Apache Pig: High-level MapReduce scripting language
Apache Spark: Distributed in-memory computing framework, with libraries for SQL, streaming, machine learning, and graph analytics
Apache Sqoop: Tool for data transfer between relational databases and Hadoop
Apache Zookeeper: Distributed cluster configuration service

Beyond these universally distributed components, there are many other Apache open source projects that support functions needed for data management. The projects listed next are included in some, but not all, commercial Hadoop distributions :

Accumulo: Wide-column datastore
Atlas: Metadata repository for data governance
Avro: Remote procedure call and data serialization framework
Falcon: Data governance engine that defines, schedules, and monitors data management policies
Knox: System that provides authentication and secure access
Ranger: Comprehensive security administration
Sentry: Provides role-based authorization
Slider: Tool to deploy and monitor applications running under YARN
Storm: Distributed real-time computation system
Tez: Software that accelerates MapReduce operations

Commercial vendors mix Apache open source projects, non-Apache open source software, and proprietary software in their products. Hortonworks bundles open source software only.

Performance Improvements

Compared to high performance data warehouse appliances, Hadoop is slow,but improving quickly. The performance issue is attributable in part to the immaturity of the platform; data warehouse appliances have sophisticated query optimizers that accelerate runtime for each operation. Optimizers for Hadoop are still in an early stage of development.

MapReduce breaks high-level tasks, such as a SQL operation, into multiple steps. It saves intermediate results to disk after each pass through the data; as a result, complex analysis runs significantly slower in MapReduce than it does in a data warehouse appliance. This is especially true for iterative algorithms, such as k-means clustering or stochastic gradient descent algorithms .

The leading commercial vendors have divergent approaches to this performance problem. Cloudera embraces Apache Spark, while Hortonworks promotes Apache Tez.

Apache Spark is a framework for distributed in-memory processing. Comparable tasks run much faster in Spark than they do in MapReduce because Spark retains intermediate results in memory rather than persisting them to disk. We cover Spark in detail in Chapter Five, which covers in-memory analytics.

Apache Tez takes a different approach. Working behind the scenes, Tez models program logic as a Directed Acyclic Graph (DAG) , dynamically reconfiguring and simplifying the code. As such, Tez works in a manner similar to query optimizers in relational databases. While Tez reduces the number of MapReduce steps needed to implement a complex analysis, it does not change the need for MapReduce to persist intermediate results to disk.

Cloudera endorsed Spark in 2013 and announced that it would include Spark in its next release. The other vendors followed suit, so that by June 2014 all of the commercial Hadoop vendors had endorsed Spark and included it in their products.

In September, 2015, Cloudera announced what it calls the One Platform Initiative, under which it plans to make Spark the primary computing platform in its Hadoop release, relegating MapReduce to secondary status for new applications. This does not mean that MapReduce is dead; there is a large inventory of existing programs that will continue to run in MapReduce.

Using Tez, Hortonworks’ Stinger project has improved the performance of Hive. However, planned extensions for machine learning haven’t been realized , and Tez appears to be a technological cul-de-sac.¹⁰

The Economics of Hadoop

Two key factors drive Hadoop adoption. The first of these is flexibility— the ability to capture and retain data without first defining its structure enables enterprises to act more quickly. The second is cost: Hadoop and NoSQL databases are much less expensive per unit of data than conventional data warehouses.

Hadoop and relational databases are like apples and oranges: they do very different things. Hadoop loads quickly and does not require a predefined structure; hence it is very well suited to support the flood of unstructured data produced by the new digital economy. However, Hadoop is relatively hard to query, and its runtime performance on requests is relatively slow.

Relational databases —and data warehouses based on them—require predefined data models, which can be time consuming and expensive to build. But they are relatively easy to query, and they can be tuned for extremely fast runtime performance.

Hadoop provides very low cost storage; according to one analyst, Hadoop costs¹¹ $1,000 to $2,000 per terabyte. In contrast to Hadoop, a Teradata data warehouse can cost¹² up to $69,000 per terabyte.

Apache Hadoop itself is free and open source; commercial bundles must be licensed, but every distributor provides a free version, enabling users to get started for little or no cost. The cost model for Hadoop scales smoothly with user needs. In other words, it is relatively easy to add another node to a Hadoop cluster, and the incremental cost of doing so is minimal.

Data warehouses from leading vendors are expensive to build, and costs are “front-end loaded”—the organization pays a steep price just to get started. Moreover, data warehouse costs tend to be “lumpy”— units of expansion are large and costly.

Hadoop disruption stems from the business model of the Hadoop ecosystem. For organizations with rapidly expanding volumes of unstructured data, Hadoop’s low cost, agility, and scalability is very attractive relative to conventional data warehousing.

Data warehouses built on appliances, columnar databases, or in-memory databases remain attractive for high-value and heavily used data. For high-concurrency access to high-value data, where the organization is willing to pay a premium, high-performance data warehouses remain the best choice.

Just one of the major Hadoop distributors, Hortonworks, is a public company. Hortonworks’ financial disclosures show rapid growth. In the second quarter of 2015, the company added 119 customers, a 27% increase over the previous quarter; revenue increased 35% over the first quarter of 2015 and 154% over the second quarter of 2014.

NoSQL Datastores

NoSQL means “not only SQL.” NoSQL databases do not require the user to define a logical structure prior to loading the data; instead, the user defines structure when analyzing the data. That makes them well-suited to handling images, audio, video, machine-generated logs, documents, social media text, and other data with diverse formats.

Technically, Hadoop is a type of NoSQL datastore, and the Hadoop ecosystem includes popular NoSQL datastores like HBase. In addition, some NoSQL datastores can leverage HDFS or other components of the Hadoop ecosystem.

There are four major types of NoSQL datastore:

Key-value datastore: An approach to data storage where records in a table can have varying field structures, in contrast to relational databases where each record has the same field structure. Each record has a unique key. Examples include Redis, Memcached, DynamoDB, and Riak.
Wide column datastore: A hybrid of the key-value datastore and columnar datastore; highly scalable. Examples include Apache Cassandra and Apache HBase.
Document datastore: Database designed for storage and retrieval of documents, such as news articles, SEC filings, or research papers. Examples include MongoDB, CouchDB, and Couchbase.
Graph database: Datastore designed to represent data in the form of a mathematical graph, with nodes, edges, and properties. Neo4j is the leading example in this category.

There are many NoSQL databases currently in use. We briefly describe the five most popular¹³ here:

MongoDB: The most popular document database, widely used as the backend for production web sites and services. Developed in 2007 by MongoDB Inc., which provides commercial support; the software is free and open source under the GPU and Apache licenses.
Apache Cassandra: A distributed wide column datastore, developed by Facebook and donated to open source in 2008. The Apache Software Foundation accepted Cassandra as an incubator project in 2009 and promoted it to top level status in 2010.
Redis: A high-performance in-memory key-value datastore. Developed and supported by Redis Labs, the database is free and open source under a BSD license. Redis Labs offers Redis Cloud and Memcached Cloud as database services.
Apache HBase:A high-performance NoSQL columnar datastore based on Google BigTable. In 2006 developers at Powerset used the Google BigTable framework for their own massively scalable columnar datastore to support a natural language search engine. Microsoft acquired Powerset in 2008 and donated the datastore assets to the Apache Software Foundation. At first, development proceeded as a Hadoop subproject; in 2010, Apache promoted the project to top-level status as Apache HBase.
Neo4j: An open source graph database, developed and supported by Neo Technology. Neo licenses the software under GPL and AGPL open source licenses and an enhanced version under a commercial license. The first release of the product was in 2010.

At present, NoSQL datastores are used most heavily in operational applications, with limited use as analytic datastores. This is primarily due to the lack of a standard interface equivalent to SQL. Analytic programming languages, such as Python and R, can interact with the most popular databases, but use of these tools is limited to power analysts and developers.

Industry analysts expect a shakeout in the category, arguing¹⁴ that even the most popular NoSQL databases lack a sustainable business model.

Analytics in Hadoop

Within the Hadoop ecosystem, there are a number of open source projects that support analytics : Hive, Impala, Spark, Drill, Presto, and Phoenix for SQL; Kylin for OLAP; Mahout, Spark, Flink, and H2O for machine learning; Giraph and Spark GraphX for graph analytics; Zeppelin for machine learning pipelines; and Lucene/Solr for search and text analytics. In Hadoop 2.0, there are also a growing number of commercial software packages for analytics.

We divide this chapter into two parts. In the first, we cover analytics available in Hadoop 1.0, which were rudimentary compared to what was available concurrently outside of Hadoop. In the second part, we cover analytics in Hadoop 2.0. We limit the scope of that review to SQL engines, deferring coverage of machine learning engines to Chapter Eight, and self-service BI to Chapter Nine.

Hadoop 1.0

Until the Apache Hadoop team introduced YARN in 2013, analytic workloads ran in MapReduce, because it was not possible to run alternative programming frameworks concurrently in Hadoop. Consequently, analytic tools served as brokers, translating higher-level tasks into MapReduce commands, submitting them for execution and handling the result set.

We discuss four open source projects and one commercial offering in this section. We also note that three open source analytics projects discussed elsewhere in this book—Jaspersoft¹⁵, Pentaho,¹⁶ and Talend¹⁷—pioneered integration with Hadoop.

Two additional projects are of historical interest. The rhadoop¹⁸ project, sponsored by Revolution Analytics, supports connectivity to HDFS, HBase, and Avro, and enables dplyr-like operations in MapReduce for R users. GitHub statistics show¹⁹ minimal activity since 2013, and virtually no activity since Microsoft acquired Revolution Analytics in 2015. The RHIPE²⁰ project offers comparable capability for R users. Code contributions are sporadic.

The limitations of MapReduce discussed early in this chapter impaired the performance and utility of these early efforts to bring analytics to Hadoop. Subsequently, developers have re-engineered Apache Hive to support Spark and Tez, effectively upgrading it for use under Hadoop 2.0.

Apache Hive

Hive is an open source data warehouse environment designed to support SQL-like queries in Hadoop. Originally designed to run exclusively in MapReduce, Hive is now able to execute queries in Apache Tez. Hive is the most mature and most widely distributed SQL-on-Hadoop project.

Developers at Facebook started working on Hive in 2007 and donated the code to the Apache Software Foundation in 2008 as a Hadoop contributed subproject²¹. In September 2010, Hive graduated²² to top-level Apache project status. Hive’s committers include developers from technology leaders such as Cloudera, Dropbox, Facebook, Hortonworks, InMobi, Intel, LinkedIn, Microsoft, NexR, Nutanix, Qubole, and Yahoo.²³

Prior to 2014, Hive executed queries through MapReduce . As a result, Hive’s runtime performance was relatively slow, which made it best suited for Batch SQL. Under the Stinger and Stinger.next projects, Hortonworks has invested in Hive improvements, with the following primary goals:

Improve performance to sub-second response time.
Expand scalability to petabyte data volume.
Enhance SQL support to full ANSI standard.

Additional enhancements planned under Stinger.next include:

Streaming data ingestion
Cross-geo queries, which is the ability to query data sets distributed across geographic areas
Materialized views, which are multiple views of the data held in memory
Ease-of-use enhancements
Simplified deployment

For performance improvements, the Stinger team rebuilt Hive to leverage Apache Tez , an application framework that creates a more efficient execution plan than generic MapReduce. Tez models processing logic as a Directed Acyclic Graph (DAG) , then dynamically reconfigures the graph for more efficient logic. According to Hortonworks, Hive on Tez runs on average 52 times faster than conventional Hive for the TPC-DS benchmark.²⁴

Concurrent with the Stinger Initiative, a team at Cloudera Labs ported Hive to run on Apache Spark . The team released²⁵ Hive-on-Spark to General Availability in April, 2016.

Hive is distributed and supported in every commercial product powered by Hadoop, including Cloudera CDH, MapR, Hortonworks HDP, and IBM Infosphere BigInsights. A modified version of Hive is available as a cloud service in Amazon Web Services’ Elastic MapReduce (EMR) . This version includes the ability to import data and write back to Amazon S3 and specify an external metadata library.

Hive Server is an interface that enables users to submit queries to Hive for execution and retrieve results. (The most current version, HiveServer2, has largely displaced the original HiveServer1.) HiveServer2 supports authentication with popular security protocols, including Kerberos, SASL, LDAP, Pluggable Custom Authentication, and Pluggable Authentication Modules.

Hive supports a SQL-like language called QL (“HiveQL”). HiveQL includes a subset of ANSI SQL features, plus additional support for MapReduce, JSON, and Thrift. Users can extend Hive functionality through Java User Defined Functions (UDFs) , User Defined Analytic Functions (UDAFs) , and User-Defined Table Functions (UDTFs) .

Hive works with data in HDFS, HBase, and compatible file systems, including Amazon S3.

According to statistics in OpenHub²⁶, Hive is a very active project, with 140 contributors and more than a million lines of code. The code base has expanded steadily and has accelerated markedly since 2013.

The OpenHub database, a project of Black Duck software, crawls open source code repositories to develop statistics on contributors and code commits.

Apache Pig

Apache Pig is a top-level Apache project. It includes a high-level SQL-like analytic language (“Pig Latin”) coupled to a compiler that converts this language into MapReduce.

Developers at Yahoo started²⁷ work on Pig as a research project in the summer of 2006. Yahoo donated the software to the Apache Software Foundation, which accepted²⁸ the project for incubation in 2007. A year later, Pig graduated to top-level status.

Within the Pig team, there are ongoing projects to port Pig to Tez and to Spark.

According to OpenHub , Pig is much less active than Hive, with 28 contributors and 376 thousand lines of code. The code base peaked in 2011, with very little new activity since then.

Apache Mahout

Apache Mahout is an open source project for machine learning started²⁹ in 2008 by several developers from the Apache Lucene team. Initially inspired by a paper³⁰ authored by researchers at Stanford, the project evolved—some would say devolved—to include a mix of approaches. Instead of fostering innovation, this loss of discipline produced a loose collection of developed and contributed algorithms.

Some of Mahout’s algorithms used MapReduce, others did not; some were distributed, while others ran on a single node. Many of the contributed algorithms were subsequently deprecated and removed from the project for lack of interest, use, and support.

Most of the remaining algorithms use MapReduce, but since 2014 all new algorithms must use Spark.

Beginning with version 0.11.1, which the team released in November 2015, the project includes Samsara, a math environment for linear algebra that runs on Spark or H2O.

Mahout’s code base grew³¹ slowly until 2012, but has not grown at all since then. It is virtually a dead project, with no code contributions in the 12 months prior to May 2016.

Apache Giraph

Graph engines perform calculations at scale on data represented as a mathematical graph. Apache Giraph is an open source iterative graph processing engine inspired by the Google Pregel graph engine described³² in a 2010 paper; it uses a parallelization approach called Bulk Synchronous Parallel (BSP) processing. BSP provides a framework for managing processes and communications within a distributed processing system.

Giraph implements the Pregel architecture in MapReduce and works with HDFS files or Hive tables. Giraph extends the basic Pregel model with additional functionality for better performance, scalability, utility, and fault tolerance.

Facebook uses³³ Giraph to analyze the social graph of its users; the graph has more than a trillion edges. To meet requirements, the Facebook developer team modified the software and contributed the enhancements back to the open source project. The enhanced version of Giraph can:

Read edge and node data from different data sources
Work with any number of data sources
Support multi-threading on Hadoop worker machines
Make optimal use of memory on each machine
Balance aggregation workload across machines in the Hadoop cluster

There was a surge of interest in Giraph in 2013, when Facebook first published³⁴ results of its assessment. Giraph’s code base has grown³⁵ slowly but steadily since then.

Datameer

Datameer is a commercial software venture that pioneered business intelligence on Hadoop. Ajay Anand and Stefan Groschupf co-founded³⁶ Datameer in 2009. At the time, there were few options for analyzing data managed in Hadoop; one could write MapReduce expressions, program in Pig, or perform rudimentary SQL in an early edition of Hive. Any of these options required programming and technical skills not widely available among business analysts. Datameer set out to make Hadoop data accessible.

Datameer’s product has evolved significantly since 2009, and it is now in Release 6.0 (as of May 2016). The software consists of an application server and database server that reside on an edge node of a Hadoop cluster.

The Datameer application server accepts requests from end users, who work from a browser-based interface, and translates the user requests into Hadoop operations. The application evaluates the user request and submits it using one of four execution frameworks:

MapReduce: The default computing framework for Hadoop, to be used if alternatives are unavailable.
Optimized MapReduce: When available, Datameer uses MapReduce on Apache Tez. (For more on Tez, see the previous section on Apache Hive.)
Spark: Used when Spark is available.
Single-Node Execution: For small jobs, Datameer runs the request on a single node of the Hadoop cluster.

Datameer comprehensively supports major Hadoop distributions. It also supports the ability to import data from a wide variety of data sources, including relational databases, NoSQL databases, Hive, Microsoft Office formats, social media files, HTML, JSON, and many others. Data export capabilities are more limited. It is also important to note that Datameer must physically extract and move data to Hadoop for processing, as it lacks a facility for SQL pass-through to external databases.

The browser-based Datameer user interface displays results in a spreadsheet-like display, to which the user can add functions and charts.

Datameer has raised $76.5 million in five rounds of venture funding. Its most recent funding was a $40 million “D” round led by ST Telemedia, an investment firm based in Singapore. At the time of the funding announcement, Datameer claimed³⁷ to have 200 customers.

Hadoop 2.0

As stated previously, we limit this review to SQL engines and defer discussion of other analytic tools to later chapters.

SQL is a foundation tool in analytics, required for almost every project; many projects require nothing but SQL. SQL engines are an organic part of relational databases; organizations map data into the SQL framework when they load it into the database.

In contrast, SQL is not built into Apache Hadoop; instead, separate components called SQL engines deliver SQL capabilities. A SQL engine accepts SQL commands from the user, generates one or more requests for data, submits the requests to one or more datastores, and returns the result set to the user.

While these engines are frequently called SQL-on-Hadoop engines , the term is imprecise. Some engines, like Hive and Impala, run only in Hadoop with Hadoop data sources. Others, like Spark, Drill, and Presto, can run in Hadoop or outside of Hadoop, and can work with many different data sources.

Unbundling the SQL engine from the datastore offers a number of potential benefits. The first of these is specialization: SQL engine developers can focus development on the SQL interface and query parser, while other developers enhance the datastore itself.

Separating SQL engines from the file system also enables the user to choose among multiple options. Since modern SQL engines are not commercially tied to a particular datastore, organizations can implement a “best-in-class” architecture, mixing and matching SQL engines to end user needs. Operating independently of the datastore also offers the potential to “federate” queries across multiple datastores.

Hadoop stores data without first mapping it into a SQL framework. Consequently, in Hadoop a SQL user must map data into a table structure before running queries.

Modern SQL engines support one or more of the following three modes of SQL processing :

Batch: SQL scripts run without human supervision or attention on static data. Batch mode is suitable for long-running queries or queries that are scheduled for repeated execution. Typically used for Extract, Transform, and Load (ETL) processing, scoring, or scheduled reports.
Interactive: SQL scripts run on static data while the user awaits a response. This mode generally requires lower latency: up to 20 minutes. Typically used for ad hoc queries and discovery, where questions may be nested.
Streaming:SQL scripts run continuously on dynamic data over a sliding time window. Streaming mode requires very high performance. Typically used for algorithmic trading, real-time ad targeting, and similar applications.

Hive, Spark SQL, and Impala support HiveQL; Presto and Drill support ANSI SQL. All engines support User Defined Functions (UDFs) . Hive and Spark offer query fault tolerance; the rest do not. Hive, Spark SQL, and Impala are the most mature and feature-rich; Presto and Drill are the least mature.

User Defined Functions (UDFs) are expressions, functions, or code snippets provided by a database user to supplement built-in functions. UDF support varies with each platform and can include the ability to run programs written in languages such as C, Java, Python, and R.

There are now far too many SQL engines for Hadoop to cover all of them in detail. We cover Impala, Drill, and Presto in this chapter and Spark SQL in Chapter Five. Several additional projects merit brief coverage:

Apache Tajo: Tajo is an open source high-performance SQL engine. The Apache Software Foundation accepted Tajo for incubation in March 2013 and promoted it to top-level status in March, 2014. Gruter, a startup based in South Korea, leads Tajo development and offers commercial support.
Apache Kylin: Kylin offers fast interactive ANSI SQL and MOLAP cube capabilities together with integration with BI tools and enterprise security. Originally developed by eBay, Kylin was accepted as an Apache Incubator project in November 2014 and graduated to top-level status in November 2015.
Apache Phoenix: Phoenix is a relational database framework designed to integrate with Apache HBase; it includes a query engine, metadata repository, and JDBC driver. Phoenix converts user requests into native HBase calls, bypassing MapReduce; this enables it to run much faster than early versions of Hive. Engineers at Salesforce.com developed Phoenix for internal applications; the company donated³⁸ the project to open source in 2013. The Apache Software Foundation promoted³⁹ Phoenix to top-level status in May 2014. Hortonworks includes Phoenix in its product.
Apache Trafodion: Trafodion is a SQL-on-HBase engine that offers ANSI SQL support and ODBC/JDBC connectivity for BI. HP Labs launched Trafodion as an open source project in June 2014 and released it to production in January 2015. The Apache Software Foundation accepted Trafodion as an incubator project in May 2015.
Apache HAWQ: Currently in Apache Incubator status, HAWQ is a SQL on Hadoop engine evolved from Pivotal Greenplum Database. Pivotal Software donated the project to open source in June 2015.
IBM Big SQL: Big SQL is IBM’s SQL interface to its Hadoop distribution, InfoSphere BigInsights; it can query HDFS, HBase, and special tables created by Big SQL itself. The product includes a facility to copy data from relational databases into Big SQL tables.
Oracle Big Data SQL: Big Data SQL is an Oracle product available only on the Oracle Big Data Appliance. It federates queries across Oracle Database, Oracle Hadoop, and Oracle NoSQL database in a single query. The product operates through external table extensions to Oracle Database and offers a capability called Query Franchising, through which agents on the data subsystems execute the query using equivalent operators.

Many different factors influence runtime performance of the engines. These factors include the volume and type of data, storage formats, system infrastructure, deployment characteristics, and the nature of queries tested. Not surprisingly, published benchmarks produce conflicting results due to differences in the factors cited here.

Apache Spark

Apache Spark is a distributed in-memory computing framework, with libraries for SQL, streaming, machine learning, and graph analytics. With more than 1,000 contributors, it is the most active Apache project, the most active project in the Hadoop ecosystem, and the most active project anywhere in Big Data. Given its significance, we treat Spark separately in Chapter Five, which covers in-memory analytics.

Apache Impala

Impala is a massively parallel processing (MPP) SQL platform for Hadoop. Developed and maintained by Cloudera, the software is free and open source under an Apache license. Cloudera announced⁴⁰ the project in October 2012 and released⁴¹ it to general availability in May 2013. In 2015, Cloudera donated Impala to the Apache Software Foundation.

Apache Impala runs fast interactive SQL queries on data stored in popular Apache Hadoop file formats. Impala integrates with the Apache Hive metastore to share databases and tables between the components. This enables users to freely choose to work with Impala or Hive without moving or duplicating data.

Impala supports HiveQL, with built-in functions for mathematics, data type conversion, date and time operations, conditional expressions, and string functions. For more advanced operations, Impala supports aggregate functions with statistics such as count, sum, mean, and median; and window functions for ordered and grouped statistics. Finally, Impala supports basic user-defined functions (UDFs) and user-defined aggregate functions (UDAFs) for custom operations

Users working with Impala submit commands through the Impala Shell or through any ODBC- or JDBC-compliant client.

Impala operates directly with the stored data, and does not use an intermediate computing layer (such as MapReduce or Spark). It works with HBase and HDFS, including common formats: text, SequenceFiles, RCFiles, Apache Avro, and Apache Parquet. Impala also works with Amazon Web Services’ S3 file format.

In early 2014, IBM Research published⁴² the results of a performance test that compared the performance of Hive, Hive on Tez, and Impala with different file formats and compression codecs. The test protocol used standard test protocols published by the Transaction Processing Council.

In IBM’s testing , Impala ran 3-4 times faster than Hive on MapReduce and 2-3 times faster than Hive on Tez for TPC-H benchmarks. For TPC-DS benchmarks, Impala ran 8-10 times faster than Hive on MapReduce and about 4 times faster than Hive on Tez.

TPC-H and TPC-DS are standard decision support benchmarks published by the Transaction Processing Performance Council. The benchmarks consist of a series of typical ad hoc queries that simulate the work of a typical user.

Later in 2014, Cloudera published⁴³ results of its own benchmark testing comparing performance of Impala, Spark SQL, Facebook Presto, and Hive on Tez. For single-user queries, Impala outperformed all alternatives, running on average seven times faster. For multi-user queries, Impala widened the performance gap, running on average 13 times faster.

Cloudera, MapR, Oracle, and Amazon Web Services distribute Impala; Cloudera, MapR, and Oracle provide commercial build and installation support.

Apache Drill

Apache Drill is an open source distributed software framework for interactive analysis.

In 2010, a group of Google engineers published⁴⁴ a paper describing a distributed system for interactive ad hoc query analysis designed to aggregate trillion-row tables in seconds. They called the system Dremel . Dremel inspired Google’s BigQuery, an interactive ad hoc query system hosted in the Google cloud.

A group of contributors led by Ted Dunning of MapR proposed to develop an open source version of Dremel, renamed Drill. In September, 2012, the Apache Software Foundation accepted Drill as an incubator project. Drill graduated to top-level project status in December 2014. In June 2015, the team published Release 1.0 of the software. Drill’s developer community includes employees of MapR, Intuit, Hortonworks, Elastic, LinkedIn, Pentaho, Cisco, and the University of Wisconsin, among others.

Drill offers the user the ability to query data with or without predefined schemas, and it can query unstructured or nested data. It offers full ANSI-standard SQL, and it integrates with widely used BI tools. Drill can federate queries across multiple data sources.

Prospective users can download and install Drill directly from the project web site; MapR also distributes Drill as an add-on to its Hadoop distribution.

Users can deploy Drill as a standalone application, or on any Hadoop cluster.

End users can query Drill with BI tools (such as Tableau, MicroStrategy, or Microsoft Excel) through ODBC and JDBC drivers. Drill also supports a REST API for custom applications as well as Java and C applications.

Drill currently supports the following data sources:

Hadoop: Apache Hadoop, MapR, Cloudera CDH, and Amazon EMR
NoSQL: MongoDB and HBase
Cloud storage: Amazon S3, Google Cloud Storage, Azure Blog Storage, and Swift

While the Apache Drill project team claims trillion-row scalability, there are no published benchmarks or reference users as of June 2015. Drill has high potential as an interactive query tool, but limited commercial adoption at present.

Drill has an active contributor base, and its code base has steadily expanded⁴⁵ since mid-2014.

Presto

Facebook has one of the largest active data warehouses, with more than 300 petabytes of data stored in a few large Hadoop clusters. Addressing a need for SQL connectivity to this asset, Facebook engineers created Hive software in 2007 to query this massive datastore.

As we noted previously, in its original architecture Hive used MapReduce as a compute engine, which made it suitable for batch queries only. In 2012, Facebook engineers developed a better tool for fast interactive SQL, called Presto. After extensive internal testing and rollout to the internal user community, Facebook shared the code with some other organizations for testing and feedback. In late 2013, Facebook donated the Presto code to open source and made it generally available to any user under an Apache 2.0 license.

Facebook reports that it has successfully scaled a Presto cluster to 1,000 nodes. The company also reports⁴⁶ that more than 1,000 employees run queries on Presto, and they run more than 30,000 queries per day on more than a petabyte of data. In May 2015, developers reported significant speedup in new releases through continuous improvement.⁴⁷

Presto supports ANSI SQL queries across a range of data sources, including Hive, Cassandra, relational databases, or proprietary file systems (such as Amazon Web Services’ S3). Presto queries can federate data from multiple sources. Users can submit queries from C, Java, Node.js, PHP, Python, R, and Ruby. Airpal, a web-based query execution tool developed by Airbnb, offers users the ability to submit queries to Presto through a browser.

Presto’s user base currently includes Facebook, Airbnb, and Dropbox.

Organizations can deploy Presto on-premises or in the cloud through Qubole. In June, 2015, Teradata announced⁴⁸ plans to develop and support the project. Under an announced three-phase program, Teradata proposes to integrate Presto into the Hadoop ecosystem, enable operation under YARN, and enhance connectivity⁴⁹ through ODBC and JDBC.

OpenHub statistics show⁵⁰ that Presto is a very active project, with a steadily expanding code base.

Summary

While there is some debate among analysts about Hadoop’s growth rate and enterprise penetration, there is no doubt that the Hadoop ecosystem is growing rapidly. We know this from disclosures by Cloudera, Hortonworks, MapR, Amazon Web Services, and other companies that play a key role in the ecosystem.

There are two reasons for this growth. The first is the tsunami of text, images, audio, video, and other data that is unsuited to relational databases, as noted in Chapter Two. Economics is the second reason. Hadoop’s cost per terabyte is well below the cost of alternative data management tools.

Under Hadoop 1.0 prior to 2014, Hadoop’s ability to handle non-relational data was its primary justification. It was not a credible competitor to data warehouse appliances for analytics, because it was too immature, unstable, rudimentary, slow, and hard to use.

Since the implementation of Hadoop 2.0, the number of commercial and open source software packages for Hadoop has exploded. The ability to run alternative programming frameworks together with MapReduce makes it possible to run the most complex analysis.

Concurrently, the platform itself has matured. Thanks to the contributions of commercial vendors engaged with enterprise customers, Hadoop is increasingly stable and secure. Projects to improve performance, such as Tez and Spark make Hadoop a credible platform for interactive analytics. We will cover Spark in detail in Chapter Five.

For SQL analytics, Hadoop is increasingly competitive with relational databases. Under Hadoop 1.0, Hive and Pig were suitable for high-latency batch programs; semantically, they supported a subset of ANSI SQL. Thanks to the Stinger project, Hive is now competitive with other tools for interactive analysis, and so are Impala, Spark SQL, Drill, and Presto. And, they are rapidly approaching full ANSI SQL compliance.

Moreover, Hadoop is increasingly accessible to the business user. Thanks to a number of innovations in self-service analytics, it is now possible for business users to work directly with Hadoop using tools like Excel and Tableau. We cover these innovations in Chapter Nine.

As Hadoop matures, it competes with conventional data warehouses, and its attractive economics matter. Enterprises with existing warehouses have a powerful economic incentive to substitute Hadoop for existing platforms as those platforms approach end of life.

Hadoop’s economics make it attractive for use cases that cannot support the costs of conventional data warehouses. More importantly, Hadoop’s marginal “sunk” costs make it ideal for use cases where the returns are simply unknown. Thus, Hadoop becomes the platform of choice for labs, sandboxes, skunk works, and innovations of all kinds.