Exploring Hadoop features

Apache Hadoop has two main features:

  • HDFS (Hadoop Distributed File System)
  • MapReduce

Studying Hadoop components

Hadoop includes an ecosystem of other products built over the core HDFS and MapReduce layer to enable various types of operations on the platform. A few popular Hadoop components are as follows:

  • Mahout: This is an extensive library of machine learning algorithms.
  • Pig: Pig is a high-level language (such as PERL) to analyze large datasets with its own language syntax for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
  • Hive: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad hoc queries, and the analysis of large datasets stored in HDFS. It has its own SQL-like query language called Hive Query Language (HQL), which is used to issue query commands to Hadoop.
  • HBase: HBase (Hadoop Database) is a distributed, column-oriented database. HBase uses HDFS for the underlying storage. It supports both batch style computations using MapReduce and atomic queries (random reads).
  • Sqoop: Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and Structured Relational Databases. Sqoop is an abbreviation for (SQ)L to Had(oop).
  • ZooKeper: ZooKeeper is a centralized service to maintain configuration information, naming, providing distributed synchronization, and group services, which are very useful for a variety of distributed systems.
  • Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters, which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop.

Understanding the reason for using R and Hadoop together

I would also say that sometimes the data resides on the HDFS (in various formats). Since a lot of data analysts are very productive in R, it is natural to use R to compute with the data stored through Hadoop-related tools.

As mentioned earlier, the strengths of R lie in its ability to analyze data using a rich library of packages but fall short when it comes to working on very large datasets. The strength of Hadoop on the other hand is to store and process very large amounts of data in the TB and even PB range. Such vast datasets cannot be processed in memory as the RAM of each machine cannot hold such large datasets. The options would be to run analysis on limited chunks also known as sampling or to correspond the analytical power of R with the storage and processing power of Hadoop and you arrive at an ideal solution. Such solutions can also be achieved in the cloud using platforms such as Amazon EMR.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset