Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Tools and techniques

Let's take a look at different tools and techniques used in Hadoop and Spark for Big Data analytics.

While the Hadoop platform can be used for both storing and processing the data, Spark can be used for processing only by reading data into memory.

The following is a tabular representation of the tools and techniques used in typical Big Data analytics projects:

	Tools used	Techniques used
Data collection	Apache Flume for real-time data collection and aggregation Apache Sqoop for data import and export from relational data stores and NoSQL databases Apache Kafka for the publish-subscribe messaging system General-purpose tools such as FTP/Copy	Real-time data capture Export Import Message publishing Data APIs Screen scraping
Data storage and formats	HDFS: Primary storage of Hadoop HBase: NoSQL database Parquet: Columnar format Avro: Serialization system on Hadoop Sequence File: Binary key-value pairs RC File: First columnar format in Hadoop ORC File: Optimized RC File XML and JSON: Standard data interchange formats Compression formats: Gzip, Snappy, LZO, Bzip2, Deflate, and others Unstructured Text, images, videos, and so on	Data storage Data archival Data compression Data serialization Schema evolution
Data transformation and enrichment	MapReduce: Hadoop's processing framework Spark: Compute engine Hive: Data warehouse and querying Pig: Data flow language Python: Functional programming Crunch, Cascading, Scalding, and Cascalog: Special MapReduce tools	Data munging Filtering Joining ETL File format conversion Anonymization Re-identification
Data analytics	Hive: Data warehouse and querying Pig: Data flow language Tez: Alternative to MapReduce Impala: Alternative to MapReduce Drill: Alternative to MapReduce Apache Storm: Real-time compute engine Spark Core: Spark core compute engine Spark Streaming: Real-time compute engine Spark SQL: For SQL analytics SolR: Search platform Apache Zeppelin: Web-based notebook Jupyter Notebooks Databricks cloud Apache NiFi: Data flow Spark-on-HBase connector Programming languages: Java, Scala, and Python	Online Analytical Processing (OLAP) Data mining Data visualization Complex event processing Real-time stream processing Full text search Interactive data analytics
Data science	Python: Functional programming R: Statistical computing language Mahout: Hadoop's machine learning library MLlib: Spark's machine learning library GraphX and GraphFrames: Spark's graph processing framework and DataFrame adoption to graphs.	Predictive analytics Sentiment analytics Text and Natural Language Processing Network analytics Cluster analytics

Tools used

Techniques used

Data collection

Apache Flume for real-time data collection and aggregation

Apache Sqoop for data import and export from relational data stores and NoSQL databases

Apache Kafka for the publish-subscribe messaging system

General-purpose tools such as FTP/Copy

Real-time data capture

Export

Import

Message publishing

Data APIs

Screen scraping

Data storage and formats

HDFS: Primary storage of Hadoop

HBase: NoSQL database

Parquet: Columnar format

Avro: Serialization system on Hadoop

Sequence File: Binary key-value pairs

RC File: First columnar format in Hadoop

ORC File: Optimized RC File

XML and JSON: Standard data interchange formats

Compression formats: Gzip, Snappy, LZO, Bzip2, Deflate, and others

Unstructured Text, images, videos, and so on

Data storage

Data archival

Data compression

Data serialization

Schema evolution

Data transformation and enrichment

MapReduce: Hadoop's processing framework

Spark: Compute engine

Hive: Data warehouse and querying

Pig: Data flow language

Python: Functional programming

Crunch, Cascading, Scalding, and Cascalog: Special MapReduce tools

Data munging

Filtering

Joining

ETL

File format conversion

Anonymization

Re-identification

Data analytics

Hive: Data warehouse and querying

Pig: Data flow language

Tez: Alternative to MapReduce

Impala: Alternative to MapReduce

Drill: Alternative to MapReduce

Apache Storm: Real-time compute engine

Spark Core: Spark core compute engine

Spark Streaming: Real-time compute engine

Spark SQL: For SQL analytics

SolR: Search platform

Apache Zeppelin: Web-based notebook

Jupyter Notebooks

Databricks cloud

Apache NiFi: Data flow

Spark-on-HBase connector

Programming languages: Java, Scala, and Python

Online Analytical Processing (OLAP)

Data mining

Data visualization

Complex event processing

Real-time stream processing

Full text search

Interactive data analytics

Data science

Python: Functional programming

R: Statistical computing language

Mahout: Hadoop's machine learning library

MLlib: Spark's machine learning library

GraphX and GraphFrames: Spark's graph processing framework and DataFrame adoption to graphs.

Predictive analytics

Sentiment analytics

Text and Natural Language Processing

Network analytics

Cluster analytics

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Tools and techniques

Create new playlist

Sign In

Sign Up

Tools and techniques

Table of Contents for
Tools and techniques