Home Page Icon
Home Page
Table of Contents for
Cover
Close
Cover
by Thilina Gunarathne, Srinath Perera
Hadoop MapReduce Cookbook
Hadoop MapReduce Cookbook
Table of Contents
Hadoop MapReduce Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Getting Hadoop Up and Running in a Cluster
Introduction
Setting up Hadoop on your machine
Getting ready
How to do it...
How it works...
Writing a WordCount MapReduce sample, bundling it, and running it using standalone Hadoop
Getting ready
How to do it...
How it works...
There's more...
Adding the combiner step to the WordCount MapReduce program
How to do it...
How it works...
There's more...
Setting up HDFS
Getting ready
How to do it...
How it works...
Using HDFS monitoring UI
Getting ready
How to do it...
HDFS basic command-line file operations
Getting ready
How to do it...
How it works...
There's more...
Setting Hadoop in a distributed cluster environment
Getting ready
How to do it...
How it works...
There's more...
Running the WordCount program in a distributed cluster environment
Getting ready
How to do it...
How it works...
There's more...
Using MapReduce monitoring UI
How to do it...
How it works...
2. Advanced HDFS
Introduction
Benchmarking HDFS
Getting ready
How to do it...
How it works...
There's more...
See also
Adding a new DataNode
Getting ready
How to do it...
There's more...
Rebalancing HDFS
See also
Decommissioning DataNodes
How to do it...
How it works...
See also
Using multiple disks/volumes and limiting HDFS disk usage
How to do it...
Setting HDFS block size
How to do it...
There's more...
See also
Setting the file replication factor
How to do it...
How it works...
There's more...
See also
Using HDFS Java API
Getting ready
How to do it...
How it works...
There's more...
Configuring the FileSystem object
Retrieving the list of data blocks of a file
See also
Using HDFS C API (libhdfs)
Getting ready
How to do it...
How it works...
There's more...
Configuring using HDFS configuration files
See also
Mounting HDFS (Fuse-DFS)
Getting ready
How to do it...
How it works...
There's more...
Building libhdfs
See also
Merging files in HDFS
How to do it...
How it works...
3. Advanced Hadoop MapReduce Administration
Introduction
Tuning Hadoop configurations for cluster deployments
Getting ready
How to do it...
How it works...
There's more...
Running benchmarks to verify the Hadoop installation
Getting ready
How to do it...
How it works...
There's more...
Reusing Java VMs to improve the performance
How to do it...
How it works...
Fault tolerance and speculative execution
How to do it...
How it works...
Debug scripts – analyzing task failures
Getting ready
How to do it...
How it works...
Setting failure percentages and skipping bad records
Getting ready
How to do it...
How it works...
There's more...
Shared-user Hadoop clusters – using fair and other schedulers
Getting ready
How to do it...
How it works...
There's more...
Hadoop security – integrating with Kerberos
Getting ready
How to do it...
How it works...
Using the Hadoop Tool interface
How to do it...
How it works...
4. Developing Complex Hadoop MapReduce Applications
Introduction
Choosing appropriate Hadoop data types
How to do it...
There's more...
See also
Implementing a custom Hadoop Writable data type
How to do it...
How it works...
There's more...
See also
Implementing a custom Hadoop key type
How to do it...
How it works...
See also
Emitting data of different value types from a mapper
How to do it...
How it works...
There's more...
See also
Choosing a suitable Hadoop InputFormat for your input data format
How to do it...
How it works...
There's more...
Using multiple input data types and multiple mapper implementations in a single MapReduce application
See also
Adding support for new input data formats – implementing a custom InputFormat
How to do it...
How it works...
There's more...
See also
Formatting the results of MapReduce computations – using Hadoop OutputFormats
How to do it...
How it works...
There's more...
Hadoop intermediate (map to reduce) data partitioning
How to do it...
How it works...
There's more...
TotalOrderPartitioner
KeyFieldBasedPartitioner
Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache
How to do it...
How it works...
There's more...
Distributing archives using the DistributedCache
Adding resources to the DistributedCache from the command line
Adding resources to the classpath using DistributedCache
See also
Using Hadoop with legacy applications – Hadoop Streaming
How to do it...
How it works...
There's more...
See also
Adding dependencies between MapReduce jobs
How to do it...
How it works...
There's more...
Hadoop counters for reporting custom metrics
How to do it...
How it works...
5. Hadoop Ecosystem
Introduction
Installing HBase
How to do it...
How it works...
There's more...
Data random access using Java client APIs
Getting ready
How to do it...
How it works...
Running MapReduce jobs on HBase (table input/output)
Getting ready
How to do it...
How it works...
Installing Pig
How to do it...
How it works...
There's more...
Running your first Pig command
How to do it...
How it works...
Set operations (join, union) and sorting with Pig
Getting ready
How to do it...
How it works...
There's more...
Installing Hive
Getting ready
How to do it...
How it works...
Running a SQL-style query with Hive
Getting ready
How to do it...
How it works...
Performing a join with Hive
Getting ready
How to do it...
How it works...
There's more...
Installing Mahout
How to do it...
How it works...
Running K-means with Mahout
Getting ready
How to do it...
How it works...
Visualizing K-means results
Getting ready
How to do it...
How it works...
6. Analytics
Introduction
Simple analytics using MapReduce
Getting ready
How to do it...
How it works...
There's more...
Performing Group-By using MapReduce
Getting ready
How to do it...
How it works...
Calculating frequency distributions and sorting using MapReduce
Getting ready
How to do it...
How it works...
Plotting the Hadoop results using GNU Plot
Getting ready
How to do it...
How it works...
There's more...
Calculating histograms using MapReduce
Getting ready
How to do it...
How it works...
Calculating scatter plots using MapReduce
Getting ready
How to do it...
How it works...
Parsing a complex dataset with Hadoop
Getting ready
How to do it...
How it works...
Joining two datasets using MapReduce
Getting ready
How to do it...
How it works...
7. Searching and Indexing
Introduction
Generating an inverted index using Hadoop MapReduce
Getting ready
How to do it...
How it works...
There's more...
See also
Intra-domain web crawling using Apache Nutch
Getting ready
How to do it...
See also
Indexing and searching web documents using Apache Solr
Getting Ready
How to do it
How it works
See also
Configuring Apache HBase as the backend data store for Apache Nutch
Getting ready
How to do it
How it works...
See also
Deploying Apache HBase on a Hadoop cluster
Getting ready
How to do it
How it works...
See also
Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
Getting ready
How to do it
How it works
See also
ElasticSearch for indexing and searching
Getting ready
How to do it
How it works
See also
Generating the in-links graph for crawled web pages
Getting ready
How to do it
How it works
See also
8. Classifications, Recommendations, and Finding Relationships
Introduction
Content-based recommendations
Getting ready
How to do it...
How it works...
There's more...
Hierarchical clustering
Getting ready
How to do it...
How it works...
There's more...
Clustering an Amazon sales dataset
Getting ready
How to do it...
How it works...
There's more...
Collaborative filtering-based recommendations
Getting ready
How to do it...
How it works...
Classification using Naive Bayes Classifier
Getting ready
How to do it...
How it works...
Assigning advertisements to keywords using the Adwords balance algorithm
Getting ready
How to do it...
How it works...
There's more...
9. Mass Text Data Processing
Introduction
Data preprocessing (extract, clean, and format conversion) using Hadoop Streaming and Python
Getting ready
How to do it...
How it works...
There's more...
See also
Data de-duplication using Hadoop Streaming
Getting ready
How to do it...
How it works...
See also
Loading large datasets to an Apache HBase data store using importtsv and bulkload tools
Getting ready
How to do it…
How it works...
There's more...
Data de-duplication using HBase
See also
Creating TF and TF-IDF vectors for the text data
Getting ready
How to do it…
How it works…
See also
Clustering the text data
Getting ready
How to do it...
How it works...
See also
Topic discovery using Latent Dirichlet Allocation (LDA)
Getting ready
How to do it…
How it works…
See also
Document classification using Mahout Naive Bayes classifier
Getting ready
How to do it...
How it works...
See also
10. Cloud Deployments: Using Hadoop on Clouds
Introduction
Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR)
Getting ready
How to do it...
See also
Saving money by using Amazon EC2 Spot Instances to execute EMR job flows
How to do it...
There's more...
See also
Executing a Pig script using EMR
How to do it...
There's more...
Starting a Pig interactive session
See also
Executing a Hive script using EMR
How to do it...
There's more...
Starting a Hive interactive session
See also
Creating an Amazon EMR job flow using the Command Line Interface
How to do it...
There's more...
See also
Deploying an Apache HBase Cluster on Amazon EC2 cloud using EMR
Getting ready
How to do it...
See also
Using EMR Bootstrap actions to configure VMs for the Amazon EMR jobs
How to do it...
There's more...
Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment
How to do it...
How it works...
See also
Using Apache Whirr to deploy an Apache HBase cluster in a cloud environment
Getting ready
How to do it...
How it works...
See also
Index
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Next
Next Chapter
Table of Contents
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset