Home Page Icon
Home Page
Table of Contents for
Cover
Close
Cover
by Jon Lentz, Brian Femiano, Jonathan R. Owens
Hadoop Real-World Solutions Cookbook
Hadoop Real-World Solutions Cookbook
Table of Contents
Hadoop Real-World Solutions Cookbook
Credits
About the Authors
About the Reviewers
www.packtpub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Hadoop Distributed File System – Importing and Exporting Data
Introduction
Importing and exporting data into HDFS using Hadoop shell commands
Getting ready
How to do it...
How it works...
There's more...
See also
Moving data efficiently between clusters using Distributed Copy
Getting ready
How to do it...
How it works...
There's more...
Importing data from MySQL into HDFS using Sqoop
Getting ready
How to do it...
How it works...
There's more...
See also
Exporting data from HDFS into MySQL using Sqoop
Getting ready
How to do it...
How it works...
See also
Configuring Sqoop for Microsoft SQL Server
Getting ready
How to do it...
How it works...
Exporting data from HDFS into MongoDB
Getting ready
How to do it...
How it works...
Importing data from MongoDB into HDFS
Getting ready
How to do it...
How it works...
Exporting data from HDFS into MongoDB using Pig
Getting ready
How to do it...
How it works...
Using HDFS in a Greenplum external table
Getting ready
How to do it...
How it works...
There's more...
Using Flume to load data into HDFS
Getting ready
How to do it...
How it works...
There's more...
2. HDFS
Introduction
Reading and writing data to HDFS
Getting ready
How to do it...
How it works...
There's more...
Compressing data using LZO
Getting ready
How to do it...
How it works...
There's more...
See also
Reading and writing data to SequenceFiles
Getting ready
How to do it...
How it works...
There's more...
See also
Using Apache Avro to serialize data
Getting ready
How to do it...
How it works...
There's more...
See also
Using Apache Thrift to serialize data
Getting ready
How to do it...
How it works...
See also
Using Protocol Buffers to serialize data
Getting ready
How to do it...
How it works...
Setting the replication factor for HDFS
Getting ready
How to do it...
How it works...
There's more...
See also
Setting the block size for HDFS
Getting ready
How to do it...
How it works...
3. Extracting and Transforming Data
Introduction
Transforming Apache logs into TSV format using MapReduce
Getting ready
How to do it...
How it works...
There's more...
See also
Using Apache Pig to filter bot traffic from web server logs
Getting ready
How to do it...
How it works...
There's more...
See also
Using Apache Pig to sort web server log data by timestamp
Getting ready
How to do it...
How it works...
There's more...
See also
Using Apache Pig to sessionize web server log data
Getting ready
How to do it...
How it works...
See also
Using Python to extend Apache Pig functionality
Getting ready
How to do it...
How it works...
Using MapReduce and secondary sort to calculate page views
Getting ready
How to do it...
How it works...
See also
Using Hive and Python to clean and transform geographical event data
Getting ready
How to do it...
How it works...
There's more...
Making every column type String
Type casing values using the AS keyword
Testing the script locally
Using Python and Hadoop Streaming to perform a time series analytic
Getting ready
How to do it...
How it works...
There's more...
Using Hadoop Streaming with any language that can read from stdin and write to stdout
Using the –file parameter to pass additional required files for MapReduce jobs
Using MultipleOutputs in MapReduce to name output files
Getting ready
How to do it...
How it works...
Creating custom Hadoop Writable and InputFormat to read geographical event data
Getting ready
How to do it...
How it works...
4. Performing Common Tasks Using Hive, Pig, and MapReduce
Introduction
Using Hive to map an external table over weblog data in HDFS
Getting ready
How to do it...
How it works...
There's more...
LOCATION must point to a directory, not a file
Dropping an external table does not delete the data stored in the table
You can add data to the path specified by LOCATION
Using Hive to dynamically create tables from the results of a weblog query
Getting ready
How to do it...
How it works...
There's more...
CREATE TABLE AS cannot be used to create external tables
DROP temporary tables
Using the Hive string UDFs to concatenate fields in weblog data
Getting ready
How to do it...
How it works...
There's more...
The UDF concat_ws() function will not automatically cast parameters to String
Alias your concatenated field
The concat_ws() function supports variable length parameter arguments
See also
Using Hive to intersect weblog IPs and determine the country
Getting ready
How to do it...
How it works...
There's more...
Hive supports multitable joins
The ON operator for inner joins does not support inequality conditions
See also
Generating n-grams over news archives using MapReduce
Getting ready
How to do it...
How it works...
There's more...
Use caution when invoking FileSystem.delete()
Use NullWritable to avoid unnecessary serialization overhead
Using the distributed cache in MapReduce to find lines that contain matching keywords over news archives
Getting ready
How to do it...
How it works...
There's more...
Use the distributed cache to pass JAR dependencies to map/reduce task JVMs
Distributed cache does not work in local jobrunner mode
Using Pig to load a table and perform a SELECT operation with GROUP BY
Getting ready
How to do it...
How it works...
See also
5. Advanced Joins
Introduction
Joining data in the Mapper using MapReduce
Getting ready
How to do it...
How it works...
There's more...
See also
Joining data using Apache Pig replicated join
Getting ready
How to do it...
How it works...
There's more...
See also
Joining sorted data using Apache Pig merge join
Getting ready
How to do it...
How it works...
There's more...
See also
Joining skewed data using Apache Pig skewed join
Getting ready
How to do it...
How it works...
Using a map-side join in Apache Hive to analyze geographical events
Getting ready
How to do it...
How it works...
There's more...
Auto-convert to map-side join whenever possible
Map-join behavior
See also
Using optimized full outer joins in Apache Hive to analyze geographical events
Getting ready
How to do it...
How it works...
There's more...
Common join versus map-side join
STREAMTABLE hint
Table ordering in the query matters
Joining data using an external key-value store (Redis)
Getting ready
How to do it...
How it works...
There's more...
6. Big Data Analysis
Introduction
Counting distinct IPs in weblog data using MapReduce and Combiners
Getting ready
How to do it...
How it works...
There's more...
The Combiner does not always have to be the same class as your Reducer
Combiners are not guaranteed to run
Using Hive date UDFs to transform and sort event dates from geographic event data
Getting ready
How to do it...
How it works...
There's more...
Date format strings follow Java SimpleDateFormat guidelines
Default date and time formats
See also
Using Hive to build a per-month report of fatalities over geographic event data
Getting ready
How to do it...
How it works...
There's more...
The coalesce() method can take variable length arguments.
Date reformatting code template
See also
Implementing a custom UDF in Hive to help validate source reliability over geographic event data
Getting ready
How to do it...
How it works...
There's more...
Check out the existing UDFs
User-defined table and aggregate functions
Export HIVE_AUX_JARS_PATH in your environment
See also
Marking the longest period of non-violence using Hive MAP/REDUCE operators and Python
Getting ready
How to do it...
How it works...
There's more...
SORT BY versus DISTRIBUTE BY versus CLUSTER BY versus ORDER BY
MAP and REDUCE keywords are shorthand for SELECT TRANSFORM
See also
Calculating the cosine similarity of artists in the Audioscrobbler dataset using Pig
Getting ready
How to do it...
How it works...
Trim Outliers from the Audioscrobbler dataset using Pig and datafu
Getting ready
How to do it...
How it works...
There's more...
7. Advanced Big Data Analysis
Introduction
PageRank with Apache Giraph
Getting ready
How to do it...
How it works...
There's more...
Keep up with the Apache Giraph community
Read and understand the Google Pregel paper
See also
Single-source shortest-path with Apache Giraph
Getting ready
How to do it...
How it works...
First superstep (S0)
Second superstep (S1)
See also
Using Apache Giraph to perform a distributed breadth-first search
Getting ready
How to do it...
How it works...
There's more...
Apache Giraph jobs often require scalability tuning
Collaborative filtering with Apache Mahout
Getting ready
How to do it...
How it works...
See also
Clustering with Apache Mahout
Getting ready
How to do it...
How it works...
See also
Sentiment classification with Apache Mahout
Getting ready
How to do it...
How it works...
There's more...
8. Debugging
Introduction
Using Counters in a MapReduce job to track bad records
Getting ready
How to do it...
How it works...
Developing and testing MapReduce jobs with MRUnit
Getting ready
How to do it...
How it works...
There's more...
See also
Developing and testing MapReduce jobs running in local mode
Getting ready
How to do it...
How it works...
There's more...
See also
Enabling MapReduce jobs to skip bad records
How to do it...
How it works...
There's more...
Using Counters in a streaming job
Getting ready
How to do it...
How it works...
There's more...
See also
Updating task status messages to display debugging information
Getting ready
How to do it...
How it works...
Using illustrate to debug Pig jobs
Getting ready
How to do it...
How it works...
See also
9. System Administration
Introduction
Starting Hadoop in pseudo-distributed mode
Getting ready
How to do it...
How it works...
There's more...
See also
Starting Hadoop in distributed mode
Getting ready
How to do it...
How it works...
There's more...
See also
Adding new nodes to an existing cluster
Getting ready
How to do it...
How it works...
There's more...
See also
Safely decommissioning nodes
Getting ready
How to do it...
How it works...
Recovering from a NameNode failure
Getting ready
How to do it...
How it works...
There's more...
Monitoring cluster health using Ganglia
Getting ready
How to do it...
How it works...
Tuning MapReduce job parameters
Getting ready
How to do it...
How it works...
10. Persistence Using Apache Accumulo
Introduction
Designing a row key to store geographic events in Accumulo
Getting ready
How to do it...
How it works...
There's more...
Lexicographic sorting of keys
Z-order curve
See also
Using MapReduce to bulk import geographic event data into Accumulo
Getting ready
How to do it...
How it works...
There's more...
AccumuloTableAssistant.java
Split points
AccumuloOutputFormat versus AccumuloFileOutputFormat
See also
Setting a custom field constraint forinputting geographic event data in Accumulo
Getting ready
How to do it...
How it works...
There's more...
Bundled Constraint classes
Installing a constraint on each TabletServer
See also
Limiting query results using the regex filtering iterator
Getting ready
How to do it...
How it works...
See also
Counting fatalities for different versions of the same key using SumCombiner
Getting ready
How to do it...
How it works...
There's more...
Combiners are on a per-key basis, not across all keys
Combiners can be applied at scan time or applied to the table configuration for incoming mutations
See also
Enforcing cell-level security on scans using Accumulo
Getting ready
How to do it...
How it works...
There's more...
Writing mutations for unauthorized scanning
ColumnVisibility is part of the key
Supporting more complex Boolean expressions
See also
Aggregating sources in Accumulo using MapReduce
Getting ready
How to do it...
How it works...
Index
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Next
Next Chapter
Table of Contents
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset