Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3. Extracting and Transforming Data

In this chapter, we will cover:

Transforming Apache logs into TSV format using MapReduce
Using Apache Pig to filter bot traffic from web server logs
Using Apache Pig to sort web server log data by timestamp
Using Apache Pig to sessionize web server log data
Using Python to extend Apache Pig functionality
Using MapReduce and secondary sort to calculate page views
Using Hive and Python to clean and transform geographical event data
Using Python and Hadoop Streaming to perform a time series analytic
Using MultipleOutputs in MapReduce to name output files
Creating custom Hadoop Writable and InputFormat to read geographical event data

Introduction

Parsing and formatting large amounts of data to meet business requirements is a challenging task. The software and the architecture must meet strict scalability, reliability, and run-time constraints. Hadoop is an ideal environment for extracting and transforming large-scale data. Hadoop provides a scalable, reliable, and distributed processing environment that is ideal for large-scale data processing. This chapter will demonstrate methods to extract and transform data using MapReduce, Apache Pig, Apache Hive, and Python.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3. Extracting and Transforming Data

Create new playlist

Sign In

Sign Up

Chapter 3. Extracting and Transforming Data

Introduction

Table of Contents for
3. Extracting and Transforming Data