Transforming Apache logs into TSV format using MapReduce
Using Apache Pig to filter bot traffic from web server logs
Using Apache Pig to sort web server log data by timestamp
Using Apache Pig to sessionize web server log data
Using Python to extend Apache Pig functionality
Using MapReduce and secondary sort to calculate page views
Using Hive and Python to clean and transform geographical event data
Using Python and Hadoop Streaming to perform a time series analytic
Using MultipleOutputs in MapReduce to name output files
Creating custom Hadoop Writable and InputFormat to read geographical event data
Introduction
Parsing and formatting large amounts of data to meet business requirements is a challenging task. The software and the architecture must meet strict scalability, reliability, and run-time constraints. Hadoop is an ideal environment for extracting and transforming large-scale data. Hadoop provides a scalable, reliable, and distributed processing environment that is ideal for large-scale data processing. This chapter will demonstrate methods to extract and transform data using MapReduce, Apache Pig, Apache Hive, and Python.