0%

Book Description

Hadoop is mostly written in Java, but that doesn't exclude the use of other programming languages with this distributed storage and processing framework, particularly Python. With this concise book, you’ll learn how to use Python with the Hadoop Distributed File System (HDFS), MapReduce, the Apache Pig platform and Pig Latin script, and the Apache Spark cluster-computing framework.

Authors Zachary Radtka and Donald Miner from the data science firm Miner & Kasch take you through the basic concepts behind Hadoop, MapReduce, Pig, and Spark. Then, through multiple examples and use cases, you'll learn how to work with these technologies by applying various Python tools.

  • Use the Python library Snakebite to access HDFS programmatically from within Python applications
  • Write MapReduce jobs in Python with mrjob, the Python MapReduce library
  • Extend Pig Latin with user-defined functions (UDFs) in Python
  • Use the Spark Python API (PySpark) to write Spark programs with Python
  • Learn how to use the Luigi Python workflow scheduler to manage MapReduce jobs and Pig scripts

Zachary Radtka, a platform engineer at Miner & Kasch, has extensive experience creating custom analytics that run on petabyte-scale data sets.

Table of Contents

  1. Source Code
  2. 1. Hadoop Distributed File System (HDFS)
    1. Overview of HDFS
    2. Interacting with HDFS
      1. Common File Operations
      2. HDFS Command Reference
    3. Snakebite
      1. Installation
      2. Client Library
      3. CLI Client
    4. Chapter Summary
  3. 2. MapReduce with Python
    1. Data Flow
      1. Map
      2. Shuffle and Sort
      3. Reduce
    2. Hadoop Streaming
      1. How It Works
      2. A Python Example
    3. mrjob
      1. Installation
      2. WordCount in mrjob
      3. What Is Happening
      4. Executing mrjob
      5. Top Salaries
    4. Chapter Summary
  4. 3. Pig and Python
    1. WordCount in Pig
      1. WordCount in Detail
    2. Running Pig
      1. Execution Modes
      2. Interactive Mode
      3. Batch Mode
    3. Pig Latin
      1. Statements
      2. Loading Data
      3. Transforming Data
      4. Storing Data
    4. Extending Pig with Python
      1. Registering a UDF
      2. A Simple Python UDF
      3. String Manipulation
      4. Most Recent Movies
    5. Chapter Summary
  5. 4. Spark with Python
    1. WordCount in PySpark
      1. WordCount Described
    2. PySpark
      1. Interactive Shell
      2. Self-Contained Applications
    3. Resilient Distributed Datasets (RDDs)
      1. Creating RDDs from Collections
      2. Creating RDDs from External Sources
      3. RDD Operations
    4. Text Search with PySpark
    5. Chapter Summary
  6. 5. Workflow Management with Python
    1. Installation
    2. Workflows
      1. Tasks
      2. Target
      3. Parameters
    3. An Example Workflow
      1. Task.requires
      2. Task.output
      3. Task.run
      4. Parameters
      5. Execution
    4. Hadoop Workflows
      1. Configuration File
      2. MapReduce in Luigi
      3. Pig in Luigi
    5. Chapter Summary