Chapter 5. Introducing MLlib

In the previous chapter, we learned how to prepare the data for modeling. In this chapter, we will actually use some of that learning to build a classification model using the MLlib package of PySpark.

MLlib stands for Machine Learning Library. Even though MLlib is now in a maintenance mode, that is, it is not actively being developed (and will most likely be deprecated later), it is warranted that we cover at least some of the features of the library. In addition, MLlib is currently the only library that supports training models for streaming.

Note

Starting with Spark 2.0, ML is the main machine learning library that operates on DataFrames instead of RDDs as is the case for MLlib.

The documentation for MLlib can be found here: http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html.

In this chapter, you will learn how to do the following:

  • Prepare the data for modeling with MLlib
  • Perform statistical testing
  • Predict survival chances of infants using logistic regression
  • Select the most predictable features and train a random forest model

Overview of the package

At the high level, MLlib exposes three core machine learning functionalities:

  • Data preparation: Feature extraction, transformation, selection, hashing of categorical features, and some natural language processing methods
  • Machine learning algorithms: Some popular and advanced regression, classification, and clustering algorithms are implemented
  • Utilities: Statistical methods such as descriptive statistics, chi-square testing, linear algebra (sparse and dense matrices and vectors), and model evaluation methods

As you can see, the palette of available functionalities allows you to perform almost all of the fundamental data science tasks.

In this chapter, we will build two classification models: a linear regression and a random forest. We will use a portion of the US 2014 and 2015 birth data we downloaded from http://www.cdc.gov/nchs/data_access/vitalstatsonline.htm; from the total of 300 variables we selected 85 features that we will use to build our models. Also, out of the total of almost 7.99 million records, we selected a balanced sample of 45,429 records: 22,080 records where infants were reported dead and 23,349 records with infants alive.

Tip

The dataset we will use in this chapter can be downloaded from http://www.tomdrabas.com/data/LearningPySpark/births_train.csv.gz.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset