Machine learning on Spark and Hadoop

MLlib is a machine learning library on top of Spark that provides major machine learning algorithms and utilities. It is divided into two separate packages:

  • spark.mllib: This is the original machine learning API built on top of Resilient Distributed Datasets (RDD). As of Spark 2.0, this RDD-based API is in maintenance mode and is expected to be deprecated and removed in upcoming releases of Spark.
  • spark.ml: This is the primary machine learning API built on top of DataFrames to construct machine learning pipelines and optimizations.

spark.ml is preferred over spark.mllib because it is based on the DataFrames API that provides higher performance and flexibility.

Apache Mahout was a general machine learning library on top of Hadoop. Mahout started out primarily as a Java MapReduce package to run machine learning algorithms. As machine learning algorithms are iterative in nature, MapReduce had major performance and scalability issues. So, Mahout stopped the development of MapReduce-based algorithms and started supporting new platforms such as Spark, H2O, and Flink with a new package called Samsara. The Apache Mahout integration with Spark is explained in Chapter 8, Building Recommendation Systems with Spark and Mahout.

The Sparkling Water project allows H2O project's powerful machine learning algorithms to be used on a Spark cluster. It is an open source system that offers the ability to develop machine learning applications in Java, Scala, Python, and R. It also has the ability to interface with HDFS, Amazon S3, SQL, and NoSQL databases as well. Sparkling Water is explained in detail in the Getting started with Sparkling Water section of this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset