Machine learning with H2O and Spark

H2O is an open source system for machine learning. It offers a rich set of machine learning algorithms and a web-based data processing user interface. It offers the ability to develop machine learning applications in Java, Scala, Python, and R. It also has the ability to interface with Spark, HDFS, Amazon S3, SQL, and NoSQL databases. H2O also provides an H2O Flow, which is an IPython-like notebook that allows you to combine code execution, text, mathematics, plots, and rich media into a single document. Sparkling Water is a product of H2O on Spark.

Why Sparkling Water?

Sparkling Water combines the best of both worlds of Spark and H2O:

  • Spark provides the best APIs, RDDs, and multitenant contexts
  • H2O provides speed, columnar-compression, machine learning, and deep learning algorithms
  • Both Spark and H2O Contexts reside in a shared executor JVM and shared Spark RDDs and H2O RDDs

An application flow on YARN

The steps involved in a Sparkling Water application submitted on YARN are as follows:

  1. When the Sparkling Water application is submitted with spark-submit, the YARN resource manager allocates a container to launch the application master. The Spark driver runs in a client when submitted in the yarn-client mode. The Spark driver runs in the same container as the application master when submitted in yarn-cluster mode.
  2. The application master negotiates resources with the resource manager to spawn Spark executor JVMs.
  3. The Spark executor starts an H2O instance within the JVM.
  4. When the Sparkling Water cluster is ready, HDFS data can be read by H2O or Spark. The H2O flow interface can be accessed when the cluster is ready. Data is shared across Spark RDDs and H2O RDDs.
  5. All jobs can be monitored and visualized in the resource manager UI and Spark UI as well.

    Once the job is finished, YARN will free up resources for other jobs. Figure 7.5 illustrates a Sparkling Water application flow on YARN:

    An application flow on YARN

    Figure 7.5: A Sparkling Water application on YARN

Getting started with Sparkling Water

The recommended system versions that should be used with H2O are available at http://h2o.ai/product/recommended-systems-for-h2o/. To get started, download the Sparkling Water binaries, set up environment variables, and start using it.

Go to http://www.h2o.ai/download/sparkling-water/choose to choose the Sparkling Water version that you want to use with Spark. Download the chosen version (1.6.5 in this case) as follows:

wget http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.6/5/sparkling-water-1.6.5.zip
unzip sparkling-water-1.6.5.zip
cd sparkling-water-1.6.5
export SPARK_HOME=/usr/lib/spark
export MASTER=yarn-client

For Python shell, use the following command:

bin/pysparkling

For Scala shell, the command is as follows:

bin/sparkling-shell

The bin directory of the Sparkling Water installation has many tools available. You will see launch-spark-cloud.sh, which is used to start a standalone master and three worker nodes. The pysparkling command is used for PySpark with Sparkling Water and sparkling-shell is used for Scala shell with Sparkling Water. You can interactively execute the Sparkling Water programs in pysparkling or sparkling-shell. The run-pynotebook.sh command is used to start an IPython Notebook.

Step-by-step, deep learning examples are available at https://github.com/h2oai/sparkling-water/tree/master/examples.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset