H2O is an open source system for machine learning. It offers a rich set of machine learning algorithms and a web-based data processing user interface. It offers the ability to develop machine learning applications in Java, Scala, Python, and R. It also has the ability to interface with Spark, HDFS, Amazon S3, SQL, and NoSQL databases. H2O also provides an H2O Flow, which is an IPython-like notebook that allows you to combine code execution, text, mathematics, plots, and rich media into a single document. Sparkling Water is a product of H2O on Spark.
Sparkling Water combines the best of both worlds of Spark and H2O:
The steps involved in a Sparkling Water application submitted on YARN are as follows:
yarn-client
mode. The Spark driver runs in the same container as the application master when submitted in yarn-cluster mode.Once the job is finished, YARN will free up resources for other jobs. Figure 7.5 illustrates a Sparkling Water application flow on YARN:
The recommended system versions that should be used with H2O are available at http://h2o.ai/product/recommended-systems-for-h2o/. To get started, download the Sparkling Water binaries, set up environment variables, and start using it.
Go to http://www.h2o.ai/download/sparkling-water/choose to choose the Sparkling Water version that you want to use with Spark. Download the chosen version (1.6.5 in this case) as follows:
wget http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.6/5/sparkling-water-1.6.5.zip unzip sparkling-water-1.6.5.zip cd sparkling-water-1.6.5 export SPARK_HOME=/usr/lib/spark export MASTER=yarn-client
For Python shell, use the following command:
bin/pysparkling
For Scala shell, the command is as follows:
bin/sparkling-shell
The bin
directory of the Sparkling Water installation has many tools available. You will see launch-spark-cloud.sh
, which is used to start a standalone master and three worker nodes. The pysparkling
command is used for PySpark with Sparkling Water and sparkling-shell
is used for Scala shell with Sparkling Water. You can interactively execute the Sparkling Water programs in pysparkling
or sparkling-shell
. The run-pynotebook.sh
command is used to start an IPython Notebook.
Step-by-step, deep learning examples are available at https://github.com/h2oai/sparkling-water/tree/master/examples.