Chapter 3: Fundamental Workflow – Data to Deployable Model

In this chapter, we will walk through a minimal model-building workflow for H2O at scale. We will refer to this as the fundamental workflow because it omits the wide range of functionality and user choices to build accurate, trusted models while nevertheless touching on the main steps.

The fundamental workflow will serve as a basis to build your understanding of H2O technology and coding steps so that in the next part of the book you can dive fully into advanced techniques to build state-of-the-art models.

To develop the fundamental workflow, we will cover the following main topics in this chapter:

  • Use case and data overview
  • The fundamental workflow
  • Variation points – alternatives and extensions to the fundamental workflow

Technical requirements

For this chapter, we will focus on using Enterprise Steam to launch H2O clusters on an enterprise server cluster. Enterprise Steam technically is not required to launch H2O clusters but enterprise stakeholders typically view Enterprise Steam as a security, governance, and administrator requirement for implementing H2O in enterprise environments.

Enterprise Steam requires a license purchased from H2O.ai. If your organization does not have an instance of Enterprise Steam installed, you can access Enterprise Steam and an enterprise server cluster through a temporary trial license of the larger H2O platform. Alternatively, for ease of conducting the exercises in this book, you may wish to launch H2O clusters as a sandbox in your local environment (for example, on your laptop or desktop workstation) and bypass the use of Enterprise Steam.

See Appendix – Alternative Methods to Launch H2O Clusters for this Book to help you decide on how you wish to launch H2O clusters for the exercises in this book and how to set up your environment to do so.  

Enterprise Steam: Enterprise Environment versus Coding Exercises in the Book

Enterprise stakeholders typically view Enterprise Steam as a security, governance, and administrator requirement for implementing H2O in enterprise environments. This chapter shows how data scientists use Enterprise Steam in this enterprise context. Enterprise Steam, however, requires an H2O.ai license to implement and will not be available to all readers of this book.

A simple sandbox (non-enterprise) experience is to use H2O exclusively on your local environment (laptop or workstation) and this does not require Enterprise Steam. Coding exercises in subsequent chapters will leverage the local sandbox environment but also can be performed using Enterprise Steam as demonstrated in this chapter.

Note that the distinction between the data scientist workflow with and without Enterprise Steam is isolated to the first step of the workflow (launching the H2O cluster) and will be made clearer later in this chapter. See also Appendix – Alternative Methods to Launch H2O Clusters.

Use case and data overview

To demonstrate the fundamental workflow, we will implement a binary classification problem where we predict the likelihood that a loan will default or not. The dataset we use in this chapter can be found at https://github.com/PacktPublishing/Machine-Learning-at-Scale-with-H2O/blob/main/chapt3/loans-lite.csv. (This is a simplified version of the Kaggle Lending Club Loan dataset: https://www.kaggle.com/imsparsh/lending-club-loan-dataset-2007-2011.)  

We are using a simplified version of the dataset to streamline the workflow in this chapter. In Part 2, Building State-of-the-Art Models at Scale, we will develop this use case using advanced H2O model-building capabilities on the original loan dataset.

The fundamental workflow

Our fundamental workflow will proceed through the following steps:

  1. Launching the H2O cluster (Enterprise Steam UI)
  2. Connecting to the H2O cluster (your IDE from this point onward)
  3. Building the model
  4. Evaluating and explaining the model
  5. Exporting the model for production deployment
  6. Shutting down the H2O cluster

Step 1 – launching the H2O cluster

This step is done from the Enterprise Steam UI. You will select whether you want an H2O-3 or Sparkling Water cluster and then you will configure the H2O cluster behavior, such as the duration of idle time before it times out and terminates and whether you want to save the state at termination so you can restart the cluster and pick up where you left off (this must be enabled by the administrator). Nicely, Enterprise Steam will auto-size the H2O cluster (number of nodes, memory per node, CPUs) based on your data size.

Logging in to Steam

Open a web browser and go to https://steam-url:9555/login and log in to Enterprise Steam, where steam-url is the URL of your specific Steam instance. (Your administrator may have changed the port number, but typically it is 9555 as shown in the URL.)

Selecting an H2O for H2O-3 (versus Sparkling Water) cluster

Here, we will launch an H2O-3 cluster (and not Sparkling Water, which we will do in the next part of the book), so click on the H2O link in the left panel and then click LAUNCH NEW CLUSTER.

Configuring the H2O-3 cluster

This brings us to the following form, which you will configure:

Figure 3.1 – UI to launch an H2O-3 cluster on Kubernetes

Figure 3.1 – UI to launch an H2O-3 cluster on Kubernetes

For now, we will ignore most configurations. These will be covered more fully in Chapter 11, The Administrator and Operations Views, where Enterprise Steam is overviewed in detail. Note that the configuration page uses the term H2O cluster to represent an H2O-3 cluster specifically, whereas in this book we use the term H2O cluster to represent either an H2O-3 or Sparkling Water cluster.

Note on the "Configuring the H2O-3 cluster" Screenshot

Details on the screen shown in Figure 3.1 will vary depending on whether the H2O cluster is launched on a Kubernetes environment or on a YARN-based Hadoop or Spark environment. Details will also vary based on whether the H2O cluster is an H2O-3 cluster or a Sparkling Water cluster. In all cases, however, the fundamental concepts of H2O cluster size (number of nodes, CPU/GPU per node, and memory per node) and maximum idle/uptime are common throughout.

Give your cluster a name and for DATASET PARAMETERS, click Set parameters to arrive at the following popup:

Figure 3.2 – Popup to automatically size the H2O-3 cluster

Figure 3.2 – Popup to automatically size the H2O-3 cluster

The inputs here are used by Enterprise Steam to auto-size your H2O cluster (that is, to determine the number of H2O nodes and memory allocated for each node and CPU allocations for each node). Recall the key concepts of an H2O cluster as presented in the previous chapter.

Waiting briefly for the cluster to start

The STATUS field in the UI will state Starting, signifying that the H2O cluster is being launched on the enterprise server cluster. This will take a minute or two. When the status changes to Running, your H2O cluster is ready to use.

Viewing details of the cluster

Let's first learn a few things about the cluster by clicking on Actions and then Detail. This generates a popup describing the cluster.

Notice in this case that Number of nodes is 6 and Memory per node is 48 GB as auto-sized by Enterprise Steam for a dataset size of 50 GB, as shown in Figure 3.1. Recall from the H2O key concepts section in the previous chapter that our dataset is partitioned and distributed in memory across this number of H2O cluster nodes on the enterprise server cluster and that compute is done in parallel on these H2O nodes.

Note on H2O Cluster Sizing

In general, an H2O cluster is sized so the total memory allocated to the cluster (that is, the product of N H2O nodes and X GB memory per node) is roughly 5 times the size of the uncompressed dataset that will be used for model building. The calculation minimizes the number of nodes (that is, fewer nodes with more memory per node is better).

Enterprise Steam will calculate this sizing based on your description of the dataset, but alternatively, you can size the cluster yourself through the Enterprise Steam UI. The total memory allocated to the H2O cluster will be released when the H2O cluster is terminated.

Note that the Enterprise Steam administrator sets the minimum and maximum configuration values a user may have when launching an H2O cluster (see Figure 3.1) and thus the maximum H2O cluster size a user may launch. These boundaries set by the administrator can be configured differently for different users.

Step 2 – connecting to the H2O cluster

This and all subsequent steps are from your IDE. We will use a Jupyter notebook and write code in Python (though other options include writing H2O in R, Java, or Scala using your preferred IDE).

Open the notebook and connect to the H2O cluster you launched in Enterprise Steam by writing the following code:

import h2o
import h2osteam
from h2osteam.clients import H2oKubernetesClient
conn = h2osteam.login(url="https://steam-url:9555",
                      username="my-steam-username",
                      password="my-steam-password")
cluster = H2oKubernetesClient().get_cluster("cluster-name")
cluster.connect()

You have now connected to the H2O cluster and can start building models. Note that after you connect, you will see H2O cluster details similar to those viewed from the Enterprise Steam UI when you configured the cluster before launching.

Let's understand what the code is doing:

  1. You referenced the h2osteam and h2o Python libraries that were downloaded from H2O and implemented in the IDE environment. (The h2o library is not used by the code shown here but will be used by subsequent model building steps that follow.)
  2. Then you logged into the Enterprise Steam server via the h2osteam API (library). You used the same URL, username, and password that was used to log in to the UI of Enterprise Steam.
  3. You then retrieved your H2O cluster information from Enterprise Steam via the h2osteam API.
  4. Note that you are using H2oKubernetesClient here because you are connecting to an H2O cluster launched on a Kubernetes environment. If, alternatively, your enterprise environment is Hadoop or Spark, you use H2oClient or SparklingClient, respectively.
  5. You connected to your H2O cluster using cluster.connect() and passed the cluster information to the h2o API. Note that you did not have to specify any URL to the H2O cluster because Steam returned this behind the scenes with H2oKubernetesClient().get_cluster("cluster-name").

    Creating an H2O Sandbox Environment

    If you want to create a small H2O sandbox on your local machine instead of using Enterprise Steam and your enterprise server cluster, simply implement the following two lines of code from your IDE:

    import h2o

    h2o.init()

    The result is identical to performing steps 1–2 using Enterprise Steam, except that it launches an H2O cluster with one node on your local machine and connects to it.

    Whether connecting to an H2O cluster in your enterprise environment or on your local machine, you can now write model-building steps identically from your IDE against the respective cluster. For the sandbox, you will be constrained, of course, to much smaller data volumes because of its small cluster size of one node with low memory.

Step 3 – building the model

Now that we have connected to our H2O cluster, it is time to build the model. From this point onward, you will be using the h2o API to communicate with the H2O cluster to which you launched and connected.

Here in our fundamental workflow, we will take a minimal approach to import data, clean it, engineer features from it, and then train the model.

Importing the data

The loans dataset is loaded from the source into the H2O-3 cluster memory using the h2o.import_file command as follows:

input_csv = "https://raw.githubusercontent.com/PacktPublishing/Machine-Learning-at-Scale-with-H2O/main/chapt3/loans-lite.csv"
loans = h2o.import_file(input_csv)
loans.dim
loans.head()

The loans.dim line gives us the number of rows and columns and loans.head() displays the first 10 rows. Quite simple data exploration for now.

Note that the dataset is now partitioned and distributed in memory across the H2O cluster. From our coding standpoint in the IDE, it is treated as a single two-dimensional data structure of columns and rows called an H2OFrame.

Cleaning the data

Let's perform one simple data cleaning step. The target or response column is called bad_loan and it holds values of either 0 or 1 for good and bad loans respectively. We need to transform the integers in this column to categorical values, as shown next:

loans["bad_loan"] = loans["bad_loan"].asfactor()

Engineering new features from the original data

Feature engineering is often considered the secret sauce in building a superior predictive model. For our purposes now, we will do basic feature engineering by extracting year and month as separate features from the issue_d column, which holds day, month, and year as a single value:

loans["issue_d_year"] = loans["issue_d"].year().asfactor()
loans["issue_d_month"] = loans["issue_d"].month().asfactor()

We have just created two new categorical columns in our loans dataset: issue_d_year and issue_d_month.

Model training

We will next train a model to predict bad loans. We first split our data into train and test:

train, validate, test = loans.split_frame(seed=1, ratios=[0.7, 0.15])

We now need to identify which columns we will use to predict whether a loan is bad or not. We will do this by removing two columns from the current loans H2OFrame, which hold the cleaned and engineered data:

predictors = list(loans.col_names)
predictors.remove("bad_loan)
predictors.remove("issue_d")

Note that we removed bad_loan from the columns used as features because this is what we are predicting. We also removed issue_d because we engineered new features from this and do not want it as a predictor.

Next, let's create an XGBoost model to predict loan default:

from h2o.estimators import H2OXGBoostEstimator
param = {
         "ntrees" : 20,
         "nfolds" : 5,
         "seed": 12345
}
model = H2OXGBoostEstimator(**param)
model.train(x = predictors,
            y = "bad_loan",
            training_frame = train,
            validation_frame = validate)

Step 4 – evaluating and explaining the model

Let's evaluate the performance of the model that we just trained:

perf = model.model_performance(test)
perf

The output of perf shows details on model performance, including model metrics such as MSE, Logloss, AUC, and others, as well as a confusion matrix, maximum metrics thresholds, and a gains/lift table.

Now let's look at one simple view of model explainability by generating variable importance from the model result:

explain = model.explain(test,include_explanations="varimp")
explain

The output of explain shows the variable importance of the trained model run against the test dataset. This is a table listing how strongly each feature contributed to the model.

H2O's model explainability capabilities go much further than variable importance, as we shall see later in the book.

Step 5 – exporting the model's scoring artifact

Now let's generate and export the model as a scoring artifact that can be deployed to a production environment by the DevOps group:

model.download_mojo("download-destination-path")

In the real world, of course, we would train many models and compare their performance and explainability to evaluate which (if any) should make it to production.

Step 6 – shutting down the cluster

When your work is complete, shut down the H2O-3 cluster to free up the resources that were reserved by it:

h2o.cluster().shutdown()

Variation points – alternatives and extensions to the fundamental workflow

The fundamental workflow we developed here is a simple example. For each step we performed, there are multiple alternatives and extensions to what has been shown. All of Part 2:, Building State-of-the-Art Models at Scale, is dedicated to understanding these alternatives and elaborations and to putting them together to build superior models at scale.

Let's first touch on some key variation points here.

Launching an H2O cluster using the Enterprise Steam API versus the UI (step 1)

In our example, we used the convenience of the Enterprise Steam UI to configure and launch an H2O cluster. Alternatively, we could have used the Steam API from our IDE to do so. See the full H2O Enterprise Steam API documentation at https://docs.h2o.ai/enterprise-steam/latest-stable/docs/python-docs/index.html for the Python API and https://docs.h2o.ai/enterprise-steam/latest-stable/docs/r-docs/index.html for the R API.

By launching the H2O cluster from our IDE, we therefore could have completed all of steps 1–6 of our workflow exclusively from the IDE.

Launching an H2O-3 versus Sparkling Water cluster (step 1)

In our example, we launched an H2O-3 cluster. We could alternatively launch an H2O Sparkling Water cluster. As we will see, Sparkling Water clusters have the same capability set as H2O-3 clusters but with the additional ability to integrate Spark code and Spark DataFrames with H2O code and H2O DataFrames. This is particularly powerful when leveraging Spark for advanced data exploration and data munging before building models in H2O.

Implementing Enterprise Steam or not (steps 1–2)

Know that Enterprise Steam is not a requirement for launching and connecting to an enterprise server cluster: it is possible for a data scientist to use only the h2o (and not h2osteam) API in the IDE to configure, launch, and connect to an enterprise server cluster, but this is low-level coding and configuration and requires detailed integration information. Importantly, this approach lacks sound enterprise security, governance, and integration practices.

In the enterprise setting, Enterprise Steam is viewed as essential to centralize, manage, and govern H2O technology and H2O users in the enterprise server cluster environment. These capabilities are elaborated on in Chapter 11, The Administrator and Operations Views.

Using a personal access token to log in to Enterprise Steam (step 2)

For Step 2 – connecting to the H2O cluster, we authenticated to Enterprise Steam from our IDE using the Enterprise Steam API. In the example code, we used a clear text password (which was the same password used to log into the Enterprise Steam UI). This is not secure if, for example, you shared the notebook.

Alternatively, and more securely, you can use a Personal Access Token (PAT) as the API login password to Enterprise Steam. A PAT can be generated as often as you wish, with each newly generated PAT revoking the previous one. Thus, if you shared a Jupyter notebook with your login credentials using a PAT as your password, the recipient of the notebook would not know your Enterprise Steam UI login password and could not authenticate via the API using the revoked password in your shared notebook. You can take the PAT one step further and implement it as an environment variable outside the IDE.

Enterprise Steam lets you generate a PAT from the UI. To generate a PAT, log in to Enterprise Steam UI, click Configurations, and follow the brief token workflow. Copy the result (a long string) for use in your current notebook or script or to set it as an environment variable.

Building the model (step 3)

H2O offers a much more powerful model-building experience than what was shown in our fundamental workflow. This larger experience is touched on here and explored fully in Part 2, Building State-of-the-Art Models at Scale.

Language and IDE

We are writing H2O code in Python in a Jupyter notebook. You can also choose R for the Enterprise Steam API and use the Python or R IDE of your choice. Additionally, you can use H2O's UI-rich IDE called H2O Flow to perform the full workflow or to quickly understand aspects of an H2O cluster workflow that is progressing from your own IDE.

Importing data

Data can be imported from many sources into H2O clusters, including cloud object storage (for example, S3 or Azure Delta Lake), database tables (via JDBC), HDFS, and more. Additionally, source files can have many formats, including Parquet, ORC, ARFF, and more.

Cleaning data and engineering features

H2O-3 has capabilities for basic data manipulation (for example, changing column types, combining or slicing rows or columns, group by, impute, and so on).

Recall that launching a Sparkling Water cluster gives us full H2O-3 capabilities with the addition of Spark's more powerful data exploration and engineering capabilities.

Model training

In our fundamental workflow, we explored only one type of model (XGBoost) while changing only a few default parameters. H2O-3 (and its Sparkling Water extension) has an extensive list of both supervised and unsupervised learning algorithms and a wide range of parameters and hyperparameters to set to your specification. In addition, these algorithms can be combined powerfully into an AutoML workflow that explores multiple models and hyperparameter space and arranges the resulting best models on a leaderboard. You also have control over cross-validation techniques, checkpointing, retraining, and reproducibility.

Evaluating and explaining the model (step 4)

H2O has numerous explainability methods and visualizations for both local (individual) and global (model-level) explainability, including residual analysis, variable importance heatmaps, Shapley summaries, Partial Dependence Plots (PDPs), and Individual Conditional Expectation (ICE).

Exporting the model's scoring artifact (step 5)

Once you export the model's scoring artifact (called an H2O MOJO), it is ready for DevOps to deploy and monitor in live scoring environments. It likely will enter the organization's CI/CD process. We will pick it up at this point in Part 3, Deploying Your Models to Production Environments.

Shutting down the cluster (step 6)

You can shut down your cluster from your IDE as shown in our example workflow. If you noticed, when configuring your cluster in Enterprise Steam, however, there are two configurations that automate the shutdown process: MAXIMUM IDLE TIME and MAXIMUM UPTIME. The first shuts down the cluster after it has not been used for the configured amount of time. The second shuts down the cluster after it has been up for the configured amount of time. Shutting down clusters (manually or automatically) saves resources for others using the enterprise server cluster.

The administrator assigns minimum and maximum values for these auto-terminate configurations. Note that when enabled by administrators, Enterprise Steam saves all models and DataFrames when the H2O cluster has been auto-terminated. You can restart the cluster later and pick up where the cluster terminated.

Summary

In this chapter, you learned how to launch an H2O cluster and build a model on it from your IDE. This fundamental workflow is a bare skeleton that you will flesh out much more fully with a deep set of advanced H2O model-building techniques that we will now learn in Part 2, Building State-of-the-Art Models at Scale, of the book.

We will start this advanced journey in the next chapter by overviewing these capabilities before using them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset