Chapter 9. Machine Learning Classifier Using TensorFlow

In Chapter 7, we built a machine learning model but ran into problems when trying to scale it out and make it operational. The first problem was how to prevent training–serving skew when using a time-windowed aggregate feature. We solved this in Chapter 8 by using the same code for computing the aggregates on historical data as will be used on real-time data. The Cloud Dataflow pipeline that we implemented in Chapter 8 was used to create two sets of files: trainFlights*.csv, which will serve as our training dataset for machine learning, and testFlights*.csv, which we will use to evaluate the model. Both of these files contain augmented datasets—the purpose of the pipeline was to add the computed time-aggregates to the raw data received from the airlines. We want to predict the first column in those files (whether the flight is on time) based on the other columns (departure delay, taxi-out time, distance, and average departure and arrival delays, and a few other fields).

While we solved the problem of dataset augmentation with time aggregates, the other three problems identified at the end of Chapter 7 remain:

  • One-hot encoding categorical columns caused an explosion in the size of the dataset

  • Embeddings would involve special bookkeeping

  • Putting the model into production requires the machine learning library to be portable to environments beyond the cluster on which the model is trained.

The solution to these three problems requires a portable machine learning library that is (1) powerful enough to carry out distributed training (i.e., training on a cluster of machines so that we can deal with very large datasets), (2) flexible enough to support the latest machine learning research such as wide-and-deep networks, and (3) portable enough to support both massively parallel prediction on custom application-specific integrated circuits (ASICs) and prediction carried out on handheld devices. TensorFlow, the open source machine learning library developed at Google, meets all these objectives.

If you skipped ahead to this chapter without reading Chapter 7, please go back and read it. That chapter looks at logistic regression using Spark and using BigQuery ML, and I introduce a number of machine learning concepts that are essential to understanding this one. In particular, understanding the limitations of the approach presented in Chapter 7 will help you to understand the architecture of the distributed TensorFlow model that we develop here.

Toward More Complex Models

Normally, when you want a computer to do something for you, you need to program the computer to do it by using an explicit set of rules. For example, if you want a computer to look at an image of a screw on a manufacturing line and figure out whether the screw is faulty or not, you need to code up a set of rules: Is the screw bent? Is the screw head broken? Is the screw discolored? With machine learning, you turn the problem around on its head. Instead of coming up with all kinds of logical rules for why a screw might be bad, you show the computer a whole bunch of data. Maybe you show it 5,000 images of good screws and 5,000 images of faulty screws that your (human) operators discarded for one reason or the other. Then, you let the computer learn how to identify a bad screw from a good one. The computer is the “machine” and it’s “learning” to make decisions based on data. In this particular case, the “machine” is learning a discriminant function from the manually labeled training data, which separates good screws from bad screws.

When we did logistic regression with Spark, or Bayes classification with Pig, we were doing machine learning. We took all of the data, chose a model (logistic regression or Bayes classifier), and asked the computer to figure out the free parameters in the model (the weights in logistic regression, the empirical probabilities in Bayes). We then could use the “trained” model to make predictions on new data points.

Even plain old linear regression, in this view, can be thought of as machine learning—that is, if the model is effective at capturing the nuances of the data. Many real-world problems are much more complex than can be adequately captured by linear regression or similarly simple models. When people talk of machine learning, they are usually thinking of more complex models with many more free parameters.

Tell a statistician about complex models with lots of free parameters, and you’ll get a lecture back on the dangers of overfitting, of building a model that (instead of capturing the nuances of the problem) is simply fitting observation noise in the data. So, another aspect of machine learning is that you need to counteract the dangers of overfitting when using very complex models by training1 the model on extremely large and highly representative datasets. Additionally, even though these complex models may be more accurate, the trade-off is that you cannot readily analyze them to retroactively derive logical rules and reasoning. When people think of machine learning, they think of random forests, support vector machines, and neural networks.

For our problem, we could use random forests, support vector machines, or neural networks, and I suspect that we will get very similar results. This is true of many real-world problems—the biggest return for your effort is going to be in terms of finding additional data to provide to the training model (and the resulting increase in free parameters in your model) or in devising better input features using the available data. In contrast, changing the machine learning model doesn’t provide as much benefit. However, for a specific class of problems—those with extremely dense2 and highly correlated inputs such as audio and images, deep neural networks begin to shine. In general, you should try to use a linear model if you can and reserve the use of more complex models (deep neural networks, convolutional layers, recurrent neural networks, etc.) only if the particular problem warrants it. For the flight delay use case, I will use a “wide-and-deep” model that consists of two parts: a wide or linear part for input features that are sparse and a part consisting of deep layers for input features that are continuous.

To train the model, we will use TensorFlow, an open source software library developed at Google to carry out numerical computation for machine learning research. The guts of the library are written in C++ to permit you to deploy computation to one or more Central Processing Units (CPUs) or Graphical Processing Units (GPUs) in a desktop or the cloud. Come prediction time, the trained model can be run on CPUs, GPUs, a server that uses Google’s custom ASIC chips for machine learning (called Tensor Processing Units or TPUs3), or even a mobile device. However, it is not necessary to program in C++ to use TensorFlow, because the programming paradigm is to build a data flow graph and then stream data into that graph. It is possible to control the graph creation and streaming from Python without losing the efficiency of C++, or the ability to do GPU and ASIC computations. Nodes in the graph represent mathematical operations (such as the summation and sigmoid function that we used in logistic regression), whereas the graph edges represent the multidimensional data arrays (tensors) communicated between these nodes.

In fact, we could have expressed logistic regression as a simple neural network consisting of a single node and done the training using TensorFlow rather than Spark, as illustrated in Figure 9-1.

Logistic regression can be expressed as a simple neural network with only one node.
Figure 9-1. Logistic regression can be expressed as a simple neural network with only one node

For comparison purposes, the first neural network that we will build in this chapter will be precisely this. We will then be able to examine the impact of the additional input features while keeping the model (logistic regression) the same as what was used in Chapter 7.

Having done the comparison, though, we will move on to building a neural network that will have many more nodes and will be distributed in more layers. We’ll keep the output node a sigmoid so that the output is restricted to lie in [0,1] but add in intermediate layers and nodes with other activation functions. The number of nodes and layers is something that we must determine via experimentation. At some point, increasing the number of nodes and layers will begin to result in overfitting, and the exact point is dependent on the size of your dataset (both the number of labeled examples and the number of predictor variables) and the extent to which the predictor variables do predict the label and the extent to which the predictors are independent. This problem is hairy enough that there is no real way to know beforehand how big and large you can afford your neural network to be. If your neural network is too small, it won’t fit all the nuances of the problem adequately and your training error will be large. Again, you won’t know that your neural network is too small unless you try a slightly larger neural network. The relationship is not going to be nice and smooth because there are random seeds involved in all the optimization methods that you will use to find the weights and biases. Because of that, machine learning is going to have to involve many, many runs. The best advice is to try out different numbers of nodes and layers and different activation functions (different ones work better for different problems) and see what works well for your problem. Having a cloud platform that supports this sort of experimentation to be carried out on your complete dataset in a timely manner is very important. When it’s time to run our experiment on the full dataset, we will use Cloud AI Platform.

For the intermediate layers, we will use Rectified Linear Units (ReLUs) as the neural network nodes. The ReLU has a linear activation function that is clamped to non-negative values. Essentially the input of the neuron is passed through to the output after thresholding it at 0—so if the weighted sum of the input neurons is 3, the output is 3, but if the weighted sum of the inputs is –3, the output is 0.

A typical neural network node in the intermediate (hidden) layers of a neural network consists of the weighted sum of its inputs transformed by a nonlinear function.
Figure 9-2. A typical neural network node in the intermediate (hidden) layers of a neural network consists of the weighted sum of its inputs transformed by a nonlinear function

Using ReLU rather than sigmoidal or tanh activation functions is a trade-off—the sigmoid activation function saturates between 0 and 1 (see the graphs in Chapter 7), and therefore the output won’t blow up. However, the gradient near 0 and 1 is so slow-varying that training takes a long time. Also, the impact of a neuron is never actually zero when you use sigmoidal activation functions—this makes it very important to choose the right number of nodes/layers in a neural network for fear of overfitting. Because the gradient of a ReLU is constant, ReLU networks are faster to train. Also, because the ReLU function can reach 0, it can get to sparse models; while we will still search for the right model architecture, getting it precisely right is not as much of a concern. However—and this is where the trade-off comes in—the outputs of neurons with ReLU activation functions can reach really large, positive magnitudes. Some of the theoretical advances in machine learning over the past few years have been on how to initialize and train ReLUs without having the intermediate outputs of the neural network go flying off the handle.

Reading Data into TensorFlow

Let’s begin by writing TensorFlow code to read in our data and do the training. We’ll train the neural network on the dataset that we created using Cloud Dataflow. Recall that we took the raw flights data and used Cloud Dataflow to compute new features that we could use as inputs to our machine learning model—the average departure delay at the airport, at the hour we are flying out, and the average arrival delay currently at the airport we are flying into. The target (or label) that we are trying to predict is also in the dataset—it is 1 if the flight is “on time” (that is, delayed on arrival by less than 15 minutes) and 0 otherwise.

Ultimately, we want to submit our TensorFlow program to Cloud AI Platform so that it can be run in the cloud. For that to happen, our program needs to be a Python module. Python’s packaging mechanism is essentially built off the filesystem, and so we begin by creating a directory structure:4

flights
flights/trainer
flights/trainer/__init__.py

The __init__.py, though empty, is required to be present for trainer to function as a Python module.

At this point, we are ready to write our actual code. The code will reside in two files: task.py will contain the main() function and model.py will contain the machine learning model.

For easy development, let’s develop the code in a notebook flights_model_tf2.ipynb and then we can move it over to model.py. In the Jupyter notebook, which we develop using Cloud AI Platform Notebooks (see Chapter 5), let’s verify that the data from the Dataflow pipeline we wrote in the previous chapter is present:

DATA_BUCKET = "gs://{}/flights/chapter8/output/".format(BUCKET)
TRAIN_DATA_PATTERN = DATA_BUCKET + "train*"
VALID_DATA_PATTERN = DATA_BUCKET + "test*"
!gsutil ls $DATA_BUCKET

While developing this code, we want to use a small dataset so that the notebook remains responsive. To do that, we will read in only a small number of examples from the data:

NUM_EXAMPLES = 1000*1000 # assume 1 million examples

Let’s begin by writing a function to read in the data. We begin by importing the tensorflow package and then defining the header of the comma-separated value file (CSV) file we are about to read:

import tensorflow as tf

CSV_COLUMNS  = 
('ontime,dep_delay,taxiout,distance,avg_dep_delay,avg_arr_delay' + 
 'carrier,dep_lat,dep_lon,arr_lat,arr_lon,origin,dest').split(',')
LABEL_COLUMN = 'ontime'

The TensorFlow CSV reader asks us to specify default values for the columns just in case the column value is empty. It also uses the default value to infer the data type of the column.

DEFAULTS     = [[0.0],[0.0],[0.0],[0.0],[0.0],[0.0],
                ['na'],[0.0],[0.0],[0.0],[0.0],['na'],['na']]

If we specify the default for a column as 0, it will be a tf.int32, but if we specify it as 0.0, it will be a tf.float32. Any columns whose default value is a string will be taken to be tf.string.

Let’s now read in the first 3 lines and print them out:

dataset = tf.data.experimental.make_csv_dataset(TRAIN_DATA_PATTERN, 
    batch_size=1, CSV_COLUMNS, DEFAULTS)
for n, data in enumerate(dataset):
  numpy_data = {k: v.numpy() for k, v in data.items()}
  print(numpy_data)
  if n>3: break

The make_csv_dataset() function returns a TensorFlow dataset that can be iterated through. Each iteration returns a batch of rows, and because we specified batch_size=1, we will get one row at a time. Note that we break at n > 3, so we end up reading only three rows. Each row is retrieved as a key-value pair. The key is the name of the column (e.g. origin) and the value is a tensor whose length is the batch_size. We are converting the tensor to a numerical array through the numpy() function. Thus, the first line of the output is:

{'origin': array([b'ORD'], dtype=object), 'dep_delay': array([21.],
  dtype=float32), 'arr_lon': array([-97.60083], dtype=float32),
  'distance': array([693.], dtype=float32), 'arr_lat': array([35.393055],
  dtype=float32), 'avg_dep_delay': array([39.491936], dtype=float32),
  'avg_arr_delay': array([0.], dtype=float32), 'taxiout': array([12.],
  dtype=float32), 'dest': array([b'OKC'], dtype=object), 'ontime': 
  array([1.], dtype=float32), 'dep_lat': array([41.979443], 
  dtype=float32), 'carrier': array([b'MQ'], dtype=object), 
  'dep_lon': array([-87.9075], dtype=float32)}

Having verified that we can read the data, let’s write a read_dataset() function that reads the training data, yielding batch_size examples each time, and allows us to stop iterating once a certain number of examples have been read. This is the function that we want:

def read_dataset(pattern, batch_size, 
                 mode=tf.estimator.ModeKeys.TRAIN, truncate=None):

The reason for the mode parameter is that we want to reuse the function for reading both the training and the evaluation data. During evaluation, we need to read the entire dataset only once. During training, though, we need to read the dataset and pass it through the model several times. In addition, if we are training on multiple workers, we want the workers to see different examples. We can achieve this by calling shuffle() with a large enough buffer. Putting these concepts together, we have:

  if mode == tf.estimator.ModeKeys.TRAIN:
    dataset = dataset.shuffle(batch_size*10)
    dataset = dataset.repeat()
  if truncate is not None:
    dataset = dataset.take(truncate)

Shuffling the order in which the sharded input data is read each time is important for distributed training. The way distributed training is carried out is that each of the workers is assigned a batch of data to process. The workers compute the gradient on their batch and send it to “parameter servers”5 that maintain shared state of the training run. For reasons of fault tolerance, the results from very slow workers might be discarded. Therefore, it is important that the same batch of data not be assigned to the same slow worker in each run. Shuffling the data helps mitigate this possibility.

The example data consists of both features and the label. It’s better to separate them to make the later code easier to read. Hence, we’ll apply a map() function to the dictionary and return a tuple of features and labels).6

def features_and_labels(features):
  label = features.pop('ontime') # this is what we will train for
  return features, label

dataset = dataset.map(features_and_labels)

In this example, we are reading CSV files using TensorFlow’s native ops. This is a trade-off between human readability and powerful performance. The fastest way to read data into TensorFlow programs is to store the data as TFRecord files (with each example stored in tf.Example or tf.SequenceExample protocol buffers), but (as of this writing) there are no visualization or debugging tools that can read TFRecord files. The most convenient way to feed directly from Python is to construct tf.Constants directly from numpy arrays, but this doesn’t scale to out-of-memory datasets. Storing and reading CSV files is a middle ground that provides us access to visualization and debugging tools (e.g., seaborn visualization package) while also providing reasonably fast reading speeds from TensorFlow.

Now that we have the code to read the data, let’s copy it over to model.py and write a main() in task.py to invoke this method. Let’s use Python’s argparse library to be able to pass in the name of the bucket as a command-line argument:

if __name__ == '__main__':
  parser = argparse.ArgumentParser()
  parser.add_argument(
      '--bucket',
      help='Training data will be in gs://BUCKET/flights/chapter8/output/',
      required=True
  )
  # parse args
  args = parser.parse_args()
  arguments = args.__dict__

In model.py, we can use the provided bucket information to obtain the training and evaluation file patterns:

    BUCKET = args['bucket']
    DATA_BUCKET = "gs://{}/flights/chapter8/output/".format(BUCKET)
    TRAIN_DATA_PATTERN = DATA_BUCKET + "train*"
    EVAL_DATA_PATTERN = DATA_BUCKET + "test*"

We can then read the first few lines using:

 def read_lines():  
    logging.info("Checking input pipeline batch_size={}".format(TRAIN_BATCH_SIZE))
    one_item = read_dataset(TRAIN_DATA_PATTERN, TRAIN_BATCH_SIZE, truncate=1)
    print(list(one_item)) # should print one batch of items

We can now invoke the task module and get it to read the first 3 lines as follows:

%%bash
export PYTHONPATH="$PWD/flights"
python3 -m trainer.task --bucket $BUCKET 
        --train_batch_size=3 --func=read_lines

Note that we are setting the PYTHONPATH (within which modules will be searched) to the flights directory and then invoking the module trainer.task. This module sets the TRAIN_BATCH_SIZE to 3, parses the func argument, and invokes the read_lines() function.

If you run into errors about missing Python packages, install them by using pip7 (install pip first, if necessary). Alternatively, run the code inside a Cloud AI Platform Notebook that already has TensorFlow installed. 

To do something interesting, let’s find the average of all the labels in the dataset by computing the sum and count as we go along:

def find_average_label():
    features_and_labels = read_dataset(TRAIN_DATA_PATTERN, 1, 
    truncate=NUM_EXAMPLES)
    labels = features_and_labels.map(lambda x, y : y)
    count, sum = labels.reduce((0.0,0.0), lambda state, 
    y: (state[0]+1.0, state[1]+y))
    print(sum/count) # average of the whole lot

We can find the average label in the first 100 lines using:

python3 -m trainer.task --bucket $BUCKET 
        --num_examples=100 --func=find_average_label

However, we don’t want to have to program at such a low level. Instead, we will use Keras to write our machine learning model.

Training and Evaluation in Keras

Keras is an open-source library that simplifies the writing of machine learning models and can work with a variety of backends, including TensorFlow. We can create a Keras model, specifying the model function, the types of feature engineering to be performed, the evaluation metrics to compute, and how to export a serving endpoint (see Figure 9-3). Then, we call methods such as fit(), evaluate(), and predict() on the model. The distribution strategy will take care of calling the optimizer for the model in a distributed way (i.e., across several machines) to adjust the weights of the model every time a batch of training examples is read.8

The Experiment class controls the training and evaluation loop.
Figure 9-3. The training and evaluation loop involves calls on the Keras model.

The framework shown in Figure 9-3 asks us to provide everything needed to train, evaluate, and predict with the model. Specifically, we need to provide:

  • The machine learning model, including the feature columns.

  • A training input function. This is the function that will be called to read the training data. As with our read_dataset() function, this function will need to return a batch of features and their corresponding labels.

  • An evaluation input function, which is like the training input function, except that it will be called to read the test data.

  • An export strategy9 along with a serving input function. The export strategy specifies when a model should be saved—typically we just save the final iteration. The serving input function is the input function that will be used to read the inputs at prediction time.

Model Function

In Chapter 7, we built a logistic regression model based on three continuous variables: departure delay, taxi-out time, and distance. We then tried to add one more variable—the origin airport—and because the origin is a categorical variable, it needed to be one-hot encoded. One-hot encoding the origin airport ended up creating more than a hundred new columns, making the model two orders of magnitude more complex. Thus, the addition of this fourth variable caused the Spark ML model to collapse (although BigQuery ML was able to handle this just fine).

Here, let’s build a logistic regression model in Keras, but because we do have many more columns now, let’s use them all. As discussed earlier in this chapter, logistic regression is simply a linear model with a sigmoidal output node.

output = tf.keras.layers.Dense(1, 
              activation='sigmoid', name='pred')(inputs)
model = tf.keras.Model(inputs, output)

The model contains a single layer which is fully connected (“dense”) to its inputs, has one output, and a sigmoid activation function.

We cannot, however, pass the input values as-is into the neural network. As in Chapter 7, we will have to convert all the inputs into a single vector of floating point values. The process of converting the raw inputs into floating point values that are amenable to being input into a machine learning model so is called feature engineering. In Keras, the raw inputs are Input layers, and the conversion is carried out by feature columns. Let’s look at those next.10

Input and Features

When creating the model, we specify a FeatureColumn for each input feature. Features that are continuous numbers correspond to a numeric_column—fields like departure delay, taxi-out time, distance, average delays, latitude, and longitude are all real valued:

def get_inputs_and_features():
    real = {
        colname : tf.feature_column.numeric_column(colname) 
          for colname in 
            ('dep_delay,taxiout,distance,avg_dep_delay,avg_arr_delay' +
             ',dep_lat,dep_lon,arr_lat,arr_lon').split(',')
    }
    sparse = {
      'carrier': tf.feature_column.categorical_column_with_vocabulary_list(
          'carrier',
          vocabulary_list='AS,VX,F9,UA,US,WN,HA,EV,MQ,DL,OO,B6,NK,AA'.split(',')),
      'origin' : tf.feature_column.categorical_column_with_hash_bucket(
          'origin', hash_bucket_size=1000), # FIXME
      'dest'   : tf.feature_column.categorical_column_with_hash_bucket(
          'dest', hash_bucket_size=1000)
   }
   sparse = {
        colname : tf.feature_column.indicator_column(col)
          for colname, col in sparse.items()
   }

Features that are discrete (and have to be one-hot encoded [see Chapter 7]) are represented by categorical_column. The airline carrier can be one of the following strings:

AS,VX,F9,UA,US,WN,HA,EV,MQ,DL,OO,B6,NK,AA

Thus, it is represented by a sparse column with those specific keys. This is called the vocabulary of the column; to find the vocabulary of the carrier codes, I used BigQuery:

SELECT
  DISTINCT UNIQUE_CARRIER
FROM
  flights.tzcorr

Although I could have done the same thing for the origin and destination codes (most likely by saving the airport codes from the BigQuery result set to a file and reading that file from Python), I decided to use a shortcut by mapping the airport codes to hashed buckets; rather than find all the origin airports in the dataset, I ask TensorFlow to create a cryptographic hash of the airport code and then discretize the hash number into 1,000 buckets (a number larger than the number of unique airports). Provided the hash works as intended, the airports will be uniformly discretized into 1,000 bins. For any bucket with only one airport in it, this is equivalent to one-hot encoding. However, there is likely to be some small amount of collision, and so using the hash rather than explicitly specifying the keys will be somewhat worse. This, however, is the kind of thing we can fix after we have an initial version up and running—hence the FIXME in the code.

Once we have the categorical columns, we one-hot encode them. The one-hot coded columns are called indicator columns.

The Input layers map 1:1 to the input features and their types, so rather than repate the column names, I can quite easily do:

    inputs = {
        colname : tf.keras.layers.Input(
            name=colname, shape=(), dtype='float32') 
          for colname in real.keys()
    }
    inputs.update({
        colname : tf.keras.layers.Input(
            name=colname, shape=(), dtype='string') 
          for colname in sparse.keys()
    })

Now that we have the model in place, let’s move on to implementing the training and evaluation input functions.

Training and Evaluating Input Functions

Once we have the input functions and model function, we can put it all together to train the model:

    features = tf.keras.layers.DenseFeatures(
        list(sparse) + list(real), name='features')(inputs)
    output = tf.keras.layers.Dense(
        1, activation='sigmoid', name='pred')(features)
    model = tf.keras.Model(inputs, output)
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

    train_dataset = read_dataset(TRAIN_DATA_PATTERN, 
                                 train_batch_size)
    eval_dataset = read_dataset(EVAL_DATA_PATTERN, 
                                eval_batch_size,
                                tf.estimator.ModeKeys.EVAL,
                                num_eval_examples)

    history = model.fit(train_dataset, 
                        validation_data=eval_dataset,
                        epochs=epochs, 
                        steps_per_epoch=steps_per_epoch,
                        validation_steps=10)

Saving and Exporting

When predicting, the input values will come through a REST call. Thus, the application invoking our trained model will supply us all the input parameters (dep_delay, taxiout, etc.) as a JSON string. In the JSON that comes in, all real-valued columns will be supplied as floating-point numbers and all sparse columns will be supplied as strings. In order for this to happen, we need to save the trained model in a format that can be deployed:

export_dir = os.path.join(OUTPUT_DIR,
                          'export/flights_{}'.format(
                          time.strftime("%Y%m%d-%H%M%S")))
tf.saved_model.save(model, export_dir)

With all the components in place, we are now ready to run the code.

Performing a Training Run

We can now run the training from the directory that contains task.py:

python task.py 
       --bucket $BUCKET --num_examples=1000

However, we want to run it as a Python module. To do that, we add the path to the module to the Python search path and then invoke it by using python -m:

export PYTHONPATH=${PYTHONPATH}:${PWD}/flights
python -m trainer.task 
   --bucket $BUCKET --num_examples=1000

In Chapter 7, we discussed the need for a metric that is independent of threshold and captures the full spectrum of probabilities. For comparison purposes, therefore, it would be good to also compute the RMSE. We can do this by adding an evaluation metric to the model definition:11

metrics=['accuracy', rmse]

The rmse() function is defined as follows:

def rmse(y_true, y_pred):
    return tf.sqrt(tf.reduce_mean(tf.square(y_pred - y_true))) 

When we run the code, we get the RMSE:

loss: 2.4929 - accuracy: 0.3854 - rmse: 0.7060 - val_loss: 3.7618 -
    val_accuracy: 0.5490 - val_rmse: 0.6543​

Of course, this is after training on just 1,000 examples and evaluating on a very small number. We do need to train and evaluate on the larger datasets before we can draw any conclusions. 

Training in the Cloud

Training on the complete dataset is easy after we have the Python module as described in the previous sections. We simply need to submit the training job to Cloud AI Platform using the gcloud command.

JOBNAME=flights_$(date -u +%y%m%d_%H%M%S)
gcloud ai-platform jobs submit training $JOBNAME 
  --region=$REGION 
  --module-name=trainer.task 
  --package-path=$(pwd)/flights/trainer 
  --job-dir=$OUTPUT_DIR 
  --runtime-version 2.0 
  --staging-bucket=gs://$BUCKET 
  --master-machine-type=n1-standard-4 --scale-tier=CUSTOM 
  -- 
  --bucket=$BUCKET --num_examples=10000000

The parameters are mostly self-evident. We provide the package path and the module name similar to what we provided when we executed it locally. 

The training took 35 minutes and I got a RMSE of 0.21, which is pretty much the same as what we obtained when we used logistic regression in BigQuery ML in Chapter 7. This should not come as a surprise—whether in Keras or in BigQuery, we are training the same model, and so should expect pretty similar results.

What happens if we change our model from a linear model to a deep neural network? In Keras, if we want two hidden layers with 64 and 8 nodes, we would insert a couple of Dense layers that have a relu activation function:

    features = tf.keras.layers.DenseFeatures(
        list(sparse) + list(real), name='features')(inputs)
    h1 = tf.keras.layers.Dense(
        64, activation='relu', name='pred')(features)
    h2 = tf.keras.layers.Dense(
        8, activation='relu', name='pred')(h1)
    output = tf.keras.layers.Dense(
        1, activation='sigmoid', name='pred')(h2)
    model = tf.keras.Model(inputs, output)
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

In BigQuery ML, we could achieve this by changing the model type:

CREATE OR REPLACE MODEL flights.arr_delay_airports_dnn
OPTIONS(input_label_cols=['ontime'], 
        model_type='dnn_classifier',
        hidden_units=[64, 8], 
        data_split_method='custom',
        data_split_col='is_eval_day')
AS

The result with a deep neural network is a RMSE of 0.205, which is not a meaningful difference. But let’s not give up just yet!

Now that we have more data, TensorFlow/Keras in our toolchest, and the ability to train machine learning models on the larger dataset, why not also improve our machine learning modeling?

Wide-and-Deep Model

A recent paper suggests using a hybrid model that the authors call a wide-and-deep model. In the wide-and-deep model, there are two parts. One part directly connects the inputs to the outputs; in other words, it is a linear model. The other part connects the inputs to the outputs via a deep neural network. The modeler places the sparse columns in the linear part of the model, and the real-valued columns in the deep part of the model.

In addition, real-valued columns whose precision is overkill (thus, likely to cause overfitting) are discretized and made into categorical columns. For example, if we have a column for the age of the aircraft, we might discretize into just three bins—less than 5 years old, 5 to 20 years old, and more than 20 years old.

Finally, a process called feature crossing is applied to categorical features that work well in combination. Think of a feature cross as being an AND condition. If you have a column for colors and another column for sizes, the feature cross of colors and sizes will result in sparse columns for color-size combinations such as red-medium.

Let’s apply these techniques to our model function. Recall that our get_features() returns two Python dictionaries of features: one is a dict of real-valued columns, and the other is a dict of sparse columns.

Among the real-valued columns are the latitude and longitude of the departure and arrival airports. The latitudes themselves should not have much of an impact on a flight being early or late, but the location of the airport and the flight path between pairs of cities do play a part. For example, flights along the West Coast of the United States are rarely delayed, whereas flights that pass through the high-traffic area between Chicago and New York tend to experience a lot of delays. This is true even if the flight in question does not originate in Chicago or New York.

Indeed, the Federal Aviation Administration in the United States manages airplanes in flight in terms of air traffic corridors or areas (see Figure 9-4). We can make the machine learning problem easier for the model if there were a way to provide this human insight directly, instead of expecting it to be learned directly from the raw latitude and longitude data.

Air traffic corridors.
Figure 9-4. Air traffic corridors

Even though we could explicitly program in the air traffic corridors, let’s use a shortcut: we can discretize the latitudes and longitudes (the blue and orange arrows in Figure 9-5) and cross the buckets—this will result in breaking up the country into grids and yield the grid point into which a specific latitude and longitude falls.

Bucketizing latitude and longitude essentially separates out the space into grid boxes.
Figure 9-5. Bucketizing latitude and longitude essentially separates out the space into grid boxes

The following code takes the real-valued latitude and longitude columns and discretizes them into nbuckets each:

latbuckets = np.linspace(20.0, 50.0, NBUCKETS).tolist()  # USA
lonbuckets = np.linspace(-120.0, -70.0, NBUCKETS).tolist() # USA
disc = {}
disc.update({
       'd_{}'.format(key) : tf.feature_column.bucketized_column(real[key], 
  latbuckets) 
          for key in ['dep_lat', 'arr_lat']
})
disc.update({
       'd_{}'.format(key) : tf.feature_column.bucketized_column(real[key], 
  lonbuckets) 
          for key in ['dep_lon', 'arr_lon']
})

The dictionary disc at this point contains four discretized columns: d_dep_lat, d_arr_lat, d_dep_lon, and d_arr_lat. We can take these discretized columns and cross them to create two sparse columns: one for the box within which the departure lat-lon falls, and another for the box within which the arrival lat-lon falls:

sparse['dep_loc'] = tf.feature_column.crossed_column(
     [disc['d_dep_lat'], disc['d_dep_lon']], NBUCKETS*NBUCKETS)
sparse['arr_loc'] = tf.feature_column.crossed_column(
     [disc['d_arr_lat'], disc['d_arr_lon']], NBUCKETS*NBUCKETS)

We can also create a feature cross of the pair of departure and arrival grid cells, essentially capturing flights between two boxes. In addition, we also feature cross the departure and arrival airport codes (e.g., ORD–JFK for flights that leave Chicago’s O’Hare airport and arrive at New York’s John F. Kennedy airport):

sparse['dep_arr'] = tf.feature_column.crossed_column(
      [sparse['dep_loc'], sparse['arr_loc']], NBUCKETS ** 4)
sparse['ori_dest'] = tf.feature_column.crossed_column(
      ['origin', 'dest'], hash_bucket_size=1000)

Even though we want to use the sparse columns directly in the linear part of the model, we would also like to perform dimensionality reduction on them and use them in the deep part of the model:

embed = {
       'embed_{}'.format(colname) : 
           tf.feature_column.embedding_column(col, 10)
               for colname, col in sparse.items()
}
real.update(embed)

With the sparse and real feature columns thus enhanced beyond the raw inputs, we can create a wide_and_deep_classifier passing in the linear and deep feature columns separately:

def wide_and_deep_classifier(inputs, 
  linear_feature_columns, dnn_feature_columns, dnn_hidden_units):
    deep = tf.keras.layers.DenseFeatures(
               dnn_feature_columns, name='deep_inputs')(inputs)
    for layerno, numnodes in enumerate(dnn_hidden_units):
        deep = tf.keras.layers.Dense(numnodes, 
          activation='relu', name='dnn_{}'.format(layerno+1))(deep)        
    wide = tf.keras.layers.DenseFeatures(
          linear_feature_columns, name='wide_inputs')(inputs)
    both = tf.keras.layers.concatenate([deep, wide], name='both')
    output = tf.keras.layers.Dense(
          1, activation='sigmoid', name='pred')(both)
    model = tf.keras.Model(inputs, output)
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

Hyperparameter Tuning

In our model, we made a number of arbitrary choices. For example, the number of layers and the number of hidden nodes was essentially arbitrary. As discussed earlier, more layers help the model learn more complex input spaces, but it is difficult to have an intuition about how difficult this particular problem (predicting flight delays) is. However, the choice of model architecture does matter—choosing too few layers will result in a suboptimal classifier, whereas choosing too many layers might result in overfitting. We need to select an appropriate number of layers and nodes.

The optimizer uses gradient descent, but computes the gradients on small batches. We used a batch size of 64, but that choice was arbitrary. The larger the batch size, the quicker the training run will complete because the network overhead scales with the number of batches—with larger batches, we have fewer batches to complete an epoch, and so the training will complete faster. However, if the batch size is too large, the sensitivity of the optimizer to specific data points reduces, and hurts the ability of the optimizer to learn the nuances of the problem. Even in terms of efficiency, too large a batch will cause matrix multiplications to spill over from more efficient memory to less efficient memory (such as from a GPU to a CPU). Thus, the choice of batch size matters.

There are other arbitrary choices that are specific to our model. For example, we discretized the latitude and longitude into five buckets each. What should this number of buckets actually be? Too low a number, and we will lose the discrimination ability; too high a number, and we will begin to overfit.

As a final step in improving the model, we’ll carry out an experiment with different choices for these three parameters: number of hidden units, batch size, and number of buckets. Even though we could laboriously carry out these experiments one-by-one, we will use a feature of Cloud AI Platform that allows for a nonlinear hyperparameter tuning approach. We’ll specify ranges for these three parameters, specify a maximum number of trials we want to try out, and have Cloud AI Platform carry out a search in hyperparameter space for the best set of parameters. The search itself is carried out by using an optimization technique that avoids laboriously trying out every possible set of parameters.

Model changes

The first thing we do is to make three key changes to our training program:

  • Add a hyperparameter evaluation metric

  • Change the output directory so that different runs don’t clobber one another

  • Add command-line parameters for each of the hyperparameters

Because you might have many evaluation metrics (accuracy, recall, precision, RMSE, AUC, etc.), we need to instruct Cloud AI Platform which evaluation metric it should use for tuning. We do this by adding a new evaluation metric with the particular name that Cloud AI Platform will look for:

    final_rmse = history.history['val_rmse'][-1]
  
    hpt = hypertune.HyperTune()
    hpt.report_hyperparameter_tuning_metric(
       hyperparameter_metric_tag='rmse',
       metric_value=final_rmse,
       global_step=1) 

In our case, we use the RMSE as the evaluation metric when it comes to finding the optimal set of parameters and obtain the RMSE from the history that Keras returns from fit().

Hyperparameter tuning involves many runs of the training program with different sets of parameters. We need to make sure that the different runs don’t checkpoint into the same directory. If they do, fault tolerance will be difficult—when a failed worker is restarted, from which of the many checkpoints should it resume? To keep the different runs of the train from clobbering one another, we change the output directory slightly in task.py:

output_dir = os.path.join(
     output_dir,
     json.loads(
         os.environ.get('TF_CONFIG', '{}')
     ).get('task', {}).get('trial', '')
 )

What we have done is to find the trial number from the environment variable TF_CONFIG that Cloud AI Platform sets in the worker’s environment so that we can append it to the user-specified output path. For example, checkpoints from trial number 7 will now go to gs://cloud-training-demos-ml/flights/chapter9/output/7/model.ckpt, thus keeping the checkpoints separate from those of other trials.

Finally, we add command-line parameters for each of the variables that we want to optimize, making sure that the training code uses the value from the command line. For example, we begin by making the batch_size a command-line parameter in task.py:

parser.add_argument(
      '--train_batch_size',
      help='Number of examples to compute gradient on',
      type=int,
      default=64
  )

Then, we make sure to parse the train_batch_size and set the model parameter: 

TRAIN_BATCH_SIZE = int(args['train_batch_size'])

This is repeated for the nbuckets and dnn_hidden_units command-line parameters, except that the hidden units need to be converted from a string input at the command line to a list of numbers before they can be passed to the model.

Hyperparameter configuration file

The second thing we do is to write a configuration file that specifies the search space for the hyperparameters:

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highmem-2
  hyperparameters:
    goal: MINIMIZE
    hyperparameterMetricTag: rmse
    maxTrials: 50
    maxParallelTrials: 5
    params:
    - parameterName: train_batch_size
      type: INTEGER
      minValue: 16
      maxValue: 512
      scaleType: UNIT_LOG_SCALE
    - parameterName: nbuckets
      type: INTEGER
      minValue: 5
      maxValue: 10
      scaleType: UNIT_LINEAR_SCALE
    - parameterName: dnn_hidden_units
      type: CATEGORICAL
      categoricalValues: ["64,16", "64,16,4", "64,64,64,8", "256,64,16"]   

You can use the configuration file, in YAML format, to specify any of the command-line parameters. Note, for example, that we use it to specify the scaleTier here, running the model on a single high-memory machine. In addition, we are specifying the hyperparameters, telling the tuner that our evaluation metric (rmse) is to be minimized over 50 trials12 with 5 of them happening in parallel.

Then, we specify the parameters to be optimized. The parameter train_batch_size is an integer; we ask for it to look for values in the interval [16, 512]—the logarithmic scale instructs the tuner that we would like to try more values at the smaller end of the range rather than the larger end of the range. This is because long-standing experience suggests that smaller batch sizes yield more accurate models.

The nbuckets parameter is also an integer, but linearly distributed between 5 and 10. The FAA seems to have about 36 grid boxes into which it divides up the airspace (see Figure 9-4). This argues for nbuckets=6 (since 6 × 6 = 36), but the corridors are significantly narrower in the Northeast part of the United States, and so perhaps we need more fine-grained grid cells. By specifying nbuckets in the range 5 to 10, we are asking the tuner to explore having between 25 and 100 grid cells into which to divide up the United States.

As for dnn_hidden_units, we explicitly specify a few candidates—a two-layer network, a three-layer network, and a four-layer network, and a network with many more nodes. If it turns out that the optimal parameter is near the extrema, we will repeat the hyperparameter tuning with a different range. For example, if it turns out that nbuckets = 10 is best, we should repeat the tuning, but trying out nbuckets in the range 10 to 15 the next time. Similarly, if a four-layer network turns out to be best, we will need to also try a five-layer and a six-layer network.

Running hyperparameter tuning

Submitting a hyperparameter tuning job is just like submitting a training job—you can accomplish it by using gcloud. The only difference is that there is now an extra parameter that points to the configuration file described in the previous section:

gcloud ai-platform jobs submit training $JOBNAME 
  --region=$REGION 
  --module-name=trainer.task 
  --package-path=$(pwd)/flights/trainer 
  --job-dir=$OUTPUT_DIR 
  --staging-bucket=gs://$BUCKET 
  --config=hyperparam.yaml 
  --runtime-version 2.0 
  -- 
  --bucket=$BUCKET --num_examples=100000

A few hours later, the output directory as well as the GCP console is populated with the results of each of the trials as shown in Figure 9-6 (due to random seeds, your results might be different):

The results of hyperparameter tuning of the model in the Google Cloud web console.
Figure 9-6. The results of hyperparameter tuning of the model in the Google Cloud web console.

In other words, with a batch size of 381, a two-layer network, and five buckets, the RMSE is 0.176. As a comparison, in Chapter 7, using BigQuery ML and a linear model with no feature engineering, we were able to obtain an RMSE of 0.21. Using Deep Neural Networks didn’t do much better.

This should underscore the importance and impact of better ML models, feature engineering, and hyperparameter tuning in conjunction. Feature engineering and the wide-and-deep model alone might have left us with a RMSE of 0.205 (see Trial ID no. 4 in Figure 9-6)!

Deploying the Model

Now that we have a trained model, let’s use it to make predictions. The saved model has all the pieces needed to accept inputs and make predictions. All that we need to do is to deploy the model with a REST endpoint.

Deploying a model involves giving the model a name and a version—the version is useful in case we want to do A–B testing of the current model in parallel with a previous version before promoting the model from staging to production.

Within the export directory, Cloud AI Platform creates a model in a folder with the timestamp. In my case trial no. 15 was the one that gave me the best results. So, I pick the latest model saved in that trial’s output folder:

BEST_MODEL="15"
EXPORT_PATH=$(gsutil ls 
    gs://$BUCKET/flights/trained_model/${BEST_MODEL}/export | tail -1)

Then, because this is the first time, we will create a model (flights) and a version (tf2) of that model:

gcloud ml-engine models create flights --regions us-central1
gcloud ml-engine versions create tf2 --model flights 
       --origin ${EXPORT_PATH} --framework=tensorflow --python-version=3.5 
       --runtime-version=2.0 --staging-bucket=gs://$BUCKET

Predicting with the Model

Now that a model version has been created, we can send it REST requests from any language. Let’s do it from Python. The first step is to authenticate and get the credentials to access the service deployed by Cloud ML Engine:13

#!/usr/bin/env python
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json
credentials = GoogleCredentials.get_application_default()

The next step is to use the Google APIs Discovery Service14 to form a client:

api = discovery.build('ml', 'tf2', credentials=credentials,
      discoveryServiceUrl=
      'https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')

PROJECT = 'cloud-training-demos'
parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'flights', 'v1')
response = api.projects().predict(body=request_data, name=parent).execute()
print "response={0}".format(response)

This code requires the request, and this is where we provide a dictionary of input variables as defined in the serving input function of our model:

request_data = {'instances':
  [
    {
      "dep_delay": 14.0, 
      "taxiout": 13.0, 
      "distance": 319.0, 
      "avg_dep_delay": 25.863039, 
      "avg_arr_delay": 27.0, 
      "carrier": "WN", 
      "dep_lat": 32.84722, 
      "dep_lon": -96.85167, 
      "arr_lat": 31.9425, 
      "arr_lon": -102.20194, 
      "origin": "DAL", 
      "dest": "MAF"
    }
  ]
}

The result for the preceding request is this response:

{'predictions': [{'pred': [0.7853261828422546]}]}

From this JSON, we obtain the probability that the flight is on-time is 0.78. The name of the field (pred) is based on the name that we provided to the output node in Keras. As you can see, making a trained model operational is quite straightforward in Cloud AI Platform. In this section, we looked at a specific way of running a TensorFlow model—as a web service. However, it is also possible to embed the model directly into our code. We explore both these options in Chapter 10.

Explaining the Model

Why does the model believe that the flight will be late with a probability of 0.83? An active area of research in machine learning is to provide the reasoning that underlies a specific model production in a form that humans can understand.

One way to diagnose why the model behaves in a certain way is to use the What-If tool that is integrated with Cloud AI Platform Notebooks. For more details on this approach, see https://cloud.google.com/blog/products/ai-machine-learning/introducing-the-what-if-tool-for-cloud-ai-platform-models.

Another rather simple approach is to replace key predictors by average values (averages computed from the training dataset) to see the impact of that change. We can use this to provide some text explaining why the model thinks the flight will be late. Essentially, then, instead of sending in just one request, we send it several variants of that request:

request_data = {'instances':
  [
      {
        'dep_delay': dep_delay,
        'taxiout': taxiout,
        'distance': 160.0,
        'avg_dep_delay': 13.34,
        'avg_arr_delay': avg_arr_delay,
        'carrier': 'AS',
        'dep_lat': 61.17,
        'dep_lon': -150.00,
        'arr_lat': 60.49,
        'arr_lon': -145.48,
        'origin': 'ANC',
        'dest': 'CDV'
      }
      for dep_delay, taxiout, avg_arr_delay in
        [[16.0, 13.0, 67.0],
         [13.3, 13.0, 67.0], # if dep_delay was the airport mean
         [16.0, 16.0, 67.0], # if taxiout was the global mean
         [16.0, 13.0, 4] # if avg_arr_delay was the global mean
        ]
  ]
}

We are passing in four instances for prediction. The first instance consists of the actual observed values. The next three involve variants. The second instance consists of the average departure delay at this airport along with observed values for taxi-out time and average arrival delay. The third instance is a variant where the taxi-out change has been changed to the average taxi-out time for all flights in the training dataset.15 Similarly, the fourth instance involves changing the arrival delay to the mean arrival delay in the training dataset. Of course, we could use the average corresponding to the actual airport they are flying to, but I’m trying to avoid having a bunch of lookup tables that need to be known by the client code.

Why these three variables and not the others? It doesn’t help the user to inform her that her flight would not have been delayed if she were flying to Columbus, Ohio, rather than to Cincinnati (CVG)—she is sitting on a flight that is flying to Cincinnati! Thus, we treat some of the variables as “given,” and try out only variants of the variables that are unique about the user’s current experience. Devising what variants to use in this manner requires some thought and customer centricity.

The resulting responses from the predictions service can be parsed as

probs = [pred[u'pred'][0] 
              for pred in response[u'predictions']]

to yield the following array of probabilities (rounded off):

[0.17, 0.27, 0.07, 0.51]

From this, we can surmise that the average arrival delay is the feature that had the most impact. Because the arrival delay at CDV is 67 minutes, and not 4 minutes, the likelihood of the flight being on time has decreased from 0.51 to 0.17. The departure delay of 16 minutes versus the average of 13 minutes also contributes to the overall delay likelihood, but its impact is only about 30% (0.10/0.34) of the impact of the arrival delay. On the other hand, the reduced taxi-out time has helped; had it been 16.0 minutes, the on-time arrival likelihood would have been even lower.

Although not very detailed, it is very helpful to accompany a model prediction with a simple reason such as: “There is an 83% chance that you will arrive late. This is mostly because the average arrival delay at CVG is 67 minutes now. Also, your flight left the gate 16 minutes late; 13 minutes is more typical.” Go the extra step and provide a bit of understandability to your model.

Summary

In this chapter, we extended the machine learning approach that we started in Chapter 7, but using the TensorFlow library instead of Spark MLib. Realizing that categorical columns result in an explosion of the dataset, we used TensorFlow to carry out distributed training. Another advantage that TensorFlow provides is that its design allows a computer scientist to go as low-level as he needs to, and so many machine learning research innovations are implemented in TensorFlow. As machine learning practitioners, therefore, using TensorFlow allows us to use innovative machine learning research soon after it is published rather than wait for a reimplementation in some other framework. Finally, using TensorFlow allows us to deploy the model rather easily into our data pipelines regardless of where they are run because TensorFlow is portable across a wide variety of hardware platforms.

We trained a logistic regression model on all of the input values and realized that using the extra features resulted in a reduction in RMSE, with a significant chunk of the improvement being provided by the time averages that were added by the Cloud Dataflow pipeline that we built in Chapter 8.

We discussed that, intuitively, the nodes in a deep neural network help provide decision hyperplanes, and that successive layers help to combine individual hyperplanes into more complex decision surfaces. Using a deep neural network instead of logistic regression didn’t provide any benefit with our inputs, though. However, bringing in human insight in the form of additional features that bucketed some of the continuous features, creating feature crosses, and using a wide-and-deep model yielded a further reduction in the RMSE.

After we had a viable machine learning model and features, we carried out hyperparameter tuning to find optimal values of batch size, learning rate, number of buckets, and neural network architecture. We discovered that our initial, default choices were themselves quite good, but that reducing the batch size and increasing the number of layers provided a minute advantage.

For speed of experimentation, we had trained and hyperparameter-tuned the model on a sample of the full dataset. So, next, we trained the model with the chosen features and hyperparameters on the full dataset.

Finally, we deployed the model and invoked it using REST APIs to do online prediction. We also employed a simple method of providing rationales to accompany the machine learning predictions.

1 If you come from a statistics background, training a machine learning model is the same thing as fitting a statistical model or function to data.

2 In this context, dense inputs are those where small differences in numeric values are meaningful—that is, where the inputs are continuous numbers.

3 See https://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html. It is a common misconception that the primary advantage of custom machine learning chips is to reduce training time—it is in prediction that TPUs offer the largest advantage. For example, https://cloudplatform.googleblog.com/2017/04/quantifying-the-performance-of-the-TPU-our-first-machine-learning-chip.html solely talks about the performance advantages that TPUs bring to inference. After this chapter was written, Google announced the second generation of the TPU, which sped up both training and inference: https://blog.google/topics/google-cloud/google-cloud-offer-tpus-machine-learning/. The order in which these chips were developed and released implies the ordering of the relative benefits that a custom chip can bring to inference and to training.

4 See 09_cloudml/flights in the GitHub repository for this book.

5 For more details, see https://research.google.com/pubs/pub44634.html.

6 For full context, look at the complete code in https://github.com/GoogleCloudPlatform/data-science-on-gcp/tree/master/09_cloudml/flights_model_tf2.ipynb.

7 For example, pip install tensorflow.

8 The alternative option—of writing low-level TensorFlow code and managing device placement and distribution yourself—is not one that I recommend.

9 Technically, the export strategy and serving input function are not required. But I fail to see the point of training a machine learning model that you will not use for prediction.

10 The complete code may be found in the GitHub repository of this book at https://github.com/GoogleCloudPlatform/data-science-on-gcp/blob/master/09_cloudml/flights/trainer/model.py.

11 See the full context on GitHub at https://github.com/GoogleCloudPlatform/data-science-on-gcp/blob/master/09_cloudml/flights/trainer/model.py.

12 Note that hyperparameter tuning in Cloud ML Engine is not a grid search. The number of combinations of parameters can be many more than 50. The tuner chooses the exploration strategy to figure out the best 50 sets of parameters to try.

13 See https://github.com/GoogleCloudPlatform/data-science-on-gcp/blob/master/09_cloudml/call_predict.py for full context.

14 For details, see https://developers.google.com/discovery.

15 I found this using BigQuery: SELECT AVG(taxi_out) FROM ...

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset