8 Experimentation in action: Finalizing an MVP with MLflow and runtime optimization

This chapter covers

  • Approaches, tools, and methods to version-control ML code, models, and experiment results
  • Scalable solutions for model training and inference

In the preceding chapter, we arrived at a solution to one of the most time-consuming and monotonous tasks that we face as ML practitioners: fine-tuning models. By having techniques to solve the tedious act of tuning, we can greatly reduce the risk of producing ML-backed solutions that are inaccurate to the point of being worthless. In the process of applying those techniques, however, we quietly welcomed an enormous elephant into the room of our project: tracking.

Throughout the last several chapters, we have been required to retrain our time-series models each time that we do inference. For the vast majority of other supervised learning tasks, this won’t be the case. Those other applications of modeling, both supervised and unsupervised, will have periodic retraining events, between which each model will be called for inference (prediction) many times.

Regardless of whether we’ll have to retrain daily, weekly, or monthly (you really shouldn’t be letting a model go stale for longer than that), we will have versions of not only the final production model that will generate scoring metrics, but also the optimization history of automated tuning. Add to this volume of modeling information a wealth of statistical validation tests, metadata, artifacts, and run-specific data that is valuable for historical reference, and you have yourself a veritable mountain of critical data that needs to be recorded.

In this chapter, we’ll go through logging our tuning run data to MLflow’s tracking server, enabling us to have historical references to everything that we deem important to store about our project’s solution. Having this data available is valuable not merely for tuning and experimentation; it’s also critical for monitoring the long-term health of your solution. Having referenceable metrics and parameter search history over time helps inform ways to potentially make the solution better, and also gives insight into when the performance degrades to the point that you need to rebuild the solution.

Note A companion Spark notebook provides examples of the points discussed in this chapter. See the accompanying GitHub repository for further details, if interested.

8.1 Logging: Code, metrics, and results

Chapters 2 and 3 covered the critical importance of communication about modeling activities, both to the business and among a team of fellow data scientists. Being able to not only show our project solutions, but also have a provenance history for reference, is just as important to the project’s success, if not more so, than the algorithms used to solve it.

For the forecasting project that we’ve been covering through the last few chapters, the ML aspect of the solution isn’t particularly complex, but the magnitude of the problem is. With thousands of airports to model (which, in turn, means thousands of models to tune and keep track of), handling communication and having a reference for historical data for each execution of the project code is a daunting task.

What happens when, after running our forecasting project in production, a member of the business unit team wants an explanation as to why a particular forecast was so far off from the eventual reality of the data that is collected? This is a common question from many companies that rely on ML predictions to inform the business about actions that should be taken in running the business. The very last thing that you would want to have to deal with if a black swan event occurs and the business is asking questions about why the modeled forecast solution didn’t foresee it, is having to try to regenerate what the model might have forecasted at a certain point in time in order to fully explain how unpredictable events cannot be modeled.

Note A black swan event is an unforeseeable and many times catastrophic event that changes the nature of acquired data. While rare, they can have disastrous effects on models, businesses, and entire industries. Some recent black swan events include the September 11th terrorist attacks, the financial collapse of 2008, and the Covid-19 pandemic. Due to the far-reaching and entirely unpredictable nature of these events, the impact to models can be absolutely devastating. The term “black swan” was coined and popularized in reference to data and business in the book The Black Swan: The Impact of the Highly Improbable by Nassim Nicholas Taleb (Random House, 2007).

To solve these intractable issues that ML practitioners have had to deal with historically, MLflow was created. The aspect of MLflow that we’re going to look at in this section is the Tracking API, giving us a place to record all of our tuning iterations, our metrics from each model’s tuning runs, and pre-generated visualizations that can be easily retrieved and referenced from a unified graphical user interface (GUI).

8.1.1 MLflow tracking

Let’s look at what is going on with the two Spark-based implementations from chapter 7 (section 7.2) as they pertain to MLflow logging. In the code examples shown in that chapter, the initialization of the context for MLflow was instantiated in two distinct places.

In the first approach, using SparkTrials as the state-management object (running on the driver), the MLflow context was placed as a wrapper around the entire tuning run within the function run_tuning(). This is the preferred method of orchestrating the tracking of runs when using SparkTrials so that a parent run’s individual children runs can be associated easily for querying from within the tracking server’s GUI as well as from REST API requests to the tracking server that involve filter predicates.

Figure 8.1 shows a graphical representation of this code when interacting with MLflow’s tracking server. The code records not only the metadata of the parent encapsulating run, but the per iteration logging that occurs from the workers as each hyperparameter evaluation happens.

08-01

Figure 8.1 MLflow tracking server logging using distributed hyperparameter optimization

When looking at the actual code manifestation within the MLflow tracking server’s GUI, we can see the results of this parent-child relationship, shown in figure 8.2.

08-02

Figure 8.2 Example of the MLflow tracking UI

Conversely, the approach used for the pandas_udf implementation is slightly different. In chapter 7’s listing 7.10, each individual iteration that Hyperopt executes requires the creation of a new experiment. Since there is no child-parent relationship to group the data together, the application of custom naming and tagging is required to allow for searchability within the GUI and—more important for production-capable code—the REST API. The overview of the logging mechanics for this alternative (and more scalable implementation for this use case of thousands of models) is shown in figure 8.3.

08-03

Figure 8.3 MLflow logging logical execution for the pandas_udf distributed model approach.

Regardless of which methodology is chosen, the important aspect of all of this discussion is that we’ve solved a large problem that frequently causes projects to fail. (Each methodology has its own merits for different approaches; for a single-model project, SparkTrails is by far the better option, while for the scenario of forecasting that we’ve shown here, with thousands of models, the pandas_udf approach is far superior.) We’ve solved the historical tracking and organization woes that have hamstrung ML project work for a long time. Having the ability to readily access the results of not only our testing, but also the state of a model currently running in production as of the point of its training and scoring, is simply an essential aspect of creating successful ML projects.

8.1.2 Please stop printing and log your information

Now that we’ve seen a tool that we can use to keep track of our experiments, tuning runs, and pre-production training for each prediction job that is run, let’s take a moment to discuss another best-practice aspect of using a tracking service when building ML-backed projects: logging.

The number of times that I’ve seen print statements in production ML code is truly astonishing. Most of the time, it’s due to forgotten (or intentionally left-in for future debugging) lines of debugging script to let the developer know that code is being executed (and whether it’s safe to go get a coffee while it runs). At no point outside of coffee breaks during solution development will these print statements ever be seen by human eyes again. The top of figure 8.4 shows the irrelevance of these print statements within a code base.

Figure 8.4 compares methodologies that are frequent patterns in ML project code, particularly in the top two areas. While the top portion (printing to stdout in notebooks that get executed on some periodicity) is definitely not recommended, it is, unfortunately, the most frequent habit seen in industry. For more sophisticated teams that are writing packaged code for their ML projects (or using languages that can be compiled, like Java, Scala, or a C-based language), the historical recourse has been to log information about the run to a logging daemon. While this does maintain a historical reference for the data record, it also involves a great deal of either ETL or, more commonly, ELT in order to extract information in the event that something goes wrong. The final block in figure 8.4 demonstrates how utilizing MLflow solves these accessibility concerns, as well as the historical provenance needs for any ML solution.

08-04

Figure 8.4 Comparison of information storage paradigms for ML experimentation

To be explicit, I’m not saying to never use print or log statements. They have a remarkable utility when debugging particularly complex code bases, and are incredibly useful while developing solutions. This utility begins to fade as you transition to production development. The print statements are no longer looked at, and the desire to parse logs to retrieve status information becomes far less palatable when you’re busy with other projects.

If critical information needs to be recorded for a project’s code execution, it should be logged and recorded for future reference at all times. Before tools like MLflow solved this problem, many DS teams would record this critical information for production purposes to a table in an RDBMS. Larger-scale groups with dozens of solutions in production may have utilized a NoSQL solution to handle scalability. The truly masochistic would write ELT jobs to parse system logs to retrieve their critical data about their models. MLflow simplifies all of these situations by creating a cogent unified framework for metric, attribute, and artifact logging to eliminate the time-consuming work of ML logging.

As we saw in the earlier examples running on Spark, we were recording additional information to these runs outside of the typical information that would be associated with a tuning execution. We logged the per airport metrics and parameters for historical searchability, as well as charts of our forecasts. If we had additional data to record, we could simply add a tag through the API in the form of mlflow.set_tag(<key>,<value>) for run information logging, or, for more complex information (visualizations, data, models, or highly structured data), we can log that information as an artifact with the API mlflow.log_artifact(<locationandnameofdataonlocalfilesystem>).

Keeping a history of all information surrounding a particular model tuning and training event in a single place, external to the system used to execute the run, can save countless hours of frustrating work when trying to re-create the exact conditions that the model may have seen when it was trained and you are asked to explain what happened to a particular build. Being able to quickly answer questions about the business’s faith in your model’s performance can dramatically reduce the chances of project abandonment, as well as save a great deal of time in improving an underperforming model.

8.1.3 Version control, branch strategies, and working with others

One of the biggest aspects of development work that can affect a timely and organized delivery of a project to the MVP phase is in the way a team (or an individual) interacts with a repository. In our example scenario, with a relatively sizeable ML team working on individual components of the forecasting model, the ability for everyone to contribute to pieces of the code base in a structured and controlled manner is absolutely critical for eliminating frustrating rework, broken code, and large-scale refactoring. While we haven’t been delving into what the production version of this code would look like (it wouldn’t be developed in a notebook, that’s for certain), the general design would look something like the module layout in figure 8.5.

08-05

Figure 8.5 An initial repository structure for the forecasting project

As the project progresses, different team members of the project will be contributing to different modules within the code base at any given time. Some, within the sprint, may be tackling tasks and stories surrounding the visualizations. Others on that sprint may be working on the core modeling classes, while the common utility functions will be added to and refined by nearly everyone on the team.

Without the use of not only a strong version-control system but also a foundational process surrounding the committing of code to that repository, the chances of the code base being significantly degraded or broken is high. While most aspects of ML development are significantly different from traditional software engineering development, the one aspect that is completely identical between the two fields is in version-control and branched development practices.

To prevent issues arising from incompatible changes being merged to a master branch, each story or task that is taken from a sprint for a DS to work on should have its own branch cut from the current build of the master branch of the repo. It is within this branch that the new features should be built, updates to common functionality made, and the addition of new unit tests to assure the team that the modifications are not going to break anything should all be done. When it comes time to close out the story (or task), the DS who developed the code for that story will need to ensure that the entire project’s code passes both unit tests (especially for modules and functionality that they did not modify) and a full-run integration test before submitting their peer review request to merge their code into the master.

Figure 8.6 shows the standard approach for ML project work when dealing with a repository, regardless of the repository technology or service used. Each has its own nuances, functionality, and commands, which we won’t get into here; what’s important is the way the repository is used, rather than how to use a particular one.

08-06

Figure 8.6 Repository management process during feature development for an ML team

By following a paradigm for code merging like this one, a great deal of frustration and wasted time can be completely avoided. It will simply leave more time for the DS team members to solve the actual problem of the project, rather than solving merge-hell problems and fixing broken code resulting from a bad merge. Effective testing of code-merge candidates brings a higher level of project velocity that can dramatically reduce the chances of project abandonment by creating a more reliable, stable, and bug-free code base for a project.

8.2 Scalability and concurrency

Throughout this project that we’ve been working on, the weightiest and most complex aspect of the solution has been in scalability. When we talk about scalability here, we’re actually referring to cost. The longer that VMs are running and executing our project code, the more the silent ticker of our bill is going up. Anything that we can do to maximize resource utilization of that hardware as a function of time is going to keep that bill in a manageable state, reducing the concern that the business will have about the total cost of the solution.

Throughout the second half of chapter 7, we evaluated two strategies for scaling our problem to support modeling many airports. The first, parallelizing the hyperparameter evaluation over a cluster, scaled down the per-model training time significantly as compared to the serial approach. The second, parallelizing the actual per-model training across a cluster, scaled the solution in a slightly different way (which is more in favor of the many models/reasonable training iterations approach), reducing our cost footprint for the solution in a much larger manner.

As mentioned in chapter 7, these are but two ways of scaling this problem, both involving parallel implementations that distribute portions of the modeling process across multiple machines. However, we can add a layer of additional processing to speed these operations up even more. Figure 8.7 shows an overview of our options for increasing the throughput for ML tasks to reduce the wall-clock time involved in building a solution.

08-07

Figure 8.7 Comparison of execution paradigms

Moving down the scale in figure 8.7 brings a trade-off between simplicity and performance. For problems that require a scale that distributed computing can offer, it is important to understand the level of complexity that will be introduced into the code base. The challenges with these implementations are no longer relegated to the DS part of the solution and instead require increasingly sophisticated engineering skills in order to build.

Gaining the knowledge and ability to build large-scale ML projects that leverage systems capable of handling distributed computation (for example, Spark, Kubernetes, or Dask) will help ensure that you are capable of implementing solutions requiring scale. In my own experience, my time has been well spent learning how to leverage concurrency and the use of distributed systems to accelerate the performance and reduce the cost of projects by monopolizing available hardware resources as much as I can.

For the purposes of brevity, we won’t go into examples of implementing the last two sections of figure 8.7 within this chapter. However, we will touch on examples of concurrent operations later in this book.

8.2.1 What is concurrency?

In figure 8.7, you can see the term concurrencylisted in the bottom two solutions. For most data scientists who don’t come from a software engineering background, this term may easily be misconstrued as parallelism. It is, after all, effectively doing a bunch of things at the same time.

Concurrency, by definition, is the act of executing many tasks at the same time. It doesn’t imply ordering or sequential processing of tasks simultaneously. It merely requires that a system and the code instructions being sent to it be capable of running more than one task at the same time.

Parallelism, on the other hand, works by dividing tasks into subtasks that can be executed in parallel, simultaneously, on discrete threads and cores of a CPU or GPU. Spark, for instance, executes tasks in parallel on a distributed system of discrete cores in executors.

These two concepts can be combined in a system that can support them, one of multiple machines, each of which has multiple cores available to it. This system architecture is shown in the final bottom section of figure 8.7. Figure 8.8 illustrates the differences between parallel execution, concurrent execution, and the hybrid parallel-concurrent system.

08-08

Figure 8.8 Comparison of execution strategies

Leveraging these execution strategies for the appropriate type of problem being solved can dramatically improve the cost of a project. While it may seem tempting to utilize the most complex approach for every problem (parallel concurrent processing in a distributed system), it simply isn’t worth it. If the problem that you’re trying to solve can be implemented on a single machine, it’s always best to reduce the infrastructure complexity by going with that approach. It’s advisable to move down the path of greater infrastructure complexity only when you need to. This is particularly true when the data, the algorithm, or the scale of tasks is so large that a simpler approach is not possible.

8.2.2 What you can (and can’t) run asynchronously

For a final note on improving runtime performance, it is important to mention that not every problem in ML can be solved through the use of parallel execution or on a distributed system. Many algorithms require maintaining state to function correctly, and as such, cannot be split into subtasks to execute on a pool of cores.

The scenario that we’ve gone through in the past few chapters with univariate time series could certainly benefit from parallelizing. We can parallelize both the Hyperopt tuning and the model training. The isolation that we can achieve within the data itself (each airport’s data is self-contained and has no dependency on any other’s) and the tuning actions means that we can dramatically reduce the total runtime of our job by appropriately leveraging both distributed processing and asynchronous concurrency.

When selecting opportunities for improving performance of a modeling solution, you should be thinking about the dependencies within the tasks being executed. If there is an opportunity to isolate tasks from one another, such as separating model evaluation, training, or inference based on filters that can be applied to a dataset, it could be worthwhile to leverage a framework that can handle this processing for you.

However, many tasks in ML cannot be distributed (or, at least, cannot be distributed easily). Models that require access to the entirety of a feature training set are poor candidates for distributed training. Other models may have the capability to be distributed but simply have not been because of either demand or the technological complexity involved in building a distributed solution. The best bet, when wondering whether an algorithm or approach can leverage concurrency or parallelism through distributed processing, is to read the library documentation for popular frameworks. If an algorithm hasn’t been implemented on a distributed processing framework, there’s likely a good reason. Either simpler approaches are available that fulfill the same requirements of the model you’re looking into (highly likely), or the development and runtime costs for building a distributed solution for the algorithm are astronomically high.

Summary

  • Utilizing an experimentation tracking service such as MLflow throughout a solution’s life cycle can dramatically increase auditability and historical monitoring for projects. Additionally, utilizing version control and logging will enhance production code bases with the ability to reduce troubleshooting time and allow for diagnostic reporting of the project’s health when in production.
  • Learning to use and implement solutions in a scalable infrastructure is incredibly important for many large-scale ML projects. While not appropriate for all implementations, understanding distributed systems, concurrency, and the frameworks that enable these paradigms is crucial for an ML engineer.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset