13. MLOps—DevOps for machine learning

In the previous chapter, we covered machine learning (ML) deployments in Azure using automated Azure Machine Learning deployments for real-time scoring services, Azure Pipelines for batch prediction services, and ONNX, FPGAs, and Azure IoT Edge for alternative deployment targets. If you have read all of the chapters preceding this one, you will have seen and implemented a complete end-to-end ML pipeline with data cleansing, preprocessing, labeling, experimentation, model development, training, optimization, and deployment.

Congratulations on making it this far! You now possess all the skills needed to connect the bits and pieces together for MLOps and to create DevOps pipelines for your ML models.

Throughout this book, we have emphasized how every step of the ML training and deployment process can be scripted through Bash, PowerShell, the Python SDK, or any other library wrapping the Azure Machine Learning REST service. This is true for creating environments, starting and scaling clusters, submitting experiments, performing parameter optimization, and deploying fully fledged scoring services on Kubernetes. In this chapter, we will reuse all of these concepts to build a version-controlled, reproducible, automated ML training and deployment process as a continuous integration/continuous deployment (CI/CD) pipeline in Azure.

First, we will take a look at how to ensure reproducible builds, environments, and deployments with Azure DevOps. We will look at this from a code and artifact perspective and decide what to do with both to ensure that the same model is trained each time a build is started. We will take this very approach and map it to register and version data. This will allow you to audit your training and know what data was used to train a specific model at all times.

Next, we will take a look at validating your code, and code quality, automatically. You are probably already familiar with some testing techniques for application development.

However, we will take these techniques to the next level to test the quality of datasets and the responses of ML deployments.

In this chapter, we will cover the following topics:

  • Ensuring reproducible builds and deployments
  • Validating your code, data, and models

We'll begin by exploring a number of methods to ensure the reproducibility of your builds and deployments.

Ensuring reproducible builds and deployments

DevOps has many different meanings, but it is usually oriented toward enabling rapid and high-quality deployments when source code changes. One way of achieving high-quality operational code is to guarantee reproducible and predictable builds, which is also crucial for creating reproducible ML pipelines. While it seems obvious for application development that the compiled binary will look and behave in a similar manner, with only a few minor configuration changes, the same is not true for the development of ML pipelines.

There are four main problems that ML engineers and data scientists face that make building reproducible deployments very difficult:

  • The development process is often performed in notebooks, so it is not always linear.
  • There are mismatching library versions and drivers.
  • Source data can be changed or modified.
  • Non-deterministic optimization techniques can lead to completely different outputs.

We have discussed these issues in the first few chapters of this book, and you have probably seen them in a lot of places when implementing ML models and data pipelines, particularly in interactive notebooks such as Jupyter, JupyterLab, Databricks, Zeppelin, and Azure notebooks. While interactive notebooks have the great advantage of executing cells to validate blocks of models iteratively, they also often encourage a user to run cells in a non-linear order. The very benefit of using a notebook environment becomes a pain when trying to productionize or automate a pipeline.

The second issue that is quite common in ML is ensuring that the correct drivers, libraries, and runtimes are installed. While it is easy to run a linear regression model based on scikit-learn in either Python 2 or 3, it makes a huge difference if those CUDA, cuDNN, libgpu, OpenMPI, Horovod, and PyTorch versions match and work in deployment as they did during development. Using Docker helps a lot in providing reproducible environments, but it's not straightforward when using it throughout the experimentation, training, optimization, and deployment processes.

Another big problem faced by many data scientists is that often, data changes over time. Either a new batch of data is added during development, or data is cleaned, written back to the disk, and reused as input for a new experiment. Data, due to its variability in format, scale, and quality, can be one of the biggest issues when producing reproducible models. Thinking about data versions and checkpoints similarly to how you would think about version-controlling source code is absolutely essential, not only for reproducible builds but also for auditing purposes.

The last problem that makes ML deployments very difficult is that they often contain an optimization step, as discussed in Chapter 9, Hyperparameter tuning and Automated Machine Learning. While this optimization, either for model selection, training, hyperparameter tuning, or stacking, is essential to the ML life cycle, it adds a layer of uncertainty to your automatic deployment if non-deterministic processes are used. Let's find out how we can fight these problems step by step.

Azure DevOps gives you a great set of functionalities to automate everything in your CI/CD process. In general, it lets you run pieces of functionality, called tasks, grouped together in pipelines on a compute infrastructure that you define. You can either run pipelines that are triggered automatically through a new commit in your version control system or manually trigger them through a button; for example, for semi-automated deployments. Build pipelines run statelessly and don't output anything, whereas release pipelines are stateful pipelines that are supposed to generate artifacts and use them for releases and deployments. The reproducibility of your ML pipelines ensures that all the stages that you go through for training your model, such as data prep, hyperparameter tuning, and model evaluation, can, and do, flow into each other without you having to reinvent the wheel.

Version-controlling your code

This is not optional; using version control for your source code, data transformations, experiments, training scripts, and so on is essential. While many people and organizations might not be OK with storing code in private GitHub, GitLab, or Bitbucket repositories, you can also create your private repository in Azure DevOps. Creating a new project in Azure DevOps automatically creates a new Git repository for you.

Using version control for your code at all is more important than which version control system you use. Git works well, but so does Mercurial, and some people work with Subversion (SVN). However, making yourself familiar with the basic workflows of the version control system that you choose is essential. In Git, you should be able to create branches and commits, submit pull requests (PRs), comment on and review requests, and merge and squash changes.

This is also where the power lies: documenting changes. Changing your code should trigger an automatic pipeline that validates and tests your changes and, when successful and merged, trains your model and rolls it out to production. Your commit and PR history will not only become a great source of documenting changes, but is also useful when it comes to triggering, running, and documenting whether these changes are ready for production.

In order to work effectively with version control, it is essential that you try to move business logic out of your interactive notebooks as soon as possible. I would recommend using a hybrid approach, where you first test your code experiments in a notebook and gradually move the code to a module that is imported at the beginning of each file. Using auto-reload plugins, you can make sure that these modules get automatically reloaded whenever you change them, without needing to restart your kernel.

Moving code from notebooks to modules will not only make your code more reusable in your own experiments—there will be no need to copy utility functions from notebook to notebook—but it will also make your commit log much more readable. When multiple people change a few lines of code in a massive JSON file (that's how your notebook environment stores the code and output of every cell), then the changes made to the file will be almost impossible to review and merge. However, if those changes are made in a module—a separate file containing only executable code—then these changes will be a lot easier to read, review, reason about, and merge.

Figure 13.1 shows the Azure DevOps repository view, which is a good starting point for all subsequent MLOps tasks. Please note that your source code doesn't have to be stored in Azure DevOps Git repositories; you can use many other popular code hosting services, such as GitHub, Bitbucket, or SVN, or you can even use your own custom Git server:

The Azure DevOps repository view
Figure 13.1: The Azure DevOps repository view

So, if you haven't already, brush up on your Git skills, create a (private) repository, and get started with version control; we will need it in the following sections.

Registering snapshots of your data

Building a versioning process around your training data is probably the hardest step that we will cover in this section. It is fairly obvious to check any data files that are small and readable (non-binary and non-compressed) in the version control system. However, together with your source code, it is usually the case for most data sources to be binary, compressed, or not small enough to store in Git. This is what makes this step so complicated and is the reason why many ML engineers prefer to skip it rather than do it properly from the beginning.

So then, how is it done properly? You can think of it like this: whenever you execute the same code, it should always pull and use the same data predictably—regardless of whether you execute the script today or in a year from now. A second constraint is that when you change your data or the input source of the training data, then you want to make sure the change is reflected in the version control system. Sounds simple, right?

In general, we need to differentiate operational data (transactional, stateful, or mutable) from historical data (analytical, partitioned, or immutable). When working with operational data—for example, an operational database storing customer data—we need to always create snapshots before pulling in the data for training. When using efficient data formats, such as Parquet or Arrow, and scalable storage systems, such as Azure Blob storage, this should never be an issue—even if you have multiple TBs of data. Snapshots could, and should, be incremental, such that only new data is added in new partitions.

The other obvious example is that your data might change when you change sensors, or you could see the effects of seasons on your data, which will showcase data drift on the datasets. Suddenly, your model does not perform as expected, and performance degrades. Therefore, once you have set up the pipelines as mentioned in this chapter, there is the possibility to retrain the model without having to change all the steps involved. This is because, as a result of using pipelines, data preprocessing should become a process that is automated and reproducible.

When dealing with historical, immutable data, we usually don't need to create extra snapshots if the data is partitioned—that is, organized in directories. This will make it easier to modify your input data source to point to a specific range of partitions instead of pointing to a set of files directly.

Once you have the data in place, it is strongly recommended that you use Azure Machine Learning to create snapshots of your datasets before you get started. This will create and track a reference to the original data, and provide you with a pandas or PySpark interface to read the data. This data will define the input of your pipeline.

Whenever you process data, it is helpful to parameterize your pipeline using a predictable placeholder. Looking up the current date in your program to determine which folder to write to is not very useful, as you will most likely have to execute the pipeline with the same parameters on multiple days when you run into errors. You should always parameterize pipelines from the calling script, such that you can always rerun failed pipelines and it will create the same outputs every time.

When using Azure DevOps pipelines to wrap your data preprocessing, cleaning, and feature engineering steps, your pipelines should always create—and eventually overwrite—the same output folder when called with the same arguments. This ensures that your pipeline stays reproducible, even when executed multiple days in a row for the same input data.

So, make sure that your input data is registered and versioned and that your output data is registered and parameterized. This takes a bit of configuring to set up properly, but it is worth it for the whole project life cycle.

Tracking your model metadata and artifacts

Moving your code to modules, checking it into version control, and versioning your data will help to create reproducible models. If you are building an ML model for an enterprise, or you are building a model for your start-up, knowing which model and which version is deployed and used in your service is essential. This is relevant for auditing, debugging, or resolving customer inquiries regarding service predictions.

We have covered this in previous chapters, and hopefully you are convinced by now that it's not only beneficial but absolutely essential to track and version your models in a model registry. The model consists of artifacts, files that are generated while training (for example, the model architecture and model weights), and metadata (for example, the dataset snapshot and version used for training, validation, and testing, the commit hash to know which code has produced the model, and the experiment and run IDs to know which other parameter configurations were tested before the model was selected).

Another important consideration is to specify and version-control the seed for your random number generators. During most training and optimization steps, algorithms will use pseudo-random numbers based on a random seed to shuffle data and choices. So, in order to produce the same model after running your code multiple times, you need to ensure that you set a fixed random number seed for every operation that is built on randomized behaviors.

The good thing about tracking your model artifacts in a model registry—for example, in Azure Machine Learning—is that you automatically trigger release pipelines in Azure DevOps when the artifacts change. Figure 13.2 shows an Azure DevOps release pipeline, where you can select one or more ML models as artifacts for the pipeline, so updating a model in the registry can now trigger a release or deployment pipeline:

The Azure DevOps release pipeline
Figure 13.2: The Azure DevOps release pipeline

Once you understand the benefits of source code version control to your application code, you will understand that it makes a lot of sense for your trained models as well. However, instead of readable code, you now store the model artifacts—binaries that contain the model weights and architecture—and metadata for each model.

The ability to enable MLflow Tracking with your Azure Machine Learning workspace is another option in terms of tracking and logging experiment metrics and artifacts. The integration of MLflow with Azure Machine Learning enables you to explore a number of options. For example, when you're using MLflow Tracking for an experiment, and you've set up MLflow experiments, you can store the training metrics and models on a central environment within the Azure Machine Learning workspace. If you have read through this book from the beginning, you will recall that we have previously talked about the capabilities and functionalities across the different aspects of Azure Machine Learning. Therefore, if you deploy an MLflow experiment to your Azure Machine Learning—which is possible by deploying the experiment as a web service—you can still use all the functionalities with Azure Machine Learning, such as monitoring capabilities and the ability to detect data drift from your models.

Scripting your environments and deployments

Automating everything that you do more than once will ultimately save you a lot of time during development, testing, and deployment. The good thing with cloud infrastructure and services such as Azure Machine Learning and Azure DevOps is that the services provide you with all the necessary tools to automate every step easily. Sometimes, you will get an SDK, and sometimes, a specific automation will be built into the SDK directly—we have seen this for ML deployments where we could simply spin up an AKS cluster using Azure Machine Learning.

First of all, if you haven't done so already, you should start organizing your Python environments into requirements, pyenv, or conda files, and always start your projects with a clean standard environment. Whenever you add a package, add it to your requirements file and re-initialize your environment from the requirements file. This way, you'll ensure that you always have the libraries from your requirements file installed and nothing else.

Azure DevOps can help you with this by running integration tests on clean images, where all of your used tools need to be installed automatically during the test. This is usually one of the first tasks to implement on an Azure DevOps pipeline. Then, whenever you check in new code and tests to your version control system, the Azure DevOps pipeline is executed and also tests the installation of your environment automatically. Therefore, it's good practice to add integration tests to all of your used modules, such that you can never miss a package definition in your environment.

Figure 13.3 shows you how to add a simple Python task to a release pipeline:

Adding a Python task to a release pipeline
Figure 13.3: Adding a Python task to a release pipeline

If you have followed the previous chapters in this book, you might have figured out by now why we did all the infrastructure automation and deployments through an authoring environment in Python. If you have scripted these things, you can simply run and parameterize these scripts in the Azure DevOps pipelines.

The next step, which is usually a bit more difficult to achieve, is to script, configure, and automate the infrastructure. If you run a release pipeline that generates a model, you most likely want to spin up a fresh Azure Machine Learning cluster for this job so you don't interfere with other release or build pipelines or experimentation. While this level of automation is very hard to achieve on on-premises infrastructures, you can do this easily in the cloud. Many services, such as ARM templates in Azure or Terraform from HashiCorp, provide full control over your infrastructure and configuration.

The last part is to always automate deployments, especially with Azure Machine Learning. Deployments can be done through the UI and we know it's easy to click and configure the right model, compute target, and scoring file from there. However, doing so using code doesn't take much longer and gives you the benefit of a repeatable and reproducible deployment. If you have ever wondered whether you could simply deploy a new scoring endpoint to an AKS cluster—or even, simply, to no-code deployments—whenever you change the model definition, then let me tell you that this is exactly what you are supposed to do.

You will often be confronted to do the same thing in many different ways; for example, deploying an ML model from Azure Machine Learning via the CLI, Python, the UI, or a plugin in Azure DevOps. Figure 13.4 shows the package for deploying ML models directly through a task in Azure DevOps:

Deploying ML models in Azure DevOps
Figure 13.4: Deploying ML models in Azure DevOps

However, I recommend you stick to one way of doing things and then do all the automation and deployments in the same way. Having said this, using Python as the scripting language for deployments and checking your deployment code in version control is a good approach to take.

Reproducible builds and release pipelines are key and they have to begin at the infrastructure and environment level. Within the cloud, especially in Azure, this should be very easy, as most tools and services can be automated through the SDK.

Note

You can find an up-to-date example of an Azure Machine LearningOps pipeline in the Microsoft GitHub repository: https://github.com/microsoft/MLOps.

The Azure Machine Learning team put a lot of work into the SDK so that you can automate each piece, from ingestion to deployment, from within Python. Therefore, I strongly recommend you use this functionality.

Validating your code, data, and models

When implementing a CI/CD pipeline, you need to make sure you have all the necessary tests in place to deploy your newly created code with ease and confidence. Once you are running a CI or a CI/CD pipeline, the power of automated tests will become immediately evident. It not only protects certain pieces of code from failing while you are developing them, but it also protects your entire process—including the environment, data requirements, model initialization, optimization, resource requirements, and deployment—for the future.

When implementing a validation pipeline for our ML process, we align ourselves with the classical application development principles:

  • Unit testing
  • Integration testing
  • End-to-end testing

We can translate these testing techniques directly to input data, models, and the application code of the scoring service.

Rethinking unit testing for data quality

Unit tests are essential to writing good-quality code. A unit test aims to test the smallest unit of code—a function—independently of all other code. Each test should only test one thing at a time and should run and finish quickly. Many application developers run unit tests either every time they change the code, or at least every time they submit a new commit to version control.

Here is a simple example of a unit test written in Python using the unittest module provided by the standard library in Python 3:

import unittest

class TestStringMethods(unittest.TestCase): def test_upper(self):

  self.assertEqual('foo'.upper(), 'FOO')

As you can see, we run a single function and test whether the outcome matches a predefined variable.

In Python, and many other languages, we differentiate between frameworks and libraries that help us to write and organize tests, and libraries to execute tests and create reports. pytest is a great library to execute tests, and so is tox. unittest and mock help you to set up and organize your tests in classes and mock out dependencies on other functions.

When you write code for your ML model, you will also find units of code that can, and should, be unit tested and should be tested on every commit. However, ML engineers, data engineers, and data scientists now deal with another source of errors in their development cycle: data. Therefore, it is a good idea to rethink what unit tests could mean in terms of data processing.

Once you get the hang of it, many doors open. Suddenly, you can see your input data feature dimensions as a single unit of something that you need to test in order to ensure that it is fulfilling requirements. This is especially important as we are always thinking of collecting new data and retraining the model at one point—if not even retraining it continuously as new training data is collected. Therefore, we always want to make sure that the data is clean.

So, when dealing with changing data over time and implementing CI/CD pipelines, you should always test your data to match the expected criteria. Good things to test in relation to each dimension include the following:

  • Unique/distinct values
  • Correlation
  • Skewness
  • Minimum/maximum values
  • The most common value
  • Values containing zero

Your unit test could look like the following example, and you can test all the individual requirements in separate tests:

import unittest import pandas as pd

class TestDataFrameStats(unittest.TestCase):

  def setUp(self):

    # initialize and load df

    self.df = pd.DataFrame(data={'data': [0,1,2,3]}) def

    test_min(self):

    self.assertEqual(self.df.min().values[0], 0)

In the preceding code, we used unittest to organize the unit test in multiple functions within the same class. Each class could correspond to a specific data source, where we have wrappers testing each feature dimension. Once set up, we can install pytest and simply execute pytest from the command line to run the test.

In Azure DevOps, we can set up pytest or tox as a simple step in our build pipeline. For a build pipeline step, we can simply add the following block to the azure-pipelines.yml file:

- script: |

    pip install pytest

    pip install pytest-cov

    pytest tests --doctest-modules

  displayName: 'Test with pytest'

In the preceding code, we first installed pytest and pytest-cov to create a pytest coverage report. In the next line, we executed test, which will now use the dataset and compute all the statistical requirements. If the requirements are not met according to the tests, the tests will fail and we will see these errors in the UI for this build. This adds great protection to your ML pipeline, as you can now ensure that no unforeseen problems with the training data make it into the release without you noticing.

Unit testing is essential, and so is unit testing for data. As with testing in general, it will take some initial effort to be implemented, the value of which isn't immediately obvious. However, you will soon see that having these tests in place will give you some peace of mind when deploying new models faster, as it will catch errors with the training data at build time and not when the model is already deployed.

Integration testing for ML

In application development, integration testing tests the combinations of multiple smaller units as individual components. You normally use a test driver to run the test suite and mock or stub other components in your tests that you don't want to test. In graphical applications, you could test a simple visual component while mocking the modules the component is interacting with. In the back-end code, you test your business logic module while mocking all dependent persistence, configuration, and UI components.

Integration tests, therefore, help you to detect critical errors when combining multiple units together, without the expense of scaffolding the entire application infrastructure. They sit between unit testing and end-to-end testing and are typically run per commit, branch, or PR on the CI runtime.

In ML, we can use the concept of integration testing to test the training process of an ML pipeline. This can help your training run find potential bugs and errors during the build phase. Integration testing allows you to test whether your model, pretrained weights, a piece of test data, and optimizer can yield a successful output. However, different algorithms require different integration tests to test whether something is wrong in the training process.

When training a deep neural network model, you can test a lot of interesting aspects with integration tests. Here is a non-exhaustive list:

  • Verify correct weight initialization
  • Verify default loss
  • Verify zero input
  • Verify single-batch fitting
  • Verify activations
  • Verify gradients

Using a similar list, you can easily catch cases where all activations are capped at the maximum value (for example, 1) in a forward pass, or when all gradients are 0 during a backward pass. Any experiment, test, or check you would perform manually before working with a fresh dataset and your model can, in theory, be run continuously in your CI runtime. So, any time your model gets retrained or fine-tuned, these checks run automatically in the background.

A more general assumption is that when training a regression model, the default mean should be close to the mean prediction value. When training a classifier, you could test the distribution of the output classes. In both cases, you can detect issues due to modeling, data, or initialization error sooner rather than later, and before embarking on the costly training and optimization process.

In terms of the runner and framework, you can choose the same libraries as used for unit testing because, in this case, integration testing differs only in the components that are tested and the way they are combined. Therefore, unittest, mock, and pytest are popular choices for scaffolding your integration testing pipeline.

Integration testing is essential for application development and for running end-to-end ML pipelines. It will save you a lot of worry, trouble, and expense if you can detect and avoid these problems automatically.

End-to-end testing using Azure Machine Learning

In end-to-end testing, we want to make a request to a deployed service in a staging environment and check the result of the service. To do so, we need to deploy the complete service altogether. End-to-end testing is critical for catching errors that are created when connecting all the components together and running the service in a staging or testing environment without mocking any of the other components.

In ML deployments, there are multiple steps where a lot of things can go wrong if not tested properly. Let's discard the more straightforward ones, where we need to make sure that the environment is correctly installed and configured. A more critical aspect of the deployment in Azure Machine Learning is the code for the application logic itself: the scoring file. There is no easy way to test the scoring file, the format of the request, and the output together, without a proper end-to-end test.

As you might imagine, end-to-end tests are usually quite expensive to build and to operate. First, you need to write code and deploy applications simply to test the code, which requires extra work, effort, and costs. However, this is the only way to truly test the scoring endpoint in a production-like environment from end to end.

The good thing is that by using Azure Machine Learning deployments, end-to-end testing becomes so easy that it should be part of everyone's pipeline. If the model allows it, we could even do a no-code deployment where we don't specify the deployment target. If this is not possible, we can specify an Azure Container Instances (ACI) as a compute target and deploy the model independently. This means taking the code from the previous chapter, wrapping it in a Python script, and including it as a step in the build process.

End-to-end testing is usually complicated and expensive. However, with Azure Machine Learning and automated deployments, a model deployment and sample request could just be part of the build pipeline.

Continuous profiling of your model

Model profiling is an important step during your experimentation and training phase. This will give you a good understanding of the amount of resources your model will require when used as a scoring service. This is critical information for designing and choosing a properly sized inference environment.

Whenever your training and optimization processes run continuously, your model requirements and profile might evolve. If you use optimization for model stacking or automated ML, your resulting models could grow bigger to fit the new data. So, it is good to keep an eye on your model requirements to account for deviations from your initial resource choices.

Luckily, Azure Machine Learning provides a model profiling interface, which you can feed with a model, scoring function, and test data. It will instantiate an inferencing environment for you, start the scoring service, run the test data through the service, and track resource utilization.

Summary

In this chapter, we introduced MLOps, a DevOps-like workflow for developing, deploying, and operating ML services. DevOps aims to provide a quick and high-quality way of making changes to code and deploying these changes to production.

We first learned that Azure DevOps gives us all the features to run powerful CI/CD pipelines. We can run either build pipelines, where steps are coded in YAML, or release pipelines, which are configured in the UI. Release pipelines can have manual or multiple automatic triggers—for example, a commit in the version control repository or if the artifact of a model registry was updated—and creates an output artifact for release or deployment.

Version-controlling your code is necessary, but it's not enough to run proper CI/CD pipelines. In order to create reproducible builds, we need to make sure that the dataset is also versioned and that pseudo-random generators are seeded with a specified parameter.

Environments and infrastructure should also be automated and deployments can be done from the authoring environment.

In order to keep the code quality high, you need to add tests to the ML pipeline. In application development, we differentiate between unit, integration, and end-to-end tests, where they test different parts of the code, either independently or together with other services. For data pipelines with changing or increasing data, unit tests should test the data quality as well as units of code in the application. Integration tests are great for loading a model or performing a forward or backward pass through a model independently from other components. With Azure Machine Learning, writing end-to-end tests becomes a real joy, as they can be completely automated with very little effort and expense.

Now you have learned how to set up continuous pipelines that can retrain and optimize your models and then automatically build and redeploy the models to production. In the final chapter, we will look at what's next for you, your company, and your ML services in Azure.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset