Chapter 1. Introduction to Kubeflow

Kubeflow1 is an open source Kubernetes-native platform for developing, orchestrating, deploying, and running scalable and portable machine learning workloads. It is a cloud native platform based on Google’s internal machine learning pipelines. The project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable.

In this book we take a look at the evolution of machine learning in the enterprise, how infrastructure has changed, and then how kubeflow meets these new needs of the modern enterprise.

Operating Kubeflow in an increasingly multi-cloud and hybrid cloud world will be a key topic as the market grows and as Kubernetes adoption grows. A single workflow may have a lifecycle that starts on-premise but quickly requires resources only available in the cloud. Building out machine learning tooling on the emergent platform kubernetes is where life began for Kubeflow, so let’s start there.

Machine Learning on Kubernetes

Kubeflow began life as a basic way to get rudimentary machine learning infrastructure running on Kubernetes. The two driving forces in its development and adoption are the evolution of machine learning in the enterprise and the emergence of kubernetes as the de-facto infrastructure management layer.

Let’s start out with a quick tour of the recent history of machine learning in the enterprise to better understand how we got here.

The Evolution of Machine Learning in the Enterprise

The past decade has seen the popularity and interest in machine learning rise considerably. This can be attributed to developments in the computer industry such as:

  • advances in self-driving cars
  • wide-spread adoption of computer vision application
  • integration of tools such as Amazon Echo into daily life

Deep Learning tends to get a lot of the credit for many tools these days, but fundamentally applied machine learning is foundational to all of the developments.

Machine learning can be defined2 as:

In everyday parlance, when we say learning, we mean something like “gaining knowledge by studying, experience, or being taught.” Sharpening our focus a bit, we can think of machine learning as using algorithms for acquiring structural descriptions from data examples. A computer learns something about the structures that represent the information in the raw data. 

Examples of machine learning algorithms are linear regression, decision trees, and neural networks. The reader should consider machine learning to be a subset of the broader domain of artificial intelligence.

The mid-2000’s saw the rise of Deep Learning research fueled by the availability of GPUs, better labeled datasets, and better tooling. This rise re-ignited interest in applied machine learning across all enterprises in the early 2010’s in their quest to leverage data and machine intelligence to be “more like Google and Amazon”.

Many early efforts in applied machine learning focused on the Apache Hadoop platform as it had seen a meteoric rise in enterprise adoption in the first half of the 2010’s. However, where Apache Hadoop was focused on leveraging commodity hardware with CPUs and distributed processing, deep learning practitioners were focused on single machines with one or more GPUs and the python programming environment. Apache Spark on Apache Hadoop gave robust options for applied machine learning on the JVM, yet the graduate-level practitioner tended to be trained on the python programming language.

Enterprises began hiring more graduate program educated machine learning practitioners and this created a large influx of enterprise users demanding support for python in the enterprise for their applied machine learning workflows.

In earlier years Spark had been a key technology for running machine learning jobs, but in more recent years we see enterprises shifting towards GPU-based training in containers that may not live on a Hadoop cluster. The DevOps, IT, and Platform teams working with these data scientists wanted to make sure they were setting the teams up for success while also having a manageable infrastructure strategy.

The folks in charge of these platforms were concerned with using the cloud in a way that did not cost too much but took advantage of the transient nature of cloud-based workloads. 

There is a growing need to make these worlds work together on data lakes whether they were on-premise or in the cloud, or somewhere in-between. Today a few of the major challenges in the enterprise machine learning practitioner space are:

  • data scientists tend to prefer their own unique cluster of libraries and tools
  • these tools are generally heterogeneous both inside teams and across organizations
  • data scientists needs to interoperate with the rest of the enterprise for resources, data, and model deployment
  • many of their tools are python-based, where python scripts prove difficult to manage across a large organization
  • many times data scientists want to build containers and then just “move them around” to take advantage of more powerful resources beyond their laptop, either in the on-premise data center or in the cloud

This makes operating a platform to support the different constituents using the system harder than ever.

It’s Harder Than Ever to Run Enterprise Infrastructure

We progressively live in an age where developers and data scientists feel empowered to spin up their own infrastructure in the cloud, so it has become antithetical for many organizations to enforce strict infrastructure rules for machine learning practitioners. We’ve seen this in the space where customers end up with 3-4 different machine learning pipeline systems because groups cannot agree (and simply run off to AWS or GCP to get what they want when they get told “no”).

Hadoop and Spark are still key tools for data storage and processing, but increasingly Kubernetes has come into the picture to manage the on-premise, cloud, and hybrid workloads.  In the past two years we’ve seen Kubernetes adoption increase rapidly. Many enterprises have made infrastructure investments in previous technology cycles such as:

  • RDBMs
  • Parallel RDBMs
  • Hadoop
  • Key-Value Stores
  • Spark

Just like the systems before them, these systems have a certain level of legacy-inertia in their adoptive organizations. Therefore, its easy to predict that integration between Hadoop workloads and specialized Kubernetes-based machine learning workloads are on the horizon (as we see Cloudera using Kubernetes for their new data science tools as opposed to YARN) and Google changing out YARN for Kubernetes for Spark4

Other factors that make running infrastructure more complex than ever include:

  • evolution of open source acceptance in enterprise
  • using both on-premise and cloud infrastructure
  • security requirements involved in having multiple platforms
  • complexity of multi-tenancy support
  • demands for specialized hardware such as GPUs

Some overall infrastructure trends we note as of interest as we go forward:

  • Mixtures of on-premise, cloud, and hybrid deployments are becoming more popular with enterprises.
  • The 3 major “big cloud” vendors continue to capture a number of workloads that move to the cloud; some of these workloads oscillate between cloud and on-premise.
  • Docker is synonymous with the term “container”
  • Many enterprises are either using Kubernetes or strongly considering it as their container orchestration platform

Beyond an on-premise cluster, Kubernetes clusters can be joined together with cluster federation within Kubernetes. In a truly hybrid world, we can potentially have a job scheduling policy that places certain workloads on the cloud as the on-premise resources are over-subscribed, or conversely restrain certain workloads to be execute on-premise only, and never get executed in the cloud.

There are many factors to consider when dealing with cloud infrastructure or a hybrid federated cluster infrastructure, including:

  • Active Directory integration
  • security considerations
  • cost considerations for cloud instances
  • which users should have access to the cloud instances

We look at all of these topics in more depth in chapter 2 and in the rest of this book.

Identifying Next Generation Infrastructure Core Principles

In the age of cloud infrastructure, what “big cloud” (AWS, Azure, GCP) does tends to significantly impact how enterprises build their own infrastructure today. In 2018 we saw all 3 major cloud vendors offer managed Kubernetes as part of their infrastructure:

In late 2017, after intially offering their own container service, Amazon joined the other two members of big cloud and offered Amazon EKS as a first class supported citizen on AWS. 

All three major cloud vendors are pushing both Kubernetes and these tailwinds should accelerate interest in Kubernetes and Kubeflow in 2019. Beyond that, we’ve seen how containers are key part of how data scientists want to operate in managing containers from local execution, to on-premise infrastructure, to running on the cloud (easily and cleanly).

Teams need to balance multi-tenancy with access to storage and compute options (NAS, HDFS, Flashblade) and high-end GPUs (Nvidia DGX-1, etc). Teams also need a flexible way to customize their applied machine learning full applications, from ETL to modeling to model deployment, yet in a manner that stays within the above tenants

We see that cloud vendors have an increasing influence on how enterprises build their infrastructure where managed services they tend to advocate for are likely to get accelerated adoption. However, most organizations are separated from getting results from data science efforts in their organization by obstacles listed in the last section. We can break these obstacles down into the key components of:

  • composability
  • portability
  • scalability

Composability involves breaking down machine learning workflows into a number of components (or stages) in a sequence and allowing us to put them together in different formations. Many times these stages are each their own systems and we need to move the output of one system to another system as input.

Portability involves have a consistent way to run and test code:

  • development
  • staging
  • production

When we have deltas between those environments we introduce opportunities to have outages in production. Having a machine learning stack that can run on your laptop the same way it runs on a multi-GPU DGX-1 and then on a public cloud would be considered desirable by most practitioners today.

Most organizations also desire scalability and are constrained by:

  • access to machine specific hardware (e.g., “GPUs”, “TPUs”)
  • limited compute
  • network
  • storage
  • heterogenous hardware

Unfortunately scale is not about “adding more hardware”, in many cases its about the underlying distributed systems architecture involved. Kubernetes helps us with these constraints.

The Building Blocks of Containers and Kubernetes

A major reason we see wide spread container (e.g., “Docker”) and Kuberentes adoption today is because together they are an exceptional infrastructure solution for the issues inherent in composability, portability, and scalability.

What are Containers?

A container image is a lightweight, stand-alone, executable package of software that includes the code, runtime, system tools, system libraries, settings needed to run the software.

Containers and container platforms provide a lot more advantages over traditional virtualization:

  • Isolation is done on the kernel level without the need for a guest operating system
  • containers are much more efficient, fast, and lightweight

This allows applications to become encapsulated in self-contained environments and comes with a slew of advantages such as quicker deployments, scalability, and closer parity between development environments. In recent years the Docker platform has largely become synonymous with the term “containers”.

Kubernetes is further a great fit for machine learning workloads because it already has support for GPUs5 and TPUs6 as resources, with FPGAs7 in development. In the diagram below we can see how having key abstractions with containers and other layers is significant for running a machine learning training job on GPUs in a similar fashion to how you might run the same job on FPGAs.

Figure 1-2. Issues in Portability

We highlight the above abstractions as they are all components in the challenges involved with workflow portability. An example of this is how our code may work locally on our own laptop yet fails to run remotely on a Google Cloud virtual machine with GPUs.

Kubernetes also already has controllers for tasks such as batch jobs and deployment of long lived services for model hosting, giving us many foundational components for our machine learning infrastructure. Further, as a resource manager Kubernetes provides three primary benefits:

  1. Unified management
  2. Job isolation
  3. Resilient infrastructure

Unified management allows us to use a single cluster management interface for multiple kubernetes clusters. Job isolation in Kubernetes gives us the ability to move models and ETL pipelines from development to production with less dependency issues. Resilient infrastructure implies how Kubernetes manages the nodes in the cluster for us and makes sure we have enough machines and resources to accomplish our tasks.

Kubernetes as a platform has been scaled in practice8 up into the 1000’s of nodes. While many organizations will not hit that order of magnitude of nodes, it highlights the fact that Kubernetes (the platform itself) will not have scale issues from a distributed system aspect as a platform.

Kubernetes is Not Magic

Just because the kubernetes platform has been shown to scale in practice up to 1000’s of nodes does not mean applications on the platform are immune to scale issues.

Every distributed systems applications requires thoughtful design. Don’t fall prey to the notion that kubernetes will magically scale any application just because someone threw the application in a container and sent it to a cluster.

If we can solve issues around moving containers and machine learning workflow stacks around, we’re not entirely home free. We still have other considerations such as:

  • cost of using the cloud vs on-premise
  • organizational rules around where data can live
  • security considerations such as Active Directory and Kerberos integration

just to name a few. Modern data center infrastructure has a lot of variables in play, and it can make life hard for an IT or DevOps team. In this book we seek to at least arm the reader with a plan of attack on how to best plan and meet the needs of their data science constituents all the while keeping in line with organizational policies.

Kubernetes solves a lot of infrastructure problems in a progressively complex infrastructure world, but it wasn’t designed specifically for machine learning workflows. Let’s now take a look at how Kubeflow enters the picture to help enable machine learning workflows.

Enter: Kubeflow

Kubeflow was conceived of as a way to run machine learning workflows on Kubernetes and is used typically for the following reasons:

  • You want to train/serve machine learning models in different environments (e.g. local, on-premise, and cloud)
  • You want to use Jupyter notebooks to manage machine learning training jobs
  • You want to launch training jobs that use resources (such as additional CPUs or GPUs – that aren’t available on your personal computer)
  • You want to combine machine learning code (e.g., TensorFlow code with other processes (For example, you may want to use TensorFlow/agents to run simulations to generate data for training reinforcement learning models.)

With our machine learning platform typically we’d like an operational pattern similar to the “lab and the factory" pattern. In this way we want our data scientists to be able to explore new data ideas in “the lab”, and once they find a setup they like, move it to “the factory” so that the modeling workflow can be run consistently and in a reproducable manner.

Kubeflow is a great option for Fortune 500 infrastructure because it allows a notebook-based modeling system to easily integrated with the data lake ETL operations on local data lake or in the cloud in similar fashion. It also supports multi-tenancy across different hardware by managing the container orchestration aspect of the infrastructure. Kubeflow gives us great options for a multi-tennant “lab” environment while also providing infrastructure to schedule and maintain consistent machine learning production workflows in “the factory”.

While teams today can potentially share a server, when we get to trying to manage 5 or 10 servers each with multiple GPUs, we quickly create a situation where multitenancy can become messy. Users end up having to wait for resources and may also be fighting dependencies or libraries other users left behind from previous jobs. There is also the aspect of isolating users from one another so users cannot see each other’s data.

If we have any hope of sustaining an enterprise data science practice across a cluster of heterogeneous hardware, we will need to use a system like kubeflow and Kubernetes. Beyond just the obstacles we mention here, machine learning workflows tend to have much technical debt right under the visible surface.

Origin of Kubeflow

In 2016 at Google Next, Google announces10 Cloud ML uses GKE, a pre-cursor step towards what we know as Kubeflow today. In December 2017 at Kubecon the initial version Kubeflow is announced11 and demoed by David Aronchick and Jeremy Lewi. This version included:

  • JupyterHub
  • TFJob v1alpha1
  • TFServing
  • GPUs working on Kubernetes

In January 2018 the Kubeflow Governance Proposal12 was published to help direct how the community around the project would work. In June 2018 the 0.1 version of Kubeflow was introduced13, containing an expanded set of components including:

  • JupyterHub
  • TFJob with distributed training support
  • TFServing
  • Argo
  • SeldonCore
  • Ambassador
  • Better Kubernetes support

Kubeflow version 0.2 was announced14 only 2 months later in August 2018 with such advancements as:

  • TFJob v1alpha2
  • Pytorch and Caffe operators
  • Central UI

The project is still (as of the writing of this book) growing and Kubeflow continues to evolve. Today the project has expanded to over 100 engineers working on it (up from the 3 at the beginning), with 22 member organizations15 participating.

Who Uses Kubeflow?

Some of the key enterprise personnel that have the most interest in Kubeflow include:

  • DevOps Engineer
  • Platform Architect
  • Data scientists
  • Data Engineer

DevOps and Platform Architects needs to support everyone with the right infrastructure in the right places, cloud or on-premise, supporting the ingest, ETL, data warehouse, and modeling efforts of other teams. Data engineers use this infrastructure to setup data scientists with the best denormalized datasets for vectorization and machine learning modeling. All of these teams need to work together to operate in the modern data warehouse and Kubernetes gives them a lot of flexibility in how that happens.

Team Alignment for the Line of Business, DevOps, Data Engineering, and Data Science

Another challenge for organizations in the space of machine learning infrastructure is how multi-faceted the efforts are to build and support these pipelines. As we saw in the diagram previously in this chapter, there is a lot of “hidden technical debt” in most machine learning workflows and the components in these workflows are owned by multiple teams. These teams include:

  • Line of business: the part of the company that intends to use the results of the machine learning model to produce revenue for the organization
  • DevOps: the group that is responsible for making sure the platform is operational and secure
  • Data Engineering: the group that is responsible for getting data from the system of record (or data warehouse) and converting it into a form for the data science team to use
  • Data Science: the team responsible for building and testing the machine learning model

Each of these teams has to work in concert or the business will likely find no value from their individual efforts. Let’s now take a look at some specific scenarios where we might see kubeflow used for the above teams.

Common Kubeflow Use Cases

Specific scenarios for how we’d use a technology are always critical because otherwise its just “more infrastructure to manage”. In this section we’ll look at some specific use cases in how kubeflow can be used and how each of the teams from the previous section might be involved.

Running Notebooks on GPUs

Users commonly start out on local platforms such as Anaconda and design initial use cases. As their use cases need more data, more powerful processing, or data that cannot be copied onto their local laptop, they tend to hit roadblocks in terms of advancing their model activities forward.

Advantages of Notebooks on GPUs

Given that many notebook users start locally on their laptop then this naturally leads us to questions around “why use Kubeflow for hosted notebook infrastructure”? Machine learning and especially deep learning training algorithms are notoriously hungry for processing power for their linear algebra routines. GPUs have become the industry standard for speeding up linear algebra processing in machine learning. Most laptops do not have GPUs onboard so the users go looking for other places to eventually run their modeling code.

A common platform stack users have converged on is the

  • Jupyter notebook
  • python code
  • dependency management with containers, typically docker

These developers are good at getting this stack just how they want it on their local laptop. However, operating inside a F500 company this way has side effects such as:

  • security
  • data access
  • driver management
  • model integration

The IT organizations in most Fortune 500 companies take topics such as security and data access seriously. The idea of phds experimenting with sensitive corporate data on their local laptops conflicts directly with most IT information security policies and this create natural contention in an organization. This contention revolves around the line of business’ need for more value from its data vs the IT information security’s mandate in keeping key information safe.

Given companies will not give up on security requirements we need to find ways to better service how data scientists want to operate while keeping security policies intact. Kubeflow is a great option for this scenario because it allows a data scientist to keep their preferred working environment built in a container to execute in an environment blessed by IT corporate security.

This internal infrastructure can be secured with active directory and kerberos while providing GPUs (e.g., Nvidia’s DGX-1) and large storage arrays with object stores, proprietary storage (e.g., Pure’s Flashblade, NetApp’s storage), or HDFS.

Team Alignment for Notebooks on GPUs

In this scenario the DevOps team can enable the data scientists to build models faster with kubeflow. This allows the data scientists to explore more concepts for the line of business faster and this lets them eliminate poor candidate use cases faster.

If the line of business can validate use cases faster with the data science team, they can find the best fits for the business to make the most money from its data.

Shared Multi-Tennant Machine Learning Environment

Many times an organization will have either multiple data scientists who need to share a cluster of high-value resources (e.g., GPUs) or they will have multiple teams of data scientists who need the same access to shared resources.

In this case an organization needs to build a multi-tennant machine learning platform and kubeflow is a solid candidate for this scenario.

Often we’ll see organizations buy machines with 1 or more GPUs attached up to specialized hardware such as an Nvidia DGX-1 or DGX-2 (e.g., 8 GPUs per machine). This hardware is considerably more costly than a traditional server so we want as many data scientists leveraging it for model training as possible. 

Advantages of On-Premise Multi-Tennant Environment

Each data scientist will have their own model workflow and code dependencies as described earlier in this chapter. We need a system such as kubeflow that can execute each user’s workflow while keeping the workflow dependencies and data seperate from other user’s work on the same set of resources (e.g., isolation).

Kubeflow and kubernetes handle these requirements for us with their scheduling and container management functionality. A great example is how we may have 3 different data scientists all needing to each run their own notebook on a single GPU. Kubernetes with kubeflow handles keeping track of who is running what code on what machine and which GPUs are currently in use. Kubernetes also keeps track on what jobs are waiting in the job queue and will schedule the waiting jobs once an in-process job finishes.

Team Alignment

Multi-tennant systems make DevOps teams lives’ much simpler as they handle a lot of infrastructure complexity for us. The DevOps team can focus on keeping the kubernetes cluster and the kubeflow application operational and this let’s us leverage all the benefits of scheduling, container scheduling, and resource management in kubernetes.

When data scientists have more flexible access to the resources they need (e.g., GPUs), they can build models faster and this allows the line of business to evaluate a data product faster to see if it can be effective enough to be viable for the organization.

Example: Building a Transfer Learning Pipeline

Let’s use a computer vision example scenario to better understand how kubeflow might be deployed to solve a real problem. A realistic example would be a team wanting to get a computer vision transfer learning pipeline working for their team so they can build an custom computer vision model for detecting specific items in a retail store.

The team’s basic plan consists of:

  1. Get a basic computer vision model working from the TensorFlow model zoo16
  2. Update the model with an example dataset to understand transfer learning
  3. Move the model training code to kubeflow to take advantage of on-premise GPUs

The team starts off by experimenting with the object detection example notebook17 provided in the TensorFlow github repository. Once they have this notebook running locally, they know they can produce inferences for objects in an image with a pre-built TensorFlow computer vision model.

Next the team wants to customize a computer vision model from the model zoo with their own dataset, but first they need to get a feel for how transfer learning works in practice. The team checks out the “pet detector” example18 on the TensorFlow object detection github repository and learns how to further train an existing COCO19-pretrained model with transfer learning to build a custom computer vision model.

Once they get the transfer learning example running with the pets dataset, it should not be hard to label some of their own product data to build the annotations needed to further train their own custom computer vision model.

The original pets example shows how to use GPUs on Google Cloud but the team wants to leverage a cluster of GPUs they have in their own datacenter. At this point the team sets up Kubeflow on their internal on-premise kubernetes cluster and runs the Pets tutorial specifically built20 for kubeflow. The team had previously built their own custom annotated dataset that they can now use for building models on their own internal kubeflow cluster.

Advantages of Running Computer Vision Pipeline on Kubeflow

The major reasons the team moves their pipeline to kubeflow are:

  1. secure access to sensitive data for multiple users (data that may not be allowed to live on users’ laptops)
  2. cost savings by using on-premise GPUs
  3. ability for multiple users to share the same cluster of GPUs

The team reasons that in some training scenarios kubeflow on-premise can be more cost-effective per GPU hour than cloud GPUs. They also want to securely control where the core training dataset lives and they want the ability to allow multiple data scientists to share the same consistent training dataset while trying different variations of models.

Team Alignment for Computer Vision Pipeline

This transfer learning scenario on kubeflow allows the DevOps team to more tightly control who has a copy of the sensitive training data. The line of business has tasked the data science team with building a model that has a mAP score of a minimal level to be economically viable to the business unit.

To accomplish this modeling goal, the data science team needs to try many variations of hyperparameters and feature selection passes (in concert with the data engineering team) to drive up their model’s effectiveness. The faster the data science team can train models the more variations of training runs the team can try. In the case of deep learning and computer vision, GPUs make training runs take significantly less time so these are a key resource for the data science team.

The business unit wants to hit their target minimum model effectiveness goal but they have a budget to stay under in doing so. Using kubeflow on-premise with GPUs is a cheaper way to build models for the data science team, so costs end up being lower. The cost is forecasted as cheaper by the business unit because the data science team forecasts they will need to model many times a week for a long period of time.

GPUs in the Cloud

GPUs in the cloud give us more flexibility than GPUs on-premise because we can spin them up ad-hoc on-demand making it more convenient to try new ideas.

However, this convenience may cost more than if we bought a GPU and used it locally all of the time.

The cost vs flexibility tradeoff is something we should always keep in mind when deciding where to run our jobs.

Using kubeflow on-premise with GPUs also allows the data science team to model faster while running multiple computer vision jobs on the cluster at the same time under the multi-tennant nature of the system. 

Deploying Models to Production for Application Integration

Once a model is trained it typically exists as a single file on a laptop or server host. We need to either

  1. copy the file to the machine with the application for integration
  2. or we need to load the model into a server process that can accept network requests for model inference

If we choose to copy the model file around to a single application host, then this is manageable. However, when we have many applications seeking to get model inference output from a model this becomes more complex. 

One major issue involves updating the model once it has been deployed to production. This becomes more work as we need to track the model version on more machines and remember which ones need to be updated.

Another issue is rolling back a model once deployed. If we have deployed a model and then realize that we need to roll the version back to the previous model version, this is a lot of work when we have a lot of copies of the model floating around out there on different hosts.

Let’s now take a look at some advantages if we use a model hosting pattern for deploying a model to production with kubeflow.

Advantages of Deploying Models to Production on Kubeflow

A major advantage of having a machine learning model loaded in a model server on kubeflow is that we can do updates and rollbacks from a single point (e.g., the server process). This allows us to treat a model more like a database table in that we can apply operations to the model and then all of the consuming client applications get the updates as soon as the update transaction is complete and they make their next inference request.

Team Alignment for Model Deployment

The model server pattern makes life for the DevOps team considerably easier as they don’t have to manually track many model copies. The application engineering team can focus on consuming the model as an internal REST resource on the network and less on writing specific machine learning API code to integrate a model.

Once the data science team has developed models that the line of business wishes to put into production, the model server pattern allows them to hand off the model to the DevOps team to put into production on the model server. This allows the data science team to get out of having to support individual models in production and focus on building the next round of models with the lines of business.

Components of Kubeflow

The logical components of kubeflow are:

  • Jupyter Notebooks
  • Hyperparameter Tuning with Katib
  • Pipeline for workflow management on kubeflow
  • Model serving for ways to serve inferences from deployed machine learning models
  • Training systems for model training

these components work together to provide a scalable and secure system for running machine learning jobs (notebook-based jobs and also jobs outside of notebooks). The relationships between the components can be seen in the diagram below.

Figure 1-4. Kubeflow Component Diagram

Given the rise of Kubernetes as an enterprise platform management system, it makes a lot of sense to have a way to manage our machine learning workloads in a similar manner. In the rest of this section we take a look at each component and how it is used within the Kubeflow platform.

Jupyter Notebooks

Jupyter Notebooks are included with the kubeflow platform as a core component. Jupyter Notebooks are popular21 due to their ease of use and are commonly associated (in machine learning especially) with the python programming language. Notebooks typically contain code (e.g., python, java) and other visual rich-text elements which mimic a web page or textbook.

A novel aspect of notebooks are that they combine the concept of a computer program with the notes we typically associate with complex logic in our programs. This in-line documenting property allows a notebook user a better way to not only document what they’re doing but to communicate it better to other users that may want to run our code. Given the complexity of machine learning code, this property has been part of the reason we see their explosive popularity in the machine learning space.

Jupyter Notebook Integration with Kubeflow

The Kubeflow application deployment includes support for spawning and operating jupyter notebooks. Advantages of having Jupyter Notebooks integrated into the Kubeflow platform include:

  • good integration with the rest of the kubeflow infrastructure from the point of view of authentication and access control
  • ability to share notebooks between users

Jupyter Notebooks in Kubeflow also have access to the Fairing library enabling the notebooks to submit training jobs to kubernetes from the notebook. This is compelling because code in a Jupyter Notebook only uses the default single process execution mode. The ability to submit jobs to kubernetes from the notebook allows us to leverage TFJob (covered later in this chapter) and run distributed training jobs.

Sometimes we want separate notebook servers for each user or for a team. Kubeflow allows us to setup multiple notebook servers for a given kubeflow deployment. Each notebook server belongs to a namespace can serve and execute multiple notebooks. Namespaces in kubeflow also give us multi-user isolation. This means we can have multiple sets of users who can’t see one another’s resources so that they don’t clutter one another’s workspaces on the shared multi-tenant infrastructure.

Multi-User and Malicious Users

Currently as of v0.7 kubeflow has no hard security guarantees against malicious attempts by users to access other user’s resources.

Kubeflow will launch a container image on k8s for each notebook server. The notebook image containers dependencies such as the libraries for ml and cpu or GPU support in the notebook.

Machine Learning Model Training Components

Many machine learning frameworks are supported by Kubeflow. In theory, a user could just containerize an arbitrary framework and submit it to a Kubernetes cluster to execute. However, Kubeflow makes Kubernetes clusters aware of the execution nuances each machine learning library expects or needs to do thing such as:

  • parallel training
  • custom resources
  • workflows

The current  training frameworks supported by Kubeflow are:

  • TensorFlow
  • Keras
  • PyTorch
  • MXNet
  • MPI
  • Chainer

Below we list some details about a few of these frameworks.

 

TensorFlow Training and TFJob

TensorFlow is supported by Kubeflow and is the most popular machine learning library in the world today. Being that TensorFlow, Kubernetes, and Kubeflow were all created originally at Google, it makes sense it was the original library supported by Kubeflow.

As mentioned previously in this chapter, TFJob is a custom component for Kubeflow which contains a Kubernetes custom resource descriptor (CRD) and an associated controller (tf-operator). The TFJob CRD is what makes Kubernetes able to execute distributed TensorFlow jobs.

Check out the Kubeflow documentation page on TensorFlow at:

https://www.kubeflow.org/docs/components/tftraining/

Keras

Keras is supported in the Kubeflow project and can be used in several ways:

  • single process  job run as a CRD
  • single process GPU job run as a CRD
  • TFJob as a single worker job
  • TFJob as a distributed worker job (via the Estimator API)
  • Jupyter notebook (CPU or GPU)

Below we show the simplest way to run a Keras job on Kubernetes as a Job:

Example 1-1. Job Example YAML to Run a Keras Python Script
apiVersion: batch/v1
kind: Job
metadata:
  name: keras-job
spec:
  template:
    spec:
      containers:
      - name: tf-keras-gpu-job
        image: emsixteeen/basic_python_job:v1.0-gpu
        imagePullPolicy: Always
      volumes:
      - name: home
        persistentVolumeClaim:
          claimName: working-directory
      restartPolicy: Never
  backoffLimit: 1

In the example above we would launch this Keras job from the command line with Kubectl.

Pytorch Training

Initial Pytorch (alpha) support with the PyTorch Operator was included in Kubeflow 0.2.0. This alpha version was supported up to the 0.3.5 version of Kubeflow. The next version (beta) of the PyTorch Operator was introduced with Kubeflow 0.4.0.

We run a PyTorch job similar to how we’d run a TensorFlow job:

  • Using PyTorch code inside a JupyterNotebook
  • Or writing a yaml file to configure a PyTorchJob and running it with Kubectl 

Hyperparameter Tuning

Hyper parameter tuning involves exploring the hyperparameter search space to find an optimal (or near optimal) set of hyperparameters for training a machine learning model. Data scientists spend a non-trivial amount of time trying combinations of hyperparameters and seeing how much more accurate their model can be as a result.

The included hyperparameter search application with kubeflow is called Katib22. Katib was original inspired by an internal Google system called “vizier” and is focused on being machine learning framework agnostic. Katib lets us create a number of experiments to test trials for hyperparameter evaluation. Currently Katib supports exploration algorithms such as random search, grid search, and more.

Pipelines

Kubeflow Pipelines23 allows us to build machine learning workflows and deploy them as a logical unit on Kubeflow to be run as containers on kubernetes. While many times we see machine learning examples in a single Jupyter notebook or python script, machine learning workflows are typically not a single script or job. Previously in this chapter we introduced the diagram below from the paper “Hidden Technical Debt in Machine Learning”:

Figure 1-5. “Hidden Technical Debt in Machine Learning24"

Many of the boxes in this diagram end up as their own sub-workflows in practice in a production machine learning system.

Examples of this would be how data collection and feature engineering each may be their own workflows built by teams separately from the data science team. Model evaluation is another component we often see performed as its own workflow after the training phase is complete.

Kubeflow Pipelines enable the simplify the orchestration of these pipelines as containers on kubernetes infrastructure. This gives our teams the ability to reconfigure and deploy complex machine learning pipelines in a modular fashion to speed model deployment and time to production.

Basic Kubeflow Pipeline Concepts

Section content goes here

A Kubeflow Pipeline is a graph defined by python code representing all of the components in the workflow. Pipelines define input parameter slots required to run the pipeline and then how each component’s output is wired as input to the next stage in the graph.

A Pipeline component is defined as a Docker image containing the user code and dependencies to run on kubernetes. Below we can see a Pipeline definition example25 in python.

Example 1-2. Kubeflow Pipeline Defined in Python
@dsl.pipeline(
  name='XGBoost Trainer',
  description='A trainer that does end-to-end distributed training for XGBoost models.'
)
def xgb_train_pipeline(
    output,
    project,
    region='us-central1',
    train_data='gs://ml-pipeline-playground/sfpd/train.csv',
    eval_data='gs://ml-pipeline-playground/sfpd/eval.csv',
    schema='gs://ml-pipeline-playground/sfpd/schema.json',
    target='resolution',
    rounds=200,
    workers=2,
    true_label='ACTION',
):
  delete_cluster_op = DeleteClusterOp('delete-cluster', project, region).apply(gcp.use_gcp_secret('user-gcp-sa'))
  with dsl.ExitHandler(exit_op=delete_cluster_op):
    create_cluster_op = CreateClusterOp('create-cluster', project, region, output).apply(gcp.use_gcp_secret('user-gcp-sa'))

    analyze_op = AnalyzeOp('analyze', project, region, create_cluster_op.output, schema,
                           train_data, '%s/{{workflow.name}}/analysis' % output).apply(gcp.use_gcp_secret('user-gcp-sa'))

    transform_op = TransformOp('transform', project, region, create_cluster_op.output,
                               train_data, eval_data, target, analyze_op.output,
                               '%s/{{workflow.name}}/transform' % output).apply(gcp.use_gcp_secret('user-gcp-sa'))

    train_op = TrainerOp('train', project, region, create_cluster_op.output, transform_op.outputs['train'],
                         transform_op.outputs['eval'], target, analyze_op.output, workers,
                         rounds, '%s/{{workflow.name}}/model' % output).apply(gcp.use_gcp_secret('user-gcp-sa'))
...

If we look at this graph in the Kubeflow Pipelines user interface in Kubeflow it would look similar to the image below.

Figure 1-6. Rendered Pipeline Graph in the Kubeflow Pipeline UI

The Kubeflow Pipelines UI further allows us to define input parameters per run and then launch the job. The saved outputs from the pipeline include graphs such as Confusion Matrix and Receiver Operating Characteristics (ROC) Curves.

Kubeflow Pipelines produce and store both metadata and artifacts from each run. Metadata from pipeline runs are stored in a MySQL database and artifacts are stored in an artifact store such as MinIO26 server or a cloud storage system. Both MinIO and MySQL are both backed by PersistentVolumes27 in Kubernetes.

The metadata saved allows Kubeflow to track specific experiments and jobs run on the cluster. The artifacts save information so that we can investigate an individual job’s run performance.

Machine Learning Model Inference Serving

Once we’ve constructed a model we need to integrate the saved model with our applications. Kubeflow offers multiple ways to load saved models into a process for serving life inference to external applications. These options include:

  • Istio Integration (for TF Serving)
  • Seldon Serving
  • NVIDIA TensorRT Inference Server
  • TensorFlow Serving
  • TensorFlow Batch Predict

Different of the above options support different machine learning libraries and have their own set of specific features. Typically each has a docker image that you can run as a kubernetes resource and then load a saved model from a repository of models.

An Overview of Kubernetes

Planning and deploying infrastructure on-premise, in the cloud, and in hybrid situations is a key topic in the 2019 infrastructure landscape. Kubernetes is a container orchestration system for containers that is meant to coordinate clusters of nodes at scale in production in an efficient manner. Kubernetes works around the idea of pods which are scheduling units (each pod containing one or more containers) in the Kubernetes ecosystem. These pods are distributed across hosts in a cluster to provide high availability. Kubernetes itself is not a complete solution and is and is intended to integrate with other tools such as Docker, a popular container technology.

container image is a lightweight, stand-alone, executable package of a piece of software that includes everything needed to run (code, runtime, system tools, system libraries, settings). Colloquially containers are referred to as “docker”, but what the speaker typically means are “containers” in the general sense. Docker containers made it significantly easier for developers to enjoy parity between their local, testing, staging, and production environments. Containers allow teams to run the same artifact or image across all of those environments, including a developer’s laptop and on the cloud. This property of containers, especially when combined with container orchestration with a system like Kubernetes, is driving the concept of “hybrid cloud” in practice.

Kubernetes shines when we have a lot of containers that need to be managed across a lot of machines (on-premise, in the cloud, or a mixture of both).

Docker and Kubernetes are not direct competitors, but complementing technologies. One way to think of it is Docker is for managing images and individual containers while Kubernetes is for managing pods of containers. Docker provided an open standard for packaging and distributing containerized applications, but it did not solve the container orchestration (inter-communication, scaling, etc) issue. Competitors to Kubernetes in this space are Mesos and Docker Swarm, but we’re seeing the industry converge on Kubernetes as the standard for container orchestration (as is evident later in this article).

With Kubernetes we also have the ability to dynamically scale applications on a cluster. With installable options such as Autoscaler, we also have the ability to scale clusters themselves. 

Let’s now dig into a few core principles from Kubernetes that are relevant throughout this book.

 

Core Kubenetes Concepts

For the newer practitioner we want to cover a few basic Kubernetes concepts in this section as not every reader will be a battle hardened DevOps engineer.

Pod

A pod (as in "a pod of Whales“) in Kubernetes is defined as the smallest unit of computing we can deploy that can be created and managed in Kubernetes. Further defined, a pod is a group of one or more containers with shared storage/network, combined with a specification on how to run the containers. The contents of a pod always co-located and co-scheduled to run in a shared context.

The official reference for Kubernetes Pods is at:

https://kubernetes.io/docs/concepts/workloads/pods/pod/

A Pod is the smallest deployable unit in Kubernetes and it has one or more Containers. Multiple containers within a Pod share:

1. The network stack
2. The filesystem
3. The Pod “Lifecycle”

Some notes on “Pods vs Jobs”: whereas a Pod is the smallest deployable unit in Kubernetes, and can be scheduled to execute on their own, Pods are typically scheduled in more higher-level objects within Kubernetes, such as as part of Deployment, a StatefulSet, a Job, etc. A Job within Kubernetes would be the first higher-level deployable unit above a Pod. Some of the key differences between a Pod and Job include:

  • When a “Bare Pod” is running on a node that reboots or fails, it will not be re-scheduled
  • A Job is similar to a “Bare Pod”, however Kubernetes will attempt to re-schedule terminated Jobs
  • A Job provides additional semantics compared to “Bare Pods”

Submitting Containers to Kubernetes

Containers are submitted as part of a Pod, and not “standalone”. Pods are defined using YAML syntax, as we can see in the example at the end of this section. Kubernetes doesn’t actually accept a Container as a first-class object, rather it accepts a “Pod Definition”. It then attempts to schedule the Pod, and pulls the containers defined within the Pod.

Typically, Pods are submitted via the kubectl command-line interface. An example yaml file definition for a Pod is below:

Example 1-3. A sample YAML file definition for a Kubernetes Pod
apiVersion: v1
kind: Pod
metadata: 
  name: myapp-pod
  labels: 
    app: myapp
spec: 
  containers: 
  - name: myapp-container 
    image: busybox 
    command: ['sh', '-c', 'echo Hello Kubernetes! && sleep 3600']

Custom Resources, Controllers, and Operators

A kubernetes resource28 is an endpoint in the Kubernetes API that stores a group of API Objects of a certain kind. A custom resource is a Kubernetes API extension that is specific to a cluster once installed, customizing said cluster. Custom resources can be created, updated, and deleted on a running cluster through dynamic registration. Once we have installed a custom resource, we can control it with the kubectl tool just like we would built-in resources (e.g., “pods”, etc).

By themselves custom resources allow us to store and retrieve structured data. When we combine a custom resource with a controller then they become a true declarative API. The controller interprets the structured data as a record of the user’s desired state. The controller continually takes action to get to this state. Custom controllers are especially effective when combined with custom resources (but they can work with any kind of resource). One example of this combination (custom controller and custom resource) is the "operator pattern“. The operator pattern allows us to encode domain knowledge for specific applications into an extension of the Kubernetes API.

Custom Controllers

Custom resources let us store and retrieve structured data. They also need to be combined with a controller to be considered a declarative API . Being a “declarative API" means they allow us to declare / specify the desired state of our resource. 

This desired state is then attempted to be matched with an actual state by kubernetes. The controller interprets the structured data as the user’s intent for the state and attempts to achieve and maintain this desired state inside the kubernetes cluster.

A custom controller is a controller that can be deployed by a user  and be updated independent from the cluster’s own lifecycle. Customer controllers can work with any kind of resource and are most effective when paired with a custom resource. This pairing is called the "operator" pattern in kubernetes.

Custom Resource Definition (CRDs)

When we want to define our own custom resource we write a custom resource definition (CRD). Once we have the CRD setup in kubernetes, we can use it like any other native kubernetes object (leveraging other features such as CLI, security, API services, RBAC, etc).

Summary

In this introduction chapter we covered the impetus for kubeflow as a machine learning platform. As we move forward through this book, we’ll come to understand how to plan, install, maintain, and develop on the kubeflow platform as a key cornerstone in our machine learning infrastructure.

1 https://github.com/kubeflow

2 http://shop.oreilly.com/product/0636920035343.do

3 http://shop.oreilly.com/product/0636920035343.do

4 https://www.youtube.com/watch?v=8W88qAFdAUU

5 https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

6 https://cloud.google.com/kubernetes-engine/docs/concepts/tpus

7 https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/

8 https://openai.com/blog/scaling-kubernetes-to-2500-nodes/

9 https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf

10 https://cloud.google.com/blog/products/gcp/google-takes-cloud-machine-learning-service-mainstream

11 https://kubernetes.io/blog/2017/12/introducing-kubeflow-composable/

12 https://docs.google.com/document/d/1vgvZC93I5N-6qm9xrtINs1JJCbDqZEyJA175nAwBX8g/edit?ts=5a711f78#heading=h.rhx0jou170p8

13 https://kubernetes.io/blog/2018/05/04/announcing-kubeflow-0.1/

14 https://medium.com/kubeflow/kubeflow-0-2-offers-new-components-and-simplified-setup-735e4c56988d

15 https://github.com/kubeflow/community/blob/master/member_organizations.yaml

16 https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md

17 https://github.com/tensorflow/models/blob/master/research/object_detection/object_detection_tutorial.ipynb

18 https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md

19 http://cocodataset.org/

20 https://github.com/kubeflow/examples/tree/master/object_detection

21 http://shop.oreilly.com/product/0636920034919.do

22 https://github.com/kubeflow/katib

23 https://v0-6.kubeflow.org/docs/components/pipelines/pipelines/

24 https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf

25 https://github.com/kubeflow/pipelines/tree/master/samples/core/xgboost_training_cm

26 https://docs.min.io/

27 https://kubernetes.io/docs/concepts/storage/persistent-volumes/#types-of-persistent-volumes

28 https://kubectl.docs.kubernetes.io/pages/kubectl_book/resources_and_controllers.html

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset