Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 1. Introduction to Kubeflow

Kubeflow¹ is an open source Kubernetes-native platform for developing, orchestrating, deploying, and running scalable and portable machine learning workloads. It is a cloud native platform based on Google’s internal machine learning pipelines. The project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable.

In this book we take a look at the evolution of machine learning in the enterprise, how infrastructure has changed, and then how kubeflow meets these new needs of the modern enterprise.

Operating Kubeflow in an increasingly multi-cloud and hybrid cloud world will be a key topic as the market grows and as Kubernetes adoption grows. A single workflow may have a lifecycle that starts on-premise but quickly requires resources only available in the cloud. Building out machine learning tooling on the emergent platform kubernetes is where life began for Kubeflow, so let’s start there.

Machine Learning on Kubernetes

Kubeflow began life as a basic way to get rudimentary machine learning infrastructure running on Kubernetes. The two driving forces in its development and adoption are the evolution of machine learning in the enterprise and the emergence of kubernetes as the de-facto infrastructure management layer.

Let’s start out with a quick tour of the recent history of machine learning in the enterprise to better understand how we got here.

The Evolution of Machine Learning in the Enterprise

The past decade has seen the popularity and interest in machine learning rise considerably. This can be attributed to developments in the computer industry such as:

advances in self-driving cars
wide-spread adoption of computer vision application
integration of tools such as Amazon Echo into daily life

Deep Learning tends to get a lot of the credit for many tools these days, but fundamentally applied machine learning is foundational to all of the developments.

Machine learning can be defined² as:

In everyday parlance, when we say learning, we mean something like “gaining knowledge by studying, experience, or being taught.” Sharpening our focus a bit, we can think of machine learning as using algorithms for acquiring structural descriptions from data examples. A computer learns something about the structures that represent the information in the raw data.

Examples of machine learning algorithms are linear regression, decision trees, and neural networks. The reader should consider machine learning to be a subset of the broader domain of artificial intelligence.

Defining Machine Learning, Deep Learning, and Artificial Intelligence

Deep learning is a subset of machine learning where specific architectures of neural networks use their specialized architecture to do automated feature learning as part of their learning process. Typically we see deep learning applied in areas including computer vision, timeseries modeling, and audio classification and the training of these advanced neural networks often is done best with GPUs.

Figure 1-1. Thinking About the Relationship Between Artificial Intelligence, Machine Learning, and Deep Learning.

Artificial intelligence (“AI”) is difficult define as it tends to get used as a label for things ranging from basic optimization to “something the public considers magic”.

Let’s start with "What AI is Not“:

AI is not alive, self-aware, nor near any level of “general artificial intelligence” that keeps getting marketed today. We’ve seen these cycles before (2 times, previously), where AI gets over-marketed and then crashes through results, while impressive and drive new industry sectors, always fall short of “magical”.

Why do we keep doing this to ourselves?

AI has long held a place in society’s collective imagination because it poses an existential threat to our fundamental role in society. People relate to their place in the society by the role they take on in the working world, whether that is clergy, engineer, or philosopher. If we are all automated into irrelevance, then those roles and relations disappear and this becomes an existential issues for most people. Fundamentally, we keep running through this hype cycle due to fear of this scenario.

Artificial Intelligence today in real terms is “applied machine learning” (with deep learning included). This definition can be expanded further, from the automation viewpoint, to “artificial intelligence is a term for algorithms that increase user productivity”. We can make this designation because some techniques, such as game-state search, are not qualified as machine learning yet fall under the historical banner of AI.

For more on what is AI, check out Appendix A of “Deep Learning: A Practitioner’s Approach”³ by Oreilly

The mid-2000’s saw the rise of Deep Learning research fueled by the availability of GPUs, better labeled datasets, and better tooling. This rise re-ignited interest in applied machine learning across all enterprises in the early 2010’s in their quest to leverage data and machine intelligence to be “more like Google and Amazon”.

Many early efforts in applied machine learning focused on the Apache Hadoop platform as it had seen a meteoric rise in enterprise adoption in the first half of the 2010’s. However, where Apache Hadoop was focused on leveraging commodity hardware with CPUs and distributed processing, deep learning practitioners were focused on single machines with one or more GPUs and the python programming environment. Apache Spark on Apache Hadoop gave robust options for applied machine learning on the JVM, yet the graduate-level practitioner tended to be trained on the python programming language.

Enterprises began hiring more graduate program educated machine learning practitioners and this created a large influx of enterprise users demanding support for python in the enterprise for their applied machine learning workflows.

In earlier years Spark had been a key technology for running machine learning jobs, but in more recent years we see enterprises shifting towards GPU-based training in containers that may not live on a Hadoop cluster. The DevOps, IT, and Platform teams working with these data scientists wanted to make sure they were setting the teams up for success while also having a manageable infrastructure strategy.

The folks in charge of these platforms were concerned with using the cloud in a way that did not cost too much but took advantage of the transient nature of cloud-based workloads.

There is a growing need to make these worlds work together on data lakes whether they were on-premise or in the cloud, or somewhere in-between. Today a few of the major challenges in the enterprise machine learning practitioner space are:

data scientists tend to prefer their own unique cluster of libraries and tools
these tools are generally heterogeneous both inside teams and across organizations
data scientists needs to interoperate with the rest of the enterprise for resources, data, and model deployment
many of their tools are python-based, where python scripts prove difficult to manage across a large organization
many times data scientists want to build containers and then just “move them around” to take advantage of more powerful resources beyond their laptop, either in the on-premise data center or in the cloud

This makes operating a platform to support the different constituents using the system harder than ever.

It’s Harder Than Ever to Run Enterprise Infrastructure

We progressively live in an age where developers and data scientists feel empowered to spin up their own infrastructure in the cloud, so it has become antithetical for many organizations to enforce strict infrastructure rules for machine learning practitioners. We’ve seen this in the space where customers end up with 3-4 different machine learning pipeline systems because groups cannot agree (and simply run off to AWS or GCP to get what they want when they get told “no”).

Hadoop and Spark are still key tools for data storage and processing, but increasingly Kubernetes has come into the picture to manage the on-premise, cloud, and hybrid workloads. In the past two years we’ve seen Kubernetes adoption increase rapidly. Many enterprises have made infrastructure investments in previous technology cycles such as:

RDBMs
Parallel RDBMs
Hadoop
Key-Value Stores
Spark

Just like the systems before them, these systems have a certain level of legacy-inertia in their adoptive organizations. Therefore, its easy to predict that integration between Hadoop workloads and specialized Kubernetes-based machine learning workloads are on the horizon (as we see Cloudera using Kubernetes for their new data science tools as opposed to YARN) and Google changing out YARN for Kubernetes for Spark⁴.

Other factors that make running infrastructure more complex than ever include:

evolution of open source acceptance in enterprise
using both on-premise and cloud infrastructure
security requirements involved in having multiple platforms
complexity of multi-tenancy support
demands for specialized hardware such as GPUs

Some overall infrastructure trends we note as of interest as we go forward:

Mixtures of on-premise, cloud, and hybrid deployments are becoming more popular with enterprises.
The 3 major “big cloud” vendors continue to capture a number of workloads that move to the cloud; some of these workloads oscillate between cloud and on-premise.
Docker is synonymous with the term “container”
Many enterprises are either using Kubernetes or strongly considering it as their container orchestration platform

Beyond an on-premise cluster, Kubernetes clusters can be joined together with cluster federation within Kubernetes. In a truly hybrid world, we can potentially have a job scheduling policy that places certain workloads on the cloud as the on-premise resources are over-subscribed, or conversely restrain certain workloads to be execute on-premise only, and never get executed in the cloud.

There are many factors to consider when dealing with cloud infrastructure or a hybrid federated cluster infrastructure, including:

Active Directory integration
security considerations
cost considerations for cloud instances
which users should have access to the cloud instances

We look at all of these topics in more depth in chapter 2 and in the rest of this book.

Identifying Next Generation Infrastructure Core Principles

In the age of cloud infrastructure, what “big cloud” (AWS, Azure, GCP) does tends to significantly impact how enterprises build their own infrastructure today. In 2018 we saw all 3 major cloud vendors offer managed Kubernetes as part of their infrastructure:

In late 2017, after intially offering their own container service, Amazon joined the other two members of big cloud and offered Amazon EKS as a first class supported citizen on AWS.

All three major cloud vendors are pushing both Kubernetes and these tailwinds should accelerate interest in Kubernetes and Kubeflow in 2019. Beyond that, we’ve seen how containers are key part of how data scientists want to operate in managing containers from local execution, to on-premise infrastructure, to running on the cloud (easily and cleanly).

Teams need to balance multi-tenancy with access to storage and compute options (NAS, HDFS, Flashblade) and high-end GPUs (Nvidia DGX-1, etc). Teams also need a flexible way to customize their applied machine learning full applications, from ETL to modeling to model deployment, yet in a manner that stays within the above tenants

We see that cloud vendors have an increasing influence on how enterprises build their infrastructure where managed services they tend to advocate for are likely to get accelerated adoption. However, most organizations are separated from getting results from data science efforts in their organization by obstacles listed in the last section. We can break these obstacles down into the key components of:

composability
portability
scalability

Composability involves breaking down machine learning workflows into a number of components (or stages) in a sequence and allowing us to put them together in different formations. Many times these stages are each their own systems and we need to move the output of one system to another system as input.

Portability involves have a consistent way to run and test code:

development
staging
production

When we have deltas between those environments we introduce opportunities to have outages in production. Having a machine learning stack that can run on your laptop the same way it runs on a multi-GPU DGX-1 and then on a public cloud would be considered desirable by most practitioners today.

Most organizations also desire scalability and are constrained by:

access to machine specific hardware (e.g., “GPUs”, “TPUs”)
limited compute
network
storage
heterogenous hardware

Unfortunately scale is not about “adding more hardware”, in many cases its about the underlying distributed systems architecture involved. Kubernetes helps us with these constraints.

The Building Blocks of Containers and Kubernetes

A major reason we see wide spread container (e.g., “Docker”) and Kuberentes adoption today is because together they are an exceptional infrastructure solution for the issues inherent in composability, portability, and scalability.

What are Containers?

A container image is a lightweight, stand-alone, executable package of software that includes the code, runtime, system tools, system libraries, settings needed to run the software.

Containers and container platforms provide a lot more advantages over traditional virtualization:

Isolation is done on the kernel level without the need for a guest operating system
containers are much more efficient, fast, and lightweight

This allows applications to become encapsulated in self-contained environments and comes with a slew of advantages such as quicker deployments, scalability, and closer parity between development environments. In recent years the Docker platform has largely become synonymous with the term “containers”.

Kubernetes is further a great fit for machine learning workloads because it already has support for GPUs⁵ and TPUs⁶ as resources, with FPGAs⁷ in development. In the diagram below we can see how having key abstractions with containers and other layers is significant for running a machine learning training job on GPUs in a similar fashion to how you might run the same job on FPGAs.

We highlight the above abstractions as they are all components in the challenges involved with workflow portability. An example of this is how our code may work locally on our own laptop yet fails to run remotely on a Google Cloud virtual machine with GPUs.

Kubernetes also already has controllers for tasks such as batch jobs and deployment of long lived services for model hosting, giving us many foundational components for our machine learning infrastructure. Further, as a resource manager Kubernetes provides three primary benefits:

Unified management
Job isolation
Resilient infrastructure

Unified management allows us to use a single cluster management interface for multiple kubernetes clusters. Job isolation in Kubernetes gives us the ability to move models and ETL pipelines from development to production with less dependency issues. Resilient infrastructure implies how Kubernetes manages the nodes in the cluster for us and makes sure we have enough machines and resources to accomplish our tasks.

Kubernetes as a platform has been scaled in practice⁸ up into the 1000’s of nodes. While many organizations will not hit that order of magnitude of nodes, it highlights the fact that Kubernetes (the platform itself) will not have scale issues from a distributed system aspect as a platform.

Kubernetes is Not Magic

Just because the kubernetes platform has been shown to scale in practice up to 1000’s of nodes does not mean applications on the platform are immune to scale issues.

Every distributed systems applications requires thoughtful design. Don’t fall prey to the notion that kubernetes will magically scale any application just because someone threw the application in a container and sent it to a cluster.

If we can solve issues around moving containers and machine learning workflow stacks around, we’re not entirely home free. We still have other considerations such as:

cost of using the cloud vs on-premise
organizational rules around where data can live
security considerations such as Active Directory and Kerberos integration

just to name a few. Modern data center infrastructure has a lot of variables in play, and it can make life hard for an IT or DevOps team. In this book we seek to at least arm the reader with a plan of attack on how to best plan and meet the needs of their data science constituents all the while keeping in line with organizational policies.

Kubernetes solves a lot of infrastructure problems in a progressively complex infrastructure world, but it wasn’t designed specifically for machine learning workflows. Let’s now take a look at how Kubeflow enters the picture to help enable machine learning workflows.

Enter: Kubeflow

Kubeflow was conceived of as a way to run machine learning workflows on Kubernetes and is used typically for the following reasons:

You want to train/serve machine learning models in different environments (e.g. local, on-premise, and cloud)
You want to use Jupyter notebooks to manage machine learning training jobs
You want to launch training jobs that use resources (such as additional CPUs or GPUs – that aren’t available on your personal computer)
You want to combine machine learning code (e.g., TensorFlow code with other processes (For example, you may want to use TensorFlow/agents to run simulations to generate data for training reinforcement learning models.)

With our machine learning platform typically we’d like an operational pattern similar to the “lab and the factory" pattern. In this way we want our data scientists to be able to explore new data ideas in “the lab”, and once they find a setup they like, move it to “the factory” so that the modeling workflow can be run consistently and in a reproducable manner.

Kubeflow is a great option for Fortune 500 infrastructure because it allows a notebook-based modeling system to easily integrated with the data lake ETL operations on local data lake or in the cloud in similar fashion. It also supports multi-tenancy across different hardware by managing the container orchestration aspect of the infrastructure. Kubeflow gives us great options for a multi-tennant “lab” environment while also providing infrastructure to schedule and maintain consistent machine learning production workflows in “the factory”.

While teams today can potentially share a server, when we get to trying to manage 5 or 10 servers each with multiple GPUs, we quickly create a situation where multitenancy can become messy. Users end up having to wait for resources and may also be fighting dependencies or libraries other users left behind from previous jobs. There is also the aspect of isolating users from one another so users cannot see each other’s data.

Hidden Technical Debt in Machine Learning Systems

Most of the time the things that grab headlines are wonderful (visual) applications of deep learning models. However, for the majority of machine learning pipelines, there is a lot more that goes into building a model than just running a training loop and making a pretty render in a Jupyter notebook. The diagram below from the paper “Hidden Technical Debt in Machine Learning”⁹ gives a better overall representation of proportionally what goes into building a machine learning pipeline.

Figure 1-3. Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex.

This diagram is a great visual metaphor for why platforms such as Kubeflow are critical (and popular with) enterprise practitioners today.

If we have any hope of sustaining an enterprise data science practice across a cluster of heterogeneous hardware, we will need to use a system like kubeflow and Kubernetes. Beyond just the obstacles we mention here, machine learning workflows tend to have much technical debt right under the visible surface.

Origin of Kubeflow

In 2016 at Google Next, Google announces¹⁰ Cloud ML uses GKE, a pre-cursor step towards what we know as Kubeflow today. In December 2017 at Kubecon the initial version Kubeflow is announced¹¹ and demoed by David Aronchick and Jeremy Lewi. This version included:

JupyterHub
TFJob v1alpha1
TFServing
GPUs working on Kubernetes

In January 2018 the Kubeflow Governance Proposal¹² was published to help direct how the community around the project would work. In June 2018 the 0.1 version of Kubeflow was introduced¹³, containing an expanded set of components including:

JupyterHub
TFJob with distributed training support
TFServing
Argo
SeldonCore
Ambassador
Better Kubernetes support

Kubeflow version 0.2 was announced¹⁴ only 2 months later in August 2018 with such advancements as:

TFJob v1alpha2
Pytorch and Caffe operators
Central UI

The project is still (as of the writing of this book) growing and Kubeflow continues to evolve. Today the project has expanded to over 100 engineers working on it (up from the 3 at the beginning), with 22 member organizations¹⁵ participating.

Who Uses Kubeflow?

Some of the key enterprise personnel that have the most interest in Kubeflow include:

DevOps Engineer
Platform Architect
Data scientists
Data Engineer

DevOps and Platform Architects needs to support everyone with the right infrastructure in the right places, cloud or on-premise, supporting the ingest, ETL, data warehouse, and modeling efforts of other teams. Data engineers use this infrastructure to setup data scientists with the best denormalized datasets for vectorization and machine learning modeling. All of these teams need to work together to operate in the modern data warehouse and Kubernetes gives them a lot of flexibility in how that happens.

Team Alignment for the Line of Business, DevOps, Data Engineering, and Data Science

Another challenge for organizations in the space of machine learning infrastructure is how multi-faceted the efforts are to build and support these pipelines. As we saw in the diagram previously in this chapter, there is a lot of “hidden technical debt” in most machine learning workflows and the components in these workflows are owned by multiple teams. These teams include:

Line of business: the part of the company that intends to use the results of the machine learning model to produce revenue for the organization
DevOps: the group that is responsible for making sure the platform is operational and secure
Data Engineering: the group that is responsible for getting data from the system of record (or data warehouse) and converting it into a form for the data science team to use
Data Science: the team responsible for building and testing the machine learning model

Each of these teams has to work in concert or the business will likely find no value from their individual efforts. Let’s now take a look at some specific scenarios where we might see kubeflow used for the above teams.

Common Kubeflow Use Cases

Specific scenarios for how we’d use a technology are always critical because otherwise its just “more infrastructure to manage”. In this section we’ll look at some specific use cases in how kubeflow can be used and how each of the teams from the previous section might be involved.

Running Notebooks on GPUs

Users commonly start out on local platforms such as Anaconda and design initial use cases. As their use cases need more data, more powerful processing, or data that cannot be copied onto their local laptop, they tend to hit roadblocks in terms of advancing their model activities forward.

Common Usage Patterns of Notebooks for New Users

The Jupyter Notebook App is a client-server application that supports running notebooks locally. It allows us to look at a web browser and interact with the current running notebook.

For the desktop user, we typically see Jupyter Notebook App installed via a python distribution. This is easier for most users because of the dependency complexity involved with all of the needed python packages out of the box. The most popular distribution is the Anaconda Distribution [ link ]

https://www.anaconda.com/download/

Notebooks are popular because they provide compelling flexibility in how data scientists can use the library (language independent as well) of their choice in a notebook or outside of a notebook on Kubeflow. It becomes a compelling offering quickly as now a data scientist can quickly move a workload built in the language of their choice from their laptop to an on-premise enterprise cloud or to a public cloud and leverage more hardware. This fits in with the pattern where machine learning jobs typically are prototyped on a user laptop and then once validated are moved to a more powerful system for training and then model deployment.

Advantages of Notebooks on GPUs

Given that many notebook users start locally on their laptop then this naturally leads us to questions around “why use Kubeflow for hosted notebook infrastructure”? Machine learning and especially deep learning training algorithms are notoriously hungry for processing power for their linear algebra routines. GPUs have become the industry standard for speeding up linear algebra processing in machine learning. Most laptops do not have GPUs onboard so the users go looking for other places to eventually run their modeling code.

A common platform stack users have converged on is the

Jupyter notebook
python code
dependency management with containers, typically docker

These developers are good at getting this stack just how they want it on their local laptop. However, operating inside a F500 company this way has side effects such as:

security
data access
driver management
model integration

The IT organizations in most Fortune 500 companies take topics such as security and data access seriously. The idea of phds experimenting with sensitive corporate data on their local laptops conflicts directly with most IT information security policies and this create natural contention in an organization. This contention revolves around the line of business’ need for more value from its data vs the IT information security’s mandate in keeping key information safe.

Given companies will not give up on security requirements we need to find ways to better service how data scientists want to operate while keeping security policies intact. Kubeflow is a great option for this scenario because it allows a data scientist to keep their preferred working environment built in a container to execute in an environment blessed by IT corporate security.

This internal infrastructure can be secured with active directory and kerberos while providing GPUs (e.g., Nvidia’s DGX-1) and large storage arrays with object stores, proprietary storage (e.g., Pure’s Flashblade, NetApp’s storage), or HDFS.

GPUs and Model Accuracy

We want to make a clear distinction around what we gain from GPUs as many times this seems to be conflated in machine learning marketing. We define “predictive accuracy of model” as how accurate the model’s prediction are with respect to the task at hand (and there are multiple ways to rate models).

Another facet of model training is “training speed” which is how long it takes us to train for a number of epochs (passes over an entire input training dataset) or to a specific accuracy/metric. We call out these two specific definitions because we want to make sure the reader understands:

GPUs will not make your model more accurate
However, GPUs will allow us to train a model faster and find better models (which may be more accurate) faster.

So GPUs can indirectly help model accuracy, but we still want to set expectations here.

Team Alignment for Notebooks on GPUs

In this scenario the DevOps team can enable the data scientists to build models faster with kubeflow. This allows the data scientists to explore more concepts for the line of business faster and this lets them eliminate poor candidate use cases faster.

If the line of business can validate use cases faster with the data science team, they can find the best fits for the business to make the most money from its data.

Shared Multi-Tennant Machine Learning Environment

Many times an organization will have either multiple data scientists who need to share a cluster of high-value resources (e.g., GPUs) or they will have multiple teams of data scientists who need the same access to shared resources.

In this case an organization needs to build a multi-tennant machine learning platform and kubeflow is a solid candidate for this scenario.

Often we’ll see organizations buy machines with 1 or more GPUs attached up to specialized hardware such as an Nvidia DGX-1 or DGX-2 (e.g., 8 GPUs per machine). This hardware is considerably more costly than a traditional server so we want as many data scientists leveraging it for model training as possible.

Advantages of On-Premise Multi-Tennant Environment

Each data scientist will have their own model workflow and code dependencies as described earlier in this chapter. We need a system such as kubeflow that can execute each user’s workflow while keeping the workflow dependencies and data seperate from other user’s work on the same set of resources (e.g., isolation).

Kubeflow and kubernetes handle these requirements for us with their scheduling and container management functionality. A great example is how we may have 3 different data scientists all needing to each run their own notebook on a single GPU. Kubernetes with kubeflow handles keeping track of who is running what code on what machine and which GPUs are currently in use. Kubernetes also keeps track on what jobs are waiting in the job queue and will schedule the waiting jobs once an in-process job finishes.

Team Alignment

Multi-tennant systems make DevOps teams lives’ much simpler as they handle a lot of infrastructure complexity for us. The DevOps team can focus on keeping the kubernetes cluster and the kubeflow application operational and this let’s us leverage all the benefits of scheduling, container scheduling, and resource management in kubernetes.

When data scientists have more flexible access to the resources they need (e.g., GPUs), they can build models faster and this allows the line of business to evaluate a data product faster to see if it can be effective enough to be viable for the organization.

Example: Building a Transfer Learning Pipeline

Let’s use a computer vision example scenario to better understand how kubeflow might be deployed to solve a real problem. A realistic example would be a team wanting to get a computer vision transfer learning pipeline working for their team so they can build an custom computer vision model for detecting specific items in a retail store.

The team’s basic plan consists of:

Get a basic computer vision model working from the TensorFlow model zoo¹⁶
Update the model with an example dataset to understand transfer learning
Move the model training code to kubeflow to take advantage of on-premise GPUs

The team starts off by experimenting with the object detection example notebook¹⁷ provided in the TensorFlow github repository. Once they have this notebook running locally, they know they can produce inferences for objects in an image with a pre-built TensorFlow computer vision model.

Next the team wants to customize a computer vision model from the model zoo with their own dataset, but first they need to get a feel for how transfer learning works in practice. The team checks out the “pet detector” example¹⁸ on the TensorFlow object detection github repository and learns how to further train an existing COCO¹⁹-pretrained model with transfer learning to build a custom computer vision model.

Once they get the transfer learning example running with the pets dataset, it should not be hard to label some of their own product data to build the annotations needed to further train their own custom computer vision model.

The original pets example shows how to use GPUs on Google Cloud but the team wants to leverage a cluster of GPUs they have in their own datacenter. At this point the team sets up Kubeflow on their internal on-premise kubernetes cluster and runs the Pets tutorial specifically built²⁰ for kubeflow. The team had previously built their own custom annotated dataset that they can now use for building models on their own internal kubeflow cluster.

Advantages of Running Computer Vision Pipeline on Kubeflow

The major reasons the team moves their pipeline to kubeflow are:

secure access to sensitive data for multiple users (data that may not be allowed to live on users’ laptops)
cost savings by using on-premise GPUs
ability for multiple users to share the same cluster of GPUs

The team reasons that in some training scenarios kubeflow on-premise can be more cost-effective per GPU hour than cloud GPUs. They also want to securely control where the core training dataset lives and they want the ability to allow multiple data scientists to share the same consistent training dataset while trying different variations of models.

Team Alignment for Computer Vision Pipeline

This transfer learning scenario on kubeflow allows the DevOps team to more tightly control who has a copy of the sensitive training data. The line of business has tasked the data science team with building a model that has a mAP score of a minimal level to be economically viable to the business unit.

To accomplish this modeling goal, the data science team needs to try many variations of hyperparameters and feature selection passes (in concert with the data engineering team) to drive up their model’s effectiveness. The faster the data science team can train models the more variations of training runs the team can try. In the case of deep learning and computer vision, GPUs make training runs take significantly less time so these are a key resource for the data science team.

The business unit wants to hit their target minimum model effectiveness goal but they have a budget to stay under in doing so. Using kubeflow on-premise with GPUs is a cheaper way to build models for the data science team, so costs end up being lower. The cost is forecasted as cheaper by the business unit because the data science team forecasts they will need to model many times a week for a long period of time.

GPUs in the Cloud

GPUs in the cloud give us more flexibility than GPUs on-premise because we can spin them up ad-hoc on-demand making it more convenient to try new ideas.

However, this convenience may cost more than if we bought a GPU and used it locally all of the time.

The cost vs flexibility tradeoff is something we should always keep in mind when deciding where to run our jobs.

Using kubeflow on-premise with GPUs also allows the data science team to model faster while running multiple computer vision jobs on the cluster at the same time under the multi-tennant nature of the system.

Deploying Models to Production for Application Integration

Once a model is trained it typically exists as a single file on a laptop or server host. We need to either

copy the file to the machine with the application for integration
or we need to load the model into a server process that can accept network requests for model inference

If we choose to copy the model file around to a single application host, then this is manageable. However, when we have many applications seeking to get model inference output from a model this becomes more complex.

One major issue involves updating the model once it has been deployed to production. This becomes more work as we need to track the model version on more machines and remember which ones need to be updated.

Another issue is rolling back a model once deployed. If we have deployed a model and then realize that we need to roll the version back to the previous model version, this is a lot of work when we have a lot of copies of the model floating around out there on different hosts.

Let’s now take a look at some advantages if we use a model hosting pattern for deploying a model to production with kubeflow.

Advantages of Deploying Models to Production on Kubeflow

A major advantage of having a machine learning model loaded in a model server on kubeflow is that we can do updates and rollbacks from a single point (e.g., the server process). This allows us to treat a model more like a database table in that we can apply operations to the model and then all of the consuming client applications get the updates as soon as the update transaction is complete and they make their next inference request.

Team Alignment for Model Deployment

The model server pattern makes life for the DevOps team considerably easier as they don’t have to manually track many model copies. The application engineering team can focus on consuming the model as an internal REST resource on the network and less on writing specific machine learning API code to integrate a model.

Once the data science team has developed models that the line of business wishes to put into production, the model server pattern allows them to hand off the model to the DevOps team to put into production on the model server. This allows the data science team to get out of having to support individual models in production and focus on building the next round of models with the lines of business.

Components of Kubeflow

The logical components of kubeflow are:

Jupyter Notebooks
Hyperparameter Tuning with Katib
Pipeline for workflow management on kubeflow
Model serving for ways to serve inferences from deployed machine learning models
Training systems for model training

these components work together to provide a scalable and secure system for running machine learning jobs (notebook-based jobs and also jobs outside of notebooks). The relationships between the components can be seen in the diagram below.

Given the rise of Kubernetes as an enterprise platform management system, it makes a lot of sense to have a way to manage our machine learning workloads in a similar manner. In the rest of this section we take a look at each component and how it is used within the Kubeflow platform.

Jupyter Notebooks

Jupyter Notebooks are included with the kubeflow platform as a core component. Jupyter Notebooks are popular²¹ due to their ease of use and are commonly associated (in machine learning especially) with the python programming language. Notebooks typically contain code (e.g., python, java) and other visual rich-text elements which mimic a web page or textbook.

A novel aspect of notebooks are that they combine the concept of a computer program with the notes we typically associate with complex logic in our programs. This in-line documenting property allows a notebook user a better way to not only document what they’re doing but to communicate it better to other users that may want to run our code. Given the complexity of machine learning code, this property has been part of the reason we see their explosive popularity in the machine learning space.

Jupyter Notebook Integration with Kubeflow

The Kubeflow application deployment includes support for spawning and operating jupyter notebooks. Advantages of having Jupyter Notebooks integrated into the Kubeflow platform include:

good integration with the rest of the kubeflow infrastructure from the point of view of authentication and access control
ability to share notebooks between users

Jupyter Notebooks in Kubeflow also have access to the Fairing library enabling the notebooks to submit training jobs to kubernetes from the notebook. This is compelling because code in a Jupyter Notebook only uses the default single process execution mode. The ability to submit jobs to kubernetes from the notebook allows us to leverage TFJob (covered later in this chapter) and run distributed training jobs.

Sometimes we want separate notebook servers for each user or for a team. Kubeflow allows us to setup multiple notebook servers for a given kubeflow deployment. Each notebook server belongs to a namespace can serve and execute multiple notebooks. Namespaces in kubeflow also give us multi-user isolation. This means we can have multiple sets of users who can’t see one another’s resources so that they don’t clutter one another’s workspaces on the shared multi-tenant infrastructure.

Multi-User and Malicious Users

Currently as of v0.7 kubeflow has no hard security guarantees against malicious attempts by users to access other user’s resources.

Kubeflow will launch a container image on k8s for each notebook server. The notebook image containers dependencies such as the libraries for ml and cpu or GPU support in the notebook.

Machine Learning Model Training Components

Many machine learning frameworks are supported by Kubeflow. In theory, a user could just containerize an arbitrary framework and submit it to a Kubernetes cluster to execute. However, Kubeflow makes Kubernetes clusters aware of the execution nuances each machine learning library expects or needs to do thing such as:

parallel training
custom resources
workflows

The current training frameworks supported by Kubeflow are:

TensorFlow
Keras
PyTorch
MXNet
MPI
Chainer

Below we list some details about a few of these frameworks.

TensorFlow Training and TFJob

TensorFlow is supported by Kubeflow and is the most popular machine learning library in the world today. Being that TensorFlow, Kubernetes, and Kubeflow were all created originally at Google, it makes sense it was the original library supported by Kubeflow.

As mentioned previously in this chapter, TFJob is a custom component for Kubeflow which contains a Kubernetes custom resource descriptor (CRD) and an associated controller (tf-operator). The TFJob CRD is what makes Kubernetes able to execute distributed TensorFlow jobs.

Check out the Kubeflow documentation page on TensorFlow at:

https://www.kubeflow.org/docs/components/tftraining/

Keras

Keras is supported in the Kubeflow project and can be used in several ways:

single process job run as a CRD
single process GPU job run as a CRD
TFJob as a single worker job
TFJob as a distributed worker job (via the Estimator API)
Jupyter notebook (CPU or GPU)

Below we show the simplest way to run a Keras job on Kubernetes as a Job:

Example 1-1. Job Example YAML to Run a Keras Python Script

apiVersion: batch/v1
kind: Job
metadata:
  name: keras-job
spec:
  template:
    spec:
      containers:
      - name: tf-keras-gpu-job
        image: emsixteeen/basic_python_job:v1.0-gpu
        imagePullPolicy: Always
      volumes:
      - name: home
        persistentVolumeClaim:
          claimName: working-directory
      restartPolicy: Never
  backoffLimit: 1

In the example above we would launch this Keras job from the command line with Kubectl.

Pytorch Training

Initial Pytorch (alpha) support with the PyTorch Operator was included in Kubeflow 0.2.0. This alpha version was supported up to the 0.3.5 version of Kubeflow. The next version (beta) of the PyTorch Operator was introduced with Kubeflow 0.4.0.

We run a PyTorch job similar to how we’d run a TensorFlow job:

Using PyTorch code inside a JupyterNotebook
Or writing a yaml file to configure a PyTorchJob and running it with Kubectl

Hyperparameter Tuning

Hyper parameter tuning involves exploring the hyperparameter search space to find an optimal (or near optimal) set of hyperparameters for training a machine learning model. Data scientists spend a non-trivial amount of time trying combinations of hyperparameters and seeing how much more accurate their model can be as a result.

Machine Learning, Parameters, and Hyperparameters

Let’s take a quick detour and define some machine learning terminology for the reader. Previously in this chapter we defined machine learning as

using algorithms for acquiring structural descriptions from data examples

In machine learning we are searching for a set of function coefficients that maximize our model’s accuracy (or value) with respect to a set of ground-truth labels. These set of function coefficients are sometimes referred to as "parameters“. For example, in neural networks parameters are the weights on the connections between the artificial neurons in the layers.

From the context of machine learning we can say:

In machine learning, we have both model parameters and parameters we tune to make networks train better and faster. These tuning parameters are called hyperparameters, and they deal with controlling optimization functions and model selection during training with our learning algorithm.

Examples of hyperparameters include:

learning rate
regularization
momentum
sparsity
type of loss function
optimization algorithm

The process of hyperparameter tuning involves trying different combinations of the above training options to find a combination of parameters for our model that give us the best model accuracy for our target problem.

The included hyperparameter search application with kubeflow is called Katib²². Katib was original inspired by an internal Google system called “vizier” and is focused on being machine learning framework agnostic. Katib lets us create a number of experiments to test trials for hyperparameter evaluation. Currently Katib supports exploration algorithms such as random search, grid search, and more.

Pipelines

Kubeflow Pipelines²³ allows us to build machine learning workflows and deploy them as a logical unit on Kubeflow to be run as containers on kubernetes. While many times we see machine learning examples in a single Jupyter notebook or python script, machine learning workflows are typically not a single script or job. Previously in this chapter we introduced the diagram below from the paper “Hidden Technical Debt in Machine Learning”:

Many of the boxes in this diagram end up as their own sub-workflows in practice in a production machine learning system.

Examples of this would be how data collection and feature engineering each may be their own workflows built by teams separately from the data science team. Model evaluation is another component we often see performed as its own workflow after the training phase is complete.

Kubeflow Pipelines enable the simplify the orchestration of these pipelines as containers on kubernetes infrastructure. This gives our teams the ability to reconfigure and deploy complex machine learning pipelines in a modular fashion to speed model deployment and time to production.

Basic Kubeflow Pipeline Concepts

Section content goes here

A Kubeflow Pipeline is a graph defined by python code representing all of the components in the workflow. Pipelines define input parameter slots required to run the pipeline and then how each component’s output is wired as input to the next stage in the graph.

A Pipeline component is defined as a Docker image containing the user code and dependencies to run on kubernetes. Below we can see a Pipeline definition example²⁵ in python.

Example 1-2. Kubeflow Pipeline Defined in Python

@dsl.pipeline(
  name='XGBoost Trainer',
  description='A trainer that does end-to-end distributed training for XGBoost models.'
)
def xgb_train_pipeline(
    output,
    project,
    region='us-central1',
    train_data='gs://ml-pipeline-playground/sfpd/train.csv',
    eval_data='gs://ml-pipeline-playground/sfpd/eval.csv',
    schema='gs://ml-pipeline-playground/sfpd/schema.json',
    target='resolution',
    rounds=200,
    workers=2,
    true_label='ACTION',
):
  delete_cluster_op = DeleteClusterOp('delete-cluster', project, region).apply(gcp.use_gcp_secret('user-gcp-sa'))
  with dsl.ExitHandler(exit_op=delete_cluster_op):
    create_cluster_op = CreateClusterOp('create-cluster', project, region, output).apply(gcp.use_gcp_secret('user-gcp-sa'))

    analyze_op = AnalyzeOp('analyze', project, region, create_cluster_op.output, schema,
                           train_data, '%s/{{workflow.name}}/analysis' % output).apply(gcp.use_gcp_secret('user-gcp-sa'))

    transform_op = TransformOp('transform', project, region, create_cluster_op.output,
                               train_data, eval_data, target, analyze_op.output,
                               '%s/{{workflow.name}}/transform' % output).apply(gcp.use_gcp_secret('user-gcp-sa'))

    train_op = TrainerOp('train', project, region, create_cluster_op.output, transform_op.outputs['train'],
                         transform_op.outputs['eval'], target, analyze_op.output, workers,
                         rounds, '%s/{{workflow.name}}/model' % output).apply(gcp.use_gcp_secret('user-gcp-sa'))
...

If we look at this graph in the Kubeflow Pipelines user interface in Kubeflow it would look similar to the image below.

The Kubeflow Pipelines UI further allows us to define input parameters per run and then launch the job. The saved outputs from the pipeline include graphs such as Confusion Matrix and Receiver Operating Characteristics (ROC) Curves.

Kubeflow Pipelines produce and store both metadata and artifacts from each run. Metadata from pipeline runs are stored in a MySQL database and artifacts are stored in an artifact store such as MinIO²⁶ server or a cloud storage system. Both MinIO and MySQL are both backed by PersistentVolumes²⁷ in Kubernetes.

The metadata saved allows Kubeflow to track specific experiments and jobs run on the cluster. The artifacts save information so that we can investigate an individual job’s run performance.

Machine Learning Model Inference Serving

Once we’ve constructed a model we need to integrate the saved model with our applications. Kubeflow offers multiple ways to load saved models into a process for serving life inference to external applications. These options include:

Istio Integration (for TF Serving)
Seldon Serving
NVIDIA TensorRT Inference Server
TensorFlow Serving
TensorFlow Batch Predict

Different of the above options support different machine learning libraries and have their own set of specific features. Typically each has a docker image that you can run as a kubernetes resource and then load a saved model from a repository of models.

An Overview of Kubernetes

Planning and deploying infrastructure on-premise, in the cloud, and in hybrid situations is a key topic in the 2019 infrastructure landscape. Kubernetes is a container orchestration system for containers that is meant to coordinate clusters of nodes at scale in production in an efficient manner. Kubernetes works around the idea of pods which are scheduling units (each pod containing one or more containers) in the Kubernetes ecosystem. These pods are distributed across hosts in a cluster to provide high availability. Kubernetes itself is not a complete solution and is and is intended to integrate with other tools such as Docker, a popular container technology.

A container image is a lightweight, stand-alone, executable package of a piece of software that includes everything needed to run (code, runtime, system tools, system libraries, settings). Colloquially containers are referred to as “docker”, but what the speaker typically means are “containers” in the general sense. Docker containers made it significantly easier for developers to enjoy parity between their local, testing, staging, and production environments. Containers allow teams to run the same artifact or image across all of those environments, including a developer’s laptop and on the cloud. This property of containers, especially when combined with container orchestration with a system like Kubernetes, is driving the concept of “hybrid cloud” in practice.

Why Don’t We Just Plug All Needed Drivers into Our Containers?

Hardware drivers are designed for controlling the underlying hardware they are intended to support. In the same way a specialized Ethernet driver, or a specialized disk driver control a network card or a disk respectively, the GPU drivers control and (drive) the GPUs in a machine. Drivers typically operate at or near the Kernel level.

Considering that containers are a namespaced processed (that is, a process that has been isolated from many or all other processes in a system), installing the driver in the container would require that process be allowed to inject and remove elements from the running Kernel, in effect providing exclusive control of the hardware to that container. The container would need to run with elevated privileges, removing a chunk of security semantics containers provide.

In other words, giving a container exclusive control of underlying hardware sort of defeats the notion of the “containerized process”; it binds it to a highly specific hardware set - thus making it less portable (consider what happens if the container attempts to load the driver, but the hardware is different on another machine the container runs on); and it requires the container have elevated system privileges.

In a containerized environment, the underlying host is typically best suited to control its own hardware, and to expose the availability of that hardware to containers.

Kubernetes shines when we have a lot of containers that need to be managed across a lot of machines (on-premise, in the cloud, or a mixture of both).

Docker and Kubernetes are not direct competitors, but complementing technologies. One way to think of it is Docker is for managing images and individual containers while Kubernetes is for managing pods of containers. Docker provided an open standard for packaging and distributing containerized applications, but it did not solve the container orchestration (inter-communication, scaling, etc) issue. Competitors to Kubernetes in this space are Mesos and Docker Swarm, but we’re seeing the industry converge on Kubernetes as the standard for container orchestration (as is evident later in this article).

With Kubernetes we also have the ability to dynamically scale applications on a cluster. With installable options such as Autoscaler, we also have the ability to scale clusters themselves.

Let’s now dig into a few core principles from Kubernetes that are relevant throughout this book.

Core Kubenetes Concepts

For the newer practitioner we want to cover a few basic Kubernetes concepts in this section as not every reader will be a battle hardened DevOps engineer.

Pod

A pod (as in "a pod of Whales“) in Kubernetes is defined as the smallest unit of computing we can deploy that can be created and managed in Kubernetes. Further defined, a pod is a group of one or more containers with shared storage/network, combined with a specification on how to run the containers. The contents of a pod always co-located and co-scheduled to run in a shared context.

The official reference for Kubernetes Pods is at:

https://kubernetes.io/docs/concepts/workloads/pods/pod/

A Pod is the smallest deployable unit in Kubernetes and it has one or more Containers. Multiple containers within a Pod share:

1. The network stack
2. The filesystem
3. The Pod “Lifecycle”

Some notes on “Pods vs Jobs”: whereas a Pod is the smallest deployable unit in Kubernetes, and can be scheduled to execute on their own, Pods are typically scheduled in more higher-level objects within Kubernetes, such as as part of Deployment, a StatefulSet, a Job, etc. A Job within Kubernetes would be the first higher-level deployable unit above a Pod. Some of the key differences between a Pod and Job include:

When a “Bare Pod” is running on a node that reboots or fails, it will not be re-scheduled
A Job is similar to a “Bare Pod”, however Kubernetes will attempt to re-schedule terminated Jobs
A Job provides additional semantics compared to “Bare Pods”

Submitting Containers to Kubernetes

Containers are submitted as part of a Pod, and not “standalone”. Pods are defined using YAML syntax, as we can see in the example at the end of this section. Kubernetes doesn’t actually accept a Container as a first-class object, rather it accepts a “Pod Definition”. It then attempts to schedule the Pod, and pulls the containers defined within the Pod.

Typically, Pods are submitted via the kubectl command-line interface. An example yaml file definition for a Pod is below:

Example 1-3. A sample YAML file definition for a Kubernetes Pod

apiVersion: v1
kind: Pod
metadata: 
  name: myapp-pod
  labels: 
    app: myapp
spec: 
  containers: 
  - name: myapp-container 
    image: busybox 
    command: ['sh', '-c', 'echo Hello Kubernetes! && sleep 3600']

Custom Resources, Controllers, and Operators

A kubernetes resource²⁸ is an endpoint in the Kubernetes API that stores a group of API Objects of a certain kind. A custom resource is a Kubernetes API extension that is specific to a cluster once installed, customizing said cluster. Custom resources can be created, updated, and deleted on a running cluster through dynamic registration. Once we have installed a custom resource, we can control it with the kubectl tool just like we would built-in resources (e.g., “pods”, etc).

By themselves custom resources allow us to store and retrieve structured data. When we combine a custom resource with a controller then they become a true declarative API. The controller interprets the structured data as a record of the user’s desired state. The controller continually takes action to get to this state. Custom controllers are especially effective when combined with custom resources (but they can work with any kind of resource). One example of this combination (custom controller and custom resource) is the "operator pattern“. The operator pattern allows us to encode domain knowledge for specific applications into an extension of the Kubernetes API.

Custom Controllers

Custom resources let us store and retrieve structured data. They also need to be combined with a controller to be considered a declarative API . Being a “declarative API" means they allow us to declare / specify the desired state of our resource.

This desired state is then attempted to be matched with an actual state by kubernetes. The controller interprets the structured data as the user’s intent for the state and attempts to achieve and maintain this desired state inside the kubernetes cluster.

A custom controller is a controller that can be deployed by a user and be updated independent from the cluster’s own lifecycle. Customer controllers can work with any kind of resource and are most effective when paired with a custom resource. This pairing is called the "operator" pattern in kubernetes.

Custom Resource Definition (CRDs)

When we want to define our own custom resource we write a custom resource definition (CRD). Once we have the CRD setup in kubernetes, we can use it like any other native kubernetes object (leveraging other features such as CLI, security, API services, RBAC, etc).

Summary

In this introduction chapter we covered the impetus for kubeflow as a machine learning platform. As we move forward through this book, we’ll come to understand how to plan, install, maintain, and develop on the kubeflow platform as a key cornerstone in our machine learning infrastructure.

¹ https://github.com/kubeflow

² http://shop.oreilly.com/product/0636920035343.do

³ http://shop.oreilly.com/product/0636920035343.do

⁴ https://www.youtube.com/watch?v=8W88qAFdAUU

⁵ https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

⁶ https://cloud.google.com/kubernetes-engine/docs/concepts/tpus

⁷ https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/

⁸ https://openai.com/blog/scaling-kubernetes-to-2500-nodes/

⁹ https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf

¹⁰ https://cloud.google.com/blog/products/gcp/google-takes-cloud-machine-learning-service-mainstream

¹¹ https://kubernetes.io/blog/2017/12/introducing-kubeflow-composable/

¹² https://docs.google.com/document/d/1vgvZC93I5N-6qm9xrtINs1JJCbDqZEyJA175nAwBX8g/edit?ts=5a711f78#heading=h.rhx0jou170p8

¹³ https://kubernetes.io/blog/2018/05/04/announcing-kubeflow-0.1/

¹⁴ https://medium.com/kubeflow/kubeflow-0-2-offers-new-components-and-simplified-setup-735e4c56988d

¹⁵ https://github.com/kubeflow/community/blob/master/member_organizations.yaml

¹⁶ https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md

¹⁷ https://github.com/tensorflow/models/blob/master/research/object_detection/object_detection_tutorial.ipynb

¹⁸ https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md

¹⁹ http://cocodataset.org/

²⁰ https://github.com/kubeflow/examples/tree/master/object_detection

²¹ http://shop.oreilly.com/product/0636920034919.do

²² https://github.com/kubeflow/katib

²³ https://v0-6.kubeflow.org/docs/components/pipelines/pipelines/

²⁴ https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf

²⁵ https://github.com/kubeflow/pipelines/tree/master/samples/core/xgboost_training_cm

²⁶ https://docs.min.io/

²⁷ https://kubernetes.io/docs/concepts/storage/persistent-volumes/#types-of-persistent-volumes

²⁸ https://kubectl.docs.kubernetes.io/pages/kubectl_book/resources_and_controllers.html

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.