Deploy and Scale Machine Learning Models with Kubernetes

Data Engineering
Data Science
Machine Learning

29. July 2021

Team statworx

Management Summary

Kubernetes is a technology that in many ways greatly simplifies the deployment and maintenance of applications and compute loads, especially the training and hosting of machine learning models. At the same time, it allows us to adapt the required hardware resources, providing a scalable and cost-transparent solution.

This article first discusses the transition from a server to management and orchestration of containers: isolated applications or models that are packaged once with all their requirements and can subsequently be run almost anywhere. Regardless of the server, these can be replicated at will with Kubernetes, allowing effortless and almost seamless continuous accessibility of their services even under intense demand. Likewise, their number can be reduced to a minimum level when the demand temporarily or periodically dwindles in order to use computing resources elsewhere or avoid unnecessary costs.

From the capabilities of this infrastructure emerges a useful architectural paradigm called microservices. Formerly centralized applications are thus broken down into their functionalities, which provide a high degree of reusability. These can be accessed and used by different services and scale individually according to internal needs. An example of this is large and complex language models in Natural Language Processing, which can capture the context of a text regardless of its further use and thus underlie many downstream purposes. Other microservices (models), such as for text classification or summarization, can invoke them and further process the partial results.

After a brief introduction of the general terminology and functionality of Kubernetes, as well as possible use cases, the focus turns to the most common way to use Kubernetes: with cloud providers such as Google GCP, Amazon AWS, or Microsoft Azure. These allow so-called Kubernetes clusters to dynamically consume more or fewer resources, though the costs incurred remain foreseeable on a pay-per-use basis. Other common services such as data storage, versioning, and networking can also be easily integrated by the providers. Finally, the article gives an outlook on tools and further developments, which either make using Kubernetes even more efficient or further abstract and simplify the process towards serverless architectures.

Introduction

Over the last 20 years, vast amounts of new technologies have surfaced in software development and deployment, which have not only multiplied and diversified the choice of services, programming languages, and libraries but have even led to a paradigm shift in many use cases or domains.

If we also look at the way software solutions, models, or work and computing loads have been deployed over the years, we can see how innovations in this area have also led to greater flexibility, scalability, and resource efficiency, among other things.

In the beginning, these were run as local processes directly on a server (shared by several applications), which posed some limitations and problems: on the one hand, one is bound to the configuration of the server and its operating system when selecting the technical tools, and on the other hand, all applications hosted on the server are limited by its memory and processor capacities. Thus, they share not only resources in total but also a possible cross-process error-proneness.

As a first further development, virtual machines can then offer a further level of abstraction: by emulating (“virtualizing”) an independent machine on the server, modularity and thus greater freedom is created for development and deployment. For example, in the choice of operating system or the programming languages and libraries used. From the point of view of the “real” server, the resources to which the application is entitled can be better limited or guaranteed. However, their requirements are also significantly higher since the virtual machine must also maintain the virtual operating system.

Ultimately, this principle has been significantly streamlined and simplified by the proliferation of containers, especially Docker. Put simply, one builds/configures a separate virtual, isolated server for an application or machine learning model. Thus, each container has its own file system and certain system libraries, but not operating system. This technically turns it into a sandbox whose other configuration, code dependencies or errors do not affect the host server, but at the same time can run as relatively “lightweight” processes directly on it.

So there is the possibility to copy, install, etc., everything for the desired application and provide this in a packaged container everywhere in a consistent format. This is not only extremely useful for the production environment, but we at STATWORX also like to use it in the development of more complicated projects or the proof-of-concept phase. Intermediate steps or results, such as extracting text from images, can be used as a container like a small web server by those interested in further processing of the text, such as extracting certain key information, or determining its mood or intent.

This subdivision into so-called “microservices“ with the help of containers helps immensely in the reusability of the individual modules, in the planning and development of the architecture of complex systems; at the same time, it frees the individual work steps from technical dependencies on each other and facilitates maintenance and update procedures.

After this brief overview of the powerful and versatile possibilities of deploying software, the following text will deal with how to reliably and scalably deploy these containers (i.e., applications or models) for customers, other applications, internal services or computations with Kubernetes.

Kubernetes – 8 Essential Components

Kubernetes was introduced by Google in 2014 as open-source container management software (also called container orchestration). Internally, the company had already been using tools developed in-house for years to manage workloads and applications, and regarded the development of Kubernetes not only as a convergence of best practices and lessons learned, but also as an opportunity to open up a new business segment in cloud computing.

The name Kubernetes (Greek for helmsman) was supposedly chosen in reference to a symbolic container ship, for whose optimal operation he was responsible.

1. Nodes

When speaking of a Kubernetes instance, it is referred to as a (Kubernetes) cluster: it consists of several servers, called nodes. One of them, called the master node, is solely responsible for administrative operations, and is the interface that is addressed by the developer. All other nodes, called worker nodes, are initially unoccupied and thus flexible. While nodes are actually physical instances, mostly in data centers, the following terms are digital concepts of Kubernetes.

2. Pods

If an application is to be deployed on the cluster, in the simplest case the desired container is specified and then (automatically) a so-called pod is created and assigned to a node. The pod simply resembles a running container. If several instances of the same application are to run in parallel, for example to provide better availability, the number of replicas can be specified. In this case, the specified number of pods, each with the same application, is distributed to the nodes. If the demand for the application exceeds the capacities despite replicas, even more pods can be created automatically with the Horizontal Autoscaler. Especially for Deep Learning models with relatively long inference times, metrics such as CPU or GPU utilization can be monitored here, and the number of pods can be increased or decreased automatically to optimize both capacity and cost.

To avoid confusion, ultimately every running container, i.e., every workload, is a pod. In the case of deploying an application, this is technically done via a deployment, whereas temporal compute loads are jobs. Persistent stores such as databases are managed with StatefulSets. The following figure provides an overview of the terms:

3. Jobs

Kubernetes jobs can be used to execute both one-time and recurring jobs (so-called CronJobs) in the form of a container deployment on the cluster.

In the simplest case, these can be seen as a script, which can be used for maintenance or data preparation work of databases, for example. Furthermore, they are also used for batch processing, for example when deep learning models are to be applied to larger data volumes and it is not worthwhile to keep the model continuously on the cluster. In this case, the model container is started up, gets access to the desired dataset, performs its inference on it, saves the results and shuts down. There is also flexibility here for the origin and subsequent storage of the data, so own or cloud databases, bucket/object storage or even local data and logging frameworks can be connected.

For recurring CronJobs, a simple time scheme can be specified so that, for example, certain customer data, transactions or the like are processed at night. Natural Language Processing can be used to automatically create press reviews at night, for example, which can then be evaluated the following morning: News about a company, its industry, business locations, customers, etc. can be aggregated or sourced, evaluated with NLP, summarized, and presented with sentiment or sorted by topic/content.

Even labor-intensive ETL (Extract Transform Load) processes can be performed or prepared outside business hours.

4. Rolling Updates

If a deployment needs to be brought up to the latest version or a rollback to an older version needs to be completed, rolling updates can be triggered in Kubernetes. These guarantee continuous accessibility of the applications and models within a Continuous Integration/Continuous Deployment pipeline.

Such a rollout can be initiated and monitored smoothly in one or a few steps. By means of a rollout history it is also possible not only to jump back to a previous container version, but also to restore the previous deployment parameters, i.e. minimum and maximum number of nodes, which resource group (GPU nodes, CPU nodes with little/much RAM,…), health checks, etc.

If a rolling update is triggered, the respective existing pods are kept running and accessible until the same number of new pods are up and accessible. Here there are methods to guarantee that no requests are lost, as well as parameters that regulate a minimum accessibility or a maximum surplus of pods for the change.

Figure 5 illustrates the rolling update.

1) The current version of an application is located on the Kubernetes cluster with 2 replicas and can be accessed as usual.

2) A rolling update to version V2 is started, the same number of pods as for V1 are created.

3) As soon as the new pods have the state “Running” and, if applicable, health checks have been completed, thus being functional, the containers of the older version are shut down.

4) The older pods are removed and the resources are released again.

The DevOps and time involved here is marginal, internally no hostnames or the like change, while from the consumer’s point of view the service can be accessed as before in the usual way (same IP, URL, …) and has merely been updated to the latest version.

5. Platform/Infrastructure as a Service

Of course, a Kubernetes cluster can also be deployed locally on your on-premises hardware as well as on partially pre-built solutions, such as DGX Workbenches.

Some of our customers have strict policies or requirements regarding (data) compliance or information security, and do not want potentially sensitive data to leave the company. Furthermore, it can be avoided that data traffic flows through non-European nodes or generally ends up in foreign data centers.

Experience shows, however, that this is only the case in a very small proportion of projects. Through encryption, rights management and SLAs of the operators, we consider the use of cloud services and data centers to be generally secure and also use them for larger projects. In this regard, deployment, maintenance, CI/CD pipelines are also largely identical and easy to use thanks to methods of containerization (Docker) and abstraction (Kubernetes).

All major cloud operators like Google (GCP), Amazon (AWS) and Microsoft (Azure), but also smaller providers and soon even exciting new German projects, offer services very similar to Kubernetes. This makes it even easier to deploy and, most importantly, scale a project or model, as auto-scaling allows the cluster to expand or shrink depending on resource needs. From a technical perspective, this largely frees us from having to estimate the demand of a service while keeping the profitability and cost structure the same. Furthermore, the services can also be hosted and operated in different (geographical) zones to guarantee fastest reachability and redundancy.

6. Node-Variety

The cloud operators offer a large number of different node types to satisfy all resource requirements for all use cases from the simpler web service to high performance computing. Especially in the application field of Deep Learning, the ever growing models can thus always be trained and served on the required latest hardware.

For example, while we use nodes with an average CPU and low memory for smaller NLP purposes, large Transformer models can be deployed on GPU nodes in the same cluster, which effectively enables their use in the first place and at the same time can speed up inference (application of the model) by a factor of over 20. As of late, the importance of dedicated hardware for neural networks has been steadily increasing, Google also provides access to the custom TPUs optimized for Tensorflow.

The organization and grouping of all these different nodes is done in Kubernetes in so-called node pools. These can be selected or specified in the deployment so that the right resources are allocated to the pods of the models.

7. Cluster Autoscaling

The extent to which models or services are used, internally or by customers, is often unpredictable or fluctuates greatly over time. With a cluster autoscaler, new nodes can be created automatically, or unneeded “empty” nodes can be removed. Here, too, a minimum number of nodes can be specified, which should always be available, as well as, if desired, a maximum number, which cannot be exceeded, to cap the costs, if necessary.

8. Interfacing with Other Services

In principle, cloud services from different providers can be combined, but it is more convenient and easier to use one provider (e.g. Google GCP). This means that services such as data buckets, container registry, Lambda functions can be integrated and used internally in the cloud without major authentication processes. Furthermore, especially in a microservice architecture, network communication among the individual hosts (applications, models) is important and facilitated within a provider. Access control/RBAC can also be implemented here, and several clusters or projects can be bridged with a virtual network to better separate the areas of responsibility and competence.

Environment and Future Developments

The growing use and spread of Kubernetes has brought with it a whole environment of useful tools, as well as advancements and further abstractions that further facilitate its use.

Tools and Pipelines based on Kubernetes

For example, Kubeflow can be used to trigger the training of machine learning models as a TensorFlow training job and deploy completed models with TensorFlow Serving.

The whole process can also be packaged into a pipeline that then performs training of different models with reference to training, validation and test data in memory buckets, and also monitors or logs their metrics and compares model performance. The workflow also includes the preparation of input data, so that after the initial pipeline setup, experiments can be easily performed to explore model architectures and hyperparameter tuning.

Serverless

Serverless deployment methods such as Cloud Run or Amazon Fargate take another abstraction step away from the technical requirements. With this, containers can be deployed within seconds and scale like pods on a Kubernetes cluster without even having to create or maintain it. So the same infrastructure has once again been simplified in its use. According to the pay-per-use principle, only the time in which the code is actually called and executed is charged.

Conclusion

Kubernetes has become a central pillar in machine learning deployment today. The path from data and model exploration to the prototype and finally to production has been enormously streamlined and simplified by libraries such as PyTorch, TensorFlow and Keras. At the same time, these frameworks can also be applied in enormous detail, if required, to develop customized components or to integrate and adapt existing models using transfer learning. Container technologies such as Docker subsequently allow the result to be bundled with all its requirements and dependencies and executed almost anywhere without drawbacks in speed. In the final step, their deployment, maintenance, and scaling has also become immensely simplified and powerful with Kubernetes.

All of this allows us to develop our own products as well as solutions for customers in a structured way:

The components and the framework infrastructure have a high degree of reusability
A first milestone or proof-of-concept can be achieved in relatively little time and cost expenditure
Further development work expands on this process in a natural way by increasing complexity
Ready deployments scale without additional effort, with costs proportional to demand
This results in a reliable platform with a predictable cost structure

If you would like to read further about some key components following this article, we have some more interesting articles about: