en
                    array(2) {
  ["de"]=>
  array(13) {
    ["code"]=>
    string(2) "de"
    ["id"]=>
    string(1) "3"
    ["native_name"]=>
    string(7) "Deutsch"
    ["major"]=>
    string(1) "1"
    ["active"]=>
    int(0)
    ["default_locale"]=>
    string(5) "de_DE"
    ["encode_url"]=>
    string(1) "0"
    ["tag"]=>
    string(2) "de"
    ["missing"]=>
    int(0)
    ["translated_name"]=>
    string(6) "German"
    ["url"]=>
    string(63) "https://www.statworx.com/content-hub/blog/tag/data-engineer-de/"
    ["country_flag_url"]=>
    string(87) "https://www.statworx.com/wp-content/plugins/sitepress-multilingual-cms/res/flags/de.png"
    ["language_code"]=>
    string(2) "de"
  }
  ["en"]=>
  array(13) {
    ["code"]=>
    string(2) "en"
    ["id"]=>
    string(1) "1"
    ["native_name"]=>
    string(7) "English"
    ["major"]=>
    string(1) "1"
    ["active"]=>
    string(1) "1"
    ["default_locale"]=>
    string(5) "en_US"
    ["encode_url"]=>
    string(1) "0"
    ["tag"]=>
    string(2) "en"
    ["missing"]=>
    int(0)
    ["translated_name"]=>
    string(7) "English"
    ["url"]=>
    string(69) "https://www.statworx.com/en/content-hub/blog/tag/data-engineering-en/"
    ["country_flag_url"]=>
    string(87) "https://www.statworx.com/wp-content/plugins/sitepress-multilingual-cms/res/flags/en.png"
    ["language_code"]=>
    string(2) "en"
  }
}
                    
Contact

My colleague An previously published a chart documenting his journey from Data Science to Data Engineering at statworx. His post showed which skills data engineers require for their daily work. If you are unfamiliar with data engineering, it is a field that revolves around storing, processing, and transferring data in a secure and efficient way.

In this post, I will discuss these skill requirements more in depth. Since there are quite a few topics to learn about, I propose the following order:

  1. A programming language
  2. The basics of git & version control
  3. The UNIX command line
  4. REST APIs & networking basics
  5. Database systems
  6. Containerization
  7. The cloud

While this may differ from your personal learning experience, I have found this order to be more manageable for beginners. If you’re interested in a brief rundown of key data engineering technologies, you may also appreciate this post by my colleague Andre on this topic.

Learning how to program – which languages do I need?

As in other data related roles, coding is a mandatory skill for data engineers. Besides SQL, data engineers use other programming languages to solve their problems. There are many programming languages that can be used in data engineering, but Python is certainly one of the best options. It has become the lingua franca in data driven jobs, and it’s perfect for executing ETL jobs and writing data pipelines. Not only is the language relatively easy to learn and syntactically elegant, but it also provides integration with tools and frameworks that are critical in data engineering, such as Apache Airflow, Apache Spark, REST APIs and relational database systems like PostgresSQL.

Alongside the programming language, you will probably end up choosing an IDE (Integrated Development Environment). Popular choices for Python are PyCharm and VSCode. Regardless of your choice, your IDE probably will introduce you to the basics of version control, as most IDEs have a graphical interface to use git & version control. Once you are comfortable with the basics, you can learn more about git and version control.

git & version control tools – tracking source code

In an agile team several data engineers typically work on a project. It is therefore important to ensure that all changes to data pipelines and other parts of the code base can be tracked, reviewed, and integrated. This usually means versioning source code in a remote source code management system such as GitHub and ensuring all changes are fully tested prior to production deployments.

I strongly recommend that you learn git on the command line to utilize its full power. Although most IDEs provide interfaces to git, certain features may not be fully available. Furthermore, learning git on the command line provides a good entry point to learn more about shell commands.

The UNIX command line – a fundamental skill

Many of the jobs that run in the cloud or on-premises servers and other frameworks are executed using shell commands and scripts. In these situations, there are no graphical user interfaces, which is why data engineers must be familiar with the command line to edit files, run commands, and navigate the system. Whether its bash, zsh or another shell, being able to write scripts to automate tasks without needing to use a programming language such as Python can be unavoidable, especially on deployment servers. Since command line programs are used in so many different scenarios, they also apply to REST APIs and database systems.

REST APIs & Networks – how services talk to each other

Modern applications are usually not designed as monoliths. Instead, functionalities are often contained in separate modules that run as microservices. This makes the overall architecture more flexible, and the design may evolve more easily, without requiring developers to pull out code from a large application.

How, then, are such modules able to talk to each other? The answer lies in representational State transfer (REST) over a network. The most common protocol, HTTP, is used by services to send and receive data. Learning the basics of how HTTP requests are structured, which HTTP verbs are typically used to accomplish tasks and how to practically implement such functionalities in the programming language of your choice is crucial. Python offers frameworks such as fastAPI and Flask. Check out this article for a concrete example of how to build a REST API with Flask.

Networks also play a key role here since they enable isolation of important systems like databases and REST APIs. Configuring networks can sometimes be necessary, which is why you should know the basics. Once you are familiar with REST APIs, it makes sense to move on to database systems because REST APIs often don’t store data themselves, but instead function as standardized interfaces to access data from a database.

Database systems – organizing data

As a data engineer, you will spend a considerable amount of time operating databases, either to collect, store, transfer, clean, or just consult data. Hence, data engineers must have a good knowledge of database management. This entails being fluent with SQL (Structured Query Language), the basic language to interact with databases, and having expertise with some of the most popular SQL dialects, including MySQL, SQL Server, and PostgreSQL. In addition to relational databases, data engineers need to be familiar with NoSQL (“Not only SQL”) databases, which are rapidly becoming the go-to systems for Big Data and real-time applications. Therefore, although the number of NoSQL engines is on the rise, data engineers should at least understand the difference between NoSQL database types and the use cases for each of them. With databases and REST APIs under your belt, you will need to deploy them somehow. Enter containers.

Containerization – packaging your software

Containerization is the packaging of software code with the operating system (OS) libraries and dependencies required to run the code to create a single lightweight executable—called a container—that runs consistently on any infrastructure. More portable and resource-efficient than virtual machines (VMs), containers have become the de facto compute units of modern cloud-native applications. To get better understanding of how containers make AI solutions scalable, read our whitepaper on containers.

In order to containerize applications, most engineers use docker, an open-source tool for building images and running containers. Packaging code almost always involves command line tools, such as the docker command line interface. But not only applications or REST APIs may be containerized. Data Engineers frequently run data processing tasks in containers to stabilize the runtime environment. Such tasks must be ordered and scheduled, which is where orchestration tools come in.

Orchestration – automating data processing

One of the main roles of data engineers is to create data pipelines with ETL technologies and orchestration frameworks. In this section, we could list many technologies, since the number of frameworks is ever increasing.

Data engineers should know or be comfortable with some of the best known – such as Apache Airflow, a popular orchestration framework for planning, generating, and tracking data pipelines.

Maintaining an instance of such an orchestration framework yourself can be quite cumbersome. As the technology stack grows, maintenance often becomes a full-time job. To alleviate this burden, cloud providers offer readymade solutions.

 

The cloud – into production without too much maintenance

Among the many cloud providers, it makes sense to pick one of the big three: Amazon Web Services (AWS), Google Cloud Platform (GCP) and Microsoft Azure. All of them offer different services to solve standard data engineering tasks such as ingesting data, scheduling and orchestrating data processing jobs, securely storing data and making it available to business users and data scientists. Due to the plethora of offerings, it makes sense for data engineers to familiarize themselves with the pricing when choosing a solution.

If you have a good understanding of, say, database systems then grasping database systems in the cloud should not be too difficult. However, certain technologies such as Apache Spark on Databricks may be difficult to explore without access to the cloud. In this case, I would recommend setting up an account on the cloud platform of your choice and start experimenting.

High effort, high reward

Let us recap: to become a data engineer, you will need to learn:

  1. A programming language
  2. The basics of git & version control
  3. The UNIX command line
  4. REST APIs & networking basics
  5. Database systems
  6. Containerization
  7. The cloud

While this seems like a lot to learn, I would urge you not to be discouraged. Virtually all of the skills listed above are transferrable to other roles, so learning them will help you almost regardless of your exact career trajectory. If you have a data science background like me, some of these topics will already be familiar to you. Personally, I find the networking to be the most challenging to grasp, as it is often handled by IT professionals on the client side.

You are probably wondering how to get started in practice. Working on individual projects will help you learn the basics of most of these steps. Common data engineering projects include setting up database systems and orchestrating jobs to regularly update the database. There are many publicly available datasets on kaggle and APIs, such as the coinbase API, from which to pull data for your personal project. You may complete the first steps locally and eventually migrate your project to the cloud. Thomas Alcock Thomas Alcock Thomas Alcock

Monitoring Data Science workflows

Long gone are the days when a data science project only consisted of loosely coupled notebooks for data processing and model training that the data scientists ran occasionally on their laptops. With maturity, the projects have grown into big software projects with multiple contributors and dependencies, and multiple modules with numerous classes and functions. The Data Science workflow, usually starting with data pre-processing, feature engineering, model training, tuning, evaluating, and lastly inferring– referred to as an ML pipeline, is being modularized. This modularization makes the process more scalable and automatable, ergo suitable to run in container orchestration systems or on cloud infrastructure. Extracting valuable model or data-related KPIs, if done manually, can be a labor-intense task, more so if with increasing and/or automated re-runs. This information is important for comparing different models and observing trends like distribution shifts in the training data. It can also be used for detecting unusual values, from imbalanced classes to inflated outlier deletions – whatever might be deemed necessary to ensure a model´s robustness. Libraries like MLFlow can be used to store all these sorts of metrics. Besides, operationalizing ML pipelines heavily relies on tracking run-related information for efficient troubleshooting, as well as maintaining or improving the pipeline´s resource consumption.

This not only holds in the world of data science. Today’s microservice-based architectures also add to the fact that maintaining and managing deployed code requires unified supervision. It can drastically cut the operation and maintenance hours needed by a DevOps team due to a more holistic understanding of the processes involved, plus simultaneously reduce error-related downtimes.

It is important to understand how different forms of monitoring aim to tackle the above-stated implications and how model- and data-related metrics can fit this objective too. In fact, while MLFlow has been established as the industry standard for supervision of ML-related metrics, tracking them along with all the operational information can be appealing as well.

Logs vs Metrics

Logs provide an event-based snapshot – Metrics give a bird’s eye overview

A log is a point in time, written out record (e.g., stdout/stderr) of an event that occurs discontinuously, at no pre-defined intervals. Depending on the application, logs carry information such as timestamp, trigger, name, description, and/or result of the event. Events can be anything from simple requests to user logins that the developer of the underlying code deemed important. When following best practices during this process, it can save a lot of hassle and time in setting up downstream monitoring tools. Using dedicated log libraries and writing meaningful log messages fits the bill

INFO[2021-01-06T17:44:13.368024402-08:00] starting *secrets.YamlSecrets               
INFO[2021-01-06T17:44:13.368679356-08:00] starting *config.YamlConfig                 
INFO[2021-01-06T17:44:13.369046236-08:00] starting *s3.DefaultService                 
INFO[2021-01-06T17:44:13.369518352-08:00] starting *lambda.DefaultService             
ERROR[2021-01-06T17:44:13.369694698-08:00] http server error   error="listen tcp 127.0.0.1:6060: bind: address already in use"

Fig. 1: textual event logs

Although the data footprint of a single log is negligible, log streams can exponentiate rapidly. This results in the fact that storing every single log does not scale well, especially in the shape of semi-structured text data. For debugging or auditing, however, storing logs as-is might be unavoidable. Archive storage solutions or retention periods can help.

In other cases, parsing and extracting logs on the move into other formats, like key-value pairs, further addresses these limitations. It can also preserve a lot of the event´s information while having a much lower footprint.

{
  time: "2021-01-06T17:44:13-08:00",
  mode: "reader",
  debug_http_error: "listen tcp 127.0.0.1:6061: bind: address already in use"
  servicePort: 8089,
  duration_ms: 180262
}

Fig. 2: structured event logs

Another form of reducing this footprint can be done through sampling methods, with metrics being the most prominent representatives.

A metric represents a numeric measure of a particular target (specific event) evenly distributed over intervals of time. Mathematical aggregations like sums or averages are common transformations that keep such metrics relatively small data-wise

{
   time: "2022-01-06T17:44:13-08:00",
   Duration_ms: 60,
   sum_requests: 500,
   sum_hits_endpoint_1: 250,
   sum_hits_endpoint_2: 117,
   avg_duration: 113,
 }

Fig. 3: metrics

Thus, metrics are well suited for gradually reducing the data resolution into wider frequencies like daily, weekly, or even longer periods of analysis. Additionally, metrics tend to be better unifiable across multiple applications as they carry highly structured data compared to raw log messages. While this reduces the issues mentioned before, it does come at the cost of granularity. This makes metrics perfect for high-frequency events where a single event´s information is less important. Monitoring compute resources is an example of this. Both takes have their right to exist in any monitoring setup as different use cases fit the different objectives. Consider this more tangible example of a small shop to showcase their main differences:

The total balance of a bank account may fluctuate over time due to withdraws and deposits (that can occur at any point in time). If one is only concerned that there is money in the account, tracking an aggerated metric periodically should be sufficient. If one is interested in the total inflow linked to a specific client, though, logging every transaction is inevitable.

Architecture and tool stack

In most modern cloud stacks, such as Azure Appservice, most logging on infrastructure and request-side is shipped with the service itself. This can come costly with increasing volume, however. Defining the use cases, understanding the deployment environment, and matching it with the logging architecture is part of the job instruction of DevOps teams.

From a developer’s perspective, there are plenty of open-source tools that can deliver valuable monitoring solutions that only need some effort for orchestration. Leaner setups can consist of only a backend server like a time series database and a tool for visualization. More complex systems can incorporate multiple logging systems with multiple dedicated log shippers, alert managers, and other intermediate components (see picture). Some of these tools might be necessary for making logs accessible in the first place or for unifying different log streams. Understanding the workings and service area of each component is, therefore, a pivotal part.

Fig. 4: Monitoring flow of applications deployed in a Kubernetes Cluster (altered, from https://logz.io/blog/fluentd-vs-fluent-bit/)

Database & Design

Logs, at least when following the best practices of including a timestamp, and metrics are usually time series data that can be stored in a time series database. Although in cases where textual logs get stored as-is, other architectures utilize document-oriented types of storage with a powerful query engine on top (like ElasticSearch). Besides storage-related differences, the backend infrastructure is split into two different paradigms: push and pull. These paradigms address the questions of who is responsible (client or backend) for ingesting the data initially.

Choosing one over the other depends on the use case or type of information that should be persisted. For instance, push services are well apt for event logging where the information of a single event is important. However, this makes them also more prone to get overwhelmed by receiving too many requests which lowers robustness. On the other hand, pull systems are perfectly fit for scraping periodical information which is in line with the composition of metrics.

Dashboard & Alerting

To better comprehend the data and spot any irregularities, dashboards come in quite handy. Monitoring systems are largely suited for simple, “less complex” querying as performance matters. The purpose of these tools is specialized for the problems being tackled and they offer a more limited inventory than some of the prominent software like PowerBI. This does not make them less powerful in their area of use, however. Tools like Grafana, which is excellent at handling log-based metrics data, can connect to various database backends and build customized solutions dating from multiple sources. Tools like Kibana, which have their edge in text-based log analyses, provide users with a large querying toolkit for root cause analysis and diagnostics. It is worth mentioning that both tools expand their scope to support both worlds.

Fig. 5 Grafana example dashboard (https://grafana.com/grafana/)

While monitoring is great at spotting irregularities (proactive) and targeted analysis of faulty systems (reactive), being informed about application failures right when they occur allows DevOps teams to take instant action. Alert managers provide the capability of poking for events and triggering alerts on all sorts of different communication channels, such as messaging, incidents managing programs, or via plain email.

Scrapers, Aggregators, and Shippers

Given the fact that not every microservice exposes an endpoint where logs and log-based metrics can be assessed or extracted – remember the differences between push and pull – intermediaries must chip in. Services like scrapers extract and format logs from different sources, aggregators perform some sort of combining actions (generating metrics) and shippers can pose as a push service for push-based backends. Fluentd is a perfect candidate that incorporates all the mentioned capabilities while still maintaining a smallish footprint.

End-to-end monitoring

There are paid-tier services that make a run at providing a holistic one-fits-all system for any sort of application, architecture, and independently from cloud vendors, which can be a game changer for DevOps teams. However, leaner setups can also do a cost-effective and reliable job.

When ruling out the necessity of collecting full-text logs, many standard use cases can be realized with a time series database as the backend. InfluxDB is well suited and easy to spin up with mature integrability into Grafana. Grafana, as a dashboard tool, pairs well with Prometheus´ alter manager service. As an intermediary, fluentd is perfectly fitted to extract the textual logs and perform the necessary transformations. As InfluxDB is push-based, fluentd also takes care that the data get into InfluxDB.

Building on said tools, the example infrastructure covers everything from the Data Science pipeline to the later deployed model APIs, with Dashboards dedicated to each use case. Before a new trainings-run gets approved for production, the ML metrics mentioned at the beginning, provide a good entry point to observe the model´s legitimacy. Simple user statistics, like total and unique requests, give a fair overview of its usage once the model is deployed. By tracking response times, e.g. of an API call, bottlenecks can be disclosed easily.

At the resource level, the APIs along with each pipeline step are monitored to observe any irregularities, like sudden spikes in memory consumption. Tracking the resources over time can also determine whether the types of VM that are being used are over- or underutilized. Optimizing these metrics can potentially cut unnecessary costs. Lastly, pre-defined failure events, such as an unreachable API or failed trainings runs should trigger an alert with an Email being sent out.

 

Fig. 6: Deployed infrastructure with logging streams and monitoring stack.

The entire architecture, consisting of the monitoring infrastructure, the data science pipeline, and deployed APIs, can all run in a (managed) Kubernetes cluster. From a DevOps perspective, knowing Kubernetes is already half the battle. This open-source stack can be scaled up and down and is not bound to any paid-tier subscription model which provides great flexibility and cost-efficiency. Plus, onboarding new log streams, deployed apps or multiple pipelines can be done painlessly. Even single frameworks could be swapped out. For instance, if Grafana is not suitable anymore, just use another visualization tool that can integrate with the backend and matches the use case requirements.

Conclusion

Logging and monitoring are pivotal parts of modern infrastructures not just since applications were modularized and shipped into the cloud. Yet, they surely exacerbate the struggles of not being set up properly. In addition to the increasing operationalization of the ML workflow, the need for organizations to establish well-thought-out monitoring solutions in order to keep track of models, data, and everything around them is also growing steadily.

While there are dedicated platforms designed to address these challenges, the charming idea behind the presented infrastructure is that it consists of only a single entry point for Data Science, MLOps, and Devops teams and is highly extensible. Benedikt Müller Benedikt Müller

Create more value from your projects with the cloud.

Data science and data-driven decision-making have become a crucial part of many companies’ daily business and will become even more important in the upcoming years. Many organizations will have a cloud strategy in place by the end of 2022:

“70% of organizations will have a formal cloud strategy by 2022, and the ones that fail to adopt it will struggle”
– Gartner Research

By becoming a standard building block in all kinds of organizations, cloud technologies are getting more easily available, lowering the entry barrier for developing cloud-native applications.

In this blog entry, we will have a look at why the cloud is a good idea for data science projects. I will provide a high-level overview of the steps needed to be taken to onboard a data science project to the cloud and share some best practices from my experience to avoid common pitfalls.

I will not discuss solution patterns specific to a single cloud provider, compare them or go into detail about machine learning operations and DevOps best practices.

Data Science projects benefit from using public cloud services

One common approach to data science projects is to start by coding on local machines, crunching data, training, and snapshot-based model evaluation. This helps keeping pace at an early stage when not yet confident machine learning can solve the topic identified by the business. After having created the first version satisfying the business needs, the question arises of how to deploy that model to generate value.

Running a machine learning model in production usually can be achieved by either of two options: 1) run the model on some on-premises infrastructure. 2) run the model in a cloud environment with a cloud provider of your choice. Deploying the model on-premises might sound appealing at first and there are cases where it is a viable option. However, the cost of building and maintaining a data science specific piece of infrastructure can be quite high. This results from diverse requirements ranging from specific hardware, over managing peak loads in training phases up to additional interdependent software components.

Different cloud set-ups offer varying degrees of freedom

When using the cloud, you can choose the most suitable service level between Infrastructure as a Service (IaaS), Container as a Service (CaaS), Platform as a Service (PaaS) and Software as a Service (SaaS), usually trading off flexibility for ease of maintenance. The following picture visualizes the responsibilities in each of the service levels.

  • «On-Premises» you must take care of everything yourself: ordering and setting up the necessary hardware, setting up your data pipeline and developing, running, and monitoring your applications.
  • In «IaaS» the provider takes care of the hardware components and delivers a virtual machine with a fixed version of an operating system (OS).
  • With «CaaS» the provider offers a container platform and orchestration solution. You can use container images from a public registry, customize them or build your own container.
  • With «PaaS» services what is usually left to do is bring your data and start developing your application. Depending on whether the solution is serverless you might not even have to provide information on the sizing.
  • «SaaS» solutions as the highest service level are tailored to a specific purpose and include very little effort for setup and maintenance, but offer quite limited flexibility for new features have usually to be requested from the provider

Public cloud services are already tailored to the needs of data science projects

The benefits of public cloud services include scalability, decoupling of resources and pay-as-you-go models. Those benefits are already a plus for data science applications, e.g., for scaling resources for a training run. On top of that, all 3 major cloud providers have a part of their service catalog designed specifically for data science applications, each of them with its own strengths and weaknesses.

Not only does this include special hardware like GPUs, but also integrated solutions for ML operations like automated deployments, model registries and monitoring of model performance and data drift. Many new features are constantly developed and made available. To keep up with those innovations and functionalities on-premises you would have to spend a substantial number of resources without generating direct business impact.
If you are interested in an in-depth discussion of the importance of the cloud for the success of AI projects, be sure to take a look at the white paper published on our content hub.

Onboarding your project to the cloud takes only 5 simple steps

If you are looking to get started with using the cloud for data science projects, there are a few key decisions and steps you will have to make in advance. We will take a closer look at each of those.

1. Choosing the cloud service level

BWhen choosing the service level, the most common patterns for data science applications are CaaS or PaaS. The reason is that infrastructure as a service can create high costs resulting from maintaining virtual machines or building up scalability across VMs. SaaS services on the other hand are already tailored to a specific business problem and are used instead of building your own model and application.

CaaS comes with the main advantage of containers, namely that containers can be deployed to any container platform of any provider. Also, when the application does not only consist of the machine learning model but needs additional micro-services or front-end components, they can all be hosted with CaaS. The downside is that similar to an on-premises roll-out, Container images for MLops tools like model registry, pipelines and model performance monitoring are not available out of the box and need to be built and integrated with the application. The larger the number of used tools and libraries, the higher the likelihood that at some point future versions will have incompatibilities or even not match at all.

PaaS services like Azure Machine Learning, Google Vertex AI or Amazon SageMaker on the other hand have all those functionalities built in. The downside of these services is that they all come with complex cost structures and are specific to the respective cloud provider. Depending on the project requirements the PaaS services may in some special cases feel too restrictive.

When comparing CaaS and PaaS it mostly comes down to the tradeoff between flexibility and a higher level of vendor lock-in. Higher vendor lock-in comes with paying an extra premium for the sake of included features, increased compatibility and rise in the speed of development. Higher flexibility on the other hand comes at the cost of increased integration and maintenance effort.

2.    Making your data available in cloud

Usually, the first step to making your data available is to upload a snapshot of the data to a cloud object storage. These are well integrated with other services and can later be replaced by a more suitable data storage solution with little effort. Once the results from the machine learning model are suitable from a business perspective, data engineers should set up a process to automatically keep your data up to date.

3.    Building a pipeline for preprocessing

In any data science project, one crucial step is building a robust pipeline for data preprocessing. This ensures your data is clean and ready for modeling, which will save you time and effort in the long run. A best practice is to set up a continuous integration and continuous delivery (CICD) pipeline to automate deployment and testing of your preprocessing and to make it part of your DevOps cycle. The cloud helps you automatically scale your pipelines to deal with any amount of data needed for the training of your model.

4.    Training and evaluating trained models

In this stage, the preprocessing pipeline is extended by adding modeling components. This includes hyper-parameter tuning which cloud services once again support by scaling resources and storing the results of each training experiment for easier comparison. All cloud providers offer an automated machine learning service. This can be used either to generate the first version of a model quickly and compare performance on the data across multiple model types. This way you can quickly assess if the data and preprocessing suffice to tackle the business problem. Besides that, the result can be utilized as a benchmark for the data scientist. The best model should be stored in a model registry for deployment and transparency.

In case a model has already been trained locally or on-premises, it is possible to skip the training and just load the model into the model registry.

5.   Serving models to business users

The final and likely most important step is serving the model to your business unit to create value from it. All cloud providers offer solutions to deploy the model in a scalable manner with little effort. Finally, all pieces created in the earlier steps from automatically provisioning the most recent data over applying preprocessing and feeding the data into the deployed model come together.

Now we have gone through the steps of how to onboard your data science project. With these 5 steps you are well on the way with moving your data science workflow to the cloud. To avoid some of the common pitfalls, here are some learnings from my personal experiences I would like to share, which can positively impact your project’s success.

Make your move to the cloud even easier with these useful tips

Start using the cloud early in the process.

By starting early, the team can familiarize themselves with the platform’s features. This will help you make the most of its capabilities and avoid potential problems and heavy refactoring down the road

Make sure your data is accessible.

This may seem like a no-brainer, but it is important to make sure your data is easily accessible when you move to the cloud. This is especially true in a setup where your data is generated on-premises and needs to be transferred to the cloud.

Consider using serverless computing.

Serverless computing is a great option for data science projects because it allows you to scale your resources up or down as needed without having to provision or manage any servers.

Don’t forget about security.

While all cloud providers offer some of the most up-to-date IT-security setups, some of them are easy to miss during configuration and can expose your project to needless risk.

Monitor your cloud expenses.

Coming from on-premises, optimization is often about peak resource usage because hardware or licenses are limited. With scalability and pay-as-you-go, this paradigm shifts stronger towards optimizing costs. Optimizing costs is usually not the first activity to do when starting a project but keeping an eye on the costs can prevent unpleasant surprises and be used at a later stage to make a cloud application even more cost-effective.

Take your data science projects to new heights with the cloud

If you’re starting your next data science project, doing so into the cloud is a great option. It is scalable, flexible, and offers a variety of services that can help you get the most out of your project. Cloud based architectures are a modern way of developing applications, that are expected to grow even more important in the future.

Following the steps presented will help you on that journey and support you in keeping up with the newest trends and developments. Plus, with the tips provided, you can avoid many of the common pitfalls that may occur on the way. So, if you’re looking for a way to get the most out of your data science project, the cloud is definitely worth considering. Alexander Broska Alexander Broska Alexander Broska Alexander Broska, Alexander Veicht Alexander Broska

Be Safe!

In the age of open-source software projects, attacks on vulnerable software are ever present. Python is the most popular language for Data Science and Engineering and is thus increasingly becoming a target for attacks through malicious libraries. Additionally, public facing applications can be exploited by attacking vulnerabilities in the source code.

For this reason it’s crucial that your code does not contain any CVEs (common vulnerabilities and exposures) or uses other libraries that might be malicious. This is especially true if it’s public facing software, e.g. a web application. At statworx we look for ways to increase the quality of our code by using automated scanning tools. Hence, we’ll discuss the value of two code and package scanners for Python.

Automatic screening

There are numerous tools for scanning code and its dependencies, here I will provide an overview of the most popular tools designed with Python in mind. Such tools fall into one of two categories:

  • Static Application Security Testing (SAST): look for weaknesses in code and vulnerable packages
  • Dynamic Application Security Testing (DAST): look for vulnerabilities that occur at runtime

In what follows I will compare bandit and safety using a small streamlit application I’ve developed. Both tools fall into the category of SAST, since they don’t need the application to run in order to perform their checks. Dynamic application testing is more involved and may be the subject of a future post.

The application

For the sake of context, here’s a brief description of the application: it was designed to visualize the convergence (or lack thereof) in the sampling distributions of random variables drawn from different theoretical probability distributions. Users can choose the distribution (e.g. Log-Normal), set the maximum number of samples and pick different sampling statistics (e.g. mean, standard deviation, etc.).

Bandit

Bandit is an open-source python code scanner that checks for vulnerabilities in code and only in your code. It decomposes the code into its abstract syntax tree and runs plugins against it to check for known weaknesses. Among other tests it performs checks on plain SQL code which could provide an opening for SQL injections, passwords stored in code and hints about common openings for attacks such as use of the pickle library. Bandit is designed for use with CI/CD and throws an exit status of 1 whenever it encounters any issues, thus terminating the pipeline. A report is generated, which includes information about the number of issues separated by confidence and severity according to three levels: low, medium, and high. In this case, bandit finds no obvious security flaws in our code.

Run started:2022-06-10 07:07:25.344619

Test results:
        No issues identified.

Code scanned:
        Total lines of code: 0
        Total lines skipped (#nosec): 0

Run metrics:
        Total issues (by severity):
                Undefined: 0
                Low: 0
                Medium: 0
                High: 0
        Total issues (by confidence):
                Undefined: 0
                Low: 0
                Medium: 0
                High: 0
Files skipped (0):

All the more reason to carefully configure Bandit to use in your project. Sometimes it may raise a flag even though you already know that this would not be a problem at runtime. If, for example, you have a series of unit tests that use pytest and run as part of your CI/CD pipeline Bandit will normally throw an error, since this code uses the assert statement, which is not recommended for code that does not run without the -O flag.

To avoid this behaviour you could:

  1. run scans against all files but exclude the test using the command line interface
  2. create a yaml configuration file to exclude the test

Here’s an example:

# bandit_cfg.yml
skips: ["B101"] # skips the assert check

Then we can run bandit as follows: bandit -c bandit_yml.cfg /path/to/python/files and the unnecessary warnings will not crop up.

Safety

Developed by the team at pyup.io, this package scanner runs against a curated database which consists of manually reviewed records based on publicly available CVEs and changelogs. The package is available for Python >= 3.5 and can be installed for free. By default it uses Safety DB which is freely accessible. Pyup.io also offers paid access to a more frequently updated database.

Running safety check --full-report -r requirements.txt on the package root directory gives us the following output (truncated the sake of readability):

+==============================================================================+
|                                                                              |
|                               /$$$$$$            /$$                         |
|                              /$$__  $$          | $$                         |
|           /$$$$$$$  /$$$$$$ | $$  \__//$$$$$$  /$$$$$$   /$$   /$$           |
|          /$$_____/ |____  $$| $$$$   /$$__  $$|_  $$_/  | $$  | $$           |
|         |  $$$$$$   /$$$$$$$| $$_/  | $$$$$$$$  | $$    | $$  | $$           |
|          \____  $$ /$$__  $$| $$    | $$_____/  | $$ /$$| $$  | $$           |
|          /$$$$$$$/|  $$$$$$$| $$    |  $$$$$$$  |  $$$$/|  $$$$$$$           |
|         |_______/  \_______/|__/     \_______/   \___/   \____  $$           |
|                                                          /$$  | $$           |
|                                                         |  $$$$$$/           |
|  by pyup.io                                              \______/            |
|                                                                              |
+==============================================================================+
| REPORT                                                                       |
| checked 110 packages, using free DB (updated once a month)                   |
+============================+===========+==========================+==========+
| package                    | installed | affected                 | ID       |
+============================+===========+==========================+==========+
| urllib3                    | 1.26.4    | <1.26.5                  | 43975    |
+==============================================================================+
| Urllib3 1.26.5 includes a fix for CVE-2021-33503: An issue was discovered in |
| urllib3 before 1.26.5. When provided with a URL containing many @ characters |
| in the authority component, the authority regular expression exhibits        |
| catastrophic backtracking, causing a denial of service if a URL were passed  |
| as a parameter or redirected to via an HTTP redirect.                        |
| https://github.com/advisories/GHSA-q2q7-5pp4-w6pg                            |
+==============================================================================+

The report includes the number of packages that were checked, the type of database used for reference and information on each vulnerability that was found. In this example an older version of the package urllib3 is affected by a vulnerability which technically could be used by an to perform a denial-of-service attack.

Integration into your workflow

Both bandit and safety are available as GitHub Actions. The stable release of safety also provides integrations for TravisCI and GitLab CI/CD.

Of course, you can always manually install both packages from PyPI on your runner if no ready-made integration like a GitHub action is available. Since both programs can be used from the command line, you could also integrate them into a pre-commit hook locally if using them on your CI/CD platform is not an option.

The CI/CD pipeline for the application above was built with GitHub Actions. After installing the application’s required packages, it runs bandit first and then safety to scan all packages. With all the packages updated, the vulnerability scans pass and the docker image is built.

Package check Code Check

Conclusion

I would strongly recommend using both bandit and safety in your CI/CD pipeline, as they provide security checks for your code and your dependencies. For modern applications manually reviewing every single package your application depends on is simply not feasible, not to mention all of the dependencies these packages have! Thus, automated scanning is inevitable if you want to have some level of awareness about how unsafe your code is.

While bandit scans your code for known exploits, it does not check any of the libraries used in your project. For this, you need safety, as it informs you about known security flaws in the libraries your application depends on. While neither frameworks are completely foolproof, it’s still better to be notified about some CVEs than none at all. This way, you’ll be able to either fix your vulnerable code or upgrade a vulnerable package dependency to a more secure version.

Keeping your code safe and your dependencies trustworthy can ward off potentially devastating attacks on your application. Thomas Alcock Thomas Alcock Thomas Alcock Thomas Alcock Thomas Alcock Thomas Alcock Thomas Alcock Thomas Alcock Thomas Alcock Thomas Alcock

Given the hype and the recent success of AI it is surprising that most companies still lack the successful integration of AI. This is quite evident in many industries, especially in manufacturing. (McKinsey).

In a study published by Accenture  in 2019 about the implementation of  AI in companies, the authors came to the conclusion that around 80% of all Proof of Concepts (PoCs) do not make it into production. Furthermore, only 5% of all interviewed companies stated that they currently have a company-wide AI strategy in place.

These findings are thought-provoking: What exactly is going wrong and why does artificial intelligence apparently not yet make the holistic transition from successful academic studies to the real world?

1. What Is Data-Centric AI?

„Data-centric AI is the discipline of systematically engineering the data used to build an AI system.“
Andrew Ng, data-centric AI pioneer

The data-centric approach focuses on a more data-integrating AI (data-first) and less on the models (model-fist) to overcome the difficulties of AI with “reality”. Usually the training data that stands as the starting point of an AI project at companies have relatively little in common with the meticulously curated and widely used benchmark datasets such as MNIST or ImageNet.

In this article, we want to consolide different data-centric theories and frameworks in the context of an AI (project) workflow. In addition, we want to show how we approach a data-first AI implementation at statworx.

2. What´s Behind a Data-Centric Way of Thinking?

In the simplest terms, AI systems consist of two critical components: data and model (code). Data-centric thereby leans its focus more towards data, model-centric on model infrastructure – duh!

A strong model-centric leaning AI regards the data only as an extrinsic, static parameter. The iterative process of a data science project only starts after the dataset is being “delivered” with the model-related task, like train and fine tune of various model architectures. This occupies the vast portion of time in a data science project, and data pre-processing steps are only and ad-hoc duty at the beginning of each project.

model centric graphic

In contrary, data-centric understands (automated) data processes as an integral part of any ML project. This incorporates any necessary steps to get from raw data to a final dataset. Internalizing these processes aims to enhance the quality and the methodical observability of the data.

data centric graphic

Data-centric approaches can be consolidated into three broader categories that explain loosely the scope of the data-centric concept. In the following, we will assign buzzwords (frameworks) that are often used in the data-centric context to a specific category.

2.1. Integration of SMEs Into the Development Process as a Major Link Between Data and Model Knowledge

The integration of domain knowledge is an integral part of data-centric. It should help project teams to grow together and thus integrate the knowledge of Subject Matter Experts (SMEs) in the best way possible.

  • Data Profiling:
    Data scientists should not act as a one man show that only share their findings with the SMEs. With their statistical and programmatical abilities they should rather act as mediators to empower SMEs to individually dig through the data.
  • Human-in-the-loop Data & Model Monitoring
    In similarity to profiling, this should be a central starting point to ensure that SMEs have access to the relevant components of the AI system. From this central checkpoint, not only data but also model-relevant metrics can be monitored or examples visualized and checked. Sophisticated monitoring decreases the response time drastically since errors can be directly investigated (and mitingated) – not only by data scientists.

2.2. Data Quality Management as an Agile, Automated and Iterative Process Over (Training) Data

Copiously improving the data preparation process is key to every data science project. The model itself should be an extrinsic part at first.

  • Data Cataloague, Lineage & Validation:
    The documentation of data should also not be an extrinsic task, which often only arises ad-hoc towards the end of a project and could become obsolete again with every change, e.g., of a model feature. Changes should be reflected dynamically and thus automate the documentation. Data catalogue frameworks provide the capabilities to store data with associated meta data (and other necessary information).
    Data lineage, as a subsequent step in the data process, keeps then track of all data mingling steps that occur during the transition from raw to the final dataset. The more complex a data model, the more a lineage graph can track how a final column was created (graph below), for example, joining, filtering or other processing steps. Finally, validating the data during the input and transformation process allows for a consistent data foundation. The knowledge gained from data profiling helps here to develop validation rules and integrate them into the process.

lineage graph

  • Data & Label Cleaning:
    The necessity of data processing is ubiquitous and well-established among AI practitioners. However, label cleaning is a rather disregarded step (that, of course, only applies to classification problems). Wrongly classified datapoints can make it hard for some algorithms to reveal the right patterns in the data.
  • Data Drifts in Production:
    Well known weak spot of (all) AI systems are drifts in data. This happens when trainings and the actual live inference data do not have the same distribution. In order to pledge the model’s prediction validity in the long run, identifying these irregularities and retraining the model accordingly is a crucial part of any ML pipeline.
  • Data Versioning:
    Since ever, GitHub is the go-to standard of versioning the codebase of projects. However, for AI projects it is not only important to track code but also data changes and tuck both together. This can produce a more holistic depiction of a ML workflow with increased visibility and observability.

data/code visioning graphic

2.3. Generating Trainings Data as Programmatic Task

Producing new (labeled) training examples is often a huge road blocker for AI projects, particularly if the underlying problem is complex and thus requires vast datasets. This may lead to an unbearable overhead of manual labor.

  • Data Augmentation:
    In many data-intensive deep learning models, this technique has been used for a long time to create artificial data with existing data. It, for instance, works perfectly on image data, for easy operations such as rotating, tilting or altering the color filters. Even in NLP and tabular (Excel and co.) use cases, there are possibilities to enlarge the dataset.

augmentation graphic

  • Automated Data Labeling:
    As already stated, labeling data is a rather labor-intense task in which people assign data points to a predefined category. On the one hand, this makes the initial effort (costs) very high, and on the other hand, it is error-prone and difficult to monitor. That’s where ML can chip in. Concepts like semi and weak supervision can automate the manual task almost entirely.
  • Data Selection:
    Working with large chunks of data in a local setup is often not possible, especially once the dataset does not fit into memory anymore. And even if it fits, trainings runs can take forever. Data Selection tries to reduce the size by active subsampling (whether labeled or unlabeled). The “best” examples with the highest diversity and representativeness are actively selected here to ensure the best possible characterization of the input – and this is done automatically.

Needless to say, not every presented method is necessarily a good fit for every project. It is part of the work of the development team to analyze how a certain framework could benefit the final product, which also should take business considerations (e.g. cost versus benefit) into account.

3. Integration of Data-Centric at statworx

Data-centric is a crucial part of our projects, especially during the transition between PoC and production-ready model. We also had cases, in which during this transition, we faced some mainly data-related issue due to inadequate documentation, validation or poor integration of SMEs into the data process.

We therefore generally try to show our customers the importance of data management for the longevity and robustness of AI products in production and how helpful components are linked within an AI pipeline.

As part of the learnings, our data onboarding framework, a mix of profiling, catalogue and validation aims to mitigate the before mentioned issues. Additionally, this framework helps the entire company make previously unused, undocumented data sources available for various use cases (not just AI).

A strong interaction with the SMEs on the client’s side is integral to establish trustworthy, robust and well-understood quality checks. This also helps to empower our clients to debug errors and do the first-level support themselves, which also helps with the service’s longevity.

data & ai pipeline by statworx graphic

In a stripped-down, custom data onboarding integration, we used a variety of open and closed source tools to create a platform that is easily scalable and understandable for the customer. We installed validation checks with Great Expectations (GE), a python-based tool with reporting capabilities to create a shareable status report of the data. data validation graphic

This architecture can then run on different environments, like cloud native software (Azure DataFactory) or orchestrated with Airflow (open-source) and can be easily complemented.

4. Data-Centric in Relation AI’s Status Quo

Both data- and model-centric describe attempts on how to approach an AI project.

On the one hand, there exist already well-established best practices around model-centric with various production-proved frameworks.

One reason for this maturity is certainly the strong focus on model architectures and their advancements among academic researchers and leading AI companies. With computer vision and NLP leading the way, commercialized meta models, trained on enormous datasets, opened the door for successful AI use cases. With relatively limited data, those models can get finetuned for downstream end-use applications – known as transfer learning.

However, this trend helps only some of the failed projects, because especially in the context of industrial projects, lack of compatibility or rigidity of use cases makes the applications of meta-models difficult. Non-rigidity is often found in machine-heavy manufacturing industries, where the environment in which data is produced is constantly changing and even the replacement of a single machine can have a large impact on a productive AI model. If this issue has not been properly considered in the AI process, this creates a difficult-to-calculate risk, also known as technical debt [Quelle: https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf].

Lastly, edge cases, rare and unusual datapoints are generally a burden for any AI. An application that observes anomalies of a machine component most certainly sees only a fraction of faulty units.

5. Conclusion – Paradigm Shift in Sight?

Overcoming these problems is part of the promise of data-centric, but is still rather immature at the moment.

The availability or immaturity of open-source frameworks does manifest this allegation, especially since there is a lack of a more unified tool stack end user can choose from. This inevitably leads to longer, more involved, and complex AI projects, which is a significant hurdle for many companies. In addition, there are few data metrics available to give companies feedback on what exactly they are “improving.” And second, many of the tools (eg., data catalogue) have more indirect, distributed benefits.

Some startups that aim to address these issues have emerged in recent years. However, because they (exclusively) market paid tier software, it is rather unclear to what extent these products can really cover the broad mass of problems from different use cases.

Although the above shows that companies in general are still far away from a holistic integration of data-centric, robust data strategies have become more and more important lately (as we at statworx could see in our projects).

With increased academic research into data products, this trend will certainly intensify. Not only because new, more robust frameworks will emerge, but also because university graduates will bring more knowledge in this area to companies.

Image Sources

Model-centric arch: own
Data-centric arch: own
Data lineage: https://www.researchgate.net/figure/Data-lineage-visualization-example-in-DW-environment-using-Sankey-diagram_fig7_329364764
Versioning Code/Data: https://ardigen.com/7155/
Data Augmentation: https://medium.com/secure-and-private-ai-writing-challenge/data-augmentation-increases-accuracy-of-your-model-but-how-aa1913468722
Data & AI pipeline: own
Validation with GE: https://greatexpectations.io/blog/ge-data-warehouse/

Benedikt Müller Benedikt Müller Benedikt Müller Benedikt Müller Benedikt Müller Benedikt Müller Benedikt Müller Benedikt Müller Benedikt Müller Benedikt Müller

What you can expect:

The aim of the fair is to connect students of business informatics, business mathematics and data science with companies.

We will be there with our own booth and will introduce statworx in the context of a short presentation additionally. 

Given good weather, the fair will take place outside this year. Participation is free of charge and a pre-registration is not necessary. Just come by and talk to us. 

What you can expect:

The konaktiva fair at Darmstadt University of Technology is one of the oldest and largest student-organized company career fairs in Germany. In line with its motto “Students meet companies”, it brings together prospective graduates and companies every year.

This year, we are also taking part with our own booth as well as several colleagues and we are looking forward to the exchange with interested students. We will be happy to present the various entry-level opportunities at statworx – from internships to permanent positions – and share insights into our day-to-day work.

In addition, there will be the opportunity to get to know us better and to discuss individual questions and cases during pre-scheduled one-on-one meetings away from the hustle and bustle of the trade fair.

Participation in the fair is free of charge for visitors.

In the field of Data Science – as the name suggests – the topic of data, from data cleaning to feature engineering, is one of the cornerstones. Having and evaluating data is one thing, but how do you actually get data for new problems?

If you are lucky, the data you need is already available. Either by downloading a whole dataset or by using an API. Often, however, you have to gather information from websites yourself – this is called web scraping. Depending on how often you want to scrape data, it is advantageous to automate this step.

This post will be about exactly this automation. Using web scraping and GitHub Actions as an example, I will show how you can create your own data sets over a more extended period. The focus will be on the experience I have gathered over the last few months.

The code I used and the data I collected can be found in this GitHub repository.

Search for data – the initial situation

During my research for the blog post about gasoline prices, I also came across data on the utilization of parking garages in Frankfurt am Main. Obtaining this data laid the foundation for this post. After some thought and additional research, other thematically appropriate data sources came to mind:

  • Road utilization
  • S-Bahn and subway delays
  • Events nearby
  • Weather data

However, it quickly became apparent that I could not get all this data, as it is not freely available or allowed to be stored. Since I planned to store the collected data on GitHub and make it available, this was a crucial point for which data came into question. For these reasons, railway data fell out completely. I only found data for Cologne for road usage, and I wanted to avoid using the Google API as that definitely brings its own challenges. So, I was left with event and weather data.

For the weather data of the German Weather Service, the rdwd package can be used. Since this data is already historized, it is irrelevant for this blog post. The GitHub Actions have proven to be very useful to get the remaining event and park data, even if they are not entirely trivial to use. Especially the fact that they can be used free of charge makes them a recommendable tool for such projects.

Scraping the data

Since this post will not deal with the details of web scraping, I refer you here to the post by my colleague David.

The parking data is available here in XML format and is updated every 5 minutes. Once you understand the structure of the XML, it’s a simple matter of accessing the right index, and you have the data you want. In the function get_parking_data(), I have summarized everything I need. It creates a record for the area and a record for the individual parking garages.

Example data extract area

parkingAreaOccupancy;parkingAreaStatusTime;parkingAreaTotalNumberOfVacantParkingSpaces;
totalParkingCapacityLongTermOverride;totalParkingCapacityShortTermOverride;id;TIME
0.08401977;2021-12-01T01:07:00Z;556;150;607;1[Anlagenring];2021-12-01T01:07:02.720Z
0.31417114;2021-12-01T01:07:00Z;513;0;748;4[Bahnhofsviertel];2021-12-01T01:07:02.720Z
0.351417;2021-12-01T01:07:00Z;801;0;1235;5[Dom / Römer];2021-12-01T01:07:02.720Z
0.21266666;2021-12-01T01:07:00Z;1181;70;1500;2[Zeil];2021-12-01T01:07:02.720Z

Example data extract facility

parkingFacilityOccupancy;parkingFacilityStatus;parkingFacilityStatusTime;
totalNumberOfOccupiedParkingSpaces;totalNumberOfVacantParkingSpaces;
totalParkingCapacityLongTermOverride;totalParkingCapacityOverride;
totalParkingCapacityShortTermOverride;id;TIME
0.02;open;2021-12-01T01:02:00Z;4;196;150;350;200;24276[Turmcenter];2021-12-01T01:07:02.720Z
0.11547912;open;2021-12-01T01:02:00Z;47;360;0;407;407;18944[Alte Oper];2021-12-01T01:07:02.720Z
0.0027472528;open;2021-12-01T01:02:00Z;1;363;0;364;364;24281[Hauptbahnhof Süd];2021-12-01T01:07:02.720Z
0.609375;open;2021-12-01T01:02:00Z;234;150;0;384;384;105479[Baseler Platz];2021-12-01T01:07:02.720Z

For the event data, I scrape the page stadtleben.de. Since it is a HTML that is quite well structured, I can access the tabular event overview via the tag “kalenderListe”. The result is created by the function get_event_data().

Example data extract event

eventtitle;views;place;address;eventday;eventdate;request
Magical Sing Along - Das lustigste Mitsing-Event;12576;Bürgerhaus;64546 Mörfelden-Walldorf, Westendstraße 60;Freitag;2022-03-04;2022-03-04T02:24:14.234833Z
Velvet-Bar-Night;1460;Velvet Club;60311 Frankfurt, Weißfrauenstraße 12-16;Freitag;2022-03-04;2022-03-04T02:24:14.234833Z
Basta A-cappella-Band;465;Zeltpalast am Deutsche Bank Park;60528 Frankfurt am Main, Mörfelder Landstraße 362;Freitag;2022-03-04;2022-03-04T02:24:14.234833Z
BeThrifty Vintage Kilo Sale | Frankfurt | 04. & 05. …;1302;Batschkapp;60388 Frankfurt am Main, Gwinnerstraße 5;Freitag;2022-03-04;2022-03-04T02:24:14.234833Z

Automation of workflows – GitHub Actions

The basic framework is in place. I have a function that writes the park and event data to a .csv file when executed. Since I want to query the park data every 5 minutes and the event data three times a day for security, GitHub Actions come into play.

With this function of GitHub, workflows can be scheduled and executed in addition to actions triggered during merging or committing. For this purpose, a .yml file is created in the folder /.github/workflows.

The main components of my workflow are:

  • The schedule – Every ten minutes, the functions should be executed
  • The OS – Since I develop locally on a Mac, I use the macOS-latest here.
  • Environment variables – This contains my GitHub token and the path for the package management renv.
  • The individual steps in the workflow itself.

The workflow goes through the following steps:

  • Setup R
  • Load packages with renv
  • Run script to scrape data
  • Run script to update the README
  • Pushing the new data back into git

Each of these steps is very small and clear in itself; however, as is often the case, the devil is in the details.

Limitation and challenges

Over the last few months, I’ve been tweaking and optimizing my workflow to deal with the bugs and issues. In the following, you will find an overview of my condensed experiences with GitHub Actions from the last months.

Schedule problems

If you want to perform time-critical actions, you should use other services. GitHub Action does not guarantee that the jobs will be timed exactly (or, in some cases, that they will be executed at all).

Time span in minutes <= 5 <= 10 <= 20 <= 60 > 60
Number of queries 1720 2049 5509 3023 194

You can see that the planned five-minute intervals were not always adhered to. I should plan a larger margin here in the future.

Merge conflicts

In the beginning, I had two workflows, one for the park data and one for the events. If they overlapped in time, there were merge conflicts because both processes updated the README with a timestamp. Over time, I switched to a workflow including error handling.
Even if one run took longer and the next one had already started, there were merge conflicts in the .csv data when pushing. Long runs were often caused by the R setup and the loading of the packages. Consequently, I extended the schedule interval from five to ten minutes.

Format adjustments

There were a few situations where the paths or structure of the scraped data changed, so I had to adjust my functions. Here the setting to get an email if a process failed was very helpful.

Lack of testing capabilities

There is no way to test a workflow script other than to run it. So, after a typo in the evening, one can wake up to a flood of emails with spawned runs in the morning. Still, that shouldn’t stop you from doing a local test run.

No data update

Since the end of December, the parking data has not been updated or made available. This shows that even if you have an automatic process, you should still continue to monitor it. I only noticed this later, which meant that my queries at the end of December always went nowhere.

Conclusion

Despite all these complications, I still consider the whole thing a massive success. Over the last few months, I’ve been studying the topic repeatedly and have learned the tricks described above, which will also help me solve other problems in the future. I hope that all readers of this blog post could also take away some valuable tips and thus learn from my mistakes.

Since I have now collected a good half-year of data, I can deal with the evaluation. But this will be the subject of another blog post. Jakob Gepp Jakob Gepp Jakob Gepp Jakob Gepp

Management Summary

Kubernetes is a technology that in many ways greatly simplifies the deployment and maintenance of applications and compute loads, especially the training and hosting of machine learning models. At the same time, it allows us to adapt the required hardware resources, providing a scalable and cost-transparent solution.

This article first discusses the transition from a server to management and orchestration of containers: isolated applications or models that are packaged once with all their requirements and can subsequently be run almost anywhere. Regardless of the server, these can be replicated at will with Kubernetes, allowing effortless and almost seamless continuous accessibility of their services even under intense demand. Likewise, their number can be reduced to a minimum level when the demand temporarily or periodically dwindles in order to use computing resources elsewhere or avoid unnecessary costs.

From the capabilities of this infrastructure emerges a useful architectural paradigm called microservices. Formerly centralized applications are thus broken down into their functionalities, which provide a high degree of reusability. These can be accessed and used by different services and scale individually according to internal needs. An example of this is large and complex language models in Natural Language Processing, which can capture the context of a text regardless of its further use and thus underlie many downstream purposes. Other microservices (models), such as for text classification or summarization, can invoke them and further process the partial results.

After a brief introduction of the general terminology and functionality of Kubernetes, as well as possible use cases, the focus turns to the most common way to use Kubernetes: with cloud providers such as Google GCP, Amazon AWS, or Microsoft Azure. These allow so-called Kubernetes clusters to dynamically consume more or fewer resources, though the costs incurred remain foreseeable on a pay-per-use basis. Other common services such as data storage, versioning, and networking can also be easily integrated by the providers. Finally, the article gives an outlook on tools and further developments, which either make using Kubernetes even more efficient or further abstract and simplify the process towards serverless architectures.

Introduction

Over the last 20 years, vast amounts of new technologies have surfaced in software development and deployment, which have not only multiplied and diversified the choice of services, programming languages, and libraries but have even led to a paradigm shift in many use cases or domains.

Google Trends Paradigmenwechsel
Fig. 1: Google trends chart showing the above-mentioned paradigm shift

If we also look at the way software solutions, models, or work and computing loads have been deployed over the years, we can see how innovations in this area have also led to greater flexibility, scalability, and resource efficiency, among other things.

In the beginning, these were run as local processes directly on a server (shared by several applications), which posed some limitations and problems: on the one hand, one is bound to the configuration of the server and its operating system when selecting the technical tools, and on the other hand, all applications hosted on the server are limited by its memory and processor capacities. Thus, they share not only resources in total but also a possible cross-process error-proneness.

As a first further development, virtual machines can then offer a further level of abstraction: by emulating (“virtualizing”) an independent machine on the server, modularity and thus greater freedom is created for development and deployment. For example, in the choice of operating system or the programming languages and libraries used.  From the point of view of the “real” server, the resources to which the application is entitled can be better limited or guaranteed. However, their requirements are also significantly higher since the virtual machine must also maintain the virtual operating system.

Ultimately, this principle has been significantly streamlined and simplified by the proliferation of containers, especially Docker. Put simply, one builds/configures a separate virtual, isolated server for an application or machine learning model. Thus, each container has its own file system and certain system libraries, but not operating system. This technically turns it into a sandbox whose other configuration, code dependencies or errors do not affect the host server, but at the same time can run as relatively “lightweight” processes directly on it.

_Vergleich virtuelle Maschine und Docker Container Architektur
Fig. 2: Comparison between virtual machine and Docker container system architecture; Source: https://i1.wp.com/www.docker.com/blog/wp-content/uploads/Blog.-Are-containers-..VM-Image-1-1024×435.png?ssl=1

So there is the possibility to copy, install, etc., everything for the desired application and provide this in a packaged container everywhere in a consistent format. This is not only extremely useful for the production environment, but we at STATWORX also like to use it in the development of more complicated projects or the proof-of-concept phase. Intermediate steps or results, such as extracting text from images, can be used as a container like a small web server by those interested in further processing of the text, such as extracting certain key information, or determining its mood or intent.

This subdivision into so-called “microservices“ with the help of containers helps immensely in the reusability of the individual modules, in the planning and development of the architecture of complex systems; at the same time, it frees the individual work steps from technical dependencies on each other and facilitates maintenance and update procedures.

After this brief overview of the powerful and versatile possibilities of deploying software, the following text will deal with how to reliably and scalably deploy these containers (i.e., applications or models) for customers, other applications, internal services or computations with Kubernetes.

Kubernetes – 8 Essential Components

Kubernetes was introduced by Google in 2014 as open-source container management software (also called container orchestration). Internally, the company had already been using tools developed in-house for years to manage workloads and applications, and regarded the development of Kubernetes not only as a convergence of best practices and lessons learned, but also as an opportunity to open up a new business segment in cloud computing.

The name Kubernetes (Greek for helmsman) was supposedly chosen in reference to a symbolic container ship, for whose optimal operation he was responsible.

1.    Nodes

When speaking of a Kubernetes instance, it is referred to as a (Kubernetes) cluster: it consists of several servers, called nodes. One of them, called the master node, is solely responsible for administrative operations, and is the interface that is addressed by the developer. All other nodes, called worker nodes, are initially unoccupied and thus flexible. While nodes are actually physical instances, mostly in data centers, the following terms are digital concepts of Kubernetes.

2.    Pods

If an application is to be deployed on the cluster, in the simplest case the desired container is specified and then (automatically) a so-called pod is created and assigned to a node. The pod simply resembles a running container. If several instances of the same application are to run in parallel, for example to provide better availability, the number of replicas can be specified. In this case, the specified number of pods, each with the same application, is distributed to the nodes. If the demand for the application exceeds the capacities despite replicas, even more pods can be created automatically with the Horizontal Autoscaler. Especially for Deep Learning models with relatively long inference times, metrics such as CPU or GPU utilization can be monitored here, and the number of pods can be increased or decreased automatically to optimize both capacity and cost.

Illustration des Autoscaling und der Belegung der Nodes
Illustration 3: Illustration of the autoscaling and the occupancy of the nodes. The width of the bars corresponds to the resource requirements of the pods or the capacity of the nodes.

To avoid confusion, ultimately every running container, i.e., every workload, is a pod. In the case of deploying an application, this is technically done via a deployment, whereas temporal compute loads are jobs. Persistent stores such as databases are managed with StatefulSets. The following figure provides an overview of the terms:

Deployment-Controller
Fig. 4: What is what in Kubernetes? Deployment specifies what is desired; the deployment controller takes care of creating, maintaining, and scaling the model containers, which then run as individual pods on the nodes. Jobs and StatefulSets work analogously with their own controller.

3.    Jobs

Kubernetes jobs can be used to execute both one-time and recurring jobs (so-called CronJobs) in the form of a container deployment on the cluster.

In the simplest case, these can be seen as a script, which can be used for maintenance or data preparation work of databases, for example. Furthermore, they are also used for batch processing, for example when deep learning models are to be applied to larger data volumes and it is not worthwhile to keep the model continuously on the cluster. In this case, the model container is started up, gets access to the desired dataset, performs its inference on it, saves the results and shuts down. There is also flexibility here for the origin and subsequent storage of the data, so own or cloud databases, bucket/object storage or even local data and logging frameworks can be connected.

For recurring CronJobs, a simple time scheme can be specified so that, for example, certain customer data, transactions or the like are processed at night. Natural Language Processing can be used to automatically create press reviews at night, for example, which can then be evaluated the following morning: News about a company, its industry, business locations, customers, etc. can be aggregated or sourced, evaluated with NLP, summarized, and presented with sentiment or sorted by topic/content.

Even labor-intensive ETL (Extract Transform Load) processes can be performed or prepared outside business hours.

4.    Rolling Updates

If a deployment needs to be brought up to the latest version or a rollback to an older version needs to be completed, rolling updates can be triggered in Kubernetes. These guarantee continuous accessibility of the applications and models within a Continuous Integration/Continuous Deployment pipeline.

Such a rollout can be initiated and monitored smoothly in one or a few steps. By means of a rollout history it is also possible not only to jump back to a previous container version, but also to restore the previous deployment parameters, i.e. minimum and maximum number of nodes, which resource group (GPU nodes, CPU nodes with little/much RAM,…), health checks, etc.

If a rolling update is triggered, the respective existing pods are kept running and accessible until the same number of new pods are up and accessible. Here there are methods to guarantee that no requests are lost, as well as parameters that regulate a minimum accessibility or a maximum surplus of pods for the change.

Illustration eines Rolling Updates
Fig. 5: Illustration of a Rolling Update.

Figure 5 illustrates the rolling update.

1) The current version of an application is located on the Kubernetes cluster with 2 replicas and can be accessed as usual.

2) A rolling update to version V2 is started, the same number of pods as for V1 are created.

3) As soon as the new pods have the state “Running” and, if applicable, health checks have been completed, thus being functional, the containers of the older version are shut down.

4) The older pods are removed and the resources are released again.

The DevOps and time involved here is marginal, internally no hostnames or the like change, while from the consumer’s point of view the service can be accessed as before in the usual way (same IP, URL, …) and has merely been updated to the latest version.

5.    Platform/Infrastructure as a Service

Of course, a Kubernetes cluster can also be deployed locally on your on-premises hardware as well as on partially pre-built solutions, such as DGX Workbenches.

Some of our customers have strict policies or requirements regarding (data) compliance or information security, and do not want potentially sensitive data to leave the company. Furthermore, it can be avoided that data traffic flows through non-European nodes or generally ends up in foreign data centers.

Experience shows, however, that this is only the case in a very small proportion of projects. Through encryption, rights management and SLAs of the operators, we consider the use of cloud services and data centers to be generally secure and also use them for larger projects. In this regard, deployment, maintenance, CI/CD pipelines are also largely identical and easy to use thanks to methods of containerization (Docker) and abstraction (Kubernetes).

All major cloud operators like Google (GCP), Amazon (AWS) and Microsoft (Azure), but also smaller providers and soon even exciting new German projects, offer services very similar to Kubernetes. This makes it even easier to deploy and, most importantly, scale a project or model, as auto-scaling allows the cluster to expand or shrink depending on resource needs. From a technical perspective, this largely frees us from having to estimate the demand of a service while keeping the profitability and cost structure the same. Furthermore, the services can also be hosted and operated in different (geographical) zones to guarantee fastest reachability and redundancy.

6.    Node-Variety

The cloud operators offer a large number of different node types to satisfy all resource requirements for all use cases from the simpler web service to high performance computing. Especially in the application field of Deep Learning, the ever growing models can thus always be trained and served on the required latest hardware.

For example, while we use nodes with an average CPU and low memory for smaller NLP purposes, large Transformer models can be deployed on GPU nodes in the same cluster, which effectively enables their use in the first place and at the same time can speed up inference (application of the model) by a factor of over 20. As of late, the importance of dedicated hardware for neural networks has been steadily increasing, Google also provides access to the custom TPUs optimized for Tensorflow.

The organization and grouping of all these different nodes is done in Kubernetes in so-called node pools. These can be selected or specified in the deployment so that the right resources are allocated to the pods of the models.

7.    Cluster Autoscaling

The extent to which models or services are used, internally or by customers, is often unpredictable or fluctuates greatly over time. With a cluster autoscaler, new nodes can be created automatically, or unneeded “empty” nodes can be removed. Here, too, a minimum number of nodes can be specified, which should always be available, as well as, if desired, a maximum number, which cannot be exceeded, to cap the costs, if necessary.

8.    Interfacing with Other Services

In principle, cloud services from different providers can be combined, but it is more convenient and easier to use one provider (e.g. Google GCP). This means that services such as data buckets, container registry, Lambda functions can be integrated and used internally in the cloud without major authentication processes. Furthermore, especially in a microservice architecture, network communication among the individual hosts (applications, models) is important and facilitated within a provider. Access control/RBAC can also be implemented here, and several clusters or projects can be bridged with a virtual network to better separate the areas of responsibility and competence.

Environment and Future Developments

The growing use and spread of Kubernetes has brought with it a whole environment of useful tools, as well as advancements and further abstractions that further facilitate its use.

Tools and Pipelines based on Kubernetes

For example, Kubeflow can be used to trigger the training of machine learning models as a TensorFlow training job and deploy completed models with TensorFlow Serving.

The whole process can also be packaged into a pipeline that then performs training of different models with reference to training, validation and test data in memory buckets, and also monitors or logs their metrics and compares model performance. The workflow also includes the preparation of input data, so that after the initial pipeline setup, experiments can be easily performed to explore model architectures and hyperparameter tuning.

Serverless

Serverless deployment methods such as Cloud Run or Amazon Fargate take another abstraction step away from the technical requirements. With this, containers can be deployed within seconds and scale like pods on a Kubernetes cluster without even having to create or maintain it. So the same infrastructure has once again been simplified in its use. According to the pay-per-use principle, only the time in which the code is actually called and executed is charged.

Conclusion

Kubernetes has become a central pillar in machine learning deployment today. The path from data and model exploration to the prototype and finally to production has been enormously streamlined and simplified by libraries such as PyTorch, TensorFlow and Keras. At the same time, these frameworks can also be applied in enormous detail, if required, to develop customized components or to integrate and adapt existing models using transfer learning. Container technologies such as Docker subsequently allow the result to be bundled with all its requirements and dependencies and executed almost anywhere without drawbacks in speed. In the final step, their deployment, maintenance, and scaling has also become immensely simplified and powerful with Kubernetes.

All of this allows us to develop our own products as well as solutions for customers in a structured way:

  • The components and the framework infrastructure have a high degree of reusability
  • A first milestone or proof-of-concept can be achieved in relatively little time and cost expenditure
  • Further development work expands on this process in a natural way by increasing complexity
  • Ready deployments scale without additional effort, with costs proportional to demand
  • This results in a reliable platform with a predictable cost structure

If you would like to read further about some key components following this article, we have some more interesting articles about:

Sources

Jonas Braun Jonas Braun Jonas Braun

Management Summary

Deploying and monitoring machine learning projects is a complex undertaking. In addition to the consistent documentation of model parameters and the associated evaluation metrics, the main challenge is to transfer the desired model into a productive environment. If several people are involved in the development, additional synchronization problems arise concerning the models’ development environments and version statuses. For this reason, tools for the efficient management of model results through to extensive training and inference pipelines are required. In this article, we present the typical challenges along the machine learning workflow and describe a possible solution platform with MLflow. In addition, we present three different scenarios that can be used to professionalize machine learning workflows:

  1. Entry-level Variant: Model parameters and performance metrics are logged via a R/Python API and clearly presented in a GUI. In addition, the trained models are stored as artifacts and can be made available via APIs.
  2. Advanced Model Management: In addition to tracking parameters and metrics, certain models are logged and versioned. This enables consistent monitoring and simplifies the deployment of selected model versions.
  3. Collaborative Workflow Management: Encapsulating Machine Learning projects as packages or Git repositories and the accompanying local reproducibility of development environments enable smooth development of Machine Learning projects with multiple stakeholders.

Depending on the maturity of your machine learning project, these three scenarios can serve as inspiration for a potential machine learning workflow. We have elaborated each scenario in detail for better understanding and provide recommendations regarding the APIs and deployment environments to use.

Challenges Along the Machine Learning Workflow

Training machine learning models is becoming easier and easier. Meanwhile, a variety of open-source tools enable efficient data preparation as well as increasingly simple model training and deployment.

The added value for companies comes primarily from the systematic interaction of model training, in the form of model identification, hyperparameter tuning and fitting on the training data, and deployment, i.e., making the model available for inference tasks. This interaction is often not established as a continuous process, especially in the early phases of machine learning initiative development. However, a model can only generate added value in the long term if a stable production process is implemented from model training, through its validation, to testing and deployment. If this process is implemented correctly, complex dependencies and costly maintenance work in the long term can arise during the operational start-up of the model [2]. The following risks are particularly noteworthy in this regard.

1. Ensuring Synchronicity

Often, in an exploratory context, data preparation and modeling workflows are developed locally. Different configurations of development environments or even the use of different technologies make it difficult to reproduce results, especially between developers or teams. In addition, there are potential dangers concerning the compatibility of the workflow if several scripts must be executed in a logical sequence. Without an appropriate version control logic, the synchronization effort afterward can only be guaranteed with great effort.

2. Documentation Effort

To evaluate the performance of the model, model metrics are often calculated following training. These depend on various factors, such as the parameterization of the model or the influencing factors used. This meta-information about the model is often not stored centrally. However, for systematic further development and improvement of a model, it is mandatory to have an overview of the parameterization and performance of all past training runs.

3. Heterogeneity of Model Formats

In addition to managing model parameters and results, there is the challenge of subsequently transferring the model to the production environment. If different models from multiple packages are used for training, deployment can quickly become cumbersome and error-prone due to different packages and versions.

4. Recovery of Prior Results

In a typical machine learning project, the situation often arises that a model is developed over a long period of time. For example, new features may be used, or entirely new architectures may be evaluated. These experiments do not necessarily lead to better results. If experiments are not versioned cleanly, there is a risk that old results can no longer be reproduced.

Various tools have been developed in recent years to solve these and other challenges in the handling and management of machine learning workflows, such as TensorFlow TFX, cortex, Marvin, or MLFlow. The latter, in particular, is currently one of the most widely used solutions.

MLflow is an open-source project with the goal to combine the best of existing ML platforms to make the integration to existing ML libraries, algorithms, and deployment tools as straightforward as possible [3]. In the following, we will introduce the main MLflow modules and discuss how machine learning workflows can be mapped via MLflow.

MLflow Services

MLflow consists of four components: MLflow Tracking, MLflow Models, MLflow Projects, and MLflow Registry. Depending on the requirements of the experimental and deployment scenario, all services can be used together, or individual components can be isolated.

With MLflow Tracking, all hyperparameters, metrics (model performance), and artifacts, such as charts, can be logged. MLflow Tracking provides the ability to collect presets, parameters, and results for collective monitoring for each training or scoring run of a model. The logged results can be visualized in a GUI or alternatively accessed via a REST API.

The MLflow Models module acts as an interface between technologies and enables simplified deployment. Depending on its type, a model is stored as a binary, e.g., a pure Python function, or as a Keras or H2O model. One speaks here of the so-called model flavors. Furthermore, MLflow Models provides support for model deployment on various machine learning cloud services, e.g., for AzureML and Amazon Sagemaker.

MLflow Projects are used to encapsulate individual ML projects in a package or Git repository. The basic configurations of the respective environment are defined via a YAML file. This can be used, for example, to control how exactly the conda environment is parameterized, which is created when MLflow is executed. MLflow Projects allows experiments that have been developed locally to be executed on other computers in the same environment. This is an advantage, for example, when developing in smaller teams.

MLflow Registry provides a centralized model management. Selected MLflow models can be registered and versioned in it. A staging workflow enables a controlled transfer of models into the productive environment. The entire process can be controlled via a GUI or a REST API.

Examples of Machine Learning Pipelines Using MLflow

In the following, three different ML workflow scenarios are presented using the above MLflow modules. These increase in complexity from scenario to scenario. In all scenarios, a dataset is loaded into a development environment using a Python script, processed, and a machine learning model is trained. The last step in all scenarios is a deployment of the ML model in an exemplary production environment.

1. Scenario – Entry-Level Variant

Szenario 1 – Simple Metrics TrackingScenario 1 – Simple Metrics Tracking

Scenario 1 uses the MLflow Tracking and MLflow Models modules. Using the Python API, the model parameters and metrics of the individual runs can be stored on the MLflow Tracking Server Backend Store, and the corresponding MLflow Model File can be stored as an artifact on the MLflow Tracking Server Artifact Store. Each run is assigned to an experiment. For example, an experiment could be called ‘fraud_classification’, and a run would be a specific ML model with a certain hyperparameter configuration and the corresponding metrics. Each run is stored with a unique RunID.

Artikel MLFlow Tool Bild 01

In the screenshot above, the MLflow Tracking UI is shown as an example after executing a model training. The server is hosted locally in this example. Of course, it is also possible to host the server remotely. For example in a Docker container within a virtual machine. In addition to the parameters and model metrics, the time of the model training, as well as the user and the name of the underlying script, are also logged. Clicking on a specific run also displays additional information, such as the RunID and the model training duration.

Artikel MLFlow Tool Bild 02

If you have logged other artifacts in addition to the metrics, such as the model, the MLflow Model Artifact is also displayed in the Run view. In the example, a model from the sklearn.svm package was used. The MLmodel file contains metadata with information about how the model should be loaded. In addition to this, a conda.yaml is created that contains all the package dependencies of the environment at training time. The model itself is located as a serialized version under model.pkl and contains the model parameters optimized on the training data.

Artikel MLFlow Tool Bild 03

The deployment of the trained model can now be done in several ways. For example, suppose one wants to deploy the model with the best accuracy metric. In that case, the MLflow tracking server can be accessed via the Python API mlflow.list_run_infos to identify the RunID of the desired model. Now, the path to the desired artifact can be assembled, and the model loaded via, for example, the Python package pickle. This workflow can now be triggered via a Dockerfile, allowing flexible deployment to the infrastructure of your choice. MLflow offers additional separate APIs for deployment on Microsoft Azure and AWS. For example, if the model is to be deployed on AzureML, an Azure ML container image can be created using the Python API mlflow.azureml.build_image, which can be deployed as a web service to Azure Container Instances or Azure Kubernetes Service. In addition to the MLflow Tracking Server, it is also possible to use other storage systems for the artifact, such as Amazon S3, Azure Blob Storage, Google Cloud Storage, SFTP Server, NFS, and HDFS.

2. Scenario – Advanced Model Management

Szenario 2 – Advanced Model ManagementScenario 2 – Advanced Model Management

Scenario 2 includes, in addition to the modules used in scenario 1, MLflow Model Registry as a model management component. Here, it is possible to register and process the models logged there from specific runs. These steps can be controlled via the API or GUI. A basic requirement to use the Model Registry is deploying the MLflow Tracking Server Backend Store as Database Backend Store. To register a model via the GUI, select a specific run and scroll to the artifact overview.

Artikel MLFlow Tool Bild 04

Clicking on Register Model opens a new window in which a model can be registered. If you want to register a new version of an already existing model, select the desired model from the dropdown field. Otherwise, a new model can be created at any time. After clicking the Register button, the previously registered model appears in the Models tab with corresponding versioning.

Artikel MLFlow Tool Bild 05

Each model includes an overview page that shows all past versions. This is useful, for example, to track which models were in production when.

Artikel MLFlow Tool Bild 06

If you now select a model version, you will get to an overview where, for example, a model description can be added. The Source Run link also takes you to the run from which the model was registered. Here you will also find the associated artifact, which can be used later for deployment.

Artikel MLFlow Tool Bild 07

In addition, individual model versions can be categorized into defined phases in the Stage area. This feature can be used, for example, to determine which model is currently being used in production or is to be transferred there. For deployment, in contrast to scenario 1, versioning and staging status can be used to identify and deploy the appropriate model. For this, the Python API MlflowClient().search_model_versions can be used, for example, to filter the desired model and its associated RunID. Similar to scenario 1, deployment can then be completed to, for example, AWS Sagemaker or AzureML via the respective Python APIs.

3. Scenario – Collaborative Workflow Management

Szenario 3 – Full Workflow ManagementScenario 3 – Full Workflow Management

In addition to the modules used in scenario 2, scenario 3 also includes the MLflow Projects module. As already explained, MLflow Projects are particularly well suited for collaborative work. Any Git repository or local environment can act as a project and be controlled by an MLproject file. Here, package dependencies can be recorded in a conda.yaml, and the MLproject file can be accessed when starting the project. Then the corresponding conda environment is created with all dependencies before training and logging the model. This avoids the need for manual alignment of the development environments of all developers involved and also guarantees standardized and comparable results of all runs. Especially the latter is necessary for the deployment context since it cannot be guaranteed that different package versions produce the same model artifacts. Instead of a conda environment, a Docker environment can also be defined using a Dockerfile. This offers the advantage that package dependencies independent of Python can also be defined. Likewise, MLflow Projects allow the use of different commit hashes or branch names to use other project states, provided a Git repository is used.

An interesting use case is the modularized development of machine learning training pipelines [4]. For example, data preparation can be decoupled from model training and developed in parallel, while another team uses a different branch name to train the model. In this case, only a different branch name must be used as a parameter when starting the project in the MLflow Projects file. The final data preparation can then be pushed to the same branch name used for model training and would thus already be fully implemented in the training pipeline. The deployment can also be controlled as a sub-module within the project pipeline through a Python script via the ML Project File and can be carried out analogous to scenario 1 or 2 on a platform of your choice.

Conclusion and Outlook

MLflow offers a flexible way to make the machine learning workflow robust against the typical challenges in the daily life of a data scientist, such as synchronization problems due to different development environments or missing model management. Depending on the maturity level of the existing machine learning workflow, various services from the MLflow portfolio can be used to achieve a higher level of professionalization.

In the article, three machine learning workflows, ascending in complexity, were presented as examples. From simple logging of results in an interactive UI to more complex, modular modeling pipelines, MLflow services can support it. Logically, there are also synergies outside the MLflow ecosystem with other tools, such as Docker/Kubernetes for model scaling or even Jenkins for CI/CD pipeline control. If there is further interest in MLOps challenges and best practices, I refer you to the webinar on MLOps by our CEO Sebastian Heinz, which we provide free of charge.

Resources

John Vicente John Vicente John Vicente

Management Summary

Deploying and monitoring machine learning projects is a complex undertaking. In addition to the consistent documentation of model parameters and the associated evaluation metrics, the main challenge is to transfer the desired model into a productive environment. If several people are involved in the development, additional synchronization problems arise concerning the models’ development environments and version statuses. For this reason, tools for the efficient management of model results through to extensive training and inference pipelines are required. In this article, we present the typical challenges along the machine learning workflow and describe a possible solution platform with MLflow. In addition, we present three different scenarios that can be used to professionalize machine learning workflows:

  1. Entry-level Variant: Model parameters and performance metrics are logged via a R/Python API and clearly presented in a GUI. In addition, the trained models are stored as artifacts and can be made available via APIs.
  2. Advanced Model Management: In addition to tracking parameters and metrics, certain models are logged and versioned. This enables consistent monitoring and simplifies the deployment of selected model versions.
  3. Collaborative Workflow Management: Encapsulating Machine Learning projects as packages or Git repositories and the accompanying local reproducibility of development environments enable smooth development of Machine Learning projects with multiple stakeholders.

Depending on the maturity of your machine learning project, these three scenarios can serve as inspiration for a potential machine learning workflow. We have elaborated each scenario in detail for better understanding and provide recommendations regarding the APIs and deployment environments to use.

Challenges Along the Machine Learning Workflow

Training machine learning models is becoming easier and easier. Meanwhile, a variety of open-source tools enable efficient data preparation as well as increasingly simple model training and deployment.

The added value for companies comes primarily from the systematic interaction of model training, in the form of model identification, hyperparameter tuning and fitting on the training data, and deployment, i.e., making the model available for inference tasks. This interaction is often not established as a continuous process, especially in the early phases of machine learning initiative development. However, a model can only generate added value in the long term if a stable production process is implemented from model training, through its validation, to testing and deployment. If this process is implemented correctly, complex dependencies and costly maintenance work in the long term can arise during the operational start-up of the model [2]. The following risks are particularly noteworthy in this regard.

1. Ensuring Synchronicity

Often, in an exploratory context, data preparation and modeling workflows are developed locally. Different configurations of development environments or even the use of different technologies make it difficult to reproduce results, especially between developers or teams. In addition, there are potential dangers concerning the compatibility of the workflow if several scripts must be executed in a logical sequence. Without an appropriate version control logic, the synchronization effort afterward can only be guaranteed with great effort.

2. Documentation Effort

To evaluate the performance of the model, model metrics are often calculated following training. These depend on various factors, such as the parameterization of the model or the influencing factors used. This meta-information about the model is often not stored centrally. However, for systematic further development and improvement of a model, it is mandatory to have an overview of the parameterization and performance of all past training runs.

3. Heterogeneity of Model Formats

In addition to managing model parameters and results, there is the challenge of subsequently transferring the model to the production environment. If different models from multiple packages are used for training, deployment can quickly become cumbersome and error-prone due to different packages and versions.

4. Recovery of Prior Results

In a typical machine learning project, the situation often arises that a model is developed over a long period of time. For example, new features may be used, or entirely new architectures may be evaluated. These experiments do not necessarily lead to better results. If experiments are not versioned cleanly, there is a risk that old results can no longer be reproduced.

Various tools have been developed in recent years to solve these and other challenges in the handling and management of machine learning workflows, such as TensorFlow TFX, cortex, Marvin, or MLFlow. The latter, in particular, is currently one of the most widely used solutions.

MLflow is an open-source project with the goal to combine the best of existing ML platforms to make the integration to existing ML libraries, algorithms, and deployment tools as straightforward as possible [3]. In the following, we will introduce the main MLflow modules and discuss how machine learning workflows can be mapped via MLflow.

MLflow Services

MLflow consists of four components: MLflow Tracking, MLflow Models, MLflow Projects, and MLflow Registry. Depending on the requirements of the experimental and deployment scenario, all services can be used together, or individual components can be isolated.

With MLflow Tracking, all hyperparameters, metrics (model performance), and artifacts, such as charts, can be logged. MLflow Tracking provides the ability to collect presets, parameters, and results for collective monitoring for each training or scoring run of a model. The logged results can be visualized in a GUI or alternatively accessed via a REST API.

The MLflow Models module acts as an interface between technologies and enables simplified deployment. Depending on its type, a model is stored as a binary, e.g., a pure Python function, or as a Keras or H2O model. One speaks here of the so-called model flavors. Furthermore, MLflow Models provides support for model deployment on various machine learning cloud services, e.g., for AzureML and Amazon Sagemaker.

MLflow Projects are used to encapsulate individual ML projects in a package or Git repository. The basic configurations of the respective environment are defined via a YAML file. This can be used, for example, to control how exactly the conda environment is parameterized, which is created when MLflow is executed. MLflow Projects allows experiments that have been developed locally to be executed on other computers in the same environment. This is an advantage, for example, when developing in smaller teams.

MLflow Registry provides a centralized model management. Selected MLflow models can be registered and versioned in it. A staging workflow enables a controlled transfer of models into the productive environment. The entire process can be controlled via a GUI or a REST API.

Examples of Machine Learning Pipelines Using MLflow

In the following, three different ML workflow scenarios are presented using the above MLflow modules. These increase in complexity from scenario to scenario. In all scenarios, a dataset is loaded into a development environment using a Python script, processed, and a machine learning model is trained. The last step in all scenarios is a deployment of the ML model in an exemplary production environment.

1. Scenario – Entry-Level Variant

Szenario 1 – Simple Metrics TrackingScenario 1 – Simple Metrics Tracking

Scenario 1 uses the MLflow Tracking and MLflow Models modules. Using the Python API, the model parameters and metrics of the individual runs can be stored on the MLflow Tracking Server Backend Store, and the corresponding MLflow Model File can be stored as an artifact on the MLflow Tracking Server Artifact Store. Each run is assigned to an experiment. For example, an experiment could be called ‘fraud_classification’, and a run would be a specific ML model with a certain hyperparameter configuration and the corresponding metrics. Each run is stored with a unique RunID.

Artikel MLFlow Tool Bild 01

In the screenshot above, the MLflow Tracking UI is shown as an example after executing a model training. The server is hosted locally in this example. Of course, it is also possible to host the server remotely. For example in a Docker container within a virtual machine. In addition to the parameters and model metrics, the time of the model training, as well as the user and the name of the underlying script, are also logged. Clicking on a specific run also displays additional information, such as the RunID and the model training duration.

Artikel MLFlow Tool Bild 02

If you have logged other artifacts in addition to the metrics, such as the model, the MLflow Model Artifact is also displayed in the Run view. In the example, a model from the sklearn.svm package was used. The MLmodel file contains metadata with information about how the model should be loaded. In addition to this, a conda.yaml is created that contains all the package dependencies of the environment at training time. The model itself is located as a serialized version under model.pkl and contains the model parameters optimized on the training data.

Artikel MLFlow Tool Bild 03

The deployment of the trained model can now be done in several ways. For example, suppose one wants to deploy the model with the best accuracy metric. In that case, the MLflow tracking server can be accessed via the Python API mlflow.list_run_infos to identify the RunID of the desired model. Now, the path to the desired artifact can be assembled, and the model loaded via, for example, the Python package pickle. This workflow can now be triggered via a Dockerfile, allowing flexible deployment to the infrastructure of your choice. MLflow offers additional separate APIs for deployment on Microsoft Azure and AWS. For example, if the model is to be deployed on AzureML, an Azure ML container image can be created using the Python API mlflow.azureml.build_image, which can be deployed as a web service to Azure Container Instances or Azure Kubernetes Service. In addition to the MLflow Tracking Server, it is also possible to use other storage systems for the artifact, such as Amazon S3, Azure Blob Storage, Google Cloud Storage, SFTP Server, NFS, and HDFS.

2. Scenario – Advanced Model Management

Szenario 2 – Advanced Model ManagementScenario 2 – Advanced Model Management

Scenario 2 includes, in addition to the modules used in scenario 1, MLflow Model Registry as a model management component. Here, it is possible to register and process the models logged there from specific runs. These steps can be controlled via the API or GUI. A basic requirement to use the Model Registry is deploying the MLflow Tracking Server Backend Store as Database Backend Store. To register a model via the GUI, select a specific run and scroll to the artifact overview.

Artikel MLFlow Tool Bild 04

Clicking on Register Model opens a new window in which a model can be registered. If you want to register a new version of an already existing model, select the desired model from the dropdown field. Otherwise, a new model can be created at any time. After clicking the Register button, the previously registered model appears in the Models tab with corresponding versioning.

Artikel MLFlow Tool Bild 05

Each model includes an overview page that shows all past versions. This is useful, for example, to track which models were in production when.

Artikel MLFlow Tool Bild 06

If you now select a model version, you will get to an overview where, for example, a model description can be added. The Source Run link also takes you to the run from which the model was registered. Here you will also find the associated artifact, which can be used later for deployment.

Artikel MLFlow Tool Bild 07

In addition, individual model versions can be categorized into defined phases in the Stage area. This feature can be used, for example, to determine which model is currently being used in production or is to be transferred there. For deployment, in contrast to scenario 1, versioning and staging status can be used to identify and deploy the appropriate model. For this, the Python API MlflowClient().search_model_versions can be used, for example, to filter the desired model and its associated RunID. Similar to scenario 1, deployment can then be completed to, for example, AWS Sagemaker or AzureML via the respective Python APIs.

3. Scenario – Collaborative Workflow Management

Szenario 3 – Full Workflow ManagementScenario 3 – Full Workflow Management

In addition to the modules used in scenario 2, scenario 3 also includes the MLflow Projects module. As already explained, MLflow Projects are particularly well suited for collaborative work. Any Git repository or local environment can act as a project and be controlled by an MLproject file. Here, package dependencies can be recorded in a conda.yaml, and the MLproject file can be accessed when starting the project. Then the corresponding conda environment is created with all dependencies before training and logging the model. This avoids the need for manual alignment of the development environments of all developers involved and also guarantees standardized and comparable results of all runs. Especially the latter is necessary for the deployment context since it cannot be guaranteed that different package versions produce the same model artifacts. Instead of a conda environment, a Docker environment can also be defined using a Dockerfile. This offers the advantage that package dependencies independent of Python can also be defined. Likewise, MLflow Projects allow the use of different commit hashes or branch names to use other project states, provided a Git repository is used.

An interesting use case is the modularized development of machine learning training pipelines [4]. For example, data preparation can be decoupled from model training and developed in parallel, while another team uses a different branch name to train the model. In this case, only a different branch name must be used as a parameter when starting the project in the MLflow Projects file. The final data preparation can then be pushed to the same branch name used for model training and would thus already be fully implemented in the training pipeline. The deployment can also be controlled as a sub-module within the project pipeline through a Python script via the ML Project File and can be carried out analogous to scenario 1 or 2 on a platform of your choice.

Conclusion and Outlook

MLflow offers a flexible way to make the machine learning workflow robust against the typical challenges in the daily life of a data scientist, such as synchronization problems due to different development environments or missing model management. Depending on the maturity level of the existing machine learning workflow, various services from the MLflow portfolio can be used to achieve a higher level of professionalization.

In the article, three machine learning workflows, ascending in complexity, were presented as examples. From simple logging of results in an interactive UI to more complex, modular modeling pipelines, MLflow services can support it. Logically, there are also synergies outside the MLflow ecosystem with other tools, such as Docker/Kubernetes for model scaling or even Jenkins for CI/CD pipeline control. If there is further interest in MLOps challenges and best practices, I refer you to the webinar on MLOps by our CEO Sebastian Heinz, which we provide free of charge.

Resources

John Vicente John Vicente John Vicente