Create more value from your projects with the cloud.
Data science and data-driven decision-making have become a crucial part of many companies’ daily business and will become even more important in the upcoming years. Many organizations will have a cloud strategy in place by the end of 2022:
“70% of organizations will have a formal cloud strategy by 2022, and the ones that fail to adopt it will struggle”
– Gartner Research
By becoming a standard building block in all kinds of organizations, cloud technologies are getting more easily available, lowering the entry barrier for developing cloud-native applications.
In this blog entry, we will have a look at why the cloud is a good idea for data science projects. I will provide a high-level overview of the steps needed to be taken to onboard a data science project to the cloud and share some best practices from my experience to avoid common pitfalls.
I will not discuss solution patterns specific to a single cloud provider, compare them or go into detail about machine learning operations and DevOps best practices.
Data Science projects benefit from using public cloud services
One common approach to data science projects is to start by coding on local machines, crunching data, training, and snapshot-based model evaluation. This helps keeping pace at an early stage when not yet confident machine learning can solve the topic identified by the business. After having created the first version satisfying the business needs, the question arises of how to deploy that model to generate value.
Running a machine learning model in production usually can be achieved by either of two options: 1) run the model on some on-premises infrastructure. 2) run the model in a cloud environment with a cloud provider of your choice. Deploying the model on-premises might sound appealing at first and there are cases where it is a viable option. However, the cost of building and maintaining a data science specific piece of infrastructure can be quite high. This results from diverse requirements ranging from specific hardware, over managing peak loads in training phases up to additional interdependent software components.
Different cloud set-ups offer varying degrees of freedom
When using the cloud, you can choose the most suitable service level between Infrastructure as a Service (IaaS), Container as a Service (CaaS), Platform as a Service (PaaS) and Software as a Service (SaaS), usually trading off flexibility for ease of maintenance. The following picture visualizes the responsibilities in each of the service levels.
- «On-Premises» you must take care of everything yourself: ordering and setting up the necessary hardware, setting up your data pipeline and developing, running, and monitoring your applications.
- In «IaaS» the provider takes care of the hardware components and delivers a virtual machine with a fixed version of an operating system (OS).
- With «CaaS» the provider offers a container platform and orchestration solution. You can use container images from a public registry, customize them or build your own container.
- With «PaaS» services what is usually left to do is bring your data and start developing your application. Depending on whether the solution is serverless you might not even have to provide information on the sizing.
- «SaaS» solutions as the highest service level are tailored to a specific purpose and include very little effort for setup and maintenance, but offer quite limited flexibility for new features have usually to be requested from the provider
Public cloud services are already tailored to the needs of data science projects
The benefits of public cloud services include scalability, decoupling of resources and pay-as-you-go models. Those benefits are already a plus for data science applications, e.g., for scaling resources for a training run. On top of that, all 3 major cloud providers have a part of their service catalog designed specifically for data science applications, each of them with its own strengths and weaknesses.
Not only does this include special hardware like GPUs, but also integrated solutions for ML operations like automated deployments, model registries and monitoring of model performance and data drift. Many new features are constantly developed and made available. To keep up with those innovations and functionalities on-premises you would have to spend a substantial number of resources without generating direct business impact.
If you are interested in an in-depth discussion of the importance of the cloud for the success of AI projects, be sure to take a look at the white paper published on our content hub.
Onboarding your project to the cloud takes only 5 simple steps
If you are looking to get started with using the cloud for data science projects, there are a few key decisions and steps you will have to make in advance. We will take a closer look at each of those.
1. Choosing the cloud service level
BWhen choosing the service level, the most common patterns for data science applications are CaaS or PaaS. The reason is that infrastructure as a service can create high costs resulting from maintaining virtual machines or building up scalability across VMs. SaaS services on the other hand are already tailored to a specific business problem and are used instead of building your own model and application.
CaaS comes with the main advantage of containers, namely that containers can be deployed to any container platform of any provider. Also, when the application does not only consist of the machine learning model but needs additional micro-services or front-end components, they can all be hosted with CaaS. The downside is that similar to an on-premises roll-out, Container images for MLops tools like model registry, pipelines and model performance monitoring are not available out of the box and need to be built and integrated with the application. The larger the number of used tools and libraries, the higher the likelihood that at some point future versions will have incompatibilities or even not match at all.
PaaS services like Azure Machine Learning, Google Vertex AI or Amazon SageMaker on the other hand have all those functionalities built in. The downside of these services is that they all come with complex cost structures and are specific to the respective cloud provider. Depending on the project requirements the PaaS services may in some special cases feel too restrictive.
When comparing CaaS and PaaS it mostly comes down to the tradeoff between flexibility and a higher level of vendor lock-in. Higher vendor lock-in comes with paying an extra premium for the sake of included features, increased compatibility and rise in the speed of development. Higher flexibility on the other hand comes at the cost of increased integration and maintenance effort.
2. Making your data available in cloud
Usually, the first step to making your data available is to upload a snapshot of the data to a cloud object storage. These are well integrated with other services and can later be replaced by a more suitable data storage solution with little effort. Once the results from the machine learning model are suitable from a business perspective, data engineers should set up a process to automatically keep your data up to date.
3. Building a pipeline for preprocessing
In any data science project, one crucial step is building a robust pipeline for data preprocessing. This ensures your data is clean and ready for modeling, which will save you time and effort in the long run. A best practice is to set up a continuous integration and continuous delivery (CICD) pipeline to automate deployment and testing of your preprocessing and to make it part of your DevOps cycle. The cloud helps you automatically scale your pipelines to deal with any amount of data needed for the training of your model.
4. Training and evaluating trained models
In this stage, the preprocessing pipeline is extended by adding modeling components. This includes hyper-parameter tuning which cloud services once again support by scaling resources and storing the results of each training experiment for easier comparison. All cloud providers offer an automated machine learning service. This can be used either to generate the first version of a model quickly and compare performance on the data across multiple model types. This way you can quickly assess if the data and preprocessing suffice to tackle the business problem. Besides that, the result can be utilized as a benchmark for the data scientist. The best model should be stored in a model registry for deployment and transparency.
In case a model has already been trained locally or on-premises, it is possible to skip the training and just load the model into the model registry.
5. Serving models to business users
The final and likely most important step is serving the model to your business unit to create value from it. All cloud providers offer solutions to deploy the model in a scalable manner with little effort. Finally, all pieces created in the earlier steps from automatically provisioning the most recent data over applying preprocessing and feeding the data into the deployed model come together.
Now we have gone through the steps of how to onboard your data science project. With these 5 steps you are well on the way with moving your data science workflow to the cloud. To avoid some of the common pitfalls, here are some learnings from my personal experiences I would like to share, which can positively impact your project’s success.
Make your move to the cloud even easier with these useful tips
Start using the cloud early in the process.
By starting early, the team can familiarize themselves with the platform’s features. This will help you make the most of its capabilities and avoid potential problems and heavy refactoring down the road
Make sure your data is accessible.
This may seem like a no-brainer, but it is important to make sure your data is easily accessible when you move to the cloud. This is especially true in a setup where your data is generated on-premises and needs to be transferred to the cloud.
Consider using serverless computing.
Serverless computing is a great option for data science projects because it allows you to scale your resources up or down as needed without having to provision or manage any servers.
Don’t forget about security.
While all cloud providers offer some of the most up-to-date IT-security setups, some of them are easy to miss during configuration and can expose your project to needless risk.
Monitor your cloud expenses.
Coming from on-premises, optimization is often about peak resource usage because hardware or licenses are limited. With scalability and pay-as-you-go, this paradigm shifts stronger towards optimizing costs. Optimizing costs is usually not the first activity to do when starting a project but keeping an eye on the costs can prevent unpleasant surprises and be used at a later stage to make a cloud application even more cost-effective.
Take your data science projects to new heights with the cloud
If you’re starting your next data science project, doing so into the cloud is a great option. It is scalable, flexible, and offers a variety of services that can help you get the most out of your project. Cloud based architectures are a modern way of developing applications, that are expected to grow even more important in the future.
Following the steps presented will help you on that journey and support you in keeping up with the newest trends and developments. Plus, with the tips provided, you can avoid many of the common pitfalls that may occur on the way. So, if you’re looking for a way to get the most out of your data science project, the cloud is definitely worth considering.
In the age of open-source software projects, attacks on vulnerable software are ever present. Python is the most popular language for Data Science and Engineering and is thus increasingly becoming a target for attacks through malicious libraries. Additionally, public facing applications can be exploited by attacking vulnerabilities in the source code.
For this reason it’s crucial that your code does not contain any CVEs (common vulnerabilities and exposures) or uses other libraries that might be malicious. This is especially true if it’s public facing software, e.g. a web application. At statworx we look for ways to increase the quality of our code by using automated scanning tools. Hence, we’ll discuss the value of two code and package scanners for Python.
There are numerous tools for scanning code and its dependencies, here I will provide an overview of the most popular tools designed with Python in mind. Such tools fall into one of two categories:
- Static Application Security Testing (SAST): look for weaknesses in code and vulnerable packages
- Dynamic Application Security Testing (DAST): look for vulnerabilities that occur at runtime
In what follows I will compare
safety using a small
streamlit application I’ve developed. Both tools fall into the category of SAST, since they don’t need the application to run in order to perform their checks. Dynamic application testing is more involved and may be the subject of a future post.
For the sake of context, here’s a brief description of the application: it was designed to visualize the convergence (or lack thereof) in the sampling distributions of random variables drawn from different theoretical probability distributions. Users can choose the distribution (e.g. Log-Normal), set the maximum number of samples and pick different sampling statistics (e.g. mean, standard deviation, etc.).
Bandit is an open-source python code scanner that checks for vulnerabilities in code and only in your code. It decomposes the code into its abstract syntax tree and runs plugins against it to check for known weaknesses. Among other tests it performs checks on plain SQL code which could provide an opening for SQL injections, passwords stored in code and hints about common openings for attacks such as use of the pickle library. Bandit is designed for use with CI/CD and throws an exit status of 1 whenever it encounters any issues, thus terminating the pipeline. A report is generated, which includes information about the number of issues separated by confidence and severity according to three levels: low, medium, and high. In this case, bandit finds no obvious security flaws in our code.
Run started:2022-06-10 07:07:25.344619 Test results: No issues identified. Code scanned: Total lines of code: 0 Total lines skipped (#nosec): 0 Run metrics: Total issues (by severity): Undefined: 0 Low: 0 Medium: 0 High: 0 Total issues (by confidence): Undefined: 0 Low: 0 Medium: 0 High: 0 Files skipped (0):
All the more reason to carefully configure Bandit to use in your project. Sometimes it may raise a flag even though you already know that this would not be a problem at runtime. If, for example, you have a series of unit tests that use
pytest and run as part of your CI/CD pipeline Bandit will normally throw an error, since this code uses the
assert statement, which is not recommended for code that does not run without the
To avoid this behaviour you could:
- run scans against all files but exclude the test using the command line interface
- create a
yamlconfiguration file to exclude the test
Here’s an example:
# bandit_cfg.yml skips: ["B101"] # skips the assert check
Then we can run bandit as follows:
bandit -c bandit_yml.cfg /path/to/python/files and the unnecessary warnings will not crop up.
Developed by the team at
pyup.io, this package scanner runs against a curated database which consists of manually reviewed records based on publicly available CVEs and changelogs. The package is available for Python >= 3.5 and can be installed for free. By default it uses
Safety DB which is freely accessible. Pyup.io also offers paid access to a more frequently updated database.
safety check --full-report -r requirements.txt on the package root directory gives us the following output (truncated the sake of readability):
+==============================================================================+ | | | /$$$$$$ /$$ | | /$$__ $$ | $$ | | /$$$$$$$ /$$$$$$ | $$ \__//$$$$$$ /$$$$$$ /$$ /$$ | | /$$_____/ |____ $$| $$$$ /$$__ $$|_ $$_/ | $$ | $$ | | | $$$$$$ /$$$$$$$| $$_/ | $$$$$$$$ | $$ | $$ | $$ | | \____ $$ /$$__ $$| $$ | $$_____/ | $$ /$$| $$ | $$ | | /$$$$$$$/| $$$$$$$| $$ | $$$$$$$ | $$$$/| $$$$$$$ | | |_______/ \_______/|__/ \_______/ \___/ \____ $$ | | /$$ | $$ | | | $$$$$$/ | | by pyup.io \______/ | | | +==============================================================================+ | REPORT | | checked 110 packages, using free DB (updated once a month) | +============================+===========+==========================+==========+ | package | installed | affected | ID | +============================+===========+==========================+==========+ | urllib3 | 1.26.4 | <1.26.5 | 43975 | +==============================================================================+ | Urllib3 1.26.5 includes a fix for CVE-2021-33503: An issue was discovered in | | urllib3 before 1.26.5. When provided with a URL containing many @ characters | | in the authority component, the authority regular expression exhibits | | catastrophic backtracking, causing a denial of service if a URL were passed | | as a parameter or redirected to via an HTTP redirect. | | https://github.com/advisories/GHSA-q2q7-5pp4-w6pg | +==============================================================================+
The report includes the number of packages that were checked, the type of database used for reference and information on each vulnerability that was found. In this example an older version of the package
urllib3 is affected by a vulnerability which technically could be used by an to perform a denial-of-service attack.
Integration into your workflow
Of course, you can always manually install both packages from PyPI on your runner if no ready-made integration like a GitHub action is available. Since both programs can be used from the command line, you could also integrate them into a pre-commit hook locally if using them on your CI/CD platform is not an option.
The CI/CD pipeline for the application above was built with GitHub Actions. After installing the application’s required packages, it runs
bandit first and then
safety to scan all packages. With all the packages updated, the vulnerability scans pass and the docker image is built.
|Package check||Code Check|
I would strongly recommend using both
safety in your CI/CD pipeline, as they provide security checks for your code and your dependencies. For modern applications manually reviewing every single package your application depends on is simply not feasible, not to mention all of the dependencies these packages have! Thus, automated scanning is inevitable if you want to have some level of awareness about how unsafe your code is.
bandit scans your code for known exploits, it does not check any of the libraries used in your project. For this, you need
safety, as it informs you about known security flaws in the libraries your application depends on. While neither frameworks are completely foolproof, it’s still better to be notified about some CVEs than none at all. This way, you’ll be able to either fix your vulnerable code or upgrade a vulnerable package dependency to a more secure version.
Keeping your code safe and your dependencies trustworthy can ward off potentially devastating attacks on your application.
Given the hype and the recent success of AI it is surprising that most companies still lack the successful integration of AI. This is quite evident in many industries, especially in manufacturing. (McKinsey).
In a study published by Accenture in 2019 about the implementation of AI in companies, the authors came to the conclusion that around 80% of all Proof of Concepts (PoCs) do not make it into production. Furthermore, only 5% of all interviewed companies stated that they currently have a company-wide AI strategy in place.
These findings are thought-provoking: What exactly is going wrong and why does artificial intelligence apparently not yet make the holistic transition from successful academic studies to the real world?
1. What Is Data-Centric AI?
„Data-centric AI is the discipline of systematically engineering the data used to build an AI system.“
Andrew Ng, data-centric AI pioneer
The data-centric approach focuses on a more data-integrating AI (data-first) and less on the models (model-fist) to overcome the difficulties of AI with “reality”. Usually the training data that stands as the starting point of an AI project at companies have relatively little in common with the meticulously curated and widely used benchmark datasets such as MNIST or ImageNet.
In this article, we want to consolide different data-centric theories and frameworks in the context of an AI (project) workflow. In addition, we want to show how we approach a data-first AI implementation at statworx.
2. What´s Behind a Data-Centric Way of Thinking?
In the simplest terms, AI systems consist of two critical components: data and model (code). Data-centric thereby leans its focus more towards data, model-centric on model infrastructure – duh!
A strong model-centric leaning AI regards the data only as an extrinsic, static parameter. The iterative process of a data science project only starts after the dataset is being “delivered” with the model-related task, like train and fine tune of various model architectures. This occupies the vast portion of time in a data science project, and data pre-processing steps are only and ad-hoc duty at the beginning of each project.
In contrary, data-centric understands (automated) data processes as an integral part of any ML project. This incorporates any necessary steps to get from raw data to a final dataset. Internalizing these processes aims to enhance the quality and the methodical observability of the data.
Data-centric approaches can be consolidated into three broader categories that explain loosely the scope of the data-centric concept. In the following, we will assign buzzwords (frameworks) that are often used in the data-centric context to a specific category.
2.1. Integration of SMEs Into the Development Process as a Major Link Between Data and Model Knowledge
The integration of domain knowledge is an integral part of data-centric. It should help project teams to grow together and thus integrate the knowledge of Subject Matter Experts (SMEs) in the best way possible.
- Data Profiling:
Data scientists should not act as a one man show that only share their findings with the SMEs. With their statistical and programmatical abilities they should rather act as mediators to empower SMEs to individually dig through the data.
- Human-in-the-loop Data & Model Monitoring
In similarity to profiling, this should be a central starting point to ensure that SMEs have access to the relevant components of the AI system. From this central checkpoint, not only data but also model-relevant metrics can be monitored or examples visualized and checked. Sophisticated monitoring decreases the response time drastically since errors can be directly investigated (and mitingated) – not only by data scientists.
2.2. Data Quality Management as an Agile, Automated and Iterative Process Over (Training) Data
Copiously improving the data preparation process is key to every data science project. The model itself should be an extrinsic part at first.
- Data Cataloague, Lineage & Validation:
The documentation of data should also not be an extrinsic task, which often only arises ad-hoc towards the end of a project and could become obsolete again with every change, e.g., of a model feature. Changes should be reflected dynamically and thus automate the documentation. Data catalogue frameworks provide the capabilities to store data with associated meta data (and other necessary information).
Data lineage, as a subsequent step in the data process, keeps then track of all data mingling steps that occur during the transition from raw to the final dataset. The more complex a data model, the more a lineage graph can track how a final column was created (graph below), for example, joining, filtering or other processing steps. Finally, validating the data during the input and transformation process allows for a consistent data foundation. The knowledge gained from data profiling helps here to develop validation rules and integrate them into the process.
- Data & Label Cleaning:
The necessity of data processing is ubiquitous and well-established among AI practitioners. However, label cleaning is a rather disregarded step (that, of course, only applies to classification problems). Wrongly classified datapoints can make it hard for some algorithms to reveal the right patterns in the data.
- Data Drifts in Production:
Well known weak spot of (all) AI systems are drifts in data. This happens when trainings and the actual live inference data do not have the same distribution. In order to pledge the model’s prediction validity in the long run, identifying these irregularities and retraining the model accordingly is a crucial part of any ML pipeline.
- Data Versioning:
Since ever, GitHub is the go-to standard of versioning the codebase of projects. However, for AI projects it is not only important to track code but also data changes and tuck both together. This can produce a more holistic depiction of a ML workflow with increased visibility and observability.
2.3. Generating Trainings Data as Programmatic Task
Producing new (labeled) training examples is often a huge road blocker for AI projects, particularly if the underlying problem is complex and thus requires vast datasets. This may lead to an unbearable overhead of manual labor.
- Data Augmentation:
In many data-intensive deep learning models, this technique has been used for a long time to create artificial data with existing data. It, for instance, works perfectly on image data, for easy operations such as rotating, tilting or altering the color filters. Even in NLP and tabular (Excel and co.) use cases, there are possibilities to enlarge the dataset.
- Automated Data Labeling:
As already stated, labeling data is a rather labor-intense task in which people assign data points to a predefined category. On the one hand, this makes the initial effort (costs) very high, and on the other hand, it is error-prone and difficult to monitor. That’s where ML can chip in. Concepts like semi and weak supervision can automate the manual task almost entirely.
- Data Selection:
Working with large chunks of data in a local setup is often not possible, especially once the dataset does not fit into memory anymore. And even if it fits, trainings runs can take forever. Data Selection tries to reduce the size by active subsampling (whether labeled or unlabeled). The “best” examples with the highest diversity and representativeness are actively selected here to ensure the best possible characterization of the input – and this is done automatically.
Needless to say, not every presented method is necessarily a good fit for every project. It is part of the work of the development team to analyze how a certain framework could benefit the final product, which also should take business considerations (e.g. cost versus benefit) into account.
3. Integration of Data-Centric at statworx
Data-centric is a crucial part of our projects, especially during the transition between PoC and production-ready model. We also had cases, in which during this transition, we faced some mainly data-related issue due to inadequate documentation, validation or poor integration of SMEs into the data process.
We therefore generally try to show our customers the importance of data management for the longevity and robustness of AI products in production and how helpful components are linked within an AI pipeline.
As part of the learnings, our data onboarding framework, a mix of profiling, catalogue and validation aims to mitigate the before mentioned issues. Additionally, this framework helps the entire company make previously unused, undocumented data sources available for various use cases (not just AI).
A strong interaction with the SMEs on the client’s side is integral to establish trustworthy, robust and well-understood quality checks. This also helps to empower our clients to debug errors and do the first-level support themselves, which also helps with the service’s longevity.
In a stripped-down, custom data onboarding integration, we used a variety of open and closed source tools to create a platform that is easily scalable and understandable for the customer. We installed validation checks with Great Expectations (GE), a python-based tool with reporting capabilities to create a shareable status report of the data.
This architecture can then run on different environments, like cloud native software (Azure DataFactory) or orchestrated with Airflow (open-source) and can be easily complemented.
4. Data-Centric in Relation AI’s Status Quo
Both data- and model-centric describe attempts on how to approach an AI project.
On the one hand, there exist already well-established best practices around model-centric with various production-proved frameworks.
One reason for this maturity is certainly the strong focus on model architectures and their advancements among academic researchers and leading AI companies. With computer vision and NLP leading the way, commercialized meta models, trained on enormous datasets, opened the door for successful AI use cases. With relatively limited data, those models can get finetuned for downstream end-use applications – known as transfer learning.
However, this trend helps only some of the failed projects, because especially in the context of industrial projects, lack of compatibility or rigidity of use cases makes the applications of meta-models difficult. Non-rigidity is often found in machine-heavy manufacturing industries, where the environment in which data is produced is constantly changing and even the replacement of a single machine can have a large impact on a productive AI model. If this issue has not been properly considered in the AI process, this creates a difficult-to-calculate risk, also known as technical debt [Quelle: https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf].
Lastly, edge cases, rare and unusual datapoints are generally a burden for any AI. An application that observes anomalies of a machine component most certainly sees only a fraction of faulty units.
5. Conclusion – Paradigm Shift in Sight?
Overcoming these problems is part of the promise of data-centric, but is still rather immature at the moment.
The availability or immaturity of open-source frameworks does manifest this allegation, especially since there is a lack of a more unified tool stack end user can choose from. This inevitably leads to longer, more involved, and complex AI projects, which is a significant hurdle for many companies. In addition, there are few data metrics available to give companies feedback on what exactly they are “improving.” And second, many of the tools (eg., data catalogue) have more indirect, distributed benefits.
Some startups that aim to address these issues have emerged in recent years. However, because they (exclusively) market paid tier software, it is rather unclear to what extent these products can really cover the broad mass of problems from different use cases.
Although the above shows that companies in general are still far away from a holistic integration of data-centric, robust data strategies have become more and more important lately (as we at statworx could see in our projects).
With increased academic research into data products, this trend will certainly intensify. Not only because new, more robust frameworks will emerge, but also because university graduates will bring more knowledge in this area to companies.
Model-centric arch: own
Data-centric arch: own
Data lineage: https://www.researchgate.net/figure/Data-lineage-visualization-example-in-DW-environment-using-Sankey-diagram_fig7_329364764
Versioning Code/Data: https://ardigen.com/7155/
Data Augmentation: https://medium.com/secure-and-private-ai-writing-challenge/data-augmentation-increases-accuracy-of-your-model-but-how-aa1913468722
Data & AI pipeline: own
Validation with GE: https://greatexpectations.io/blog/ge-data-warehouse/
What you can expect:
The aim of the fair is to connect students of business informatics, business mathematics and data science with companies.
We will be there with our own booth and will introduce statworx in the context of a short presentation additionally.
Given good weather, the fair will take place outside this year. Participation is free of charge and a pre-registration is not necessary. Just come by and talk to us.
What you can expect:
The konaktiva fair at Darmstadt University of Technology is one of the oldest and largest student-organized company career fairs in Germany. In line with its motto “Students meet companies”, it brings together prospective graduates and companies every year.
This year, we are also taking part with our own booth as well as several colleagues and we are looking forward to the exchange with interested students. We will be happy to present the various entry-level opportunities at statworx – from internships to permanent positions – and share insights into our day-to-day work.
In addition, there will be the opportunity to get to know us better and to discuss individual questions and cases during pre-scheduled one-on-one meetings away from the hustle and bustle of the trade fair.
Participation in the fair is free of charge for visitors.
In the field of Data Science – as the name suggests – the topic of data, from data cleaning to feature engineering, is one of the cornerstones. Having and evaluating data is one thing, but how do you actually get data for new problems?
If you are lucky, the data you need is already available. Either by downloading a whole dataset or by using an API. Often, however, you have to gather information from websites yourself – this is called web scraping. Depending on how often you want to scrape data, it is advantageous to automate this step.
This post will be about exactly this automation. Using web scraping and GitHub Actions as an example, I will show how you can create your own data sets over a more extended period. The focus will be on the experience I have gathered over the last few months.
The code I used and the data I collected can be found in this GitHub repository.
Search for data – the initial situation
During my research for the blog post about gasoline prices, I also came across data on the utilization of parking garages in Frankfurt am Main. Obtaining this data laid the foundation for this post. After some thought and additional research, other thematically appropriate data sources came to mind:
- Road utilization
- S-Bahn and subway delays
- Events nearby
- Weather data
However, it quickly became apparent that I could not get all this data, as it is not freely available or allowed to be stored. Since I planned to store the collected data on GitHub and make it available, this was a crucial point for which data came into question. For these reasons, railway data fell out completely. I only found data for Cologne for road usage, and I wanted to avoid using the Google API as that definitely brings its own challenges. So, I was left with event and weather data.
For the weather data of the German Weather Service, the
rdwd package can be used. Since this data is already historized, it is irrelevant for this blog post. The GitHub Actions have proven to be very useful to get the remaining event and park data, even if they are not entirely trivial to use. Especially the fact that they can be used free of charge makes them a recommendable tool for such projects.
Scraping the data
Since this post will not deal with the details of web scraping, I refer you here to the post by my colleague David.
The parking data is available here in XML format and is updated every 5 minutes. Once you understand the structure of the XML, it’s a simple matter of accessing the right index, and you have the data you want. In the function
get_parking_data(), I have summarized everything I need. It creates a record for the area and a record for the individual parking garages.
Example data extract area
parkingAreaOccupancy;parkingAreaStatusTime;parkingAreaTotalNumberOfVacantParkingSpaces; totalParkingCapacityLongTermOverride;totalParkingCapacityShortTermOverride;id;TIME 0.08401977;2021-12-01T01:07:00Z;556;150;607;1[Anlagenring];2021-12-01T01:07:02.720Z 0.31417114;2021-12-01T01:07:00Z;513;0;748;4[Bahnhofsviertel];2021-12-01T01:07:02.720Z 0.351417;2021-12-01T01:07:00Z;801;0;1235;5[Dom / Römer];2021-12-01T01:07:02.720Z 0.21266666;2021-12-01T01:07:00Z;1181;70;1500;2[Zeil];2021-12-01T01:07:02.720Z
Example data extract facility
parkingFacilityOccupancy;parkingFacilityStatus;parkingFacilityStatusTime; totalNumberOfOccupiedParkingSpaces;totalNumberOfVacantParkingSpaces; totalParkingCapacityLongTermOverride;totalParkingCapacityOverride; totalParkingCapacityShortTermOverride;id;TIME 0.02;open;2021-12-01T01:02:00Z;4;196;150;350;200;24276[Turmcenter];2021-12-01T01:07:02.720Z 0.11547912;open;2021-12-01T01:02:00Z;47;360;0;407;407;18944[Alte Oper];2021-12-01T01:07:02.720Z 0.0027472528;open;2021-12-01T01:02:00Z;1;363;0;364;364;24281[Hauptbahnhof Süd];2021-12-01T01:07:02.720Z 0.609375;open;2021-12-01T01:02:00Z;234;150;0;384;384;105479[Baseler Platz];2021-12-01T01:07:02.720Z
For the event data, I scrape the page stadtleben.de. Since it is a HTML that is quite well structured, I can access the tabular event overview via the tag “kalenderListe”. The result is created by the function
Example data extract event
eventtitle;views;place;address;eventday;eventdate;request Magical Sing Along - Das lustigste Mitsing-Event;12576;Bürgerhaus;64546 Mörfelden-Walldorf, Westendstraße 60;Freitag;2022-03-04;2022-03-04T02:24:14.234833Z Velvet-Bar-Night;1460;Velvet Club;60311 Frankfurt, Weißfrauenstraße 12-16;Freitag;2022-03-04;2022-03-04T02:24:14.234833Z Basta A-cappella-Band;465;Zeltpalast am Deutsche Bank Park;60528 Frankfurt am Main, Mörfelder Landstraße 362;Freitag;2022-03-04;2022-03-04T02:24:14.234833Z BeThrifty Vintage Kilo Sale | Frankfurt | 04. & 05. …;1302;Batschkapp;60388 Frankfurt am Main, Gwinnerstraße 5;Freitag;2022-03-04;2022-03-04T02:24:14.234833Z
Automation of workflows – GitHub Actions
The basic framework is in place. I have a function that writes the park and event data to a .csv file when executed. Since I want to query the park data every 5 minutes and the event data three times a day for security, GitHub Actions come into play.
With this function of GitHub, workflows can be scheduled and executed in addition to actions triggered during merging or committing. For this purpose, a .yml file is created in the folder
The main components of my workflow are:
schedule– Every ten minutes, the functions should be executed
- The OS – Since I develop locally on a Mac, I use the
- Environment variables – This contains my GitHub token and the path for the package management
- The individual
stepsin the workflow itself.
The workflow goes through the following steps:
- Setup R
- Load packages with renv
- Run script to scrape data
- Run script to update the README
- Pushing the new data back into git
Each of these steps is very small and clear in itself; however, as is often the case, the devil is in the details.
Limitation and challenges
Over the last few months, I’ve been tweaking and optimizing my workflow to deal with the bugs and issues. In the following, you will find an overview of my condensed experiences with GitHub Actions from the last months.
If you want to perform time-critical actions, you should use other services. GitHub Action does not guarantee that the jobs will be timed exactly (or, in some cases, that they will be executed at all).
|Time span in minutes||<= 5||<= 10||<= 20||<= 60||> 60|
|Number of queries||1720||2049||5509||3023||194|
You can see that the planned five-minute intervals were not always adhered to. I should plan a larger margin here in the future.
In the beginning, I had two workflows, one for the park data and one for the events. If they overlapped in time, there were merge conflicts because both processes updated the README with a timestamp. Over time, I switched to a workflow including error handling.
Even if one run took longer and the next one had already started, there were merge conflicts in the .csv data when pushing. Long runs were often caused by the R setup and the loading of the packages. Consequently, I extended the schedule interval from five to ten minutes.
There were a few situations where the paths or structure of the scraped data changed, so I had to adjust my functions. Here the setting to get an email if a process failed was very helpful.
Lack of testing capabilities
There is no way to test a workflow script other than to run it. So, after a typo in the evening, one can wake up to a flood of emails with spawned runs in the morning. Still, that shouldn’t stop you from doing a local test run.
No data update
Since the end of December, the parking data has not been updated or made available. This shows that even if you have an automatic process, you should still continue to monitor it. I only noticed this later, which meant that my queries at the end of December always went nowhere.
Despite all these complications, I still consider the whole thing a massive success. Over the last few months, I’ve been studying the topic repeatedly and have learned the tricks described above, which will also help me solve other problems in the future. I hope that all readers of this blog post could also take away some valuable tips and thus learn from my mistakes.
Since I have now collected a good half-year of data, I can deal with the evaluation. But this will be the subject of another blog post.
Kubernetes is a technology that in many ways greatly simplifies the deployment and maintenance of applications and compute loads, especially the training and hosting of machine learning models. At the same time, it allows us to adapt the required hardware resources, providing a scalable and cost-transparent solution.
This article first discusses the transition from a server to management and orchestration of containers: isolated applications or models that are packaged once with all their requirements and can subsequently be run almost anywhere. Regardless of the server, these can be replicated at will with Kubernetes, allowing effortless and almost seamless continuous accessibility of their services even under intense demand. Likewise, their number can be reduced to a minimum level when the demand temporarily or periodically dwindles in order to use computing resources elsewhere or avoid unnecessary costs.
From the capabilities of this infrastructure emerges a useful architectural paradigm called microservices. Formerly centralized applications are thus broken down into their functionalities, which provide a high degree of reusability. These can be accessed and used by different services and scale individually according to internal needs. An example of this is large and complex language models in Natural Language Processing, which can capture the context of a text regardless of its further use and thus underlie many downstream purposes. Other microservices (models), such as for text classification or summarization, can invoke them and further process the partial results.
After a brief introduction of the general terminology and functionality of Kubernetes, as well as possible use cases, the focus turns to the most common way to use Kubernetes: with cloud providers such as Google GCP, Amazon AWS, or Microsoft Azure. These allow so-called Kubernetes clusters to dynamically consume more or fewer resources, though the costs incurred remain foreseeable on a pay-per-use basis. Other common services such as data storage, versioning, and networking can also be easily integrated by the providers. Finally, the article gives an outlook on tools and further developments, which either make using Kubernetes even more efficient or further abstract and simplify the process towards serverless architectures.
Over the last 20 years, vast amounts of new technologies have surfaced in software development and deployment, which have not only multiplied and diversified the choice of services, programming languages, and libraries but have even led to a paradigm shift in many use cases or domains.
If we also look at the way software solutions, models, or work and computing loads have been deployed over the years, we can see how innovations in this area have also led to greater flexibility, scalability, and resource efficiency, among other things.
In the beginning, these were run as local processes directly on a server (shared by several applications), which posed some limitations and problems: on the one hand, one is bound to the configuration of the server and its operating system when selecting the technical tools, and on the other hand, all applications hosted on the server are limited by its memory and processor capacities. Thus, they share not only resources in total but also a possible cross-process error-proneness.
As a first further development, virtual machines can then offer a further level of abstraction: by emulating (“virtualizing”) an independent machine on the server, modularity and thus greater freedom is created for development and deployment. For example, in the choice of operating system or the programming languages and libraries used. From the point of view of the “real” server, the resources to which the application is entitled can be better limited or guaranteed. However, their requirements are also significantly higher since the virtual machine must also maintain the virtual operating system.
Ultimately, this principle has been significantly streamlined and simplified by the proliferation of containers, especially Docker. Put simply, one builds/configures a separate virtual, isolated server for an application or machine learning model. Thus, each container has its own file system and certain system libraries, but not operating system. This technically turns it into a sandbox whose other configuration, code dependencies or errors do not affect the host server, but at the same time can run as relatively “lightweight” processes directly on it.
So there is the possibility to copy, install, etc., everything for the desired application and provide this in a packaged container everywhere in a consistent format. This is not only extremely useful for the production environment, but we at STATWORX also like to use it in the development of more complicated projects or the proof-of-concept phase. Intermediate steps or results, such as extracting text from images, can be used as a container like a small web server by those interested in further processing of the text, such as extracting certain key information, or determining its mood or intent.
This subdivision into so-called “microservices“ with the help of containers helps immensely in the reusability of the individual modules, in the planning and development of the architecture of complex systems; at the same time, it frees the individual work steps from technical dependencies on each other and facilitates maintenance and update procedures.
After this brief overview of the powerful and versatile possibilities of deploying software, the following text will deal with how to reliably and scalably deploy these containers (i.e., applications or models) for customers, other applications, internal services or computations with Kubernetes.
Kubernetes – 8 Essential Components
Kubernetes was introduced by Google in 2014 as open-source container management software (also called container orchestration). Internally, the company had already been using tools developed in-house for years to manage workloads and applications, and regarded the development of Kubernetes not only as a convergence of best practices and lessons learned, but also as an opportunity to open up a new business segment in cloud computing.
The name Kubernetes (Greek for helmsman) was supposedly chosen in reference to a symbolic container ship, for whose optimal operation he was responsible.
When speaking of a Kubernetes instance, it is referred to as a (Kubernetes) cluster: it consists of several servers, called nodes. One of them, called the master node, is solely responsible for administrative operations, and is the interface that is addressed by the developer. All other nodes, called worker nodes, are initially unoccupied and thus flexible. While nodes are actually physical instances, mostly in data centers, the following terms are digital concepts of Kubernetes.
If an application is to be deployed on the cluster, in the simplest case the desired container is specified and then (automatically) a so-called pod is created and assigned to a node. The pod simply resembles a running container. If several instances of the same application are to run in parallel, for example to provide better availability, the number of replicas can be specified. In this case, the specified number of pods, each with the same application, is distributed to the nodes. If the demand for the application exceeds the capacities despite replicas, even more pods can be created automatically with the Horizontal Autoscaler. Especially for Deep Learning models with relatively long inference times, metrics such as CPU or GPU utilization can be monitored here, and the number of pods can be increased or decreased automatically to optimize both capacity and cost.
To avoid confusion, ultimately every running container, i.e., every workload, is a pod. In the case of deploying an application, this is technically done via a deployment, whereas temporal compute loads are jobs. Persistent stores such as databases are managed with StatefulSets. The following figure provides an overview of the terms:
Kubernetes jobs can be used to execute both one-time and recurring jobs (so-called CronJobs) in the form of a container deployment on the cluster.
In the simplest case, these can be seen as a script, which can be used for maintenance or data preparation work of databases, for example. Furthermore, they are also used for batch processing, for example when deep learning models are to be applied to larger data volumes and it is not worthwhile to keep the model continuously on the cluster. In this case, the model container is started up, gets access to the desired dataset, performs its inference on it, saves the results and shuts down. There is also flexibility here for the origin and subsequent storage of the data, so own or cloud databases, bucket/object storage or even local data and logging frameworks can be connected.
For recurring CronJobs, a simple time scheme can be specified so that, for example, certain customer data, transactions or the like are processed at night. Natural Language Processing can be used to automatically create press reviews at night, for example, which can then be evaluated the following morning: News about a company, its industry, business locations, customers, etc. can be aggregated or sourced, evaluated with NLP, summarized, and presented with sentiment or sorted by topic/content.
Even labor-intensive ETL (Extract Transform Load) processes can be performed or prepared outside business hours.
4. Rolling Updates
If a deployment needs to be brought up to the latest version or a rollback to an older version needs to be completed, rolling updates can be triggered in Kubernetes. These guarantee continuous accessibility of the applications and models within a Continuous Integration/Continuous Deployment pipeline.
Such a rollout can be initiated and monitored smoothly in one or a few steps. By means of a rollout history it is also possible not only to jump back to a previous container version, but also to restore the previous deployment parameters, i.e. minimum and maximum number of nodes, which resource group (GPU nodes, CPU nodes with little/much RAM,…), health checks, etc.
If a rolling update is triggered, the respective existing pods are kept running and accessible until the same number of new pods are up and accessible. Here there are methods to guarantee that no requests are lost, as well as parameters that regulate a minimum accessibility or a maximum surplus of pods for the change.
Figure 5 illustrates the rolling update.
1) The current version of an application is located on the Kubernetes cluster with 2 replicas and can be accessed as usual.
2) A rolling update to version V2 is started, the same number of pods as for V1 are created.
3) As soon as the new pods have the state “Running” and, if applicable, health checks have been completed, thus being functional, the containers of the older version are shut down.
4) The older pods are removed and the resources are released again.
The DevOps and time involved here is marginal, internally no hostnames or the like change, while from the consumer’s point of view the service can be accessed as before in the usual way (same IP, URL, …) and has merely been updated to the latest version.
5. Platform/Infrastructure as a Service
Some of our customers have strict policies or requirements regarding (data) compliance or information security, and do not want potentially sensitive data to leave the company. Furthermore, it can be avoided that data traffic flows through non-European nodes or generally ends up in foreign data centers.
Experience shows, however, that this is only the case in a very small proportion of projects. Through encryption, rights management and SLAs of the operators, we consider the use of cloud services and data centers to be generally secure and also use them for larger projects. In this regard, deployment, maintenance, CI/CD pipelines are also largely identical and easy to use thanks to methods of containerization (Docker) and abstraction (Kubernetes).
All major cloud operators like Google (GCP), Amazon (AWS) and Microsoft (Azure), but also smaller providers and soon even exciting new German projects, offer services very similar to Kubernetes. This makes it even easier to deploy and, most importantly, scale a project or model, as auto-scaling allows the cluster to expand or shrink depending on resource needs. From a technical perspective, this largely frees us from having to estimate the demand of a service while keeping the profitability and cost structure the same. Furthermore, the services can also be hosted and operated in different (geographical) zones to guarantee fastest reachability and redundancy.
The cloud operators offer a large number of different node types to satisfy all resource requirements for all use cases from the simpler web service to high performance computing. Especially in the application field of Deep Learning, the ever growing models can thus always be trained and served on the required latest hardware.
For example, while we use nodes with an average CPU and low memory for smaller NLP purposes, large Transformer models can be deployed on GPU nodes in the same cluster, which effectively enables their use in the first place and at the same time can speed up inference (application of the model) by a factor of over 20. As of late, the importance of dedicated hardware for neural networks has been steadily increasing, Google also provides access to the custom TPUs optimized for Tensorflow.
The organization and grouping of all these different nodes is done in Kubernetes in so-called node pools. These can be selected or specified in the deployment so that the right resources are allocated to the pods of the models.
7. Cluster Autoscaling
The extent to which models or services are used, internally or by customers, is often unpredictable or fluctuates greatly over time. With a cluster autoscaler, new nodes can be created automatically, or unneeded “empty” nodes can be removed. Here, too, a minimum number of nodes can be specified, which should always be available, as well as, if desired, a maximum number, which cannot be exceeded, to cap the costs, if necessary.
8. Interfacing with Other Services
In principle, cloud services from different providers can be combined, but it is more convenient and easier to use one provider (e.g. Google GCP). This means that services such as data buckets, container registry, Lambda functions can be integrated and used internally in the cloud without major authentication processes. Furthermore, especially in a microservice architecture, network communication among the individual hosts (applications, models) is important and facilitated within a provider. Access control/RBAC can also be implemented here, and several clusters or projects can be bridged with a virtual network to better separate the areas of responsibility and competence.
Environment and Future Developments
The growing use and spread of Kubernetes has brought with it a whole environment of useful tools, as well as advancements and further abstractions that further facilitate its use.
Tools and Pipelines based on Kubernetes
For example, Kubeflow can be used to trigger the training of machine learning models as a TensorFlow training job and deploy completed models with TensorFlow Serving.
The whole process can also be packaged into a pipeline that then performs training of different models with reference to training, validation and test data in memory buckets, and also monitors or logs their metrics and compares model performance. The workflow also includes the preparation of input data, so that after the initial pipeline setup, experiments can be easily performed to explore model architectures and hyperparameter tuning.
Serverless deployment methods such as Cloud Run or Amazon Fargate take another abstraction step away from the technical requirements. With this, containers can be deployed within seconds and scale like pods on a Kubernetes cluster without even having to create or maintain it. So the same infrastructure has once again been simplified in its use. According to the pay-per-use principle, only the time in which the code is actually called and executed is charged.
Kubernetes has become a central pillar in machine learning deployment today. The path from data and model exploration to the prototype and finally to production has been enormously streamlined and simplified by libraries such as PyTorch, TensorFlow and Keras. At the same time, these frameworks can also be applied in enormous detail, if required, to develop customized components or to integrate and adapt existing models using transfer learning. Container technologies such as Docker subsequently allow the result to be bundled with all its requirements and dependencies and executed almost anywhere without drawbacks in speed. In the final step, their deployment, maintenance, and scaling has also become immensely simplified and powerful with Kubernetes.
All of this allows us to develop our own products as well as solutions for customers in a structured way:
- The components and the framework infrastructure have a high degree of reusability
- A first milestone or proof-of-concept can be achieved in relatively little time and cost expenditure
- Further development work expands on this process in a natural way by increasing complexity
- Ready deployments scale without additional effort, with costs proportional to demand
- This results in a reliable platform with a predictable cost structure
If you would like to read further about some key components following this article, we have some more interesting articles about:
Deploying and monitoring machine learning projects is a complex undertaking. In addition to the consistent documentation of model parameters and the associated evaluation metrics, the main challenge is to transfer the desired model into a productive environment. If several people are involved in the development, additional synchronization problems arise concerning the models’ development environments and version statuses. For this reason, tools for the efficient management of model results through to extensive training and inference pipelines are required. In this article, we present the typical challenges along the machine learning workflow and describe a possible solution platform with MLflow. In addition, we present three different scenarios that can be used to professionalize machine learning workflows:
- Entry-level Variant: Model parameters and performance metrics are logged via a R/Python API and clearly presented in a GUI. In addition, the trained models are stored as artifacts and can be made available via APIs.
- Advanced Model Management: In addition to tracking parameters and metrics, certain models are logged and versioned. This enables consistent monitoring and simplifies the deployment of selected model versions.
- Collaborative Workflow Management: Encapsulating Machine Learning projects as packages or Git repositories and the accompanying local reproducibility of development environments enable smooth development of Machine Learning projects with multiple stakeholders.
Depending on the maturity of your machine learning project, these three scenarios can serve as inspiration for a potential machine learning workflow. We have elaborated each scenario in detail for better understanding and provide recommendations regarding the APIs and deployment environments to use.
Challenges Along the Machine Learning Workflow
Training machine learning models is becoming easier and easier. Meanwhile, a variety of open-source tools enable efficient data preparation as well as increasingly simple model training and deployment.
The added value for companies comes primarily from the systematic interaction of model training, in the form of model identification, hyperparameter tuning and fitting on the training data, and deployment, i.e., making the model available for inference tasks. This interaction is often not established as a continuous process, especially in the early phases of machine learning initiative development. However, a model can only generate added value in the long term if a stable production process is implemented from model training, through its validation, to testing and deployment. If this process is implemented correctly, complex dependencies and costly maintenance work in the long term can arise during the operational start-up of the model . The following risks are particularly noteworthy in this regard.
1. Ensuring Synchronicity
Often, in an exploratory context, data preparation and modeling workflows are developed locally. Different configurations of development environments or even the use of different technologies make it difficult to reproduce results, especially between developers or teams. In addition, there are potential dangers concerning the compatibility of the workflow if several scripts must be executed in a logical sequence. Without an appropriate version control logic, the synchronization effort afterward can only be guaranteed with great effort.
2. Documentation Effort
To evaluate the performance of the model, model metrics are often calculated following training. These depend on various factors, such as the parameterization of the model or the influencing factors used. This meta-information about the model is often not stored centrally. However, for systematic further development and improvement of a model, it is mandatory to have an overview of the parameterization and performance of all past training runs.
3. Heterogeneity of Model Formats
In addition to managing model parameters and results, there is the challenge of subsequently transferring the model to the production environment. If different models from multiple packages are used for training, deployment can quickly become cumbersome and error-prone due to different packages and versions.
4. Recovery of Prior Results
In a typical machine learning project, the situation often arises that a model is developed over a long period of time. For example, new features may be used, or entirely new architectures may be evaluated. These experiments do not necessarily lead to better results. If experiments are not versioned cleanly, there is a risk that old results can no longer be reproduced.
Various tools have been developed in recent years to solve these and other challenges in the handling and management of machine learning workflows, such as TensorFlow TFX, cortex, Marvin, or MLFlow. The latter, in particular, is currently one of the most widely used solutions.
MLflow is an open-source project with the goal to combine the best of existing ML platforms to make the integration to existing ML libraries, algorithms, and deployment tools as straightforward as possible . In the following, we will introduce the main MLflow modules and discuss how machine learning workflows can be mapped via MLflow.
MLflow consists of four components: MLflow Tracking, MLflow Models, MLflow Projects, and MLflow Registry. Depending on the requirements of the experimental and deployment scenario, all services can be used together, or individual components can be isolated.
With MLflow Tracking, all hyperparameters, metrics (model performance), and artifacts, such as charts, can be logged. MLflow Tracking provides the ability to collect presets, parameters, and results for collective monitoring for each training or scoring run of a model. The logged results can be visualized in a GUI or alternatively accessed via a REST API.
The MLflow Models module acts as an interface between technologies and enables simplified deployment. Depending on its type, a model is stored as a binary, e.g., a pure Python function, or as a Keras or H2O model. One speaks here of the so-called model flavors. Furthermore, MLflow Models provides support for model deployment on various machine learning cloud services, e.g., for AzureML and Amazon Sagemaker.
MLflow Projects are used to encapsulate individual ML projects in a package or Git repository. The basic configurations of the respective environment are defined via a YAML file. This can be used, for example, to control how exactly the conda environment is parameterized, which is created when MLflow is executed. MLflow Projects allows experiments that have been developed locally to be executed on other computers in the same environment. This is an advantage, for example, when developing in smaller teams.
MLflow Registry provides a centralized model management. Selected MLflow models can be registered and versioned in it. A staging workflow enables a controlled transfer of models into the productive environment. The entire process can be controlled via a GUI or a REST API.
Examples of Machine Learning Pipelines Using MLflow
In the following, three different ML workflow scenarios are presented using the above MLflow modules. These increase in complexity from scenario to scenario. In all scenarios, a dataset is loaded into a development environment using a Python script, processed, and a machine learning model is trained. The last step in all scenarios is a deployment of the ML model in an exemplary production environment.
1. Scenario – Entry-Level Variant
Scenario 1 uses the MLflow Tracking and MLflow Models modules. Using the Python API, the model parameters and metrics of the individual runs can be stored on the MLflow Tracking Server Backend Store, and the corresponding MLflow Model File can be stored as an artifact on the MLflow Tracking Server Artifact Store. Each run is assigned to an experiment. For example, an experiment could be called ‘fraud_classification’, and a run would be a specific ML model with a certain hyperparameter configuration and the corresponding metrics. Each run is stored with a unique RunID.
In the screenshot above, the MLflow Tracking UI is shown as an example after executing a model training. The server is hosted locally in this example. Of course, it is also possible to host the server remotely. For example in a Docker container within a virtual machine. In addition to the parameters and model metrics, the time of the model training, as well as the user and the name of the underlying script, are also logged. Clicking on a specific run also displays additional information, such as the RunID and the model training duration.
If you have logged other artifacts in addition to the metrics, such as the model, the MLflow Model Artifact is also displayed in the Run view. In the example, a model from the sklearn.svm package was used. The MLmodel file contains metadata with information about how the model should be loaded. In addition to this, a conda.yaml is created that contains all the package dependencies of the environment at training time. The model itself is located as a serialized version under model.pkl and contains the model parameters optimized on the training data.
The deployment of the trained model can now be done in several ways. For example, suppose one wants to deploy the model with the best accuracy metric. In that case, the MLflow tracking server can be accessed via the Python API mlflow.list_run_infos to identify the RunID of the desired model. Now, the path to the desired artifact can be assembled, and the model loaded via, for example, the Python package pickle. This workflow can now be triggered via a Dockerfile, allowing flexible deployment to the infrastructure of your choice. MLflow offers additional separate APIs for deployment on Microsoft Azure and AWS. For example, if the model is to be deployed on AzureML, an Azure ML container image can be created using the Python API mlflow.azureml.build_image, which can be deployed as a web service to Azure Container Instances or Azure Kubernetes Service. In addition to the MLflow Tracking Server, it is also possible to use other storage systems for the artifact, such as Amazon S3, Azure Blob Storage, Google Cloud Storage, SFTP Server, NFS, and HDFS.
2. Scenario – Advanced Model Management
Scenario 2 includes, in addition to the modules used in scenario 1, MLflow Model Registry as a model management component. Here, it is possible to register and process the models logged there from specific runs. These steps can be controlled via the API or GUI. A basic requirement to use the Model Registry is deploying the MLflow Tracking Server Backend Store as Database Backend Store. To register a model via the GUI, select a specific run and scroll to the artifact overview.
Clicking on Register Model opens a new window in which a model can be registered. If you want to register a new version of an already existing model, select the desired model from the dropdown field. Otherwise, a new model can be created at any time. After clicking the Register button, the previously registered model appears in the Models tab with corresponding versioning.
Each model includes an overview page that shows all past versions. This is useful, for example, to track which models were in production when.
If you now select a model version, you will get to an overview where, for example, a model description can be added. The Source Run link also takes you to the run from which the model was registered. Here you will also find the associated artifact, which can be used later for deployment.
In addition, individual model versions can be categorized into defined phases in the Stage area. This feature can be used, for example, to determine which model is currently being used in production or is to be transferred there. For deployment, in contrast to scenario 1, versioning and staging status can be used to identify and deploy the appropriate model. For this, the Python API MlflowClient().search_model_versions can be used, for example, to filter the desired model and its associated RunID. Similar to scenario 1, deployment can then be completed to, for example, AWS Sagemaker or AzureML via the respective Python APIs.
3. Scenario – Collaborative Workflow Management
In addition to the modules used in scenario 2, scenario 3 also includes the MLflow Projects module. As already explained, MLflow Projects are particularly well suited for collaborative work. Any Git repository or local environment can act as a project and be controlled by an MLproject file. Here, package dependencies can be recorded in a conda.yaml, and the MLproject file can be accessed when starting the project. Then the corresponding conda environment is created with all dependencies before training and logging the model. This avoids the need for manual alignment of the development environments of all developers involved and also guarantees standardized and comparable results of all runs. Especially the latter is necessary for the deployment context since it cannot be guaranteed that different package versions produce the same model artifacts. Instead of a conda environment, a Docker environment can also be defined using a Dockerfile. This offers the advantage that package dependencies independent of Python can also be defined. Likewise, MLflow Projects allow the use of different commit hashes or branch names to use other project states, provided a Git repository is used.
An interesting use case is the modularized development of machine learning training pipelines . For example, data preparation can be decoupled from model training and developed in parallel, while another team uses a different branch name to train the model. In this case, only a different branch name must be used as a parameter when starting the project in the MLflow Projects file. The final data preparation can then be pushed to the same branch name used for model training and would thus already be fully implemented in the training pipeline. The deployment can also be controlled as a sub-module within the project pipeline through a Python script via the ML Project File and can be carried out analogous to scenario 1 or 2 on a platform of your choice.
Conclusion and Outlook
MLflow offers a flexible way to make the machine learning workflow robust against the typical challenges in the daily life of a data scientist, such as synchronization problems due to different development environments or missing model management. Depending on the maturity level of the existing machine learning workflow, various services from the MLflow portfolio can be used to achieve a higher level of professionalization.
In the article, three machine learning workflows, ascending in complexity, were presented as examples. From simple logging of results in an interactive UI to more complex, modular modeling pipelines, MLflow services can support it. Logically, there are also synergies outside the MLflow ecosystem with other tools, such as Docker/Kubernetes for model scaling or even Jenkins for CI/CD pipeline control. If there is further interest in MLOps challenges and best practices, I refer you to the webinar on MLOps by our CEO Sebastian Heinz, which we provide free of charge.
- Hidden Technical Debt in Machine Learning Systems (2014) von D. Sculley et al., (https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf)
Artificial intelligence (AI) is no longer a vision of the future for German companies. According to a survey by Deloitte of around 2,700 AI experts from nine countries, over 90 percent of those surveyed say that their company uses or plans to use technologies from one of the areas of Machine Learning (ML), Deep Learning, Natural Language Processing (NLP) and Computer Vision. This high percentage cannot be explained solely by the fact that the companies have recognized the potential of AI. Instead, there are also significantly more standardized solutions available for the use of these technologies. This development has led to the fact that the entry barrier has been lowered more and more in recent years.
For example, the three major cloud providers – Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure – offer standardized solutions for certain problems (e.g., object recognition on images, translation of texts, and automated machine learning). So far, not all problems can be solved with the help of such standardized applications. There can be various reasons for this: The most common reason is that the available standard solutions do not fit the desired problem. For example, in the field of NLP, the classification of entire texts is often available as a standard solution. If, on the other hand, a classification is not to take place on the text-level but the word-level, other models are required for this purpose, which are not always available as part of standard solutions. And even if these are available, the possible categories are usually predefined and cannot be further adapted. So, a service built for the classification of words into the categories of place, person, and time cannot be used to classify words in the categories of customer, product, and price. Many companies, therefore, continue to rely on developing their own ML models. Since the development of models often takes place on local computers, it must be ensured that these models are not only available to the developer. Once a model has been developed, a significant challenge is to make the model available to different users, since only then will the model add value for the company.
ML & AI projects in the company have their own challenges in both development and deployment. While development often fails due to the lack of suitable data availability, deployment can fail because a model is not compatible with the production environment. For example, machine learning models are mostly developed with open source languages or new ML frameworks (e.g., Dataiku or H2O), while an operational production environment often works with proprietary software that has been tested and proven over many years. The close integration of these two worlds often presents both components with significant challenges. Therefore, it is essential to link the development of ML models with the work of IT Operations. This process is called MLOps because data scientists work together with IT to make models productively usable.
MLOps is an ML development culture and practice whose goal is to link the development of ML systems (Dev) and the operation of ML systems (Ops). In practice, MLOps means focusing on automation and monitoring. This principle extends to all steps of ML system configuration, such as integration, testing, sharing, deployment, and infrastructure management. The code of a model is one of many other components, as illustrated in Figure 1. The figure shows other steps of the MLOps process in addition to the ML code and illustrates that the ML code itself is a relatively small part of the overall process.
Figure 1: Important components of the MLOps process
Further aspects of MLOps are e.g., the continuous provision and quality check of the data, or the testing of the model and, if necessary, the debugging. Docker containers have emerged as a core technology for the provision of specially developed ML models and are therefore presented in this paper.
Why Docker Container?
The challenge in providing ML models is that a model is written in a specific version of a programming language. This language is usually not available in the production environment and therefore has to be first installed. Besides, the model has its libraries, runtimes, and other technical dependencies, which also have to be installed in the production environment. Docker solves this problem via so-called containers, in which applications, including all their components, can be packaged in isolation and made available as separate services. These containers contain all components that the application or ML model needs to run, including code, libraries, runtimes, and system tools. Containers can therefore be used to provide their own models and algorithms in any environment without worrying about missing or incompatible libraries leading to errors.
Figure 2: Comparison of Docker Containers and virtual machines
Before Docker’s triumphant success, virtual machines were long the tool of choice to deliver applications and ML models in isolation. However, Docker has proven to have several advantages over virtual machines. These include improved resource utilization, scalability, and faster deployment of new software. In the following, the three points will be examined in more detail.
Improved resource utilization
Figure 2 schematically compares how applications can run in Docker Containers and virtual machines. Virtual machines have their own guest operating system on which different applications run. Virtualizing the guest operating system at the hardware level requires a lot of computing power and memory. Therefore, fewer applications can run simultaneously on a virtual machine while maintaining the same efficiency.
On the other hand, Docker Containers share the host operating system and do not require a separate operating system. Therefore, applications in Docker Containers boot faster and use less processing power and memory due to the host’s shared operating system. This lower resource utilization makes it possible to run several applications in parallel on a server, which improves the utilization rate of a server.
Containers offer a further advantage in the area of scaling: If an ML model is to be used more frequently within the company, the application must be able to handle the additional requests. Fortunately, ML models with Docker can be easily scaled by starting additional containers with the same application. Especially Kubernetes, an open-source technology for container orchestration and scalable web services delivery, is suitable for flexible scaling due to its compatibility with Docker. With Kubernetes, web services can be scaled up or down flexibly and automatically based on the current workload.
Deployment of new software
Another advantage is that containers can be pushed seamlessly from local developing machines to production machines. Therefore, they are easy to exchange, for example, when a new version of the model is to be provided. The isolation of the code and all dependencies in a container also leads to a more stable environment in which the model can be operated. As a result, errors due to, for example, incorrect versions of individual libraries occur less frequently and can be corrected more effectively.
The model is provided within a container as a web service that other users and applications can access via common Internet protocols (e.g., HTTP). In this way, the model can be accessed as a web service by other systems and users without the need for them to meet specific technical requirements. Thus, it is unnecessary to install libraries or the model’s programming language to make the model usable.
In addition to Docker, other container technologies such as rkt and Mesos, whereby Docker, with its user-friendly operation and detailed documentation, make it easy for new developers to get started. Due to the large user base, templates exist for many standard applications that can be run in containers with little effort. At the same time, these free templates serve as a basis for developing your own applications.
Not least because of these advantages, Docker is now considered best practice in the MLOps process. The process of model development increasingly resembles the software development process, not least because of Docker. This becomes clear by the fact that container-based applications are supported by standard tools for the continuous integration and provision (CI/CD) of web services.
What role do Docker Containers play in the MLOps pipeline?
As already mentioned, MLOps is a complex process of continuous provision of ML models. The central components of such a system are illustrated in figure 1. The MLOps process is very similar to the DevOps process because the development of machine learning systems is also a form of software development. Standard concepts from the DevOps area, such as continuous integration of new code and provision of new software, can be found in the MLOps process. New ML-specific components such as continuous model training and model and data validation are added.
It is considered best practice to embed the development of ML models in an MLOps pipeline. The MLOps pipeline includes all steps from the provision and transformation of data, model training to the continuous provision of finished models on production servers. The code for each step in the pipeline is packed in a docker container and the pipeline starts the containers in a defined order. Here, Docker Containers show their strength. By isolating the code within individual containers, code changes can be continuously incorporated at the pipeline’s appropriate points without replacing the entire pipeline. Therefore the costs for pipeline maintenance are relatively low. The major cloud providers (GCP, AWS, and Microsoft Azure) also offer services that allow Docker Containers to be automatically built, deployed, and hosted as web services. To make container scaling easier and as flexible as possible, cloud providers also offer fully managed Kubernetes products. For the use of ML models in the enterprise, this flexibility means cost savings, as an ML application is simply downscaled in case the usage rate drops. Similarly, higher demand can be ensured by providing additional containers without having to stop the container with the model. Users of the application will not experience any unnecessary downtime.
For the development of machine learning models and MLOps pipelines, docker containers are a core technology. The advantages are portability, modularization, and isolation of model code, low maintenance when integrated into pipelines, faster deployment of new versions of the model and scalability via serverless cloud products for container deployment. At STATWORX, we have recognized the potential of Docker Containers and are actively using them. With this knowledge, we support our customers in the realization of their machine learning and AI projects. Do you want to use Docker in your MLOps pipeline? Our Academy offers remote training on Data Science with Docker as well as free webinars on MLOps and Docker.
Here at STATWORX, a Data Science and AI consulting company, we thrive on creating data-driven solutions that can be acted on quickly and translate into real business value. We provide many of our solutions in some form of web application to our customers, to allow and support them in their data-driven decision-making.
Containerization Allows Flexible Solutions
At the start of a project, we typically need to decide where and how to implement the solution we are going to create. There are several good reasons to deploy the designed solutions directly into our customer IT infrastructure instead of acquiring an external solution. Often our data science solutions use sensitive data. By deploying directly to the customers’ infrastructure, we make sure to avoid data-related compliance or security issues. Furthermore, it allows us to build pipelines that automatically extract new data points from the source and incorporate them into the solution so that it is always up to date.
However, this also imposes some constraints on us. We need to work with the infrastructure provided by our customers. On the one hand, that requires us to develop solutions that can exist in all sorts of different environments. On the other hand, we need to adapt to changes in the customers’ infrastructure quickly and efficiently. All of this can be achieved by containerizing our solutions.
The Advantages of Containerization
Containerization has evolved as a lightweight alternative to virtualization. It involves packaging up software code and all its dependencies in a “container” so that the software can run on practically any infrastructure. Traditionally, an application was developed in a specific computing development environment and then transferred to the production environment, often resulting in many bugs and errors; Especially when these environments were not mirroring each other. For example, when an application is transferred from a local desktop computer to a virtual machine or from a Linux to a Windows operating system.
A container platform like Docker allows us to store the whole application with all the necessary code, system tools, libraries, and settings in a container that can be shipped to and work uniformly in any environment. We can develop our applications dockerized and do not have to worry about the specific infrastructure environment provided by our customers.
There are some other advantages that come with using Docker in comparison to traditional virtual machines for the deployment of data science applications.
- Efficiency – As the container shares the machines’ OS system kernel and does not require a Guest OS per application, it uses the provided infrastructure more efficiently, resulting in lower infrastructure costs.
- Speed – The start of a container does not require a Guest OS reboot; it can be started, stopped, replicated, and destroyed in seconds. That speeds up the development process, the time to market, and the operational speed. Releasing new software or updates has never been so fast: Bugs can be fixed, and new features implemented in hours or days.
- Scalability – Horizontal scaling allows to start and stop additional container depending on the current demand.
- Security – Docker provides the strongest default isolation capabilities in the industry. Containers run isolated from each other, which means that if one crashes, other containers serving the same applications will still be running.
The Key Benefits of a Microservices Architecture
In connection with the use of Docker for delivering data science solutions, we use another emerging method. Instead of providing a monolithic application that comes with all the required functionalities of an application, we create small, independent services that communicate with each other and together embody the complete application. Usually, we develop WebApps for our customers. As shown in the graphic, the WebApp will communicate directly with the different other backend microservices. Each one is designed for a specific task and has an exposed REST API that allows for different HTTP requests.
Furthermore, the backend microservices are indirectly exposed to the mobile app. An API Gateway routes the requests to the desired microservices. It can also provide an API endpoint that invokes several backend microservices and aggregates the results. Moreover, it can be used for access control, caching, and load balancing. If suitable, you might also decide to place an API Gateway between the WebApp and the backend microservices.
In summary, splitting the application into small microservices has several advantages for us:
- Agility – As services operate independently, we can update or fix bugs for a specific microservice without redeploying the entire application.
- Technology freedom – Different microservices can be based on different technologies or languages, thus allowing us to use the best of all worlds.
- Fault isolation – If an individual microservice becomes unavailable, it will not crash the entire application. Only the function provided by the specific microservice will not be provided.
- Scalability – Services can be scaled independently. It is possible to scale the services which do the work without scaling the application.
- Reusability of service – Often, the functionalities of the services we create are also requested by other departments and other cases. We then expose application user interfaces so that the services can also be used independently of the focal application.
Containerized Microservices – The Best of Both Worlds!
The combination of docker with a clean microservices architecture allows us to combine the mentioned advantages. Each microservice lives in its own Docker container. We deliver fast solutions that are consistent across environments, efficient in terms of resource consumption, and easily scalable and updatable. We are not bound to a specific infrastructure and can adjust to changes quickly and efficiently.
Often the deployment of a data science solution is one of the most challenging tasks within data science projects. But without a proper deployment, there won’t be any business value created. Hopefully, I was able to help you figure out how to optimize the implementation of your data science application. If you need further help bringing your data science solution into production, feel free to contact us!