Data Science, Machine Learning, and AI

Monitoring Data Science workflows

Long gone are the days when a data science project only consisted of loosely coupled notebooks for data processing and model training that the data scientists ran occasionally on their laptops. With maturity, the projects have grown into big software projects with multiple contributors and dependencies, and multiple modules with numerous classes and functions. The Data Science workflow, usually starting with data pre-processing, feature engineering, model training, tuning, evaluating, and lastly inferring– referred to as an ML pipeline, is being modularized. This modularization makes the process more scalable and automatable, ergo suitable to run in container orchestration systems or on cloud infrastructure. Extracting valuable model or data-related KPIs, if done manually, can be a labor-intense task, more so if with increasing and/or automated re-runs. This information is important for comparing different models and observing trends like distribution shifts in the training data. It can also be used for detecting unusual values, from imbalanced classes to inflated outlier deletions – whatever might be deemed necessary to ensure a model´s robustness. Libraries like MLFlow can be used to store all these sorts of metrics. Besides, operationalizing ML pipelines heavily relies on tracking run-related information for efficient troubleshooting, as well as maintaining or improving the pipeline´s resource consumption.

This not only holds in the world of data science. Today’s microservice-based architectures also add to the fact that maintaining and managing deployed code requires unified supervision. It can drastically cut the operation and maintenance hours needed by a DevOps team due to a more holistic understanding of the processes involved, plus simultaneously reduce error-related downtimes.

It is important to understand how different forms of monitoring aim to tackle the above-stated implications and how model- and data-related metrics can fit this objective too. In fact, while MLFlow has been established as the industry standard for supervision of ML-related metrics, tracking them along with all the operational information can be appealing as well.

Logs vs Metrics

Logs provide an event-based snapshot – Metrics give a bird’s eye overview

A log is a point in time, written out record (e.g., stdout/stderr) of an event that occurs discontinuously, at no pre-defined intervals. Depending on the application, logs carry information such as timestamp, trigger, name, description, and/or result of the event. Events can be anything from simple requests to user logins that the developer of the underlying code deemed important. When following best practices during this process, it can save a lot of hassle and time in setting up downstream monitoring tools. Using dedicated log libraries and writing meaningful log messages fits the bill

INFO[2021-01-06T17:44:13.368024402-08:00] starting *secrets.YamlSecrets               
INFO[2021-01-06T17:44:13.368679356-08:00] starting *config.YamlConfig                 
INFO[2021-01-06T17:44:13.369046236-08:00] starting *s3.DefaultService                 
INFO[2021-01-06T17:44:13.369518352-08:00] starting *lambda.DefaultService             
ERROR[2021-01-06T17:44:13.369694698-08:00] http server error   error="listen tcp bind: address already in use"

Fig. 1: textual event logs

Although the data footprint of a single log is negligible, log streams can exponentiate rapidly. This results in the fact that storing every single log does not scale well, especially in the shape of semi-structured text data. For debugging or auditing, however, storing logs as-is might be unavoidable. Archive storage solutions or retention periods can help.

In other cases, parsing and extracting logs on the move into other formats, like key-value pairs, further addresses these limitations. It can also preserve a lot of the event´s information while having a much lower footprint.

  time: "2021-01-06T17:44:13-08:00",
  mode: "reader",
  debug_http_error: "listen tcp bind: address already in use"
  servicePort: 8089,
  duration_ms: 180262

Fig. 2: structured event logs

Another form of reducing this footprint can be done through sampling methods, with metrics being the most prominent representatives.

A metric represents a numeric measure of a particular target (specific event) evenly distributed over intervals of time. Mathematical aggregations like sums or averages are common transformations that keep such metrics relatively small data-wise

   time: "2022-01-06T17:44:13-08:00",
   Duration_ms: 60,
   sum_requests: 500,
   sum_hits_endpoint_1: 250,
   sum_hits_endpoint_2: 117,
   avg_duration: 113,

Fig. 3: metrics

Thus, metrics are well suited for gradually reducing the data resolution into wider frequencies like daily, weekly, or even longer periods of analysis. Additionally, metrics tend to be better unifiable across multiple applications as they carry highly structured data compared to raw log messages. While this reduces the issues mentioned before, it does come at the cost of granularity. This makes metrics perfect for high-frequency events where a single event´s information is less important. Monitoring compute resources is an example of this. Both takes have their right to exist in any monitoring setup as different use cases fit the different objectives. Consider this more tangible example of a small shop to showcase their main differences:

The total balance of a bank account may fluctuate over time due to withdraws and deposits (that can occur at any point in time). If one is only concerned that there is money in the account, tracking an aggerated metric periodically should be sufficient. If one is interested in the total inflow linked to a specific client, though, logging every transaction is inevitable.

Architecture and tool stack

In most modern cloud stacks, such as Azure Appservice, most logging on infrastructure and request-side is shipped with the service itself. This can come costly with increasing volume, however. Defining the use cases, understanding the deployment environment, and matching it with the logging architecture is part of the job instruction of DevOps teams.

From a developer’s perspective, there are plenty of open-source tools that can deliver valuable monitoring solutions that only need some effort for orchestration. Leaner setups can consist of only a backend server like a time series database and a tool for visualization. More complex systems can incorporate multiple logging systems with multiple dedicated log shippers, alert managers, and other intermediate components (see picture). Some of these tools might be necessary for making logs accessible in the first place or for unifying different log streams. Understanding the workings and service area of each component is, therefore, a pivotal part.

Fig. 4: Monitoring flow of applications deployed in a Kubernetes Cluster (altered, from

Database & Design

Logs, at least when following the best practices of including a timestamp, and metrics are usually time series data that can be stored in a time series database. Although in cases where textual logs get stored as-is, other architectures utilize document-oriented types of storage with a powerful query engine on top (like ElasticSearch). Besides storage-related differences, the backend infrastructure is split into two different paradigms: push and pull. These paradigms address the questions of who is responsible (client or backend) for ingesting the data initially.

Choosing one over the other depends on the use case or type of information that should be persisted. For instance, push services are well apt for event logging where the information of a single event is important. However, this makes them also more prone to get overwhelmed by receiving too many requests which lowers robustness. On the other hand, pull systems are perfectly fit for scraping periodical information which is in line with the composition of metrics.

Dashboard & Alerting

To better comprehend the data and spot any irregularities, dashboards come in quite handy. Monitoring systems are largely suited for simple, “less complex” querying as performance matters. The purpose of these tools is specialized for the problems being tackled and they offer a more limited inventory than some of the prominent software like PowerBI. This does not make them less powerful in their area of use, however. Tools like Grafana, which is excellent at handling log-based metrics data, can connect to various database backends and build customized solutions dating from multiple sources. Tools like Kibana, which have their edge in text-based log analyses, provide users with a large querying toolkit for root cause analysis and diagnostics. It is worth mentioning that both tools expand their scope to support both worlds.

Fig. 5 Grafana example dashboard (

While monitoring is great at spotting irregularities (proactive) and targeted analysis of faulty systems (reactive), being informed about application failures right when they occur allows DevOps teams to take instant action. Alert managers provide the capability of poking for events and triggering alerts on all sorts of different communication channels, such as messaging, incidents managing programs, or via plain email.

Scrapers, Aggregators, and Shippers

Given the fact that not every microservice exposes an endpoint where logs and log-based metrics can be assessed or extracted – remember the differences between push and pull – intermediaries must chip in. Services like scrapers extract and format logs from different sources, aggregators perform some sort of combining actions (generating metrics) and shippers can pose as a push service for push-based backends. Fluentd is a perfect candidate that incorporates all the mentioned capabilities while still maintaining a smallish footprint.

End-to-end monitoring

There are paid-tier services that make a run at providing a holistic one-fits-all system for any sort of application, architecture, and independently from cloud vendors, which can be a game changer for DevOps teams. However, leaner setups can also do a cost-effective and reliable job.

When ruling out the necessity of collecting full-text logs, many standard use cases can be realized with a time series database as the backend. InfluxDB is well suited and easy to spin up with mature integrability into Grafana. Grafana, as a dashboard tool, pairs well with Prometheus´ alter manager service. As an intermediary, fluentd is perfectly fitted to extract the textual logs and perform the necessary transformations. As InfluxDB is push-based, fluentd also takes care that the data get into InfluxDB.

Building on said tools, the example infrastructure covers everything from the Data Science pipeline to the later deployed model APIs, with Dashboards dedicated to each use case. Before a new trainings-run gets approved for production, the ML metrics mentioned at the beginning, provide a good entry point to observe the model´s legitimacy. Simple user statistics, like total and unique requests, give a fair overview of its usage once the model is deployed. By tracking response times, e.g. of an API call, bottlenecks can be disclosed easily.

At the resource level, the APIs along with each pipeline step are monitored to observe any irregularities, like sudden spikes in memory consumption. Tracking the resources over time can also determine whether the types of VM that are being used are over- or underutilized. Optimizing these metrics can potentially cut unnecessary costs. Lastly, pre-defined failure events, such as an unreachable API or failed trainings runs should trigger an alert with an Email being sent out.


Fig. 6: Deployed infrastructure with logging streams and monitoring stack.

The entire architecture, consisting of the monitoring infrastructure, the data science pipeline, and deployed APIs, can all run in a (managed) Kubernetes cluster. From a DevOps perspective, knowing Kubernetes is already half the battle. This open-source stack can be scaled up and down and is not bound to any paid-tier subscription model which provides great flexibility and cost-efficiency. Plus, onboarding new log streams, deployed apps or multiple pipelines can be done painlessly. Even single frameworks could be swapped out. For instance, if Grafana is not suitable anymore, just use another visualization tool that can integrate with the backend and matches the use case requirements.


Logging and monitoring are pivotal parts of modern infrastructures not just since applications were modularized and shipped into the cloud. Yet, they surely exacerbate the struggles of not being set up properly. In addition to the increasing operationalization of the ML workflow, the need for organizations to establish well-thought-out monitoring solutions in order to keep track of models, data, and everything around them is also growing steadily.

While there are dedicated platforms designed to address these challenges, the charming idea behind the presented infrastructure is that it consists of only a single entry point for Data Science, MLOps, and Devops teams and is highly extensible. Benedikt Müller Benedikt Müller

Text classification is one of the most common applications of natural language processing (NLP). It is the task of assigning a set of predefined categories to a text snippet. Depending on the type of problem, the text snippet could be a sentence, a paragraph, or even a whole document. There are many potential real-world applications for text classification, but among the most common ones are sentiment analysis, topic modeling and intent, spam, and hate speech detection.

The standard approach to text classification is training a classifier in a supervised regime. To do so, one needs pairs of text and associated categories (aka labels) from the domain of interest as training data. Then, any classifier (e.g., a neural network) can learn a mapping function from the text to the most likely category. While this approach can work quite well for many settings, its feasibility highly depends on the availability of those hand-labeled pairs of training data.

Though pre-trained language models like BERT can reduce the amount of data needed, it does not make it obsolete altogether. Therefore, for real-world applications, data availability remains the biggest hurdle.

Zero-Shot Learning

Though there are various definitions of zero-shot learning1, it can broadly speaking be defined as a regime in which a model solves a task it was not explicitly trained on before.

It is important to understand, that a “task” can be defined in both a broader and a narrower sense: For example, the authors of GPT-2 showed that a model trained on language generation can be applied to entirely new downstream tasks like machine translation2. At the same time, a narrower definition of task would be to recognize previously unseen categories in images as shown in the OpenAI CLIP paper3.

But what all these approaches have in common is the idea of extrapolation of learned concepts beyond the training regime. A powerful concept, because it disentangles the solvability of a task from the availability of (labeled) training data.

Zero-Shot Learning for Text Classification

Solving text classification tasks with zero-shot learning can serve as a good example of how to apply the extrapolation of learned concepts beyond the training regime. One way to do this is using natural language inference (NLI) as proposed by Yin et al. (2019)4. There are other approaches as well like the calculation of distances between text embeddings or formulating the problem as a cloze

In NLI the task is to determine whether a hypothesis is true (entailment), false (contradiction), or undetermined (neutral) given a premise5. A typical NLI dataset consists of sentence pairs with associated labels in the following form:

Examples from

Yin et al. (2019) proposed to use large language models like BERT trained on NLI datasets and exploit their language understanding capabilities for zero-shot text classification. This can be done by taking the text of interest as the premise and formulating one hypothesis for each potential category by using a so-called hypothesis template. Then, we let the NLI model predict whether the premise entails the hypothesis. Finally, the predicted probability of entailment can be interpreted as the probability of the label.

Zero-Shot Text Classification with Hugging Face 🤗

Let’s explore the above-formulated idea in more detail using the excellent Hugging Face implementation for zero-shot text classification.­

We are interested in classifying the sentence below into pre-defined topics:

topics = ['Web', 'Panorama', 'International', 'Wirtschaft', 'Sport', 'Inland', 'Etat', 'Wissenschaft', 'Kultur']
test_txt = 'Eintracht Frankfurt gewinnt die Europa League nach 6:5-Erfolg im Elfmeterschießen gegen die Glasgow Rangers'

Thanks to the 🤗 pipeline abstraction, we do not need to define the prediction task ourselves. We just need to instantiate a pipeline and define the task as zero-shot-text-classification. The pipeline will take care of formulating the premise and hypothesis as well as deal with the logits and probabilities from the model.

As written above, we need a language model that was pre-trained on an NLI task. The default model for zero-shot text classification in 🤗 is bart-large-mnli. BART is a transformer encoder-decoder for sequence-2-sequence modeling with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder6. The mnli suffix means that BART was then further fine-tuned on the MultiNLI dataset7.

But since we are using German sentences and BART is English-only, we need to replace the default model with a custom one. Thanks to the 🤗 model hub, finding a suitable candidate is quite easy. In our case, mDeBERTa-v3-base-xnli-multilingual-nli-2mil7 is such a candidate. Let’s decrypt the name shortly for a better understanding: it is a multilanguage version of DeBERTa-v3-base (which is itself an improved version of BERT/RoBERTa8) that was then fine-tuned on two cross-lingual NLI datasets (XNLI8 and multilingual-NLI-26lang10).

With the correct task and the correct model, we can now instantiate the pipeline:

from transformers import pipeline
model = 'MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7'
pipe = pipeline(task='zero-shot-classification', model=model, tokenizer=model)

Next, we call the pipeline to predict the most likely category of our text given the candidates. But as a final step, we need to replace the default hypothesis template as well. This is necessary since the default is again in English. We, therefore, define the template as 'Das Thema is {}'. Note that, {} is a placeholder for the previously defined topic candidates. You can define any template you like as long as it contains a placeholder for the candidates:

template_de = 'Das Thema ist {}'
prediction = pipe(test_txt, topics, hypothesis_template=template_de)

Finally, we can assess the prediction from the pipeline. The code below will output the three most likely topics together with their predicted probabilities:

print(f'Zero-shot prediction for: \n {prediction["sequence"]}')
top_3 = zip(prediction['labels'][0:3], prediction['scores'][0:3])
for label, score in top_3:
    print(f'{label} - {score:.2%}')
Zero-shot prediction for: 
 Eintracht Frankfurt gewinnt die Europa League nach 6:5-Erfolg im Elfmeterschießen gegen die Glasgow Rangers
Sport - 77.41%
International - 15.69%
Inland - 5.29%

As one can see, the zero-shot model produces a reasonable result with “Sport” being the most likely topic followed by “International” and “Inland”.

Below are a few more examples from other categories. Like before, the results are overall quite reasonable. Note how for the second text the model predicts an unexpectedly low probability of “Kultur”.

further_examples = ['Verbraucher halten sich wegen steigender Zinsen und Inflation beim Immobilienkauf zurück',
                    '„Die bitteren Tränen der Petra von Kant“ von 1972 geschlechtsumgewandelt und neu verfilmt',
                    'Eine 541 Millionen Jahre alte fossile Alge weist erstaunliche Ähnlichkeit zu noch heute existierenden Vertretern auf']

for txt in further_examples:
    prediction = pipe(txt, topics, hypothesis_template=template_de)
    print(f'Zero-shot prediction for: \n {prediction["sequence"]}')
    top_3 = zip(prediction['labels'][0:3], prediction['scores'][0:3])
    for label, score in top_3:
        print(f'{label} - {score:.2%}')
Zero-shot prediction for: 
  Verbraucher halten sich wegen steigender Zinsen und Inflation beim Immobilienkauf zurück 
Wirtschaft - 96.11% 
Inland - 1.69% 
Panorama - 0.70% 

Zero-shot prediction for: 
  „Die bitteren Tränen der Petra von Kant“ von 1972 geschlechtsumgewandelt und neu verfilmt 
International - 50.95% 
Inland - 16.40% 
Kultur - 7.76% 

Zero-shot prediction for: 
  Eine 541 Millionen Jahre alte fossile Alge weist erstaunliche Ähnlichkeit zu noch heute existierenden Vertretern auf 
Wissenschaft - 67.52% 
Web - 8.14% 
Inland - 6.91%

The entire code can be found on GitHub. Besides the examples from above, you will find there also applications of zero-shot text classifications on two labeled datasets including an evaluation of the accuracy. In addition, I added some prompt-tuning by playing around with the hypothesis template.

Concluding Thoughts

Zero-shot text classification offers a suitable approach when either training data is limited (or even non-existing) or as an easy-to-implement benchmark for more sophisticated methods. While explicit approaches, like fine-tuning large pre-trained models, certainly still outperform implicit approaches, like zero-shot learning, their universal applicability makes them very appealing.

In addition, we should expect zero-shot learning, in general, to become more important over the next few years. This is because the way we will use models to solve tasks will evolve with the increasing importance of large pre-trained models. Therefore, I advocate that already today zero-shot techniques should be considered part of every modern data scientist’s toolbox.




Fabian Müller Fabian Müller Fabian Müller Fabian Müller Fabian Müller Fabian Müller

What to expect:

AI, data science, machine learning, deep learning and cybersecurity talents from FrankfurtRhineMain take note!

The AI Talent Night, a networking event with a difference, offers you the opportunity to meet other talents as well as potential employers and AI experts. But that’s not all, because a great ambience, delicious food, live music, AI visualization and 3 VIP talks by leading AI experts from the business world await you.  

And the best part: Are you a talent? Then you can join us for free 

Together with our partners AI FrankfurtRheinMain e.V. and STATION HQ, we are organizing the AI Talent Night as part of the UAI conference series. Our goal is to promote AI in our region and establish an AI hub in Frankfurt. 

Become a part of our AI community!  

Tickets for the event can be purchased here: 

or on our website:


What you can expect:

The konaktiva fair at Darmstadt University of Technology is one of the oldest and largest student-organized company career fairs in Germany. In line with its motto “Students meet companies”, it brings together prospective graduates and companies every year.

This year, we are also taking part with our own booth as well as several colleagues and we are looking forward to the exchange with interested students. We will be happy to present the various entry-level opportunities at statworx – from internships to permanent positions – and share insights into our day-to-day work.

In addition, there will be the opportunity to get to know us better and to discuss individual questions and cases during pre-scheduled one-on-one meetings away from the hustle and bustle of the trade fair.

Participation in the fair is free of charge for visitors.