Data Science, Machine Learning, and AI
Contact

At statworx, we deal intensively with how to get the best possible results from large language models (LLMs). In this blog post, I present five approaches that have proven successful both in research and in our own work with LLMs. While this text is limited to the manual design of prompts for text generation, image generation and automated prompt search will be the topic of future posts.

Mega models herald new paradigm

The arrival of the revolutionary language model GPT-3 was not only a turning point for the research field of language modeling (NLP) but has incidentally heralded a paradigm shift in AI development: prompt learning. Prior to GPT-3, the standard was fine-tuning of medium-sized language models such as BERT, which, thanks to re-training with new data, would adapt the pre-trained model to the desired use case. Such fine-tuning requires exemplary data for the desired application, as well as the computational capabilities to at least partially re-train the model.

The new large language models such as OpenAI’s GPT-3 and BigScience’s BLOOM, on the other hand, have already been trained by their development teams with such enormous amounts of resources that these models have achieved a new level of independence in their intended use: These LLMs no longer require elaborate fine-tuning to learn their specific purpose, but already produce impressive results using targeted instruction (“prompt”) in natural language.

So, we are in the midst of a revolution in AI development: Thanks to prompt learning, interaction with models no longer takes place via code, but in natural language. This is a giant step forward for the democratization of language modeling. Generating text or, most recently, even creating images requires no more than rudimentary language skills. However, this does not mean that compelling or impressive results are accessible to all. High quality outputs require high quality inputs. For us users, this means that engineering efforts in NLP are no longer focused on model architecture or training data, but on the design of the instructions that models receive in natural language. Welcome to the age of prompt engineering.

Figure 1: From prompt to prediction with a large language model.

Prompts are more than just snippets of text

Templates facilitate the handling of prompts

Since LLMs have not been trained on a specific use case, it is up to the prompt design to provide the model with the exact task. So-called “prompt templates” are used for this purpose. A template defines the structure of the input that is passed on to the model. Thus, the template takes over the function of fine-tuning and determines the expected output of the model for a specific use case. Using sentiment analysis as an example, a simple prompt template might look like this:

The expressed sentiment in text [X] is: [Z]

The model thus searches for a token z, that, based on the trained parameters and the text in location [X], maximizes the probability of the masked token in location [Z]. The template thus specifies the desired context of the problem to be solved and defines the relationship between the input at position [X] and the output to be predicted at position [Z]. The modular structure of templates enables the systematic processing of a large number of texts for the desired use case.

Figure 2: Prompt templates define the structure of a prompt.

Prompts do not necessarily need examples

The template presented is an example of a so-called 0-Shot” prompt, since there is only an instruction, without any demonstration with examples in the template. Originally, LLMs were called “Few-Shot Learners” by the developers of GPT-3, i.e., models whose performance can be maximized with a selection of solved examples of the problem (Brown et al., 2020). However, a follow-up study showed that with strategic prompt design, 0-shot prompts, without a large number of examples, can achieve comparable performance (Reynolds & McDonell, 2021). Thus, since different approaches are also used in research, the next section presents 5 strategies for effective prompt template design.

5 Strategies for Effective Prompt Design

Task demonstration

In the conventional few-shot setting, the problem to be solved is narrowed down by providing several examples. The solved example cases are supposed to take a similar function as the additional training samples during the fine-tuning process and thus define the specific use case of the model. Text translation is a common example for this strategy, which can be represented with the following prompt template:

French: „Il pleut à Paris“

English: „It’s raining in Paris“

French: „Copenhague est la capitale du Danemark“

English: „Copenhagen is the capital of Denmark“

[…]

French: [X]

English: [Z]

While the solved examples are good for defining the problem setting, they can also cause problems. “Semantic contamination” refers to the phenomenon of the LLM interpreting the content of the translated sentences as relevant to the prediction. Examples in the semantic context of the task produce better results – and those out of context can lead to the prediction Z being “contaminated” in terms of its content (Reynolds & McDonell, 2021). Using the above template for translating complex facts, the model might well interpret the input sentence as a statement about a major European city in ambiguous cases.

Task Specification

Recent research shows that with good prompt design, even the 0-shot approach can yield competitive results. For example, it has been demonstrated that LLMs do not require pre-solved examples at all, as long as the problem is defined as precisely as possible in the prompt (Reynolds & McDonell, 2021). This specification can take different forms, but it is always based on the same idea: to describe as precisely as possible what is to be solved, but without demonstrating how.

A simple example of the translation case would be the following prompt:

Translate from French to English [X]: [Z]

This may already work, but the researchers recommend making the prompt as descriptive as possible and explicitly mentioning translation quality:

A French sentence is provided: [X]. The masterful French translator flawlessly translates the sentence to English: [Z]

This helps the model locate the desired problem solution in the space of the learned tasks.

Figure 3: A clear task description can greatly increase the forecasting quality.

This is also recommended in use cases outside of translations. A text can be summarized with a simple command:

Summarize the following text: [X]: [Z]

However, better results can be expected with a more concrete prompt:

Rephrase this sentence with easy words so a child understands it,
emphasize practical applications and examples: [X]: [Z]

The more accurate the prompt, the greater the control over the output.

Prompts as constraints

Taken to its logical conclusion, the approach of controlling the model simply means constraining the model’s behavior through careful prompt design. This perspective is useful because during training, LLMs learn to complete many different sorts of texts and can thus solve a wide range of problems. With this design strategy, the basic approach to prompt design changes from describing the problem to excluding undesirable results by constraining model behavior. Which prompt leads to the desired result and only to the desired result? The following prompt indicates a translation task, but beyond that, it does not include any approaches to prevent the sentence from simply being continued into a story by the model.

Translate French to English Il pleut à Paris

One approach to improve this prompt is to use both semantic and syntactic means:

Translate this French sentence to English: “Il pleut à Paris.”

The use of syntactic elements such as the colon and quotation marks makes it clear where the sentence to be translated begins and ends. Also, the specification by sentence expresses that it is only about a single sentence. These measures reduce the likelihood that this prompt will be misunderstood and not treated as a translation problem.

Use of “memetic proxies”

This strategy can be used to increase the density of information in a prompt and avoid long descriptions through culturally understood context. Memetic proxies can be used in task descriptions and use implicitly understood situations or personae instead of detailed instructions:

A primary school teacher rephrases the following sentence: [X]: [Z]

This prompt is less descriptive than the previous example of rephrasing in simple words. However, the situation described contains a much higher density of information: The mentioning of an elementary school teacher already implies that the outcome should be understandable to children and thus hopefully increases the likelihood of practical examples in the output. Similarly, prompts can describe fictional conversations with well-known personalities so that the output reflects their worldview or way of speaking:

In this conversation, Yoda responds to the following question: [X]

Yoda: [Z]

This approach helps to keep prompts short by using implicitly understood context and to increase the information density within a prompt. Memetic proxies are also used in prompt design for other modalities. In image generation models such as DALL-e 2, the suffix “Trending on Artstation” often leads to higher quality results, although semantically no statements are made about the image to be generated.

Metaprompting

Metaprompting is how the research team of one study describes the approach of enriching prompts with instructions that are tailored to the task at hand. They describe this as a way to constrain a model with clearer instructions so that the task at hand can be better accomplished (Reynolds & McDonell, 2021). The following example can help to solve mathematical problems more reliably and to make the reasoning path comprehensible:

[X]. Let us solve this problem step-by-step: [Z]

Similarly, multiple choice questions can be enriched with metaprompts so that the model actually chooses an option in the output rather than continuing the list:

[X] in order to solve this problem, let us analyze each option and choose the best: [Z]

Metaprompts thus represent another means of constraining model behavior and results.

Figure 4: Metaprompts can be used to define procedures for solving problems.

Outlook

Prompt learning is a very young paradigm, and the closely related prompt engineering is still in its infancy. However, the importance of sound prompt writing skills will undoubtedly only increase. Not only language models such as GPT-3, but also the latest image generation models require their users to have solid prompt design skills in order to create convincing results. The strategies presented are both research and practice proven approaches to systematically writing prompts that are helpful for getting better results from large language models.

In a future blog post, we will use this experience with text generation to unlock best practices for another category of generative models: state-of-the-art diffusion models for image generation, such as DALL-e 2, Midjourney, and Stable Diffusion.

Sources

Brown, Tom B. et al. 2020. “Language Models Are Few-Shot Learners.” arXiv:2005.14165 [cs]. http://arxiv.org/abs/2005.14165 (March 16, 2022).

Reynolds, Laria, and Kyle McDonell. 2021. “Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm.” http://arxiv.org/abs/2102.07350 (July 1, 2022). Oliver Guggenbühl Oliver Guggenbühl Oliver Guggenbühl Oliver Guggenbühl

Text classification is one of the most common applications of natural language processing (NLP). It is the task of assigning a set of predefined categories to a text snippet. Depending on the type of problem, the text snippet could be a sentence, a paragraph, or even a whole document. There are many potential real-world applications for text classification, but among the most common ones are sentiment analysis, topic modeling and intent, spam, and hate speech detection.

The standard approach to text classification is training a classifier in a supervised regime. To do so, one needs pairs of text and associated categories (aka labels) from the domain of interest as training data. Then, any classifier (e.g., a neural network) can learn a mapping function from the text to the most likely category. While this approach can work quite well for many settings, its feasibility highly depends on the availability of those hand-labeled pairs of training data.

Though pre-trained language models like BERT can reduce the amount of data needed, it does not make it obsolete altogether. Therefore, for real-world applications, data availability remains the biggest hurdle.

Zero-Shot Learning

Though there are various definitions of zero-shot learning1, it can broadly speaking be defined as a regime in which a model solves a task it was not explicitly trained on before.

It is important to understand, that a “task” can be defined in both a broader and a narrower sense: For example, the authors of GPT-2 showed that a model trained on language generation can be applied to entirely new downstream tasks like machine translation2. At the same time, a narrower definition of task would be to recognize previously unseen categories in images as shown in the OpenAI CLIP paper3.

But what all these approaches have in common is the idea of extrapolation of learned concepts beyond the training regime. A powerful concept, because it disentangles the solvability of a task from the availability of (labeled) training data.

Zero-Shot Learning for Text Classification

Solving text classification tasks with zero-shot learning can serve as a good example of how to apply the extrapolation of learned concepts beyond the training regime. One way to do this is using natural language inference (NLI) as proposed by Yin et al. (2019)4. There are other approaches as well like the calculation of distances between text embeddings or formulating the problem as a cloze

In NLI the task is to determine whether a hypothesis is true (entailment), false (contradiction), or undetermined (neutral) given a premise5. A typical NLI dataset consists of sentence pairs with associated labels in the following form:

Examples from http://nlpprogress.com/english/natural_language_inference.html

Yin et al. (2019) proposed to use large language models like BERT trained on NLI datasets and exploit their language understanding capabilities for zero-shot text classification. This can be done by taking the text of interest as the premise and formulating one hypothesis for each potential category by using a so-called hypothesis template. Then, we let the NLI model predict whether the premise entails the hypothesis. Finally, the predicted probability of entailment can be interpreted as the probability of the label.

Zero-Shot Text Classification with Hugging Face 🤗

Let’s explore the above-formulated idea in more detail using the excellent Hugging Face implementation for zero-shot text classification.­

We are interested in classifying the sentence below into pre-defined topics:

topics = ['Web', 'Panorama', 'International', 'Wirtschaft', 'Sport', 'Inland', 'Etat', 'Wissenschaft', 'Kultur']
test_txt = 'Eintracht Frankfurt gewinnt die Europa League nach 6:5-Erfolg im Elfmeterschießen gegen die Glasgow Rangers'

Thanks to the 🤗 pipeline abstraction, we do not need to define the prediction task ourselves. We just need to instantiate a pipeline and define the task as zero-shot-text-classification. The pipeline will take care of formulating the premise and hypothesis as well as deal with the logits and probabilities from the model.

As written above, we need a language model that was pre-trained on an NLI task. The default model for zero-shot text classification in 🤗 is bart-large-mnli. BART is a transformer encoder-decoder for sequence-2-sequence modeling with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder6. The mnli suffix means that BART was then further fine-tuned on the MultiNLI dataset7.

But since we are using German sentences and BART is English-only, we need to replace the default model with a custom one. Thanks to the 🤗 model hub, finding a suitable candidate is quite easy. In our case, mDeBERTa-v3-base-xnli-multilingual-nli-2mil7 is such a candidate. Let’s decrypt the name shortly for a better understanding: it is a multilanguage version of DeBERTa-v3-base (which is itself an improved version of BERT/RoBERTa8) that was then fine-tuned on two cross-lingual NLI datasets (XNLI8 and multilingual-NLI-26lang10).

With the correct task and the correct model, we can now instantiate the pipeline:

from transformers import pipeline
model = 'MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7'
pipe = pipeline(task='zero-shot-classification', model=model, tokenizer=model)

Next, we call the pipeline to predict the most likely category of our text given the candidates. But as a final step, we need to replace the default hypothesis template as well. This is necessary since the default is again in English. We, therefore, define the template as 'Das Thema is {}'. Note that, {} is a placeholder for the previously defined topic candidates. You can define any template you like as long as it contains a placeholder for the candidates:

template_de = 'Das Thema ist {}'
prediction = pipe(test_txt, topics, hypothesis_template=template_de)

Finally, we can assess the prediction from the pipeline. The code below will output the three most likely topics together with their predicted probabilities:

print(f'Zero-shot prediction for: \n {prediction["sequence"]}')
top_3 = zip(prediction['labels'][0:3], prediction['scores'][0:3])
for label, score in top_3:
    print(f'{label} - {score:.2%}')
Zero-shot prediction for: 
 Eintracht Frankfurt gewinnt die Europa League nach 6:5-Erfolg im Elfmeterschießen gegen die Glasgow Rangers
Sport - 77.41%
International - 15.69%
Inland - 5.29%

As one can see, the zero-shot model produces a reasonable result with “Sport” being the most likely topic followed by “International” and “Inland”.

Below are a few more examples from other categories. Like before, the results are overall quite reasonable. Note how for the second text the model predicts an unexpectedly low probability of “Kultur”.

further_examples = ['Verbraucher halten sich wegen steigender Zinsen und Inflation beim Immobilienkauf zurück',
                    '„Die bitteren Tränen der Petra von Kant“ von 1972 geschlechtsumgewandelt und neu verfilmt',
                    'Eine 541 Millionen Jahre alte fossile Alge weist erstaunliche Ähnlichkeit zu noch heute existierenden Vertretern auf']

for txt in further_examples:
    prediction = pipe(txt, topics, hypothesis_template=template_de)
    print(f'Zero-shot prediction for: \n {prediction["sequence"]}')
    top_3 = zip(prediction['labels'][0:3], prediction['scores'][0:3])
    for label, score in top_3:
        print(f'{label} - {score:.2%}')
Zero-shot prediction for: 
  Verbraucher halten sich wegen steigender Zinsen und Inflation beim Immobilienkauf zurück 
Wirtschaft - 96.11% 
Inland - 1.69% 
Panorama - 0.70% 

Zero-shot prediction for: 
  „Die bitteren Tränen der Petra von Kant“ von 1972 geschlechtsumgewandelt und neu verfilmt 
International - 50.95% 
Inland - 16.40% 
Kultur - 7.76% 

Zero-shot prediction for: 
  Eine 541 Millionen Jahre alte fossile Alge weist erstaunliche Ähnlichkeit zu noch heute existierenden Vertretern auf 
Wissenschaft - 67.52% 
Web - 8.14% 
Inland - 6.91%

The entire code can be found on GitHub. Besides the examples from above, you will find there also applications of zero-shot text classifications on two labeled datasets including an evaluation of the accuracy. In addition, I added some prompt-tuning by playing around with the hypothesis template.

Concluding Thoughts

Zero-shot text classification offers a suitable approach when either training data is limited (or even non-existing) or as an easy-to-implement benchmark for more sophisticated methods. While explicit approaches, like fine-tuning large pre-trained models, certainly still outperform implicit approaches, like zero-shot learning, their universal applicability makes them very appealing.

In addition, we should expect zero-shot learning, in general, to become more important over the next few years. This is because the way we will use models to solve tasks will evolve with the increasing importance of large pre-trained models. Therefore, I advocate that already today zero-shot techniques should be considered part of every modern data scientist’s toolbox.

 

Sources:

1 https://joeddav.github.io/blog/2020/05/29/ZSL.html
2 https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
3 https://arxiv.org/pdf/2103.00020.pdf
4 https://arxiv.org/pdf/1909.00161.pdf
5
http://nlpprogress.com/english/natural_language_inference.html
6
https://arxiv.org/pdf/1910.13461.pdf
7
https://huggingface.co/datasets/multi_nli
8 https://arxiv.org/pdf/2006.03654.pdf
9
https://huggingface.co/datasets/xnli
10 https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7

Fabian Müller Fabian Müller Fabian Müller Fabian Müller Fabian Müller Fabian Müller

Artificially enhancing face images is all the rage

What can AI contribute?

In recent years, image filters have become wildly popular on social media. These filters let anyone adjust their face and the surroundings in different ways, leading to entertaining results. Often, filters enhance facial features that seem to match a certain beauty standard. As AI experts, we asked ourselves what is possible to achieve in the topic of face representations using our tools. One issue that sparked our interest is gender representations. We were curious: how does the AI represent gender differences when creating these images? And on top of that: can we generate gender-neutral versions of existing face images?

Using StyleGAN on existing images

When thinking about what existing images to explore, we were curious to see how our own faces would be edited. Additionally, we decided to use several celebrities as inputs – after all, wouldn’t it be intriguing to observe world-famous faces morphed into different genders?

Currently, we often see text-prompt-based image generation models like DALL-E in the center of public discourse. Yet, the AI-driven creation of photo-realistic face images has long been a focus of researchers due to the apparent challenge of creating natural-looking face images. Searching for suitable AI models to approach our idea, we chose the StyleGAN architectures that are well known for generating realistic face images.

Adjusting facial features using StyleGAN

One crucial aspect of this AI’s architecture is the use of a so-called latent space from which we sample the inputs of the neural network. You can picture this latent space as a map on which every possible artificial face has a defined coordinate. Usually, we would just throw a dart at this map and be happy about the AI producing a realistic image. But as it turns out, this latent space allows us to explore various aspects of artificial face generation. When you move from one face’s location on that map to another face’s location, you can generate mixtures of the two faces. And as you move in any arbitrary direction, you will see random changes in the generated face image.

This makes the StyleGAN architecture a promising approach for exploring gender representation in AI.

Can we isolate a gender direction?

So, are there directions that allow us to change certain aspects of the generated image? Could a gender-neutral representation of a face be approached this way? Pre-existing works have found semantically interesting directions, yielding fascinating results. One of those directions can alter a generated face image to have a more feminine or masculine appearance. This lets us explore gender representation in images.

The approach we took for this article was to generate multiple images by making small steps in each gender’s direction. That way, we can compare various versions of the faces, and the reader can, for example, decide which image comes closest to a gender-neutral face. It also allows us to examine the changes more clearly and look for unwanted characteristics in the edited versions.

Introducing our own faces to the AI

The described method can be utilized to alter any face generated by the AI towards a more feminine or masculine version. However, a crucial challenge remains: Since we want to use our own images as a starting point, we must be able to obtain the latent coordinate (in our analogy, the correct place on the map) for a given face image. Sounds easy at first, but the used StyleGAN architecture only allows us to go one way, from latent coordinate to generated image, not the other way around. Thankfully, other researchers have explored this very problem. Our approach thus heavily builds on the python notebook found here. The researchers built another “encoder”-AI that takes a face image as input and finds its corresponding coordinate in the latent space.

And with that, we finally have all parts necessary to realize our goal: exploring different gender representations using an AI. In the photo sequences below, the center image is the original input image. Towards the left, the generated faces appear more female; towards the right, they seem more male. Without further ado, we present the AI-generated images of our experiment:

Results: photo series from female to male

Images 1-6, from the top.: Marilyn Monroe, Actress; Drake, Singer; Kim Kardashian, Entrepreneur & reality star; Harry Styles, Singer; Isabel Hermes, Co-author of this article; Alexander Müller; Co-author of this article

Unintended biases

After finding the corresponding images in the latent space, we generated artificial versions of the faces. We then moved them along the chosen gender direction, creating “feminized” and “masculinized” faces. Looking at the results, we see some unexpected behavior in the AI: it seems to recreate classic gender stereotypes.

Big smiles vs. thick eyebrows

Whenever we edited an image to look more feminine, we gradually see an opening mouth with a stronger smile and vice versa. Likewise, eyes grow larger and wide open in the female direction. The Drake and Kim Kardashian examples illustrate a visible change in skin tone from darker to lighter when moving along the series from feminine to masculine. The chosen gender direction appears to edit out curls in the female direction (as opposed to the male direction), as exemplified by the examples of Marylin Monroe and blog co-author Isabel Hermes. We also asked ourselves whether the lack of hair extension in Drake’s female direction would be remedied if we extended his photo series. Examining the overall extremes, eyebrows are thinned out and arched on the female and straighter and thicker on the male side. Eye and lip makeup increase heavily on faces that move in the female direction, making the area surrounding the eyes darker and thinning out eyebrows. This may be why we perceived the male versions we generated to look more natural than the female versions.

Finally, we would like to challenge you, as the reader, to examine the photo series above closely. Try to decide which image you perceive as gender-neutral, i.e., as much male as female. What made you choose that image? Did any of the stereotypical features described above impact your perception?

A natural question that arises from image series like the ones generated for this article is whether there is a risk that the AI reinforces current gender stereotypes.

Is the AI to blame for recreating stereotypes?

Given that the adjusted images recreate certain gender stereotypes like a more pronounced smile in female images, a possible conclusion could be that the AI was trained on a biased dataset. And indeed, to train the underlying StyleGAN, image data from Flickr was used that inherits the biases from the website. However, the main goal of this training was to create realistic images of faces. And while the results might not always look as we expect or want, we would argue that the AI did precisely that in all our tests.

To alter the images, however, we used the beforementioned latent direction. In general, those latent directions rarely change only a single aspect of the created image. Instead, like walking in a random direction on our latent map, many elements of the generated face usually get changed simultaneously. Identifying a direction that alters only a single aspect of a generated image is anything but trivial. For our experiment, the chosen direction was created primarily for research purposes without accounting for said biases. It can therefore introduce unwanted artifacts in the images alongside the intended alterations. Yet it is reasonable to assume that a latent direction exists that allows us to alter the gender of a face created by the StyleGAN without affecting other facial features.

Overall, the implementations we build upon use different AIs and datasets, and therefore the complex interplay of those systems doesn’t allow us to identify the AI as a single source for these issues. Nevertheless, our observations suggest that doing due diligence to ensure the representation of different ethnic backgrounds and avoid biases in creating datasets is paramount.

Abb. 7: Picture from “A Sex Difference in Facial Contrast and its Exaggeration by Cosmetics” by Richard Russel

Subconscious bias: looking at ourselves

A study by Richard Russel deals with human perception of gender in faces. Ask yourself, which gender would you intuitively assign to the two images above? It turns out that most people perceive the left person as male and the right person as female. Look again. What separates the faces? There is no difference in facial structure. The only difference is darker eye and mouth regions. It becomes apparent that increased contrast is enough to influence our perception. Suppose our opinion on gender can be swayed by applying “cosmetics” to a face. In that case, we must question our human understanding of gender representations and whether they are simply products of our life-long exposure to stereotypical imagery. The author refers to this as the “Illusion of Sex”.
This bias relates to the selection of latent “gender” dimension: To find the latent dimension that changes the perceived gender of a face, StyleGAN-generated images were divided into groups according to their appearance. While this was implemented based on yet another AI, human bias in gender perception might well have impacted this process and have leaked through to the image rows illustrated above.

Conclusion

Moving beyond the gender binary with StyleGANs

While a StyleGAN might not reinforce gender-related bias in and of itself, people still subconsciously harbor gender stereotypes. Gender bias is not limited to images – researchers have found the ubiquity of female voice assistants reason enough to create a new voice assistant that is neither male nor female: GenderLess Voice.

One example of a recent societal shift is the debate over gender; rather than binary, gender may be better represented as a spectrum. The idea is that there is biological gender and social gender. Being included in society as who they are is essential for somebody who identifies with a gender that differs from that they were born with.

A question we, as a society, must stay wary of is whether the field of AI is at risk of discriminating against those beyond the assigned gender binary. The fact is that in AI research, gender is often represented as binary. Pictures fed into algorithms to train them are either labeled as male or female. Gender recognition systems based on deterministic gender-matching may also cause direct harm by mislabelling members of the LGBTQIA+ community. Currently, additional gender labels have yet to be included in ML research. Rather than representing gender as a binary variable, it could be coded as a spectrum.

Exploring female to male gender representations

We used StyleGANs to explore how AI represents gender differences. Specifically, we used a gender direction in the latent space. Researchers determined this direction to display male and female gender. We saw that the generated images replicated common gender stereotypes – women smile more, have bigger eyes, longer hair, and wear heavy makeup – but importantly, we could not conclude that the StyleGAN model alone propagates this bias. Firstly, StyleGANs were created primarily to generate photo-realistic face images, not to alter the facial features of existing photos at will. Secondly, since the latent direction we used was created without correcting for biases in the StyleGANs training data, we see a correlation between stereotypical features and gender.

Next steps and gender neutrality

We asked ourselves which faces we perceived as gender neutral among the image sequences we generated. For original images of men, we had to look towards the artificially generated female direction and vice versa. This was a subjective choice. We see it as a logical next step to try to automate the generation of gender-neutral versions of face images to explore further the possibilities of AI in the topic of gender and society. For this, we would first have to classify the gender of the face to be edited and then move towards the opposite gender to the point where the classifier can no longer assign an unambiguous label. Therefore, interested readers will be able to follow the continuation of our journey in a second blog article in the coming time.

If you are interested in our technical implementation for this article, you can find the code here and try it out with your own images.

Resources

Photo Credits
Img. 1: © Alfred Eisenstaedt / Life Picture Collection
Img. 2: https://www.pinterest.com/pin/289989663476162265/
Img. 3: https://www.gala.de/stars/starportraets/kim-kardashian-20479282.html
Img. 4: © Charles Sykes / Picture Alliance
Img. 7: Richard Russel, “A Sex Difference in Facial Contrast and its Exaggeration by Cosmetics” Isabel Hermes, Alexander Müller Isabel Hermes, Alexander Müller Isabel Hermes, Alexander Müller Isabel Hermes, Alexander Müller Isabel Hermes, Alexander Müller

Given the hype and the recent success of AI it is surprising that most companies still lack the successful integration of AI. This is quite evident in many industries, especially in manufacturing. (McKinsey).

In a study published by Accenture  in 2019 about the implementation of  AI in companies, the authors came to the conclusion that around 80% of all Proof of Concepts (PoCs) do not make it into production. Furthermore, only 5% of all interviewed companies stated that they currently have a company-wide AI strategy in place.

These findings are thought-provoking: What exactly is going wrong and why does artificial intelligence apparently not yet make the holistic transition from successful academic studies to the real world?

1. What Is Data-Centric AI?

„Data-centric AI is the discipline of systematically engineering the data used to build an AI system.“
Andrew Ng, data-centric AI pioneer

The data-centric approach focuses on a more data-integrating AI (data-first) and less on the models (model-fist) to overcome the difficulties of AI with “reality”. Usually the training data that stands as the starting point of an AI project at companies have relatively little in common with the meticulously curated and widely used benchmark datasets such as MNIST or ImageNet.

In this article, we want to consolide different data-centric theories and frameworks in the context of an AI (project) workflow. In addition, we want to show how we approach a data-first AI implementation at statworx.

2. What´s Behind a Data-Centric Way of Thinking?

In the simplest terms, AI systems consist of two critical components: data and model (code). Data-centric thereby leans its focus more towards data, model-centric on model infrastructure – duh!

A strong model-centric leaning AI regards the data only as an extrinsic, static parameter. The iterative process of a data science project only starts after the dataset is being “delivered” with the model-related task, like train and fine tune of various model architectures. This occupies the vast portion of time in a data science project, and data pre-processing steps are only and ad-hoc duty at the beginning of each project.

model centric graphic

In contrary, data-centric understands (automated) data processes as an integral part of any ML project. This incorporates any necessary steps to get from raw data to a final dataset. Internalizing these processes aims to enhance the quality and the methodical observability of the data.

data centric graphic

Data-centric approaches can be consolidated into three broader categories that explain loosely the scope of the data-centric concept. In the following, we will assign buzzwords (frameworks) that are often used in the data-centric context to a specific category.

2.1. Integration of SMEs Into the Development Process as a Major Link Between Data and Model Knowledge

The integration of domain knowledge is an integral part of data-centric. It should help project teams to grow together and thus integrate the knowledge of Subject Matter Experts (SMEs) in the best way possible.

  • Data Profiling:
    Data scientists should not act as a one man show that only share their findings with the SMEs. With their statistical and programmatical abilities they should rather act as mediators to empower SMEs to individually dig through the data.
  • Human-in-the-loop Data & Model Monitoring
    In similarity to profiling, this should be a central starting point to ensure that SMEs have access to the relevant components of the AI system. From this central checkpoint, not only data but also model-relevant metrics can be monitored or examples visualized and checked. Sophisticated monitoring decreases the response time drastically since errors can be directly investigated (and mitingated) – not only by data scientists.

2.2. Data Quality Management as an Agile, Automated and Iterative Process Over (Training) Data

Copiously improving the data preparation process is key to every data science project. The model itself should be an extrinsic part at first.

  • Data Cataloague, Lineage & Validation:
    The documentation of data should also not be an extrinsic task, which often only arises ad-hoc towards the end of a project and could become obsolete again with every change, e.g., of a model feature. Changes should be reflected dynamically and thus automate the documentation. Data catalogue frameworks provide the capabilities to store data with associated meta data (and other necessary information).
    Data lineage, as a subsequent step in the data process, keeps then track of all data mingling steps that occur during the transition from raw to the final dataset. The more complex a data model, the more a lineage graph can track how a final column was created (graph below), for example, joining, filtering or other processing steps. Finally, validating the data during the input and transformation process allows for a consistent data foundation. The knowledge gained from data profiling helps here to develop validation rules and integrate them into the process.

lineage graph

  • Data & Label Cleaning:
    The necessity of data processing is ubiquitous and well-established among AI practitioners. However, label cleaning is a rather disregarded step (that, of course, only applies to classification problems). Wrongly classified datapoints can make it hard for some algorithms to reveal the right patterns in the data.
  • Data Drifts in Production:
    Well known weak spot of (all) AI systems are drifts in data. This happens when trainings and the actual live inference data do not have the same distribution. In order to pledge the model’s prediction validity in the long run, identifying these irregularities and retraining the model accordingly is a crucial part of any ML pipeline.
  • Data Versioning:
    Since ever, GitHub is the go-to standard of versioning the codebase of projects. However, for AI projects it is not only important to track code but also data changes and tuck both together. This can produce a more holistic depiction of a ML workflow with increased visibility and observability.

data/code visioning graphic

2.3. Generating Trainings Data as Programmatic Task

Producing new (labeled) training examples is often a huge road blocker for AI projects, particularly if the underlying problem is complex and thus requires vast datasets. This may lead to an unbearable overhead of manual labor.

  • Data Augmentation:
    In many data-intensive deep learning models, this technique has been used for a long time to create artificial data with existing data. It, for instance, works perfectly on image data, for easy operations such as rotating, tilting or altering the color filters. Even in NLP and tabular (Excel and co.) use cases, there are possibilities to enlarge the dataset.

augmentation graphic

  • Automated Data Labeling:
    As already stated, labeling data is a rather labor-intense task in which people assign data points to a predefined category. On the one hand, this makes the initial effort (costs) very high, and on the other hand, it is error-prone and difficult to monitor. That’s where ML can chip in. Concepts like semi and weak supervision can automate the manual task almost entirely.
  • Data Selection:
    Working with large chunks of data in a local setup is often not possible, especially once the dataset does not fit into memory anymore. And even if it fits, trainings runs can take forever. Data Selection tries to reduce the size by active subsampling (whether labeled or unlabeled). The “best” examples with the highest diversity and representativeness are actively selected here to ensure the best possible characterization of the input – and this is done automatically.

Needless to say, not every presented method is necessarily a good fit for every project. It is part of the work of the development team to analyze how a certain framework could benefit the final product, which also should take business considerations (e.g. cost versus benefit) into account.

3. Integration of Data-Centric at statworx

Data-centric is a crucial part of our projects, especially during the transition between PoC and production-ready model. We also had cases, in which during this transition, we faced some mainly data-related issue due to inadequate documentation, validation or poor integration of SMEs into the data process.

We therefore generally try to show our customers the importance of data management for the longevity and robustness of AI products in production and how helpful components are linked within an AI pipeline.

As part of the learnings, our data onboarding framework, a mix of profiling, catalogue and validation aims to mitigate the before mentioned issues. Additionally, this framework helps the entire company make previously unused, undocumented data sources available for various use cases (not just AI).

A strong interaction with the SMEs on the client’s side is integral to establish trustworthy, robust and well-understood quality checks. This also helps to empower our clients to debug errors and do the first-level support themselves, which also helps with the service’s longevity.

data & ai pipeline by statworx graphic

In a stripped-down, custom data onboarding integration, we used a variety of open and closed source tools to create a platform that is easily scalable and understandable for the customer. We installed validation checks with Great Expectations (GE), a python-based tool with reporting capabilities to create a shareable status report of the data. data validation graphic

This architecture can then run on different environments, like cloud native software (Azure DataFactory) or orchestrated with Airflow (open-source) and can be easily complemented.

4. Data-Centric in Relation AI’s Status Quo

Both data- and model-centric describe attempts on how to approach an AI project.

On the one hand, there exist already well-established best practices around model-centric with various production-proved frameworks.

One reason for this maturity is certainly the strong focus on model architectures and their advancements among academic researchers and leading AI companies. With computer vision and NLP leading the way, commercialized meta models, trained on enormous datasets, opened the door for successful AI use cases. With relatively limited data, those models can get finetuned for downstream end-use applications – known as transfer learning.

However, this trend helps only some of the failed projects, because especially in the context of industrial projects, lack of compatibility or rigidity of use cases makes the applications of meta-models difficult. Non-rigidity is often found in machine-heavy manufacturing industries, where the environment in which data is produced is constantly changing and even the replacement of a single machine can have a large impact on a productive AI model. If this issue has not been properly considered in the AI process, this creates a difficult-to-calculate risk, also known as technical debt [Quelle: https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf].

Lastly, edge cases, rare and unusual datapoints are generally a burden for any AI. An application that observes anomalies of a machine component most certainly sees only a fraction of faulty units.

5. Conclusion – Paradigm Shift in Sight?

Overcoming these problems is part of the promise of data-centric, but is still rather immature at the moment.

The availability or immaturity of open-source frameworks does manifest this allegation, especially since there is a lack of a more unified tool stack end user can choose from. This inevitably leads to longer, more involved, and complex AI projects, which is a significant hurdle for many companies. In addition, there are few data metrics available to give companies feedback on what exactly they are “improving.” And second, many of the tools (eg., data catalogue) have more indirect, distributed benefits.

Some startups that aim to address these issues have emerged in recent years. However, because they (exclusively) market paid tier software, it is rather unclear to what extent these products can really cover the broad mass of problems from different use cases.

Although the above shows that companies in general are still far away from a holistic integration of data-centric, robust data strategies have become more and more important lately (as we at statworx could see in our projects).

With increased academic research into data products, this trend will certainly intensify. Not only because new, more robust frameworks will emerge, but also because university graduates will bring more knowledge in this area to companies.

Image Sources

Model-centric arch: own
Data-centric arch: own
Data lineage: https://www.researchgate.net/figure/Data-lineage-visualization-example-in-DW-environment-using-Sankey-diagram_fig7_329364764
Versioning Code/Data: https://ardigen.com/7155/
Data Augmentation: https://medium.com/secure-and-private-ai-writing-challenge/data-augmentation-increases-accuracy-of-your-model-but-how-aa1913468722
Data & AI pipeline: own
Validation with GE: https://greatexpectations.io/blog/ge-data-warehouse/

Benedikt Müller Benedikt Müller Benedikt Müller Benedikt Müller Benedikt Müller Benedikt Müller Benedikt Müller Benedikt Müller Benedikt Müller Benedikt Müller

Whether deliberate or unconscious, bias in our society makes it difficult to create a gender-equal world free of stereotypes and discrimination. Unfortunately, this gender bias creeps into AI technologies, which are rapidly advancing in all aspects of our daily lives and will transform our society as we have never seen before. Therefore, creating fair and unbiased AI systems is imperative for a diverse, equitable, and inclusive future. It is crucial not only to be aware of this issue but that we act now, before these technologies reinforce our gender bias even more, including in areas of our lives where we have already eliminated them.

Solving starts with understanding: To work on solutions to eliminate gender bias and all other forms of bias in AI, we first need to understand what it is and where it comes from. Therefore, in the following, I will first introduce some examples of gender-biased AI technologies and then give you a structured overview of the different reasons for bias in AI. I will present the actions needed towards fairer and more unbiased AI systems in a second step.

Sexist AI

Gender bias in AI has many faces and has severe implications for women’s equality. While Youtube shows my single friend (male, 28) advertisements for the latest technical inventions or the newest car models, I, also single and 28, have to endure advertisements for fertility or pregnancy tests. But AI is not only used to make decisions about which products we buy or which series we want to watch next. AI systems are also being used to decide whether or not you get a job interview, how much you pay for your car insurance, how good your credit score is, or even what medical treatment you will get. And this is where bias in such systems really starts to become dangerous.

In 2015, for example, Amazon’s recruiting tool falsely learned that men are better programmers than women, thus, not rating candidates for software developer jobs and other technical posts in a gender-neutral way.

In 2019, a couple applied for the same credit card. Although the wife had a slightly better credit score and the same income, expenses, and debts as her husband, the credit card company set her credit card limit much lower, which the customer service of the credit card company could not explain.

If these sexist decisions were made by humans, we would be outraged. Fortunately, there are laws and regulations against sexist behavior for us humans. Still, AI has somehow become above the law because an assumably rational machine made the decision. So, how can an assumably rational machine become biased, prejudiced, and racist? There are three interlinked reasons for bias in AI: data, models, and community.

Data is Destiny

First, data is a mirror of our society, with all our values, assumptions, and, unfortunately, also biases. There is no such thing as neutral or raw data. Data is always generated, measured, and collected by humans. Data has always been produced through cultural operations and shaped into cultural categories. For example, most demographic data is labeled based on simplified, binary female-male categories. When gender classification conflates gender in this way, data is unable to show gender fluidity and one’s gender identity. Also, race is a social construct, a classification system invented by us humans a long time ago to define physical differences between people, which is still present in data.

The underlying mathematical algorithm in AI systems is not sexist itself. AI learns from data with all its potential gender biases. For example, suppose a face recognition model has never seen a transgender or non-binary person because there was no such picture in the data set. In that case, it will not correctly classify a transgender or non-binary person (selection bias).

Or, as in the case of Google translate, the phrase “eine Ärztin” (a female doctor) is consistently translated into the masculine form in gender-inflected languages because the AI system has been trained on thousands of online texts where the male form of “doctor” was more prevalent due to historical and social circumstances (historical bias). According to Invisible Women, there is a big gender gap in Big Data in general, to the detriment of women. So if we do not pay attention to what data we feed these algorithms, they will take over the gender gap in the data, leading to serious discrimination of women.

Models need Education

Second, our AI models are unfortunately not smart enough to overcome the biases in the data. Because current AI models only analyze correlations and not causal structures, they blindly learn what is in the data. These algorithms inherent a systematical structural conservatism, as they are designed to reproduce given patterns in the data.

To illustrate this, I will use a fictional and very simplified example: Imagine a very stereotypical data set with many pictures of women in kitchens and men in cars. Based on these pictures, an image classification algorithm has to learn to predict the gender of a person in a picture. Due to the data selection, there is a high correlation between kitchens and women and between cars and men in the data set – a higher correlation than between some characteristic gender features and the respective gender. As the model cannot identify causal structures (what are gender-specific features), it thus falsely learns that having a kitchen in the picture also implies having women in the picture and the same for cars and men. As a result, if there’s a woman in a car in some image, the AI would identify the person as a man and vice versa.

However, this is not the only reason AI systems cannot overcome bias in data. It is also because we do not “tell” the systems that they should watch out for this. AI algorithms learn by optimizing a certain objective or goal defined by the developers. Usually, this performance measure is an overall accuracy metric, not including any ethical or fairness constraints. It is as if a child was to learn to get as much money as possible without any additional constraints such as suffering consequences from stealing, exploiting, or deceiving. If we want AI systems to learn that gender bias is wrong, we have to incorporate this into their training and performance evaluation.

Community lacks Diversity

Last, it is the developing community who directly or indirectly, consciously or subconsciously introduces their own gender and other biases into AI technologies. They choose the data, define the optimization goal, and shape the usage of AI.

While there may be malicious intent in some cases, I would argue that developers often bring their own biases into AI systems at an unconscious level. We all suffer from unconscious biases, that is, unconscious errors in thinking that arise from problems related to memory, attention, and other mental mistakes. In other words, these biases result from the effort to simplify the incredibly complex world in which we live.

For example, it is easier for our brain to apply stereotypic thinking, that is, perceiving ideas about a person based on what people from a similar group might “typically “be like (e.g., a man is more suited to a CEO position) than to gather all the information to fully understand a person and their characteristics. Or, according to the affinity bias, we like people most who look and think like us, which is also a simplified way of understanding and categorizing the people around us.

We all have such unconscious biases, and since we are all different people, these biases vary from person to person. However, since the current community of AI developers comprises over 80% white cis-men, the values, ideas, and biases creeping into AI systems are very homogeneous and thus literally narrow-minded. Starting with the definition of AI, the founding fathers of AI back in 1956 were all white male engineers, a very homogeneous group of people, which led to a narrow idea of what intelligence is, namely the ability to win games such as chess. However, from psychology, we know that there are a lot of different kinds of intelligence, such as emotional or social intelligence. Still, today, if a model is developed and reviewed by a very homogenous group of people, without special attention and processes, they will not be able to identify discrimination who are different from themselves due to unconscious biases. Indeed, this homogenous community tends to be the group of people who barely suffer from bias in AI.

Just imagine if all the children in the world were raised and educated by 30-year-old white cis-men. That is what our AI looks like today. It is designed, developed, and evaluated by a very homogenous group, thus, passing on a one-sided perspective on values, norms, and ideas. Developers are at the core of this. They are teaching AI what is right or wrong, what is good or bad.

Break the Bias in Society

So, a crucial step towards fair and unbiased AI is a diverse and inclusive AI development community. Meanwhile, there are some technical solutions to the mentioned data and model bias problems (e.g., data diversification or causal modeling). Still, all these solutions are useless if the developers fail to think about bias problems in the first place. Diverse people can better check each other’s blindspots, each other’s biases. Many studies show that diversity in data science teams is critical in reducing bias in AI.

Furthermore, we must educate our society on AI, its risks, and its chances. We need to rethink and restructure the education of AI developers, as they need as much ethical knowledge as technical knowledge to develop fair and unbiased AI systems. We need to educate the broad population that we all can also become part of this massive transformation through AI to contribute our ideas and values to the design and development of these systems.

In the end, if we want to break the bias in AI, we need to break the bias in our society. Diversity is the solution to fair and unbiased AI, not only in AI developing teams but across our whole society. AI is made by humans, by us, by our society. Our society with its structures brings bias in AI: through the data we produce, the goals we expect the machines to achieve and the community developing these systems. At its core, bias in AI is not a technical problem – it is a social one.

Positive Reinforcement of AI

Finally, we need to ask ourselves: do we want AI reflecting society as it is today or a more equal society of tomorrow? Suppose we are using machine learning models to replicate the world as it is today. In that case, we are not going to make any social progress. If we fail to take action, we might lose some social progress, such as more gender equality, as AI amplifies and reinforces bias back into our lives. AI is supposed to be forward-looking. But at the same time, it is based on data, and data reflects our history and present. So, as much as we need to break the bias in society to break the bias in AI systems, we need unbiased AI systems for social progress in our world.

Having said all that, I am hopeful and optimistic. Through this amplification effect, AI has raised awareness of old fairness and discrimination issues in our society on a much broader scale. Bias in AI shows us some of the most pressing societal challenges. Ethical and philosophical questions become ever more important. And because AI has this reinforcement effect on society, we can also use it for the positive. We can use this technology for good. If we all work together, it is our chance to remake the world into a much more diverse, inclusive, and equal place. Livia Eichenberger Livia Eichenberger Livia Eichenberger

Preface

Every data science and AI professional out there will tell you: real-world data science (DS) & AI projects involve various challenges for which neither hands-on coding competitions nor theoretical lectures will prepare you. And sometimes – alarmingly often1, 2 – these real-live issues cause promising AI projects or whole AI initiatives to fail.

There has been an active discussion of the more technical pitfalls and potential solutions for quite some time now. The more known issues include, for example, siloed data, bad data quality, too inexperienced or under-staffed DS & AI teams, insufficient infrastructure for training and serving models. Another issue is that too many solutions are never moved into production due to organizational problems.

Only recently, the focus of the discourse has shifted more towards strategic issues. But in my opinion, these perspectives still do not get the attention they deserve.

That is why in this post, I want to share my take on the most important (non-technical) reasons why DS & AI initiatives fail. Of course, I’ll give you some input of how to solve these issues. I’m a Data & Strategy Consultant at statworx and I am sure this article is rather subjective. It reflects my personal experience of the problems and solutions I came across.

Issue #1: Poor Alignment of Project Scope and Actual Business Problem Issue #1

One problem that occurs way more often than one can imagine is the misalignment of developed data science & AI solutions and real business needs. The finished product might perfectly serve the exact task the DS & AI team set out to solve; however, the business users might look for a solution to a similar but significantly different task.

Too little exchange due to indirect communication channels or a lack of a shared language and frame of reference often leads to fundamental misunderstandings. The problem is that, quite ironically, only an extremely detailed, effective communication could uncover such subtle issues.

Including Too Few or Selective Perspectives Can Lead To a Rude Awakening

In other cases, individual sub-processes or the working methods of individual users differ a lot. Often, they vary so much that a solution that is a great benefit for one of the users/processes is hardly advantageous for all the others (while sometimes an option, the development of solution variants is by far less cost-efficient). If you are lucky, you will notice this at the beginning of a project when eliciting requirements. If you are unlucky, the rude awakening only occurs during broader user testing or even during roll-out, revealing that the users or experts who influenced the previous development did not provide universally generalizable input, making the developed tool not generally applicable.

How to Counteract This Problem:

  • Conduct structured and in-depth requirements engineering. Invest the time to talk to as many experts/users as possible and try to make all implicit assumptions as explicit as possible. While requirements engineering stems from the waterfall paradigm, it can easily be adapted for agile development. The elicited requirements simply must not be understood as definite product features but as items for your initial backlog that are constantly up for (re)evaluation and (re)prioritization.

  • Be sure to define success measures. Do so before the start of the project, ideally in the form of objectively quantifiable KPIs and benchmarks. This helps significantly to pin down the business problem/business value at the heart of the sought-after solution.
  • Whenever possible and as fast as possible, create prototypes, mock-ups, or even Present these solution drafts to as many test users as possible. These methods strongly facilitate the elicitation of candid and precise user feedback to inform further development. Be sure to involve a user sample that is representative of the entirety of users.

Issue #2: Loss of Efficiency and Resources Due to Non-Orchestrated Data Science & AI Efforts Issue #2

Decentralized  data science & AI teams often develop their use cases with little to no exchange or alignment between the teams’ current use cases and backlogs. This can cause different teams to accidentally and unnoticeably develop (parts of) the same (or very similar) solution.

In most cases, when such a situation is discovered, one of the redundant DS & AI solutions is discontinued or denied any future funding for further development or maintenance. Either way, the redundant development of use cases always results in direct waste of time and other resources with no or minimal added value.

The lack of alignment of an organization’s use case portfolio with the general business or AI strategy also can be problematic. This can cause high opportunity costs: Use cases that do not contribute to the general AI vision might unnecessarily bind valuable resources. Further, potential synergies between more strategically significant use cases might not be fully exploited. Lastly, competence building might happen in areas that are of little to no strategic significance.

How to Counteract This Problem:

  • Communication is key. That is why there always should be a range of possibilities for the data science professionals within an organization to connect and exchange their lessons learned and best practices – especially for decentralized DS & AI teams. To make this work, it is essential to establish an overall working atmosphere of collaboration. The free sharing of successes and failures and thus internal diffusion of competence can only succeed without competitive thinking.

  • Another option to mitigate the issue is to establish a central committee entrusted with the organization’s DS & AI use case portfolio management. The committee should include representatives of all (decentralized) DS & AI departments and general management. Together, the committee oversees the alignment of use cases and the AI strategy, preventing redundancies while fully exploiting synergies.

Issue #3: Unrealistically High Expectations of Success in Data Science & AI Issue #3

It may sound paradoxical, but over-optimism regarding the opportunities and capabilities of  data science & AI can be detrimental to success. That is because overly optimistic expectations often result in underestimating the requirements, such as the time needed for development or the volume and quality of the required database.

At the same time, the expectations regarding model accuracy are often too high, with little to no understanding of model limitations and basic machine learning mechanics. This inexperience might prevent acknowledgment of many important facts, including but not limited to the following points:  the inevitable extrapolation of historical patterns to the future; the fact that external paradigm shifts or shocks endanger the generalizability and performance of models; the complexity of harmonizing predictions of mathematically unrelated models; low naïve model interpretability or dynamics of model specifications due to retraining.

DS & AI simply is no magic bullet, and too high expectations can lead to enthusiasm turning into deep opposition. The initial expectations are almost inevitably unfulfilled and thus often give way to a profound and undifferentiated rejection of DS & AI. Subsequently, this can cause less attention-grabbing but beneficial use cases to no longer find support.

How to Counteract This Problem:

  • When dealing with stakeholders, always try to convey realistic prospects in your communication. Make sure to use unambiguous messages and objective KPIs to manage expectations and address concerns as openly as possible.

  • The education of stakeholders and management in the basics of machine learning and AI empowers them to make more realistic judgments and thus more sensible decisions. Technical in-depth knowledge is often unnecessary. Conceptual expertise with a relatively high level of abstraction is sufficient (and luckily much easier to attain).

  • Finally, whenever possible, a PoC should be carried out before any full-fledged project. This makes it possible to gather empirical indications of the use case’s feasibility and helps in the realistic assessment of the anticipated performance measured by relevant (predefined!) KPIs. It is also important to take the results of such tests seriously. In the case of a negative prognosis, it should never simply be assumed that with more time and effort, all the problems of the PoC will disappear into thin air.

Issue #4: Resentment and Fundamental Rejection of Data Science & AI Issue #4

An invisible hurdle, but one that should never be underestimated, lies in the minds of people. This can hold true for the workforce as well as for management. Often, promising data science & AI solutions are thwarted due to deep-rooted but undifferentiated reservations. The right mindset is decisive.

Although everyone is talking about DS and AI, many organizations still lack real management commitment. Frequently, lip service is paid to DS & AI and substantial funds are invested, but reservations about AI remain.

This is often ostensibly justified by the inherent biases and uncertainty of AI models and their low direct interpretability. In addition, sometimes, there is a general unwillingness to accept insights that do not correspond with one’s intuition. The fact that human intuition is often subject to much more significant – and, in contrast to AI models, unquantifiable – biases is usually ignored.

Data Science & AI Solutions Need the Acceptance and Support of the Workforce

This leads to (decision-making) processes and organizational structures (e.g., roles, responsibilities) not being adapted in such a way that DS and AI solutions can deliver their (full) benefit. But this would be necessary because data science & AI is not just another software solution that seamlessly can be integrated into existing structures.

DS & AI is a disruptive technology that inevitably will reshape entire industries and organizations alike. Organizations rejecting this change are likely to fail in the long run, precisely because of this paradigm shift. The rejection of change begins with seemingly minor matters such as shifting from project management via the waterfall method towards agile, iterative development. Irrespective of the generally positive reception of certain change measures, there is sometimes a completely irrational refusal to reform current (still) functioning processes. Yet, this is exactly what would be necessary to be – admittedly only after a phase of re-adjustment – competitive in the long term.

While vision, strategy, and structures must be changed top-down, the day-to-day operational doing can only be revolutionized bottom-up, driven by the workforce. Management commitment and the best tool in the world are useless if the end-users are not able or willing to adopt it. General uncertainty about the long-term AI roadmap and the fear of being replaced by machines fuels fears that lead to DS & AI solutions not being integrated into everyday work. This is, of course, more than problematic, as only the (correct) application of AI solutions creates added value.

How to Counteract This Problem:

  • Unsurprisingly, sound AI change management is the best approach to mitigate the anti-AI mindset. This should be an integral part of any DS & AI initiative and not only an afterthought, but responsibilities for this task should be assigned. Early, widespread, detailed, and clear communication is vital. Which steps will presumably be implemented, when, and how exactly? Remember that once trust has been lost, it is tough to regain it. Therefore, any uncertainties in the planning should be addressed. It is crucial to create a basic understanding of the matter among all stakeholders and clarify the necessity of change (e.g., otherwise endangered competitiveness, success stories, or competitors’ failures). In addition, dialogue with concerned parties is of great importance. Feedback should be sought early and acted upon where possible. Concerns should always be heard and respected, even if they cannot be addressed. However, false promises must be strictly avoided; instead, try to focus on the advantages of DS & AI.

  • In addition to understanding the need for change, the fundamental ability to change is essential. The fear of the unknown or incomprehensible is inherent in us humans. Therefore, education – only on the level of abstraction and depth necessary for the respective role – can make a big difference. Appropriate training measures are not a one-time endeavor; the development of up-to-date knowledge and training in the field of DS & AI must be ensured in the long term. General data literacy of the workforce must be ensured as well as up- or re-skilling of technical experts. Employees must be given a realistic chance to gain new and more attractive job opportunities by educating themselves and engaging with DS & AI. The most probable outcome should never be to lose (parts of) their old jobs through DS & AI but must be perceived as an opportunity and not as a danger; DS and AI must create perspectives and not spoil them.

  • Adopt or adapt the best practices of DS & AI leaders in terms of defining role and skill profiles, adjusting organizational structures, and value-creation processes. Battle-proven approaches can serve as a blueprint for reforming your organization and thereby ensure you remain competitive in the future.

Closing Remarks

As you might have noted, this blog post does not offer easy solutions. This is because the issues at hand are complex and multi-dimensional. This article gave you high-level ideas on how to mitigate the addressed problems, but it must be stressed that these issues call for a holistic solution approach. This requires a clear AI vision and a derived sound AI strategy according to which the vast number of necessary actions can be coordinated and directed.

That is why I must stress that we have long left the stage where experimental and unstructured data science & AI initiatives could be successful. DS & AI must not be treated as a technical topic that takes place solely in specialist departments. It is time to address AI as a strategic issue. Like the digital revolution, only organizations in which AI completely permeates and reforms daily operations, the general business strategy will be successful in the long term. As described above, this undoubtedly holds many pitfalls in-store but also represents an incredible opportunity.

If you are willing to integrate these changes but don’t know where to start, we at STATWORX are happy to help. Check out our website and learn more about our AI strategy offerings!

Quellen

[1] https://www.forbes.com/sites/forbestechcouncil/2020/10/14/why-do-most-ai-projects-fail/?sh=2f77da018aa3 [2] https://blogs.gartner.com/andrew_white/2019/01/03/our-top-data-and-analytics-predicts-for-2019/

Lea Waniek Lea Waniek

Management Summary

Kubernetes is a technology that in many ways greatly simplifies the deployment and maintenance of applications and compute loads, especially the training and hosting of machine learning models. At the same time, it allows us to adapt the required hardware resources, providing a scalable and cost-transparent solution.

This article first discusses the transition from a server to management and orchestration of containers: isolated applications or models that are packaged once with all their requirements and can subsequently be run almost anywhere. Regardless of the server, these can be replicated at will with Kubernetes, allowing effortless and almost seamless continuous accessibility of their services even under intense demand. Likewise, their number can be reduced to a minimum level when the demand temporarily or periodically dwindles in order to use computing resources elsewhere or avoid unnecessary costs.

From the capabilities of this infrastructure emerges a useful architectural paradigm called microservices. Formerly centralized applications are thus broken down into their functionalities, which provide a high degree of reusability. These can be accessed and used by different services and scale individually according to internal needs. An example of this is large and complex language models in Natural Language Processing, which can capture the context of a text regardless of its further use and thus underlie many downstream purposes. Other microservices (models), such as for text classification or summarization, can invoke them and further process the partial results.

After a brief introduction of the general terminology and functionality of Kubernetes, as well as possible use cases, the focus turns to the most common way to use Kubernetes: with cloud providers such as Google GCP, Amazon AWS, or Microsoft Azure. These allow so-called Kubernetes clusters to dynamically consume more or fewer resources, though the costs incurred remain foreseeable on a pay-per-use basis. Other common services such as data storage, versioning, and networking can also be easily integrated by the providers. Finally, the article gives an outlook on tools and further developments, which either make using Kubernetes even more efficient or further abstract and simplify the process towards serverless architectures.

Introduction

Over the last 20 years, vast amounts of new technologies have surfaced in software development and deployment, which have not only multiplied and diversified the choice of services, programming languages, and libraries but have even led to a paradigm shift in many use cases or domains.

Google Trends Paradigmenwechsel
Fig. 1: Google trends chart showing the above-mentioned paradigm shift

If we also look at the way software solutions, models, or work and computing loads have been deployed over the years, we can see how innovations in this area have also led to greater flexibility, scalability, and resource efficiency, among other things.

In the beginning, these were run as local processes directly on a server (shared by several applications), which posed some limitations and problems: on the one hand, one is bound to the configuration of the server and its operating system when selecting the technical tools, and on the other hand, all applications hosted on the server are limited by its memory and processor capacities. Thus, they share not only resources in total but also a possible cross-process error-proneness.

As a first further development, virtual machines can then offer a further level of abstraction: by emulating (“virtualizing”) an independent machine on the server, modularity and thus greater freedom is created for development and deployment. For example, in the choice of operating system or the programming languages and libraries used.  From the point of view of the “real” server, the resources to which the application is entitled can be better limited or guaranteed. However, their requirements are also significantly higher since the virtual machine must also maintain the virtual operating system.

Ultimately, this principle has been significantly streamlined and simplified by the proliferation of containers, especially Docker. Put simply, one builds/configures a separate virtual, isolated server for an application or machine learning model. Thus, each container has its own file system and certain system libraries, but not operating system. This technically turns it into a sandbox whose other configuration, code dependencies or errors do not affect the host server, but at the same time can run as relatively “lightweight” processes directly on it.

_Vergleich virtuelle Maschine und Docker Container Architektur
Fig. 2: Comparison between virtual machine and Docker container system architecture; Source: https://i1.wp.com/www.docker.com/blog/wp-content/uploads/Blog.-Are-containers-..VM-Image-1-1024×435.png?ssl=1

So there is the possibility to copy, install, etc., everything for the desired application and provide this in a packaged container everywhere in a consistent format. This is not only extremely useful for the production environment, but we at STATWORX also like to use it in the development of more complicated projects or the proof-of-concept phase. Intermediate steps or results, such as extracting text from images, can be used as a container like a small web server by those interested in further processing of the text, such as extracting certain key information, or determining its mood or intent.

This subdivision into so-called “microservices“ with the help of containers helps immensely in the reusability of the individual modules, in the planning and development of the architecture of complex systems; at the same time, it frees the individual work steps from technical dependencies on each other and facilitates maintenance and update procedures.

After this brief overview of the powerful and versatile possibilities of deploying software, the following text will deal with how to reliably and scalably deploy these containers (i.e., applications or models) for customers, other applications, internal services or computations with Kubernetes.

Kubernetes – 8 Essential Components

Kubernetes was introduced by Google in 2014 as open-source container management software (also called container orchestration). Internally, the company had already been using tools developed in-house for years to manage workloads and applications, and regarded the development of Kubernetes not only as a convergence of best practices and lessons learned, but also as an opportunity to open up a new business segment in cloud computing.

The name Kubernetes (Greek for helmsman) was supposedly chosen in reference to a symbolic container ship, for whose optimal operation he was responsible.

1.    Nodes

When speaking of a Kubernetes instance, it is referred to as a (Kubernetes) cluster: it consists of several servers, called nodes. One of them, called the master node, is solely responsible for administrative operations, and is the interface that is addressed by the developer. All other nodes, called worker nodes, are initially unoccupied and thus flexible. While nodes are actually physical instances, mostly in data centers, the following terms are digital concepts of Kubernetes.

2.    Pods

If an application is to be deployed on the cluster, in the simplest case the desired container is specified and then (automatically) a so-called pod is created and assigned to a node. The pod simply resembles a running container. If several instances of the same application are to run in parallel, for example to provide better availability, the number of replicas can be specified. In this case, the specified number of pods, each with the same application, is distributed to the nodes. If the demand for the application exceeds the capacities despite replicas, even more pods can be created automatically with the Horizontal Autoscaler. Especially for Deep Learning models with relatively long inference times, metrics such as CPU or GPU utilization can be monitored here, and the number of pods can be increased or decreased automatically to optimize both capacity and cost.

Illustration des Autoscaling und der Belegung der Nodes
Illustration 3: Illustration of the autoscaling and the occupancy of the nodes. The width of the bars corresponds to the resource requirements of the pods or the capacity of the nodes.

To avoid confusion, ultimately every running container, i.e., every workload, is a pod. In the case of deploying an application, this is technically done via a deployment, whereas temporal compute loads are jobs. Persistent stores such as databases are managed with StatefulSets. The following figure provides an overview of the terms:

Deployment-Controller
Fig. 4: What is what in Kubernetes? Deployment specifies what is desired; the deployment controller takes care of creating, maintaining, and scaling the model containers, which then run as individual pods on the nodes. Jobs and StatefulSets work analogously with their own controller.

3.    Jobs

Kubernetes jobs can be used to execute both one-time and recurring jobs (so-called CronJobs) in the form of a container deployment on the cluster.

In the simplest case, these can be seen as a script, which can be used for maintenance or data preparation work of databases, for example. Furthermore, they are also used for batch processing, for example when deep learning models are to be applied to larger data volumes and it is not worthwhile to keep the model continuously on the cluster. In this case, the model container is started up, gets access to the desired dataset, performs its inference on it, saves the results and shuts down. There is also flexibility here for the origin and subsequent storage of the data, so own or cloud databases, bucket/object storage or even local data and logging frameworks can be connected.

For recurring CronJobs, a simple time scheme can be specified so that, for example, certain customer data, transactions or the like are processed at night. Natural Language Processing can be used to automatically create press reviews at night, for example, which can then be evaluated the following morning: News about a company, its industry, business locations, customers, etc. can be aggregated or sourced, evaluated with NLP, summarized, and presented with sentiment or sorted by topic/content.

Even labor-intensive ETL (Extract Transform Load) processes can be performed or prepared outside business hours.

4.    Rolling Updates

If a deployment needs to be brought up to the latest version or a rollback to an older version needs to be completed, rolling updates can be triggered in Kubernetes. These guarantee continuous accessibility of the applications and models within a Continuous Integration/Continuous Deployment pipeline.

Such a rollout can be initiated and monitored smoothly in one or a few steps. By means of a rollout history it is also possible not only to jump back to a previous container version, but also to restore the previous deployment parameters, i.e. minimum and maximum number of nodes, which resource group (GPU nodes, CPU nodes with little/much RAM,…), health checks, etc.

If a rolling update is triggered, the respective existing pods are kept running and accessible until the same number of new pods are up and accessible. Here there are methods to guarantee that no requests are lost, as well as parameters that regulate a minimum accessibility or a maximum surplus of pods for the change.

Illustration eines Rolling Updates
Fig. 5: Illustration of a Rolling Update.

Figure 5 illustrates the rolling update.

1) The current version of an application is located on the Kubernetes cluster with 2 replicas and can be accessed as usual.

2) A rolling update to version V2 is started, the same number of pods as for V1 are created.

3) As soon as the new pods have the state “Running” and, if applicable, health checks have been completed, thus being functional, the containers of the older version are shut down.

4) The older pods are removed and the resources are released again.

The DevOps and time involved here is marginal, internally no hostnames or the like change, while from the consumer’s point of view the service can be accessed as before in the usual way (same IP, URL, …) and has merely been updated to the latest version.

5.    Platform/Infrastructure as a Service

Of course, a Kubernetes cluster can also be deployed locally on your on-premises hardware as well as on partially pre-built solutions, such as DGX Workbenches.

Some of our customers have strict policies or requirements regarding (data) compliance or information security, and do not want potentially sensitive data to leave the company. Furthermore, it can be avoided that data traffic flows through non-European nodes or generally ends up in foreign data centers.

Experience shows, however, that this is only the case in a very small proportion of projects. Through encryption, rights management and SLAs of the operators, we consider the use of cloud services and data centers to be generally secure and also use them for larger projects. In this regard, deployment, maintenance, CI/CD pipelines are also largely identical and easy to use thanks to methods of containerization (Docker) and abstraction (Kubernetes).

All major cloud operators like Google (GCP), Amazon (AWS) and Microsoft (Azure), but also smaller providers and soon even exciting new German projects, offer services very similar to Kubernetes. This makes it even easier to deploy and, most importantly, scale a project or model, as auto-scaling allows the cluster to expand or shrink depending on resource needs. From a technical perspective, this largely frees us from having to estimate the demand of a service while keeping the profitability and cost structure the same. Furthermore, the services can also be hosted and operated in different (geographical) zones to guarantee fastest reachability and redundancy.

6.    Node-Variety

The cloud operators offer a large number of different node types to satisfy all resource requirements for all use cases from the simpler web service to high performance computing. Especially in the application field of Deep Learning, the ever growing models can thus always be trained and served on the required latest hardware.

For example, while we use nodes with an average CPU and low memory for smaller NLP purposes, large Transformer models can be deployed on GPU nodes in the same cluster, which effectively enables their use in the first place and at the same time can speed up inference (application of the model) by a factor of over 20. As of late, the importance of dedicated hardware for neural networks has been steadily increasing, Google also provides access to the custom TPUs optimized for Tensorflow.

The organization and grouping of all these different nodes is done in Kubernetes in so-called node pools. These can be selected or specified in the deployment so that the right resources are allocated to the pods of the models.

7.    Cluster Autoscaling

The extent to which models or services are used, internally or by customers, is often unpredictable or fluctuates greatly over time. With a cluster autoscaler, new nodes can be created automatically, or unneeded “empty” nodes can be removed. Here, too, a minimum number of nodes can be specified, which should always be available, as well as, if desired, a maximum number, which cannot be exceeded, to cap the costs, if necessary.

8.    Interfacing with Other Services

In principle, cloud services from different providers can be combined, but it is more convenient and easier to use one provider (e.g. Google GCP). This means that services such as data buckets, container registry, Lambda functions can be integrated and used internally in the cloud without major authentication processes. Furthermore, especially in a microservice architecture, network communication among the individual hosts (applications, models) is important and facilitated within a provider. Access control/RBAC can also be implemented here, and several clusters or projects can be bridged with a virtual network to better separate the areas of responsibility and competence.

Environment and Future Developments

The growing use and spread of Kubernetes has brought with it a whole environment of useful tools, as well as advancements and further abstractions that further facilitate its use.

Tools and Pipelines based on Kubernetes

For example, Kubeflow can be used to trigger the training of machine learning models as a TensorFlow training job and deploy completed models with TensorFlow Serving.

The whole process can also be packaged into a pipeline that then performs training of different models with reference to training, validation and test data in memory buckets, and also monitors or logs their metrics and compares model performance. The workflow also includes the preparation of input data, so that after the initial pipeline setup, experiments can be easily performed to explore model architectures and hyperparameter tuning.

Serverless

Serverless deployment methods such as Cloud Run or Amazon Fargate take another abstraction step away from the technical requirements. With this, containers can be deployed within seconds and scale like pods on a Kubernetes cluster without even having to create or maintain it. So the same infrastructure has once again been simplified in its use. According to the pay-per-use principle, only the time in which the code is actually called and executed is charged.

Conclusion

Kubernetes has become a central pillar in machine learning deployment today. The path from data and model exploration to the prototype and finally to production has been enormously streamlined and simplified by libraries such as PyTorch, TensorFlow and Keras. At the same time, these frameworks can also be applied in enormous detail, if required, to develop customized components or to integrate and adapt existing models using transfer learning. Container technologies such as Docker subsequently allow the result to be bundled with all its requirements and dependencies and executed almost anywhere without drawbacks in speed. In the final step, their deployment, maintenance, and scaling has also become immensely simplified and powerful with Kubernetes.

All of this allows us to develop our own products as well as solutions for customers in a structured way:

  • The components and the framework infrastructure have a high degree of reusability
  • A first milestone or proof-of-concept can be achieved in relatively little time and cost expenditure
  • Further development work expands on this process in a natural way by increasing complexity
  • Ready deployments scale without additional effort, with costs proportional to demand
  • This results in a reliable platform with a predictable cost structure

If you would like to read further about some key components following this article, we have some more interesting articles about:

Sources

Jonas Braun Jonas Braun Jonas Braun

Management Summary

Deploying and monitoring machine learning projects is a complex undertaking. In addition to the consistent documentation of model parameters and the associated evaluation metrics, the main challenge is to transfer the desired model into a productive environment. If several people are involved in the development, additional synchronization problems arise concerning the models’ development environments and version statuses. For this reason, tools for the efficient management of model results through to extensive training and inference pipelines are required. In this article, we present the typical challenges along the machine learning workflow and describe a possible solution platform with MLflow. In addition, we present three different scenarios that can be used to professionalize machine learning workflows:

  1. Entry-level Variant: Model parameters and performance metrics are logged via a R/Python API and clearly presented in a GUI. In addition, the trained models are stored as artifacts and can be made available via APIs.
  2. Advanced Model Management: In addition to tracking parameters and metrics, certain models are logged and versioned. This enables consistent monitoring and simplifies the deployment of selected model versions.
  3. Collaborative Workflow Management: Encapsulating Machine Learning projects as packages or Git repositories and the accompanying local reproducibility of development environments enable smooth development of Machine Learning projects with multiple stakeholders.

Depending on the maturity of your machine learning project, these three scenarios can serve as inspiration for a potential machine learning workflow. We have elaborated each scenario in detail for better understanding and provide recommendations regarding the APIs and deployment environments to use.

Challenges Along the Machine Learning Workflow

Training machine learning models is becoming easier and easier. Meanwhile, a variety of open-source tools enable efficient data preparation as well as increasingly simple model training and deployment.

The added value for companies comes primarily from the systematic interaction of model training, in the form of model identification, hyperparameter tuning and fitting on the training data, and deployment, i.e., making the model available for inference tasks. This interaction is often not established as a continuous process, especially in the early phases of machine learning initiative development. However, a model can only generate added value in the long term if a stable production process is implemented from model training, through its validation, to testing and deployment. If this process is implemented correctly, complex dependencies and costly maintenance work in the long term can arise during the operational start-up of the model [2]. The following risks are particularly noteworthy in this regard.

1. Ensuring Synchronicity

Often, in an exploratory context, data preparation and modeling workflows are developed locally. Different configurations of development environments or even the use of different technologies make it difficult to reproduce results, especially between developers or teams. In addition, there are potential dangers concerning the compatibility of the workflow if several scripts must be executed in a logical sequence. Without an appropriate version control logic, the synchronization effort afterward can only be guaranteed with great effort.

2. Documentation Effort

To evaluate the performance of the model, model metrics are often calculated following training. These depend on various factors, such as the parameterization of the model or the influencing factors used. This meta-information about the model is often not stored centrally. However, for systematic further development and improvement of a model, it is mandatory to have an overview of the parameterization and performance of all past training runs.

3. Heterogeneity of Model Formats

In addition to managing model parameters and results, there is the challenge of subsequently transferring the model to the production environment. If different models from multiple packages are used for training, deployment can quickly become cumbersome and error-prone due to different packages and versions.

4. Recovery of Prior Results

In a typical machine learning project, the situation often arises that a model is developed over a long period of time. For example, new features may be used, or entirely new architectures may be evaluated. These experiments do not necessarily lead to better results. If experiments are not versioned cleanly, there is a risk that old results can no longer be reproduced.

Various tools have been developed in recent years to solve these and other challenges in the handling and management of machine learning workflows, such as TensorFlow TFX, cortex, Marvin, or MLFlow. The latter, in particular, is currently one of the most widely used solutions.

MLflow is an open-source project with the goal to combine the best of existing ML platforms to make the integration to existing ML libraries, algorithms, and deployment tools as straightforward as possible [3]. In the following, we will introduce the main MLflow modules and discuss how machine learning workflows can be mapped via MLflow.

MLflow Services

MLflow consists of four components: MLflow Tracking, MLflow Models, MLflow Projects, and MLflow Registry. Depending on the requirements of the experimental and deployment scenario, all services can be used together, or individual components can be isolated.

With MLflow Tracking, all hyperparameters, metrics (model performance), and artifacts, such as charts, can be logged. MLflow Tracking provides the ability to collect presets, parameters, and results for collective monitoring for each training or scoring run of a model. The logged results can be visualized in a GUI or alternatively accessed via a REST API.

The MLflow Models module acts as an interface between technologies and enables simplified deployment. Depending on its type, a model is stored as a binary, e.g., a pure Python function, or as a Keras or H2O model. One speaks here of the so-called model flavors. Furthermore, MLflow Models provides support for model deployment on various machine learning cloud services, e.g., for AzureML and Amazon Sagemaker.

MLflow Projects are used to encapsulate individual ML projects in a package or Git repository. The basic configurations of the respective environment are defined via a YAML file. This can be used, for example, to control how exactly the conda environment is parameterized, which is created when MLflow is executed. MLflow Projects allows experiments that have been developed locally to be executed on other computers in the same environment. This is an advantage, for example, when developing in smaller teams.

MLflow Registry provides a centralized model management. Selected MLflow models can be registered and versioned in it. A staging workflow enables a controlled transfer of models into the productive environment. The entire process can be controlled via a GUI or a REST API.

Examples of Machine Learning Pipelines Using MLflow

In the following, three different ML workflow scenarios are presented using the above MLflow modules. These increase in complexity from scenario to scenario. In all scenarios, a dataset is loaded into a development environment using a Python script, processed, and a machine learning model is trained. The last step in all scenarios is a deployment of the ML model in an exemplary production environment.

1. Scenario – Entry-Level Variant

Szenario 1 – Simple Metrics TrackingScenario 1 – Simple Metrics Tracking

Scenario 1 uses the MLflow Tracking and MLflow Models modules. Using the Python API, the model parameters and metrics of the individual runs can be stored on the MLflow Tracking Server Backend Store, and the corresponding MLflow Model File can be stored as an artifact on the MLflow Tracking Server Artifact Store. Each run is assigned to an experiment. For example, an experiment could be called ‘fraud_classification’, and a run would be a specific ML model with a certain hyperparameter configuration and the corresponding metrics. Each run is stored with a unique RunID.

Artikel MLFlow Tool Bild 01

In the screenshot above, the MLflow Tracking UI is shown as an example after executing a model training. The server is hosted locally in this example. Of course, it is also possible to host the server remotely. For example in a Docker container within a virtual machine. In addition to the parameters and model metrics, the time of the model training, as well as the user and the name of the underlying script, are also logged. Clicking on a specific run also displays additional information, such as the RunID and the model training duration.

Artikel MLFlow Tool Bild 02

If you have logged other artifacts in addition to the metrics, such as the model, the MLflow Model Artifact is also displayed in the Run view. In the example, a model from the sklearn.svm package was used. The MLmodel file contains metadata with information about how the model should be loaded. In addition to this, a conda.yaml is created that contains all the package dependencies of the environment at training time. The model itself is located as a serialized version under model.pkl and contains the model parameters optimized on the training data.

Artikel MLFlow Tool Bild 03

The deployment of the trained model can now be done in several ways. For example, suppose one wants to deploy the model with the best accuracy metric. In that case, the MLflow tracking server can be accessed via the Python API mlflow.list_run_infos to identify the RunID of the desired model. Now, the path to the desired artifact can be assembled, and the model loaded via, for example, the Python package pickle. This workflow can now be triggered via a Dockerfile, allowing flexible deployment to the infrastructure of your choice. MLflow offers additional separate APIs for deployment on Microsoft Azure and AWS. For example, if the model is to be deployed on AzureML, an Azure ML container image can be created using the Python API mlflow.azureml.build_image, which can be deployed as a web service to Azure Container Instances or Azure Kubernetes Service. In addition to the MLflow Tracking Server, it is also possible to use other storage systems for the artifact, such as Amazon S3, Azure Blob Storage, Google Cloud Storage, SFTP Server, NFS, and HDFS.

2. Scenario – Advanced Model Management

Szenario 2 – Advanced Model ManagementScenario 2 – Advanced Model Management

Scenario 2 includes, in addition to the modules used in scenario 1, MLflow Model Registry as a model management component. Here, it is possible to register and process the models logged there from specific runs. These steps can be controlled via the API or GUI. A basic requirement to use the Model Registry is deploying the MLflow Tracking Server Backend Store as Database Backend Store. To register a model via the GUI, select a specific run and scroll to the artifact overview.

Artikel MLFlow Tool Bild 04

Clicking on Register Model opens a new window in which a model can be registered. If you want to register a new version of an already existing model, select the desired model from the dropdown field. Otherwise, a new model can be created at any time. After clicking the Register button, the previously registered model appears in the Models tab with corresponding versioning.

Artikel MLFlow Tool Bild 05

Each model includes an overview page that shows all past versions. This is useful, for example, to track which models were in production when.

Artikel MLFlow Tool Bild 06

If you now select a model version, you will get to an overview where, for example, a model description can be added. The Source Run link also takes you to the run from which the model was registered. Here you will also find the associated artifact, which can be used later for deployment.

Artikel MLFlow Tool Bild 07

In addition, individual model versions can be categorized into defined phases in the Stage area. This feature can be used, for example, to determine which model is currently being used in production or is to be transferred there. For deployment, in contrast to scenario 1, versioning and staging status can be used to identify and deploy the appropriate model. For this, the Python API MlflowClient().search_model_versions can be used, for example, to filter the desired model and its associated RunID. Similar to scenario 1, deployment can then be completed to, for example, AWS Sagemaker or AzureML via the respective Python APIs.

3. Scenario – Collaborative Workflow Management

Szenario 3 – Full Workflow ManagementScenario 3 – Full Workflow Management

In addition to the modules used in scenario 2, scenario 3 also includes the MLflow Projects module. As already explained, MLflow Projects are particularly well suited for collaborative work. Any Git repository or local environment can act as a project and be controlled by an MLproject file. Here, package dependencies can be recorded in a conda.yaml, and the MLproject file can be accessed when starting the project. Then the corresponding conda environment is created with all dependencies before training and logging the model. This avoids the need for manual alignment of the development environments of all developers involved and also guarantees standardized and comparable results of all runs. Especially the latter is necessary for the deployment context since it cannot be guaranteed that different package versions produce the same model artifacts. Instead of a conda environment, a Docker environment can also be defined using a Dockerfile. This offers the advantage that package dependencies independent of Python can also be defined. Likewise, MLflow Projects allow the use of different commit hashes or branch names to use other project states, provided a Git repository is used.

An interesting use case is the modularized development of machine learning training pipelines [4]. For example, data preparation can be decoupled from model training and developed in parallel, while another team uses a different branch name to train the model. In this case, only a different branch name must be used as a parameter when starting the project in the MLflow Projects file. The final data preparation can then be pushed to the same branch name used for model training and would thus already be fully implemented in the training pipeline. The deployment can also be controlled as a sub-module within the project pipeline through a Python script via the ML Project File and can be carried out analogous to scenario 1 or 2 on a platform of your choice.

Conclusion and Outlook

MLflow offers a flexible way to make the machine learning workflow robust against the typical challenges in the daily life of a data scientist, such as synchronization problems due to different development environments or missing model management. Depending on the maturity level of the existing machine learning workflow, various services from the MLflow portfolio can be used to achieve a higher level of professionalization.

In the article, three machine learning workflows, ascending in complexity, were presented as examples. From simple logging of results in an interactive UI to more complex, modular modeling pipelines, MLflow services can support it. Logically, there are also synergies outside the MLflow ecosystem with other tools, such as Docker/Kubernetes for model scaling or even Jenkins for CI/CD pipeline control. If there is further interest in MLOps challenges and best practices, I refer you to the webinar on MLOps by our CEO Sebastian Heinz, which we provide free of charge.

Resources

John Vicente John Vicente John Vicente

Car Model Classification III: Explainability of Deep Learning Models with Grad-CAM

In the first article of this series on car model classification, we built a model using transfer learning to classify the car model through an image of a car. In the second article, we showed how TensorFlow Serving can be used to deploy a TensorFlow model using the car model classifier as an example. We dedicate this third post to another essential aspect of deep learning and machine learning in general: the explainability of model predictions.

We will start with a short general introduction on the topic of explainability in machine learning. Next, we will briefly talk about popular methods that can be used to explain and interpret predictions from CNNs. We will then explain Grad-CAM, a gradient-based method, in-depth by going through an implementation step by step. Finally, we will show you the results we obtained by our Grad-CAM implementation for the car model classifier.

A Brief Introduction to Explainability in Machine Learning

For the last couple of years, explainability was always a recurring but still niche topic in machine learning. But, over the past four years, interest in this topic has started to accelerate. At least one particular reason fuelled this development: the increased number of machine learning models in production. On the one hand, this leads to a growing number of end-users who need to understand how models are making decisions. On the other hand, an increasing number of machine learning developers need to understand why (or why not) a model is functioning in a particular way.

This increasing demand in explainability led to some, both methodological and technical, noteworthy innovations in the last years:

Methods for Explaining CNN Outputs for Images

Deep neural networks and especially complex architectures like CNNs were long considered as pure black box models. As written above, this changed in recent years, and now there are various methods available to explain CNN outputs. For example, the excellent tf-explain library implements a wide range of useful methods for TensorFlow 2.x. We will now briefly talk about the ideas of different approaches before turning to Grad-CAM:

Activations Visualization

This is the most straightforward visualization technique. It simply shows the output of a specific layer within the network during the forward pass. It can be helpful to get a feel for the extracted features since, during training, most of the activations tend towards zero (when using the ReLu-activation). An example for the output of the first convolutional layer of the car model classifier is shown below:

Vanilla Gradients

One can use the vanilla gradients of the predicted classes’ output for the input image to derive input pixel importances.

We can see that the highlighted region is mainly focused on the car. Compared to other methods discussed below, the discriminative region is much less confined.

Occlusion Sensitivity

This approach computes the importance of certain parts of the input image by reevaluating the model’s prediction with different parts of the input image hidden. Parts of the image are hidden iteratively by replacing them by grey pixels. The weaker the prediction gets with a part of the image hidden, the more important this part is for the final prediction. Based on the discriminative power of the regions of the image, a heatmap can be constructed and plotted. Applying occlusion sensitivity for our car model classifier did not yield any meaningful results. Thus, we show tf-explain‘s sample image, showing the result of applying the occlusion sensitivity procedure for a cat image.

CNN Fixations

Another exciting approach called CNN Fixations was introduced in this paper. The idea is to backtrack which neurons were significant in each layer, given the activations from the forward pass and the network weights. The neurons with large influence are referred to as fixations. This approach thus allows finding the essential regions for obtaining the result without the need for any recomputation (e.g., in the case of occlusion sensitivity above, where multiple predictions must be made).

The procedure can be described as follows: The node corresponding to the class is chosen as the fixation in the output layer. Then, the fixations for the previous layer are computed by computing which of the nodes have the most impact on the next higher level’s fixations determined in the last step. The node importance is computed by multiplying activations and weights. If you are interested in the details of the procedure, check out the paper or the corresponding github repo. This backtracking is done until the input image is reached, yielding a set of pixels with considerable discriminative power. An example from the paper is shown below.

CAM

Introduced in this paper, class activation mapping (CAM) is a procedure to find the discriminative region(s) for a CNN prediction by computing class activation maps. A significant drawback of this procedure is that it requires the network to use global average pooling (GAP) as the last step before the prediction layer. It thus is not possible to apply this approach to general CNNs. An example is shown in the figure below (taken from the CAM paper):

The class activation map assigns importance to every position (x, y) in the last convolutional layer by computing the linear combination of the activations, weighted by the corresponding output weights for the observed class (Australian terrier in the example above). The resulting class activation mapping is then upsampled to the size of the input image. This is depicted by the heat map above. Due to the architecture of CNNs, the activation, e.g., in the top left for any layer, is directly related to the top left of the input image. This is why we can conclude which input regions are important by only looking at the last CNN layer.

The Grad-CAM procedure we will discuss in detail below is a generalization of CAM. Grad-CAM can be applied to networks with general CNN architectures, containing multiple fully connected layers at the output.

Grad-CAM

Grad-CAM extends the applicability of the CAM procedure by incorporating gradient information. Specifically, the gradient of the loss w.r.t. the last convolutional layer determines the weight for each of its feature maps. As in the CAM procedure above, the further steps are to compute the weighted sum of the activations and then upsampling the result to the image size to plot the original image with the obtained heatmap. We will now show and discuss the code that can be used to run Grad-CAM. The complete code is available here on GitHub.

import pickle
import tensorflow as tf
import cv2
from car_classifier.modeling import TransferModel

INPUT_SHAPE = (224, 224, 3)

# Load list of targets
file = open('.../classes.pickle', 'rb')
classes = pickle.load(file)

# Load model
model = TransferModel('ResNet', INPUT_SHAPE, classes=classes)
model.load('...')

# Gradient model, takes the original input and outputs tuple with:
# - output of conv layer (in this case: conv5_block3_3_conv)
# - output of head layer (original output)
grad_model = tf.keras.models.Model([model.model.inputs],
                                   [model.model.get_layer('conv5_block3_3_conv').output,
                                    model.model.output])

# Run model and record outputs, loss, and gradients
with tf.GradientTape() as tape:
    conv_outputs, predictions = grad_model(img)
    loss = predictions[:, label_idx]

# Output of conv layer
output = conv_outputs[0]

# Gradients of loss w.r.t. conv layer
grads = tape.gradient(loss, conv_outputs)[0]

# Guided Backprop (elimination of negative values)
gate_f = tf.cast(output > 0, 'float32')
gate_r = tf.cast(grads > 0, 'float32')
guided_grads = gate_f * gate_r * grads

# Average weight of filters
weights = tf.reduce_mean(guided_grads, axis=(0, 1))

# Class activation map (cam)
# Multiply output values of conv filters (feature maps) with gradient weights
cam = np.zeros(output.shape[0: 2], dtype=np.float32)
for i, w in enumerate(weights):
    cam += w * output[:, :, i]

# Or more elegant: 
# cam = tf.reduce_sum(output * weights, axis=2)

# Rescale to org image size and min-max scale
cam = cv2.resize(cam.numpy(), (224, 224))
cam = np.maximum(cam, 0)
heatmap = (cam - cam.min()) / (cam.max() - cam.min())
  • The first step is to load an instance of the model.
  • Then, we create a new keras.Model instance that has two outputs: The activations of the last CNN layer ('conv5_block3_3_conv') and the original model output.
  • Next, we run a forward pass for our new grad_model using as input an image ( img) of shape (1, 224, 224, 3), preprocessed with the resnetv2.preprocess_input method. tf.GradientTape is set up and applied to record the gradients (the gradients are stored in the tapeobject). Further, the outputs of the convolutional layer (conv_outputs) and the head layer (predictions) are stored as well. Finally, we can use label_idxto get the loss corresponding to the label we want to find the discriminative regions for.
  • Using the gradient-method, one can extract the desired gradients from tape. In this case, we need the gradient of the loss w.r.t. the output of the convolutional layer.
  • In a further step, a guided backprop is applied. Only values for the gradients are kept where both the activations and the gradients are positive. This essentially means restricting attention to the activations which positively contribute to the wanted output prediction.
  • The weights are computed by averaging the obtained guided gradients for each filter.
  • The class activation map cam is then computed as the weighted average of the feature map activations (output). The method containing the for loop above helps understanding what the function does in detail. A less straightforward but more efficient way to implement the CAM-computation is to use tf.reduce_mean and is shown in the commented line below the loop implementation.
  • Finally, the resampling (resizing) is done using OpenCV2’s resize method, and the heatmap is rescaled to contain values in [0, 1] for plotting.

A version of Grad-CAM is also implemented in tf-explain.

Examples

We now use the Grad-CAM implementation to interpret and explain the predictions of the TransferModelfor car model classification. We start by looking at car images taken from the front.

Grad-CAM for car images from the front

The red regions highlight the most important discriminative regions, the blue regions the least important. We can see that for images from the front, the CNN focuses on the car’s grille and the area containing the logo. If the car is slightly tilted, the focus is shifted more to the edge of the vehicle. This is also the case for slightly tilted images from the car’s backs, as shown in the middle image below.

Grad-CAM for car images from the back

For car images from the back, the most crucial discriminative region is near the number plate. As mentioned above, for cars looked at from an angle, the closest corner has the highest discriminative power. A very interesting example is the Mercedes-Benz C-class on the right side, where the model not only focuses on the tail lights but also puts the highest discriminative power on the model lettering.

Grad-CAM for car images from the side

When looking at images from the side, we notice the discriminative region is restricted to the bottom half of the cars. Again, the angle the car image was taken from determines the shift of the region towards the front or back corner.

In general, the most important fact is that the discriminative areas are always confined to parts of the cars. There are no images where the background has high discriminative power. Looking at the heatmaps and the associated discriminative regions can be used as a sanity check for CNN models.

Conclusion

We discussed multiple approaches to explaining CNN classifier outputs. We introduced Grad-CAM in detail by examining the code and looking at examples for the car model classifier. Most notably, the discriminative regions highlighted by the Grad-CAM procedure are always focussed on the car and never on the backgrounds of the images. The result shows that the model works as we expect and uses specific parts of the car to discriminate between different models.

In the fourth and last part of this blog series, we will show how the car classifier can be built into a web application using Dash. See you soon!

Stephan Müller Stephan Müller Stephan Müller

Did you ever want to make your machine learning model available to other people, but didn’t know how? Or maybe you just heard about the term API, and want to know what’s behind it? Then this post is for you!

Here at STATWORX, we use and write APIs daily. For this article, I wrote down how you can build your own API for a machine learning model that you create and the meaning of some of the most important concepts like REST. After reading this short article, you will know how to make requests to your API within a Python program. So have fun reading and learning!

What is an API?

API is short for Application Programming Interface. It allows users to interact with the underlying functionality of some written code by accessing the interface. There is a multitude of APIs, and chances are good that you already heard about the type of API, we are going to talk about in this blog post: The web API.

This specific type of API allows users to interact with functionality over the internet. In this example, we are building an API that will provide predictions through our trained machine learning model. In a real-world setting, this kind of API could be embedded in some type of application, where a user enters new data and receives a prediction in return. APIs are very flexible and easy to maintain, making them a handy tool in the daily work of a Data Scientist or Data Engineer.

An example of a publicly available machine learning API is Time Door. It provides Time Series tools that you can integrate into your applications. APIs can also be used to make data available, not only machine learning models.

API Illustration

And what is REST?

Representational State Transfer (or REST) is an approach that entails a specific style of communication through web services. When using some of the REST best practices to implement an API, we call that API a “REST API”. There are other approaches to web communication, too (such as the Simple Object Access Protocol: SOAP), but REST generally runs on less bandwidth, making it preferable to serve your machine learning models.

In a REST API, the four most important types of requests are:

  • GET
  • PUT
  • POST
  • DELETE

For our little machine learning application, we will mostly focus on the POST method, since it is very versatile, and lots of clients can’t send GET methods.

It’s important to mention that APIs are stateless. This means that they don’t save the inputs you give during an API call, so they don’t preserve the state. That’s significant because it allows multiple users and applications to use the API at the same time, without one user request interfering with another.

The Model

For this How-To-article, I decided to serve a machine learning model trained on the famous iris dataset. If you don’t know the dataset, you can check it out here. When making predictions, we will have four input parameters: sepal length, sepal width, petal length, and finally, petal width. Those will help to decide which type of iris flower the input is.

For this example I used the scikit-learn implementation of a simple KNN (K-nearest neighbor) algorithm to predict the type of iris:

# model.py
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.externals import joblib
import numpy as np


def train(X,y):

    # train test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

    knn = KNeighborsClassifier(n_neighbors=1)

    # fit the model
    knn.fit(X_train, y_train)
    preds = knn.predict(X_test)
    acc = accuracy_score(y_test, preds)
    print(f'Successfully trained model with an accuracy of {acc:.2f}')

    return knn

if __name__ == '__main__':

    iris_data = datasets.load_iris()
    X = iris_data['data']
    y = iris_data['target']

    labels = {0 : 'iris-setosa',
              1 : 'iris-versicolor',
              2 : 'iris-virginica'}

    # rename integer labels to actual flower names
    y = np.vectorize(labels.__getitem__)(y)

    mdl = train(X,y)

    # serialize model
    joblib.dump(mdl, 'iris.mdl')

As you can see, I trained the model with 70% of the data and then validated with 30% out of sample test data. After the model training has taken place, I serialize the model with the joblib library. Joblib is basically an alternative to pickle, which preserves the persistence of scikit estimators, which include a large number of numpy arrays (such as the KNN model, which contains all the training data). After the file is saved as a joblib file (the file ending thereby is not important by the way, so don’t be confused that some people call it .model or .joblib), it can be loaded again later in our application.

The API with Python and Flask

To build an API from our trained model, we will be using the popular web development package Flask and Flask-RESTful. Further, we import joblib to load our model and numpy to handle the input and output data.

In a new script, namely app.py, we can now set up an instance of a Flask app and an API and load the trained model (this requires saving the model in the same directory as the script):

from flask import Flask
from flask_restful import Api, Resource, reqparse
from sklearn.externals import joblib
import numpy as np

APP = Flask(__name__)
API = Api(APP)

IRIS_MODEL = joblib.load('iris.mdl')

The second step now is to create a class, which is responsible for our prediction. This class will be a child class of the Flask-RESTful class Resource. This lets our class inherit the respective class methods and allows Flask to do the work behind your API without needing to implement everything.

In this class, we can also define the methods (REST requests) that we talked about before. So now we implement a Predict class with a .post() method we talked about earlier.

The post method allows the user to send a body along with the default API parameters. Usually, we want the body to be in JSON format. Since this body is not delivered directly in the URL, but as a text, we have to parse this text and fetch the arguments. The flask _restful package offers the RequestParser class for that. We simply add all the arguments we expect to find in the JSON input with the .add_argument() method and parse them into a dictionary. We then convert it into an array and return the prediction of our model as JSON.

class Predict(Resource):

    @staticmethod
    def post():
        parser = reqparse.RequestParser()
        parser.add_argument('petal_length')
        parser.add_argument('petal_width')
        parser.add_argument('sepal_length')
        parser.add_argument('sepal_width')

        args = parser.parse_args()  # creates dict

        X_new = np.fromiter(args.values(), dtype=float)  # convert input to array

        out = {'Prediction': IRIS_MODEL.predict([X_new])[0]}

        return out, 200

You might be wondering what the 200 is that we are returning at the end: For APIs, some HTTP status codes are displayed when sending requests. You all might be familiar with the famous 404 - page not found code. 200 just means that the request has been received successfully. You basically let the user know that everything went according to plan.

In the end, you just have to add the Predict class as a resource to the API, and write the main function:

API.add_resource(Predict, '/predict')

if __name__ == '__main__':
    APP.run(debug=True, port='1080')

The '/predict' you see in the .add_resource() call, is the so-called API endpoint. Through this endpoint, users of your API will be able to access and send (in this case) POST requests. If you don’t define a port, port 5000 will be the default.

You can see the whole code for the app again here:

# app.py
from flask import Flask
from flask_restful import Api, Resource, reqparse
from sklearn.externals import joblib
import numpy as np

APP = Flask(__name__)
API = Api(APP)

IRIS_MODEL = joblib.load('iris.mdl')


class Predict(Resource):

    @staticmethod
    def post():
        parser = reqparse.RequestParser()
        parser.add_argument('petal_length')
        parser.add_argument('petal_width')
        parser.add_argument('sepal_length')
        parser.add_argument('sepal_width')

        args = parser.parse_args()  # creates dict

        X_new = np.fromiter(args.values(), dtype=float)  # convert input to array

        out = {'Prediction': IRIS_MODEL.predict([X_new])[0]}

        return out, 200


API.add_resource(Predict, '/predict')

if __name__ == '__main__':
    APP.run(debug=True, port='1080')

Run the API

Now it’s time to run and test our API!

To run the app, simply open a terminal in the same directory as your app.py script and run this command.

python run app.py

You should now get a notification, that the API runs on your localhost in the port you defined. There are several ways of accessing the API once it is deployed. For debugging and testing purposes, I usually use tools like Postman. We can also access the API from within a Python application, just like another user might want to do to use your model in their code.

We use the requests module, by first defining the URL to access and the body to send along with our HTTP request:

import requests

url = 'http://127.0.0.1:1080/predict'  # localhost and the defined port + endpoint
body = {
    "petal_length": 2,
    "sepal_length": 2,
    "petal_width": 0.5,
    "sepal_width": 3
}
response = requests.post(url, data=body)
response.json()

The output should look something like this:

Out[1]: {'Prediction': 'iris-versicolor'}

That’s how easy it is to include an API call in your Python code! Please note that this API is just running on your localhost. You would have to deploy the API to a live server (e.g., on AWS) for others to access it.

Conclusion

In this blog article, you got a brief overview of how to build a REST API to serve your machine learning model with a web interface. Further, you now understand how to integrate simple API requests into your Python code. For the next step, maybe try securing your APIs? If you are interested in learning how to build an API with R, you should check out this post. I hope that this gave you a solid introduction to the concept and that you will be building your own APIs immediately. Happy coding!

 

Jannik Klauke Jannik Klauke

Did you ever want to make your machine learning model available to other people, but didn’t know how? Or maybe you just heard about the term API, and want to know what’s behind it? Then this post is for you!

Here at STATWORX, we use and write APIs daily. For this article, I wrote down how you can build your own API for a machine learning model that you create and the meaning of some of the most important concepts like REST. After reading this short article, you will know how to make requests to your API within a Python program. So have fun reading and learning!

What is an API?

API is short for Application Programming Interface. It allows users to interact with the underlying functionality of some written code by accessing the interface. There is a multitude of APIs, and chances are good that you already heard about the type of API, we are going to talk about in this blog post: The web API.

This specific type of API allows users to interact with functionality over the internet. In this example, we are building an API that will provide predictions through our trained machine learning model. In a real-world setting, this kind of API could be embedded in some type of application, where a user enters new data and receives a prediction in return. APIs are very flexible and easy to maintain, making them a handy tool in the daily work of a Data Scientist or Data Engineer.

An example of a publicly available machine learning API is Time Door. It provides Time Series tools that you can integrate into your applications. APIs can also be used to make data available, not only machine learning models.

API Illustration

And what is REST?

Representational State Transfer (or REST) is an approach that entails a specific style of communication through web services. When using some of the REST best practices to implement an API, we call that API a “REST API”. There are other approaches to web communication, too (such as the Simple Object Access Protocol: SOAP), but REST generally runs on less bandwidth, making it preferable to serve your machine learning models.

In a REST API, the four most important types of requests are:

For our little machine learning application, we will mostly focus on the POST method, since it is very versatile, and lots of clients can’t send GET methods.

It’s important to mention that APIs are stateless. This means that they don’t save the inputs you give during an API call, so they don’t preserve the state. That’s significant because it allows multiple users and applications to use the API at the same time, without one user request interfering with another.

The Model

For this How-To-article, I decided to serve a machine learning model trained on the famous iris dataset. If you don’t know the dataset, you can check it out here. When making predictions, we will have four input parameters: sepal length, sepal width, petal length, and finally, petal width. Those will help to decide which type of iris flower the input is.

For this example I used the scikit-learn implementation of a simple KNN (K-nearest neighbor) algorithm to predict the type of iris:

# model.py
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.externals import joblib
import numpy as np


def train(X,y):

    # train test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

    knn = KNeighborsClassifier(n_neighbors=1)

    # fit the model
    knn.fit(X_train, y_train)
    preds = knn.predict(X_test)
    acc = accuracy_score(y_test, preds)
    print(f'Successfully trained model with an accuracy of {acc:.2f}')

    return knn

if __name__ == '__main__':

    iris_data = datasets.load_iris()
    X = iris_data['data']
    y = iris_data['target']

    labels = {0 : 'iris-setosa',
              1 : 'iris-versicolor',
              2 : 'iris-virginica'}

    # rename integer labels to actual flower names
    y = np.vectorize(labels.__getitem__)(y)

    mdl = train(X,y)

    # serialize model
    joblib.dump(mdl, 'iris.mdl')

As you can see, I trained the model with 70% of the data and then validated with 30% out of sample test data. After the model training has taken place, I serialize the model with the joblib library. Joblib is basically an alternative to pickle, which preserves the persistence of scikit estimators, which include a large number of numpy arrays (such as the KNN model, which contains all the training data). After the file is saved as a joblib file (the file ending thereby is not important by the way, so don’t be confused that some people call it .model or .joblib), it can be loaded again later in our application.

The API with Python and Flask

To build an API from our trained model, we will be using the popular web development package Flask and Flask-RESTful. Further, we import joblib to load our model and numpy to handle the input and output data.

In a new script, namely app.py, we can now set up an instance of a Flask app and an API and load the trained model (this requires saving the model in the same directory as the script):

from flask import Flask
from flask_restful import Api, Resource, reqparse
from sklearn.externals import joblib
import numpy as np

APP = Flask(__name__)
API = Api(APP)

IRIS_MODEL = joblib.load('iris.mdl')

The second step now is to create a class, which is responsible for our prediction. This class will be a child class of the Flask-RESTful class Resource. This lets our class inherit the respective class methods and allows Flask to do the work behind your API without needing to implement everything.

In this class, we can also define the methods (REST requests) that we talked about before. So now we implement a Predict class with a .post() method we talked about earlier.

The post method allows the user to send a body along with the default API parameters. Usually, we want the body to be in JSON format. Since this body is not delivered directly in the URL, but as a text, we have to parse this text and fetch the arguments. The flask _restful package offers the RequestParser class for that. We simply add all the arguments we expect to find in the JSON input with the .add_argument() method and parse them into a dictionary. We then convert it into an array and return the prediction of our model as JSON.

class Predict(Resource):

    @staticmethod
    def post():
        parser = reqparse.RequestParser()
        parser.add_argument('petal_length')
        parser.add_argument('petal_width')
        parser.add_argument('sepal_length')
        parser.add_argument('sepal_width')

        args = parser.parse_args()  # creates dict

        X_new = np.fromiter(args.values(), dtype=float)  # convert input to array

        out = {'Prediction': IRIS_MODEL.predict([X_new])[0]}

        return out, 200

You might be wondering what the 200 is that we are returning at the end: For APIs, some HTTP status codes are displayed when sending requests. You all might be familiar with the famous 404 - page not found code. 200 just means that the request has been received successfully. You basically let the user know that everything went according to plan.

In the end, you just have to add the Predict class as a resource to the API, and write the main function:

API.add_resource(Predict, '/predict')

if __name__ == '__main__':
    APP.run(debug=True, port='1080')

The '/predict' you see in the .add_resource() call, is the so-called API endpoint. Through this endpoint, users of your API will be able to access and send (in this case) POST requests. If you don’t define a port, port 5000 will be the default.

You can see the whole code for the app again here:

# app.py
from flask import Flask
from flask_restful import Api, Resource, reqparse
from sklearn.externals import joblib
import numpy as np

APP = Flask(__name__)
API = Api(APP)

IRIS_MODEL = joblib.load('iris.mdl')


class Predict(Resource):

    @staticmethod
    def post():
        parser = reqparse.RequestParser()
        parser.add_argument('petal_length')
        parser.add_argument('petal_width')
        parser.add_argument('sepal_length')
        parser.add_argument('sepal_width')

        args = parser.parse_args()  # creates dict

        X_new = np.fromiter(args.values(), dtype=float)  # convert input to array

        out = {'Prediction': IRIS_MODEL.predict([X_new])[0]}

        return out, 200


API.add_resource(Predict, '/predict')

if __name__ == '__main__':
    APP.run(debug=True, port='1080')

Run the API

Now it’s time to run and test our API!

To run the app, simply open a terminal in the same directory as your app.py script and run this command.

python run app.py

You should now get a notification, that the API runs on your localhost in the port you defined. There are several ways of accessing the API once it is deployed. For debugging and testing purposes, I usually use tools like Postman. We can also access the API from within a Python application, just like another user might want to do to use your model in their code.

We use the requests module, by first defining the URL to access and the body to send along with our HTTP request:

import requests

url = 'http://127.0.0.1:1080/predict'  # localhost and the defined port + endpoint
body = {
    "petal_length": 2,
    "sepal_length": 2,
    "petal_width": 0.5,
    "sepal_width": 3
}
response = requests.post(url, data=body)
response.json()

The output should look something like this:

Out[1]: {'Prediction': 'iris-versicolor'}

That’s how easy it is to include an API call in your Python code! Please note that this API is just running on your localhost. You would have to deploy the API to a live server (e.g., on AWS) for others to access it.

Conclusion

In this blog article, you got a brief overview of how to build a REST API to serve your machine learning model with a web interface. Further, you now understand how to integrate simple API requests into your Python code. For the next step, maybe try securing your APIs? If you are interested in learning how to build an API with R, you should check out this post. I hope that this gave you a solid introduction to the concept and that you will be building your own APIs immediately. Happy coding!

 

Jannik Klauke Jannik Klauke