Data Science, Machine Learning, and AI
Contact

At statworx, we deal intensively with how to get the best possible results from large language models (LLMs). In this blog post, I present five approaches that have proven successful both in research and in our own work with LLMs. While this text is limited to the manual design of prompts for text generation, image generation and automated prompt search will be the topic of future posts.

Mega models herald new paradigm

The arrival of the revolutionary language model GPT-3 was not only a turning point for the research field of language modeling (NLP) but has incidentally heralded a paradigm shift in AI development: prompt learning. Prior to GPT-3, the standard was fine-tuning of medium-sized language models such as BERT, which, thanks to re-training with new data, would adapt the pre-trained model to the desired use case. Such fine-tuning requires exemplary data for the desired application, as well as the computational capabilities to at least partially re-train the model.

The new large language models such as OpenAI’s GPT-3 and BigScience’s BLOOM, on the other hand, have already been trained by their development teams with such enormous amounts of resources that these models have achieved a new level of independence in their intended use: These LLMs no longer require elaborate fine-tuning to learn their specific purpose, but already produce impressive results using targeted instruction (“prompt”) in natural language.

So, we are in the midst of a revolution in AI development: Thanks to prompt learning, interaction with models no longer takes place via code, but in natural language. This is a giant step forward for the democratization of language modeling. Generating text or, most recently, even creating images requires no more than rudimentary language skills. However, this does not mean that compelling or impressive results are accessible to all. High quality outputs require high quality inputs. For us users, this means that engineering efforts in NLP are no longer focused on model architecture or training data, but on the design of the instructions that models receive in natural language. Welcome to the age of prompt engineering.

Figure 1: From prompt to prediction with a large language model.

Prompts are more than just snippets of text

Templates facilitate the handling of prompts

Since LLMs have not been trained on a specific use case, it is up to the prompt design to provide the model with the exact task. So-called “prompt templates” are used for this purpose. A template defines the structure of the input that is passed on to the model. Thus, the template takes over the function of fine-tuning and determines the expected output of the model for a specific use case. Using sentiment analysis as an example, a simple prompt template might look like this:

The expressed sentiment in text [X] is: [Z]

The model thus searches for a token z, that, based on the trained parameters and the text in location [X], maximizes the probability of the masked token in location [Z]. The template thus specifies the desired context of the problem to be solved and defines the relationship between the input at position [X] and the output to be predicted at position [Z]. The modular structure of templates enables the systematic processing of a large number of texts for the desired use case.

Figure 2: Prompt templates define the structure of a prompt.

Prompts do not necessarily need examples

The template presented is an example of a so-called 0-Shot” prompt, since there is only an instruction, without any demonstration with examples in the template. Originally, LLMs were called “Few-Shot Learners” by the developers of GPT-3, i.e., models whose performance can be maximized with a selection of solved examples of the problem (Brown et al., 2020). However, a follow-up study showed that with strategic prompt design, 0-shot prompts, without a large number of examples, can achieve comparable performance (Reynolds & McDonell, 2021). Thus, since different approaches are also used in research, the next section presents 5 strategies for effective prompt template design.

5 Strategies for Effective Prompt Design

Task demonstration

In the conventional few-shot setting, the problem to be solved is narrowed down by providing several examples. The solved example cases are supposed to take a similar function as the additional training samples during the fine-tuning process and thus define the specific use case of the model. Text translation is a common example for this strategy, which can be represented with the following prompt template:

French: „Il pleut à Paris“

English: „It’s raining in Paris“

French: „Copenhague est la capitale du Danemark“

English: „Copenhagen is the capital of Denmark“

[…]

French: [X]

English: [Z]

While the solved examples are good for defining the problem setting, they can also cause problems. “Semantic contamination” refers to the phenomenon of the LLM interpreting the content of the translated sentences as relevant to the prediction. Examples in the semantic context of the task produce better results – and those out of context can lead to the prediction Z being “contaminated” in terms of its content (Reynolds & McDonell, 2021). Using the above template for translating complex facts, the model might well interpret the input sentence as a statement about a major European city in ambiguous cases.

Task Specification

Recent research shows that with good prompt design, even the 0-shot approach can yield competitive results. For example, it has been demonstrated that LLMs do not require pre-solved examples at all, as long as the problem is defined as precisely as possible in the prompt (Reynolds & McDonell, 2021). This specification can take different forms, but it is always based on the same idea: to describe as precisely as possible what is to be solved, but without demonstrating how.

A simple example of the translation case would be the following prompt:

Translate from French to English [X]: [Z]

This may already work, but the researchers recommend making the prompt as descriptive as possible and explicitly mentioning translation quality:

A French sentence is provided: [X]. The masterful French translator flawlessly translates the sentence to English: [Z]

This helps the model locate the desired problem solution in the space of the learned tasks.

Figure 3: A clear task description can greatly increase the forecasting quality.

This is also recommended in use cases outside of translations. A text can be summarized with a simple command:

Summarize the following text: [X]: [Z]

However, better results can be expected with a more concrete prompt:

Rephrase this sentence with easy words so a child understands it,
emphasize practical applications and examples: [X]: [Z]

The more accurate the prompt, the greater the control over the output.

Prompts as constraints

Taken to its logical conclusion, the approach of controlling the model simply means constraining the model’s behavior through careful prompt design. This perspective is useful because during training, LLMs learn to complete many different sorts of texts and can thus solve a wide range of problems. With this design strategy, the basic approach to prompt design changes from describing the problem to excluding undesirable results by constraining model behavior. Which prompt leads to the desired result and only to the desired result? The following prompt indicates a translation task, but beyond that, it does not include any approaches to prevent the sentence from simply being continued into a story by the model.

Translate French to English Il pleut à Paris

One approach to improve this prompt is to use both semantic and syntactic means:

Translate this French sentence to English: “Il pleut à Paris.”

The use of syntactic elements such as the colon and quotation marks makes it clear where the sentence to be translated begins and ends. Also, the specification by sentence expresses that it is only about a single sentence. These measures reduce the likelihood that this prompt will be misunderstood and not treated as a translation problem.

Use of “memetic proxies”

This strategy can be used to increase the density of information in a prompt and avoid long descriptions through culturally understood context. Memetic proxies can be used in task descriptions and use implicitly understood situations or personae instead of detailed instructions:

A primary school teacher rephrases the following sentence: [X]: [Z]

This prompt is less descriptive than the previous example of rephrasing in simple words. However, the situation described contains a much higher density of information: The mentioning of an elementary school teacher already implies that the outcome should be understandable to children and thus hopefully increases the likelihood of practical examples in the output. Similarly, prompts can describe fictional conversations with well-known personalities so that the output reflects their worldview or way of speaking:

In this conversation, Yoda responds to the following question: [X]

Yoda: [Z]

This approach helps to keep prompts short by using implicitly understood context and to increase the information density within a prompt. Memetic proxies are also used in prompt design for other modalities. In image generation models such as DALL-e 2, the suffix “Trending on Artstation” often leads to higher quality results, although semantically no statements are made about the image to be generated.

Metaprompting

Metaprompting is how the research team of one study describes the approach of enriching prompts with instructions that are tailored to the task at hand. They describe this as a way to constrain a model with clearer instructions so that the task at hand can be better accomplished (Reynolds & McDonell, 2021). The following example can help to solve mathematical problems more reliably and to make the reasoning path comprehensible:

[X]. Let us solve this problem step-by-step: [Z]

Similarly, multiple choice questions can be enriched with metaprompts so that the model actually chooses an option in the output rather than continuing the list:

[X] in order to solve this problem, let us analyze each option and choose the best: [Z]

Metaprompts thus represent another means of constraining model behavior and results.

Figure 4: Metaprompts can be used to define procedures for solving problems.

Outlook

Prompt learning is a very young paradigm, and the closely related prompt engineering is still in its infancy. However, the importance of sound prompt writing skills will undoubtedly only increase. Not only language models such as GPT-3, but also the latest image generation models require their users to have solid prompt design skills in order to create convincing results. The strategies presented are both research and practice proven approaches to systematically writing prompts that are helpful for getting better results from large language models.

In a future blog post, we will use this experience with text generation to unlock best practices for another category of generative models: state-of-the-art diffusion models for image generation, such as DALL-e 2, Midjourney, and Stable Diffusion.

Sources

Brown, Tom B. et al. 2020. “Language Models Are Few-Shot Learners.” arXiv:2005.14165 [cs]. http://arxiv.org/abs/2005.14165 (March 16, 2022).

Reynolds, Laria, and Kyle McDonell. 2021. “Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm.” http://arxiv.org/abs/2102.07350 (July 1, 2022). Oliver Guggenbühl Oliver Guggenbühl Oliver Guggenbühl Oliver Guggenbühl

Artificially enhancing face images is all the rage

What can AI contribute?

In recent years, image filters have become wildly popular on social media. These filters let anyone adjust their face and the surroundings in different ways, leading to entertaining results. Often, filters enhance facial features that seem to match a certain beauty standard. As AI experts, we asked ourselves what is possible to achieve in the topic of face representations using our tools. One issue that sparked our interest is gender representations. We were curious: how does the AI represent gender differences when creating these images? And on top of that: can we generate gender-neutral versions of existing face images?

Using StyleGAN on existing images

When thinking about what existing images to explore, we were curious to see how our own faces would be edited. Additionally, we decided to use several celebrities as inputs – after all, wouldn’t it be intriguing to observe world-famous faces morphed into different genders?

Currently, we often see text-prompt-based image generation models like DALL-E in the center of public discourse. Yet, the AI-driven creation of photo-realistic face images has long been a focus of researchers due to the apparent challenge of creating natural-looking face images. Searching for suitable AI models to approach our idea, we chose the StyleGAN architectures that are well known for generating realistic face images.

Adjusting facial features using StyleGAN

One crucial aspect of this AI’s architecture is the use of a so-called latent space from which we sample the inputs of the neural network. You can picture this latent space as a map on which every possible artificial face has a defined coordinate. Usually, we would just throw a dart at this map and be happy about the AI producing a realistic image. But as it turns out, this latent space allows us to explore various aspects of artificial face generation. When you move from one face’s location on that map to another face’s location, you can generate mixtures of the two faces. And as you move in any arbitrary direction, you will see random changes in the generated face image.

This makes the StyleGAN architecture a promising approach for exploring gender representation in AI.

Can we isolate a gender direction?

So, are there directions that allow us to change certain aspects of the generated image? Could a gender-neutral representation of a face be approached this way? Pre-existing works have found semantically interesting directions, yielding fascinating results. One of those directions can alter a generated face image to have a more feminine or masculine appearance. This lets us explore gender representation in images.

The approach we took for this article was to generate multiple images by making small steps in each gender’s direction. That way, we can compare various versions of the faces, and the reader can, for example, decide which image comes closest to a gender-neutral face. It also allows us to examine the changes more clearly and look for unwanted characteristics in the edited versions.

Introducing our own faces to the AI

The described method can be utilized to alter any face generated by the AI towards a more feminine or masculine version. However, a crucial challenge remains: Since we want to use our own images as a starting point, we must be able to obtain the latent coordinate (in our analogy, the correct place on the map) for a given face image. Sounds easy at first, but the used StyleGAN architecture only allows us to go one way, from latent coordinate to generated image, not the other way around. Thankfully, other researchers have explored this very problem. Our approach thus heavily builds on the python notebook found here. The researchers built another “encoder”-AI that takes a face image as input and finds its corresponding coordinate in the latent space.

And with that, we finally have all parts necessary to realize our goal: exploring different gender representations using an AI. In the photo sequences below, the center image is the original input image. Towards the left, the generated faces appear more female; towards the right, they seem more male. Without further ado, we present the AI-generated images of our experiment:

Results: photo series from female to male

Images 1-6, from the top.: Marilyn Monroe, Actress; Drake, Singer; Kim Kardashian, Entrepreneur & reality star; Harry Styles, Singer; Isabel Hermes, Co-author of this article; Alexander Müller; Co-author of this article

Unintended biases

After finding the corresponding images in the latent space, we generated artificial versions of the faces. We then moved them along the chosen gender direction, creating “feminized” and “masculinized” faces. Looking at the results, we see some unexpected behavior in the AI: it seems to recreate classic gender stereotypes.

Big smiles vs. thick eyebrows

Whenever we edited an image to look more feminine, we gradually see an opening mouth with a stronger smile and vice versa. Likewise, eyes grow larger and wide open in the female direction. The Drake and Kim Kardashian examples illustrate a visible change in skin tone from darker to lighter when moving along the series from feminine to masculine. The chosen gender direction appears to edit out curls in the female direction (as opposed to the male direction), as exemplified by the examples of Marylin Monroe and blog co-author Isabel Hermes. We also asked ourselves whether the lack of hair extension in Drake’s female direction would be remedied if we extended his photo series. Examining the overall extremes, eyebrows are thinned out and arched on the female and straighter and thicker on the male side. Eye and lip makeup increase heavily on faces that move in the female direction, making the area surrounding the eyes darker and thinning out eyebrows. This may be why we perceived the male versions we generated to look more natural than the female versions.

Finally, we would like to challenge you, as the reader, to examine the photo series above closely. Try to decide which image you perceive as gender-neutral, i.e., as much male as female. What made you choose that image? Did any of the stereotypical features described above impact your perception?

A natural question that arises from image series like the ones generated for this article is whether there is a risk that the AI reinforces current gender stereotypes.

Is the AI to blame for recreating stereotypes?

Given that the adjusted images recreate certain gender stereotypes like a more pronounced smile in female images, a possible conclusion could be that the AI was trained on a biased dataset. And indeed, to train the underlying StyleGAN, image data from Flickr was used that inherits the biases from the website. However, the main goal of this training was to create realistic images of faces. And while the results might not always look as we expect or want, we would argue that the AI did precisely that in all our tests.

To alter the images, however, we used the beforementioned latent direction. In general, those latent directions rarely change only a single aspect of the created image. Instead, like walking in a random direction on our latent map, many elements of the generated face usually get changed simultaneously. Identifying a direction that alters only a single aspect of a generated image is anything but trivial. For our experiment, the chosen direction was created primarily for research purposes without accounting for said biases. It can therefore introduce unwanted artifacts in the images alongside the intended alterations. Yet it is reasonable to assume that a latent direction exists that allows us to alter the gender of a face created by the StyleGAN without affecting other facial features.

Overall, the implementations we build upon use different AIs and datasets, and therefore the complex interplay of those systems doesn’t allow us to identify the AI as a single source for these issues. Nevertheless, our observations suggest that doing due diligence to ensure the representation of different ethnic backgrounds and avoid biases in creating datasets is paramount.

Abb. 7: Picture from “A Sex Difference in Facial Contrast and its Exaggeration by Cosmetics” by Richard Russel

Subconscious bias: looking at ourselves

A study by Richard Russel deals with human perception of gender in faces. Ask yourself, which gender would you intuitively assign to the two images above? It turns out that most people perceive the left person as male and the right person as female. Look again. What separates the faces? There is no difference in facial structure. The only difference is darker eye and mouth regions. It becomes apparent that increased contrast is enough to influence our perception. Suppose our opinion on gender can be swayed by applying “cosmetics” to a face. In that case, we must question our human understanding of gender representations and whether they are simply products of our life-long exposure to stereotypical imagery. The author refers to this as the “Illusion of Sex”.
This bias relates to the selection of latent “gender” dimension: To find the latent dimension that changes the perceived gender of a face, StyleGAN-generated images were divided into groups according to their appearance. While this was implemented based on yet another AI, human bias in gender perception might well have impacted this process and have leaked through to the image rows illustrated above.

Conclusion

Moving beyond the gender binary with StyleGANs

While a StyleGAN might not reinforce gender-related bias in and of itself, people still subconsciously harbor gender stereotypes. Gender bias is not limited to images – researchers have found the ubiquity of female voice assistants reason enough to create a new voice assistant that is neither male nor female: GenderLess Voice.

One example of a recent societal shift is the debate over gender; rather than binary, gender may be better represented as a spectrum. The idea is that there is biological gender and social gender. Being included in society as who they are is essential for somebody who identifies with a gender that differs from that they were born with.

A question we, as a society, must stay wary of is whether the field of AI is at risk of discriminating against those beyond the assigned gender binary. The fact is that in AI research, gender is often represented as binary. Pictures fed into algorithms to train them are either labeled as male or female. Gender recognition systems based on deterministic gender-matching may also cause direct harm by mislabelling members of the LGBTQIA+ community. Currently, additional gender labels have yet to be included in ML research. Rather than representing gender as a binary variable, it could be coded as a spectrum.

Exploring female to male gender representations

We used StyleGANs to explore how AI represents gender differences. Specifically, we used a gender direction in the latent space. Researchers determined this direction to display male and female gender. We saw that the generated images replicated common gender stereotypes – women smile more, have bigger eyes, longer hair, and wear heavy makeup – but importantly, we could not conclude that the StyleGAN model alone propagates this bias. Firstly, StyleGANs were created primarily to generate photo-realistic face images, not to alter the facial features of existing photos at will. Secondly, since the latent direction we used was created without correcting for biases in the StyleGANs training data, we see a correlation between stereotypical features and gender.

Next steps and gender neutrality

We asked ourselves which faces we perceived as gender neutral among the image sequences we generated. For original images of men, we had to look towards the artificially generated female direction and vice versa. This was a subjective choice. We see it as a logical next step to try to automate the generation of gender-neutral versions of face images to explore further the possibilities of AI in the topic of gender and society. For this, we would first have to classify the gender of the face to be edited and then move towards the opposite gender to the point where the classifier can no longer assign an unambiguous label. Therefore, interested readers will be able to follow the continuation of our journey in a second blog article in the coming time.

If you are interested in our technical implementation for this article, you can find the code here and try it out with your own images.

Resources

Photo Credits
Img. 1: © Alfred Eisenstaedt / Life Picture Collection
Img. 2: https://www.pinterest.com/pin/289989663476162265/
Img. 3: https://www.gala.de/stars/starportraets/kim-kardashian-20479282.html
Img. 4: © Charles Sykes / Picture Alliance
Img. 7: Richard Russel, “A Sex Difference in Facial Contrast and its Exaggeration by Cosmetics” Isabel Hermes, Alexander Müller Isabel Hermes, Alexander Müller Isabel Hermes, Alexander Müller Isabel Hermes, Alexander Müller Isabel Hermes, Alexander Müller

Why we need AI Principles

Artificial intelligence has already begun and will continue to fundamentally transform our world. Algorithms increasingly influence how we behave, think, and feel. Companies around the globe will continue to adapt AI technology and rethink their current processes and business models. Our social structures, how we work, and how we interact with each other will change with the advancements of digitalization, especially in AI.

Beyond its social and economic influence, AI also plays a significant role in one of the biggest challenges of our time: climate change. On the one hand, AI can provide instruments to tackle parts of this urgent challenge. On the other hand, the development and the implementation of AI applications will consume a lot of energy and emit massive amounts of greenhouse gases.

Risks of AI

With the advancement of a technology that has such a high impact on all areas of our lives come huge opportunities but also big risks. To give you an impression of the risks, we just picked a few examples:

  • AI can be used to monitor people, for example, through facial recognition systems. Some countries are already using this technology extensively for a few years.
  • AI is used in very sensitive areas where minor malfunctions could have dramatic implications. Examples are autonomous driving, robot-assisted surgery, credit scoring, recruiting candidate selection, or law enforcement.
  • The Facebook and Cambridge Analytica scandal showed that data and AI technologies can be used to build psychographic profiles. These profiles allow microtargeting of individuals with customized content to influence elections. This example shows the massive power of AI technologies and its potential for abuse and manipulation.
  • With recent advancements in computer vision technology, deep learning algorithms can now be used to create deepfakes. Deepfakes are realistic videos or images of people doing or saying something they never did or said. Obviously, this technology comes with enormous risks.
  • Artificial intelligence solutions are often developed to improve or optimize manual processes. There will be use cases where this will lead to a replacement of human work. A challenge that cannot be ignored and needs to be addressed early.
  • In the past, AI models reproduced discriminating patterns of the data they were trained on. For example, Amazon used an AI system in their recruiting process that clearly disadvantaged women.

These examples make clear that every company and every person developing AI systems should reflect very carefully on the impact the system will or might have on society, specific groups, or even individuals.

Therefore, the big challenge for us is to ensure that the AI technologies we develop help and enable people while minimizing any forms of associated risks.

Why are there no official regulations in place in 2022?

You might be asking yourself why there is no regulation in place to address this issue. The problem with new technology, especially artificial intelligence, is that it advances fast, sometimes even too fast.

Recent releases of new language models like GPT-3 or computer vision models, for example, DALLE-2, exceeded the expectations of many AI experts. The abilities and applications of AI technologies will continually advance faster than regulation can. And we are not talking about months, but years.

It is fair to say that the EU made its first attempt in this direction by proposing a regulatory framework for artificial intelligence. However, they indicate that the regulation could apply to operators in the second half of 2024 at the earliest. That is years after the above-described examples became a reality.

Our approach: statworx AI Principles

The logical consequence of this issue is that we, as a company, must address this challenge ourselves. And therefore, we are currently working on the statworx AI Principles, a set of principles that guide us when developing AI solutions.

What we have done so far and how we got here

In our task force “AI & Society”, we started to tackle this topic. First, we scanned the market and found many interesting papers but concluded that none of them could be transferred 1:1 to our business model. Often these principles or guidelines were very fuzzy or too detailed and unsuitable for a consulting company that operates in a B2B setting as a service provider. So, we decided we needed to devise a solution ourselves.

The first discussions showed four big challenges:

  • On the one hand, the AI Principles must be formulated clearly and for a high-level audience so that non-experts also understand their meaning. On the other hand, they must be specific to be able to integrate them into our delivery processes.
  • As a service provider, we may have limited control and decision power about some aspects of an AI solution. Therefore, we must understand what we can decide and what is beyond our control.
  • Our AI Principles will only add sustainable value if we can act according to them. Therefore, we need to promote them in our projects to the customers. We recognize that budget constraints, financial targets, and other factors might work against the proper application of these principles as it will need additional time and money.
  • Furthermore, what is wrong and right is not always obvious. Our discussions showed that there are many different perceptions of the right and necessary things to do. This means we will have to find common ground on which we can all agree.

Our two key take-aways

A key insight from these thoughts was that we would need two things.

As a first step, we need high-level principles that are understandable, clear, and where everyone is on board. These principles act as a guiding idea and give orientation when decisions are made. In a second step, we will use them to derive best practices or a framework that translates these principles into concrete actions during all phases of our project delivery.

The second major thing we learned, is that it is tough to undergo this process and ask these questions but also that it is inevitable for every company that develops or uses AI technology.

What comes next

So far, we are nearly at the end of the first step. We will soon communicate the statworx AI Principles through our channels. If you are currently in this process, too, we would be happy to get in touch to understand what you did and learned.

References

https://www.nytimes.com/2019/04/14/technology/china-surveillance-artificial-intelligence-racial-profiling.html

https://www.nytimes.com/2018/04/04/us/politics/cambridge-analytica-scandal-fallout.html

https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G

https://digital-strategy.ec.europa.eu/en/policies/european-approach-artificial-intelligence

https://www.bundesregierung.de/breg-de/themen/umgang-mit-desinformation/deep-fakes-1876736

https://www.welt.de/wirtschaft/article173642209/Jobverlust-Diese-Jobs-werden-als-erstes-durch-Roboter-ersetzt.html

Jan Fischer Jan Fischer Jan Fischer Jan Fischer Jan Fischer Jan Fischer

Whether deliberate or unconscious, bias in our society makes it difficult to create a gender-equal world free of stereotypes and discrimination. Unfortunately, this gender bias creeps into AI technologies, which are rapidly advancing in all aspects of our daily lives and will transform our society as we have never seen before. Therefore, creating fair and unbiased AI systems is imperative for a diverse, equitable, and inclusive future. It is crucial not only to be aware of this issue but that we act now, before these technologies reinforce our gender bias even more, including in areas of our lives where we have already eliminated them.

Solving starts with understanding: To work on solutions to eliminate gender bias and all other forms of bias in AI, we first need to understand what it is and where it comes from. Therefore, in the following, I will first introduce some examples of gender-biased AI technologies and then give you a structured overview of the different reasons for bias in AI. I will present the actions needed towards fairer and more unbiased AI systems in a second step.

Sexist AI

Gender bias in AI has many faces and has severe implications for women’s equality. While Youtube shows my single friend (male, 28) advertisements for the latest technical inventions or the newest car models, I, also single and 28, have to endure advertisements for fertility or pregnancy tests. But AI is not only used to make decisions about which products we buy or which series we want to watch next. AI systems are also being used to decide whether or not you get a job interview, how much you pay for your car insurance, how good your credit score is, or even what medical treatment you will get. And this is where bias in such systems really starts to become dangerous.

In 2015, for example, Amazon’s recruiting tool falsely learned that men are better programmers than women, thus, not rating candidates for software developer jobs and other technical posts in a gender-neutral way.

In 2019, a couple applied for the same credit card. Although the wife had a slightly better credit score and the same income, expenses, and debts as her husband, the credit card company set her credit card limit much lower, which the customer service of the credit card company could not explain.

If these sexist decisions were made by humans, we would be outraged. Fortunately, there are laws and regulations against sexist behavior for us humans. Still, AI has somehow become above the law because an assumably rational machine made the decision. So, how can an assumably rational machine become biased, prejudiced, and racist? There are three interlinked reasons for bias in AI: data, models, and community.

Data is Destiny

First, data is a mirror of our society, with all our values, assumptions, and, unfortunately, also biases. There is no such thing as neutral or raw data. Data is always generated, measured, and collected by humans. Data has always been produced through cultural operations and shaped into cultural categories. For example, most demographic data is labeled based on simplified, binary female-male categories. When gender classification conflates gender in this way, data is unable to show gender fluidity and one’s gender identity. Also, race is a social construct, a classification system invented by us humans a long time ago to define physical differences between people, which is still present in data.

The underlying mathematical algorithm in AI systems is not sexist itself. AI learns from data with all its potential gender biases. For example, suppose a face recognition model has never seen a transgender or non-binary person because there was no such picture in the data set. In that case, it will not correctly classify a transgender or non-binary person (selection bias).

Or, as in the case of Google translate, the phrase “eine Ärztin” (a female doctor) is consistently translated into the masculine form in gender-inflected languages because the AI system has been trained on thousands of online texts where the male form of “doctor” was more prevalent due to historical and social circumstances (historical bias). According to Invisible Women, there is a big gender gap in Big Data in general, to the detriment of women. So if we do not pay attention to what data we feed these algorithms, they will take over the gender gap in the data, leading to serious discrimination of women.

Models need Education

Second, our AI models are unfortunately not smart enough to overcome the biases in the data. Because current AI models only analyze correlations and not causal structures, they blindly learn what is in the data. These algorithms inherent a systematical structural conservatism, as they are designed to reproduce given patterns in the data.

To illustrate this, I will use a fictional and very simplified example: Imagine a very stereotypical data set with many pictures of women in kitchens and men in cars. Based on these pictures, an image classification algorithm has to learn to predict the gender of a person in a picture. Due to the data selection, there is a high correlation between kitchens and women and between cars and men in the data set – a higher correlation than between some characteristic gender features and the respective gender. As the model cannot identify causal structures (what are gender-specific features), it thus falsely learns that having a kitchen in the picture also implies having women in the picture and the same for cars and men. As a result, if there’s a woman in a car in some image, the AI would identify the person as a man and vice versa.

However, this is not the only reason AI systems cannot overcome bias in data. It is also because we do not “tell” the systems that they should watch out for this. AI algorithms learn by optimizing a certain objective or goal defined by the developers. Usually, this performance measure is an overall accuracy metric, not including any ethical or fairness constraints. It is as if a child was to learn to get as much money as possible without any additional constraints such as suffering consequences from stealing, exploiting, or deceiving. If we want AI systems to learn that gender bias is wrong, we have to incorporate this into their training and performance evaluation.

Community lacks Diversity

Last, it is the developing community who directly or indirectly, consciously or subconsciously introduces their own gender and other biases into AI technologies. They choose the data, define the optimization goal, and shape the usage of AI.

While there may be malicious intent in some cases, I would argue that developers often bring their own biases into AI systems at an unconscious level. We all suffer from unconscious biases, that is, unconscious errors in thinking that arise from problems related to memory, attention, and other mental mistakes. In other words, these biases result from the effort to simplify the incredibly complex world in which we live.

For example, it is easier for our brain to apply stereotypic thinking, that is, perceiving ideas about a person based on what people from a similar group might “typically “be like (e.g., a man is more suited to a CEO position) than to gather all the information to fully understand a person and their characteristics. Or, according to the affinity bias, we like people most who look and think like us, which is also a simplified way of understanding and categorizing the people around us.

We all have such unconscious biases, and since we are all different people, these biases vary from person to person. However, since the current community of AI developers comprises over 80% white cis-men, the values, ideas, and biases creeping into AI systems are very homogeneous and thus literally narrow-minded. Starting with the definition of AI, the founding fathers of AI back in 1956 were all white male engineers, a very homogeneous group of people, which led to a narrow idea of what intelligence is, namely the ability to win games such as chess. However, from psychology, we know that there are a lot of different kinds of intelligence, such as emotional or social intelligence. Still, today, if a model is developed and reviewed by a very homogenous group of people, without special attention and processes, they will not be able to identify discrimination who are different from themselves due to unconscious biases. Indeed, this homogenous community tends to be the group of people who barely suffer from bias in AI.

Just imagine if all the children in the world were raised and educated by 30-year-old white cis-men. That is what our AI looks like today. It is designed, developed, and evaluated by a very homogenous group, thus, passing on a one-sided perspective on values, norms, and ideas. Developers are at the core of this. They are teaching AI what is right or wrong, what is good or bad.

Break the Bias in Society

So, a crucial step towards fair and unbiased AI is a diverse and inclusive AI development community. Meanwhile, there are some technical solutions to the mentioned data and model bias problems (e.g., data diversification or causal modeling). Still, all these solutions are useless if the developers fail to think about bias problems in the first place. Diverse people can better check each other’s blindspots, each other’s biases. Many studies show that diversity in data science teams is critical in reducing bias in AI.

Furthermore, we must educate our society on AI, its risks, and its chances. We need to rethink and restructure the education of AI developers, as they need as much ethical knowledge as technical knowledge to develop fair and unbiased AI systems. We need to educate the broad population that we all can also become part of this massive transformation through AI to contribute our ideas and values to the design and development of these systems.

In the end, if we want to break the bias in AI, we need to break the bias in our society. Diversity is the solution to fair and unbiased AI, not only in AI developing teams but across our whole society. AI is made by humans, by us, by our society. Our society with its structures brings bias in AI: through the data we produce, the goals we expect the machines to achieve and the community developing these systems. At its core, bias in AI is not a technical problem – it is a social one.

Positive Reinforcement of AI

Finally, we need to ask ourselves: do we want AI reflecting society as it is today or a more equal society of tomorrow? Suppose we are using machine learning models to replicate the world as it is today. In that case, we are not going to make any social progress. If we fail to take action, we might lose some social progress, such as more gender equality, as AI amplifies and reinforces bias back into our lives. AI is supposed to be forward-looking. But at the same time, it is based on data, and data reflects our history and present. So, as much as we need to break the bias in society to break the bias in AI systems, we need unbiased AI systems for social progress in our world.

Having said all that, I am hopeful and optimistic. Through this amplification effect, AI has raised awareness of old fairness and discrimination issues in our society on a much broader scale. Bias in AI shows us some of the most pressing societal challenges. Ethical and philosophical questions become ever more important. And because AI has this reinforcement effect on society, we can also use it for the positive. We can use this technology for good. If we all work together, it is our chance to remake the world into a much more diverse, inclusive, and equal place. Livia Eichenberger Livia Eichenberger Livia Eichenberger

Why bother? AI and the climate crisis

According to the newest report from the Intergovernmental Panel on Climate Change (IPCC) in August 2021, “it is unequivocal that human influence has warmed the atmosphere, ocean and land” [1]. Climate change also occurs faster than previously thought. Regarding most recent estimations, the average global surface temperature increased by 1.07°C from 2010 to 2019 compared to 1850 to 1900 due to human influence. Furthermore, the atmospheric CO2 concentrations in 2019 “were higher than at any time in at least 2 million years” [1].

Still, global carbon emissions are rising, although there was a slight decrease in 2020 [2], probably due to the coronavirus and its economic effects. In 2019, 36.7 gigatons (Gt) CO2 were emitted worldwide [2]. Be aware that one Gt is one billion tons. To achieve the 1.5 °C goal with an estimated probability of about 80%, we have only 300 Gts left at the beginning of 2020 [1]. As both 2020 and 2021 are over and assuming carbon emissions of about 35 Gts for each year, the remaining budget is about 230 Gt CO2. If the yearly amount stayed constant over the next years, the remaining carbon budget would be exhausted in about seven years.

In 2019, China, the USA, and India were the most emitting countries. Overall, Germany is responsible for only about 2% of all global emissions, but it was still in seventh place with about 0.7 Gt in 2019 (see graph below). Altogether, the top ten most emitting countries account for about two-thirds of all carbon emissions in 2019 [2]. Most of these countries are highly industrialized and will likely enhance their usage of artificial intelligence (AI) to strengthen their economies during the following decades.

Using AI to reduce carbon emissions

So, what about AI and carbon emissions? Well, the usage of AI is two sides of the same coin [3]. On the one hand, AI has a great potential to reduce carbon emissions by providing more accurate predictions or improving processes in many different fields. For example, AI can be applied to predict intemperate weather events, optimize supply chains, or monitor peatlands [4, 5].

According to a recent estimation of Microsoft and PwC, the usage of AI for environmental applications can save up to 4.4% of all greenhouse gas emissions worldwide by 2030 [6].
In absolute numbers, the usage of AI for environmental applications can reduce worldwide greenhouse gas emissions by 0.9 – 2.4 Gts of CO2e. This amount is equivalent to the estimated annual emissions of Australia, Canada, and Japan together in 2030 [7]. To be clear, greenhouse gases also include other emitted gases like methane that also reinforce the earth’s greenhouse effect. To easily measure all of them, they are often declared as equivalents to CO2 and hence abbreviated as CO2e.

AI’s carbon footprint

Despite the great potential of AI to reduce carbon emissions, the usage of AI itself also emits CO2, which is the other side of the coin. From 2012 to 2018, the estimated amount of computation used to train deep learning models has increased by 300.000 (see graph below, [8]). Hence, research, training, and deployment of AI models require an increasing amount of energy and hardware, of course. Both produce carbon emissions and thus contribute to climate change.

Note: Graph taken from [8].

Unfortunately, I could not find a study that estimates the overall carbon emissions of AI. Still, there are some estimations of the CO2 or CO2e emissions of some Natural Language Processing (NLP) models that have become increasingly accurate and hence popular during recent years [9]. According to the following table, the final training of Google’s BERT model roughly emitted as much CO2e as one passenger on their flight from New York to San Francisco. Of course, the training of other NLP models – like Transformerbig – emitted far less, but the final training of a model is only the last part of finding the best model. Prior to the final training, many different models are tried to find the best parameters. Accordingly, this neural architecture search for the Transformerbig model emitted about five times the CO2e emissions as an average car in its lifetime. Now, you may look at the estimated CO2e emissions of GPT-3 and imagine how much emissions resulted from the related neural architecture search.

Comparison of selected human and AI carbon emissions
Human emissions AI emissions
Example CO2e emissions (tons) NLP model training CO2e emissions (tons)
One passenger air traveling
New York San Francisco
0.90 Transformerbig 0.09
Average human life
one year
5.00 BERTbase 0.65
Average American life
one year
16.40 GPT-3 84.74
Average car lifetime
incl. fuel
57.15 Neural architecture search
for Transformerbig
284.02

Note: All values extracted from [9], except the value of for GPT-3 [17]

What you, as a data scientist, can do the reduce your carbon footprint

Overall, there are many ways you, as a data scientist, can reduce your carbon footprint during the training and deployment of AI models. As the most important areas of AI are currently machine learning (ML) and deep learning (DL), different ways to measure and reduce the carbon footprint of these models are described in the following.

1. Be aware of the negative consequences and report them

It may sound simple but being aware of the negative consequences of searching, training, and deploying ML and DL models is the first step to reducing your carbon emissions. It is essential to understand how AI negatively impacts our environment to take the extra effort and be willing to report carbon emissions systematically, which is needed to tackle climate change [8, 9, 10]. So, if you skipped the first part about AI and the climate crisis, go back and read it. It’s worth it!

2. Measure the carbon footprint of your code

To make carbon emissions of your ML and DL models explicit, they need to be measured. Currently, there is no standardized framework to measure all sustainability aspects of AI, but one is currently formed [11]. Until there is a holistic framework, you can start by making energy consumption and related carbon emissions explicit [12]. Probably, some of the most elaborated packages to compute ML and DL models are implemented in the programming language Python. Although Python is not the most efficient programming language [13], it was again rated the most popular programming language in the PYPL index in September 2021 [14]. Accordingly, there are even three Python packages that you can use to track the carbon emissions of training your models:

  • CodeCarbon [15, 16]
  • CarbonTracker [17]
  • Experiment Impact Tracker [18]

Based on my perception, CodeCarbon and CarbonTracker seem to be the easiest ones to use. Furthermore, CodeCarbon can easily be combined with TensorFlow and CarbonTracker with PyTorch. Therefore, you find an example for each package below.

I trained a simple multilayer perceptron with two hidden layers and 256 neurons using the MNIST data set for both packages. To simulate a CPU- and GPU-based computation, I trained the model with TensorFlow and CodeCarbon on my local machine (15-inches MacBook Pro from 2018 and 6 Intel Core i7 CPUs) and the one with PyTorch and Carbontracker in a Google Colab using a Tesla K80 GPU. First, you find the TensorFlow and CodeCarbon code below.

# import needed packages
import tensorflow as tf
from codecarbon import EmissionsTracker

# prepare model training
mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0


model = tf.keras.models.Sequential(
    [
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(256, activation="relu"),
        tf.keras.layers.Dense(256, activation="relu"),
        tf.keras.layers.Dense(10, activation="softmax"),
    ]
)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])

# train model and track carbon emissions
tracker = EmissionsTracker()
tracker.start()
model.fit(x_train, y_train, epochs=10)
emissions: float = tracker.stop()
print(emissions)

After executing the code above, Codecarbon creates a csv file as output which includes different output parameters like computation duration in seconds, total power consumed by the underlying infrastructure in kWh and the related CO2e emissions in kg. The training of my model took 112.15 seconds, consumed 0.00068 kWh, and created 0.00047 kg of CO2e.

Regarding PyTorch and CarbonTracker, I used this Google Colab notebook as the basic setup. To incorporate the tracking of carbon emissions and make the two models comparable, I changed a few details of the notebook. First, I changed the model in step 2 (Define Network) from a convolutional neural network to the multilayer perceptron (I kept the class name CNN to make the rest of the notebook still work):

 class CNN(nn.Module):
  """A simple MLP model."""

  @nn.compact
  def __call__(self, x):
    x = x.reshape((x.shape[0], -1))  # flatten
    x = nn.Dense(features=256)(x)
    x = nn.relu(x)
    x = nn.Dense(features=256)(x)
    x = nn.relu(x)
    x = nn.Dense(features=10)(x)
    x = nn.log_softmax(x)
    return x

Second, I inserted the installation and import of CarbonTracker as well as the tracking of the carbon emissions in step 14 (Train and evaluate):

 !pip install carbontracker

from carbontracker.tracker import CarbonTracker

tracker = CarbonTracker(epochs=num_epochs)
for epoch in range(1, num_epochs + 1):
  tracker.epoch_start()

  # Use a separate PRNG key to permute image data during shuffling
  rng, input_rng = jax.random.split(rng)
  # Run an optimization step over a training batch
  state = train_epoch(state, train_ds, batch_size, epoch, input_rng)
  # Evaluate on the test set after each training epoch 
  test_loss, test_accuracy = eval_model(state.params, test_ds)
  print(' test epoch: %d, loss: %.2f, accuracy: %.2f' % (
      epoch, test_loss, test_accuracy * 100))
  
  tracker.epoch_end()

tracker.stop()

After executing the whole notebook, CarbonTracker prints the following output after the first training epoch is finished.

 train epoch: 1, loss: 0.2999, accuracy: 91.25
 test epoch: 1, loss: 0.22, accuracy: 93.42
CarbonTracker:
Actual consumption for 1 epoch(s):
       Time:  0:00:15
       Energy: 0.000397 kWh
       CO2eq: 0.116738 g
       This is equivalent to:
       0.000970 km travelled by car
CarbonTracker:
Predicted consumption for 10 epoch(s):
       Time:  0:02:30
       Energy: 0.003968 kWh
       CO2eq: 1.167384 g
       This is equivalent to:
       0.009696 km travelled by car

As expected, the GPU needed more energy and produced more carbon emissions. The energy consumption was 6 times higher and the carbon emissions about 2.5 times higher compared to my local CPUs. Obviously, the increased energy consumption is related to the increased computation time that was 2.5 minutes for the GPU but only less than 2 minutes for the CPUs. Overall, both packages provide all needed information to assess and report carbon emissions and related information.

3. Compare different regions of cloud providers

In recent years, the training and deployment of ML or DL models in the cloud have become more important compared to local computations. Clearly, one of the reasons is the increased need for computation power [8]. Accessing GPUs in the cloud is, for most companies, faster and cheaper than building their own data center. Of course, data centers of cloud providers also need hardware and energy for computation. It is estimated that about 1% of worldwide electricity demand is produced by data centers [19]. The usage of every hardware, regardless of its location, produces carbon emissions, and that’s why it is also important to measure carbon emissions emitted by training and deployment of ML and DL models in the cloud.

Currently, there are two different CO2e calculators that can easily be used to calculate carbon emissions in the cloud [20, 21]. The good news is that all three big cloud providers – AWS, Azure, and GCP – are incorporated in both calculators. To find out which of the three big cloud providers and which European region is best, I used the first calculator – ML CO2 Impact [20] – to calculate the CO2e emissions for the final training of GPT-3. The final model training of GPT-3 required 310 GPUs (NVIDIA Tesla V100 PCIe) running non-stop for 90 days [17]. To compute the estimated emissions of the different providers and regions, I chose the available option “Tesla V100-PCIE-16GB” as GPU. The results of the calculations can be found in the following table.

Comparison of different European regions and cloud providers
Google Cloud Computing AWS Cloud Computing Microsoft Azure
Region CO2e emissions (tons) Region CO2e emissions (tons) Region CO2e emissions (tons)
europe-west1 54.2 EU – Frankfurt 122.5 France Central 20.1
europe-west2 124.5 EU – Ireland 124.5 France South 20.1
europe-west3 122.5 EU – London 124.5 North Europe 124.5
europe-west4 114.5 EU – Paris 20.1 West Europe 114.5
europe-west6 4.0 EU – Stockholm 10.0 UK West 124.5
europe-north1 42.2 UK South 124.5

Overall, at least two findings are fascinating. First, even within the same cloud provider, the chosen region has a massive impact on the estimated CO2e emissions. The most significant difference is present for GCP with a factor of more than 30. This huge difference is partly due to the small emissions of 4 tons in the region europe-west6, which are also the smallest emissions overall. Interestingly, such a huge factor of 30 is a lot more than those described in scientific papers, which are factors of 5 to 10 [12]. Second, some estimated values are equal, which shows that some kind of simplification was used for these estimations. Therefore, you should treat the absolute values with caution, but the difference between the regions still holds as they are all based on the same (simplified) way of calculation.

Finally, to choose the cloud provider with a minimal carbon footprint in total, it is also essential to consider the sustainability strategies of the cloud providers. In this area, GCP and Azure seem to have more effective strategies for the future compared to AWS [22, 23] and have already reached 100% renewable energy with offsets and energy certificates so far. Still, none of them uses 100% renewable energy itself (see table 2 in [9]). From an environmental perspective, I personally prefer GCP because their strategy most convinced me. Furthermore, GCP has implemented a hint for “regions with the lowest carbon impact inside Cloud Console location selectors“ since 2021 [24]. These kinds of help indicate the importance of this topic to GCP.

4. Train and deploy with care

Finally, there are many other helpful hints and tricks related to the training and deployment of ML and DL models that can help you minimize your carbon footprint as a data scientist.

  • Practice to be sparse! New research that combines DL models with state-of-the-art findings in neuroscience can reduce computation times by up to 100 times and save lots of carbon emissions [25].
  • Search for simpler and less computing-intensive models with comparable accuracy and use them if appropriate. For example, there is a smaller and faster version of BERT available called DistilBERT with comparable accuracy values [26]
  • Consider transfer learning and foundation models [10] to maximize accuracy and minimize computations at the same time.
  • Consider Federated learning to reduce carbon emissions [27].
  • Don’t just think of the accuracy of your model; consider efficiency as well. Always ponder if a 1% increase in accuracy is worth the additional environmental impact [9, 12].
  • If the region of best hyperparameters is still unknown, use random or Bayesian hyperparameter search instead of grid search [9, 20].
  • If your model will be retrained periodically after deployment, choose the training interval consciously. Regarding the associated business case, it may be enough to provide a newly trained model each month and not each week.

Conclusion

Human beings and their greenhouse gas emissions influence our climate and warm the world. AI can and should be part of the solution to tackle climate change. Still, we need to keep an eye on its carbon footprint to make sure that it will be part of the solution and not part of the problem.

As a data scientist, you can do a lot. You can inform yourself and others about the positive possibilities and negative consequences of using AI. Furthermore, you can measure and explicitly state the carbon emissions of your models. You can describe your efforts to minimize their carbon footprint, too. Finally, you can also choose your cloud provider consciously and, for example, check if there are simpler models that result in a comparable accuracy but with fewer emissions.

Recently, we at statworx have formed a new initiative called AI and Environment to incorporate these aspects in our daily work as data scientists. If you want to know more about it, just get in touch with us!

References

  1. https://www.ipcc.ch/report/ar6/wg1/downloads/report/IPCC_AR6_WGI_SPM_final.pdf
  2. http://www.globalcarbonatlas.org/en/CO2-emissions
  3. https://doi.org/10.1007/s43681-021-00043-6
  4. https://arxiv.org/pdf/1906.05433.pdf
  5. https://www.pwc.co.uk/sustainability-climate-change/assets/pdf/how-ai-can-enable-a-sustainable-future.pdf
  6. Harness Artificial Intelligence
  7. https://climateactiontracker.org/
  8. https://arxiv.org/pdf/1907.10597.pdf
  9. https://arxiv.org/pdf/1906.02243.pdf
  10. https://arxiv.org/pdf/2108.07258.pdf
  11. https://algorithmwatch.org/de/sustain/
  12. https://arxiv.org/ftp/arxiv/papers/2104/2104.10350.pdf
  13. https://stefanos1316.github.io/my_curriculum_vitae/GKS17.pdf
  14. https://pypl.github.io/PYPL.html
  15. https://codecarbon.io/
  16. https://mlco2.github.io/codecarbon/index.html
  17. https://arxiv.org/pdf/2007.03051.pdf
  18. https://github.com/Breakend/experiment-impact-tracker
  19. https://www.iea.org/reports/data-centres-and-data-transmission-networks
  20. https://mlco2.github.io/impact/#co2eq
  21. http://www.green-algorithms.org/
  22. https://blog.container-solutions.com/the-green-cloud-how-climate-friendly-is-your-cloud-provider
  23. https://www.wired.com/story/amazon-google-microsoft-green-clouds-and-hyperscale-data-centers/
  24. https://cloud.google.com/blog/topics/sustainability/pick-the-google-cloud-region-with-the-lowest-co2)
  25. https://arxiv.org/abs/2112.13896
  26. https://arxiv.org/abs/1910.01108
  27. https://flower.dev/blog/2021-07-01-what-is-the-carbon-footprint-of-federated-learning

Alexander Niltop Alexander Niltop Alexander Niltop

Car Model Classification III: Explainability of Deep Learning Models with Grad-CAM

In the first article of this series on car model classification, we built a model using transfer learning to classify the car model through an image of a car. In the second article, we showed how TensorFlow Serving can be used to deploy a TensorFlow model using the car model classifier as an example. We dedicate this third post to another essential aspect of deep learning and machine learning in general: the explainability of model predictions.

We will start with a short general introduction on the topic of explainability in machine learning. Next, we will briefly talk about popular methods that can be used to explain and interpret predictions from CNNs. We will then explain Grad-CAM, a gradient-based method, in-depth by going through an implementation step by step. Finally, we will show you the results we obtained by our Grad-CAM implementation for the car model classifier.

A Brief Introduction to Explainability in Machine Learning

For the last couple of years, explainability was always a recurring but still niche topic in machine learning. But, over the past four years, interest in this topic has started to accelerate. At least one particular reason fuelled this development: the increased number of machine learning models in production. On the one hand, this leads to a growing number of end-users who need to understand how models are making decisions. On the other hand, an increasing number of machine learning developers need to understand why (or why not) a model is functioning in a particular way.

This increasing demand in explainability led to some, both methodological and technical, noteworthy innovations in the last years:

Methods for Explaining CNN Outputs for Images

Deep neural networks and especially complex architectures like CNNs were long considered as pure black box models. As written above, this changed in recent years, and now there are various methods available to explain CNN outputs. For example, the excellent tf-explain library implements a wide range of useful methods for TensorFlow 2.x. We will now briefly talk about the ideas of different approaches before turning to Grad-CAM:

Activations Visualization

This is the most straightforward visualization technique. It simply shows the output of a specific layer within the network during the forward pass. It can be helpful to get a feel for the extracted features since, during training, most of the activations tend towards zero (when using the ReLu-activation). An example for the output of the first convolutional layer of the car model classifier is shown below:

Vanilla Gradients

One can use the vanilla gradients of the predicted classes’ output for the input image to derive input pixel importances.

We can see that the highlighted region is mainly focused on the car. Compared to other methods discussed below, the discriminative region is much less confined.

Occlusion Sensitivity

This approach computes the importance of certain parts of the input image by reevaluating the model’s prediction with different parts of the input image hidden. Parts of the image are hidden iteratively by replacing them by grey pixels. The weaker the prediction gets with a part of the image hidden, the more important this part is for the final prediction. Based on the discriminative power of the regions of the image, a heatmap can be constructed and plotted. Applying occlusion sensitivity for our car model classifier did not yield any meaningful results. Thus, we show tf-explain‘s sample image, showing the result of applying the occlusion sensitivity procedure for a cat image.

CNN Fixations

Another exciting approach called CNN Fixations was introduced in this paper. The idea is to backtrack which neurons were significant in each layer, given the activations from the forward pass and the network weights. The neurons with large influence are referred to as fixations. This approach thus allows finding the essential regions for obtaining the result without the need for any recomputation (e.g., in the case of occlusion sensitivity above, where multiple predictions must be made).

The procedure can be described as follows: The node corresponding to the class is chosen as the fixation in the output layer. Then, the fixations for the previous layer are computed by computing which of the nodes have the most impact on the next higher level’s fixations determined in the last step. The node importance is computed by multiplying activations and weights. If you are interested in the details of the procedure, check out the paper or the corresponding github repo. This backtracking is done until the input image is reached, yielding a set of pixels with considerable discriminative power. An example from the paper is shown below.

CAM

Introduced in this paper, class activation mapping (CAM) is a procedure to find the discriminative region(s) for a CNN prediction by computing class activation maps. A significant drawback of this procedure is that it requires the network to use global average pooling (GAP) as the last step before the prediction layer. It thus is not possible to apply this approach to general CNNs. An example is shown in the figure below (taken from the CAM paper):

The class activation map assigns importance to every position (x, y) in the last convolutional layer by computing the linear combination of the activations, weighted by the corresponding output weights for the observed class (Australian terrier in the example above). The resulting class activation mapping is then upsampled to the size of the input image. This is depicted by the heat map above. Due to the architecture of CNNs, the activation, e.g., in the top left for any layer, is directly related to the top left of the input image. This is why we can conclude which input regions are important by only looking at the last CNN layer.

The Grad-CAM procedure we will discuss in detail below is a generalization of CAM. Grad-CAM can be applied to networks with general CNN architectures, containing multiple fully connected layers at the output.

Grad-CAM

Grad-CAM extends the applicability of the CAM procedure by incorporating gradient information. Specifically, the gradient of the loss w.r.t. the last convolutional layer determines the weight for each of its feature maps. As in the CAM procedure above, the further steps are to compute the weighted sum of the activations and then upsampling the result to the image size to plot the original image with the obtained heatmap. We will now show and discuss the code that can be used to run Grad-CAM. The complete code is available here on GitHub.

import pickle
import tensorflow as tf
import cv2
from car_classifier.modeling import TransferModel

INPUT_SHAPE = (224, 224, 3)

# Load list of targets
file = open('.../classes.pickle', 'rb')
classes = pickle.load(file)

# Load model
model = TransferModel('ResNet', INPUT_SHAPE, classes=classes)
model.load('...')

# Gradient model, takes the original input and outputs tuple with:
# - output of conv layer (in this case: conv5_block3_3_conv)
# - output of head layer (original output)
grad_model = tf.keras.models.Model([model.model.inputs],
                                   [model.model.get_layer('conv5_block3_3_conv').output,
                                    model.model.output])

# Run model and record outputs, loss, and gradients
with tf.GradientTape() as tape:
    conv_outputs, predictions = grad_model(img)
    loss = predictions[:, label_idx]

# Output of conv layer
output = conv_outputs[0]

# Gradients of loss w.r.t. conv layer
grads = tape.gradient(loss, conv_outputs)[0]

# Guided Backprop (elimination of negative values)
gate_f = tf.cast(output > 0, 'float32')
gate_r = tf.cast(grads > 0, 'float32')
guided_grads = gate_f * gate_r * grads

# Average weight of filters
weights = tf.reduce_mean(guided_grads, axis=(0, 1))

# Class activation map (cam)
# Multiply output values of conv filters (feature maps) with gradient weights
cam = np.zeros(output.shape[0: 2], dtype=np.float32)
for i, w in enumerate(weights):
    cam += w * output[:, :, i]

# Or more elegant: 
# cam = tf.reduce_sum(output * weights, axis=2)

# Rescale to org image size and min-max scale
cam = cv2.resize(cam.numpy(), (224, 224))
cam = np.maximum(cam, 0)
heatmap = (cam - cam.min()) / (cam.max() - cam.min())
  • The first step is to load an instance of the model.
  • Then, we create a new keras.Model instance that has two outputs: The activations of the last CNN layer ('conv5_block3_3_conv') and the original model output.
  • Next, we run a forward pass for our new grad_model using as input an image ( img) of shape (1, 224, 224, 3), preprocessed with the resnetv2.preprocess_input method. tf.GradientTape is set up and applied to record the gradients (the gradients are stored in the tapeobject). Further, the outputs of the convolutional layer (conv_outputs) and the head layer (predictions) are stored as well. Finally, we can use label_idxto get the loss corresponding to the label we want to find the discriminative regions for.
  • Using the gradient-method, one can extract the desired gradients from tape. In this case, we need the gradient of the loss w.r.t. the output of the convolutional layer.
  • In a further step, a guided backprop is applied. Only values for the gradients are kept where both the activations and the gradients are positive. This essentially means restricting attention to the activations which positively contribute to the wanted output prediction.
  • The weights are computed by averaging the obtained guided gradients for each filter.
  • The class activation map cam is then computed as the weighted average of the feature map activations (output). The method containing the for loop above helps understanding what the function does in detail. A less straightforward but more efficient way to implement the CAM-computation is to use tf.reduce_mean and is shown in the commented line below the loop implementation.
  • Finally, the resampling (resizing) is done using OpenCV2’s resize method, and the heatmap is rescaled to contain values in [0, 1] for plotting.

A version of Grad-CAM is also implemented in tf-explain.

Examples

We now use the Grad-CAM implementation to interpret and explain the predictions of the TransferModelfor car model classification. We start by looking at car images taken from the front.

Grad-CAM for car images from the front

The red regions highlight the most important discriminative regions, the blue regions the least important. We can see that for images from the front, the CNN focuses on the car’s grille and the area containing the logo. If the car is slightly tilted, the focus is shifted more to the edge of the vehicle. This is also the case for slightly tilted images from the car’s backs, as shown in the middle image below.

Grad-CAM for car images from the back

For car images from the back, the most crucial discriminative region is near the number plate. As mentioned above, for cars looked at from an angle, the closest corner has the highest discriminative power. A very interesting example is the Mercedes-Benz C-class on the right side, where the model not only focuses on the tail lights but also puts the highest discriminative power on the model lettering.

Grad-CAM for car images from the side

When looking at images from the side, we notice the discriminative region is restricted to the bottom half of the cars. Again, the angle the car image was taken from determines the shift of the region towards the front or back corner.

In general, the most important fact is that the discriminative areas are always confined to parts of the cars. There are no images where the background has high discriminative power. Looking at the heatmaps and the associated discriminative regions can be used as a sanity check for CNN models.

Conclusion

We discussed multiple approaches to explaining CNN classifier outputs. We introduced Grad-CAM in detail by examining the code and looking at examples for the car model classifier. Most notably, the discriminative regions highlighted by the Grad-CAM procedure are always focussed on the car and never on the backgrounds of the images. The result shows that the model works as we expect and uses specific parts of the car to discriminate between different models.

In the fourth and last part of this blog series, we will show how the car classifier can be built into a web application using Dash. See you soon!

Stephan Müller Stephan Müller Stephan Müller

In the first post of the series, we discussed transfer learning and built a model for car model classification. In this blog post, we will discuss the problem of model deployment, using the TransferModel introduced in the first post as an example.

A model is of no use in actual practice if there is no simple way to interact with it. In other words: We need an API for our models. TensorFlow Serving has been developed to provide these functionalities for TensorFlow models. This blog post will show how a TensorFlow Serving server can be launched in a Docker container and how we can interact with the server using HTTP requests.

If you are new to Docker, we recommend working through Docker’s tutorial before reading this article. If you want to look at an example of deployment in Docker, we recommend reading this blog post by our colleague Oliver Guggenbühl, in which he describes how an R-script can be run in Docker. We start by giving an overview of TensorFlow Serving.

Introduction to TensorFlow Serving

Let’s start by giving you an overview of TensorFlow Serving.

TensorFlow Serving is TensorFlow’s serving system, designed to enable the deployment of various models using a uniform API. Using the abstraction of Servables, which are basically objects clients use to perform computations, it is possible to serve multiple versions of deployed models. That enables, for example, that a new version of a model can be uploaded while the previous version is still available to clients. Looking at the bigger picture, so-called Managers are responsible for handling the life-cycle of Servables, which means loading, serving, and unloading them.

In this post, we will show how a single model version can be deployed. The code examples below show how a server can be started in a Docker container and how the Predict API can be used to interact with it. To read more about TensorFlow Serving, we refer to the TensorFlow website.

Implementation

We will now discuss the following three steps required to deploy the model and to send requests.

  • Save a model in correct format and folder structure using TensorFlow SavedModel
  • Run a Serving server inside a Docker container
  • Interact with the model using REST requests

Saving TensorFlow Models

If you didn’t read this series’ first post, here’s a brief summary of the most important points needed to understand the code below:

The TransferModel.model is a tf.keras.Model instance, so it can be saved using Model‘s built-in save method. Further, as the model was trained on web-scraped data, the class labels can change when re-scraping the data. We thus store the index-class mapping when storing the model in classes.pickle. TensorFlow Serving requires the model to be stored in the SavedModel format. When using tf.keras.Model.save, the path must be a folder name, else the model will be stored in another format (e.g., HDF5) which is not compatible with TensorFlow Serving. Below, folderpath contains the path of the folder we want to store all model relevant information in. The SavedModel is stored in folderpath/model and the class mapping is stored as folderpath/classes.pickle.

def save(self, folderpath: str):
    """
    Save the model using tf.keras.model.save

    Args:
        folderpath: (Full) Path to folder where model should be stored
    """

    # Make sure folderpath ends on slash, else fix
    if not folderpath.endswith("/"):
        folderpath += "/"

    if self.model is not None:
        os.mkdir(folderpath)
        model_path = folderpath + "model"
        # Save model to model dir
        self.model.save(filepath=model_path)
        # Save associated class mapping
        class_df = pd.DataFrame({'classes': self.classes})
        class_df.to_pickle(folderpath + "classes.pickle")
    else:
        raise AttributeError('Model does not exist')

Start TensorFlow Serving in Docker Container

Having saved the model to the disk, you now need to start the TensorFlow Serving server. Fortunately, there is an easy-to-use Docker container available. The first step is therefore pulling the TensorFlow Serving image from DockerHub. That can be done in the terminal using the command docker pull tensorflow/serving.

Then we can use the code below to start a TensorFlow Serving container. It runs the shell command for starting a container. The options set in the docker_run_cmd are the following:

  • The serving image exposes port 8501 for the REST API, which we will use later to send requests. Thus we map the host port 8501 to the container’s 8501 port using -p.
  • Next, we mount our model to the container using -v. It is essential that the model is stored in a versioned folder (here MODEL_VERSION=1); else, the serving image will not find the model. model_path_guest thus must be of the form <path>/<model name>/MODEL_VERSION, where MODEL_VERSION is an integer.
  • Using -e, we can set the environment variable MODEL_NAME to our model’s name.
  • The --name tf_serving option is only needed to assign a specific name to our new docker container.

If we try to run this file twice in a row, the docker command will not be executed the second time, as a container with the name tf_serving already exists. To avoid this problem, we use docker_run_cmd_cond. Here, we first check if a container with this specific name already exists and is running. If it does, we leave it; if not, we check if an exited version of the container exists. If it does, it is deleted, and a new container is started; if not, a new one is created directly.

import os

MODEL_FOLDER = 'models'
MODEL_SAVED_NAME = 'resnet_unfreeze_all_filtered.tf'
MODEL_NAME = 'resnet_unfreeze_all_filtered'
MODEL_VERSION = '1'

# Define paths on host and guest system
model_path_host = os.path.join(os.getcwd(), MODEL_FOLDER, MODEL_SAVED_NAME, 'model')
model_path_guest = os.path.join('/models', MODEL_NAME, MODEL_VERSION)

# Container start command
docker_run_cmd = f'docker run ' \
                 f'-p 8501:8501 ' \
                 f'-v {model_path_host}:{model_path_guest} ' \
                 f'-e MODEL_NAME={MODEL_NAME} ' \
                 f'-d ' \
                 f'--name tf_serving ' \
                 f'tensorflow/serving'

# If container is not running, create a new instance and run it
docker_run_cmd_cond = f'if [ ! "$(docker ps -q -f name=tf_serving)" ]; then \n' \
                      f'   if [ "$(docker ps -aq -f status=exited -f name=tf_serving)" ]; then 														\n' \
                      f'   		docker rm tf_serving \n' \
                      f'   fi \n' \
                      f'   {docker_run_cmd} \n' \
                      f'fi'

# Start container
os.system(docker_run_cmd_cond)

Instead of mounting the model from our local disk using the -v flag in the docker command, we could also copy the model into the docker image, so the model could be served simply by running a container and specifying the port assignments. It is important to note that, in this case, the model needs to be saved using the folder structure folderpath/<model name>/1, as explained above. If this is not the case, TensorFlow Serving will not find the model. We will not go into further detail here. If you are interested in deploying your models in this way, we refer to this guide on the TensorFlow website.

REST Request

Since the model is now served and ready to use, we need a way to interact with it. TensorFlow Serving provides two options to send requests to the server: gRCP and REST API, both exposed at different ports. In the following code example, we will use REST to query the model.

First, we load an image from the disk for which we want a prediction. This can be done using TensorFlow’s image module. Next, we convert the image to a numpy array using the img_to_array-method. The next and final step is crucial: since we preprocessed the training image before we trained our model (e.g., normalization), we need to apply the same transformation to the image we want to predict. The handypreprocess_input function makes sure that all necessary transformations are applied to our image.

from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet_v2 import preprocess_input

# Load image
img = image.load_img(path, target_size=(224, 224))
img = image.img_to_array(img)

# Preprocess and reshape data
import json
import requests

# Send image as list to TF serving via json dump
request_url = 'http://localhost:8501/v1/models/resnet_unfreeze_all_filtered:predict'
request_body = json.dumps({"signature_name": "serving_default", "instances": img.tolist()})
request_headers = {"content-type": "application/json"}
json_response = requests.post(request_url, data=request_body, headers=request_headers)
response_body = json.loads(json_response.text)
predictions = response_body['predictions']

# Get label from prediction
y_hat_idx = np.argmax(predictions)
y_hat = classes[y_hat_idx]
img = preprocess_input(img) img = img.reshape(-1, *img.shape)

TensorFlow Serving’s RESTful API offers several endpoints. In general, the API accepts post requests following this structure:

POST http://host:port/<URI>:<VERB>

URI: /v1/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]
VERB: classify|regress|predict

For our model, we can use the following URL for predictions: http://localhost:8501/v1/models/resnet_unfreeze_all_filtered:predict

The port number (here 8501) is the host’s port we specified above to map to the serving image’s port 8501. As mentioned above, 8501 is the serving container’s port exposed for the REST API. The model version is optional and will default to the latest version if omitted.

In python, the requests library can be used to send HTTP requests. As stated in the documentation, the request body for the predict API must be a JSON object with the below-listed key-value-pairs:

  • signature_name – serving signature to use (for more information, see the documentation)
  • instances – model input in row format

The response body will also be a JSON object with a single key called predictions. Since we get for each row in the instances the probability for all 300 classes, we use np.argmax to return the most likely class. Alternatively, we could have used the higher-level classify API.

Conclusion

In this second blog article of the Car Model Classification series, we learned how to deploy a TensorFlow model for image recognition using TensorFlow Serving as a RestAPI, and how to run model queries with it.

To do so, we first saved the model using the SavedModel format. Next, we started the TensorFlow Serving server in a Docker container. Finally, we showed how to request predictions from the model using the API endpoints and a correct specified request body.

A major criticism of deep learning models of any kind is the lack of explainability of the predictions. In the third blog post, we will show how to explain model predictions using a method called Grad-CAM.

Stephan Müller Stephan Müller Stephan Müller

At STATWORX, we are very passionate about the field of deep learning. In this blog series, we want to illustrate how an end-to-end deep learning project can be implemented. We use TensorFlow 2.x library for the implementation. The topics of the series include:

  • Transfer learning for computer vision.
  • Model deployment via TensorFlow Serving.
  • Interpretability of deep learning models via Grad-CAM.
  • Integrating the model into a Dash dashboard.

In the first part, we will show how you can use transfer learning to tackle car image classification. We start by giving a brief overview of transfer learning and the ResNet and then go into the implementation details. The code presented can be found in this github repository.

Introduction: Transfer Learning & ResNet

What is Transfer Learning?

In traditional (machine) learning, we develop a model and train it on new data for every new task at hand. Transfer learning differs from this approach in that knowledge is transferred from one task to another. It is a useful approach when one is faced with the problem of too little available training data. Models that are pretrained for a similar problem can be used as a starting point for training new models. The pretrained models are referred to as base models.

In our example, a deep learning model trained on the ImageNet dataset can be used as the starting point for building a car model classifier. The main idea behind transfer learning for deep learning models is that the first layers of a network are used to extract important high-level features, which remain similar for the kind of data treated. The final layers (also known as the head) of the original network are replaced by a custom head suitable for the problem at hand. The weights in the head are initialized randomly, and the resulting network can be trained for the specific task.

There are various ways in which the base model can be treated during training. In the first step, its weights can be fixed. If the learning progress suggests the model not being flexible enough, certain layers or the entire base model can be “unfrozen” and thus made trainable. A further important aspect to note is that the input must be of the same dimensionality as the data on which the model was trained on – if the first layers of the base model are not modified.

image-20200319174208670

Next, we will briefly introduce the ResNet, a popular and powerful CNN architecture for image data. Then, we will show how we used transfer learning with ResNet to do car model classification.

What is ResNet?

Training deep neural networks can quickly become challenging due to the so-called vanishing gradient problem. But what are vanishing gradients? Neural networks are commonly trained using back-propagation. This algorithm leverages the chain rule of calculus to derive gradients at deeper layers of the network by multiplying gradients from earlier layers. Since gradients get repeatedly multiplied in deep networks, they can quickly approach infinitesimally small values during back-propagation.

ResNet is a CNN network that solves the vanishing gradient problem using so-called residual blocks (you find a good explanation of why they are called ‘residual’ here). The unmodified input is passed on to the next layer in the residual block by adding it to a layer’s output (see right figure). This modification makes sure that a better information flow from the input to the deeper layers is possible. The entire ResNet architecture is depicted in the right network in the left figure below. It is plotted alongside a plain CNN and the VGG-19 network, another standard CNN architecture.

Resnet-Architecture_Residual-Block

ResNet has proved to be a powerful network architecture for image classification problems. For example, an ensemble of ResNets with 152 layers won the ILSVRC 2015 image classification contest. Pretrained ResNet models of different sizes are available in the tensorflow.keras.application module, namely ResNet50, ResNet101, ResNet152 and their corresponding second versions (ResNet50V2, …). The number following the model name denotes the number of layers the networks have. The available weights are pretrained on the ImageNet dataset. The models were trained on large computing clusters using hardware accelerators for significant time periods. Transfer learning thus enables us to leverage these training results using the obtained weights as a starting point.

Classifying Car Models

As an illustrative example of how transfer learning can be applied, we treat the problem of classifying the car model given an image of the car. We will start by describing the dataset set we used and how we can filter out unwanted examples in the dataset. Next, we will go over how a data pipeline can be setup using tensorflow.data. In the second section, we will talk you through the model implementation and point out what aspects to be particularly careful about during training and prediction.

Data Preparation

We used the dataset described in this github repo, where you can also download the entire dataset. The author built a datascraper to scrape all car images from the car connection website. He explains that many images are from the interior of the cars. As they are not wanted in the dataset, they are filtered out based on pixel color. The dataset contains 64’467 jpg images, where the file names contain information on the car’s make, model, build year, etc. For a more detailed insight on the dataset, we recommend you consult the original github repo. Three sample images are shown below.

Car Collage 01

While checking through the data, we observed that the dataset still contained many unwanted images, e.g., pictures of wing mirrors, door handles, GPS panels, or lights. Examples of unwanted images can be seen below.

Car Collage 02

Thus, it is beneficial to additionally prefilter the data to clean out more of the unwanted images.

Filtering Unwanted Images Out of the Dataset

There are multiple possible approaches to filter non-car images out of the dataset:

  1. Use a pretrained model
  2. Train another model to classify car/no-car
  3. Train a generative network on a car dataset and use the discriminator part of the network

We decided to pursue the first approach since it is the most direct one and outstanding pretrained models are easily available. If you want to follow the second or third approach, you could, e.g., use this dataset to train the model. The referred dataset only contains images of cars but is significantly smaller than the dataset we used.

We chose the ResNet50V2 in the tensorflow.keras.applications module with the pretrained “imagenet” weights. In a first step, we must figure out the indices and classnames of the imagenet labels corresponding to car images.

# Class labels in imagenet corresponding to cars
CAR_IDX = [656, 627, 817, 511, 468, 751, 705, 757, 717, 734, 654, 675, 864, 609, 436]

CAR_CLASSES = ['minivan', 'limousine', 'sports_car', 'convertible', 'cab', 'racer', 'passenger_car', 'recreational_vehicle', 'pickup', 'police_van', 'minibus', 'moving_van', 'tow_truck', 'jeep', 'landrover', 'beach_wagon']

Next, the pretrained ResNet50V2 model is loaded.

from tensorflow.keras.applications import ResNet50V2

model = ResNet50V2(weights='imagenet')

We can then use this model to make predictions for images. The images fed to the prediction method must be scaled identically to the images used for training. The different ResNet models are trained on different input scales. It is thus essential to apply the correct image preprocessing. The module keras.application.resnet_v2 contains the method preprocess_input, which should be used when using a ResNetV2 network. This method expects the image arrays to be of type float and have values in [0, 255]. Using the appropriately preprocessed input, we can then use the built-in predict method to obtain predictions given an image stored at filename:

from tensorflow.keras.applications.resnet_v2 import preprocess_input

image = tf.io.read_file(filename)
image = tf.image.decode_jpeg(image)
image = tf.cast(image, tf.float32)
image = tf.image.resize_with_crop_or_pad(image, target_height=224, target_width=224)
image = preprocess_input(image)
predictions = model.predict(image)

There are various ideas of how the obtained predictions can be used for car detection.

  • Is one of the CAR_CLASSES among the top k predictions?
  • Is the accumulated probability of the CAR_CLASSES in the predictions greater than some defined threshold?
  • Specific treatment of unwanted images (e.g., detect and filter out wheels)

We show the code for comparing the accumulated probability mass over the CAR_CLASSES.

def is_car_acc_prob(predictions, thresh=THRESH, car_idx=CAR_IDX):
    """
    Determine if car on image by accumulating probabilities of car prediction and comparing to threshold

    Args:
        predictions: (?, 1000) matrix of probability predictions resulting from ResNet with imagenet weights
        thresh: threshold accumulative probability over which an image is considered a car
        car_idx: indices corresponding to cars

    Returns:
        np.array of booleans describing if car or not
    """
    predictions = np.array(predictions, dtype=float)
    car_probs = predictions[:, car_idx]
    car_probs_acc = car_probs.sum(axis=1)
    return car_probs_acc > thresh

The higher the threshold is set, the stricter the filtering procedure is. A value for the threshold that provides good results is THRESH = 0.1. This ensures we do not lose too many true car images. The choice of an appropriate threshold remains subjective, so do as you feel.

The Colab notebook that uses the function is_car_acc_prob to filter the dataset is available in the github repository.

While tuning the prefiltering procedure, we observed the following:

  • Many of the car images with light backgrounds were classified as “beach wagons”. We thus decided to also consider the “beach wagon” class in imagenet as one of the CAR_CLASSES.
  • Images showing the front of a car are often assigned a high probability of “grille”, which is the grating at the front of a car used for cooling. This assignment is correct but leads the procedure shown above to not consider certain car images as cars since we did not include “grille” in the CAR_CLASSES. This problem results in the trade-off of either leaving many close-up images of car grilles in the dataset or filtering out several car images. We opted for the second approach since it yields a cleaner car dataset.

After prefiltering the images using the suggested procedure, 53’738 of 64’467 initially remain in the dataset.

Overview of the Final Datasets

The prefiltered dataset contains images from 323 car models. We decided to reduce our attention to the top 300 most frequent classes in the dataset. That makes sense since some of the least frequent classes have less than ten representatives and can thus not be reasonably split into a train, validation, and test set. Reducing the dataset to images in the top 300 classes leaves us with a dataset containing 53’536 labeled images. The class occurrences are distributed as follows:

Histogram

The number of images per class (car model) ranges from 24 to slightly below 500. We can see that the dataset is very imbalanced. It is essential to keep this in mind when training and evaluating the model.

Building Data Pipelines with tf.data

Even after prefiltering and reducing to the top 300 classes, we still have numerous images left. This poses a potential problem since we can not simply load all images into the memory of our GPU at once. To tackle this problem, we will use tf.data.

tf.data and especially the tf.data.Dataset API allows creating elegant and, at the same time, very efficient input pipelines. The API contains many general methods which can be applied to load and transform potentially large datasets. tf.data.Dataset is especifically useful when training models on GPU(s). It allows for data loading from the HDD, applies transformation on-the-fly, and creates batches that are than sent to the GPU. And this is all done in a way such as the GPU never has to wait for new data.

The following functions create a tf.data.Dataset instance for our particular problem:

def construct_ds(input_files: list,
                 batch_size: int,
                 classes: list,
                 label_type: str,
                 input_size: tuple = (212, 320),
                 prefetch_size: int = 10,
                 shuffle_size: int = 32,
                 shuffle: bool = True,
                 augment: bool = False):
    """
    Function to construct a tf.data.Dataset set from list of files

    Args:
        input_files: list of files
        batch_size: number of observations in batch
        classes: list with all class labels
        input_size: size of images (output size)
        prefetch_size: buffer size (number of batches to prefetch)
        shuffle_size: shuffle size (size of buffer to shuffle from)
        shuffle: boolean specifying whether to shuffle dataset
        augment: boolean if image augmentation should be applied
        label_type: 'make' or 'model'

    Returns:
        buffered and prefetched tf.data.Dataset object with (image, label) tuple
    """
    # Create tf.data.Dataset from list of files
    ds = tf.data.Dataset.from_tensor_slices(input_files)

    # Shuffle files
    if shuffle:
        ds = ds.shuffle(buffer_size=shuffle_size)

    # Load image/labels
    ds = ds.map(lambda x: parse_file(x, classes=classes, input_size=input_size,                                                                                                                                        label_type=label_type))

    # Image augmentation
    if augment and tf.random.uniform((), minval=0, maxval=1, dtype=tf.dtypes.float32, seed=None, name=None) < 0.7:
        ds = ds.map(image_augment)

    # Batch and prefetch data
    ds = ds.batch(batch_size=batch_size)
    ds = ds.prefetch(buffer_size=prefetch_size)

    return ds

We will now describe the methods in the tf.data we used:

  • from_tensor_slices() is one of the available methods for the creation of a dataset. The created dataset contains slices of the given tensor, in this case, the filenames.
  • Next, the shuffle() method considers buffer_size elements one at a time and shuffles these items in isolation from the rest of the dataset. If shuffling of the complete dataset is required, buffer_size must be larger than the bumber of entries in the dataset. Shuffling is only performed if shuffle=True.
  • map() allows to apply arbitrary functions to the dataset. We created a function parse_file() that can be found in the github repo. It is responsible for reading and resizing the images, inferring the labels from the file name and encoding the labels using a one-hot encoder. If the augment flag is set, the data augmentation procedure is activated. Augmentation is only applied in 70% of the cases since it is beneficial to also train the model on non-modified images. The augmentation techniques used in image_augment are flipping, brightness, and contrast adjustments.
  • Finally, the batch() method is used to group the dataset into batches of batch_size elements and the prefetch() method enables preparing later elements while the current element is being processed and thus improves performance. If used after a call to batch(), prefetch_size batches are prefetched.

Model Fine Tuning

Having defined our input pipeline, we now turn towards the model training part. Below you can see the code that can be used to instantiate a model based on the pretrained ResNet, which is available in tf.keras.applications:

from tensorflow.keras.applications import ResNet50V2
from tensorflow.keras import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D


class TransferModel:

    def __init__(self, shape: tuple, classes: list):
        """
        Class for transfer learning from ResNet

        Args:
            shape: Input shape as tuple (height, width, channels)
            classes: List of class labels
        """
        self.shape = shape
        self.classes = classes
        self.history = None
        self.model = None

        # Use pre-trained ResNet model
        self.base_model = ResNet50V2(include_top=False,
                                     input_shape=self.shape,
                                     weights='imagenet')

        # Allow parameter updates for all layers
        self.base_model.trainable = True

        # Add a new pooling layer on the original output
        add_to_base = self.base_model.output
        add_to_base = GlobalAveragePooling2D(data_format='channels_last', name='head_gap')(add_to_base)

        # Add new output layer as head
        new_output = Dense(len(self.classes), activation='softmax', name='head_pred')(add_to_base)

        # Define model
        self.model = Model(self.base_model.input, new_output)

A few more details on the code above:

  • We first create an instance of class tf.keras.applications.ResNet50V2. With include_top=False we tell the pretrained model to leave out the original head of the model (in this case designed for the classification of 1000 classes on ImageNet).
  • base_model.trainable = True makes all layers trainable.
  • Using tf.keras functional API, we then stack a new pooling layer on top of the last convolution block of the original ResNet model. This is a necessary intermediate step before feeding the output to the final classification layer.
  • The final classification layer is then defined using tf.keras.layers.Dense. We define the number of neurons to be equal to the number of desired classes. And the softmax activation function makes sure that the output is a pseudo probability in the range of (0,1] .

The full version of TransferModel (see github) also contains the option to replace the base model with a VGG16 network, another standard CNN for image classification. In addition, it also allows to unfreeze only specific layers, meaning we can make the corresponding parameters trainable while leaving the others fixed. As a default, we have made all parameters trainable here.

After we defined the model, we need to configure it for training. This can be done using tf.keras.Model‘s compile()-method:

def compile(self, **kwargs):
      """
    Compile method
    """
    self.model.compile(**kwargs)

We then pass the following keyword arguments to our method:

  • loss = "categorical_crossentropy"for multi-class classification,
  • optimizer = Adam(0.0001) for using the Adam optimizer from tf.keras.optimizers with a relatively small learning rate (more on the learning rate below), and
  • metrics = ["categorical_accuracy"] for training and validation monitoring.

Next, we will look at the training procedure. Therefore we define a train-method for our TransferModel-class introduced above:

from tensorflow.keras.callbacks import EarlyStopping

def train(self,
          ds_train: tf.data.Dataset,
          epochs: int,
          ds_valid: tf.data.Dataset = None,
          class_weights: np.array = None):
    """
    Trains model in ds_train with for epochs rounds

    Args:
        ds_train: training data as tf.data.Dataset
        epochs: number of epochs to train
        ds_valid: optional validation data as tf.data.Dataset
        class_weights: optional class weights to treat unbalanced classes

    Returns
        Training history from self.history
    """

    # Define early stopping as callback
    early_stopping = EarlyStopping(monitor='val_loss',
                                   min_delta=0,
                                   patience=12,
                                   restore_best_weights=True)

    callbacks = [early_stopping]

    # Fitting
    self.history = self.model.fit(ds_train,
                                  epochs=epochs,
                                  validation_data=ds_valid,
                                  callbacks=callbacks,
                                  class_weight=class_weights)

    return self.history

As our model is an instance of tensorflow.keras.Model, we can train it using the fit method. To prevent overfitting, early stopping is used by passing it to the fit method as a callback function. The patience parameter can be tuned to specify how soon early stopping should apply. The parameter stands for the number of epochs after which, if no decrease of the validation loss is registered, the training will be interrupted. Further, class weights can be passed to the fit method. Class weights allow treating unbalanced data by assigning the different classes different weights, thus allowing to increase the impact of classes with fewer training examples.

We can describe the training process using a pretrained model as follows: As the weights in the head are initialized randomly, and the weights of the base model are pretrained, the training composes of training the head from scratch and fine-tuning the pretrained model’s weights. It is recommended to use a small learning rate (e.g. 1e-4) since choosing the learning rate too large can destroy the near-optimal pretrained weights of the base model.

The training procedure can be sped up by first training for a few epochs without the base model being trainable. The purpose of these initial epochs is to adapt the heads’ weights to the problem. This speeds up the training since when training only the head, much fewer parameters are trainable and thus updated for every batch. The resulting model weights can then be used as the starting point to train the entire model, with the base model being trainable. For the car classification problem that we are considering here, applying this two-stage training did not achieve notable performance enhancement.

Model Performance Evaluation/Prediction

When using the tf.data.Dataset API, one must pay attention to the nature of the methods used. The following method in our TransferModel class can be used as a prediction method.

def predict(self, ds_new: tf.data.Dataset, proba: bool = True):
    """
    Predict class probs or labels on ds_new
    Labels are obtained by taking the most likely class given the predicted probs

    Args:
        ds_new: New data as tf.data.Dataset
        proba: Boolean if probabilities should be returned

    Returns:
        class labels or probabilities
    """

    p = self.model.predict(ds_new)

    if proba:
        return p
    else:
        return [np.argmax(x) for x in p]

It is essential that the dataset ds_new is not shuffled, or else the predictions obtained will be misaligned with the images obtained when iterating over the dataset a second time. This is the case since the flag reshuffle_each_iteration is true by default in the shuffle method’s implementation. A further effect of shuffling is that multiple calls to the take method will not return the same data. This is important when you want to check out predictions, e.g., for only one batch. A simple example where this can be seen is:

# Use construct_ds method from above to create a shuffled dataset
ds = construct_ds(..., shuffle=True)

# Take 1 batch (e.g. 32 images) of dataset: This returns a new dataset
ds_batch = ds.take(1)

# Predict labels for one batch
predictions = model.predict(ds_batch)

# Predict labels again: The result will not be the same as predictions above due to shuffling
predictions_2 = model.predict(ds_batch)

A function to plot images annotated with the corresponding predictions could look as follows:

def show_batch_with_pred(model, ds, classes, rescale=True, size=(10, 10), title=None):
      for image, label in ds.take(1):
        image_array = image.numpy()
        label_array = label.numpy()
        batch_size = image_array.shape[0]
        pred = model.predict(image, proba=False)
        for idx in range(batch_size):
            label = classes[np.argmax(label_array[idx])]
            ax = plt.subplot(np.ceil(batch_size / 4), 4, idx + 1)
            if rescale:
                plt.imshow(image_array[idx] / 255)
            else:
                plt.imshow(image_array[idx])
            plt.title("label: " + label + "\n" 
                      + "prediction: " + classes[pred[idx]], fontsize=10)
            plt.axis('off')

The show_batch_with_pred method works for shuffled datasets as well, since image and label correspond to the same call to the take method.

Evaluating model perfomance can be done using keras.Model's evaluate method.

How Accurate Is Our Final Model?

The model achieves slightly above 70% categorical accuracy for the task of predicting the car model for images from 300 model classes. To better understand the model’s predictions, it is helpful to observe the confusion matrix. Below, you can see the heatmap of the model’s predictions on the validation dataset.

heatmap

We restricted the heatmap to clip the confusion matrix’s entries to [0, 5], as allowing a further span did not significantly highlight any off-diagonal region. As can be seen from the heat map, one class is assigned to examples of almost all classes. That can be seen from the dark red vertical line two-thirds to the right in the figure above. Other than the class mentioned before, there are no evident biases in the predictions. We want to stress here that the categorical accuracy is generally not sufficient for a satisfactory assessment of the model’s performance, particularly in the case of imbalanced classes.

Conclusion and Next Steps

In this blog post, we have applied transfer learning using the ResNet50V2 to classify the car model from images of cars. Our model achieves 70% categorical accuracy over 300 classes.

We found unfreezing the entire base model and using a small learning rate to achieve the best results. Now, having developed a cool car classification model is great, but how can we use our model in a productive setting? Of course, we could build our custom model API using Flask or FastAPI…

But might there even be an easier, standardized way? In the second article of our series, “Deploying TensorFlow Models in Docker using TensorFlow Serving“, we discuss how this model can be deployed using TensorFlow Serving.

Stephan Müller Stephan Müller Stephan Müller

In the last three posts of this series, we explained how to train a deep-learning model to classify a car by its brand and model given an image of a car (Part 1), how to deploy that model from a docker container with TensorFlow Serving (Part 2) and how to explain the model’s predictions (Part 3). This post will teach you how to build a nice-looking interface around our car classifier using Dash.

We’ll transform our machine learning predictions and explanations into a fun and exciting game. We present the user with an image of a car. The user has to guess what kind of car model and brand it is – the machine learning model will do the same. After 5 rounds, we’ll evaluate who is better at predicting the car brand: the user or the model.

The Tech Stack – What is Dash?

Dash, as the name suggests, is software made to build dashboards in Python. In Python, you ask? Yes – you do not need to code anything directly in HTML or Javascript (although a basic understanding of HTML certainly helps). For a great introduction, please check out the excellent blog post from my colleague Alexander Blaufuss.

To make the layout and styling of our web app easier, we also use dash bootstrap components. They follow broadly the same syntax as standard dash components and integrate seamlessly into the dash experience.

Keep in mind that Dash is made for dashboards – which means it’s made for interactivity, but not necessarily for apps with several pages. Anyways, we are going to push Dash to its limits.

Let’s Organise Everything – The Project Structure

To replicate everything, you might want to check out our GitHub repository, where all files are available. Also, you can launch all docker containers with one click and start playing.

The files for the frontend itself are logically split into several parts. Although it’s possible to write everything into one file, it’s easy to lose the overview and subsequential hard to maintain. The files follow the structure of the article:

  1. In one file, the whole layout is defined. Every button, headline, text is set there.
  2. In another file, the whole dashboard logic (so-called callbacks) is defined. Things like what’s going to happen after the user clicks a button are defined there.
  3. We need a module that selects 5 random images and handles the communication with the prediction and explainable API.
  4. Lastly, there are two files that are the main entry points to launch the app.

The Big Picture – Creating the Entry Points

Let’s start with the last part, the main entry point for our dashboard. If you know how to write a web app, like a Dash application or also a Flask app, you are familiar with the concept of an app instance. In simple terms, the app instance is everything. It contains the configuration for the app and, eventually, the whole layout. In our case, we initialize the app instance directly with the Bootstrap CSS files to make the styling more manageable. In the same step, we expose the underlying flask app. The flask app is used to serve the frontend in a productive environment.

# app.py
import dash
import dash_bootstrap_components as dbc

# ...

# Initialize Dash App with Bootstrap CSS
app = dash.Dash(
    __name__,
    external_stylesheets=[dbc.themes.BOOTSTRAP],
)

# Underlying Flask App for productive deployment
server = app.server

This setting is used for every Dash application. In contrast to a dashboard, we need a way to handle several URL paths. More precisely, if the user enters /attempt, we want to allow him to guess a car; if he enters /result, we want to show the result of his prediction.

First, we define the layout. Notable, it is initially basically empty. You find a special Dash Core Component there. This component is used to store the current URL there and works in both ways. With a callback, we can read the content, figure out which page the user wants to access, and render the layout accordingly. We can also manipulate the content of this component, which is practically speaking a redirection to another site. The empty div is used as a placeholder for the actual layout.

# launch_dashboard.py
import dash_bootstrap_components as dbc
import dash_core_components as dcc
import dash_html_components as html
from app import app

# ...

# Set Layout
app.layout = dbc.Container(
    [dcc.Location(id='url', refresh=False),
     html.Div(id='main-page')])

The magic happens in the following function. The function itself has one argument, the current path as a string. Based on this input, it returns the right layout. For example, the user accesses the page for the first time, the path is /and, therefore, the layout is start_page. We’ll talk in a bit about the layout in detail; for now, note that we always pass an instance of the app itself and the current game state to every layout.

To get this function actually working, we have to decorate it with the callback decorator. Every callback needs at least one input and at least one output. A change in the input triggers the function. The input is simply the location component defined above, with the property pathname. In simple terms, for whatever reason the path changes, this function gets triggered. The output is the new layout, rendered in the previously initially empty div.

# launch_dashboard.py
import dash_html_components as html
from dash.dependencies import Input, Output
from dash.exceptions import PreventUpdate

# ...

@app.callback(Output('main-page', 'children'), [Input('url', 'pathname')])
def display_page(pathname: str) -> html:
    """Function to define the routing. Mapping routes to layout.

    Arguments:
        pathname {str} -- pathname from url/browser

    Raises:
        PreventUpdate: Unknown/Invalid route, do nothing

    Returns:
        html -- layout
    """
    if pathname == '/attempt':
        return main_layout(app, game_data, attempt(app, game_data))

    elif pathname == '/result':
        return main_layout(app, game_data, result(app, game_data))

    elif pathname == '/finish':
        return main_layout(app, game_data, finish_page(app, game_data))

    elif pathname == '/':
        return main_layout(app, game_data, start_page(app, game_data))

    else:
        raise PreventUpdate

Everything needs to be Nice and Shiny – Layout

Let’s start with the layout of our app – how should it look? We opted for a relatively simple look. As you can see in the animation above, the app consists of three parts: the header, the main content, and the footer. The header and footer are the same on every page, just the main content changes. Some layouts from the main content usually are rather difficult to build. For example, the result page consists of four boxes. The boxes should always have the same width of exactly half of the used screen size but can vary in height depending on the image size. However, they are not allowed to overlap, and so on. Not even talking about the cross-browser incompatibilities.

I guess you can imagine that we could have easily spent several workdays figuring out the optimal layout. Luckily, we can rely once again on Bootstrap and the Bootstrap Grid System. The main idea is that you can create as many rows as you want (two, in the case of the result page) and up to 12 columns per row (also two for the result page). The 12 columns limit is based upon the fact that Bootstrap divides the page internally into 12 equally sized columns. You just have to define with a simple CSS class how big the column should be. What’s even cooler, you can set several layouts depending on the screen size. So it would not be difficult to make our app fully responsive.

Coming back to the Dash part, we build a function for every independent layout piece. The header, footer, and one for every URL the user could access. For the header, it looks like this:

# layout.py
import dash_bootstrap_components as dbc
import dash_html_components as html

# ...

def get_header(app: dash.Dash, data: GameData) -> html:
    """Layout for the header

    Arguments:
        app {dash.Dash} -- dash app instance
        data {GameData} -- game data

    Returns:
        html -- html layout
    """
    logo = app.get_asset_url("logo.png")

    score_user, score_ai = count_score(data)

    header = dbc.Container(
        dbc.Navbar(
            [
                html.A(
                    # Use row and col to control vertical alignment of logo / brand
                    dbc.Row(
                        [
                            dbc.Col(html.Img(src=logo, height="40px")),
                            dbc.Col(
                                dbc.NavbarBrand("Beat the AI - Car Edition",
                                                className="ml-2")),
                        ],
                        align="center",
                        no_gutters=True,
                    ),
                    href="/",
                ),
                # You find the score counter here; Left out for clarity
            ],
            color=COLOR_STATWORX,
            dark=True,
        ),
        className='mb-4 mt-4 navbar-custom')

    return header

Again, you see that we pass the app instance and the global game data state to the layout function. In a perfect world, we do not have to mess around with either one of these variables in the layout. Unfortunately, that’s one of the limitations of Dash. A perfect separation of layout and logic is not possible. The app instance is needed to tell the webserver to serve the STATWORX logo as a static file.

Of course, you could serve the logo from an external server, in-fact we do this for the car images, but just for one logo, it would be a bit of an overkill. For the game data, we need to calculate the current score from the user and the AI. Everything else is either regular HTML or Bootstrap components. If you are not familiar with that, I can refer once again to the blog post from my colleague Alexander or to one of the several HTML tutorials on the internet.

Introduce Reactivity – Callbacks

As mentioned before, callbacks are the go-to way to make the layout interactive. In our case, they mainly consist of handling the dropdown as well as the button clicks. While the dropdowns were relatively straightforward to program, the buttons caused us some headaches.

Following good programming standards, every function should have precisely one responsibility. Therefore, we set up one callback for every button. After some kind of input validation and data manipulation, the ultimate goal is to redirect the user to the following site. While the input for the callback is the button-click-event and potentially some other input forms, the output is always the Location component to redirect the user. Unfortunately, Dash does not allow to have more than one callback to the same output. Therefore, we were forced to squeeze the logic for every button into one function. Since we needed to validate the user input at the attempt page, we passed the current values from the dropdown to the callback. While that worked perfectly fine for the attempt page, the button at the result page stopped working since no dropdowns were available to pass into the function. We had to include a hidden non-functional dummy dropdown into the result page to get the button working again. While that is a solution and works perfectly fine in our case, it might be too complicated for a more extensive application.

We need Cars – Data Download

Now, we have a beautiful app with working buttons and so on, but the data are still missing. We have to include images, predictions, and explanations in the app.

The high-level idea is that every component runs on its own – for example, in its own docker container with its own webserver. Everything is just loosely coupled together via APIs. The flow is the following:

  • Step 1: Request a list of all available car images. Randomly select 5 and request these images from the webserver.
  • Step 2: For all of the 5 images, send a request to the prediction API and parse the result from the API.
  • Step 3: Once again, for all 5 images, send these to the explainable API and save the returned image.

Now combine every output into the GameData class.

Currently, we save the game data instance as a global variable. That allows us to access it from everywhere. While this is, in theory, a smart idea, it won’t work if more than one user tries to access the app. The second user will see the game state from the first one. Since we plan to show this game on a big screen at fairs and exhibitions, it’s okay for now. In the future, we might launch the dashboard with Shiny Proxy, so every user gets his own docker container with an isolated global state.

Let’s Park the Car – Data Storage

The native Dash way is to store user-specific states into a Store component. It’s basically the same as the Location component explained above. The data are stored in the web browser, a callback is triggered, and the data is sent to the server. The first drawback is that we always have to transfer the whole game data instance from the browser to the server on every page change. This might cause quite a lot of traffic and slows down the entire app experience.

Moreover, if we want to change the game state, we have to do this from one callback. The one callback per output limitation also applies here. In our opinion, it does not matter too much if you have a classic dashboard; that’s what Dash is meant for. The responsibilities are separated. In our case, the game state is accessed and modified from several components. We definitely pushed Dash to its limits.

Another thing you should keep in mind if you decide to build your own microservice app is the performance of the API calls. Initially, we used the famous requests library. While we are big fans of this library, all requests are blocking. Therefore, the second request gets executed once the first one is completed. Since our requests are relatively slow (keep in mind there are fully fletched neural networks in the background), the app spends a lot of time just waiting. We implemented asynchronous calls with the help of the aiohttp library. All requests are now sent out in parallel. The app spends less time waiting, and the user is earlier ready to play.

Finally Done – Conclusion and Caveats

Even though the web app works perfectly fine, there are a few things to keep in mind. We used dash well-knowing that it’s meant as a dashboarding tool. We pushed it to the limits and beyond, which lead to some suboptimal interesting design choices.

For example, you can just set one callback per output parameter. Several callbacks for the same output are currently not possible. Since the routing from one page to another is essentially a change in the output parameter (‘url’, ‘pathname’), every page change must be routed through one callback. That increases the complexity of the code exponentially.

Another problem is the difficulty of storing states across several sites. Dash offers the possibility to store user data in the frontend with the Store Component. That’s an excellent solution for small apps; with larger ones, you quickly face the same problem as above – just one callback, one function to write to the store, is simply not enough. You can either use Python’s global state, which makes it difficult if several users access the page simultaneously, or you include a cache.

Our blog series showed you how to the whole life cycle of a data science project, from data exploration over model training to deployment and visualization. This is the last article of this series, and we hope you learned as much as we did while building the application.

To facilitate browsing through the four articles, here are the direct links:

Car Model Classification I: Transfer Learning with ResNet

Car Model Classification II: Deploying TensorFlow Models in Docker using TensorFlow Serving

Car Model Classification III: Explainability of Deep Learning Models with Grad-CAM

Dominique Lade Dominique Lade Dominique Lade Dominique Lade Dominique Lade

“Building trust through human-centric AI”: this is the slogan under which the European Commission presented its proposal for regulating Artificial Intelligence (AI regulation) last week. This historic step positions Europe as the first continent to uniformly regulate AI and the handling of data. With this groundbreaking attempt at regulation, Europe wishes to set standards for the use of AI and data-powered technology – even beyond European borders. That is the right step, as AI is a catalyst of the digital transformation, with significant implications for the economy, society, and the environment. Therefore, clear rules for the use of this technology are needed. This will allow Europe to position itself as a progressive market that is ready for the digital age. In its current form, however, the proposal still raises some questions about its practical implementation. Europe cannot afford to risk its digital competitiveness when competing with America and China for the AI leadership position.

Building Trust Through Transparency

Two Key Proposals for AI Regulation to Build Trust

To build trust in AI products, the proposal for AI regulation relies on two key approaches: Monitoring AI risks while cultivating an “ecosystem of AI excellence.” Specifically, the proposal includes a ban on the use of AI for manipulative and discriminatory purposes or to assess behavior through a “social scoring system”. Use cases that do not fall into these categories will still have to be screened for hazards and placed on a vague risk scale. Special requirements are placed on high-risk applications, with necessary compliance checks both before and after they are put into operation.

It is crucial that AI applications are to be assessed on a case-by-case basis instead of a previously considered sector-centric regulations. In last year’s white paper on AI and trust, the European Commission called for labeling all applications in business sectors such as healthcare or transportation as “high-risk”. This blanket classification based on defined industries, regardless of the actual use cases, would have been obstructive and meant structural disadvantages for entire European industries. The case-by-case assessment allows for the agile and innovative development of AI in all sectors and subjects all industries to the same standards for risky AI applications.

Clear Definition of Risks of an AI Application Is Missing

Despite this new approach, the proposal for AI regulation lacks a concise process to assess the risks of new applications. Since developers themselves are responsible for evaluating their applications, a clearly defined scale for risk assessment is essential. Articles 6 and 7 circumscribe various risks and give examples of “high-risk applications”, but a transparent process for assessing new AI applications is yet to be defined. Startups and smaller companies are heavily represented among AI developers. These companies, in particular, rely on clearly defined standards and processes to avoid being left behind by larger competitors with more appropriate resources. This requires practical guidelines for risk assessment.

If a use case is classified as a “high-risk application”, then various requirements on data governance and risk management must be met before the product can be launched. For example, training data must be tested for bias and inequalities. Also, the model architecture and training parameters must be documented. After deployment, human oversight of the decisions made by the model must be ensured.

Accountability for AI products is a noble and important goal. However, the practical implementation of these requirements once more remains questionable. Many modern AI systems no longer use the traditional approach of static training and testing data. Reinforcement Learning instead relies on exploratory training through feedback instead of a testable data set. And even though advances in Explainable AI are steadily shedding light on the decision-making processes of black-box models, complex model architectures of many modern neural networks make the tracing of individual decisions almost impossible to reconstruct.

The proposal also announces requirements for the accuracy of trained AI products. This poses a particular challenge for developers because no AI system has perfect accuracy. Nor is this ever the objective, as misclassifications are often calculated to have as little impact as possible on the individual use case. Therefore, it is imperative that performance requirements for predictions and classifications be determined on a case-by-case basis and that universal performance requirements be avoided.

Enabling AI Excellence

Europe is Falling Behind

With these requirements, the proposal for AI regulation seeks to inspire confidence in AI technology through transparency and accountability. This is a first, right step toward “AI excellence.” In addition to regulation, however, Europe as a location for Artificial Intelligence must also become more attractive to developers and investors.

According to a recently published study by the Center for Data Innovation, Europe is already falling behind both the United States and China in the battle for global leadership in AI. China has now surpassed Europe in the number of published studies on Artificial Intelligence and has taken the global lead. European AI companies are also attracting significantly less investment than their U.S. counterparts. European AI companies invest less money in research and development and are also less likely to be acquired than American companies.

A Step in the Right Direction: Supporting Research and Innovation

The European Commission recognizes that more support for AI development is needed for excellence on the European market and promises regulatory sandboxes, legal leeway to develop and test innovative AI products, and co-funding for AI research and testing sites. This is needed to make startups and smaller companies more competitive and foster European innovation and competition.

These are necessary steps to lift Europe onto the path to AI excellence, but they are far from being sufficient. AI developers need easier access to markets outside the EU, facilitating the flow of data across national borders. Opportunities to expand into the U.S. and collaborate with Silicon Valley are essential for the digital industry due to how interconnected digital products and services have become.

What is entirely missing from the proposal for AI regulation is education about AI and its potential and risks outside of expert circles. As artificial intelligence increasingly permeates all areas of everyday life, education will become more and more critical. To build trust in new technologies, they must first be understood. Educating non-specialists about both the potential and limitations of AI is an essential step in demystifying Artificial Intelligence and strengthening trust in this technology.

Potential Not Yet Fully Tapped

With this proposal, the European Commission recognizes that AI is leading the way for the future of the European market. Guidelines for a technology of this scope are important – as is the promotion of innovation. For these strategies to bear fruit, their practical implementation must also be feasible for startups and smaller companies. The potential for AI excellence is abundant in Europe. With clear rules and incentives, it can also be realized.

Oliver Guggenbühl Oliver Guggenbühl

“Building trust through human-centric AI”: this is the slogan under which the European Commission presented its proposal for regulating Artificial Intelligence (AI regulation) last week. This historic step positions Europe as the first continent to uniformly regulate AI and the handling of data. With this groundbreaking attempt at regulation, Europe wishes to set standards for the use of AI and data-powered technology – even beyond European borders. That is the right step, as AI is a catalyst of the digital transformation, with significant implications for the economy, society, and the environment. Therefore, clear rules for the use of this technology are needed. This will allow Europe to position itself as a progressive market that is ready for the digital age. In its current form, however, the proposal still raises some questions about its practical implementation. Europe cannot afford to risk its digital competitiveness when competing with America and China for the AI leadership position.

Building Trust Through Transparency

Two Key Proposals for AI Regulation to Build Trust

To build trust in AI products, the proposal for AI regulation relies on two key approaches: Monitoring AI risks while cultivating an “ecosystem of AI excellence.” Specifically, the proposal includes a ban on the use of AI for manipulative and discriminatory purposes or to assess behavior through a “social scoring system”. Use cases that do not fall into these categories will still have to be screened for hazards and placed on a vague risk scale. Special requirements are placed on high-risk applications, with necessary compliance checks both before and after they are put into operation.

It is crucial that AI applications are to be assessed on a case-by-case basis instead of a previously considered sector-centric regulations. In last year’s white paper on AI and trust, the European Commission called for labeling all applications in business sectors such as healthcare or transportation as “high-risk”. This blanket classification based on defined industries, regardless of the actual use cases, would have been obstructive and meant structural disadvantages for entire European industries. The case-by-case assessment allows for the agile and innovative development of AI in all sectors and subjects all industries to the same standards for risky AI applications.

Clear Definition of Risks of an AI Application Is Missing

Despite this new approach, the proposal for AI regulation lacks a concise process to assess the risks of new applications. Since developers themselves are responsible for evaluating their applications, a clearly defined scale for risk assessment is essential. Articles 6 and 7 circumscribe various risks and give examples of “high-risk applications”, but a transparent process for assessing new AI applications is yet to be defined. Startups and smaller companies are heavily represented among AI developers. These companies, in particular, rely on clearly defined standards and processes to avoid being left behind by larger competitors with more appropriate resources. This requires practical guidelines for risk assessment.

If a use case is classified as a “high-risk application”, then various requirements on data governance and risk management must be met before the product can be launched. For example, training data must be tested for bias and inequalities. Also, the model architecture and training parameters must be documented. After deployment, human oversight of the decisions made by the model must be ensured.

Accountability for AI products is a noble and important goal. However, the practical implementation of these requirements once more remains questionable. Many modern AI systems no longer use the traditional approach of static training and testing data. Reinforcement Learning instead relies on exploratory training through feedback instead of a testable data set. And even though advances in Explainable AI are steadily shedding light on the decision-making processes of black-box models, complex model architectures of many modern neural networks make the tracing of individual decisions almost impossible to reconstruct.

The proposal also announces requirements for the accuracy of trained AI products. This poses a particular challenge for developers because no AI system has perfect accuracy. Nor is this ever the objective, as misclassifications are often calculated to have as little impact as possible on the individual use case. Therefore, it is imperative that performance requirements for predictions and classifications be determined on a case-by-case basis and that universal performance requirements be avoided.

Enabling AI Excellence

Europe is Falling Behind

With these requirements, the proposal for AI regulation seeks to inspire confidence in AI technology through transparency and accountability. This is a first, right step toward “AI excellence.” In addition to regulation, however, Europe as a location for Artificial Intelligence must also become more attractive to developers and investors.

According to a recently published study by the Center for Data Innovation, Europe is already falling behind both the United States and China in the battle for global leadership in AI. China has now surpassed Europe in the number of published studies on Artificial Intelligence and has taken the global lead. European AI companies are also attracting significantly less investment than their U.S. counterparts. European AI companies invest less money in research and development and are also less likely to be acquired than American companies.

A Step in the Right Direction: Supporting Research and Innovation

The European Commission recognizes that more support for AI development is needed for excellence on the European market and promises regulatory sandboxes, legal leeway to develop and test innovative AI products, and co-funding for AI research and testing sites. This is needed to make startups and smaller companies more competitive and foster European innovation and competition.

These are necessary steps to lift Europe onto the path to AI excellence, but they are far from being sufficient. AI developers need easier access to markets outside the EU, facilitating the flow of data across national borders. Opportunities to expand into the U.S. and collaborate with Silicon Valley are essential for the digital industry due to how interconnected digital products and services have become.

What is entirely missing from the proposal for AI regulation is education about AI and its potential and risks outside of expert circles. As artificial intelligence increasingly permeates all areas of everyday life, education will become more and more critical. To build trust in new technologies, they must first be understood. Educating non-specialists about both the potential and limitations of AI is an essential step in demystifying Artificial Intelligence and strengthening trust in this technology.

Potential Not Yet Fully Tapped

With this proposal, the European Commission recognizes that AI is leading the way for the future of the European market. Guidelines for a technology of this scope are important – as is the promotion of innovation. For these strategies to bear fruit, their practical implementation must also be feasible for startups and smaller companies. The potential for AI excellence is abundant in Europe. With clear rules and incentives, it can also be realized.

Oliver Guggenbühl Oliver Guggenbühl