Data Science, Machine Learning, and AI
Contact

We at statworx work a lot with R and often use the same little helper functions within our projects. These functions ease our daily work life by reducing repetitive code parts or by creating overviews of our projects. To share these functions within our teams and with others as well, I started to collect them and created an R package out of them called helfRlein. Besides sharing, I also wanted to have some use cases to improve my debugging and optimization skills. With time the package grew, and more and more functions came together. Last time I presented each function as part of an advent calendar. For our new website launch, I combined all of them into this one and will present each current function from the helfRlein package.

Most functions were developed when there was a problem, and one needed a simple solution. For example, the text shown was too long and needed to be shortened (see evenstrings). Other functions exist to reduce repetitive tasks – like reading in multiple files of the same type (see read_files). Therefore, these functions might be useful to you, too!

You can check out our GitHub to explore all functions in more detail. If you have any suggestions, please feel free to send me an email or open an issue on GitHub!

1. char_replace

This little helper replaces non-standard characters (such as the German umlaut “ä”) with their standard equivalents (in this case, “ae”). It is also possible to force all characters to lower case, trim whitespaces, or replace whitespaces and dashes with underscores.

Let’s look at a small example with different settings:

x <- " Élizàldë-González Strasse"
char_replace(x, to_lower = TRUE)
[1] "elizalde-gonzalez strasse"
char_replace(x, to_lower = TRUE, to_underscore = TRUE)
[1] "elizalde_gonzalez_strasse"
char_replace(x, to_lower = FALSE, rm_space = TRUE, rm_dash = TRUE)
[1] "ElizaldeGonzalezStrasse"

 

2. checkdir

This little helper checks a given folder path for existence and creates it if needed.

checkdir(path = "testfolder/subfolder")

Internaly, there is just a simple if statement which combines the base R functions file.exists() and dir.create().

3. clean_gc

This little helper frees your memory from unused objects. Well, basically, it just calls gc() a few times. I used this some time ago for a project where I worked with huge data files. Even though we were lucky enough to have a big server with 500GB RAM, we soon reached its limits. Since we typically parallelize several processes, we needed to preserve every bit and byte of RAM we could get. So, instead of having many lines like this one:

gc();gc();gc();gc()

I wrote clean_gc() for convenience. Internally, gc() is called as long as there is memory to be freed.

Some further thoughts

There is some debate about the garbage collector gc() and its usefulness. If you want to learn more about this, I suggest you check out the memory section in Advanced R. I know that R itself frees up memory if needed, but I am unsure what happens if you have multiple R processes. Can they clear the memory of other processes? If you have some insights on this, let me know!

4. count_na

This little helper counts missing values within a vector.

x <- c(NA, NA, 1, NaN, 0)
count_na(x)
3

Internally, there is just a simple sum(is.na(x)) counting the NA values. If you want the mean instead of the sum, you can set prop = TRUE.

5. evenstrings

This little helper splits a given string into smaller parts with a fixed length. But why? I needed this function while creating a plot with a long title. The text was too long for one line, and I wanted to separate it nicely instead of just cutting it or letting it run over the edges.

Given a long string like…

long_title <- c("Contains the months: January, February, March, April, May, June, July, August, September, October, November, December")

…we want to split it after split = "," with a maximum length of char = 60.

short_title <- evenstrings(long_title, split = ",", char = 60)

The function has two possible output formats, which can be chosen by setting newlines = TRUE or FALSE:

  • one string with line separators \n
  • a vector with each sub-part.

Another use case could be a message that is printed at the console with cat():

cat(long_title)
Contains the months: January, February, March, April, May, June, July, August, September, October, November, December
cat(short_title)
Contains the months: January, February, March, April, May,
 June, July, August, September, October, November, December

Code for plot example

p1 <- ggplot(data.frame(x = 1:10, y = 1:10),
  aes(x = x, y = y)) +
  geom_point() +
  ggtitle(long_title)

p2 <- ggplot(data.frame(x = 1:10, y = 1:10),
  aes(x = x, y = y)) +
  geom_point() +
  ggtitle(short_title)

multiplot(p1, p2)

6. get_files

This little helper does the same thing as the “Find in files” search within RStudio. It returns a vector with all files in a given folder that contain the search pattern. In your daily workflow, you would usually use the shortcut key SHIFT+CTRL+F. With get_files() you can use this functionality within your scripts.

7. get_network

This little helper aims to visualize the connections between R functions within a project as a flowchart. Herefore, the input is a directory path to the function or a list with the functions, and the outputs are an adjacency matrix and an igraph object. As an example, we use this folder with
some toy functions
:

net <- get_network(dir = "flowchart/R_network_functions/", simplify = FALSE)
g1 <- net$igraph

Input

There are five parameters to interact with the function:

  • A path dir which shall be searched.
  • A character vector variations with the function’s definition string -the default is c(" <- function", "<- function", "<-function").
  • A pattern a string with the file suffix – the default is "\\.R$".
  • A boolean simplify that removes functions with no connections from the plot.
  • A named list all_scripts, which is an alternative to dir. This is mainly just used for testing purposes.

For normal usage, it should be enough to provide a path to the project folder.

Output

The given plot shows the connections of each function (arrows) and also the relative size of the function’s code (size of the points). As mentioned above, the output consists of an adjacency matrix and an igraph object. The matrix contains the number of calls for each function. The igraph object has the following properties:

  • The names of the functions are used as label.
  • The number of lines of each function (without comments and empty ones) are saved as the size.
  • The folder‘s name of the first folder in the directory.
  • A color corresponding to the folder.

With these properties, you can improve the network plot, for example, like this:

library(igraph)

# create plots ------------------------------------------------------------
l <- layout_with_fr(g1)
colrs <- rainbow(length(unique(V(g1)$color)))

plot(g1,
     edge.arrow.size = .1,
     edge.width = 5*E(g1)$weight/max(E(g1)$weight),
     vertex.shape = "none",
     vertex.label.color = colrs[V(g1)$color],
     vertex.label.color = "black",
     vertex.size = 20,
     vertex.color = colrs[V(g1)$color],
     edge.color = "steelblue1",
     layout = l)
legend(x = 0,
       unique(V(g1)$folder), pch = 21,
       pt.bg = colrs[unique(V(g1)$color)],
       pt.cex = 2, cex = .8, bty = "n", ncol = 1)

example-network-helfRlein

8. get_sequence

This little helper returns indices of recurring patterns. It works with numbers as well as with characters. All it needs is a vector with the data, a pattern to look for, and a minimum number of occurrences.

Let’s create some time series data with the following code.

library(data.table)

# random seed
set.seed(20181221)

# number of observations
n <- 100

# simulationg the data
ts_data <- data.table(DAY = 1:n, CHANGE = sample(c(-1, 0, 1), n, replace = TRUE))
ts_data[, VALUE := cumsum(CHANGE)]

This is nothing more than a random walk since we sample between going down (-1), going up (1), or staying at the same level (0). Our time series data looks like this:

Assume we want to know the date ranges when there was no change for at least four days in a row.

ts_data[, get_sequence(x = CHANGE, pattern = 0, minsize = 4)]
     min max
[1,]  45  48
[2,]  65  69

We can also answer the question if the pattern “down-up-down-up” is repeating anywhere:

ts_data[, get_sequence(x = CHANGE, pattern = c(-1,1), minsize = 2)]
     min max
[1,]  88  91

With these two inputs, we can update our plot a little bit by adding some geom_rect!

Code for the plot

rect <- data.table(
  rbind(ts_data[, get_sequence(x = CHANGE, pattern = c(0), minsize = 4)],
        ts_data[, get_sequence(x = CHANGE, pattern = c(-1,1), minsize = 2)]),
  GROUP = c("no change","no change","down-up"))

ggplot(ts_data, aes(x = DAY, y = VALUE)) +
  geom_line() +
  geom_rect(data = rect,
  inherit.aes = FALSE,
  aes(xmin = min - 1,
  xmax = max,
  ymin = -Inf,
  ymax = Inf,
  group = GROUP,
  fill = GROUP),
  color = "transparent",
  alpha = 0.5) +
  scale_fill_manual(values = statworx_palette(number = 2, basecolors = c(2,5))) +
  theme_minimal()

9. intersect2

This little helper returns the intersect of multiple vectors or lists. I found this function here, thought it is quite useful and adjusted it a bit.

intersect2(list(c(1:3), c(1:4)), list(c(1:2),c(1:3)), c(1:2))
[1] 1 2

Internally, the problem of finding the intersection is solved recursively, if an element is a list and then stepwise with the next element.

10. multiplot

This little helper combines multiple ggplots into one plot. This is a function taken from the R
cookbook
.

An advantage over facets is, that you don’t need all data for all plots within one object. Also you can freely create each single plot – which can sometimes also be a disadvantage.

With the layout parameter you can arrange multiple plots with different sizes. Let’s say you have three plots and want to arrange them like this:

1    2    2
1    2    2
3    3    3

With multiplot it boils down to

multiplot(plotlist = list(p1, p2, p3),
          layout = matrix(c(1,2,2,1,2,2,3,3,3), nrow = 3, byrow = TRUE))

Code for plot example

# star coordinates
c1  =   cos((2*pi)/5)   
c2  =   cos(pi/5)
s1  =   sin((2*pi)/5)
s2  =   sin((4*pi)/5)

data_star <- data.table(X = c(0, -s2, s1, -s1, s2),
                        Y = c(1, -c2, c1, c1, -c2))

p1 <- ggplot(data_star, aes(x = X, y = Y)) +
  geom_polygon(fill = "gold") +
  theme_void()

# tree
set.seed(24122018)
n <- 10000
lambda <- 2
data_tree <- data.table(X = c(rpois(n, lambda), rpois(n, 1.1*lambda)),
                        TYPE = rep(c("1", "2"), each = n))
data_tree <- data_tree[, list(COUNT = .N), by = c("TYPE", "X")]
data_tree[TYPE == "1", COUNT := -COUNT]

p2 <- ggplot(data_tree, aes(x = X, y = COUNT, fill = TYPE)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("green", "darkgreen")) +
  coord_flip() +
  theme_minimal()

# gifts
data_gifts <- data.table(X = runif(5, min = 0, max = 10),
                         Y = runif(5, max = 0.5),
                         Z = sample(letters[1:5], 5, replace = FALSE))

p3 <- ggplot(data_gifts, aes(x = X, y = Y)) +
  geom_point(aes(color = Z), pch = 15, size = 10) +
  scale_color_brewer(palette = "Reds") +
  geom_point(pch = 12, size = 10, color = "gold") +
  xlim(0,8) +
  ylim(0.1,0.5) +
  theme_minimal() + 
  theme(legend.position="none") 


11. na_omitlist

This little helper removes missing values from a list.

y <- list(NA, c(1, NA), list(c(5:6, NA), NA, "A"))

There are two ways to remove the missing values, either only on the first level of the list or wihtin each sub level.

na_omitlist(y, recursive = FALSE)
[[1]]
[1]  1 NA

[[2]]
[[2]][[1]]
[1]  5  6 NA

[[2]][[2]]
[1] NA

[[2]][[3]]
[1] "A"
na_omitlist(y, recursive = TRUE)
[[1]]
[1] 1

[[2]]
[[2]][[1]]
[1] 5 6

[[2]][[2]]
[1] "A"

12. %nin%

This little helper is just a convenience function. It is simply the same as the negated %in% operator, as you can see below. But in my opinion, it increases the readability of the code.

all.equal( c(1,2,3,4) %nin% c(1,2,5),
          !c(1,2,3,4) %in%  c(1,2,5))
[1] TRUE

Also, this operator has made it into a few other packages as well – as you can read here.

13. object_size_in_env

This little helper shows a table with the size of each object in the given environment.

If you are in a situation where you have coded a lot and your environment is now quite messy, object_size_in_env helps you to find the big fish with respect to memory usage. Personally, I ran into this problem a few times when I looped over multiple executions of my models. At some point, the sessions became quite large in memory and I did not know why! With the help of object_size_in_env and some degubbing I could locate the object that caused this problem and adjusted my code accordingly.

First, let us create an environment with some variables.

# building an environment
this_env <- new.env()
assign("Var1", 3, envir = this_env)
assign("Var2", 1:1000, envir = this_env)
assign("Var3", rep("test", 1000), envir = this_env)

To get the size information of our objects, internally format(object.size()) is used. With the unit the output format can be changed (eg. "B", "MB" or "GB") .

# checking the size
object_size_in_env(env = this_env, unit = "B")
   OBJECT SIZE UNIT
1:   Var3 8104    B
2:   Var2 4048    B
3:   Var1   56    B

14. print_fs

This little helper returns the folder structure of a given path. With this, one can for example add a nice overview to the documentation of a project or within a git. For the sake of automation, this function could run and change parts wihtin a log or news file after a major change.

If we take a look at the same example we used for the get_network function, we get the following:

print_fs("~/flowchart/", depth = 4)
1  flowchart                            
2   ¦--create_network.R                 
3   ¦--getnetwork.R                     
4   ¦--plots                            
5   ¦   ¦--example-network-helfRlein.png
6   ¦   °--improved-network.png         
7   ¦--R_network_functions              
8   ¦   ¦--dataprep                     
9   ¦   ¦   °--foo_01.R                 
10  ¦   ¦--method                       
11  ¦   ¦   °--foo_02.R                 
12  ¦   ¦--script_01.R                  
13  ¦   °--script_02.R                  
14  °--README.md 

With depth we can adjust how deep we want to traverse through our folders.

15. read_files

This little helper reads in multiple files of the same type and combines them into a data.table. Which kind of file reading function should be used can be choosen by the FUN argument.

If you have a list of files, that all needs to be loaded in with the same function (e.g. read.csv), instead of using lapply and rbindlist now you can use this:

read_files(files, FUN = readRDS)
read_files(files, FUN = readLines)
read_files(files, FUN = read.csv, sep = ";")

Internally, it just uses lapply and rbindlist but you dont have to type it all the time. The read_files combines the single files by their column names and returns one data.table. Why data.table? Because I like it. But, let’s not go down the rabbit hole of data.table vs dplyr (to the
rabbit hole …
).

16. save_rds_archive

This little helper is a wrapper around base R saveRDS() and checks if the file you attempt to save already exists. If it does, the existing file is renamed / archived (with a time stamp), and the “updated” file will be saved under the specified name. This means that existing code which depends on the file name remaining constant (e.g., readRDS() calls in other scripts) will continue to work while an archived copy of the – otherwise overwritten – file will be kept.

17. sci_palette

This little helper returns a set of colors which we often use at statworx. So, if – like me – you cannot remeber each hex color code you need, this might help. Of course these are our colours, but you could rewrite it with your own palette. But the main benefactor is the plotting method – so you can see the color instead of only reading the hex code.

To see which hex code corresponds to which colour and for what purpose to use it

sci_palette(scheme = "new")
Tech Blue       Black       White  Light Grey    Accent 1    Accent 2    Accent 3 
"#0000FF"   "#000000"   "#FFFFFF"   "#EBF0F2"   "#283440"   "#6C7D8C"   "#B6BDCC"   
Highlight 1 Highlight 2 Highlight 3 
"#00C800"   "#FFFF00"   "#FE0D6C" 
attr(,"class")
[1] "sci"

As mentioned above, there is a plot() method which gives the following picture.

plot(sci_palette(scheme = "new"))

18. statusbar

This little helper prints a progress bar into the console for loops.

There are two nessecary parameters to feed this function:

  • run is either the iterator or its number
  • max.run is either all possible iterators in the order they are processed or the maximum number of iterations.

So for example it could be run = 3 and max.run = 16 or run = "a" and max.run = letters[1:16].

Also there are two optional parameter:

  • percent.max influences the width of the progress bar
  • info is an additional character, which is printed at the end of the line. By default it is run.

A little disadvantage of this function is, that it does not work with parallel processes. If you want to have a progress bar when using apply functions check out pbapply.

19. statworx_palette

This little helper is an addition to yesterday’s sci_palette(). We picked colors 1, 2, 3, 5 and 10 to create a flexible color palette. If you need 100 different colors – say no more!

In contrast to sci_palette() the return value is a character vector. For example if you want 16 colors:

statworx_palette(16, scheme = "old")
[1] "#013848" "#004C63" "#00617E" "#00759A" "#0087AB" "#008F9C" "#00978E" "#009F7F"
[9] "#219E68" "#659448" "#A98B28" "#ED8208" "#F36F0F" "#E45A23" "#D54437" "#C62F4B"

If we now plot those colors, we get this nice rainbow like gradient.

library(ggplot2)

ggplot(plot_data, aes(x = X, y = Y)) +
  geom_point(pch = 16, size = 15, color = statworx_palette(16, scheme = "old")) +
  theme_minimal()

An additional feature is the reorder parameter, which samples the color’s order so that neighbours might be a bit more distinguishable. Also if you want to change the used colors, you can do so with basecolors .

ggplot(plot_data, aes(x = X, y = Y)) +
  geom_point(pch = 16, size = 15,
             color = statworx_palette(16, basecolors = c(4,8,10), scheme = "new")) +
  theme_minimal()

20. strsplit

This little helper adds functionality to the base R function strsplit – hence the same name! It is now possible to split before, after or between a given delimiter. In the case of between you need to specify two delimiters.

An earlier version of this function can be found in this blog post, where I describe the used regular expressions, if you are interested.

Here is a little example on how to use the new strsplit.

text <- c("This sentence should be split between should and be.")

strsplit(x = text, split = " ")
strsplit(x = text, split = c("should", " be"), type = "between")
strsplit(x = text, split = "be", type = "before")
[[1]]
[1] "This"     "sentence" "should"   "be"       "split"    "between"  "should"   "and"     
[9] "be."

[[1]]
[1] "This sentence should"             " be split between should and be."

[[1]]
[1] "This sentence should " "be split "             "between should and "  
[4] "be."

21. to_na

This little helper is just a convenience function. Some times during your data preparation, you have a vector with infinite values like Inf or -Inf or even NaN values. Thos kind of value can (they do not have to!) mess up your evaluation and models. But most functions do have a tendency to handle missing values. So, this little helper removes such values and replaces them with NA.

A small exampe to give you the idea:

test <- list(a = c("a", "b", NA),
             b = c(NaN, 1,2, -Inf),
             c = c(TRUE, FALSE, NaN, Inf))

lapply(test, to_na)
$a
[1] "a" "b" NA 

$b
[1] NA  1  2 NA

$c
[1]  TRUE FALSE    NA

A little advice along the way! Since there are different types of NA depending on the other values within a vector. You might want to check the format if you do to_na on groups or subsets.

test <- list(NA, c(NA, "a"), c(NA, 2.3), c(NA, 1L))
str(test)
List of 4
 $ : logi NA
 $ : chr [1:2] NA "a"
 $ : num [1:2] NA 2.3
 $ : int [1:2] NA 1

22. trim

This little helper removes leading and trailing whitespaces from a string. With R version 3.5.1 trimws was introduced, which does the exact same thing. This just shows, it was not a bad idea to write such a function. 😉

x <- c("  Hello world!", "  Hello world! ", "Hello world! ")
trim(x, lead = TRUE, trail = TRUE)
[1] "Hello world!" "Hello world!" "Hello world!"

The lead and trail parameters indicates if only leading, trailing or both whitspaces should be removed.

Conclusion

I hope that the helfRlein package makes your work as easy as it is for us here at statworx. If you have any questions or input about the package, please send us an email to: blog@statworx.com

Jakob Gepp Jakob Gepp Jakob Gepp Jakob Gepp Jakob Gepp Jakob Gepp Jakob Gepp Jakob Gepp Jakob Gepp Jakob Gepp Jakob Gepp Jakob Gepp

In the field of Data Science – as the name suggests – the topic of data, from data cleaning to feature engineering, is one of the cornerstones. Having and evaluating data is one thing, but how do you actually get data for new problems?

If you are lucky, the data you need is already available. Either by downloading a whole dataset or by using an API. Often, however, you have to gather information from websites yourself – this is called web scraping. Depending on how often you want to scrape data, it is advantageous to automate this step.

This post will be about exactly this automation. Using web scraping and GitHub Actions as an example, I will show how you can create your own data sets over a more extended period. The focus will be on the experience I have gathered over the last few months.

The code I used and the data I collected can be found in this GitHub repository.

Search for data – the initial situation

During my research for the blog post about gasoline prices, I also came across data on the utilization of parking garages in Frankfurt am Main. Obtaining this data laid the foundation for this post. After some thought and additional research, other thematically appropriate data sources came to mind:

  • Road utilization
  • S-Bahn and subway delays
  • Events nearby
  • Weather data

However, it quickly became apparent that I could not get all this data, as it is not freely available or allowed to be stored. Since I planned to store the collected data on GitHub and make it available, this was a crucial point for which data came into question. For these reasons, railway data fell out completely. I only found data for Cologne for road usage, and I wanted to avoid using the Google API as that definitely brings its own challenges. So, I was left with event and weather data.

For the weather data of the German Weather Service, the rdwd package can be used. Since this data is already historized, it is irrelevant for this blog post. The GitHub Actions have proven to be very useful to get the remaining event and park data, even if they are not entirely trivial to use. Especially the fact that they can be used free of charge makes them a recommendable tool for such projects.

Scraping the data

Since this post will not deal with the details of web scraping, I refer you here to the post by my colleague David.

The parking data is available here in XML format and is updated every 5 minutes. Once you understand the structure of the XML, it’s a simple matter of accessing the right index, and you have the data you want. In the function get_parking_data(), I have summarized everything I need. It creates a record for the area and a record for the individual parking garages.

Example data extract area

parkingAreaOccupancy;parkingAreaStatusTime;parkingAreaTotalNumberOfVacantParkingSpaces;
totalParkingCapacityLongTermOverride;totalParkingCapacityShortTermOverride;id;TIME
0.08401977;2021-12-01T01:07:00Z;556;150;607;1[Anlagenring];2021-12-01T01:07:02.720Z
0.31417114;2021-12-01T01:07:00Z;513;0;748;4[Bahnhofsviertel];2021-12-01T01:07:02.720Z
0.351417;2021-12-01T01:07:00Z;801;0;1235;5[Dom / Römer];2021-12-01T01:07:02.720Z
0.21266666;2021-12-01T01:07:00Z;1181;70;1500;2[Zeil];2021-12-01T01:07:02.720Z

Example data extract facility

parkingFacilityOccupancy;parkingFacilityStatus;parkingFacilityStatusTime;
totalNumberOfOccupiedParkingSpaces;totalNumberOfVacantParkingSpaces;
totalParkingCapacityLongTermOverride;totalParkingCapacityOverride;
totalParkingCapacityShortTermOverride;id;TIME
0.02;open;2021-12-01T01:02:00Z;4;196;150;350;200;24276[Turmcenter];2021-12-01T01:07:02.720Z
0.11547912;open;2021-12-01T01:02:00Z;47;360;0;407;407;18944[Alte Oper];2021-12-01T01:07:02.720Z
0.0027472528;open;2021-12-01T01:02:00Z;1;363;0;364;364;24281[Hauptbahnhof Süd];2021-12-01T01:07:02.720Z
0.609375;open;2021-12-01T01:02:00Z;234;150;0;384;384;105479[Baseler Platz];2021-12-01T01:07:02.720Z

For the event data, I scrape the page stadtleben.de. Since it is a HTML that is quite well structured, I can access the tabular event overview via the tag “kalenderListe”. The result is created by the function get_event_data().

Example data extract event

eventtitle;views;place;address;eventday;eventdate;request
Magical Sing Along - Das lustigste Mitsing-Event;12576;Bürgerhaus;64546 Mörfelden-Walldorf, Westendstraße 60;Freitag;2022-03-04;2022-03-04T02:24:14.234833Z
Velvet-Bar-Night;1460;Velvet Club;60311 Frankfurt, Weißfrauenstraße 12-16;Freitag;2022-03-04;2022-03-04T02:24:14.234833Z
Basta A-cappella-Band;465;Zeltpalast am Deutsche Bank Park;60528 Frankfurt am Main, Mörfelder Landstraße 362;Freitag;2022-03-04;2022-03-04T02:24:14.234833Z
BeThrifty Vintage Kilo Sale | Frankfurt | 04. & 05. …;1302;Batschkapp;60388 Frankfurt am Main, Gwinnerstraße 5;Freitag;2022-03-04;2022-03-04T02:24:14.234833Z

Automation of workflows – GitHub Actions

The basic framework is in place. I have a function that writes the park and event data to a .csv file when executed. Since I want to query the park data every 5 minutes and the event data three times a day for security, GitHub Actions come into play.

With this function of GitHub, workflows can be scheduled and executed in addition to actions triggered during merging or committing. For this purpose, a .yml file is created in the folder /.github/workflows.

The main components of my workflow are:

  • The schedule – Every ten minutes, the functions should be executed
  • The OS – Since I develop locally on a Mac, I use the macOS-latest here.
  • Environment variables – This contains my GitHub token and the path for the package management renv.
  • The individual steps in the workflow itself.

The workflow goes through the following steps:

  • Setup R
  • Load packages with renv
  • Run script to scrape data
  • Run script to update the README
  • Pushing the new data back into git

Each of these steps is very small and clear in itself; however, as is often the case, the devil is in the details.

Limitation and challenges

Over the last few months, I’ve been tweaking and optimizing my workflow to deal with the bugs and issues. In the following, you will find an overview of my condensed experiences with GitHub Actions from the last months.

Schedule problems

If you want to perform time-critical actions, you should use other services. GitHub Action does not guarantee that the jobs will be timed exactly (or, in some cases, that they will be executed at all).

Time span in minutes <= 5 <= 10 <= 20 <= 60 > 60
Number of queries 1720 2049 5509 3023 194

You can see that the planned five-minute intervals were not always adhered to. I should plan a larger margin here in the future.

Merge conflicts

In the beginning, I had two workflows, one for the park data and one for the events. If they overlapped in time, there were merge conflicts because both processes updated the README with a timestamp. Over time, I switched to a workflow including error handling.
Even if one run took longer and the next one had already started, there were merge conflicts in the .csv data when pushing. Long runs were often caused by the R setup and the loading of the packages. Consequently, I extended the schedule interval from five to ten minutes.

Format adjustments

There were a few situations where the paths or structure of the scraped data changed, so I had to adjust my functions. Here the setting to get an email if a process failed was very helpful.

Lack of testing capabilities

There is no way to test a workflow script other than to run it. So, after a typo in the evening, one can wake up to a flood of emails with spawned runs in the morning. Still, that shouldn’t stop you from doing a local test run.

No data update

Since the end of December, the parking data has not been updated or made available. This shows that even if you have an automatic process, you should still continue to monitor it. I only noticed this later, which meant that my queries at the end of December always went nowhere.

Conclusion

Despite all these complications, I still consider the whole thing a massive success. Over the last few months, I’ve been studying the topic repeatedly and have learned the tricks described above, which will also help me solve other problems in the future. I hope that all readers of this blog post could also take away some valuable tips and thus learn from my mistakes.

Since I have now collected a good half-year of data, I can deal with the evaluation. But this will be the subject of another blog post. Jakob Gepp Jakob Gepp

Why bother? AI and the climate crisis

According to the newest report from the Intergovernmental Panel on Climate Change (IPCC) in August 2021, “it is unequivocal that human influence has warmed the atmosphere, ocean and land” [1]. Climate change also occurs faster than previously thought. Regarding most recent estimations, the average global surface temperature increased by 1.07°C from 2010 to 2019 compared to 1850 to 1900 due to human influence. Furthermore, the atmospheric CO2 concentrations in 2019 “were higher than at any time in at least 2 million years” [1].

Still, global carbon emissions are rising, although there was a slight decrease in 2020 [2], probably due to the coronavirus and its economic effects. In 2019, 36.7 gigatons (Gt) CO2 were emitted worldwide [2]. Be aware that one Gt is one billion tons. To achieve the 1.5 °C goal with an estimated probability of about 80%, we have only 300 Gts left at the beginning of 2020 [1]. As both 2020 and 2021 are over and assuming carbon emissions of about 35 Gts for each year, the remaining budget is about 230 Gt CO2. If the yearly amount stayed constant over the next years, the remaining carbon budget would be exhausted in about seven years.

In 2019, China, the USA, and India were the most emitting countries. Overall, Germany is responsible for only about 2% of all global emissions, but it was still in seventh place with about 0.7 Gt in 2019 (see graph below). Altogether, the top ten most emitting countries account for about two-thirds of all carbon emissions in 2019 [2]. Most of these countries are highly industrialized and will likely enhance their usage of artificial intelligence (AI) to strengthen their economies during the following decades.

Using AI to reduce carbon emissions

So, what about AI and carbon emissions? Well, the usage of AI is two sides of the same coin [3]. On the one hand, AI has a great potential to reduce carbon emissions by providing more accurate predictions or improving processes in many different fields. For example, AI can be applied to predict intemperate weather events, optimize supply chains, or monitor peatlands [4, 5].

According to a recent estimation of Microsoft and PwC, the usage of AI for environmental applications can save up to 4.4% of all greenhouse gas emissions worldwide by 2030 [6].
In absolute numbers, the usage of AI for environmental applications can reduce worldwide greenhouse gas emissions by 0.9 – 2.4 Gts of CO2e. This amount is equivalent to the estimated annual emissions of Australia, Canada, and Japan together in 2030 [7]. To be clear, greenhouse gases also include other emitted gases like methane that also reinforce the earth’s greenhouse effect. To easily measure all of them, they are often declared as equivalents to CO2 and hence abbreviated as CO2e.

AI’s carbon footprint

Despite the great potential of AI to reduce carbon emissions, the usage of AI itself also emits CO2, which is the other side of the coin. From 2012 to 2018, the estimated amount of computation used to train deep learning models has increased by 300.000 (see graph below, [8]). Hence, research, training, and deployment of AI models require an increasing amount of energy and hardware, of course. Both produce carbon emissions and thus contribute to climate change.

Note: Graph taken from [8].

Unfortunately, I could not find a study that estimates the overall carbon emissions of AI. Still, there are some estimations of the CO2 or CO2e emissions of some Natural Language Processing (NLP) models that have become increasingly accurate and hence popular during recent years [9]. According to the following table, the final training of Google’s BERT model roughly emitted as much CO2e as one passenger on their flight from New York to San Francisco. Of course, the training of other NLP models – like Transformerbig – emitted far less, but the final training of a model is only the last part of finding the best model. Prior to the final training, many different models are tried to find the best parameters. Accordingly, this neural architecture search for the Transformerbig model emitted about five times the CO2e emissions as an average car in its lifetime. Now, you may look at the estimated CO2e emissions of GPT-3 and imagine how much emissions resulted from the related neural architecture search.

Comparison of selected human and AI carbon emissions
Human emissions AI emissions
Example CO2e emissions (tons) NLP model training CO2e emissions (tons)
One passenger air traveling
New York San Francisco
0.90 Transformerbig 0.09
Average human life
one year
5.00 BERTbase 0.65
Average American life
one year
16.40 GPT-3 84.74
Average car lifetime
incl. fuel
57.15 Neural architecture search
for Transformerbig
284.02

Note: All values extracted from [9], except the value of for GPT-3 [17]

What you, as a data scientist, can do the reduce your carbon footprint

Overall, there are many ways you, as a data scientist, can reduce your carbon footprint during the training and deployment of AI models. As the most important areas of AI are currently machine learning (ML) and deep learning (DL), different ways to measure and reduce the carbon footprint of these models are described in the following.

1. Be aware of the negative consequences and report them

It may sound simple but being aware of the negative consequences of searching, training, and deploying ML and DL models is the first step to reducing your carbon emissions. It is essential to understand how AI negatively impacts our environment to take the extra effort and be willing to report carbon emissions systematically, which is needed to tackle climate change [8, 9, 10]. So, if you skipped the first part about AI and the climate crisis, go back and read it. It’s worth it!

2. Measure the carbon footprint of your code

To make carbon emissions of your ML and DL models explicit, they need to be measured. Currently, there is no standardized framework to measure all sustainability aspects of AI, but one is currently formed [11]. Until there is a holistic framework, you can start by making energy consumption and related carbon emissions explicit [12]. Probably, some of the most elaborated packages to compute ML and DL models are implemented in the programming language Python. Although Python is not the most efficient programming language [13], it was again rated the most popular programming language in the PYPL index in September 2021 [14]. Accordingly, there are even three Python packages that you can use to track the carbon emissions of training your models:

  • CodeCarbon [15, 16]
  • CarbonTracker [17]
  • Experiment Impact Tracker [18]

Based on my perception, CodeCarbon and CarbonTracker seem to be the easiest ones to use. Furthermore, CodeCarbon can easily be combined with TensorFlow and CarbonTracker with PyTorch. Therefore, you find an example for each package below.

I trained a simple multilayer perceptron with two hidden layers and 256 neurons using the MNIST data set for both packages. To simulate a CPU- and GPU-based computation, I trained the model with TensorFlow and CodeCarbon on my local machine (15-inches MacBook Pro from 2018 and 6 Intel Core i7 CPUs) and the one with PyTorch and Carbontracker in a Google Colab using a Tesla K80 GPU. First, you find the TensorFlow and CodeCarbon code below.

# import needed packages
import tensorflow as tf
from codecarbon import EmissionsTracker

# prepare model training
mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0


model = tf.keras.models.Sequential(
    [
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(256, activation="relu"),
        tf.keras.layers.Dense(256, activation="relu"),
        tf.keras.layers.Dense(10, activation="softmax"),
    ]
)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])

# train model and track carbon emissions
tracker = EmissionsTracker()
tracker.start()
model.fit(x_train, y_train, epochs=10)
emissions: float = tracker.stop()
print(emissions)

After executing the code above, Codecarbon creates a csv file as output which includes different output parameters like computation duration in seconds, total power consumed by the underlying infrastructure in kWh and the related CO2e emissions in kg. The training of my model took 112.15 seconds, consumed 0.00068 kWh, and created 0.00047 kg of CO2e.

Regarding PyTorch and CarbonTracker, I used this Google Colab notebook as the basic setup. To incorporate the tracking of carbon emissions and make the two models comparable, I changed a few details of the notebook. First, I changed the model in step 2 (Define Network) from a convolutional neural network to the multilayer perceptron (I kept the class name CNN to make the rest of the notebook still work):

 class CNN(nn.Module):
  """A simple MLP model."""

  @nn.compact
  def __call__(self, x):
    x = x.reshape((x.shape[0], -1))  # flatten
    x = nn.Dense(features=256)(x)
    x = nn.relu(x)
    x = nn.Dense(features=256)(x)
    x = nn.relu(x)
    x = nn.Dense(features=10)(x)
    x = nn.log_softmax(x)
    return x

Second, I inserted the installation and import of CarbonTracker as well as the tracking of the carbon emissions in step 14 (Train and evaluate):

 !pip install carbontracker

from carbontracker.tracker import CarbonTracker

tracker = CarbonTracker(epochs=num_epochs)
for epoch in range(1, num_epochs + 1):
  tracker.epoch_start()

  # Use a separate PRNG key to permute image data during shuffling
  rng, input_rng = jax.random.split(rng)
  # Run an optimization step over a training batch
  state = train_epoch(state, train_ds, batch_size, epoch, input_rng)
  # Evaluate on the test set after each training epoch 
  test_loss, test_accuracy = eval_model(state.params, test_ds)
  print(' test epoch: %d, loss: %.2f, accuracy: %.2f' % (
      epoch, test_loss, test_accuracy * 100))
  
  tracker.epoch_end()

tracker.stop()

After executing the whole notebook, CarbonTracker prints the following output after the first training epoch is finished.

 train epoch: 1, loss: 0.2999, accuracy: 91.25
 test epoch: 1, loss: 0.22, accuracy: 93.42
CarbonTracker:
Actual consumption for 1 epoch(s):
       Time:  0:00:15
       Energy: 0.000397 kWh
       CO2eq: 0.116738 g
       This is equivalent to:
       0.000970 km travelled by car
CarbonTracker:
Predicted consumption for 10 epoch(s):
       Time:  0:02:30
       Energy: 0.003968 kWh
       CO2eq: 1.167384 g
       This is equivalent to:
       0.009696 km travelled by car

As expected, the GPU needed more energy and produced more carbon emissions. The energy consumption was 6 times higher and the carbon emissions about 2.5 times higher compared to my local CPUs. Obviously, the increased energy consumption is related to the increased computation time that was 2.5 minutes for the GPU but only less than 2 minutes for the CPUs. Overall, both packages provide all needed information to assess and report carbon emissions and related information.

3. Compare different regions of cloud providers

In recent years, the training and deployment of ML or DL models in the cloud have become more important compared to local computations. Clearly, one of the reasons is the increased need for computation power [8]. Accessing GPUs in the cloud is, for most companies, faster and cheaper than building their own data center. Of course, data centers of cloud providers also need hardware and energy for computation. It is estimated that about 1% of worldwide electricity demand is produced by data centers [19]. The usage of every hardware, regardless of its location, produces carbon emissions, and that’s why it is also important to measure carbon emissions emitted by training and deployment of ML and DL models in the cloud.

Currently, there are two different CO2e calculators that can easily be used to calculate carbon emissions in the cloud [20, 21]. The good news is that all three big cloud providers – AWS, Azure, and GCP – are incorporated in both calculators. To find out which of the three big cloud providers and which European region is best, I used the first calculator – ML CO2 Impact [20] – to calculate the CO2e emissions for the final training of GPT-3. The final model training of GPT-3 required 310 GPUs (NVIDIA Tesla V100 PCIe) running non-stop for 90 days [17]. To compute the estimated emissions of the different providers and regions, I chose the available option “Tesla V100-PCIE-16GB” as GPU. The results of the calculations can be found in the following table.

Comparison of different European regions and cloud providers
Google Cloud Computing AWS Cloud Computing Microsoft Azure
Region CO2e emissions (tons) Region CO2e emissions (tons) Region CO2e emissions (tons)
europe-west1 54.2 EU – Frankfurt 122.5 France Central 20.1
europe-west2 124.5 EU – Ireland 124.5 France South 20.1
europe-west3 122.5 EU – London 124.5 North Europe 124.5
europe-west4 114.5 EU – Paris 20.1 West Europe 114.5
europe-west6 4.0 EU – Stockholm 10.0 UK West 124.5
europe-north1 42.2 UK South 124.5

Overall, at least two findings are fascinating. First, even within the same cloud provider, the chosen region has a massive impact on the estimated CO2e emissions. The most significant difference is present for GCP with a factor of more than 30. This huge difference is partly due to the small emissions of 4 tons in the region europe-west6, which are also the smallest emissions overall. Interestingly, such a huge factor of 30 is a lot more than those described in scientific papers, which are factors of 5 to 10 [12]. Second, some estimated values are equal, which shows that some kind of simplification was used for these estimations. Therefore, you should treat the absolute values with caution, but the difference between the regions still holds as they are all based on the same (simplified) way of calculation.

Finally, to choose the cloud provider with a minimal carbon footprint in total, it is also essential to consider the sustainability strategies of the cloud providers. In this area, GCP and Azure seem to have more effective strategies for the future compared to AWS [22, 23] and have already reached 100% renewable energy with offsets and energy certificates so far. Still, none of them uses 100% renewable energy itself (see table 2 in [9]). From an environmental perspective, I personally prefer GCP because their strategy most convinced me. Furthermore, GCP has implemented a hint for “regions with the lowest carbon impact inside Cloud Console location selectors“ since 2021 [24]. These kinds of help indicate the importance of this topic to GCP.

4. Train and deploy with care

Finally, there are many other helpful hints and tricks related to the training and deployment of ML and DL models that can help you minimize your carbon footprint as a data scientist.

  • Practice to be sparse! New research that combines DL models with state-of-the-art findings in neuroscience can reduce computation times by up to 100 times and save lots of carbon emissions [25].
  • Search for simpler and less computing-intensive models with comparable accuracy and use them if appropriate. For example, there is a smaller and faster version of BERT available called DistilBERT with comparable accuracy values [26]
  • Consider transfer learning and foundation models [10] to maximize accuracy and minimize computations at the same time.
  • Consider Federated learning to reduce carbon emissions [27].
  • Don’t just think of the accuracy of your model; consider efficiency as well. Always ponder if a 1% increase in accuracy is worth the additional environmental impact [9, 12].
  • If the region of best hyperparameters is still unknown, use random or Bayesian hyperparameter search instead of grid search [9, 20].
  • If your model will be retrained periodically after deployment, choose the training interval consciously. Regarding the associated business case, it may be enough to provide a newly trained model each month and not each week.

Conclusion

Human beings and their greenhouse gas emissions influence our climate and warm the world. AI can and should be part of the solution to tackle climate change. Still, we need to keep an eye on its carbon footprint to make sure that it will be part of the solution and not part of the problem.

As a data scientist, you can do a lot. You can inform yourself and others about the positive possibilities and negative consequences of using AI. Furthermore, you can measure and explicitly state the carbon emissions of your models. You can describe your efforts to minimize their carbon footprint, too. Finally, you can also choose your cloud provider consciously and, for example, check if there are simpler models that result in a comparable accuracy but with fewer emissions.

Recently, we at statworx have formed a new initiative called AI and Environment to incorporate these aspects in our daily work as data scientists. If you want to know more about it, just get in touch with us!

References

  1. https://www.ipcc.ch/report/ar6/wg1/downloads/report/IPCC_AR6_WGI_SPM_final.pdf
  2. http://www.globalcarbonatlas.org/en/CO2-emissions
  3. https://doi.org/10.1007/s43681-021-00043-6
  4. https://arxiv.org/pdf/1906.05433.pdf
  5. https://www.pwc.co.uk/sustainability-climate-change/assets/pdf/how-ai-can-enable-a-sustainable-future.pdf
  6. Harness Artificial Intelligence
  7. https://climateactiontracker.org/
  8. https://arxiv.org/pdf/1907.10597.pdf
  9. https://arxiv.org/pdf/1906.02243.pdf
  10. https://arxiv.org/pdf/2108.07258.pdf
  11. https://algorithmwatch.org/de/sustain/
  12. https://arxiv.org/ftp/arxiv/papers/2104/2104.10350.pdf
  13. https://stefanos1316.github.io/my_curriculum_vitae/GKS17.pdf
  14. https://pypl.github.io/PYPL.html
  15. https://codecarbon.io/
  16. https://mlco2.github.io/codecarbon/index.html
  17. https://arxiv.org/pdf/2007.03051.pdf
  18. https://github.com/Breakend/experiment-impact-tracker
  19. https://www.iea.org/reports/data-centres-and-data-transmission-networks
  20. https://mlco2.github.io/impact/#co2eq
  21. http://www.green-algorithms.org/
  22. https://blog.container-solutions.com/the-green-cloud-how-climate-friendly-is-your-cloud-provider
  23. https://www.wired.com/story/amazon-google-microsoft-green-clouds-and-hyperscale-data-centers/
  24. https://cloud.google.com/blog/topics/sustainability/pick-the-google-cloud-region-with-the-lowest-co2)
  25. https://arxiv.org/abs/2112.13896
  26. https://arxiv.org/abs/1910.01108
  27. https://flower.dev/blog/2021-07-01-what-is-the-carbon-footprint-of-federated-learning

Alexander Niltop Alexander Niltop

Introduction

The more complex any given data science project in Python gets, the harder it usually becomes to keep track of how all modules interact with each other. Undoubtedly, when working in a team on a bigger project, as is often the case here at STATWORX, the codebase can soon grow to an extent where the complexity may seem daunting. In a typical scenario, each team member works in their “corner” of the project, leaving each one merely with firm local knowledge of the project’s code but possibly only a vague idea of the overall project architecture. Ideally, however, everyone involved in the project should have a good global overview of the project. By that, I don’t mean that one has to know how each function works internally but rather to know the responsibility of the main modules and how they are interconnected.

A visual helper for learning about the global structure can be a call graph. A call graph is a directed graph that displays which function calls which. It is created from the data of a Python profiler such as cProfile.

Since such a graph proved helpful in a project I’m working on, I created a package called project_graph, which builds such a call graph for any provided python script. The package creates a profile of the given script via cProfile, converts it into a filtered dot graph via gprof2dot, and finally exports it as a .png file.

Why Are Project Graphs Useful?

As a small first example, consider this simple module.

# test_script.py

import time
from tests.goodnight import sleep_five_seconds

def sleep_one_seconds():
    time.sleep(1)

def sleep_two_seconds():
    time.sleep(2)

for i in range(3):
    sleep_one_seconds()

sleep_two_seconds()

sleep_five_seconds()

After installation (see below), by writing project_graph test_script.py into the command line, the following png-file is placed next to the script:

The script to be profiled always acts as a starting point and is the root of the tree. Each box is captioned with a function’s name, the overall percentage of time spent in the function, and its number of calls. The number in brackets represents the time spent in the function’s code, excluding time spent in other functions that are called in it.

In this case, all time is spent in the external module time‘s function sleep, which is why the number is 0.00%. Rarely a lot of time is spent in self-written functions, as the workload of a script usually quickly trickles down to very low-level functions of the Python implementation itself. Also, next to the arrows is the amount of time that one function passes to the other, along with the number of calls. The colors (RED-GREEN-BLUE, descending) and the thickness of the arrows indicate the relevance of different spots in the program.

Note that the percentages of the three functions above don’t add up to 100%. The reason behind is is that the graph is set up to only include self-written functions. In this case, the importing the time module caused the Python interpreter to spend 0.04% time in a function of the module importlib.

Evaluation with External Packages

Consider a second example:

# test_script_2.py

import pandas as pd
from tests.goodnight import sleep_five_seconds

# some random madness
for i in range(1000):
   a_frame = pd.DataFrame([[1,2,3]])

sleep_five_seconds()

capture this in the graph, we can add the external package (pandas) with the -x flag. However, initializing a Pandas DataFrame is done within many pandas-internal functions. Frankly, I am personally not interested in the inner convolutions of pandas which is why I want the tree to not “sprout” too deep into the pandas mechanics. This can be accounted for by allowing only functions to show up if a minimal percentage of the runtime is spent in them.

Exactly this can be done using the -mflag. In combination, project_graph -m 8 -x pandas test_script_2.py yields the following:

Project Graph Creation Example 02

Toy examples aside, let’s move on to something more serious. A real-life data-science project could look like this one:

Project Graph Creation Example 03

This time the tree is much bigger. It is actually even bigger than what you see in the illustration, as many more self-written functions are invoked. However, they are trimmed from the tree for clarity, as functions in which less than 0.5 % of the overall time is spent are filtered out (this is the default setting for the -m flag). Note that such a graph also really shines when searching for performance bottlenecks. One can see right away which functions carry most of the workload, when they are called, and how often they are called. This may prevent you from optimizing your program in the wrong spots while ignoring the elephant in the room.

How to Use Project Graphs

Installation

Within your project’s environment, do the following:

brew install graphviz

pip install git+https://github.com/fior-di-latte/project_graph.git

Usage

Within your project’s environment, change your current working directory to the project’s root (this is important!) and then enterfor standard usage:

project_graph myscript.py

If your script includes an argparser, use:

project_graph "myscript.py <arg1> <arg2> (...)"

If you want to see the entire graph, including all external packages, use:

project_graph -a myscript.py

If you want to use a visibility threshold other than 1%, use:

project_graph -m <percent_value> myscript.py

Finally, if you want to include external packages into the graph, you can specify them as follows:

project_graph -x <package1> -x <package2> (...) myscript.py

Conclusion & Caveats

This package has certain weaknesses, most of which can be addressed, e.g., by formatting the code into a function-based style, by trimming with the -m flag, or adding packages by using the -x flag. Generally, if something seems odd, the best first step is probably to use the -a flag to debug. Significant caveats are the following:

  • It only works on Unix systems.
  • It does not show a truthful graph when used with multiprocessing. The reason behind that is that cProfile is not compatible with multiprocessing. If multiprocessing is used, only the root process will be profiled, leading to false computation times in the graph. Switch to a non-parallel version of the target script.
  • Profiling a script can lead to a considerable overhead computation-wise. It can make sense to scale down the work done in your script (i.e., decrease the amount of input data). If so, the time spent in the functions, of course, can be distorted massively if the functions don’t scale linearly.
  • Nested functions will not show up in the graph. In particular, a decorator implicitly nests your function and will thus hide your function. That said, when using an external decorator, don’t forget to add the decorator’s package via the -x flag (for example, project_graph -x numba myscript.py).
  • If your self-written function is exclusively called from an external package’s function, you must manually add the external package with the -x flag. Otherwise, your function will not show up in the tree, as its parent is an external function and thus not considered.

Feel free to use the little package for your own project, be it for performance analysis, code introductions for new team members, or out of sheer curiosity. As for me, I find it very satisfying to see such a visualization of my projects. If you have trouble using it, don’t hesitate to hit me up on Github.

PS: If you’re looking for a similar package in R, check out Jakob’s post on flowcharts of functions.

Felix Plagge Felix Plagge

Do you want to learn Python? Or are you an R pro and you regularly miss the important functions and commands when working with Python? Or maybe you need a little reminder from time to time while coding? That’s exactly why cheatsheets were invented!

Cheatsheets help you in all these situations. Our first cheatsheet with Python basics is the start of a new blog series, where more cheatsheets will follow in our unique STATWORX style.

So you can be curious about our series of new Python cheatsheets that will cover basics as well as packages and workspaces relevant to Data Science.

Our cheatsheets are freely available for you to download, without registration or any other paywall.

 

Why have we created new cheatsheets?

As an experienced R user you will search endlessly for state-of-the-art Python cheatsheets similiar to those known from R Studio.

Sure, there are a lot of cheatsheets for every topic, but they differ greatly in design and content. As soon as we use several cheatsheets in different designs, we have to reorientate ourselves again and again and thus lose a lot of time in total. For us as data scientists it is important to have uniform cheatsheets where we can quickly find the desired function or command.

We want to counteract this annoying search for information. Therefore, we would like to regularly publish new cheatsheets in a design language on our blog in the future – and let you all participate in this work relief.

What does the first cheatsheet contain?

Our first cheatsheet in this series is aimed primarily at Python novices, R users who use Python less often, or peoples who are just starting to use it. It facilitates the introduction and overview in Python.

It makes it easier to get started and get an overview of Python. Basic syntax, data types, and how to use them are introduced, and basic control structures are introduced. This way, you can quickly access the content you learned in our STATWORX Academy, for example, or recall the basics for your next programming project.

What does the STATWORX Cheatsheet Episode 2 cover?

The next cheatsheet will cover the first step of a data scientist in a new project: Data Wrangling. Also, you can expect a cheatsheet for pandas about data loading, selection, manipulation, aggregation and merging. Happy coding!

Niklas Junker Niklas Junker

Management Summary

Kubernetes is a technology that in many ways greatly simplifies the deployment and maintenance of applications and compute loads, especially the training and hosting of machine learning models. At the same time, it allows us to adapt the required hardware resources, providing a scalable and cost-transparent solution.

This article first discusses the transition from a server to management and orchestration of containers: isolated applications or models that are packaged once with all their requirements and can subsequently be run almost anywhere. Regardless of the server, these can be replicated at will with Kubernetes, allowing effortless and almost seamless continuous accessibility of their services even under intense demand. Likewise, their number can be reduced to a minimum level when the demand temporarily or periodically dwindles in order to use computing resources elsewhere or avoid unnecessary costs.

From the capabilities of this infrastructure emerges a useful architectural paradigm called microservices. Formerly centralized applications are thus broken down into their functionalities, which provide a high degree of reusability. These can be accessed and used by different services and scale individually according to internal needs. An example of this is large and complex language models in Natural Language Processing, which can capture the context of a text regardless of its further use and thus underlie many downstream purposes. Other microservices (models), such as for text classification or summarization, can invoke them and further process the partial results.

After a brief introduction of the general terminology and functionality of Kubernetes, as well as possible use cases, the focus turns to the most common way to use Kubernetes: with cloud providers such as Google GCP, Amazon AWS, or Microsoft Azure. These allow so-called Kubernetes clusters to dynamically consume more or fewer resources, though the costs incurred remain foreseeable on a pay-per-use basis. Other common services such as data storage, versioning, and networking can also be easily integrated by the providers. Finally, the article gives an outlook on tools and further developments, which either make using Kubernetes even more efficient or further abstract and simplify the process towards serverless architectures.

Introduction

Over the last 20 years, vast amounts of new technologies have surfaced in software development and deployment, which have not only multiplied and diversified the choice of services, programming languages, and libraries but have even led to a paradigm shift in many use cases or domains.

Google Trends Paradigmenwechsel
Fig. 1: Google trends chart showing the above-mentioned paradigm shift

If we also look at the way software solutions, models, or work and computing loads have been deployed over the years, we can see how innovations in this area have also led to greater flexibility, scalability, and resource efficiency, among other things.

In the beginning, these were run as local processes directly on a server (shared by several applications), which posed some limitations and problems: on the one hand, one is bound to the configuration of the server and its operating system when selecting the technical tools, and on the other hand, all applications hosted on the server are limited by its memory and processor capacities. Thus, they share not only resources in total but also a possible cross-process error-proneness.

As a first further development, virtual machines can then offer a further level of abstraction: by emulating (“virtualizing”) an independent machine on the server, modularity and thus greater freedom is created for development and deployment. For example, in the choice of operating system or the programming languages and libraries used.  From the point of view of the “real” server, the resources to which the application is entitled can be better limited or guaranteed. However, their requirements are also significantly higher since the virtual machine must also maintain the virtual operating system.

Ultimately, this principle has been significantly streamlined and simplified by the proliferation of containers, especially Docker. Put simply, one builds/configures a separate virtual, isolated server for an application or machine learning model. Thus, each container has its own file system and certain system libraries, but not operating system. This technically turns it into a sandbox whose other configuration, code dependencies or errors do not affect the host server, but at the same time can run as relatively “lightweight” processes directly on it.

_Vergleich virtuelle Maschine und Docker Container Architektur
Fig. 2: Comparison between virtual machine and Docker container system architecture; Source: https://i1.wp.com/www.docker.com/blog/wp-content/uploads/Blog.-Are-containers-..VM-Image-1-1024×435.png?ssl=1

So there is the possibility to copy, install, etc., everything for the desired application and provide this in a packaged container everywhere in a consistent format. This is not only extremely useful for the production environment, but we at STATWORX also like to use it in the development of more complicated projects or the proof-of-concept phase. Intermediate steps or results, such as extracting text from images, can be used as a container like a small web server by those interested in further processing of the text, such as extracting certain key information, or determining its mood or intent.

This subdivision into so-called “microservices“ with the help of containers helps immensely in the reusability of the individual modules, in the planning and development of the architecture of complex systems; at the same time, it frees the individual work steps from technical dependencies on each other and facilitates maintenance and update procedures.

After this brief overview of the powerful and versatile possibilities of deploying software, the following text will deal with how to reliably and scalably deploy these containers (i.e., applications or models) for customers, other applications, internal services or computations with Kubernetes.

Kubernetes – 8 Essential Components

Kubernetes was introduced by Google in 2014 as open-source container management software (also called container orchestration). Internally, the company had already been using tools developed in-house for years to manage workloads and applications, and regarded the development of Kubernetes not only as a convergence of best practices and lessons learned, but also as an opportunity to open up a new business segment in cloud computing.

The name Kubernetes (Greek for helmsman) was supposedly chosen in reference to a symbolic container ship, for whose optimal operation he was responsible.

1.    Nodes

When speaking of a Kubernetes instance, it is referred to as a (Kubernetes) cluster: it consists of several servers, called nodes. One of them, called the master node, is solely responsible for administrative operations, and is the interface that is addressed by the developer. All other nodes, called worker nodes, are initially unoccupied and thus flexible. While nodes are actually physical instances, mostly in data centers, the following terms are digital concepts of Kubernetes.

2.    Pods

If an application is to be deployed on the cluster, in the simplest case the desired container is specified and then (automatically) a so-called pod is created and assigned to a node. The pod simply resembles a running container. If several instances of the same application are to run in parallel, for example to provide better availability, the number of replicas can be specified. In this case, the specified number of pods, each with the same application, is distributed to the nodes. If the demand for the application exceeds the capacities despite replicas, even more pods can be created automatically with the Horizontal Autoscaler. Especially for Deep Learning models with relatively long inference times, metrics such as CPU or GPU utilization can be monitored here, and the number of pods can be increased or decreased automatically to optimize both capacity and cost.

Illustration des Autoscaling und der Belegung der Nodes
Illustration 3: Illustration of the autoscaling and the occupancy of the nodes. The width of the bars corresponds to the resource requirements of the pods or the capacity of the nodes.

To avoid confusion, ultimately every running container, i.e., every workload, is a pod. In the case of deploying an application, this is technically done via a deployment, whereas temporal compute loads are jobs. Persistent stores such as databases are managed with StatefulSets. The following figure provides an overview of the terms:

Deployment-Controller
Fig. 4: What is what in Kubernetes? Deployment specifies what is desired; the deployment controller takes care of creating, maintaining, and scaling the model containers, which then run as individual pods on the nodes. Jobs and StatefulSets work analogously with their own controller.

3.    Jobs

Kubernetes jobs can be used to execute both one-time and recurring jobs (so-called CronJobs) in the form of a container deployment on the cluster.

In the simplest case, these can be seen as a script, which can be used for maintenance or data preparation work of databases, for example. Furthermore, they are also used for batch processing, for example when deep learning models are to be applied to larger data volumes and it is not worthwhile to keep the model continuously on the cluster. In this case, the model container is started up, gets access to the desired dataset, performs its inference on it, saves the results and shuts down. There is also flexibility here for the origin and subsequent storage of the data, so own or cloud databases, bucket/object storage or even local data and logging frameworks can be connected.

For recurring CronJobs, a simple time scheme can be specified so that, for example, certain customer data, transactions or the like are processed at night. Natural Language Processing can be used to automatically create press reviews at night, for example, which can then be evaluated the following morning: News about a company, its industry, business locations, customers, etc. can be aggregated or sourced, evaluated with NLP, summarized, and presented with sentiment or sorted by topic/content.

Even labor-intensive ETL (Extract Transform Load) processes can be performed or prepared outside business hours.

4.    Rolling Updates

If a deployment needs to be brought up to the latest version or a rollback to an older version needs to be completed, rolling updates can be triggered in Kubernetes. These guarantee continuous accessibility of the applications and models within a Continuous Integration/Continuous Deployment pipeline.

Such a rollout can be initiated and monitored smoothly in one or a few steps. By means of a rollout history it is also possible not only to jump back to a previous container version, but also to restore the previous deployment parameters, i.e. minimum and maximum number of nodes, which resource group (GPU nodes, CPU nodes with little/much RAM,…), health checks, etc.

If a rolling update is triggered, the respective existing pods are kept running and accessible until the same number of new pods are up and accessible. Here there are methods to guarantee that no requests are lost, as well as parameters that regulate a minimum accessibility or a maximum surplus of pods for the change.

Illustration eines Rolling Updates
Fig. 5: Illustration of a Rolling Update.

Figure 5 illustrates the rolling update.

1) The current version of an application is located on the Kubernetes cluster with 2 replicas and can be accessed as usual.

2) A rolling update to version V2 is started, the same number of pods as for V1 are created.

3) As soon as the new pods have the state “Running” and, if applicable, health checks have been completed, thus being functional, the containers of the older version are shut down.

4) The older pods are removed and the resources are released again.

The DevOps and time involved here is marginal, internally no hostnames or the like change, while from the consumer’s point of view the service can be accessed as before in the usual way (same IP, URL, …) and has merely been updated to the latest version.

5.    Platform/Infrastructure as a Service

Of course, a Kubernetes cluster can also be deployed locally on your on-premises hardware as well as on partially pre-built solutions, such as DGX Workbenches.

Some of our customers have strict policies or requirements regarding (data) compliance or information security, and do not want potentially sensitive data to leave the company. Furthermore, it can be avoided that data traffic flows through non-European nodes or generally ends up in foreign data centers.

Experience shows, however, that this is only the case in a very small proportion of projects. Through encryption, rights management and SLAs of the operators, we consider the use of cloud services and data centers to be generally secure and also use them for larger projects. In this regard, deployment, maintenance, CI/CD pipelines are also largely identical and easy to use thanks to methods of containerization (Docker) and abstraction (Kubernetes).

All major cloud operators like Google (GCP), Amazon (AWS) and Microsoft (Azure), but also smaller providers and soon even exciting new German projects, offer services very similar to Kubernetes. This makes it even easier to deploy and, most importantly, scale a project or model, as auto-scaling allows the cluster to expand or shrink depending on resource needs. From a technical perspective, this largely frees us from having to estimate the demand of a service while keeping the profitability and cost structure the same. Furthermore, the services can also be hosted and operated in different (geographical) zones to guarantee fastest reachability and redundancy.

6.    Node-Variety

The cloud operators offer a large number of different node types to satisfy all resource requirements for all use cases from the simpler web service to high performance computing. Especially in the application field of Deep Learning, the ever growing models can thus always be trained and served on the required latest hardware.

For example, while we use nodes with an average CPU and low memory for smaller NLP purposes, large Transformer models can be deployed on GPU nodes in the same cluster, which effectively enables their use in the first place and at the same time can speed up inference (application of the model) by a factor of over 20. As of late, the importance of dedicated hardware for neural networks has been steadily increasing, Google also provides access to the custom TPUs optimized for Tensorflow.

The organization and grouping of all these different nodes is done in Kubernetes in so-called node pools. These can be selected or specified in the deployment so that the right resources are allocated to the pods of the models.

7.    Cluster Autoscaling

The extent to which models or services are used, internally or by customers, is often unpredictable or fluctuates greatly over time. With a cluster autoscaler, new nodes can be created automatically, or unneeded “empty” nodes can be removed. Here, too, a minimum number of nodes can be specified, which should always be available, as well as, if desired, a maximum number, which cannot be exceeded, to cap the costs, if necessary.

8.    Interfacing with Other Services

In principle, cloud services from different providers can be combined, but it is more convenient and easier to use one provider (e.g. Google GCP). This means that services such as data buckets, container registry, Lambda functions can be integrated and used internally in the cloud without major authentication processes. Furthermore, especially in a microservice architecture, network communication among the individual hosts (applications, models) is important and facilitated within a provider. Access control/RBAC can also be implemented here, and several clusters or projects can be bridged with a virtual network to better separate the areas of responsibility and competence.

Environment and Future Developments

The growing use and spread of Kubernetes has brought with it a whole environment of useful tools, as well as advancements and further abstractions that further facilitate its use.

Tools and Pipelines based on Kubernetes

For example, Kubeflow can be used to trigger the training of machine learning models as a TensorFlow training job and deploy completed models with TensorFlow Serving.

The whole process can also be packaged into a pipeline that then performs training of different models with reference to training, validation and test data in memory buckets, and also monitors or logs their metrics and compares model performance. The workflow also includes the preparation of input data, so that after the initial pipeline setup, experiments can be easily performed to explore model architectures and hyperparameter tuning.

Serverless

Serverless deployment methods such as Cloud Run or Amazon Fargate take another abstraction step away from the technical requirements. With this, containers can be deployed within seconds and scale like pods on a Kubernetes cluster without even having to create or maintain it. So the same infrastructure has once again been simplified in its use. According to the pay-per-use principle, only the time in which the code is actually called and executed is charged.

Conclusion

Kubernetes has become a central pillar in machine learning deployment today. The path from data and model exploration to the prototype and finally to production has been enormously streamlined and simplified by libraries such as PyTorch, TensorFlow and Keras. At the same time, these frameworks can also be applied in enormous detail, if required, to develop customized components or to integrate and adapt existing models using transfer learning. Container technologies such as Docker subsequently allow the result to be bundled with all its requirements and dependencies and executed almost anywhere without drawbacks in speed. In the final step, their deployment, maintenance, and scaling has also become immensely simplified and powerful with Kubernetes.

All of this allows us to develop our own products as well as solutions for customers in a structured way:

  • The components and the framework infrastructure have a high degree of reusability
  • A first milestone or proof-of-concept can be achieved in relatively little time and cost expenditure
  • Further development work expands on this process in a natural way by increasing complexity
  • Ready deployments scale without additional effort, with costs proportional to demand
  • This results in a reliable platform with a predictable cost structure

If you would like to read further about some key components following this article, we have some more interesting articles about:

Sources

Jonas Braun Jonas Braun

Management Summary

Deploying and monitoring machine learning projects is a complex undertaking. In addition to the consistent documentation of model parameters and the associated evaluation metrics, the main challenge is to transfer the desired model into a productive environment. If several people are involved in the development, additional synchronization problems arise concerning the models’ development environments and version statuses. For this reason, tools for the efficient management of model results through to extensive training and inference pipelines are required. In this article, we present the typical challenges along the machine learning workflow and describe a possible solution platform with MLflow. In addition, we present three different scenarios that can be used to professionalize machine learning workflows:

  1. Entry-level Variant: Model parameters and performance metrics are logged via a R/Python API and clearly presented in a GUI. In addition, the trained models are stored as artifacts and can be made available via APIs.
  2. Advanced Model Management: In addition to tracking parameters and metrics, certain models are logged and versioned. This enables consistent monitoring and simplifies the deployment of selected model versions.
  3. Collaborative Workflow Management: Encapsulating Machine Learning projects as packages or Git repositories and the accompanying local reproducibility of development environments enable smooth development of Machine Learning projects with multiple stakeholders.

Depending on the maturity of your machine learning project, these three scenarios can serve as inspiration for a potential machine learning workflow. We have elaborated each scenario in detail for better understanding and provide recommendations regarding the APIs and deployment environments to use.

Challenges Along the Machine Learning Workflow

Training machine learning models is becoming easier and easier. Meanwhile, a variety of open-source tools enable efficient data preparation as well as increasingly simple model training and deployment.

The added value for companies comes primarily from the systematic interaction of model training, in the form of model identification, hyperparameter tuning and fitting on the training data, and deployment, i.e., making the model available for inference tasks. This interaction is often not established as a continuous process, especially in the early phases of machine learning initiative development. However, a model can only generate added value in the long term if a stable production process is implemented from model training, through its validation, to testing and deployment. If this process is implemented correctly, complex dependencies and costly maintenance work in the long term can arise during the operational start-up of the model [2]. The following risks are particularly noteworthy in this regard.

1. Ensuring Synchronicity

Often, in an exploratory context, data preparation and modeling workflows are developed locally. Different configurations of development environments or even the use of different technologies make it difficult to reproduce results, especially between developers or teams. In addition, there are potential dangers concerning the compatibility of the workflow if several scripts must be executed in a logical sequence. Without an appropriate version control logic, the synchronization effort afterward can only be guaranteed with great effort.

2. Documentation Effort

To evaluate the performance of the model, model metrics are often calculated following training. These depend on various factors, such as the parameterization of the model or the influencing factors used. This meta-information about the model is often not stored centrally. However, for systematic further development and improvement of a model, it is mandatory to have an overview of the parameterization and performance of all past training runs.

3. Heterogeneity of Model Formats

In addition to managing model parameters and results, there is the challenge of subsequently transferring the model to the production environment. If different models from multiple packages are used for training, deployment can quickly become cumbersome and error-prone due to different packages and versions.

4. Recovery of Prior Results

In a typical machine learning project, the situation often arises that a model is developed over a long period of time. For example, new features may be used, or entirely new architectures may be evaluated. These experiments do not necessarily lead to better results. If experiments are not versioned cleanly, there is a risk that old results can no longer be reproduced.

Various tools have been developed in recent years to solve these and other challenges in the handling and management of machine learning workflows, such as TensorFlow TFX, cortex, Marvin, or MLFlow. The latter, in particular, is currently one of the most widely used solutions.

MLflow is an open-source project with the goal to combine the best of existing ML platforms to make the integration to existing ML libraries, algorithms, and deployment tools as straightforward as possible [3]. In the following, we will introduce the main MLflow modules and discuss how machine learning workflows can be mapped via MLflow.

MLflow Services

MLflow consists of four components: MLflow Tracking, MLflow Models, MLflow Projects, and MLflow Registry. Depending on the requirements of the experimental and deployment scenario, all services can be used together, or individual components can be isolated.

With MLflow Tracking, all hyperparameters, metrics (model performance), and artifacts, such as charts, can be logged. MLflow Tracking provides the ability to collect presets, parameters, and results for collective monitoring for each training or scoring run of a model. The logged results can be visualized in a GUI or alternatively accessed via a REST API.

The MLflow Models module acts as an interface between technologies and enables simplified deployment. Depending on its type, a model is stored as a binary, e.g., a pure Python function, or as a Keras or H2O model. One speaks here of the so-called model flavors. Furthermore, MLflow Models provides support for model deployment on various machine learning cloud services, e.g., for AzureML and Amazon Sagemaker.

MLflow Projects are used to encapsulate individual ML projects in a package or Git repository. The basic configurations of the respective environment are defined via a YAML file. This can be used, for example, to control how exactly the conda environment is parameterized, which is created when MLflow is executed. MLflow Projects allows experiments that have been developed locally to be executed on other computers in the same environment. This is an advantage, for example, when developing in smaller teams.

MLflow Registry provides a centralized model management. Selected MLflow models can be registered and versioned in it. A staging workflow enables a controlled transfer of models into the productive environment. The entire process can be controlled via a GUI or a REST API.

Examples of Machine Learning Pipelines Using MLflow

In the following, three different ML workflow scenarios are presented using the above MLflow modules. These increase in complexity from scenario to scenario. In all scenarios, a dataset is loaded into a development environment using a Python script, processed, and a machine learning model is trained. The last step in all scenarios is a deployment of the ML model in an exemplary production environment.

1. Scenario – Entry-Level Variant

Szenario 1 – Simple Metrics TrackingScenario 1 – Simple Metrics Tracking

Scenario 1 uses the MLflow Tracking and MLflow Models modules. Using the Python API, the model parameters and metrics of the individual runs can be stored on the MLflow Tracking Server Backend Store, and the corresponding MLflow Model File can be stored as an artifact on the MLflow Tracking Server Artifact Store. Each run is assigned to an experiment. For example, an experiment could be called ‘fraud_classification’, and a run would be a specific ML model with a certain hyperparameter configuration and the corresponding metrics. Each run is stored with a unique RunID.

Artikel MLFlow Tool Bild 01

In the screenshot above, the MLflow Tracking UI is shown as an example after executing a model training. The server is hosted locally in this example. Of course, it is also possible to host the server remotely. For example in a Docker container within a virtual machine. In addition to the parameters and model metrics, the time of the model training, as well as the user and the name of the underlying script, are also logged. Clicking on a specific run also displays additional information, such as the RunID and the model training duration.

Artikel MLFlow Tool Bild 02

If you have logged other artifacts in addition to the metrics, such as the model, the MLflow Model Artifact is also displayed in the Run view. In the example, a model from the sklearn.svm package was used. The MLmodel file contains metadata with information about how the model should be loaded. In addition to this, a conda.yaml is created that contains all the package dependencies of the environment at training time. The model itself is located as a serialized version under model.pkl and contains the model parameters optimized on the training data.

Artikel MLFlow Tool Bild 03

The deployment of the trained model can now be done in several ways. For example, suppose one wants to deploy the model with the best accuracy metric. In that case, the MLflow tracking server can be accessed via the Python API mlflow.list_run_infos to identify the RunID of the desired model. Now, the path to the desired artifact can be assembled, and the model loaded via, for example, the Python package pickle. This workflow can now be triggered via a Dockerfile, allowing flexible deployment to the infrastructure of your choice. MLflow offers additional separate APIs for deployment on Microsoft Azure and AWS. For example, if the model is to be deployed on AzureML, an Azure ML container image can be created using the Python API mlflow.azureml.build_image, which can be deployed as a web service to Azure Container Instances or Azure Kubernetes Service. In addition to the MLflow Tracking Server, it is also possible to use other storage systems for the artifact, such as Amazon S3, Azure Blob Storage, Google Cloud Storage, SFTP Server, NFS, and HDFS.

2. Scenario – Advanced Model Management

Szenario 2 – Advanced Model ManagementScenario 2 – Advanced Model Management

Scenario 2 includes, in addition to the modules used in scenario 1, MLflow Model Registry as a model management component. Here, it is possible to register and process the models logged there from specific runs. These steps can be controlled via the API or GUI. A basic requirement to use the Model Registry is deploying the MLflow Tracking Server Backend Store as Database Backend Store. To register a model via the GUI, select a specific run and scroll to the artifact overview.

Artikel MLFlow Tool Bild 04

Clicking on Register Model opens a new window in which a model can be registered. If you want to register a new version of an already existing model, select the desired model from the dropdown field. Otherwise, a new model can be created at any time. After clicking the Register button, the previously registered model appears in the Models tab with corresponding versioning.

Artikel MLFlow Tool Bild 05

Each model includes an overview page that shows all past versions. This is useful, for example, to track which models were in production when.

Artikel MLFlow Tool Bild 06

If you now select a model version, you will get to an overview where, for example, a model description can be added. The Source Run link also takes you to the run from which the model was registered. Here you will also find the associated artifact, which can be used later for deployment.

Artikel MLFlow Tool Bild 07

In addition, individual model versions can be categorized into defined phases in the Stage area. This feature can be used, for example, to determine which model is currently being used in production or is to be transferred there. For deployment, in contrast to scenario 1, versioning and staging status can be used to identify and deploy the appropriate model. For this, the Python API MlflowClient().search_model_versions can be used, for example, to filter the desired model and its associated RunID. Similar to scenario 1, deployment can then be completed to, for example, AWS Sagemaker or AzureML via the respective Python APIs.

3. Scenario – Collaborative Workflow Management

Szenario 3 – Full Workflow ManagementScenario 3 – Full Workflow Management

In addition to the modules used in scenario 2, scenario 3 also includes the MLflow Projects module. As already explained, MLflow Projects are particularly well suited for collaborative work. Any Git repository or local environment can act as a project and be controlled by an MLproject file. Here, package dependencies can be recorded in a conda.yaml, and the MLproject file can be accessed when starting the project. Then the corresponding conda environment is created with all dependencies before training and logging the model. This avoids the need for manual alignment of the development environments of all developers involved and also guarantees standardized and comparable results of all runs. Especially the latter is necessary for the deployment context since it cannot be guaranteed that different package versions produce the same model artifacts. Instead of a conda environment, a Docker environment can also be defined using a Dockerfile. This offers the advantage that package dependencies independent of Python can also be defined. Likewise, MLflow Projects allow the use of different commit hashes or branch names to use other project states, provided a Git repository is used.

An interesting use case is the modularized development of machine learning training pipelines [4]. For example, data preparation can be decoupled from model training and developed in parallel, while another team uses a different branch name to train the model. In this case, only a different branch name must be used as a parameter when starting the project in the MLflow Projects file. The final data preparation can then be pushed to the same branch name used for model training and would thus already be fully implemented in the training pipeline. The deployment can also be controlled as a sub-module within the project pipeline through a Python script via the ML Project File and can be carried out analogous to scenario 1 or 2 on a platform of your choice.

Conclusion and Outlook

MLflow offers a flexible way to make the machine learning workflow robust against the typical challenges in the daily life of a data scientist, such as synchronization problems due to different development environments or missing model management. Depending on the maturity level of the existing machine learning workflow, various services from the MLflow portfolio can be used to achieve a higher level of professionalization.

In the article, three machine learning workflows, ascending in complexity, were presented as examples. From simple logging of results in an interactive UI to more complex, modular modeling pipelines, MLflow services can support it. Logically, there are also synergies outside the MLflow ecosystem with other tools, such as Docker/Kubernetes for model scaling or even Jenkins for CI/CD pipeline control. If there is further interest in MLOps challenges and best practices, I refer you to the webinar on MLOps by our CEO Sebastian Heinz, which we provide free of charge.

Resources

John Vicente John Vicente

Car Model Classification III: Explainability of Deep Learning Models with Grad-CAM

In the first article of this series on car model classification, we built a model using transfer learning to classify the car model through an image of a car. In the second article, we showed how TensorFlow Serving can be used to deploy a TensorFlow model using the car model classifier as an example. We dedicate this third post to another essential aspect of deep learning and machine learning in general: the explainability of model predictions.

We will start with a short general introduction on the topic of explainability in machine learning. Next, we will briefly talk about popular methods that can be used to explain and interpret predictions from CNNs. We will then explain Grad-CAM, a gradient-based method, in-depth by going through an implementation step by step. Finally, we will show you the results we obtained by our Grad-CAM implementation for the car model classifier.

A Brief Introduction to Explainability in Machine Learning

For the last couple of years, explainability was always a recurring but still niche topic in machine learning. But, over the past four years, interest in this topic has started to accelerate. At least one particular reason fuelled this development: the increased number of machine learning models in production. On the one hand, this leads to a growing number of end-users who need to understand how models are making decisions. On the other hand, an increasing number of machine learning developers need to understand why (or why not) a model is functioning in a particular way.

This increasing demand in explainability led to some, both methodological and technical, noteworthy innovations in the last years:

Methods for Explaining CNN Outputs for Images

Deep neural networks and especially complex architectures like CNNs were long considered as pure black box models. As written above, this changed in recent years, and now there are various methods available to explain CNN outputs. For example, the excellent tf-explain library implements a wide range of useful methods for TensorFlow 2.x. We will now briefly talk about the ideas of different approaches before turning to Grad-CAM:

Activations Visualization

This is the most straightforward visualization technique. It simply shows the output of a specific layer within the network during the forward pass. It can be helpful to get a feel for the extracted features since, during training, most of the activations tend towards zero (when using the ReLu-activation). An example for the output of the first convolutional layer of the car model classifier is shown below:

Vanilla Gradients

One can use the vanilla gradients of the predicted classes’ output for the input image to derive input pixel importances.

We can see that the highlighted region is mainly focused on the car. Compared to other methods discussed below, the discriminative region is much less confined.

Occlusion Sensitivity

This approach computes the importance of certain parts of the input image by reevaluating the model’s prediction with different parts of the input image hidden. Parts of the image are hidden iteratively by replacing them by grey pixels. The weaker the prediction gets with a part of the image hidden, the more important this part is for the final prediction. Based on the discriminative power of the regions of the image, a heatmap can be constructed and plotted. Applying occlusion sensitivity for our car model classifier did not yield any meaningful results. Thus, we show tf-explain‘s sample image, showing the result of applying the occlusion sensitivity procedure for a cat image.

CNN Fixations

Another exciting approach called CNN Fixations was introduced in this paper. The idea is to backtrack which neurons were significant in each layer, given the activations from the forward pass and the network weights. The neurons with large influence are referred to as fixations. This approach thus allows finding the essential regions for obtaining the result without the need for any recomputation (e.g., in the case of occlusion sensitivity above, where multiple predictions must be made).

The procedure can be described as follows: The node corresponding to the class is chosen as the fixation in the output layer. Then, the fixations for the previous layer are computed by computing which of the nodes have the most impact on the next higher level’s fixations determined in the last step. The node importance is computed by multiplying activations and weights. If you are interested in the details of the procedure, check out the paper or the corresponding github repo. This backtracking is done until the input image is reached, yielding a set of pixels with considerable discriminative power. An example from the paper is shown below.

CAM

Introduced in this paper, class activation mapping (CAM) is a procedure to find the discriminative region(s) for a CNN prediction by computing class activation maps. A significant drawback of this procedure is that it requires the network to use global average pooling (GAP) as the last step before the prediction layer. It thus is not possible to apply this approach to general CNNs. An example is shown in the figure below (taken from the CAM paper):

The class activation map assigns importance to every position (x, y) in the last convolutional layer by computing the linear combination of the activations, weighted by the corresponding output weights for the observed class (Australian terrier in the example above). The resulting class activation mapping is then upsampled to the size of the input image. This is depicted by the heat map above. Due to the architecture of CNNs, the activation, e.g., in the top left for any layer, is directly related to the top left of the input image. This is why we can conclude which input regions are important by only looking at the last CNN layer.

The Grad-CAM procedure we will discuss in detail below is a generalization of CAM. Grad-CAM can be applied to networks with general CNN architectures, containing multiple fully connected layers at the output.

Grad-CAM

Grad-CAM extends the applicability of the CAM procedure by incorporating gradient information. Specifically, the gradient of the loss w.r.t. the last convolutional layer determines the weight for each of its feature maps. As in the CAM procedure above, the further steps are to compute the weighted sum of the activations and then upsampling the result to the image size to plot the original image with the obtained heatmap. We will now show and discuss the code that can be used to run Grad-CAM. The complete code is available here on GitHub.

import pickle
import tensorflow as tf
import cv2
from car_classifier.modeling import TransferModel

INPUT_SHAPE = (224, 224, 3)

# Load list of targets
file = open('.../classes.pickle', 'rb')
classes = pickle.load(file)

# Load model
model = TransferModel('ResNet', INPUT_SHAPE, classes=classes)
model.load('...')

# Gradient model, takes the original input and outputs tuple with:
# - output of conv layer (in this case: conv5_block3_3_conv)
# - output of head layer (original output)
grad_model = tf.keras.models.Model([model.model.inputs],
                                   [model.model.get_layer('conv5_block3_3_conv').output,
                                    model.model.output])

# Run model and record outputs, loss, and gradients
with tf.GradientTape() as tape:
    conv_outputs, predictions = grad_model(img)
    loss = predictions[:, label_idx]

# Output of conv layer
output = conv_outputs[0]

# Gradients of loss w.r.t. conv layer
grads = tape.gradient(loss, conv_outputs)[0]

# Guided Backprop (elimination of negative values)
gate_f = tf.cast(output > 0, 'float32')
gate_r = tf.cast(grads > 0, 'float32')
guided_grads = gate_f * gate_r * grads

# Average weight of filters
weights = tf.reduce_mean(guided_grads, axis=(0, 1))

# Class activation map (cam)
# Multiply output values of conv filters (feature maps) with gradient weights
cam = np.zeros(output.shape[0: 2], dtype=np.float32)
for i, w in enumerate(weights):
    cam += w * output[:, :, i]

# Or more elegant: 
# cam = tf.reduce_sum(output * weights, axis=2)

# Rescale to org image size and min-max scale
cam = cv2.resize(cam.numpy(), (224, 224))
cam = np.maximum(cam, 0)
heatmap = (cam - cam.min()) / (cam.max() - cam.min())
  • The first step is to load an instance of the model.
  • Then, we create a new keras.Model instance that has two outputs: The activations of the last CNN layer ('conv5_block3_3_conv') and the original model output.
  • Next, we run a forward pass for our new grad_model using as input an image ( img) of shape (1, 224, 224, 3), preprocessed with the resnetv2.preprocess_input method. tf.GradientTape is set up and applied to record the gradients (the gradients are stored in the tapeobject). Further, the outputs of the convolutional layer (conv_outputs) and the head layer (predictions) are stored as well. Finally, we can use label_idxto get the loss corresponding to the label we want to find the discriminative regions for.
  • Using the gradient-method, one can extract the desired gradients from tape. In this case, we need the gradient of the loss w.r.t. the output of the convolutional layer.
  • In a further step, a guided backprop is applied. Only values for the gradients are kept where both the activations and the gradients are positive. This essentially means restricting attention to the activations which positively contribute to the wanted output prediction.
  • The weights are computed by averaging the obtained guided gradients for each filter.
  • The class activation map cam is then computed as the weighted average of the feature map activations (output). The method containing the for loop above helps understanding what the function does in detail. A less straightforward but more efficient way to implement the CAM-computation is to use tf.reduce_mean and is shown in the commented line below the loop implementation.
  • Finally, the resampling (resizing) is done using OpenCV2’s resize method, and the heatmap is rescaled to contain values in [0, 1] for plotting.

A version of Grad-CAM is also implemented in tf-explain.

Examples

We now use the Grad-CAM implementation to interpret and explain the predictions of the TransferModelfor car model classification. We start by looking at car images taken from the front.

Grad-CAM for car images from the front

The red regions highlight the most important discriminative regions, the blue regions the least important. We can see that for images from the front, the CNN focuses on the car’s grille and the area containing the logo. If the car is slightly tilted, the focus is shifted more to the edge of the vehicle. This is also the case for slightly tilted images from the car’s backs, as shown in the middle image below.

Grad-CAM for car images from the back

For car images from the back, the most crucial discriminative region is near the number plate. As mentioned above, for cars looked at from an angle, the closest corner has the highest discriminative power. A very interesting example is the Mercedes-Benz C-class on the right side, where the model not only focuses on the tail lights but also puts the highest discriminative power on the model lettering.

Grad-CAM for car images from the side

When looking at images from the side, we notice the discriminative region is restricted to the bottom half of the cars. Again, the angle the car image was taken from determines the shift of the region towards the front or back corner.

In general, the most important fact is that the discriminative areas are always confined to parts of the cars. There are no images where the background has high discriminative power. Looking at the heatmaps and the associated discriminative regions can be used as a sanity check for CNN models.

Conclusion

We discussed multiple approaches to explaining CNN classifier outputs. We introduced Grad-CAM in detail by examining the code and looking at examples for the car model classifier. Most notably, the discriminative regions highlighted by the Grad-CAM procedure are always focussed on the car and never on the backgrounds of the images. The result shows that the model works as we expect and uses specific parts of the car to discriminate between different models.

In the fourth and last part of this blog series, we will show how the car classifier can be built into a web application using Dash. See you soon!

Stephan Müller Stephan Müller

In the first post of the series, we discussed transfer learning and built a model for car model classification. In this blog post, we will discuss the problem of model deployment, using the TransferModel introduced in the first post as an example.

A model is of no use in actual practice if there is no simple way to interact with it. In other words: We need an API for our models. TensorFlow Serving has been developed to provide these functionalities for TensorFlow models. This blog post will show how a TensorFlow Serving server can be launched in a Docker container and how we can interact with the server using HTTP requests.

If you are new to Docker, we recommend working through Docker’s tutorial before reading this article. If you want to look at an example of deployment in Docker, we recommend reading this blog post by our colleague Oliver Guggenbühl, in which he describes how an R-script can be run in Docker. We start by giving an overview of TensorFlow Serving.

Introduction to TensorFlow Serving

Let’s start by giving you an overview of TensorFlow Serving.

TensorFlow Serving is TensorFlow’s serving system, designed to enable the deployment of various models using a uniform API. Using the abstraction of Servables, which are basically objects clients use to perform computations, it is possible to serve multiple versions of deployed models. That enables, for example, that a new version of a model can be uploaded while the previous version is still available to clients. Looking at the bigger picture, so-called Managers are responsible for handling the life-cycle of Servables, which means loading, serving, and unloading them.

In this post, we will show how a single model version can be deployed. The code examples below show how a server can be started in a Docker container and how the Predict API can be used to interact with it. To read more about TensorFlow Serving, we refer to the TensorFlow website.

Implementation

We will now discuss the following three steps required to deploy the model and to send requests.

  • Save a model in correct format and folder structure using TensorFlow SavedModel
  • Run a Serving server inside a Docker container
  • Interact with the model using REST requests

Saving TensorFlow Models

If you didn’t read this series’ first post, here’s a brief summary of the most important points needed to understand the code below:

The TransferModel.model is a tf.keras.Model instance, so it can be saved using Model‘s built-in save method. Further, as the model was trained on web-scraped data, the class labels can change when re-scraping the data. We thus store the index-class mapping when storing the model in classes.pickle. TensorFlow Serving requires the model to be stored in the SavedModel format. When using tf.keras.Model.save, the path must be a folder name, else the model will be stored in another format (e.g., HDF5) which is not compatible with TensorFlow Serving. Below, folderpath contains the path of the folder we want to store all model relevant information in. The SavedModel is stored in folderpath/model and the class mapping is stored as folderpath/classes.pickle.

def save(self, folderpath: str):
    """
    Save the model using tf.keras.model.save

    Args:
        folderpath: (Full) Path to folder where model should be stored
    """

    # Make sure folderpath ends on slash, else fix
    if not folderpath.endswith("/"):
        folderpath += "/"

    if self.model is not None:
        os.mkdir(folderpath)
        model_path = folderpath + "model"
        # Save model to model dir
        self.model.save(filepath=model_path)
        # Save associated class mapping
        class_df = pd.DataFrame({'classes': self.classes})
        class_df.to_pickle(folderpath + "classes.pickle")
    else:
        raise AttributeError('Model does not exist')

Start TensorFlow Serving in Docker Container

Having saved the model to the disk, you now need to start the TensorFlow Serving server. Fortunately, there is an easy-to-use Docker container available. The first step is therefore pulling the TensorFlow Serving image from DockerHub. That can be done in the terminal using the command docker pull tensorflow/serving.

Then we can use the code below to start a TensorFlow Serving container. It runs the shell command for starting a container. The options set in the docker_run_cmd are the following:

  • The serving image exposes port 8501 for the REST API, which we will use later to send requests. Thus we map the host port 8501 to the container’s 8501 port using -p.
  • Next, we mount our model to the container using -v. It is essential that the model is stored in a versioned folder (here MODEL_VERSION=1); else, the serving image will not find the model. model_path_guest thus must be of the form <path>/<model name>/MODEL_VERSION, where MODEL_VERSION is an integer.
  • Using -e, we can set the environment variable MODEL_NAME to our model’s name.
  • The --name tf_serving option is only needed to assign a specific name to our new docker container.

If we try to run this file twice in a row, the docker command will not be executed the second time, as a container with the name tf_serving already exists. To avoid this problem, we use docker_run_cmd_cond. Here, we first check if a container with this specific name already exists and is running. If it does, we leave it; if not, we check if an exited version of the container exists. If it does, it is deleted, and a new container is started; if not, a new one is created directly.

import os

MODEL_FOLDER = 'models'
MODEL_SAVED_NAME = 'resnet_unfreeze_all_filtered.tf'
MODEL_NAME = 'resnet_unfreeze_all_filtered'
MODEL_VERSION = '1'

# Define paths on host and guest system
model_path_host = os.path.join(os.getcwd(), MODEL_FOLDER, MODEL_SAVED_NAME, 'model')
model_path_guest = os.path.join('/models', MODEL_NAME, MODEL_VERSION)

# Container start command
docker_run_cmd = f'docker run ' \
                 f'-p 8501:8501 ' \
                 f'-v {model_path_host}:{model_path_guest} ' \
                 f'-e MODEL_NAME={MODEL_NAME} ' \
                 f'-d ' \
                 f'--name tf_serving ' \
                 f'tensorflow/serving'

# If container is not running, create a new instance and run it
docker_run_cmd_cond = f'if [ ! "$(docker ps -q -f name=tf_serving)" ]; then \n' \
                      f'   if [ "$(docker ps -aq -f status=exited -f name=tf_serving)" ]; then 														\n' \
                      f'   		docker rm tf_serving \n' \
                      f'   fi \n' \
                      f'   {docker_run_cmd} \n' \
                      f'fi'

# Start container
os.system(docker_run_cmd_cond)

Instead of mounting the model from our local disk using the -v flag in the docker command, we could also copy the model into the docker image, so the model could be served simply by running a container and specifying the port assignments. It is important to note that, in this case, the model needs to be saved using the folder structure folderpath/<model name>/1, as explained above. If this is not the case, TensorFlow Serving will not find the model. We will not go into further detail here. If you are interested in deploying your models in this way, we refer to this guide on the TensorFlow website.

REST Request

Since the model is now served and ready to use, we need a way to interact with it. TensorFlow Serving provides two options to send requests to the server: gRCP and REST API, both exposed at different ports. In the following code example, we will use REST to query the model.

First, we load an image from the disk for which we want a prediction. This can be done using TensorFlow’s image module. Next, we convert the image to a numpy array using the img_to_array-method. The next and final step is crucial: since we preprocessed the training image before we trained our model (e.g., normalization), we need to apply the same transformation to the image we want to predict. The handypreprocess_input function makes sure that all necessary transformations are applied to our image.

from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet_v2 import preprocess_input

# Load image
img = image.load_img(path, target_size=(224, 224))
img = image.img_to_array(img)

# Preprocess and reshape data
import json
import requests

# Send image as list to TF serving via json dump
request_url = 'http://localhost:8501/v1/models/resnet_unfreeze_all_filtered:predict'
request_body = json.dumps({"signature_name": "serving_default", "instances": img.tolist()})
request_headers = {"content-type": "application/json"}
json_response = requests.post(request_url, data=request_body, headers=request_headers)
response_body = json.loads(json_response.text)
predictions = response_body['predictions']

# Get label from prediction
y_hat_idx = np.argmax(predictions)
y_hat = classes[y_hat_idx]
img = preprocess_input(img) img = img.reshape(-1, *img.shape)

TensorFlow Serving’s RESTful API offers several endpoints. In general, the API accepts post requests following this structure:

POST http://host:port/<URI>:<VERB>

URI: /v1/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]
VERB: classify|regress|predict

For our model, we can use the following URL for predictions: http://localhost:8501/v1/models/resnet_unfreeze_all_filtered:predict

The port number (here 8501) is the host’s port we specified above to map to the serving image’s port 8501. As mentioned above, 8501 is the serving container’s port exposed for the REST API. The model version is optional and will default to the latest version if omitted.

In python, the requests library can be used to send HTTP requests. As stated in the documentation, the request body for the predict API must be a JSON object with the below-listed key-value-pairs:

  • signature_name – serving signature to use (for more information, see the documentation)
  • instances – model input in row format

The response body will also be a JSON object with a single key called predictions. Since we get for each row in the instances the probability for all 300 classes, we use np.argmax to return the most likely class. Alternatively, we could have used the higher-level classify API.

Conclusion

In this second blog article of the Car Model Classification series, we learned how to deploy a TensorFlow model for image recognition using TensorFlow Serving as a RestAPI, and how to run model queries with it.

To do so, we first saved the model using the SavedModel format. Next, we started the TensorFlow Serving server in a Docker container. Finally, we showed how to request predictions from the model using the API endpoints and a correct specified request body.

A major criticism of deep learning models of any kind is the lack of explainability of the predictions. In the third blog post, we will show how to explain model predictions using a method called Grad-CAM.

Stephan Müller Stephan Müller

At STATWORX, we are very passionate about the field of deep learning. In this blog series, we want to illustrate how an end-to-end deep learning project can be implemented. We use TensorFlow 2.x library for the implementation. The topics of the series include:

  • Transfer learning for computer vision.
  • Model deployment via TensorFlow Serving.
  • Interpretability of deep learning models via Grad-CAM.
  • Integrating the model into a Dash dashboard.

In the first part, we will show how you can use transfer learning to tackle car image classification. We start by giving a brief overview of transfer learning and the ResNet and then go into the implementation details. The code presented can be found in this github repository.

Introduction: Transfer Learning & ResNet

What is Transfer Learning?

In traditional (machine) learning, we develop a model and train it on new data for every new task at hand. Transfer learning differs from this approach in that knowledge is transferred from one task to another. It is a useful approach when one is faced with the problem of too little available training data. Models that are pretrained for a similar problem can be used as a starting point for training new models. The pretrained models are referred to as base models.

In our example, a deep learning model trained on the ImageNet dataset can be used as the starting point for building a car model classifier. The main idea behind transfer learning for deep learning models is that the first layers of a network are used to extract important high-level features, which remain similar for the kind of data treated. The final layers (also known as the head) of the original network are replaced by a custom head suitable for the problem at hand. The weights in the head are initialized randomly, and the resulting network can be trained for the specific task.

There are various ways in which the base model can be treated during training. In the first step, its weights can be fixed. If the learning progress suggests the model not being flexible enough, certain layers or the entire base model can be “unfrozen” and thus made trainable. A further important aspect to note is that the input must be of the same dimensionality as the data on which the model was trained on – if the first layers of the base model are not modified.

image-20200319174208670

Next, we will briefly introduce the ResNet, a popular and powerful CNN architecture for image data. Then, we will show how we used transfer learning with ResNet to do car model classification.

What is ResNet?

Training deep neural networks can quickly become challenging due to the so-called vanishing gradient problem. But what are vanishing gradients? Neural networks are commonly trained using back-propagation. This algorithm leverages the chain rule of calculus to derive gradients at deeper layers of the network by multiplying gradients from earlier layers. Since gradients get repeatedly multiplied in deep networks, they can quickly approach infinitesimally small values during back-propagation.

ResNet is a CNN network that solves the vanishing gradient problem using so-called residual blocks (you find a good explanation of why they are called ‘residual’ here). The unmodified input is passed on to the next layer in the residual block by adding it to a layer’s output (see right figure). This modification makes sure that a better information flow from the input to the deeper layers is possible. The entire ResNet architecture is depicted in the right network in the left figure below. It is plotted alongside a plain CNN and the VGG-19 network, another standard CNN architecture.

Resnet-Architecture_Residual-Block

ResNet has proved to be a powerful network architecture for image classification problems. For example, an ensemble of ResNets with 152 layers won the ILSVRC 2015 image classification contest. Pretrained ResNet models of different sizes are available in the tensorflow.keras.application module, namely ResNet50, ResNet101, ResNet152 and their corresponding second versions (ResNet50V2, …). The number following the model name denotes the number of layers the networks have. The available weights are pretrained on the ImageNet dataset. The models were trained on large computing clusters using hardware accelerators for significant time periods. Transfer learning thus enables us to leverage these training results using the obtained weights as a starting point.

Classifying Car Models

As an illustrative example of how transfer learning can be applied, we treat the problem of classifying the car model given an image of the car. We will start by describing the dataset set we used and how we can filter out unwanted examples in the dataset. Next, we will go over how a data pipeline can be setup using tensorflow.data. In the second section, we will talk you through the model implementation and point out what aspects to be particularly careful about during training and prediction.

Data Preparation

We used the dataset described in this github repo, where you can also download the entire dataset. The author built a datascraper to scrape all car images from the car connection website. He explains that many images are from the interior of the cars. As they are not wanted in the dataset, they are filtered out based on pixel color. The dataset contains 64’467 jpg images, where the file names contain information on the car’s make, model, build year, etc. For a more detailed insight on the dataset, we recommend you consult the original github repo. Three sample images are shown below.

Car Collage 01

While checking through the data, we observed that the dataset still contained many unwanted images, e.g., pictures of wing mirrors, door handles, GPS panels, or lights. Examples of unwanted images can be seen below.

Car Collage 02

Thus, it is beneficial to additionally prefilter the data to clean out more of the unwanted images.

Filtering Unwanted Images Out of the Dataset

There are multiple possible approaches to filter non-car images out of the dataset:

  1. Use a pretrained model
  2. Train another model to classify car/no-car
  3. Train a generative network on a car dataset and use the discriminator part of the network

We decided to pursue the first approach since it is the most direct one and outstanding pretrained models are easily available. If you want to follow the second or third approach, you could, e.g., use this dataset to train the model. The referred dataset only contains images of cars but is significantly smaller than the dataset we used.

We chose the ResNet50V2 in the tensorflow.keras.applications module with the pretrained “imagenet” weights. In a first step, we must figure out the indices and classnames of the imagenet labels corresponding to car images.

# Class labels in imagenet corresponding to cars
CAR_IDX = [656, 627, 817, 511, 468, 751, 705, 757, 717, 734, 654, 675, 864, 609, 436]

CAR_CLASSES = ['minivan', 'limousine', 'sports_car', 'convertible', 'cab', 'racer', 'passenger_car', 'recreational_vehicle', 'pickup', 'police_van', 'minibus', 'moving_van', 'tow_truck', 'jeep', 'landrover', 'beach_wagon']

Next, the pretrained ResNet50V2 model is loaded.

from tensorflow.keras.applications import ResNet50V2

model = ResNet50V2(weights='imagenet')

We can then use this model to make predictions for images. The images fed to the prediction method must be scaled identically to the images used for training. The different ResNet models are trained on different input scales. It is thus essential to apply the correct image preprocessing. The module keras.application.resnet_v2 contains the method preprocess_input, which should be used when using a ResNetV2 network. This method expects the image arrays to be of type float and have values in [0, 255]. Using the appropriately preprocessed input, we can then use the built-in predict method to obtain predictions given an image stored at filename:

from tensorflow.keras.applications.resnet_v2 import preprocess_input

image = tf.io.read_file(filename)
image = tf.image.decode_jpeg(image)
image = tf.cast(image, tf.float32)
image = tf.image.resize_with_crop_or_pad(image, target_height=224, target_width=224)
image = preprocess_input(image)
predictions = model.predict(image)

There are various ideas of how the obtained predictions can be used for car detection.

  • Is one of the CAR_CLASSES among the top k predictions?
  • Is the accumulated probability of the CAR_CLASSES in the predictions greater than some defined threshold?
  • Specific treatment of unwanted images (e.g., detect and filter out wheels)

We show the code for comparing the accumulated probability mass over the CAR_CLASSES.

def is_car_acc_prob(predictions, thresh=THRESH, car_idx=CAR_IDX):
    """
    Determine if car on image by accumulating probabilities of car prediction and comparing to threshold

    Args:
        predictions: (?, 1000) matrix of probability predictions resulting from ResNet with imagenet weights
        thresh: threshold accumulative probability over which an image is considered a car
        car_idx: indices corresponding to cars

    Returns:
        np.array of booleans describing if car or not
    """
    predictions = np.array(predictions, dtype=float)
    car_probs = predictions[:, car_idx]
    car_probs_acc = car_probs.sum(axis=1)
    return car_probs_acc > thresh

The higher the threshold is set, the stricter the filtering procedure is. A value for the threshold that provides good results is THRESH = 0.1. This ensures we do not lose too many true car images. The choice of an appropriate threshold remains subjective, so do as you feel.

The Colab notebook that uses the function is_car_acc_prob to filter the dataset is available in the github repository.

While tuning the prefiltering procedure, we observed the following:

  • Many of the car images with light backgrounds were classified as “beach wagons”. We thus decided to also consider the “beach wagon” class in imagenet as one of the CAR_CLASSES.
  • Images showing the front of a car are often assigned a high probability of “grille”, which is the grating at the front of a car used for cooling. This assignment is correct but leads the procedure shown above to not consider certain car images as cars since we did not include “grille” in the CAR_CLASSES. This problem results in the trade-off of either leaving many close-up images of car grilles in the dataset or filtering out several car images. We opted for the second approach since it yields a cleaner car dataset.

After prefiltering the images using the suggested procedure, 53’738 of 64’467 initially remain in the dataset.

Overview of the Final Datasets

The prefiltered dataset contains images from 323 car models. We decided to reduce our attention to the top 300 most frequent classes in the dataset. That makes sense since some of the least frequent classes have less than ten representatives and can thus not be reasonably split into a train, validation, and test set. Reducing the dataset to images in the top 300 classes leaves us with a dataset containing 53’536 labeled images. The class occurrences are distributed as follows:

Histogram

The number of images per class (car model) ranges from 24 to slightly below 500. We can see that the dataset is very imbalanced. It is essential to keep this in mind when training and evaluating the model.

Building Data Pipelines with tf.data

Even after prefiltering and reducing to the top 300 classes, we still have numerous images left. This poses a potential problem since we can not simply load all images into the memory of our GPU at once. To tackle this problem, we will use tf.data.

tf.data and especially the tf.data.Dataset API allows creating elegant and, at the same time, very efficient input pipelines. The API contains many general methods which can be applied to load and transform potentially large datasets. tf.data.Dataset is especifically useful when training models on GPU(s). It allows for data loading from the HDD, applies transformation on-the-fly, and creates batches that are than sent to the GPU. And this is all done in a way such as the GPU never has to wait for new data.

The following functions create a tf.data.Dataset instance for our particular problem:

def construct_ds(input_files: list,
                 batch_size: int,
                 classes: list,
                 label_type: str,
                 input_size: tuple = (212, 320),
                 prefetch_size: int = 10,
                 shuffle_size: int = 32,
                 shuffle: bool = True,
                 augment: bool = False):
    """
    Function to construct a tf.data.Dataset set from list of files

    Args:
        input_files: list of files
        batch_size: number of observations in batch
        classes: list with all class labels
        input_size: size of images (output size)
        prefetch_size: buffer size (number of batches to prefetch)
        shuffle_size: shuffle size (size of buffer to shuffle from)
        shuffle: boolean specifying whether to shuffle dataset
        augment: boolean if image augmentation should be applied
        label_type: 'make' or 'model'

    Returns:
        buffered and prefetched tf.data.Dataset object with (image, label) tuple
    """
    # Create tf.data.Dataset from list of files
    ds = tf.data.Dataset.from_tensor_slices(input_files)

    # Shuffle files
    if shuffle:
        ds = ds.shuffle(buffer_size=shuffle_size)

    # Load image/labels
    ds = ds.map(lambda x: parse_file(x, classes=classes, input_size=input_size,                                                                                                                                        label_type=label_type))

    # Image augmentation
    if augment and tf.random.uniform((), minval=0, maxval=1, dtype=tf.dtypes.float32, seed=None, name=None) < 0.7:
        ds = ds.map(image_augment)

    # Batch and prefetch data
    ds = ds.batch(batch_size=batch_size)
    ds = ds.prefetch(buffer_size=prefetch_size)

    return ds

We will now describe the methods in the tf.data we used:

  • from_tensor_slices() is one of the available methods for the creation of a dataset. The created dataset contains slices of the given tensor, in this case, the filenames.
  • Next, the shuffle() method considers buffer_size elements one at a time and shuffles these items in isolation from the rest of the dataset. If shuffling of the complete dataset is required, buffer_size must be larger than the bumber of entries in the dataset. Shuffling is only performed if shuffle=True.
  • map() allows to apply arbitrary functions to the dataset. We created a function parse_file() that can be found in the github repo. It is responsible for reading and resizing the images, inferring the labels from the file name and encoding the labels using a one-hot encoder. If the augment flag is set, the data augmentation procedure is activated. Augmentation is only applied in 70% of the cases since it is beneficial to also train the model on non-modified images. The augmentation techniques used in image_augment are flipping, brightness, and contrast adjustments.
  • Finally, the batch() method is used to group the dataset into batches of batch_size elements and the prefetch() method enables preparing later elements while the current element is being processed and thus improves performance. If used after a call to batch(), prefetch_size batches are prefetched.

Model Fine Tuning

Having defined our input pipeline, we now turn towards the model training part. Below you can see the code that can be used to instantiate a model based on the pretrained ResNet, which is available in tf.keras.applications:

from tensorflow.keras.applications import ResNet50V2
from tensorflow.keras import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D


class TransferModel:

    def __init__(self, shape: tuple, classes: list):
        """
        Class for transfer learning from ResNet

        Args:
            shape: Input shape as tuple (height, width, channels)
            classes: List of class labels
        """
        self.shape = shape
        self.classes = classes
        self.history = None
        self.model = None

        # Use pre-trained ResNet model
        self.base_model = ResNet50V2(include_top=False,
                                     input_shape=self.shape,
                                     weights='imagenet')

        # Allow parameter updates for all layers
        self.base_model.trainable = True

        # Add a new pooling layer on the original output
        add_to_base = self.base_model.output
        add_to_base = GlobalAveragePooling2D(data_format='channels_last', name='head_gap')(add_to_base)

        # Add new output layer as head
        new_output = Dense(len(self.classes), activation='softmax', name='head_pred')(add_to_base)

        # Define model
        self.model = Model(self.base_model.input, new_output)

A few more details on the code above:

  • We first create an instance of class tf.keras.applications.ResNet50V2. With include_top=False we tell the pretrained model to leave out the original head of the model (in this case designed for the classification of 1000 classes on ImageNet).
  • base_model.trainable = True makes all layers trainable.
  • Using tf.keras functional API, we then stack a new pooling layer on top of the last convolution block of the original ResNet model. This is a necessary intermediate step before feeding the output to the final classification layer.
  • The final classification layer is then defined using tf.keras.layers.Dense. We define the number of neurons to be equal to the number of desired classes. And the softmax activation function makes sure that the output is a pseudo probability in the range of (0,1] .

The full version of TransferModel (see github) also contains the option to replace the base model with a VGG16 network, another standard CNN for image classification. In addition, it also allows to unfreeze only specific layers, meaning we can make the corresponding parameters trainable while leaving the others fixed. As a default, we have made all parameters trainable here.

After we defined the model, we need to configure it for training. This can be done using tf.keras.Model‘s compile()-method:

def compile(self, **kwargs):
      """
    Compile method
    """
    self.model.compile(**kwargs)

We then pass the following keyword arguments to our method:

  • loss = "categorical_crossentropy"for multi-class classification,
  • optimizer = Adam(0.0001) for using the Adam optimizer from tf.keras.optimizers with a relatively small learning rate (more on the learning rate below), and
  • metrics = ["categorical_accuracy"] for training and validation monitoring.

Next, we will look at the training procedure. Therefore we define a train-method for our TransferModel-class introduced above:

from tensorflow.keras.callbacks import EarlyStopping

def train(self,
          ds_train: tf.data.Dataset,
          epochs: int,
          ds_valid: tf.data.Dataset = None,
          class_weights: np.array = None):
    """
    Trains model in ds_train with for epochs rounds

    Args:
        ds_train: training data as tf.data.Dataset
        epochs: number of epochs to train
        ds_valid: optional validation data as tf.data.Dataset
        class_weights: optional class weights to treat unbalanced classes

    Returns
        Training history from self.history
    """

    # Define early stopping as callback
    early_stopping = EarlyStopping(monitor='val_loss',
                                   min_delta=0,
                                   patience=12,
                                   restore_best_weights=True)

    callbacks = [early_stopping]

    # Fitting
    self.history = self.model.fit(ds_train,
                                  epochs=epochs,
                                  validation_data=ds_valid,
                                  callbacks=callbacks,
                                  class_weight=class_weights)

    return self.history

As our model is an instance of tensorflow.keras.Model, we can train it using the fit method. To prevent overfitting, early stopping is used by passing it to the fit method as a callback function. The patience parameter can be tuned to specify how soon early stopping should apply. The parameter stands for the number of epochs after which, if no decrease of the validation loss is registered, the training will be interrupted. Further, class weights can be passed to the fit method. Class weights allow treating unbalanced data by assigning the different classes different weights, thus allowing to increase the impact of classes with fewer training examples.

We can describe the training process using a pretrained model as follows: As the weights in the head are initialized randomly, and the weights of the base model are pretrained, the training composes of training the head from scratch and fine-tuning the pretrained model’s weights. It is recommended to use a small learning rate (e.g. 1e-4) since choosing the learning rate too large can destroy the near-optimal pretrained weights of the base model.

The training procedure can be sped up by first training for a few epochs without the base model being trainable. The purpose of these initial epochs is to adapt the heads’ weights to the problem. This speeds up the training since when training only the head, much fewer parameters are trainable and thus updated for every batch. The resulting model weights can then be used as the starting point to train the entire model, with the base model being trainable. For the car classification problem that we are considering here, applying this two-stage training did not achieve notable performance enhancement.

Model Performance Evaluation/Prediction

When using the tf.data.Dataset API, one must pay attention to the nature of the methods used. The following method in our TransferModel class can be used as a prediction method.

def predict(self, ds_new: tf.data.Dataset, proba: bool = True):
    """
    Predict class probs or labels on ds_new
    Labels are obtained by taking the most likely class given the predicted probs

    Args:
        ds_new: New data as tf.data.Dataset
        proba: Boolean if probabilities should be returned

    Returns:
        class labels or probabilities
    """

    p = self.model.predict(ds_new)

    if proba:
        return p
    else:
        return [np.argmax(x) for x in p]

It is essential that the dataset ds_new is not shuffled, or else the predictions obtained will be misaligned with the images obtained when iterating over the dataset a second time. This is the case since the flag reshuffle_each_iteration is true by default in the shuffle method’s implementation. A further effect of shuffling is that multiple calls to the take method will not return the same data. This is important when you want to check out predictions, e.g., for only one batch. A simple example where this can be seen is:

# Use construct_ds method from above to create a shuffled dataset
ds = construct_ds(..., shuffle=True)

# Take 1 batch (e.g. 32 images) of dataset: This returns a new dataset
ds_batch = ds.take(1)

# Predict labels for one batch
predictions = model.predict(ds_batch)

# Predict labels again: The result will not be the same as predictions above due to shuffling
predictions_2 = model.predict(ds_batch)

A function to plot images annotated with the corresponding predictions could look as follows:

def show_batch_with_pred(model, ds, classes, rescale=True, size=(10, 10), title=None):
      for image, label in ds.take(1):
        image_array = image.numpy()
        label_array = label.numpy()
        batch_size = image_array.shape[0]
        pred = model.predict(image, proba=False)
        for idx in range(batch_size):
            label = classes[np.argmax(label_array[idx])]
            ax = plt.subplot(np.ceil(batch_size / 4), 4, idx + 1)
            if rescale:
                plt.imshow(image_array[idx] / 255)
            else:
                plt.imshow(image_array[idx])
            plt.title("label: " + label + "\n" 
                      + "prediction: " + classes[pred[idx]], fontsize=10)
            plt.axis('off')

The show_batch_with_pred method works for shuffled datasets as well, since image and label correspond to the same call to the take method.

Evaluating model perfomance can be done using keras.Model's evaluate method.

How Accurate Is Our Final Model?

The model achieves slightly above 70% categorical accuracy for the task of predicting the car model for images from 300 model classes. To better understand the model’s predictions, it is helpful to observe the confusion matrix. Below, you can see the heatmap of the model’s predictions on the validation dataset.

heatmap

We restricted the heatmap to clip the confusion matrix’s entries to [0, 5], as allowing a further span did not significantly highlight any off-diagonal region. As can be seen from the heat map, one class is assigned to examples of almost all classes. That can be seen from the dark red vertical line two-thirds to the right in the figure above. Other than the class mentioned before, there are no evident biases in the predictions. We want to stress here that the categorical accuracy is generally not sufficient for a satisfactory assessment of the model’s performance, particularly in the case of imbalanced classes.

Conclusion and Next Steps

In this blog post, we have applied transfer learning using the ResNet50V2 to classify the car model from images of cars. Our model achieves 70% categorical accuracy over 300 classes.

We found unfreezing the entire base model and using a small learning rate to achieve the best results. Now, having developed a cool car classification model is great, but how can we use our model in a productive setting? Of course, we could build our custom model API using Flask or FastAPI…

But might there even be an easier, standardized way? In the second article of our series, “Deploying TensorFlow Models in Docker using TensorFlow Serving“, we discuss how this model can be deployed using TensorFlow Serving.

Stephan Müller Stephan Müller

At STATWORX, we are very passionate about the field of deep learning. In this blog series, we want to illustrate how an end-to-end deep learning project can be implemented. We use TensorFlow 2.x library for the implementation. The topics of the series include:

In the first part, we will show how you can use transfer learning to tackle car image classification. We start by giving a brief overview of transfer learning and the ResNet and then go into the implementation details. The code presented can be found in this github repository.

Introduction: Transfer Learning & ResNet

What is Transfer Learning?

In traditional (machine) learning, we develop a model and train it on new data for every new task at hand. Transfer learning differs from this approach in that knowledge is transferred from one task to another. It is a useful approach when one is faced with the problem of too little available training data. Models that are pretrained for a similar problem can be used as a starting point for training new models. The pretrained models are referred to as base models.

In our example, a deep learning model trained on the ImageNet dataset can be used as the starting point for building a car model classifier. The main idea behind transfer learning for deep learning models is that the first layers of a network are used to extract important high-level features, which remain similar for the kind of data treated. The final layers (also known as the head) of the original network are replaced by a custom head suitable for the problem at hand. The weights in the head are initialized randomly, and the resulting network can be trained for the specific task.

There are various ways in which the base model can be treated during training. In the first step, its weights can be fixed. If the learning progress suggests the model not being flexible enough, certain layers or the entire base model can be “unfrozen” and thus made trainable. A further important aspect to note is that the input must be of the same dimensionality as the data on which the model was trained on – if the first layers of the base model are not modified.

image-20200319174208670

Next, we will briefly introduce the ResNet, a popular and powerful CNN architecture for image data. Then, we will show how we used transfer learning with ResNet to do car model classification.

What is ResNet?

Training deep neural networks can quickly become challenging due to the so-called vanishing gradient problem. But what are vanishing gradients? Neural networks are commonly trained using back-propagation. This algorithm leverages the chain rule of calculus to derive gradients at deeper layers of the network by multiplying gradients from earlier layers. Since gradients get repeatedly multiplied in deep networks, they can quickly approach infinitesimally small values during back-propagation.

ResNet is a CNN network that solves the vanishing gradient problem using so-called residual blocks (you find a good explanation of why they are called ‘residual’ here). The unmodified input is passed on to the next layer in the residual block by adding it to a layer’s output (see right figure). This modification makes sure that a better information flow from the input to the deeper layers is possible. The entire ResNet architecture is depicted in the right network in the left figure below. It is plotted alongside a plain CNN and the VGG-19 network, another standard CNN architecture.

Resnet-Architecture_Residual-Block

ResNet has proved to be a powerful network architecture for image classification problems. For example, an ensemble of ResNets with 152 layers won the ILSVRC 2015 image classification contest. Pretrained ResNet models of different sizes are available in the tensorflow.keras.application module, namely ResNet50, ResNet101, ResNet152 and their corresponding second versions (ResNet50V2, …). The number following the model name denotes the number of layers the networks have. The available weights are pretrained on the ImageNet dataset. The models were trained on large computing clusters using hardware accelerators for significant time periods. Transfer learning thus enables us to leverage these training results using the obtained weights as a starting point.

Classifying Car Models

As an illustrative example of how transfer learning can be applied, we treat the problem of classifying the car model given an image of the car. We will start by describing the dataset set we used and how we can filter out unwanted examples in the dataset. Next, we will go over how a data pipeline can be setup using tensorflow.data. In the second section, we will talk you through the model implementation and point out what aspects to be particularly careful about during training and prediction.

Data Preparation

We used the dataset described in this github repo, where you can also download the entire dataset. The author built a datascraper to scrape all car images from the car connection website. He explains that many images are from the interior of the cars. As they are not wanted in the dataset, they are filtered out based on pixel color. The dataset contains 64’467 jpg images, where the file names contain information on the car’s make, model, build year, etc. For a more detailed insight on the dataset, we recommend you consult the original github repo. Three sample images are shown below.

Car Collage 01

While checking through the data, we observed that the dataset still contained many unwanted images, e.g., pictures of wing mirrors, door handles, GPS panels, or lights. Examples of unwanted images can be seen below.

Car Collage 02

Thus, it is beneficial to additionally prefilter the data to clean out more of the unwanted images.

Filtering Unwanted Images Out of the Dataset

There are multiple possible approaches to filter non-car images out of the dataset:

  1. Use a pretrained model
  2. Train another model to classify car/no-car
  3. Train a generative network on a car dataset and use the discriminator part of the network

We decided to pursue the first approach since it is the most direct one and outstanding pretrained models are easily available. If you want to follow the second or third approach, you could, e.g., use this dataset to train the model. The referred dataset only contains images of cars but is significantly smaller than the dataset we used.

We chose the ResNet50V2 in the tensorflow.keras.applications module with the pretrained “imagenet” weights. In a first step, we must figure out the indices and classnames of the imagenet labels corresponding to car images.

# Class labels in imagenet corresponding to cars
CAR_IDX = [656, 627, 817, 511, 468, 751, 705, 757, 717, 734, 654, 675, 864, 609, 436]

CAR_CLASSES = ['minivan', 'limousine', 'sports_car', 'convertible', 'cab', 'racer', 'passenger_car', 'recreational_vehicle', 'pickup', 'police_van', 'minibus', 'moving_van', 'tow_truck', 'jeep', 'landrover', 'beach_wagon']

Next, the pretrained ResNet50V2 model is loaded.

from tensorflow.keras.applications import ResNet50V2

model = ResNet50V2(weights='imagenet')

We can then use this model to make predictions for images. The images fed to the prediction method must be scaled identically to the images used for training. The different ResNet models are trained on different input scales. It is thus essential to apply the correct image preprocessing. The module keras.application.resnet_v2 contains the method preprocess_input, which should be used when using a ResNetV2 network. This method expects the image arrays to be of type float and have values in [0, 255]. Using the appropriately preprocessed input, we can then use the built-in predict method to obtain predictions given an image stored at filename:

from tensorflow.keras.applications.resnet_v2 import preprocess_input

image = tf.io.read_file(filename)
image = tf.image.decode_jpeg(image)
image = tf.cast(image, tf.float32)
image = tf.image.resize_with_crop_or_pad(image, target_height=224, target_width=224)
image = preprocess_input(image)
predictions = model.predict(image)

There are various ideas of how the obtained predictions can be used for car detection.

We show the code for comparing the accumulated probability mass over the CAR_CLASSES.

def is_car_acc_prob(predictions, thresh=THRESH, car_idx=CAR_IDX):
    """
    Determine if car on image by accumulating probabilities of car prediction and comparing to threshold

    Args:
        predictions: (?, 1000) matrix of probability predictions resulting from ResNet with imagenet weights
        thresh: threshold accumulative probability over which an image is considered a car
        car_idx: indices corresponding to cars

    Returns:
        np.array of booleans describing if car or not
    """
    predictions = np.array(predictions, dtype=float)
    car_probs = predictions[:, car_idx]
    car_probs_acc = car_probs.sum(axis=1)
    return car_probs_acc > thresh

The higher the threshold is set, the stricter the filtering procedure is. A value for the threshold that provides good results is THRESH = 0.1. This ensures we do not lose too many true car images. The choice of an appropriate threshold remains subjective, so do as you feel.

The Colab notebook that uses the function is_car_acc_prob to filter the dataset is available in the github repository.

While tuning the prefiltering procedure, we observed the following:

After prefiltering the images using the suggested procedure, 53’738 of 64’467 initially remain in the dataset.

Overview of the Final Datasets

The prefiltered dataset contains images from 323 car models. We decided to reduce our attention to the top 300 most frequent classes in the dataset. That makes sense since some of the least frequent classes have less than ten representatives and can thus not be reasonably split into a train, validation, and test set. Reducing the dataset to images in the top 300 classes leaves us with a dataset containing 53’536 labeled images. The class occurrences are distributed as follows:

Histogram

The number of images per class (car model) ranges from 24 to slightly below 500. We can see that the dataset is very imbalanced. It is essential to keep this in mind when training and evaluating the model.

Building Data Pipelines with tf.data

Even after prefiltering and reducing to the top 300 classes, we still have numerous images left. This poses a potential problem since we can not simply load all images into the memory of our GPU at once. To tackle this problem, we will use tf.data.

tf.data and especially the tf.data.Dataset API allows creating elegant and, at the same time, very efficient input pipelines. The API contains many general methods which can be applied to load and transform potentially large datasets. tf.data.Dataset is especifically useful when training models on GPU(s). It allows for data loading from the HDD, applies transformation on-the-fly, and creates batches that are than sent to the GPU. And this is all done in a way such as the GPU never has to wait for new data.

The following functions create a tf.data.Dataset instance for our particular problem:

def construct_ds(input_files: list,
                 batch_size: int,
                 classes: list,
                 label_type: str,
                 input_size: tuple = (212, 320),
                 prefetch_size: int = 10,
                 shuffle_size: int = 32,
                 shuffle: bool = True,
                 augment: bool = False):
    """
    Function to construct a tf.data.Dataset set from list of files

    Args:
        input_files: list of files
        batch_size: number of observations in batch
        classes: list with all class labels
        input_size: size of images (output size)
        prefetch_size: buffer size (number of batches to prefetch)
        shuffle_size: shuffle size (size of buffer to shuffle from)
        shuffle: boolean specifying whether to shuffle dataset
        augment: boolean if image augmentation should be applied
        label_type: 'make' or 'model'

    Returns:
        buffered and prefetched tf.data.Dataset object with (image, label) tuple
    """
    # Create tf.data.Dataset from list of files
    ds = tf.data.Dataset.from_tensor_slices(input_files)

    # Shuffle files
    if shuffle:
        ds = ds.shuffle(buffer_size=shuffle_size)

    # Load image/labels
    ds = ds.map(lambda x: parse_file(x, classes=classes, input_size=input_size,                                                                                                                                        label_type=label_type))

    # Image augmentation
    if augment and tf.random.uniform((), minval=0, maxval=1, dtype=tf.dtypes.float32, seed=None, name=None) < 0.7:
        ds = ds.map(image_augment)

    # Batch and prefetch data
    ds = ds.batch(batch_size=batch_size)
    ds = ds.prefetch(buffer_size=prefetch_size)

    return ds

We will now describe the methods in the tf.data we used:

Model Fine Tuning

Having defined our input pipeline, we now turn towards the model training part. Below you can see the code that can be used to instantiate a model based on the pretrained ResNet, which is available in tf.keras.applications:

from tensorflow.keras.applications import ResNet50V2
from tensorflow.keras import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D


class TransferModel:

    def __init__(self, shape: tuple, classes: list):
        """
        Class for transfer learning from ResNet

        Args:
            shape: Input shape as tuple (height, width, channels)
            classes: List of class labels
        """
        self.shape = shape
        self.classes = classes
        self.history = None
        self.model = None

        # Use pre-trained ResNet model
        self.base_model = ResNet50V2(include_top=False,
                                     input_shape=self.shape,
                                     weights='imagenet')

        # Allow parameter updates for all layers
        self.base_model.trainable = True

        # Add a new pooling layer on the original output
        add_to_base = self.base_model.output
        add_to_base = GlobalAveragePooling2D(data_format='channels_last', name='head_gap')(add_to_base)

        # Add new output layer as head
        new_output = Dense(len(self.classes), activation='softmax', name='head_pred')(add_to_base)

        # Define model
        self.model = Model(self.base_model.input, new_output)

A few more details on the code above:

The full version of TransferModel (see github) also contains the option to replace the base model with a VGG16 network, another standard CNN for image classification. In addition, it also allows to unfreeze only specific layers, meaning we can make the corresponding parameters trainable while leaving the others fixed. As a default, we have made all parameters trainable here.

After we defined the model, we need to configure it for training. This can be done using tf.keras.Model‘s compile()-method:

def compile(self, **kwargs):
      """
    Compile method
    """
    self.model.compile(**kwargs)

We then pass the following keyword arguments to our method:

Next, we will look at the training procedure. Therefore we define a train-method for our TransferModel-class introduced above:

from tensorflow.keras.callbacks import EarlyStopping

def train(self,
          ds_train: tf.data.Dataset,
          epochs: int,
          ds_valid: tf.data.Dataset = None,
          class_weights: np.array = None):
    """
    Trains model in ds_train with for epochs rounds

    Args:
        ds_train: training data as tf.data.Dataset
        epochs: number of epochs to train
        ds_valid: optional validation data as tf.data.Dataset
        class_weights: optional class weights to treat unbalanced classes

    Returns
        Training history from self.history
    """

    # Define early stopping as callback
    early_stopping = EarlyStopping(monitor='val_loss',
                                   min_delta=0,
                                   patience=12,
                                   restore_best_weights=True)

    callbacks = [early_stopping]

    # Fitting
    self.history = self.model.fit(ds_train,
                                  epochs=epochs,
                                  validation_data=ds_valid,
                                  callbacks=callbacks,
                                  class_weight=class_weights)

    return self.history

As our model is an instance of tensorflow.keras.Model, we can train it using the fit method. To prevent overfitting, early stopping is used by passing it to the fit method as a callback function. The patience parameter can be tuned to specify how soon early stopping should apply. The parameter stands for the number of epochs after which, if no decrease of the validation loss is registered, the training will be interrupted. Further, class weights can be passed to the fit method. Class weights allow treating unbalanced data by assigning the different classes different weights, thus allowing to increase the impact of classes with fewer training examples.

We can describe the training process using a pretrained model as follows: As the weights in the head are initialized randomly, and the weights of the base model are pretrained, the training composes of training the head from scratch and fine-tuning the pretrained model’s weights. It is recommended to use a small learning rate (e.g. 1e-4) since choosing the learning rate too large can destroy the near-optimal pretrained weights of the base model.

The training procedure can be sped up by first training for a few epochs without the base model being trainable. The purpose of these initial epochs is to adapt the heads’ weights to the problem. This speeds up the training since when training only the head, much fewer parameters are trainable and thus updated for every batch. The resulting model weights can then be used as the starting point to train the entire model, with the base model being trainable. For the car classification problem that we are considering here, applying this two-stage training did not achieve notable performance enhancement.

Model Performance Evaluation/Prediction

When using the tf.data.Dataset API, one must pay attention to the nature of the methods used. The following method in our TransferModel class can be used as a prediction method.

def predict(self, ds_new: tf.data.Dataset, proba: bool = True):
    """
    Predict class probs or labels on ds_new
    Labels are obtained by taking the most likely class given the predicted probs

    Args:
        ds_new: New data as tf.data.Dataset
        proba: Boolean if probabilities should be returned

    Returns:
        class labels or probabilities
    """

    p = self.model.predict(ds_new)

    if proba:
        return p
    else:
        return [np.argmax(x) for x in p]

It is essential that the dataset ds_new is not shuffled, or else the predictions obtained will be misaligned with the images obtained when iterating over the dataset a second time. This is the case since the flag reshuffle_each_iteration is true by default in the shuffle method’s implementation. A further effect of shuffling is that multiple calls to the take method will not return the same data. This is important when you want to check out predictions, e.g., for only one batch. A simple example where this can be seen is:

# Use construct_ds method from above to create a shuffled dataset
ds = construct_ds(..., shuffle=True)

# Take 1 batch (e.g. 32 images) of dataset: This returns a new dataset
ds_batch = ds.take(1)

# Predict labels for one batch
predictions = model.predict(ds_batch)

# Predict labels again: The result will not be the same as predictions above due to shuffling
predictions_2 = model.predict(ds_batch)

A function to plot images annotated with the corresponding predictions could look as follows:

def show_batch_with_pred(model, ds, classes, rescale=True, size=(10, 10), title=None):
      for image, label in ds.take(1):
        image_array = image.numpy()
        label_array = label.numpy()
        batch_size = image_array.shape[0]
        pred = model.predict(image, proba=False)
        for idx in range(batch_size):
            label = classes[np.argmax(label_array[idx])]
            ax = plt.subplot(np.ceil(batch_size / 4), 4, idx + 1)
            if rescale:
                plt.imshow(image_array[idx] / 255)
            else:
                plt.imshow(image_array[idx])
            plt.title("label: " + label + "\n" 
                      + "prediction: " + classes[pred[idx]], fontsize=10)
            plt.axis('off')

The show_batch_with_pred method works for shuffled datasets as well, since image and label correspond to the same call to the take method.

Evaluating model perfomance can be done using keras.Model's evaluate method.

How Accurate Is Our Final Model?

The model achieves slightly above 70% categorical accuracy for the task of predicting the car model for images from 300 model classes. To better understand the model’s predictions, it is helpful to observe the confusion matrix. Below, you can see the heatmap of the model’s predictions on the validation dataset.

heatmap

We restricted the heatmap to clip the confusion matrix’s entries to [0, 5], as allowing a further span did not significantly highlight any off-diagonal region. As can be seen from the heat map, one class is assigned to examples of almost all classes. That can be seen from the dark red vertical line two-thirds to the right in the figure above. Other than the class mentioned before, there are no evident biases in the predictions. We want to stress here that the categorical accuracy is generally not sufficient for a satisfactory assessment of the model’s performance, particularly in the case of imbalanced classes.

Conclusion and Next Steps

In this blog post, we have applied transfer learning using the ResNet50V2 to classify the car model from images of cars. Our model achieves 70% categorical accuracy over 300 classes.

We found unfreezing the entire base model and using a small learning rate to achieve the best results. Now, having developed a cool car classification model is great, but how can we use our model in a productive setting? Of course, we could build our custom model API using Flask or FastAPI…

But might there even be an easier, standardized way? In the second article of our series, “Deploying TensorFlow Models in Docker using TensorFlow Serving“, we discuss how this model can be deployed using TensorFlow Serving.

Stephan Müller Stephan Müller