Data Science, Machine Learning und KI
Kontakt

Wir bei statworx arbeiten viel mit R und verwenden oft die gleichen kleinen Hilfsfunktionen in unseren Projekten. Diese Funktionen erleichtern unseren Arbeitsalltag, indem sie sich-wiederholende Codeteile reduzieren oder Übersichten über unsere Projekte erstellen.

Um diese Funktionen innerhalb unserer Teams und auch mit anderen zu teilen, habe ich angefangen, sie zu sammeln und habe dann daraus ein R-Paket namens helfRlein erstellt. Neben der gemeinsamen Nutzung wollte ich auch einige Anwendungsfälle haben, um meine Fähigkeiten zur Fehlersuche und Optimierung zu verbessern. Mit der Zeit wuchs das Paket und es kamen immer mehr Funktionen zusammen. Beim letzten Mal habe ich jede Funktion als Teil eines Adventskalenders vorgestellt. Zum Start unserer neuen Website habe ich alle Funktionen in diesem Kalender zusammengefasst und werde jede aktuelle Funktion aus dem Paket helfRlein vorstellen.

Die meisten Funktionen wurden entwickelt, als es ein Problem gab und man eine einfache Lösung dafür brauchte. Zum Beispiel war der angezeigte Text zu lang und musste gekürzt werden (siehe evenstrings). Andere Funktionen existieren nur, um sich-wiederholende Aufgaben zu reduzieren – wie das Einlesen mehrerer Dateien des selben Typs (siehe read_files). Daher könnten diese Funktionen auch für Euch nützlich sein!

Um alle Funktionen im Detail zu erkunden, könnt Ihr unser GitHub besuchen. Wenn Ihr irgendwelche Vorschläge habt, schickt mir bitte eine E-Mail oder öffnet ein Issue auf GitHub!

1. char_replace

Dieser kleine Helfer ersetzt Sonderzeichen (wie z. B. den Umlaut „ä“) durch ihre Standardentsprechung (in diesem Fall „ae“). Es ist auch möglich, alle Zeichen in Kleinbuchstaben umzuwandeln, Leerzeichen zu entfernen oder Leerzeichen und Bindestriche durch Unterstriche zu ersetzen.

Schauen wir uns ein kleines Beispiel mit verschiedenen Settings an:

x <- " Élizàldë-González Strasse"
char_replace(x, to_lower = TRUE)
[1] "elizalde-gonzalez strasse"
char_replace(x, to_lower = TRUE, to_underscore = TRUE)
[1] "elizalde_gonzalez_strasse"
char_replace(x, to_lower = FALSE, rm_space = TRUE, rm_dash = TRUE)
[1] "ElizaldeGonzalezStrasse"

2. checkdir

Dieser kleine Helfer prüft einen gegebenen Ordnerpfad auf Existenz und erstellt ihn bei Bedarf.

checkdir(path = "testfolder/subfolder")

Intern gibt es nur eine einfache if-Anweisung, die die R-Basisfunktionen file.exists() und dir.create(). kombiniert.

3. clean_gc

Dieser kleine Helfer gibt den Speicher von unbenutzten Objekten frei. Nun, im Grunde ruft es einfach gc() ein paar Mal auf. Ich habe das vor einiger Zeit für ein Projekt benutzt, bei dem ich mit riesigen Datendateien gearbeitet habe. Obwohl wir das Glück hatten, einen großen Server mit 500 GB RAM zu haben, stießen wir bald an seine Grenzen. Da wir in der Regel mehrere Prozesse parallelisieren, mussten wir jedes Bit und jedes Byte des Arbeitsspeichers nutzen, das wir bekommen konnten. Anstatt also viele Zeilen wie diese zu haben:

gc();gc();gc();gc()

… habe ich clean_gc() der Einfachheit halber geschrieben. Intern wird gc() so lange aufgerufen, wie es Speicher gibt, der freigegeben werden muss.

Some further thoughts

Es gibt einige Diskussionen über den Garbage Collector gc() und seine Nützlichkeit. Wenn Ihr mehr darüber erfahren wollt, schlage ich vor, dass Ihr Euch die memory section in Advanced R anseht. Ich weiß, dass R selbst bei Bedarf Speicher freigibt, aber ich bin mir nicht sicher, was passiert, wenn Ihr mehrere R-Prozesse habt. Können sie den Speicher von anderen Prozessen leeren? Wenn Ihr dazu etwas mehr wisst, lasst es mich wissen!

4. count_na

Dieser kleine Helfer zählt fehlende Werte innerhalb eines Vektors.

x <- c(NA, NA, 1, NaN, 0)
count_na(x)
3

Intern gibt es nur ein einfaches sum(is.na(x)), das die NA-Werte zählt. Wenn Ihr den Mittelwert statt der Summe wollt, könnt Ihr prop = TRUE setzen.

5. evenstrings

Dieser kleine Helfer zerlegt eine gegebene Zeichenkette in kleinere Teile mit einer festen Länge. Aber warum? Nun, ich brauchte diese Funktion beim Erstellen eines Plots mit einem langen Titel. Der Text war zu lang für eine Zeile und anstatt ihn einfach abzuschneiden oder über die Ränder laufen zu lassen, wollte ich ihn schön trennen.

Bei einer langen Zeichenkette wie…

long_title <- c("Contains the months: January, February, March, April, May, June, July, August, September, October, November, December")

…wollen wir sie nach split = "," mit einer maximalen Länge von char = 60 aufteilen.

short_title <- evenstrings(long_title, split = ",", char = 60)

Die Funktion hat zwei mögliche Ausgabeformate, die durch Setzen von newlines = TRUE oder FALSE gewählt werden können:

  • eine Zeichenkette mit Zeilentrennzeichen \n
  • ein Vektor mit jedem Unterteil.

Ein anderer Anwendungsfall könnte eine Nachricht sein, die mit cat() auf der Konsole ausgegeben wird:

cat(long_title)
Contains the months: January, February, March, April, May, June, July, August, September, October, November, December
cat(short_title)
Contains the months: January, February, March, April, May,
 June, July, August, September, October, November, December

Code for plot example

p1 <- ggplot(data.frame(x = 1:10, y = 1:10),
  aes(x = x, y = y)) +
  geom_point() +
  ggtitle(long_title)

p2 <- ggplot(data.frame(x = 1:10, y = 1:10),
  aes(x = x, y = y)) +
  geom_point() +
  ggtitle(short_title)

multiplot(p1, p2)

6. get_files

Dieser kleine Helfer macht das Gleiche wie die „Find in files „ Suche in RStudio. Sie gibt einen Vektor mit allen Dateien in einem bestimmten Ordner zurück, die das Suchmuster enthalten. In Eurem täglichen Arbeitsablauf würdet Ihr normalerweise die Tastenkombination SHIFT+CTRL+F verwenden. Mit get_files() könnt Ihr diese Funktionen in Euren Skripten nutzen.

7. get_network

Das Ziel dieses kleinen Helfers ist es, die Verbindungen zwischen R-Funktionen innerhalb eines Projekts als Flussdiagramm zu visualisieren. Dazu ist die Eingabe ein Verzeichnispfad zur Funktion oder eine Liste mit den Funktionen und die Ausgaben sind eine Adjazenzmatrix und ein Graph-Objekt. Als Beispiel verwenden wir diesen Ordner mit einigen Spielzeugfunktionen:

net <- get_network(dir = "flowchart/R_network_functions/", simplify = FALSE)
g1 <- net$igraph

Input

Es gibt fünf Parameter, um mit der Funktion zu interagieren:

  • ein Pfad dir, der durchsucht werden soll.
  • ein Zeichenvektor Variationen mit der Definitionszeichenfolge der Funktion – die Vorgabe ist c(" <- function", "<- function", "<-function").
  • ein „Muster“, eine Zeichenkette mit dem Dateisuffix – die Vorgabe ist "\\.R$".
  • ein boolesches simplify, das Funktionen ohne Verbindungen aus der Darstellung entfernt.
  • eine benannte Liste all_scripts, die eine Alternative zu dir ist. Diese Liste wird hauptsächlich nur zu Testzwecken verwendet.

Für eine normale Verwendung sollte es ausreichen, einen Pfad zum Projektordner anzugeben.

Output

Der gegebene Plot zeigt die Verbindungen der einzelnen Funktionen (Pfeile) und auch die relative Größe des Funktionscodes (Größe der Punkte). Wie bereits erwähnt, besteht die Ausgabe aus einer Adjazenzmatrix und einem Graph-Objekt. Die Matrix enthält die Anzahl der Aufrufe für jede Funktion. Das Graph-Objekt hat die folgenden Eigenschaften:

  • Die Namen der Funktionen werden als Label verwendet.
  • Die Anzahl der Zeilen jeder Funktion (ohne Kommentare und Leerzeilen) wird als Größe gespeichert.
  • Der Ordnername des ersten Ordners im Verzeichnis.
  • Eine Farbe, die dem Ordner entspricht.

Mit diesen Eigenschaften können Sie die Netzwerkdarstellung zum Beispiel wie folgt verbessern:

library(igraph)

# create plots ------------------------------------------------------------
l <- layout_with_fr(g1)
colrs <- rainbow(length(unique(V(g1)$color)))

plot(g1,
     edge.arrow.size = .1,
     edge.width = 5*E(g1)$weight/max(E(g1)$weight),
     vertex.shape = "none",
     vertex.label.color = colrs[V(g1)$color],
     vertex.label.color = "black",
     vertex.size = 20,
     vertex.color = colrs[V(g1)$color],
     edge.color = "steelblue1",
     layout = l)
legend(x = 0,
       unique(V(g1)$folder), pch = 21,
       pt.bg = colrs[unique(V(g1)$color)],
       pt.cex = 2, cex = .8, bty = "n", ncol = 1)
 

example-network-helfRlein

8. get_sequence

Dieser kleine Helfer gibt Indizes von wiederkehrenden Mustern zurück. Es funktioniert sowohl mit Zahlen als auch mit Zeichen. Alles, was es braucht, ist ein Vektor mit den Daten, ein Muster, nach dem gesucht werden soll, und eine Mindestanzahl von Vorkommen.

Lasst uns mit dem folgenden Code einige Zeitreihendaten erstellen.

library(data.table)

# random seed
set.seed(20181221)

# number of observations
n <- 100

# simulationg the data
ts_data <- data.table(DAY = 1:n, CHANGE = sample(c(-1, 0, 1), n, replace = TRUE))
ts_data[, VALUE := cumsum(CHANGE)]

Dies ist nichts anderes als ein Random Walk, da wir zwischen dem Abstieg (-1), dem Anstieg (1) und dem Verbleib auf demselben Niveau (0) wählen. Unsere Zeitreihendaten sehen folgendermaßen aus:

Angenommen, wir wollen die Datumsbereiche wissen, in denen es an mindestens vier aufeinanderfolgenden Tagen keine Veränderung gab.

ts_data[, get_sequence(x = CHANGE, pattern = 0, minsize = 4)]
     min max
[1,]  45  48
[2,]  65  69

Wir können auch die Frage beantworten, ob sich das Muster „down-up-down-up“ irgendwo wiederholt:

ts_data[, get_sequence(x = CHANGE, pattern = c(-1,1), minsize = 2)]
     min max
[1,]  88  91

Mit diesen beiden Eingaben können wir unseren Plot ein wenig aktualisieren, indem wir etwas geom_rect hinzufügen!

Code for the plot

rect <- data.table(
  rbind(ts_data[, get_sequence(x = CHANGE, pattern = c(0), minsize = 4)],
        ts_data[, get_sequence(x = CHANGE, pattern = c(-1,1), minsize = 2)]),
  GROUP = c("no change","no change","down-up"))

ggplot(ts_data, aes(x = DAY, y = VALUE)) +
  geom_line() +
  geom_rect(data = rect,
  inherit.aes = FALSE,
  aes(xmin = min - 1,
  xmax = max,
  ymin = -Inf,
  ymax = Inf,
  group = GROUP,
  fill = GROUP),
  color = "transparent",
  alpha = 0.5) +
  scale_fill_manual(values = statworx_palette(number = 2, basecolors = c(2,5))) +
  theme_minimal()

9. intersect2

Dieser kleine Helfer gibt den Schnittpunkt mehrerer Vektoren oder Listen zurück. Ich habe diese Funktion hier gefunden, fand sie recht nützlich und habe sie ein wenig angepasst.

intersect2(list(c(1:3), c(1:4)), list(c(1:2),c(1:3)), c(1:2))
[1] 1 2

Intern wird das Problem, die Schnittmenge zu finden, rekursiv gelöst, wenn ein Element eine Liste ist, und dann schrittweise mit dem nächsten Element.

10. multiplot

Dieses kleine Hilfsmittel kombiniert mehrere ggplots zu einem Plot. Dies ist eine Funktion aus dem R-cookbook.

Ein Vorteil gegenüber facets ist, dass man nicht alle Daten für alle Plots in einem Objekt benötigt. Auch kann man jeden einzelnen Plot frei erstellen – was manchmal auch ein Nachteil sein kann.

Mit dem Parameter layout könnt Ihr mehrere Plots mit unterschiedlichen Größen anordnen. Nehmen wir an, Ihr habt drei Plots und wollt sie wie folgt anordnen:

1    2    2
1    2    2
3    3    3

Bei multiplot läuft es auf Folgendes hinaus:

multiplot(plotlist = list(p1, p2, p3),
          layout = matrix(c(1,2,2,1,2,2,3,3,3), nrow = 3, byrow = TRUE))

Code for plot example

# star coordinates
c1  =   cos((2*pi)/5)   
c2  =   cos(pi/5)
s1  =   sin((2*pi)/5)
s2  =   sin((4*pi)/5)

data_star <- data.table(X = c(0, -s2, s1, -s1, s2),
                        Y = c(1, -c2, c1, c1, -c2))

p1 <- ggplot(data_star, aes(x = X, y = Y)) +
  geom_polygon(fill = "gold") +
  theme_void()

# tree
set.seed(24122018)
n <- 10000
lambda <- 2
data_tree <- data.table(X = c(rpois(n, lambda), rpois(n, 1.1*lambda)),
                        TYPE = rep(c("1", "2"), each = n))
data_tree <- data_tree[, list(COUNT = .N), by = c("TYPE", "X")]
data_tree[TYPE == "1", COUNT := -COUNT]

p2 <- ggplot(data_tree, aes(x = X, y = COUNT, fill = TYPE)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("green", "darkgreen")) +
  coord_flip() +
  theme_minimal()

# gifts
data_gifts <- data.table(X = runif(5, min = 0, max = 10),
                         Y = runif(5, max = 0.5),
                         Z = sample(letters[1:5], 5, replace = FALSE))

p3 <- ggplot(data_gifts, aes(x = X, y = Y)) +
  geom_point(aes(color = Z), pch = 15, size = 10) +
  scale_color_brewer(palette = "Reds") +
  geom_point(pch = 12, size = 10, color = "gold") +
  xlim(0,8) +
  ylim(0.1,0.5) +
  theme_minimal() + 
  theme(legend.position="none") 


11. na_omitlist

Dieser kleine Helfer entfernt fehlende Werte aus einer Liste.

y <- list(NA, c(1, NA), list(c(5:6, NA), NA, "A"))

Es gibt zwei Möglichkeiten, die fehlenden Werte zu entfernen, entweder nur auf der ersten Ebene der Liste oder innerhalb jeder Unterebene.

na_omitlist(y, recursive = FALSE)
[[1]]
[1]  1 NA

[[2]]
[[2]][[1]]
[1]  5  6 NA

[[2]][[2]]
[1] NA

[[2]][[3]]
[1] "A"
na_omitlist(y, recursive = TRUE)
[[1]]
[1] 1

[[2]]
[[2]][[1]]
[1] 5 6

[[2]][[2]]
[1] "A"

12. %nin%

Dieser kleine Helfer ist eine reine Komfortfunktion. Sie ist einfach dasselbe wie der negierte %in%-Operator, wie Ihr unten sehen könnt. Aber meiner Meinung nach erhöht sie die Lesbarkeit des Codes.

all.equal( c(1,2,3,4) %nin% c(1,2,5),
          !c(1,2,3,4) %in%  c(1,2,5))
[1] TRUE

Dieser Operator hat es auch in einige andere Pakete geschafft – wie Ihr hier nachlesen könnt.

13. object_size_in_env

Dieser kleine Helfer zeigt eine Tabelle mit der Größe jedes Objekts in der vorgegebenen Umgebung an.

Wenn Ihr in einer Situation seid, in der Ihr viel gecodet habt und Eure Umgebung nun ziemlich unübersichtlich ist, hilft Euch object_size_in_env, die großen Fische in Bezug auf den Speicherverbrauch zu finden. Ich selbst bin ein paar Mal auf dieses Problem gestoßen, als ich mehrere Ausführungen meiner Modelle in einem Loop durchlaufen habe. Irgendwann wurden die Sitzungen ziemlich groß im Speicher und ich wusste nicht, warum! Mit Hilfe von object_size_in_env und etwas Degubbing konnte ich das Objekt ausfindig machen, das dieses Problem verursachte, und meinen Code entsprechend anpassen.

Zuerst wollen wir eine Umgebung mit einigen Variablen erstellen.

# building an environment
this_env <- new.env()
assign("Var1", 3, envir = this_env)
assign("Var2", 1:1000, envir = this_env)
assign("Var3", rep("test", 1000), envir = this_env)

Um die Größeninformationen unserer Objekte zu erhalten, wird intern format(object.size()) verwendet. Mit der Einheit kann das Ausgabeformat geändert werden (z.B. "B", "MB" oder "GB").

# checking the size
object_size_in_env(env = this_env, unit = "B")
   OBJECT SIZE UNIT
1:   Var3 8104    B
2:   Var2 4048    B
3:   Var1   56    B

14. print_fs

Dieser kleine Helfer gibt die Ordnerstruktur eines gegebenen Pfades zurück. Damit kann man z.B. eine schöne Übersicht in die Dokumentation eines Projektes oder in ein Git einbauen. Im Sinne der Automatisierung könnte diese Funktion nach einer größeren Änderung Teile in einer Log- oder News-Datei ändern.

Wenn wir uns das gleiche Beispiel anschauen, das wir für die Funktion get_network verwendet haben, erhalten wir folgendes:

print_fs("~/flowchart/", depth = 4)
1  flowchart                            
2   ¦--create_network.R                 
3   ¦--getnetwork.R                     
4   ¦--plots                            
5   ¦   ¦--example-network-helfRlein.png
6   ¦   °--improved-network.png         
7   ¦--R_network_functions              
8   ¦   ¦--dataprep                     
9   ¦   ¦   °--foo_01.R                 
10  ¦   ¦--method                       
11  ¦   ¦   °--foo_02.R                 
12  ¦   ¦--script_01.R                  
13  ¦   °--script_02.R                  
14  °--README.md 

Mit depth können wir einstellen, wie tief wir unsere Ordner durchforsten wollen.

15. read_files

Dieser kleine Helfer liest mehrere Dateien des selben Typs ein und fasst sie zu einer data.table zusammen. Welche Art von Dateilesefunktion verwendet werden soll, kann mit dem Argument FUN ausgewählt werden.

Wenn Sie eine Liste von Dateien haben, die alle mit der gleichen Funktion eingelesen werden sollen (z.B. read.csv), können Sie statt lapply und rbindlist nun dies verwenden:

read_files(files, FUN = readRDS)
read_files(files, FUN = readLines)
read_files(files, FUN = read.csv, sep = ";")

Intern verwendet es nur lapply und rbindlist, aber man muss es nicht ständig eingeben. Die read_files kombiniert die einzelnen Dateien nach ihren Spaltennamen und gibt eine data.table zurück. Warum data.table? Weil ich es mag. Aber lassen Sie uns nicht das Fass von data.table vs. dplyr aufmachen (zum Fass…).

16. save_rds_archive

Dieser kleine Helfer ist ein Wrapper um die Basis-R-Funktion saveRDS() und prüft, ob die Datei, die Ihr zu speichern versucht, bereits existiert. Wenn ja, wird die bestehende Datei umbenannt / archiviert (mit einem Zeitstempel), und die „aktualisierte“ Datei wird unter dem angegebenen Namen gespeichert. Das bedeutet, dass vorhandener Code, der davon abhängt, dass der Dateiname konstant bleibt (z.B. readRDS()-Aufrufe in anderen Skripten), weiterhin funktionieren wird, während eine archivierte Kopie der – ansonsten überschriebenen – Datei erhalten bleibt.

17. sci_palette

Dieser kleine Helfer liefert eine Reihe von Farben, die wir bei statworx häufig verwenden. Wenn Ihr Euch also – so wie ich – nicht an jeden Hex-Farbcode erinnern könnt, den Ihr braucht, könnte das helfen. Natürlich sind das unsere Farben, aber Ihr könnt es auch mit Eurer eigenen Farbpalette umschreiben. Aber der Hauptvorteil ist die Plot-Methode – so könnt Ihr die Farbe sehen, anstatt nur den Hex-Code zu lesen.

So seht Ihr, welcher Hexadezimalcode welcher Farbe entspricht und wofür Ihr ihn verwenden könnt.

sci_palette(scheme = "new")
Tech Blue       Black       White  Light Grey    Accent 1    Accent 2    Accent 3 
"#0000FF"   "#000000"   "#FFFFFF"   "#EBF0F2"   "#283440"   "#6C7D8C"   "#B6BDCC"   
Highlight 1 Highlight 2 Highlight 3 
"#00C800"   "#FFFF00"   "#FE0D6C" 
attr(,"class")
[1] "sci"

Wie bereits erwähnt, gibt es eine Methode plot(), die das folgende Bild ergibt.

plot(sci_palette(scheme = "new"))

18. statusbar

Dieser kleine Helfer gibt einen Fortschrittsbalken in der Konsole für Schleifen aus.

Es gibt zwei notwendige Parameter, um diese Funktion zu füttern:

  • run ist entweder der Iterator oder seine Nummer
  • max.run ist entweder alle möglichen Iteratoren in der Reihenfolge, in der sie verarbeitet werden, oder die maximale Anzahl von Iterationen.

So könnte es zum Beispiel run = 3 und max.run = 16 oder run = "a" und max.run = Buchstaben[1:16] sein.

Außerdem gibt es zwei optionale Parameter:

  • percent.max beeinflusst die Breite des Fortschrittsbalkens
  • info ist ein zusätzliches Zeichen, das am Ende der Zeile ausgegeben wird. Standardmäßig ist es run.

Ein kleiner Nachteil dieser Funktion ist, dass sie nicht mit parallelen Prozessen arbeitet. Wenn Ihr einen Fortschrittsbalken haben wollt, wenn Ihr apply Funktionen benutzt, schaut Euch pbapply an.

19. statworx_palette

Dieses kleine Hilfsmittel ist eine Ergänzung zu sci_palette(). Wir haben die Farben 1, 2, 3, 5 und 10 ausgewählt, um eine flexible Farbpalette zu erstellen. Wenn Sie 100 verschiedene Farben benötigen – sagen Sie nichts mehr!

Im Gegensatz zu sci_palette() ist der Rückgabewert ein Zeichenvektor. Zum Beispiel, wenn Sie 16 Farben wollen:

statworx_palette(16, scheme = "old")
[1] "#013848" "#004C63" "#00617E" "#00759A" "#0087AB" "#008F9C" "#00978E" "#009F7F"
[9] "#219E68" "#659448" "#A98B28" "#ED8208" "#F36F0F" "#E45A23" "#D54437" "#C62F4B"

Wenn wir nun diese Farben aufzeichnen, erhalten wir einen schönen regenbogenartigen Farbverlauf.

library(ggplot2)

ggplot(plot_data, aes(x = X, y = Y)) +
  geom_point(pch = 16, size = 15, color = statworx_palette(16, scheme = "old")) +
  theme_minimal()

Eine zusätzliche Funktion ist der Parameter reorder, der die Reihenfolge der Farben abtastet, so dass Nachbarn vielleicht etwas besser unterscheidbar sind. Auch wenn Sie die verwendeten Farben ändern wollen, können Sie dies mit basecolors tun.

ggplot(plot_data, aes(x = X, y = Y)) +
  geom_point(pch = 16, size = 15,
             color = statworx_palette(16, basecolors = c(4,8,10), scheme = "new")) +
  theme_minimal()


20. strsplit

Dieses kleine Hilfsmittel erweitert die R-Basisfunktion strsplit – daher der gleiche Name! Es ist nun möglich, before, after oder between ein bestimmtes Begrenzungszeichen zu trennen. Im Falle von between müsst ihr zwei Delimiter angeben.

Eine frühere Version dieser Funktion findet Ihr in diesem Blogbeitrag, wo ich die verwendeten regulären Ausdrücke beschreibe, falls Ihr daran interessiert seid.

Hier ist ein kleines Beispiel, wie man das neue strsplit benutzt.

text <- c("This sentence should be split between should and be.")

strsplit(x = text, split = " ")
strsplit(x = text, split = c("should", " be"), type = "between")
strsplit(x = text, split = "be", type = "before")
[[1]]
[1] "This"     "sentence" "should"   "be"       "split"    "between"  "should"   "and"     
[9] "be."

[[1]]
[1] "This sentence should"             " be split between should and be."

[[1]]
[1] "This sentence should " "be split "             "between should and "  
[4] "be."

21. to_na

Dieser kleine Helfer ist nur eine Komfortfunktion. Bei der Datenaufbereitung kann es vorkommen, dass Ihr einen Vektor mit unendlichen Werten wie Inf oder -Inf oder sogar NaN-Werten habt. Solche Werte können (müssen aber nicht!) Eure Auswertungen und Modelle durcheinanderbringen. Aber die meisten Funktionen haben die Tendenz, fehlende Werte zu behandeln. Daher entfernt diese kleine Hilfe solche Werte und ersetzt sie durch NA.

Ein kleines Beispiel, um Euch die Idee zu vermitteln:

test <- list(a = c("a", "b", NA),
             b = c(NaN, 1,2, -Inf),
             c = c(TRUE, FALSE, NaN, Inf))

lapply(test, to_na)
$a
[1] "a" "b" NA 

$b
[1] NA  1  2 NA

$c
[1]  TRUE FALSE    NA

Ein kleiner Tipp am Rande! Da es je nach den anderen Werten innerhalb eines Vektors verschiedene Arten von NA gibt, solltet Ihr das Format überprüfen, wenn Ihr to_na auf Gruppen oder Teilmengen anwendet.

test <- list(NA, c(NA, "a"), c(NA, 2.3), c(NA, 1L))
str(test)
List of 4
 $ : logi NA
 $ : chr [1:2] NA "a"
 $ : num [1:2] NA 2.3
 $ : int [1:2] NA 1

22. trim

Dieser kleine Helfer entfernt führende und nachfolgende Leerzeichen aus einer Zeichenkette. Mit R Version 3.5.1 wurde trimws eingeführt, das genau das Gleiche tut. Das zeigt nur, dass es keine schlechte Idee war, eine solche Funktion zu schreiben. 😉

x <- c("  Hello world!", "  Hello world! ", "Hello world! ")
trim(x, lead = TRUE, trail = TRUE)
[1] "Hello world!" "Hello world!" "Hello world!"

Die Parameter lead und trail geben an, ob nur die führenden, die nachfolgenden oder beide Leerzeichen entfernt werden sollen.

Fazit

Ich hoffe, dass euch das helfRlein Package genauso die Arbeit erleichtert, wie uns hier bei statworx. Schreibt uns bei Fragen oder Input zum Package gerne eine Mail an: blog@statworx.com

Jakob Gepp Jakob Gepp Jakob Gepp Jakob Gepp Jakob Gepp Jakob Gepp

Did you know, that you can transform plain old static ggplot graphs to animated ones? Well, you can with the help of the package gganimate by RStudio’s Thomas Lin Pedersen and David Robinson and the results are amazing! My STATWORX colleagues and I are very impressed how effortless all kinds of geoms are transformed to suuuper smooth animations. That’s why in this post I will provide a short overview of some of the wonderful functionalities of gganimate, I hope you’ll enjoy them as much as we do!

Since Valentine’s Day is just around the corner, we’re going to explore the Speed Dating Experiment dataset compiled by Columbia Business School professors Ray Fisman and Sheena Iyengar. Hopefully, we’ll learn about gganimate as well as how to find our Valentine. If you like, you can download the data from Kaggle.

Defining the basic animation: transition_*

How are static plots put into motion? Essentially, gganimate creates data subsets, which are plotted individually and constitute the substantial frames, which, when played consecutively, create the basic animation. The results of gganimate are so seamless because gganimate takes care of the so-called tweening for us by calculating data points for transition frames displayed in-between frames with actual input data.

The transition_* functions define how the data subsets are derived and thus define the general character of any animation. In this blogpost we’re going to explore three types of transitions: transition_states(), transition_reveal() and transition_filter(). But let’s start at the beginning.

We’ll start with transition_states(). Here the data is split into subsets according to the categories of the variable provided to the states argument. If several rows of a dataset pertain to the same unit of observation and should be identifiable as such, a grouping variable defining the observation units needs to be supplied. Alternatively, an identifier can be mapped to any other aesthetic.

Please note, to ensure the readability of this post, all text concerning the interpretation of the speed dating data is written in italics. If you’re not interested in that part you simply can skip those paragraphs. For the data prep, I’d like to refer you to my GitHub.

First, we’re going to explore what the participants of the Speed Dating Experiment look for in a partner. Participants were asked to rate the importance of attributes in a potential date by allocating a budget of 100 points to several characteristics, with higher values denoting a higher importance. The participants were asked to rate the attributes according to their own views. Further, the participants were asked to rate the same attributes according to the presumed wishes of their same-sex peers, meaning they allocated the points in the way they supposed their average same-sex peer would do.

We’re going to plot all of these ratings (x-axis) for all attributes (y-axis). Since we want to compare the individual wishes to the individually presumed wishes of peers, we’re going to transition between both sets of ratings. Color always indicates the personal wishes of a participant. A given bubble indicates the rating of one specific participant for a given attribute, switching between one’s own wishes and the wishes assumed for peers.

## Static Plot
# ...characteristic vs. (presumed) rating...
# ...color&size mapped to own rating, grouped by ID
plot1 <- ggplot(df_what_look_for, 
       aes(x = value,
           y = variable,
           color = own_rating, # bubbels are always colord according to own whishes
           size = own_rating,
           group = iid)) + # identifier of observations across states
  geom_jitter(alpha = 0.5, # to reduce overplotting: jitttering & alpha
              width = 5) + 
  scale_color_viridis(option = "plasma", # use virdis' plasma scale
                      begin = 0.2, # limit range of used hues
                      name = "Own Rating") +
  scale_size(guide = FALSE) + # no legend for size
  labs(y = "", # no axis label
       x = "Allocation of 100 Points",  # x-axis label
       title = "Importance of Characteristics for Potential Partner") +
  theme_minimal() +  # apply minimal theme
  theme(panel.grid = element_blank(),  # remove all lines of plot raster
        text = element_text(size = 16)) # increase font size

## Animated Plot
plot1 + 
  transition_states(states = rating) # animate contrast subsets acc. to variable rating  

First off, if you’re a little confused which state is which, please be patient, we’ll explore dynamic labels in the section about ‚frame variables‘.

It’s apparent that different people look for different things in a partner. Yet attractiveness is often prioritized over other qualities. But the importance of attractiveness varies most strongly of all attributes between individuals. Interestingly, people are quite aware that their peer’s ratings might differ from their own views. Further, especially the collective presumptions (= the mean values) about others are not completely off, but of higher variance than the actual ratings.

So there is hope for all of us that somewhere out there somebody is looking for someone just as ambitious or just as intelligent as ourselves. However, it’s not always the inner values that count.

gganimate allows us to tailor the details of the animation according to our wishes. With the argument transition_length we can define the relative length of the transition from one to the other real subsets of data takes and with state_length how long, relatively speaking, each subset of original data is displayed. Only if the wrap argument is set to TRUE, the last frame will get morphed back into the first frame of the animation, creating an endless and seamless loop. Of course, the arguments of different transition functions may vary.

## Animated Plot
# ...replace default arguments
plot1 + 
  transition_states(states = rating,
                    transition_length = 3, # 3/4 of total time for transitions
                    state_length = 1, # 1/4 of time to display actual data
                    wrap = FALSE) # no endless loop

Styling transitions: ease_aes

As mentioned before, gganimate takes care of tweening and calculates additional data points to create smooth transitions between successively displayed points of actual input data. With ease_aes we can control which so-called easing function is used to ‚morph‘ original data points into each other. The default argument is used to declare the easing function for all aesthetics in a plot. Alternatively, easing functions can be assigned to individual aesthetics by name. Amongst others quadric, cubic , sine and exponential easing functions are available, with the linear easing function being the default. These functions can be customized further by adding a modifier-suffix: with -in the function is applied as-is, with -out the function is reversely applied with -in-out the function is applied as-is in the first half of the transition and reversed in the second half.

Here I played around with an easing function that models the bouncing of a ball.

## Animated Plot
# ...add special easing function
plot1 + 
  transition_states(states = rating) + 
  ease_aes("bounce-in") # bouncy easing function, as-is

Dynamic labelling: {frame variables}

To ensure that we, mesmerized by our animations, do not lose the overview gganimate provides so-called frame variables that provide metadata about the animation as a whole or the previous/current/next frame. The frame variables – when wrapped in curly brackets – are available for string literal interpretation within all plot labels. For example, we can label each frame with the value of the states variable that defines the currently (or soon to be) displayed subset of actual data:

## Animated Plot
# ...add dynamic label: subtitle with current/next value of states variable
plot1 +
  labs(subtitle = "{closest_state}") + # add frame variable as subtitle
  transition_states(states = rating) 

The set of available variables depends on the transition function. To get a list of frame variables available for any animation (per default the last one) the frame_vars() function can be called, to get both the names and values of the available variables.

Indicating previous data: shadow_*

To accentuate the interconnection of different frames, we can apply one of gganimates ’shadows‘. Per default shadow_null() i.e. no shadow is added to animations. In general, shadows display data points of past frames in different ways: shadow_trail() creates a trail of evenly spaced data points, while shadow_mark() displays all raw data points.

We’ll use shadow_wake() to create a little ‚wake‘ of past data points which are gradually shrinking and fading away. The argument wake_length allows us to set the length of the wake, relative to the total number of frames. Since the wakes overlap, the transparency of geoms might need adjustment. Obviously, for plots with lots of data points shadows can impede the intelligibility.

plot1B + # same as plot1, but with alpha = 0.1 in geom_jitter
  labs(subtitle = "{closest_state}") +  
  transition_states(states = rating) +
  shadow_wake(wake_length = 0.5) # adding shadow

The benefits of transition_*

While I simply love the visuals of animated plots, I think they’re also offering actual improvement. I feel transition_states compared to facetting has the advantage of making it easier to track individual observations through transitions. Further, no matter how many subplots we want to explore, we do not need lots of space and clutter our document with thousands of plots nor do we have to put up with tiny plots.

Similarly, e.g. transition_reveal holds additional value for time series by not only mapping a time variable on one of the axes but also to actual time: the transition length between the individual frames displays of actual input data corresponds to the actual relative time differences of the mapped events. To illustrate this, let’s take a quick look at the ’success‘ of all the speed dates across the different speed dating events:

## Static Plot
# ... date of event vs. interest in second date for women, men or couples
plot2 <- ggplot(data = df_match,
                aes(x = date, # date of speed dating event
                    y = count, # interest in 2nd date
                    color = info, # which group: women/men/reciprocal
                    group = info)) +
  geom_point(aes(group = seq_along(date)), # needed, otherwise transition dosen't work
             size = 4, # size of points
             alpha = 0.7) + # slightly transparent
  geom_line(aes(lty = info), # line type according to group
            alpha = 0.6) + # slightly transparent
  labs(y = "Interest After Speed Date",
       x = "Date of Event",
       title = "Overall Interest in Second Date") +
  scale_linetype_manual(values = c("Men" = "solid", # assign line types to groups
                                   "Women" = "solid",
                                   "Reciprocal" = "dashed"),
                        guide = FALSE) + # no legend for linetypes
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + # y-axis in %
  scale_color_manual(values = c("Men" = "#2A00B6", # assign colors to groups
                                "Women" = "#9B0E84",
                                "Reciprocal" = "#E94657"),
                     name = "") +
  theme_minimal() + # apply minimal theme
  theme(panel.grid = element_blank(), # remove all lines of plot raster
        text = element_text(size = 16)) # increase font size

## Animated Plot
plot2 +
  transition_reveal(along = date) 

Displayed are the percentages of women and men who were interested in a second date after each of their speed dates as well as the percentage of couples in which both partners wanted to see each other again.

Most of the time, women were more interested in second dates than men. Further, the attraction between dating partners often didn’t go both ways: the instances in which both partners of a couple wanted a second date always were far more infrequent than the general interest of either men and women. While it’s hard to identify the most romantic time of the year, according to the data there seemed to be a slack in romance in early autumn. Maybe everybody still was heartbroken over their summer fling? Fortunately, Valentine’s Day is in February.

Another very handy option is transition_filter(), it’s a great way to present selected key insights of your data exploration. Here the animation browses through data subsets defined by a series of filter conditions. It’s up to you which data subsets you want to stage. The data is filtered according to logical statements defined in transition_filter(). All rows for which a statement holds true are included in the respective subset. We can assign names to the logical expressions, which can be accessed as frame variables. If the keep argument is set to TRUE, the data of previous frames is permanently displayed in later frames.

I want to explore, whether one’s own characteristics relate to the attributes one looks for in a partner. Do opposites attract? Or do birds of a feather (want to) flock together?

Displayed below are the importances the speed dating participants assigned to different attributes of a potential partner. Contrasted are subsets of participants, who were rated especially funny, attractive, sincere, intelligent or ambitious by their speed dating partners. The rating scale went from 1 = low to 10 = high, thus I assume value of >7 to be rather outstanding.

## Static Plot (without geom)
# ...importance ratings for different attributes
plot3 <- ggplot(data = df_ratings, 
                 aes(x = variable, # different attributes
                     y = own_rating, # importance regarding potential partner
                     size = own_rating, 
                     color = variable, # different attributes
                     fill = variable)) +
  geom_jitter(alpha = 0.3) +
  labs(x = "Attributes of Potential Partner", # x-axis label
       y = "Allocation of 100 Points (Importance)",  # y-axis label
       title = "Importance of Characteristics of Potential Partner", # title
       subtitle = "Subset of {closest_filter} Participants") + # dynamic subtitle 
  scale_color_viridis_d(option = "plasma", # use viridis scale for color 
                        begin = 0.05, # limit range of used hues
                        end = 0.97,
                        guide = FALSE) + # don't show legend
  scale_fill_viridis_d(option = "plasma", # use viridis scale for filling
                       begin = 0.05, # limit range of used hues
                       end = 0.97, 
                       guide = FALSE) + # don't show legend
  scale_size_continuous(guide = FALSE) + # don't show legend
  theme_minimal() + # apply minimal theme
  theme(panel.grid = element_blank(),  # remove all lines of plot raster
        text = element_text(size = 16)) # increase font size

## Animated Plot 
# ...show ratings for different subsets of participants
plot3 +
  geom_jitter(alpha = 0.3) +
  transition_filter("More Attractive" = Attractive > 7, # adding named filter expressions
                    "Less Attractive" = Attractive <= 7,
                    "More Intelligent" = Intelligent > 7,
                    "Less Intelligent" = Intelligent <= 7,
                    "More Fun" = Fun > 7,
                    "Less Fun" = Fun <= 5) 

Of course, the number of extraordinarily attractive, intelligent or funny participants is relatively low. Surprisingly, there seem to be little differences between what the average low vs. high scoring participants look for in a partner. Rather the lower scoring group includes more people with outlying expectations regarding certain characteristics. Individual tastes seem to vary more or less independently from individual characteristics.

Styling the (dis)appearance of data: enter_* / exit_*

Especially if displayed subsets of data do not or only partially overlap, it can be favorable to underscore this visually. A good way to do this are the enter_*() and exit_*() functions, which enable us to style the entry and exit of data points, which do not persist between frames.

There are many combinable options: data points can simply (dis)appear (the default), fade (enter_fade()/exit_fade()), grow or shrink (enter_grow()/exit_shrink()), gradually change their color (enter_recolor()/exit_recolor()), fly (enter_fly()/exit_fly()) or drift (enter_drift()/exit_drift()) in and out.

We can use these stylistic devices to emphasize changes in the databases of different frames. I used exit_fade() to let further not included data points gradually fade away while flying them out of the plot area on a vertical route (y_loc = 100), data points re-entering the sample fly in vertically from the bottom of the plot (y_loc = 0):

## Animated Plot 
# ...show ratings for different subsets of participants
plot3 +
  geom_jitter(alpha = 0.3) +
  transition_filter("More Attractive" = Attractive > 7, # adding named filter expressions
                    "Less Attractive" = Attractive <= 7,
                    "More Intelligent" = Intelligent > 7,
                    "Less Intelligent" = Intelligent <= 7,
                    "More Fun" = Fun > 7,
                    "Less Fun" = Fun <= 5) +
  enter_fly(y_loc = 0) + # entering data: fly in vertically from bottom
  exit_fly(y_loc = 100) + # exiting data: fly out vertically to top...
  exit_fade() # ...while color is fading

Finetuning and saving: animate() & anim_save()

Gladly, gganimate makes it very easy to finalize and save our animations. We can pass our finished gganimate object to animate() to, amongst other things, define the number of frames to be rendered (nframes) and/or the rate of frames per second (fps) and/or the number of seconds the animation should last (duration). We also have the option to define the device in which the individual frames are rendered (the default is device = “png”, but all popular devices are available). Further, we can define arguments that are passed on to the device, like e.g. width or height. Note, that simply printing an gganimateobject is equivalent to passing it to animate() with default arguments. If we plan to save our animation the argument renderer, is of importance: the function anim_save() lets us effortlessly save any gganimate object, but only so if it was rendered using one of the functions magick_renderer() or the default gifski_renderer().

The function anim_save()works quite straightforward. We can define filename and path (defaults to the current working directory) as well as the animation object (defaults to the most recently created animation).

# create a gganimate object
gg_animation <- plot3 +
  transition_filter("More Attractive" = Attractive > 7,
                    "Less Attractive" = Attractive <= 7) 

# adjust the animation settings 
animate(gg_animation, 
        width = 900, # 900px wide
        height = 600, # 600px high
        nframes = 200, # 200 frames
        fps = 10) # 10 frames per second

# save the last created animation to the current directory 
anim_save("my_animated_plot.gif")

Conclusion (and a Happy Valentine’s Day)

I hope this blog post gave you an idea, how to use gganimate to upgrade your own ggplots to beautiful and informative animations. I only scratched the surface of gganimates functionalities, so please do not mistake this post as an exhaustive description of the presented functions or the package. There is much out there for you to explore, so don’t wait any longer and get started with gganimate!

But even more important: don’t wait on love. The speed dating data shows that most likely there’s someone out there looking for someone just like you. So from everyone here at STATWORX: Happy Valentine’s Day!

 

## 8 bit heart animation
animation2 <- plot(data = df_eight_bit_heart %>% # includes color and x/y position of pixels 
         dplyr::mutate(id = row_number()), # create row number as ID  
                aes(x = x, 
                    y = y,
                    color = color,
                    group = id)) +
  geom_point(size = 18, # depends on height & width of animation
             shape = 15) + # square
  scale_color_manual(values = c("black" = "black", # map values of color to actual colors
                                "red" = "firebrick2",
                                "dark red" = "firebrick",
                                "white" = "white"),
                     guide = FALSE) + # do not include legend
  theme_void() + # remove everything but geom from plot
  transition_states(-y, # reveal from high to low y values 
                    state_length = 0) +
  shadow_mark() + # keep all past data points
  enter_grow() + # new data grows 
  enter_fade() # new data starts without color

animate(animation2, 
        width = 250, # depends on size defined in geom_point 
        height = 250, # depends on size defined in geom_point 
        end_pause = 15) # pause at end of animation

 

 

Lea Waniek

Lea Waniek

In the last Docker tutorial Olli presented how to build a Docker image of R-Base scripts with rocker and how to run them in a container. Based on that, I’m going to discuss how to automate the process by using a bash/shell script. Since we usually use containers to deploy our apps at STATWORX, I created a small test app with R-shiny to be saved in a test container. It is, of course, possible to store any other application with this automated script as well if you like. I also created a repository at our blog github, where you can find all files and the test app.

Feel free to test and use any of its content. If you are interested in writing a setup script file yourself, note that it is possible to use alternative programming languages such as python as well.

The idea behind it

$ docker-machine ls
NAME          ACTIVE   DRIVER       STATE     URL   SWARM   DOCKER    ERRORS
Dataiku       -        virtualbox   Stopped                 Unknown   
default       -        virtualbox   Stopped                 Unknown   
ShowCase      -        virtualbox   Stopped                 Unknown   
SQLworkshop   -        virtualbox   Stopped                 Unknown   
TestMachine   -        virtualbox   Stopped                 Unknown   

$ docker-machine start TestMachine
Starting "TestMachine"...
(TestMachine) Check network to re-create if needed...
(TestMachine) Waiting for an IP...
Machine "TestMachine" was started.
Waiting for SSH to be available...
Detecting the provisioner...
Started machines may have new IP addresses. You may need to re-run the `docker-machine env` command.

$ eval $(docker-machine env --no-proxy TestMachine)

$ docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             
cfa02575ca2c        testimage           "/sbin/my_init"     2 weeks ago         
STATUS                            PORTS                    NAMES
Exited (255) About a minute ago   0.0.0.0:2000->3838/tcp   testcontainer

$ docker start testcontainer
testcontainer

$ docker ps
...

Building and rebuilding Docker images over and over again every time you make some changes to your application can get a little tedious at times, especially if you type in the same old commands all the time. Olli discussed the advantage of creating an intermediary image for the most time-consuming processes, like installing R packages to speed things up during content creation. That’s an excellent practice, and you should try and do this for every project viable. But how about speeding up the containerisation itself? A small helper tool is needed that, once it’s written, does all the work for you.

The tools to use

To create Docker images and containers, you need to install Docker on your computer. If you want to test or use all the material provided in this blog post and on our blog github, you should also install VirtualBox, R and RStudio, if you do not already have them. If you use Windows (10) as your Operating System, you also need to install the Windows Subsystem for Linux. Alternatively, you can create your own script files with PowerShell or something similar.

The tool itself is a bash/shell script that builds and runs docker containers for you. All you have to do to use it, is to copy the docker_setup executable into your project directory and execute it. The only thing the tool requires from you afterwards is some naming input.

If the execution for some reason fails or produces errors, try to run the tool via the terminal.

source ./docker_setup

To replicate or start a new bash/shell script yourself, open your preferred text editor, create a new text file, place the preamble #!/bin/bash at the very top of it and save it. Next, open your terminal, navigate to the directory where you just saved your script and change its mode by typing chmod +x your_script_name. To test if it works correctly, you can e.g. add the line echo 'it works!' below your preamble.

#!/bin/bash
echo 'it works!'

If you want to check out all available mode options, visit the wiki and for a complete guide visit the Linux Shell Scripting Tutorial.

The code that runs it

If you open the docker_setup executable with your preferred text editor, the code might feel a little overwhelming or confusing at first, but it is pretty straight forward.

#!/bin/bash

# This is a setup bash for docker containers!
# activate via: chmod 0755 setup_bash or chmod +x setup_bash
# navigate to wd docker_contents
# excute in terminal via source ./setup_bash

echo ""
echo "Welcome, You are executing a setup script bash for docker containers."
echo ""

echo "Do you want to use the Default from the global configurations?"
echo ""
source global_conf.sh
echo "machine name = $machine_name"
echo "container = $container_name"
echo "image = $image_name"
echo "app name = $app_name"
echo "password = $password_name"
echo ""

docker-machine ls
echo ""
read -p "What is the name of your docker-machine [default]? " machine_name
echo ""
if [[ "$(docker-machine status $machine_name 2> /dev/null)" == "" ]]; then
    echo "creating machine..." 
        && docker-machine create $machine_name
else
    echo "machine already exists, starting machine..." 
        && docker-machine start $machine_name
fi
echo ""
echo "activating machine..."
eval $(docker-machine env --no-proxy $machine_name)
echo ""

docker ps -a
echo ""
read -p "What is the name of your docker container? " container_name
echo ""

docker image ls
echo ""
read -p "What is the name of your docker image? (lower case only!!) " image_name
echo ""

The main code structure rests on nested if statements. Contrary to a manual docker setup via the terminal, the script needs to account for many different possibilities and even leave some error margin. The first if statement for example – depicted in the picture above – checks if a requested docker-machine already exists. If the machine does not exist, it will be created. If it does exist, it is simply started for usage.

The utilised code elements or commands are even more straightforward. The echo command returns some sort of information or a blank for better readability. The read command allows for user input to be read and stored as a variable, which in return enters all further code instances necessary. Most other code elements are docker commands and are essentially the same as the ones entered manually via the terminal. If you are interested in learning more about docker commands check the documentation and Olli’s awesome blog post.

The git repository

The focal point of the Git Repository at our blog github is the automated docker setup, but also contains some other conveniences and hopefully will grow into an entire collection of useful scripts and bashes. I am aware that there are potentially better, faster and more convenient solutions for everything included in the repository, but if we view it as an exercise and a form of creative exchange, I think we can get some use out of it.

The docker_error_logs executable allows for quick troubleshooting and storage of log files if your program or app fails to work within your docker container.

The git_repair executable is not fully tested yet and should be used with care. The idea is to quickly check if your local project or repository is connected to a corresponding Git Hub repository, given an URL address, and if not to eventually ‚repair‘ the connection. It can further manage git pulls, commits and pushes for you, but again please use carefully.

Next projects to come

As mentioned, I plan on further expanding the collection and usefulness of our blog github soon. In the next step I will add more convenience to the docker setup by adding a separate file that provides the option to write and store default values for repeated executions. So stay tuned and visit our STATWORX Blog again soon. Until then, happy coding.

Stephan Emmer

Stephan Emmer

In the last post of this series, we dealt with axis systems. In this post, we are also dealing with axes but this time we are taking a look at the position scales of dates, time, and datetimes. Since we at STATWORX are often forecasting – and thus plotting – time series, this is an important issue for us. The choice of axis ticks and labels can make the message conveyed by a plot clearer. Oftentimes, some points in time are – e.g. due to their business implications – more important than others and should be easily identified. Unequivocal, yet parsimonious labeling is key to the readability of any plot. Luckily, ggplot2 enables us to do so for dates and times with almost any effort at all.

We are using ggplot’s economics data set. Our base Plot looks like this:

base_plot <- ggplot(data = economics) +
  geom_line(aes(x = date, y = unemploy), 
            color = "#09557f",
            alpha = 0.6,
            size = 0.6) +
  labs(x = "Date", 
       y = "US Unemployed in Thousands",
       title = "Base Plot") +
  theme_minimal()

Scale Types

As of now, ggplot2 supports three date and time classes: POSIXct, Date and hms. Depending on the class at hand, axis ticks and labels can be controlled by using scale_*_date, scale_*_datetime or scale_*_time, respectively. Depending on whether one wants to modify the x or the y axis scale_x_* or scale_y_* are to be employed. For sake of simplicity, in the examples only scale_x_date is employed, but all discussed arguments work just the same for all mentioned scales.

Minor Modifications

Let’s start easy. With the argument limits the range of the displayed dates or time can be set. Two values of the correct date or time class have to be supplied.

 base_plot +
   scale_x_date(limits = as.Date(c("1980-01-01","2000-01-01"))) +
   ggtitle("limits = as.Date(c("1980-01-01","2000-01-01"))")

The expand argument ensures that there is some distance between the displayed data and the axes. The multiplicative constant is multiplied with the range of the displayed data, the additive is multiplied with one unit of the depicted data. The sum of the two resulting distances is added to the axis limits as padding. The resulting empty space is added at the left and right end of the x-axis or the top and bottom of the y-axis.

 base_plot +  
 scale_x_date(expand = c(0, 5000)) +   #5000/365 = 13.69863 years
   ggtitle("expand = c(0, 5000)")

position argument defines where the labels are displayed: Either “left” or “right” from the y-axis or on the “top” or on the “bottom” of the x-axis.

base_plot +  
   scale_x_date(position = "top") +
   ggtitle("position = "top"")

Axis Ticks and Grid Lines

More essential than the cosmetic modifications discussed so far are the axis ticks. There are several ways to define the axis ticks of dates and times. There are the labelled major breaks and further the minor breaks, which are not labeled but marked by grid lines. These can be customized with the arguments breaks and minor_breaks, respectively. The breaks as the well as minor_breaks can be defined by a numeric vector of exact positions or a function with the axis limits as inputs and breaks as outputs. Alternatively, the arguments can be set to NULL to display (minor) breaks at all. These options are especially handy if irregular intervals between breaks are desired.

 base_plot +  
  scale_x_date(breaks = as.Date(c("1970-01-01", "2000-01-01")),
               minor_breaks = as.Date(c("1975-01-01", "1980-01-01",
                                        "2005-01-01", "2010-01-01"))) +
   ggtitle("(minor_)breaks = fixed Dates")
base_plot +  
   scale_x_date(breaks = function(x) seq.Date(from = min(x), 
   											  to = max(x), 
   											  by = "12 years"),
                minor_breaks = function(x) seq.Date(from = min(x), 
                									to = max(x), 
                									by = "2 years")) +
   ggtitle("(minor_)breaks = custom function")
base_plot +  
	scale_x_date(breaks = NULL,
              minor_breaks = NULL) +
  ggtitle("(minor_)breaks = NULL")

Another and very convenient way to define regular breaks are the date_breaks and the date_minor_breaks argument. As input both arguments take a character vector combining a string specifying the time unit (either “sec“, „min“, „hour“, „day“, „week“, „month“ or „year“) and an integer specifying number of said units specifying the break intervals.

base_plot +
  scale_x_date(date_breaks = "10 years",
               date_minor_breaks = "2 years") +
  ggtitle("date_(minor_)breaks = "x years"")

If both are given, date(_minor)_breaks overrules (minor_)breaks.

Axis Labels

Similar to the axis ticks, the format of the displayed labels can either be defined via the labels or the date_labels argument. The labels argument can either be set to NULL if no labels should be displayed, with the breaks as inputs and the labels as outputs. Alternatively, a character vector with labels for all the breaks can be supplied to the argument. This can be very useful, since like this virtually any character vector can be used to label the breaks. The number of labels must be the same as the number of breaks. If the breaks are defined by a function, date_breaks or by default the labels must be defined by a function as well.

base_plot +
  scale_x_date(date_breaks = "15 years",
               labels = function(x) paste((x-365), "(+365 days)")) +
  ggtitle("labels = custom function") 
base_plot +
  scale_x_date(breaks = as.Date(c("1970-01-01", "2000-01-01")),
               labels = c("~ '70", "~ '00")) +
  ggtitle("labels = character vector")   

Furthermore and very conveniently, the format of the labels can be controlled via the argument date_labels set to a string of formatting codes, defining order, format and elements to be displayed:

Code Meaning
%S second (00-59)
%M minute (00-59)
%l hour, in 12-hour clock (1-12)
%I hour, in 12-hour clock (01-12)
%H hour, in 24-hour clock (01-24)
%a day of the week, abbreviated (Mon-Sun)
%A day of the week, full (Monday-Sunday)
%e day of the month (1-31)
%d day of the month (01-31)
%m month, numeric (01-12)
%b month, abbreviated (Jan-Dec)
%B month, full (January-December)
%y year, without century (00-99)
%Y year, with century (0000-9999)

Source: Wickham 2009 p. 99

base_plot +
  scale_x_date(date_labels = "%Y (%b)") +
  ggtitle("date_labels = "%Y (%b)"") 

The choice of axis ticks and labels might seem trivial. However, one should not underestimate the amount of confusion that can be caused by too many, too less or poorly positioned axis ticks and labels. Further, economical yet clear labeling of axis ticks can increase the readability and visual appeal of any time series plot immensely. Since it is so easy to tweak the date and time axes in ggplot2 there is simply no excuse not to do so.

References

  • Wickham, H. (2009). ggplot2: elegant graphics for data analysis. Springer.

 

Lea Waniek

Lea Waniek

In the last post of this series, we dealt with axis systems. In this post, we are also dealing with axes but this time we are taking a look at the position scales of dates, time, and datetimes. Since we at STATWORX are often forecasting – and thus plotting – time series, this is an important issue for us. The choice of axis ticks and labels can make the message conveyed by a plot clearer. Oftentimes, some points in time are – e.g. due to their business implications – more important than others and should be easily identified. Unequivocal, yet parsimonious labeling is key to the readability of any plot. Luckily, ggplot2 enables us to do so for dates and times with almost any effort at all.

We are using ggplot’s economics data set. Our base Plot looks like this:

base_plot <- ggplot(data = economics) +
  geom_line(aes(x = date, y = unemploy), 
            color = "#09557f",
            alpha = 0.6,
            size = 0.6) +
  labs(x = "Date", 
       y = "US Unemployed in Thousands",
       title = "Base Plot") +
  theme_minimal()

Scale Types

As of now, ggplot2 supports three date and time classes: POSIXct, Date and hms. Depending on the class at hand, axis ticks and labels can be controlled by using scale_*_date, scale_*_datetime or scale_*_time, respectively. Depending on whether one wants to modify the x or the y axis scale_x_* or scale_y_* are to be employed. For sake of simplicity, in the examples only scale_x_date is employed, but all discussed arguments work just the same for all mentioned scales.

Minor Modifications

Let’s start easy. With the argument limits the range of the displayed dates or time can be set. Two values of the correct date or time class have to be supplied.

 base_plot +
   scale_x_date(limits = as.Date(c("1980-01-01","2000-01-01"))) +
   ggtitle("limits = as.Date(c("1980-01-01","2000-01-01"))")

The expand argument ensures that there is some distance between the displayed data and the axes. The multiplicative constant is multiplied with the range of the displayed data, the additive is multiplied with one unit of the depicted data. The sum of the two resulting distances is added to the axis limits as padding. The resulting empty space is added at the left and right end of the x-axis or the top and bottom of the y-axis.

 base_plot +  
 scale_x_date(expand = c(0, 5000)) +   #5000/365 = 13.69863 years
   ggtitle("expand = c(0, 5000)")

position argument defines where the labels are displayed: Either “left” or “right” from the y-axis or on the “top” or on the “bottom” of the x-axis.

base_plot +  
   scale_x_date(position = "top") +
   ggtitle("position = "top"")

Axis Ticks and Grid Lines

More essential than the cosmetic modifications discussed so far are the axis ticks. There are several ways to define the axis ticks of dates and times. There are the labelled major breaks and further the minor breaks, which are not labeled but marked by grid lines. These can be customized with the arguments breaks and minor_breaks, respectively. The breaks as the well as minor_breaks can be defined by a numeric vector of exact positions or a function with the axis limits as inputs and breaks as outputs. Alternatively, the arguments can be set to NULL to display (minor) breaks at all. These options are especially handy if irregular intervals between breaks are desired.

 base_plot +  
  scale_x_date(breaks = as.Date(c("1970-01-01", "2000-01-01")),
               minor_breaks = as.Date(c("1975-01-01", "1980-01-01",
                                        "2005-01-01", "2010-01-01"))) +
   ggtitle("(minor_)breaks = fixed Dates")
base_plot +  
   scale_x_date(breaks = function(x) seq.Date(from = min(x), 
   											  to = max(x), 
   											  by = "12 years"),
                minor_breaks = function(x) seq.Date(from = min(x), 
                									to = max(x), 
                									by = "2 years")) +
   ggtitle("(minor_)breaks = custom function")
base_plot +  
	scale_x_date(breaks = NULL,
              minor_breaks = NULL) +
  ggtitle("(minor_)breaks = NULL")

Another and very convenient way to define regular breaks are the date_breaks and the date_minor_breaks argument. As input both arguments take a character vector combining a string specifying the time unit (either “sec“, „min“, „hour“, „day“, „week“, „month“ or „year“) and an integer specifying number of said units specifying the break intervals.

base_plot +
  scale_x_date(date_breaks = "10 years",
               date_minor_breaks = "2 years") +
  ggtitle("date_(minor_)breaks = "x years"")

If both are given, date(_minor)_breaks overrules (minor_)breaks.

Axis Labels

Similar to the axis ticks, the format of the displayed labels can either be defined via the labels or the date_labels argument. The labels argument can either be set to NULL if no labels should be displayed, with the breaks as inputs and the labels as outputs. Alternatively, a character vector with labels for all the breaks can be supplied to the argument. This can be very useful, since like this virtually any character vector can be used to label the breaks. The number of labels must be the same as the number of breaks. If the breaks are defined by a function, date_breaks or by default the labels must be defined by a function as well.

base_plot +
  scale_x_date(date_breaks = "15 years",
               labels = function(x) paste((x-365), "(+365 days)")) +
  ggtitle("labels = custom function") 
base_plot +
  scale_x_date(breaks = as.Date(c("1970-01-01", "2000-01-01")),
               labels = c("~ '70", "~ '00")) +
  ggtitle("labels = character vector")   

Furthermore and very conveniently, the format of the labels can be controlled via the argument date_labels set to a string of formatting codes, defining order, format and elements to be displayed:

Code Meaning
%S second (00-59)
%M minute (00-59)
%l hour, in 12-hour clock (1-12)
%I hour, in 12-hour clock (01-12)
%H hour, in 24-hour clock (01-24)
%a day of the week, abbreviated (Mon-Sun)
%A day of the week, full (Monday-Sunday)
%e day of the month (1-31)
%d day of the month (01-31)
%m month, numeric (01-12)
%b month, abbreviated (Jan-Dec)
%B month, full (January-December)
%y year, without century (00-99)
%Y year, with century (0000-9999)

Source: Wickham 2009 p. 99

base_plot +
  scale_x_date(date_labels = "%Y (%b)") +
  ggtitle("date_labels = "%Y (%b)"") 

The choice of axis ticks and labels might seem trivial. However, one should not underestimate the amount of confusion that can be caused by too many, too less or poorly positioned axis ticks and labels. Further, economical yet clear labeling of axis ticks can increase the readability and visual appeal of any time series plot immensely. Since it is so easy to tweak the date and time axes in ggplot2 there is simply no excuse not to do so.

References

 

Lea Waniek

Lea Waniek