pipes

Show me your pipe!

Tobias Krabel Blog, Data Science, Statistik

At STATWORX, we all love R – even so much, that we have decided to visit eRum 2018, an R conference hosted in Budapest! And just as much as we love R, we love the piping operator %>% , as it makes our R codes much neater.

I guess, many of you have already seen it in action, but you never can have enough of its magic. If you have a data frame and you work with dplyr, piping comes in particularly handy, as there is a particular synergy between the two. For example, your data aggregation script may look like this:

# Aggregation, the neat way!
mtcars %>% 
  group_by(cyl) %>%
  summarise(mean_hp = mean(hp)) %>%
  arrange(desc(mean_hp))

Or like this:

# Aggregation, the horrific way!
arrange(summarise(group_by(mtcars, cyl), mean_hp = sum(hp)), desc(mean_hp))

But piping (which is a function of magrittr) works independently from dplyr and should be used where appropriate. This blog is an ode to the pipe, as it makes your code much easier to read and write (notice the rhyme!). However, making code beautiful is not my main focus here. Instead, I want to offer something even more valuable: a deeper understanding of R's internals.

What piping teaches you about R

You may have worked with %>% and reached a point where it seemed that you have to break the pipe because some operations appeared to be un-pipeable! Well, then be assured, that everything is pipeable! To understand this, you need to know the following: every operator in R is a function!

If you don't believe me, then watch this

# Adding 5 to a vector ...
1:10 + 5 
# ... can be accomplished with this notation, too!
`+`(1:10, 5)

Having said that, piping is a doddle.

1:10 %>% `+`(., 5)

Note that the . here — which stands for the result of the previous operation, the vector 1:10 itself in our example — is optional. I used it, however, to make the similarity between the piped and non-piped version apparent. The newly acquired knowledge is not really beneficial in this context, but it makes the point clear: any operator, be it +, -, *, /, ==, or !, can be expressed as a function and therefore incorporated in a pipe.

This even holds when looking at anonymous functions (so-called lambdas) or control structures.

# Working with lambdas
number <- 2
number %>%
(function(x) {
    if (!x %% 2) { print("Even") } else { print("Odd") }
})

# Working with control structures (directly)
number %>%
{
    if (!. %% 2) { print("Even") } else { print("Odd") }
}

To make things easier for the user, magrittr even offers aliases for the aforementioned operators. To give you an example, watch this:

# Use the backticks
1:10 %>% `==`(3)
# ... or its synonymous wrapper
1:10 %>% equals(3)

Let the piping begin

Now let us see what else we can do with piping.

A first tool in the tool stack is the creation of functions. All you have to do is to start the pipeline with the . argument (I will henceforth refer to it as "dot"). For example, let us imagine that we have built a shiny app that requires some input from the user. If the user does not provide all inputs and clicks on some button, we want to tell him in a nicely formatted html output, which inputs are missing. We want to write a function now that gets a character vector of missing inputs and returns a html list containing the missing arguments as elements. One way of doing this — magrittr style — is:

# Turn a character vector into a html list.
vec2html <- . %>% 
  sprintf("<li>%s</li>", .) %>%
  paste(., collapse = " ") %>%
  sprintf("<ul>%s</ul>", .)

Now, if a user forgot to enter the inputs name and age into the form, she would receive an output like this:

missing <- c("name", "age")
vec2html(missing)
[1] "<ul><li>name</li> <li>age</li></ul>"

which appears as an html list if you open that in your browser. Well, it's not rocket science, but it facilitates the creation of little helper functions.

And the journey doesn't have to stop here. With the help of the dot, you can also call functions with side-effects and no return value, such as print() or plot(). Let's look at another example. You work at a consultant company and mention something about the normal distribution, something your client never heard of. Oh dear! Your drawing skills are terrible, but you like to show off, so you want to illustrate the distribution together with the meaning of one of its parameters, the mean \mu , using R. So, you decide to draw 10,000 observations from a standard normal distribution and plot the distribution twice — once the original draw, once the version shifted by two. You also thought about printing summary statistics on the fly, because, well, you can. This is the pipe that could accomplish that.

10000 %>%
  rnorm(.) %>% 
  {
    # Print summary statistics
    summary(.) %>% print(.)
    .
  } %>%
  {
    # Plot the first density
    density(.) %>%
    plot(., main = "", xlim = c(-4, 6))
    .
  } %>%
  # Shift by two and plot second density
  add(2) %>%
  density(.) %>%
  lines(., col = "red")

normal densities

As we have already seen above, curly braces put a sequence of statements together in one action, which is handy if you just care about the final statement. In the case of our summary statistics, we want to print them on the fly and continue with some other calculations without breaking the pipe. The remedy to this is returning the dot at the end of the block of code. This way, our random vector is passed through to the plotting step without alterations, leaving a print statement on the way. Curly braces come in particularly useful for the first plot code, where we only need the density()-function for the preliminary plot.

Besides the pipe itself, magrittr offers you other operators that are worth mentioning. If you want to assign a value to a previously created object, you can make your code more concise with %<>%,

# Create string
my_string <- "Hello"

# Change value and assign
my_string <- my_string %>% sprintf("%s, world!", .)

# More concise way
my_string %<>% sprintf("%s, world!", .)

If you work with data frames and want to expose its variables, you can use %$%.

# Correlation between columns cyl and hp within the mtcars data set
mtcars %$% cor(cyl, hp)

Conclusion

Most of the R users become acquainted with the pipe %>% through dplyr, but there is much more to the operator than making dplyr code neater. The magrittr package is a useful library of tools that help you produce more intuitive code, independent of the other packages you are using. Furthermore, it reveals some underlying features of R.

Über den Autor
Tobias Krabel

Tobias Krabel

I am data scientist at STATWORX, with a secret passion for data and software engineering. To compensate for my nerdy sitting-in-the-basement side, I spend even more time in the basement writing shiny applications.

ABOUT US


STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog(at)statworx.com.