bang-title

Bang Bang – How to program with dplyr

Markus Berroth Blog, Data Science

The tidyverse is making the life of a data scientist a lot easier. That’s why we at STATWORX love to execute our analytics and data science with the tidyverse. Its user-centered approach has many advantages. Instead of the base R version df_test[df_test$x > 10], we can write df_test %>% filter(x>10)), which is a lot more readable – especially if our data pipeline gets more complex and nested. Also, as you might have noticed, we can use the column names directly instead of referencing the Data Frame before. Because of those advantages, we want to use dplyr-verbs for writing our function. Imagine we want to write our own summary-function my_summary(), which takes a grouping variable and calculates some descriptive statistics. Let’s see what happens when we wrap a dplyr-pipeline into a function:

my_summary <- function(df, grouping_var){
 df %>%
  group_by(grouping_var) %>% 
  summarise(
   avg = mean(air_time),
   sum = sum(air_time),
   min = min(air_time),
   max = max(air_time),
   obs = n()
  )
}
my_summary(airline_df, origin)
Error in grouped_df_impl(data, unname(vars), drop) : 
 Column `grouping_var` is unknown 

Our new function uses group_by(), which is searching for a grouping variable grouping_var, and not for origin, as we intended. So, what happened here? group_by() is searching within its scope for the variable grouping_var, which it does not find. group_by() is quoting its arguments, grouping_var in our example. That’s why dplyr can implement custom ways of handling its operation. Throughout the tidyverse, tidy evaluation is used. Therefore we can use column names, as it is a variable. However, our data frame has no column grouping_var.

Non-Standard Evaluation

Talking about whether an argument is quoted or evaluated is a more precise way of stating whether or not a function uses non-standard evaluation (NSE).

– Hadley Wickham

The quoting used by group_by() means, that it uses non-standard evaluation, like most verbs you can find in dplyr. Nonetheless, non-standard evaluation is not only found and used within dplyr and the tidyverse.

non-standard-evaluation

Because dplyr quotes its arguments, we have to do two things to use it in our function:

  • First, we have to quote our argument
  • Second, we have to tell dplyr, that we already have quoted the argument, which we do with unquoting

We will see this quote-and-unquote pattern consequently through functions which are using tidy evaluation.

my_summary <- function(df, grouping_var){
  df %>%
    group_by(!!grouping_var) %>% 
    summarise(
      avg = mean(air_time),
      sum = sum(air_time),
      min = min(air_time),
      max = max(air_time),
      obs = n()
    )
}
my_summary(airline_df, quo(origin))

Therefore, as input in our function, we quote the origin-variable, which means that R doesn’t search for the symbol origin in the global environment, but holds evaluation. The quotation takes place with the quo() function. In order to tell group_by(), that the variable was already quoted we need to use the !!-Operator; pronounced Bang-Bang (if you wondered about the title).

If we are not using !!, group_by() at first searches for the variable within its scope, which are the columns of the given data frame. As mentioned before, throughout the tidyverse, tidy evaluation is used with its eval_tidy()-function. Whereby, it also introduces the concept of data mask, which makes data a first class object in R.

Data Mask

data-mask

Generally speaking, the data mask approach is much more convenient. However, on the programming site, we have to pay attention to some things, like the quote-and-unquote pattern from before.

As a next step, we want the quotation to take place inside of the function, so the user of the function does not have to do it. Sadly, using quo() inside the function does not work.

my_summary <- function(df, grouping_var){
  quo(grouping_var)
  df %>%
    group_by(!!grouping_var) %>% 
    summarise(
      avg = mean(air_time),
      sum = sum(air_time),
      min = min(air_time),
      max = max(air_time),
      obs = n()
    )
}
my_summary(airline_df, origin)
Error in quos(...) : object 'origin' not found 

We are getting an error message because quo() is taking it too literal and is quoting grouping_var directly instead of substituting it with origin as desired. That’s why we use the function enquo() for enriched quotation, which creates a quosure. A quosure is an object which contains an expression and an environment. Quosures redefine the internal promise object into something that can be used for programming. Thus, the following code is working, and we see the quote-and-unquote pattern again.

my_summary <- function(df, grouping_var){
  grouping_var <- enquo(grouping_var)
  df %>%
    group_by(!!grouping_var) %>% 
    summarise(
      avg = mean(air_time),
      sum = sum(air_time),
      min = min(air_time),
      max = max(air_time),
      obs = n()
    )
}
my_summary(airline_df, origin)
# A tibble: 2 x 6
  origin   avg    sum   min   max   obs
  <fct>  <dbl>  <int> <int> <int> <int>
1 JFK     166. 587966    19   415  3539
2 LAX     132. 850259     1   381  6461

All R code is a tree

To better understand what’s happening, it is useful to know that every R code can be represented by an Abstract Syntax Tree (AST) because the structure of the code is strictly hierarchical. The leaves of an AST are either symbols or constants. The more complex a function call gets, the deeper an AST is getting with more and more levels. Symbols are drawn in dark-blue and have rounded corners, whereby constants have green borders and square corners. The strings are surrounded by quotes so that they won’t be confused with symbols. The branches are function calls and are depicted as orange rectangles.

a(b(4, "s"), c(3, 4, d()))
abstract-syntax-tree

To understand how an expression is represented as an AST, it helps to write it in its prefix form.

y <- x * 10
`<-`(y, `*`(x, 10))
prefix-tree

There is also the R package called lobstr, which contains the function ast() to create an AST from R Code.

The code from the first example lobstr::ast(a(b(4, "s"), c(3, 4, d()))) results in this:

lobstr-ast

It looks as expected and just like our hand-drawn AST. The concept of ASTs helps us to understand what is happening in our function. So, if we have the following simple function, !!` introduces a placeholder (promise) for x.

x <- expr(-1)
f(!!x, y)

Due to R’s lazy evaluation, the function f() is not evaluated immediately, but once we called it. At the moment of the function call, the placeholder x is replaced by an additional AST, which can get arbitrary complex. Furthermore, it keeps the order of the operators correct, which is not the case when we use parse() and paste() with strings. So the resulting AST of our code snippet is the following:

promise-tree

Furthermore, !! also works with symbols, functions, and constants.

Perfecting our function

Now, we want to add an argument for the variable we are summarizing to refine our function. At the moment we have air_time hardcoded into it. Thus, we want to replace it with a general summary_var as an argument in our function. Additionally, we want the column names of the final output data frame to be adjusted dynamically, depending on the input variable. For adding summary_var, we follow the quote and unquote pattern from above. However, for the column-naming, we need two additional functions.

Firstly, quo_name(), which converts a quoted symbol into a string. Therefore, we can use normal string operations on it and, e.g. use the base paste command for manipulating it. However, we also need to unquote it, which would be on the Left-Hand-Side, where R is not allowing any computations. Thus, we need the second function, the vestigial operator := instead of the normal =.

my_summary <- function(df, grouping_var, summary_var){
  grouping_var <- enquo(grouping_var)
  summary_var <- enquo(summary_var)
  summary_nm <- quo_name(summary_var)
  summary_nm_avg <- paste0("avg_",summary_nm)
  summary_nm_sum <- paste0("sum_",summary_nm)
  summary_nm_obs <- paste0("obs_",summary_nm)

  df %>%
    group_by(!!grouping_var) %>% 
    summarise(
      !!summary_nm_avg := mean(!!summary_var),
      !!summary_nm_sum := sum(!!summary_var),
      !!summary_nm_obs := n()
    )
}
my_summary(airline_df, origin, air_time)
# A tibble: 2 x 4
  origin avg_air_time sum_air_time obs_air_time
  <fct>         <dbl>        <int>        <int>
1 JFK            166.       587966         3539
2 LAX            132.       850259         6461

Tidy Dots

In the next step, we want to add the possibility to summarize an arbitrary number of variables. Therefore, we need to use tidy dots (or dot-dot-dot) . E.g. if we call the documentation for select(), we get

Usage

select(.data, ...)

Arguments

... One or more unquoted expressions separated by commas.

In select() we can use any number of variables we want to select. We will use tidy dots ... in our function. However, there are some things we have to account for.

Within the function, ... is treated as a list. So we cannot use !! or enquo(), because these commands are made for single variables. However, there are counterparts for the case of .... In order to quote several arguments at once, we can use enquos(). enquos() gives back a list of quoted arguments. In order to unquote several arguments we need to use !!!, which is also called the big bang-Operator. !!! replaces arguments one-to-many, which is called unquote-splicing and respects hierarchical orders.

splicing

With using purrr, we can neatly handle the computation with our list entries provided by ... (for more information ask your Purrr-Macist). So, putting everything together, we finally arrive at our final function.

my_summary <- function(df, grouping_var, ...) {
  grouping_var <- enquo(grouping_var)

  smry_vars <- enquos(..., .named = TRUE)

  smry_avg <- purrr::map(smry_vars, function(var) {
    expr(mean(!!var, na.rm = TRUE))
  })
  names(smry_avg) <- paste0("avg_", names(smry_avg))

  smry_sum <- purrr::map(smry_vars, function(var) {
    expr(sum(!!var, na.rm = TRUE))
  })
  names(smry_sum) <- paste0("sum_", names(smry_sum))

  df %>%
    group_by(!!grouping_var) %>%
    summarise(!!!smry_avg, !!!smry_sum, obs = n())
}

my_summary(airline_df, origin, dep_delay, arr_delay)
# A tibble: 2 x 6
  origin avg_dep_delay avg_arr_delay sum_dep_delay sum_arr_delay   obs
  <fct>          <dbl>         <dbl>         <int>         <int> <int>
1 JFK            12.9          11.8          45792         41625  3539
2 LAX             8.64          5.13         55816         33117  6461

And the tidy evaluation goes on and on

As mentioned in the beginning, tidy evaluation is not only used within dplyr but within most of the packages in the tidyverse. Thus, to know how tidy evaluation works is also helpful if one wants to use ggplot in order to create a function for a styled version of a grouped scatter plot. In this example, the function takes the data, the values for the x and y-axes as well as the grouping variable as inputs:

scatter_by <- function(.data, x, y, z=NULL) {
  x <- enquo(x)
  y <- enquo(y)
  z <- enquo(z)

  ggplot(.data) + 
    geom_point(aes(!!x, !!y, color = !!z)) +
    theme_minimal()
}
scatter_by(airline_df, distance, air_time, origin) 
scatter-1

Another example would be to use R Shiny Inputs in a sparklyr-Pipeline. input$ cannot be used directly within sparklyr, because it would try to resolve the input list object on the spark side.

server.R

library(shiny)
library(dplyr)
library(sparklyr)

# Define server logic required to filter numbers
shinyServer(function(input, output) {
    tbl_1 <- tibble(a = 1:5, b = 6:10)
    sc <- spark_connect(master = "local")

    tbl_1_sp <-
        sparklyr::copy_to(
            dest = sc,
            df = tbl_1,
            name = "tbl_1_sp",
            overwrite = TRUE
        )

    observeEvent(input$select_a, {

        number_b <- tbl_1_sp %>%
            filter(a == !!input$select_a) %>%
            collect() %>%
            pull()

        output$text_b <- renderText({
            paste0("Selected number : ", number_b)
        })
    })
})

ui.R

library(shiny)
library(dplyr)
library(sparklyr)


# Define UI for application t
shinyUI(fluidPage(
    # Application title
    titlePanel("Select Number Example"),

    # Sidebar with a slider input for number
    sidebarLayout(sidebarPanel(
        sliderInput(
            "select_a",
            "Number for 1:",
            min = 1,
            max = 5,
            value = 1
        )
    ),

    # Show a text as output
    mainPanel(textOutput("text_b")))
))

Conclusion

There are many use cases for tidy evaluation, especially for advanced programmers. With the tidyverse getting bigger by the day, knowing tidy evaluation gets more and more useful. For getting more information about the metaprogramming in R and other advanced topics, I can recommend the book Advanced R by Hadley Wickham.

“Über
Markus Berroth

Markus Berroth

I am data scientist at STATWORX and I love creating novel knowledge from data. In my time off, I am always open for a weekend trip.

ABOUT US


STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog(at)statworx.com.