Optimising your R code is not always the priority. But when you run out of memory, or it just takes too long, you start to wonder if there are better ways to do things! In this blog post, I will show you my way of optimising my R code and the process behind it. As an example, I will use the code from my last blog post about gas prices in Germany – if you haven’t read it yet, click here!

There are different steps to optimise your code. From debugging and profiling over benchmarking to rethinking the whole method. You have to repeat these steps until you are satisfied with the result.

## A short recap of the initial code

I had around 1700 .csv files, each containing the price changes of one day for all German gas stations from 2014 up till 2019. My plan was to analyse the price patterns, but first I had to prepare and load all the data. Normally, I would just load all the files, combine them into one `data.table()`

, do some data preparation steps and then start my analysis. This was not possible here since the raw data was around 20 GB big and I wanted to rearrange them so that I would have half-hour prices. So, I settled for the solution of reading the files one after another, while preparing and aggregating each one at a time, and combining them at the end. There were eight steps during my loop:

```
# pseudo code
for (i.file in price_files) {
# load data
# remove wrong prices (e.g. -0.001, 0 or 8.888)
# set TIME and DATE
# build a time grid including opening times
# add last day price and save next day price
# NA handling - last observation carried forward (locf)
# adding brand, motorway and post code info
# calculating mean and quantiles
}
```

Let’s see where I can optimise my steps!

## Identifying the bottlenecks

Since I want to optimise a loop, I can just concentrate on a few iterations instead of the whole loop. Also, I do not need to include every station for this test. Therefore, I subset my data to stations which are in the area around Frankfurt am Main.

`stations_dt <- stations_dt[post_code >= "60000" & post_code < "66000",] `

**Note**: Yes, you can filter strings like this – they are sorted alphabetically. Also, they are characters and not numbers since there are cases like “01896”.

With this small test data, I will continue my optimisation process. To get an overview of bottlenecks I use the package `profvis`

. Since RStudio version 1.0, it is quite easy to use. I just select the lines of code with my for-loop and click *Profile > Profile Selected Line(s)* to get a report with the time and memory consumption in a new tab.

I filter only for those lines which have a non-zero value to have a nicer overview. My main task is to reduce time consumption, therefore I will focus on the longest time first. There is one major time consumer: `stat_fun()`

. It takes around 75% of the total 30000ms. So, what is this function of mine?

### a deeper look at `stat_fun()`

This function returns a `data.table()`

with the mean, the 10% and 90% quantile and the number of observations. I use `stat_fun`

on each group defined by `this_by <- c("DATE", "TIME", "AUTOBAHN", "BRAND", "plz")`

.

```
# get mean and quantiles 0.1 and 0.9
stat_fun <- function(x, na.rm = TRUE) {
#x <- price_time_dt[,.SD, .SDcol = c("diesel", "e10")]
x_quant1 <- x[, lapply(.SD, quantile, probs = c(0.1), na.rm = na.rm),
.SDcols = names(x)]
x_quant9 <- x[, lapply(.SD, quantile, probs = c(0.9), na.rm = na.rm),
.SDcols = names(x)]
x_mean <- x[, lapply(.SD, mean, na.rm = na.rm), .SDcols = names(x)]
x_obs <- x[, lapply(.SD, function(y) sum(!is.na(y))), .SDcols = names(x)]
setnames(x_quant1, paste0(names(x), "_Q10"))
setnames(x_quant9, paste0(names(x), "_Q90"))
setnames(x_mean, paste0(names(x), "_MEAN"))
setnames(x_obs, paste0(names(x), "_OBS"))
out <- data.table(x_quant1, x_quant9, x_mean, x_obs)
return(out)
}
price_time_dt2 <- price_time_dt[, stat_fun(.SD), .SDcols = price_vars, by = this_by]
```

Why did I build it like this? My goal was to have one function, which returns all the values I need for my analysis later: the mean, the quantiles and the number of observations. Let’s check how we can optimise this!

First, I look into the function with `profvis`

again. Since there is not much to see but that it takes 7900ms for one iteration I profile the lines in the `stat_fun`

. Therefore I set the inputs and then select the lines in the function for profiling.

```
x <- price_time_dt[,.SD, .SDcol = c("diesel", "e10")]
na.rm <- TRUE
```

The result is quite surprising, there is not much time used even though I used the whole data instead of one group.

How many times does this function get called? Well, for each combination of the by key. For the first file, this is around 1500 times. I will now check if `mean()`

and `quantile()`

are taking a lot of time, if they are used by each group.

```
x_mean <- price_time_dt[, lapply(.SD, mean, na.rm = TRUE),
.SDcols = price_vars,
by = this_by]
x_quant1 <- price_time_dt[, lapply(.SD, quantile,
probs = 0.1,
na.rm = TRUE),
.SDcols = price_vars,
by = this_by]
x_quant9 <- price_time_dt[, lapply(.SD, quantile,
probs = 0.9,
na.rm = TRUE),
.SDcols = price_vars,
by = this_by]
```

The `mean()`

function is fast and not a problem, but `quantile()`

takes quite long. I could combine the calculation for the 10% and 90% quantile into one call, but the resulting table I would have to reshape into my desired output – if I cannot find any other tweaks, I will come back to this approach.

While reading the help pages for the `quantile()`

function, I stumble upon the argument `names`

:

`names`

: logical; if true, the result has a`names`

attribute. Set to`FALSE`

for speedup with many`probs`

.

I am not using many `probs`

, but this still improves the time by around 700ms.

This is not enough, so I will have to test the other approach!

### implemanting `stat_fun()`

as part of the main code

The new approach first creates a `data.table()`

for the mean, the quantiles and the number of observations and then merges them by the groups together. With this approach, I can combine the two quantile calculations, but have to include the merging part.

```
# calculating the mean
x_mean <- price_time_dt[, lapply(.SD, mean, na.rm = TRUE),
.SDcols = price_vars,
by = this_by]
# calculating the quantiles
x_quant <- price_time_dt[, lapply(.SD, quantile, probs = c(0.1, 0.9),
na.rm = TRUE, names = FALSE),
.SDcols = price_vars,
by = this_by]
x_quant[, QUANT := c("Q10", "Q90")]
eq_quant <- paste0(paste0(this_by, collapse = "+"), "~QUANT")
x_quant <- dcast(data = x_quant, formula = eq_quant, value.var = price_vars)
setkeyv(x_quant, NULL)
# number of observations
x_obs <- price_time_dt[, lapply(.SD, function(y) sum(!is.na(y))),
.SDcols = price_vars,
by = this_by]
setnames(x_mean, price_vars, paste0(price_vars, "_MEAN"))
setnames(x_obs, price_vars, paste0(price_vars, "_OBS"))
setkeyv(price_time_dt, this_by)
price_time_dt2 <- Reduce(merge, list(x_mean, x_quant, x_obs))
```

As it turns out, merging is fast as well – I would not have thought so! Instead of nearly 30s it only needs 12s now! With the biggest issue out of the way, there seems to be some more potential in other lines as well. Next up are the time formatting and the missing values handling with the last observation carried forward (locf).

### last observations slowly carried forward

```
price_time_dt <- price_time_dt[order(station_uuid, DATE, TIME)]
price_time_dt[, c(price_vars) := lapply(.SD, zoo::na.locf, na.rm = FALSE),
.SDcols = price_vars,
by = station_uuid]
```

Sometimes it is hard to reinvent the wheel – so l search the vast knowledge of the internet and this article here comes up. They describe exactly my problem: a faster locf-method with grouping.

```
price_time_dt <- price_time_dt[order(station_uuid, DATE, TIME)]
id_change <- price_time_dt[, c(TRUE, station_uuid[-1] != station_uuid[-.N])]
price_time_dt[, c(price_vars) :=
lapply(.SD, function(x) x[cummax(((!is.na(x)) | id_change) * .I)]),
.SDcols = price_vars]
```

To understand this method I made this table with a toy example. The last column is used as the index, which replaces missing values with the last observation. Because this method is vectorised and works without a `by`

group, it is much faster.

.I | id | id_change | x | !is.na(x) | (!is.na(x)|id_change)*.I | cummax(…) |
---|---|---|---|---|---|---|

1 | 1 | TRUE | 4 | TRUE | 1 * 1 = 1 | 1 |

2 | 1 | FALSE | NA | FALSE | 0 * 2 = 0 | 1 |

3 | 1 | FALSE | NA | FALSE | 0 * 3 = 0 | 1 |

4 | 2 | TRUE | 9 | TRUE | 1 * 4 = 4 | 4 |

5 | 2 | FALSE | NA | FALSE | 0 * 5 = 0 | 4 |

6 | 2 | FALSE | 5 | TRUE | 1 * 6 = 6 | 6 |

7 | 2 | FALSE | NA | FALSE | 0 * 7 = 0 | 6 |

8 | 3 | TRUE | NA | FALSE | 1 * 8 = 8 | 8 |

9 | 3 | FALSE | 17 | TRUE | 1 * 9 = 9 | 9 |

### even formatting time is relative

The raw data contains a column with the timestamp of a price change. There are multiple options to start to transform this string into a date and time variable:

*base*R with`as.POSIXct()`

- lubridate* with
`ymd_hms()`

*data.table*with`as.IDate()`

and`as.ITime()`

There are two things I want from the time transformation. First, for my aggregation over daily hours analysis, I want the date and the time in separate columns. Second, I want to round to a given minute (e.g. to half hours or hours).

Since I am a *data.table* person and I think fewer packages in a project are a good thing, my first choices for these tasks are as.IDate() and as.ITime(). But – the go-to package when it comes to time and dates – *lubridate* with the `round_date()`

function to round to a specific minute, for example, might be a good candidate as well. Time to do some benchmarking!

### my benchmark setup

To test the different functions I will use one of the files, reduced to the `date`

, `DAY`

and `TIME`

columns:

```
date DAY TIME
1: 2014-06-08 09:50:01 2014-06-08 09:50:01
2: 2014-06-08 09:50:01 2014-06-08 09:50:01
3: 2014-06-08 09:50:01 2014-06-08 09:50:01
4: 2014-06-08 09:50:01 2014-06-08 09:50:01
5: 2014-06-08 09:50:01 2014-06-08 09:50:01
```

### splitting the format into date and time

```
microbenchmark(
# lubridate
"lubridate::ymd()" = DT[, DAY_ymd := ymd(DAY)],
"lubridate::hms()" = DT[, TIME_hms := as.numeric(hms(TIME))],
# data.table
"data.table::as.IDate()" = DT[, DATE_IDate := as.IDate(date)],
"data.table::as.ITime()" = DT[, DATE_ITime := as.ITime(date)],
# settings
times = 100L, unit = "ms")
```

```
Unit: milliseconds
expr min lq mean median uq max neval
lubridate::ymd() 7.210735 10.18305 12.62154 11.96969 14.31220 27.62012 100
lubridate::hms() 9.611753 10.26598 11.08295 10.60343 11.38001 16.48317 100
data.table::as.IDate 74.908261 75.91852 78.98697 77.86806 80.62390 91.97302 100
data.table::as.ITime 298.047305 306.15576 314.78720 309.89135 315.41061 466.35973 100
```

Well, *lubridate* is the fastest! But, the output of hms() cannot simply be saved into a data.table column because it is a nested object of the class `Period`

from *lubridate*. Therefore, I will use this little workaround:

```
price_dt[, c("DATE", "TIME") := tstrsplit(date, " ")]
price_dt[, DATE := lubridate::ymd(DATE)]
price_dt[, TIME := as.ITime(as.numeric(lubridate::hms(TIME)))]
```

With this, I can continue to use the `as.ITime()`

function and its format. Also, using `as.ITime()`

on a numeric is much faster than on the original date.

### rounding to the minutes

For the second task, I will compare `lubridate::round_date()`

and `ply::round_any()`

```
minute <- 30
microbenchmark(
# rounding lubridate
"lubridate::round_date()" =
DT[, ROUND_TIME_lb := lubridate::round_date(DATE_ymd_hms, paste0(minute, " mins"))],
# rounding plyr
"plyr::round_any()" =
DT[, ROUND_TIME_pl := as.ITime(plyr::round_any(as.numeric(DATE_ITime), minute * 60))],
# settings
times = 100L, unit = "ms")
```

```
Unit: milliseconds
expr min lq mean median uq max neval
lubridate::round_date() 24.075546 25.582259 27.27308 26.944734 28.120435 39.62532 100
plyr::round_any() 1.351571 1.533044 1.97845 1.727221 2.249484 4.37735 100
```

On the one hand, `round_date()`

has an easy to understand input format, but on the other hand `round_any()`

is much faster since it works on numeric values.

With this setup, I only use `lubridate`

and `plyr`

in one part of my code and for the rest `data.table`

– That makes my inner self.data.table happy and it still is quite fast.

With all the changes I have made, I am curious how much the speed has improved.

## fixing the bottlenecks

That is quite an improvement! The code now only takes 4330ms. There are probably even more enhancements possible, but I think that is enough of a speed boost.

## to sum it all up

There are a lot of ways how you can improve and optimise your code. Some changes might be quite settled whilst others demand a change in your way of thinking about a problem. There are also a lot of packages out there which are optimised in doing a specific job (e.g. `lubridate`

). So the question often is not «Can this be improved?» but more like «Which package is better at this job?».

To be honest, it has taken me quite a while to finish this post. Whilst finding the bottlenecks and fixing them, I also found myself changing parts of the data preparation and adding new data error handling parts. So, the initial code might not be the same as the final one which you can inspect yourself in our github.

I still hope this post can help you to improve your own coding or at least gives you some ideas on how to tackle your optimisation problems.

# ABOUT US

STATWORX

is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog(at)statworx.com.