code-r-title

CodeR: an LSTM that writes R Code

Tobias Krabel Blog, Data Science

Everybody talks about them, many people know how to use them, few people understand them: Long Short-Term Memory Neural Networks (LSTM). At STATWORX, with the beginning of the hype around AI and projects with large amounts of data, we also started using this powerful tool to solve business problems.

In short, an LSTM is a special type of recurrent neural network – i.e. a network able to access its internal state to process sequences of inputs – which is really handy if you want to exploit some time-like structure in your data. Use cases for recurrent networks range from guessing the next frame in a video to stock prediction, but you can also use them to learn and produce original text. And this shall already be enough information about LSTMs from my side. I won't bother you with yet another introduction into the theory of LSTMs, as there are more than enough great blog posts about their architecture (Kudos to Andrej Karpathy for this very informative piece of work which you should definitely read if you are not already bored by neural networks :)).

Especially inspired by the blog mentioned above, I thought about playing with a use case for LSTMs that actually has no intended use at all. LSTMs are good for learning text, so I thought it might be fun to let a character-level LSTM learn to write R code. It was not so important that the code is semantically correct or even solves a particular problem. Having a NN that is able to produce (more or less) syntactically correct code is already enough.

So, on my journey to CodeR, a NN that makes my workforce totally obsolete, I will let you participate in the three major steps of getting an RNN to write R code:

  1. Get enough text training data
  2. Build and train CodeR with that data
  3. Let CodeR write majestic R code

If you like to try it yourself or follow the subsequent steps along, you can get the code from my GitHub repository.

Step 1: Data Acquisition

Where to get enough Data?

Ultimately, CodeR needs data; a lot of data. Plus, the data should be of good quality and not be too heterogeneous so that CodeR is able to learn the structure from the given text. Since R is open source, the first address to search for good R code is GitHub. GitHub offers you an API to access information about its repositories, but for the flexibility and data I needed, I found the API too restrictive. That's why I decided to scrape the webpage myself using Hadley Wickham's rvest package.

Scrape GitHub

The goal is simple: Clone all R repositories from famous R users. Of course, you could manually define R contributors that seem to be good programmers, but chances are you miss someone out that has some good and influential packages to offer. Remember that we need a lot of code and that it isn't much of a problem to reduce the data afterwards (which I in fact did).

Get trending R user names

So, let's start by getting the names of the trending users. If you visit https://github.com/trending/developers/r?since=monthly, you see a list of all trending users. On June 14, 2018, it looked like this:

trending-r-users

If you inspect the HTML code, you quickly see that the actual user names are the href attribute of a link surrounded by <h2> tags, so we use rvest to dig us through that structure.

git_url <- "https://github.com"

trending_user <- glue("{git_url}/trending/developers/r?since=monthly") %>%
  read_html() %>%
  html_nodes(., "h2") %>%
  html_nodes(., "a") %>%
  html_attr(., "href") %>% 
  gsub("/", "", .)
trending_user
[1] "hadley"        "rstudio"       "yihui"         ...

It's good to see that the names match the expected result from the webpage :).

Get R repository names

In the next step, taking user i, we need to get all repositories of i that are her own (i.e. not forked) R repositories. When checking the url that lets you inspect all repos of a user (e.g. https://github.com/hadley?page=1&tab=repositories), you realize that you need to go through all pages of a user's repository tab. I wrote a function that does that plus makes sure that:

  • The repo's main language is R
  • If the repo is forked, the repo will be assigned to the original author

With that function, it is easy to extract all R repo names from our trending users

repos <- list()
for (user in trending_user) {
  cat("User: ", user, "\n")
  repos[[user]] <- get_r_repos(user)  # The actual magic
}
repos %<>% unlist() %>% unique()

Clone R repositories

Now that we have a bunch of repository names, the last step is to clone all those repos and to clean them so that they only contain R files. I have decided to clean a repo directly after I have cloned it since I am going to download a lot of data and don't want to use too much space on my hard drive. The example code below clones the repo where you can find all of the code above (you are welcome ;)).

repo <- "tkrabel/rcoder"
system(glue("git clone https://github.com{repo}.git"),
       wait = TRUE)

After having cloned all repos, I simply smash their content together in one big text file (r_scripts_text.txt).

Step 2: Teach the Baby to Walk

So, we have a big text file now that is ready to be inspected by CodeR so that it can learn to produce own good pieces of code. But how does the training actually work? There are a few steps that need to be taken care of here

  1. Prepare the data in a way it can actually be learned by an LSTM
  2. Construct the network's architecture
  3. The actual training step

The general idea behind step 1 is to slice the text data in overlapping sequences of characters with a pre-specified size s corresponding to the "time horizon". For example, imagine a text file containing the string "STATWORX ROCKS!" and let s = 3, meaning that you want the LSTM to use the last three characters to predict the fourth one. From this text file, you generate the data which looks like this.

x1x2x3y
'S''T''A''T'
'T''A''T''W'
'A''T''W''O'
'C''K''S''!'

In a next step, you have to represent each character as a numeric object so that your model can actually work with it. The most popular way is to represent characters as unit vectors. Making it more tangible, remember that in the sentence above, we have 11 distinct characters (including the blank space and the exclamation mark). The so-called vocabulary {'S', 'T', 'A', 'W', 'O', 'R', 'X', ' ', 'C', 'K', '!'}is utilized to represent each character by a 11-dimensional unit vector with the 1 at its respective character position, e.g. S = (1, 0, \dots, 0)^\top (because 'S' is the first character of the vocabulary), T = (0, 1, 0, \dots, 0)^\top, and so on. With these transformations, we finally have data our model can learn from.

Step 2 (building the model) is an ease with the R keras package, and it in fact took only 9 lines of code to build and LSTM with one input layer, 2 hidden LSTM layers with 128 units each and a softmax output layer, making it four layers in total. I kept the model that "simple" because I knew it is going to take a long time to learn. However, the learning results were not satisfying even after longer training times, so I decided to look out for ways of training networks on better (free) hardware in order to configure much more complex models. My search brought me to the Google Colaboratory, an environment that runs in the cloud and offers GPU support. Especially the GPU support gave training a huge time boost. However, for all R passionates out there, Google's Colab has a drawback: it is a Jupyter Notebook environment and therefore requires you to write Python code, which makes my use case somewhat cynical since I now use Python to train a network which writes R code. Well, in the end, I suppose, we all have to make some sacrifices :)!

As I started translating my code into Python, I found that there is a very useful package textgenrnn that lets you very easily build and train a model. The advantage of the package is that its functions handle the whole data preparation step for you. The only thing you need to do is to specify the raw input text file from which the model learns and to configure the model, the rest is done for you (Credits go to Max Woolf for this great piece of work).

If you want to build your own version of CodeR, just copy this notebook to your Google Drive and follow the instructions.

Step 3: Let CodeR Talk to Us

After we have a trained version of CodeR, it is time to let it write some code. Starting with a blank sheet, CodeR is asked to sample the first character, which is a random one. In the next step, we feed that created character back to the model in order to write the next character. After that, we always use up to the last 40 characters as an input for the prediction of the next element in the text sequence.

There is a parameter in the corresponding textgenrnn function that can determine CodeR's creativity while writing R code (so-called temperature). The higher the value, the more creative, i.e. diverse, the text. However, the results are not checked for syntactical correctness, so choosing too high of a temperature leads to more syntax errors. On the other hand, lower values in temperature (e.g. 0.5) make CodeR more conservative in its predictions, being closer to what it has learned. For a value of temperature = 0.5, CodeR knows how to pass any code review:

partition_by = NULL,
                                                                                                                                                                                                                                                                                                                                                                                                  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 

BATMAAAAAAAN!!! Looks like CodeR is lost in an infinite NA loop. Note the comma after the first statement. It is an artefact of the fact that we have some shiny code in the code base. This is, of course, an issue, as it leads to a very heterogenous text base, destabilizing RNN's learning process. Making the training data more homogenous is terms of syntax is definitely a topic for future refinements.

But let's look at some code now that is funny and, quite frankly, impressive. It will also teach us something about the learning set. Staying with a temerature value of 0.5, CodeR mainly produces, well, NA loops, but also writes a lot of roxygen comments.

#' @param x An object to transform to a string.
#'
#' @param x An object of column names
#' @param categorical_column A string with a model format.
#' @param ... Export to make setting to a selection of the length of the scale.
#' @param ordered If not supported values are stored in the data state on a post container.
#' @param conf.level character vector of structure (a, b, separate the second library in the first directory for the global environment for each vector of the same as a single argument.
#'
#' @param ... Additional arguments to exponentiate for the new name of the document, not a context
#'   the \code{searchPackages} returns a function that dependencies to create a character vector of the specified values. The name of the command packages in the top level normal for a single static when the default to the selection

Quite entertaining, isn't it? I like the way CodeR is using totally confusing argument names, such as categorical_column for a "string with a model format".

With temperature values around 1, CodeR starts writing first syntactically correct functions (although the semantics might be a problem). Let's look at some snippets I found in the output.

format.rdnore_configNames <- function(x) {
  class(x) <- y_string[[mull_bt]]
  x
}
addToN <- function(packagePath, ..., recursive = FALSE) {
  assert_that(is_string(id))

  base <- as.data.frame(names(purrr::packages$jitter))
  stop("parents has columns")
  layer_class = c(list(values, command), top = "Tar")
}
counts.latex <- function(x, ...) {
  if(is.null(x)) {
    stop("ROC Git classes:")
  }
}

spark_guess_value <- function(pd, stringsAsFactors = FALSE)
{
  if (!is.null(varify))
    cat("Installing ", cols, "/", x$date)


  return(self$registry)
}

It is impressive that CodeR correctly sets blank spaces and braces most of the time. I only needed to mildly correct it when it set a backtick instead of a quotation mark.

It can also use dplyr functions and the magrittr pipe (which is great since I am a big fan of the pipe as you can read here).

rf_car <- function(operator, input_col, config, default.uniques) %>%
  group_by(minorm) %>% summarise(week = 100)

Of course, this is just the tip of the iceberg, and there was a lot of unusable code CodeR produced. So to be fair, let's take a look a the lines that didn't make it into the hall of fame.

#' @import knit_print.j installed file
#' @param each HTML pages (in numbers matching metadata),
      # in the seed location
      if(tf$item %in% cl) {
        unknown <<- 100 : mutate(contsList)
      v = integer(1)
    }

    unused(
      {retries_cred_, revdeps, message_format = ","))
      

      if (flattenToBinour && !renamed) {

    # define validations for way top aggregated instead, constants that does not want
#' arguments  \code{list}.
#'
#' @inheritParams \dontrun{
#' # Default environment is supplied
#'
#' @keywords internal
str_replace_all <- function(pkg_lines, list(token = path),
                           list(using := force_init(), compare, installInfo$name, ") %>%
  ` %>%
#'   modifyList(list(2, coord = FALSE))
#'

If you set the temperature to 2, it becomes wildly creative.

x <- x$p>$scenqNy89L'<JW]
#' Clear tuning
#r
# verifican
ignore <- wwMap.com:(.p/qafffs.tboods,4max LNh,	rmAR',5R}/6)  Y/AS_M(SB423eyt
mf(,9] **.4L2;3) # v1.3mDE); *}

g3 <%yype_3X-C(r63,JAE)Zsd <- 1

Summary and outlook

LSTMs offer many interesting and amusing use cases. As a free-time side project, I try to leverage the recurrent structure of an LSTM in order to train a model I call CodeR to write, well ..., R code. The results are truly entertaining and informative, as they reveal some of the training data set's structure. For example, we see that R code contains roxygen comments to a large extent, which makes sense as we included many R packages in the training set.

One point for further improvements of CodeR definitely is to remove all the shiny code from the training set in order to make the syntax more homogeneous and therefore to improve CodeR's output text quality. Furthermore, it may be worthwhile to remove all roxygen comments.

If you have any ideas what to do next with CodeR, if you have any suggestions on how to improve my code or if you just want to leave a comment, please feel free to shoot me a message. Especially if you trained a version of CodeR yourself, don't hesitate to share your favorite lines of code with me. I would also be very curious if you could improve CodeR's output quality by altering the training set (e.g. in the way described above).

Über den Autor
Tobias Krabel

Tobias Krabel

Tobias ist im Data Science Team und absolviert im Moment seinen 2. Master in Informatik. In seiner Freizeit ist er sozial engagiert und geht gerne Wandern in der Natur.