Flowcharts of functions

Jakob Gepp Blog, Data Science

When you work on bigger R projects there comes a point when you may lose the overview of how your functions are connected. Or even worse: you get a large project and have to figure out what is actually happening! A possible remedy to this problem are flowcharts.

If you started your project with a flowchart: good for you – if you did not, then it can be a tedious job to do. Since it is kind of a repetitive task, I had the idea that this could also be done automatically through a function. Let's try it on a simple example, which you can find on our git!

What needs to be done?

The goal of a flowchart is to visualise the connections of the user defined functions (UDF). Firstly, I scan all project scripts and add all UDF into a big list. Secondly, I search in all scripts for the function names to find the dependencies. In R, functions are defined by <- function(){} – or similar- which makes this a simple search. But sometimes you define a function within a function like this:

foo_01 <- function(y) {
  print("Start foo_01")
  # define sub functions
  foo_02 <- function(x){
    if (x < 100) {
      print(x)
    } else {
      print("over 100")
    }
    foo_03 <- function(x){
      print(2 * x)
    }
    sapply(1:5, foo_03)
  }
  # main part of foo_01
  foo_02(x = y)
  foo_02(x = 10 * y)
}

In this toy example I define a function foo_01 where a second function foo_02 is defined and called. However, within foo_02 itself another function, namely foo_03, is defined and applied. This pattern is quite common in R functions and is a possible pitfall to look out for.

To tackle this, I need to separate the sub functions from the main functions to get the right connections. Otherwise I would find a direct connection between foo_01 and foo_03 instead of foo_01 to foo_02 and then foo_03. My solution to this problem is to count the curly brackets {} with respect to the their level and thus find the blocks of functions. In the toy example I would get an index like this for the curly brackets:

bracket section of foo_01

Once I find the block, I can just remove it from the main function and add it to my list of functions. With this big list, I can search for the functions calls and get a connection matrix:

 foo_01foo_02foo_03
foo_01020
foo_02001
foo_03000

Since I only evaluate everything as a string, I do not get the right number of calls for foo_03 by sapply. In this toy example it might be possible to get the right amount, but in a project scenario the call might be much more complex. Furthermore, my function eliminates empty lines and comments, just to make it a bit tidier.

A different way to store the network is to use two data sets: one for the nodes and one for the edges. This way additional information (e.g. size, weights, label,groupings, etc.) can be included. One can visualise a sophisticated project flowchart with such additional information. For instance, the igraph network below.

Plotting the flowchart

There are different ways and packages to plot those kind of flowcharts or networks in R. Popular network packages are: ggnet, visNetwork, threejs, networkD3 or igraph. The first try is based on some test scripts and functions and plotted with ggnet. I get the following graph:

Network with ggnet

To be honest, it’s not the prettiest graph, but it gets the job done and we can see the connections! Of course, there are a lot more ways to tweak and improve the plots with additional information. Some ideas come to mind:

  • Add color to symbolise the folder structure.
  • Vary the point size by the number of lines in a function.
  • Adjust the line size by the amount of calls.
  • Build them interactive.

Amongst others the threejs makes it possible to build interactive plots. They are fun to play around with but it is very hard to make them suitable for a project description. For instance, a 3D network is hard to read when it overlaps.

So, I adjust my network function a bit to include some of the ideas from above. I also have to change my underlying test scripts and functions, because there are some more special cases, which I want to test and debug. With all these improvements, this is my second version of the network:

improved network with ggnet

I also made one with igraph where recursive functions can be plotted.

network with igraph

Features and problems

So far, I am pleased with the result, but I am still missing some features, which would make it a very robust and sophisticated function. I am looking forward to implement the upcoming list in the near future.

  • The function only looks for function calls but no sourced scripts.
  • If the name of a function is embodied by another one (eg. foo_01 and foo_01b) the dependencies are not correct, because I just do a string search.
  • Linebreaks between <- and the function name also slip through my search grid.

There might as well be more missing special cases. If you have an idea or a solution for one of these tasks – feel free to try it yourself and let me know!

References

Über den Autor
Jakob Gepp

Jakob Gepp

Numbers were always my passion and as a data scientist and a statistician at STATWORX I can fullfill my nerdy needs. Also I am responsable for our blog. So if you have any questions or suggestions, just send me an email!

ABOUT US


STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog(at)statworx.com.