Data Science, Machine Learning, and AI
Contact

Be Safe!

In the age of open-source software projects, attacks on vulnerable software are ever present. Python is the most popular language for Data Science and Engineering and is thus increasingly becoming a target for attacks through malicious libraries. Additionally, public facing applications can be exploited by attacking vulnerabilities in the source code.

For this reason it’s crucial that your code does not contain any CVEs (common vulnerabilities and exposures) or uses other libraries that might be malicious. This is especially true if it’s public facing software, e.g. a web application. At statworx we look for ways to increase the quality of our code by using automated scanning tools. Hence, we’ll discuss the value of two code and package scanners for Python.

Automatic screening

There are numerous tools for scanning code and its dependencies, here I will provide an overview of the most popular tools designed with Python in mind. Such tools fall into one of two categories:

  • Static Application Security Testing (SAST): look for weaknesses in code and vulnerable packages
  • Dynamic Application Security Testing (DAST): look for vulnerabilities that occur at runtime

In what follows I will compare bandit and safety using a small streamlit application I’ve developed. Both tools fall into the category of SAST, since they don’t need the application to run in order to perform their checks. Dynamic application testing is more involved and may be the subject of a future post.

The application

For the sake of context, here’s a brief description of the application: it was designed to visualize the convergence (or lack thereof) in the sampling distributions of random variables drawn from different theoretical probability distributions. Users can choose the distribution (e.g. Log-Normal), set the maximum number of samples and pick different sampling statistics (e.g. mean, standard deviation, etc.).

Bandit

Bandit is an open-source python code scanner that checks for vulnerabilities in code and only in your code. It decomposes the code into its abstract syntax tree and runs plugins against it to check for known weaknesses. Among other tests it performs checks on plain SQL code which could provide an opening for SQL injections, passwords stored in code and hints about common openings for attacks such as use of the pickle library. Bandit is designed for use with CI/CD and throws an exit status of 1 whenever it encounters any issues, thus terminating the pipeline. A report is generated, which includes information about the number of issues separated by confidence and severity according to three levels: low, medium, and high. In this case, bandit finds no obvious security flaws in our code.

Run started:2022-06-10 07:07:25.344619

Test results:
        No issues identified.

Code scanned:
        Total lines of code: 0
        Total lines skipped (#nosec): 0

Run metrics:
        Total issues (by severity):
                Undefined: 0
                Low: 0
                Medium: 0
                High: 0
        Total issues (by confidence):
                Undefined: 0
                Low: 0
                Medium: 0
                High: 0
Files skipped (0):

All the more reason to carefully configure Bandit to use in your project. Sometimes it may raise a flag even though you already know that this would not be a problem at runtime. If, for example, you have a series of unit tests that use pytest and run as part of your CI/CD pipeline Bandit will normally throw an error, since this code uses the assert statement, which is not recommended for code that does not run without the -O flag.

To avoid this behaviour you could:

  1. run scans against all files but exclude the test using the command line interface
  2. create a yaml configuration file to exclude the test

Here’s an example:

# bandit_cfg.yml
skips: ["B101"] # skips the assert check

Then we can run bandit as follows: bandit -c bandit_yml.cfg /path/to/python/files and the unnecessary warnings will not crop up.

Safety

Developed by the team at pyup.io, this package scanner runs against a curated database which consists of manually reviewed records based on publicly available CVEs and changelogs. The package is available for Python >= 3.5 and can be installed for free. By default it uses Safety DB which is freely accessible. Pyup.io also offers paid access to a more frequently updated database.

Running safety check --full-report -r requirements.txt on the package root directory gives us the following output (truncated the sake of readability):

+==============================================================================+
|                                                                              |
|                               /$$$$$$            /$$                         |
|                              /$$__  $$          | $$                         |
|           /$$$$$$$  /$$$$$$ | $$  \__//$$$$$$  /$$$$$$   /$$   /$$           |
|          /$$_____/ |____  $$| $$$$   /$$__  $$|_  $$_/  | $$  | $$           |
|         |  $$$$$$   /$$$$$$$| $$_/  | $$$$$$$$  | $$    | $$  | $$           |
|          \____  $$ /$$__  $$| $$    | $$_____/  | $$ /$$| $$  | $$           |
|          /$$$$$$$/|  $$$$$$$| $$    |  $$$$$$$  |  $$$$/|  $$$$$$$           |
|         |_______/  \_______/|__/     \_______/   \___/   \____  $$           |
|                                                          /$$  | $$           |
|                                                         |  $$$$$$/           |
|  by pyup.io                                              \______/            |
|                                                                              |
+==============================================================================+
| REPORT                                                                       |
| checked 110 packages, using free DB (updated once a month)                   |
+============================+===========+==========================+==========+
| package                    | installed | affected                 | ID       |
+============================+===========+==========================+==========+
| urllib3                    | 1.26.4    | <1.26.5                  | 43975    |
+==============================================================================+
| Urllib3 1.26.5 includes a fix for CVE-2021-33503: An issue was discovered in |
| urllib3 before 1.26.5. When provided with a URL containing many @ characters |
| in the authority component, the authority regular expression exhibits        |
| catastrophic backtracking, causing a denial of service if a URL were passed  |
| as a parameter or redirected to via an HTTP redirect.                        |
| https://github.com/advisories/GHSA-q2q7-5pp4-w6pg                            |
+==============================================================================+

The report includes the number of packages that were checked, the type of database used for reference and information on each vulnerability that was found. In this example an older version of the package urllib3 is affected by a vulnerability which technically could be used by an to perform a denial-of-service attack.

Integration into your workflow

Both bandit and safety are available as GitHub Actions. The stable release of safety also provides integrations for TravisCI and GitLab CI/CD.

Of course, you can always manually install both packages from PyPI on your runner if no ready-made integration like a GitHub action is available. Since both programs can be used from the command line, you could also integrate them into a pre-commit hook locally if using them on your CI/CD platform is not an option.

The CI/CD pipeline for the application above was built with GitHub Actions. After installing the application’s required packages, it runs bandit first and then safety to scan all packages. With all the packages updated, the vulnerability scans pass and the docker image is built.

Package check Code Check

Conclusion

I would strongly recommend using both bandit and safety in your CI/CD pipeline, as they provide security checks for your code and your dependencies. For modern applications manually reviewing every single package your application depends on is simply not feasible, not to mention all of the dependencies these packages have! Thus, automated scanning is inevitable if you want to have some level of awareness about how unsafe your code is.

While bandit scans your code for known exploits, it does not check any of the libraries used in your project. For this, you need safety, as it informs you about known security flaws in the libraries your application depends on. While neither frameworks are completely foolproof, it’s still better to be notified about some CVEs than none at all. This way, you’ll be able to either fix your vulnerable code or upgrade a vulnerable package dependency to a more secure version.

Keeping your code safe and your dependencies trustworthy can ward off potentially devastating attacks on your application. Thomas Alcock Thomas Alcock Thomas Alcock Thomas Alcock Thomas Alcock Thomas Alcock Thomas Alcock Thomas Alcock Thomas Alcock Thomas Alcock

Introduction

Artificial intelligence (AI) is no longer a vision of the future for German companies. According to a survey by Deloitte of around 2,700 AI experts from nine countries, over 90 percent of those surveyed say that their company uses or plans to use technologies from one of the areas of Machine Learning (ML), Deep Learning, Natural Language Processing (NLP) and Computer Vision. This high percentage cannot be explained solely by the fact that the companies have recognized the potential of AI. Instead, there are also significantly more standardized solutions available for the use of these technologies. This development has led to the fact that the entry barrier has been lowered more and more in recent years.

For example, the three major cloud providers – Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure – offer standardized solutions for certain problems (e.g., object recognition on images, translation of texts, and automated machine learning). So far, not all problems can be solved with the help of such standardized applications. There can be various reasons for this: The most common reason is that the available standard solutions do not fit the desired problem. For example, in the field of NLP, the classification of entire texts is often available as a standard solution. If, on the other hand, a classification is not to take place on the text-level but the word-level, other models are required for this purpose, which are not always available as part of standard solutions. And even if these are available, the possible categories are usually predefined and cannot be further adapted. So, a service built for the classification of words into the categories of place, person, and time cannot be used to classify words in the categories of customer, product, and price. Many companies, therefore, continue to rely on developing their own ML models. Since the development of models often takes place on local computers, it must be ensured that these models are not only available to the developer. Once a model has been developed, a significant challenge is to make the model available to different users, since only then will the model add value for the company.

ML & AI projects in the company have their own challenges in both development and deployment. While development often fails due to the lack of suitable data availability, deployment can fail because a model is not compatible with the production environment. For example, machine learning models are mostly developed with open source languages or new ML frameworks (e.g., Dataiku or H2O), while an operational production environment often works with proprietary software that has been tested and proven over many years. The close integration of these two worlds often presents both components with significant challenges. Therefore, it is essential to link the development of ML models with the work of IT Operations. This process is called MLOps because data scientists work together with IT to make models productively usable.

MLOps is an ML development culture and practice whose goal is to link the development of ML systems (Dev) and the operation of ML systems (Ops). In practice, MLOps means focusing on automation and monitoring. This principle extends to all steps of ML system configuration, such as integration, testing, sharing, deployment, and infrastructure management. The code of a model is one of many other components, as illustrated in Figure 1. The figure shows other steps of the MLOps process in addition to the ML code and illustrates that the ML code itself is a relatively small part of the overall process.

Figure 1: Important components of the MLOps process

Further aspects of MLOps are e.g., the continuous provision and quality check of the data, or the testing of the model and, if necessary, the debugging. Docker containers have emerged as a core technology for the provision of specially developed ML models and are therefore presented in this paper.

Why Docker Container?

The challenge in providing ML models is that a model is written in a specific version of a programming language. This language is usually not available in the production environment and therefore has to be first installed. Besides, the model has its libraries, runtimes, and other technical dependencies, which also have to be installed in the production environment. Docker solves this problem via so-called containers, in which applications, including all their components, can be packaged in isolation and made available as separate services. These containers contain all components that the application or ML model needs to run, including code, libraries, runtimes, and system tools. Containers can therefore be used to provide their own models and algorithms in any environment without worrying about missing or incompatible libraries leading to errors.

Figure 2: Comparison of Docker Containers and virtual machines

Before Docker’s triumphant success, virtual machines were long the tool of choice to deliver applications and ML models in isolation. However, Docker has proven to have several advantages over virtual machines. These include improved resource utilization, scalability, and faster deployment of new software. In the following, the three points will be examined in more detail.

Improved resource utilization

Figure 2 schematically compares how applications can run in Docker Containers and virtual machines. Virtual machines have their own guest operating system on which different applications run. Virtualizing the guest operating system at the hardware level requires a lot of computing power and memory. Therefore, fewer applications can run simultaneously on a virtual machine while maintaining the same efficiency.

On the other hand, Docker Containers share the host operating system and do not require a separate operating system. Therefore, applications in Docker Containers boot faster and use less processing power and memory due to the host’s shared operating system. This lower resource utilization makes it possible to run several applications in parallel on a server, which improves the utilization rate of a server.

Scalability

Containers offer a further advantage in the area of scaling: If an ML model is to be used more frequently within the company, the application must be able to handle the additional requests. Fortunately, ML models with Docker can be easily scaled by starting additional containers with the same application. Especially Kubernetes, an open-source technology for container orchestration and scalable web services delivery, is suitable for flexible scaling due to its compatibility with Docker. With Kubernetes, web services can be scaled up or down flexibly and automatically based on the current workload.

Deployment of new software

Another advantage is that containers can be pushed seamlessly from local developing machines to production machines. Therefore, they are easy to exchange, for example, when a new version of the model is to be provided. The isolation of the code and all dependencies in a container also leads to a more stable environment in which the model can be operated. As a result, errors due to, for example, incorrect versions of individual libraries occur less frequently and can be corrected more effectively.

The model is provided within a container as a web service that other users and applications can access via common Internet protocols (e.g., HTTP). In this way, the model can be accessed as a web service by other systems and users without the need for them to meet specific technical requirements. Thus, it is unnecessary to install libraries or the model’s programming language to make the model usable.

In addition to Docker, other container technologies such as rkt and Mesos, whereby Docker, with its user-friendly operation and detailed documentation, make it easy for new developers to get started. Due to the large user base, templates exist for many standard applications that can be run in containers with little effort. At the same time, these free templates serve as a basis for developing your own applications.

Not least because of these advantages, Docker is now considered best practice in the MLOps process. The process of model development increasingly resembles the software development process, not least because of Docker. This becomes clear by the fact that container-based applications are supported by standard tools for the continuous integration and provision (CI/CD) of web services.

What role do Docker Containers play in the MLOps pipeline?

As already mentioned, MLOps is a complex process of continuous provision of ML models. The central components of such a system are illustrated in figure 1. The MLOps process is very similar to the DevOps process because the development of machine learning systems is also a form of software development. Standard concepts from the DevOps area, such as continuous integration of new code and provision of new software, can be found in the MLOps process. New ML-specific components such as continuous model training and model and data validation are added.

It is considered best practice to embed the development of ML models in an MLOps pipeline. The MLOps pipeline includes all steps from the provision and transformation of data, model training to the continuous provision of finished models on production servers. The code for each step in the pipeline is packed in a docker container and the pipeline starts the containers in a defined order. Here, Docker Containers show their strength. By isolating the code within individual containers, code changes can be continuously incorporated at the pipeline’s appropriate points without replacing the entire pipeline. Therefore the costs for pipeline maintenance are relatively low. The major cloud providers (GCP, AWS, and Microsoft Azure) also offer services that allow Docker Containers to be automatically built, deployed, and hosted as web services. To make container scaling easier and as flexible as possible, cloud providers also offer fully managed Kubernetes products. For the use of ML models in the enterprise, this flexibility means cost savings, as an ML application is simply downscaled in case the usage rate drops. Similarly, higher demand can be ensured by providing additional containers without having to stop the container with the model. Users of the application will not experience any unnecessary downtime.

Conclusion

For the development of machine learning models and MLOps pipelines, docker containers are a core technology. The advantages are portability, modularization, and isolation of model code, low maintenance when integrated into pipelines, faster deployment of new versions of the model and scalability via serverless cloud products for container deployment. At STATWORX, we have recognized the potential of Docker Containers and are actively using them. With this knowledge, we support our customers in the realization of their machine learning and AI projects. Do you want to use Docker in your MLOps pipeline? Our Academy offers remote training on Data Science with Docker as well as free webinars on MLOps and Docker. Thomas Alcock Thomas Alcock

Introduction

Sometimes here at STATWORX impromptu discussions about statistical methods happen. In one such discussion one of my colleagues decided to declare (albeit jokingly) Bayesian statistics unnecessary. This made me ask myself: Why would I ever use Bayesian models in the context of a standard regression problem? Surely, existing approaches such as Ridge Regression are just as good, if not better. However, the Bayesian approach has the advantage that it lets you regularize your model to prevent overfitting **and** meaningfully interpret the regularization parameters. Contrary to the usual way of looking at ridge regression, the regularization parameters are no longer abstract numbers, but can be interpreted through the Bayesian paradigm as derived from prior beliefs. In this post, I’ll show you the formal similarity between a generalized ridge estimator and the Bayesian equivalent.

A (very brief) primer on Bayesian Stats

To understand the Bayesian regression estimator a minimal amount of knowledge about Bayesian statistics is necessary, so here’s what you need to know (if you don’t already): In Bayesian statistics we think about model parameters (i.e. regression coefficients) probabilistically. In other words, the data given to us is fixed, and the parameters are considered random. This runs counter to the standard, frequentist perspective in which the underlying model parameters are treated as fixed, while the data are considered random realizations of the stochastic process driven by those fixed model parameters. The end goal of Bayesian analysis is to find the posterior distribution, which you may remember from Bayes Rule:

    \[p(\theta|y) = \frac{p(y|\theta) p(\theta)}{p(y)}\]


While p(y|\theta) is our likelihood and p(y) is a normalizing constant, p(\theta) is our prior which does not depend on the data, y. In classical statistics, p(\theta) is set to 1 (an improper reference prior) so that when the posterior ‘probability’ is maximized, really just the likelihood is maximized, because its the only part that still depends on \theta. However, in Bayesian statistics we use an actual probability distribution in place of p(\theta), a Normal distribution for example. So lets consider the case of a regression problem and we’ll assume that our target, y, **and** our prior follow normal distributions. This leads us to **conjugate** Bayesian analysis, in which we can neatly write down an equation for the posterior distribution. In many cases this is actually not possible and for this reason Markov Chain Monte Carlo methods were invented to sample from the posterior – taking a frequentist approach, ironically.

We’ll make the usual assumption about the data: y_i is i.i.d. N(\bold {x_i \beta}, \sigma^2) for all observations i. This gives us our standard likelihood for the Normal distribution. Now we can specify the prior for the parameter we’re trying to estimate, (\beta, \sigma^2). If we choose a Normal prior (conditional on the variance, \sigma^2) for the vector or weights in \beta, i.e. N(b_0, \sigma^2 B_0) and an inverse-Gamma prior over the variance parameter it can be shown that the posterior distribution for \beta is Normally distributed with mean

    \[\hat\beta_{Bayesian} = (B_0^{-1} + X'X)^{-1}(B_0^{-1} b_0 + X'X \hat\beta)\]


If you’re interested in a proof of this result check out Jackman (2009, p.526).

Let’s look at it piece by piece:

\hat\beta is our standard OLS estimator, (X'X)^{-1}X'y
b_0 is the mean vector of (multivariate normal) prior distribution, so it lets us specify what we think the average values of each of our model parameters are
B_0 is the covariance matrix and contains our respective uncertainties about the model parameters. The inverse of the variance is called the **precision**

What we can see from the equation is that the mean of our posterior is a **precision weighted average** of our prior mean (information not based on data) and the OLS estimator (based solely on the data). The second term in parentheses indicates that we are taking the uncertainty weighted prior mean, B_0^{-1} b_0, and adding it to the weighted OLS estimator, X'X\hat\beta. Imagine for a moment that B_0^{-1} = 0 . Then

    \[\hat\beta_{Bayesian} = (X'X)^{-1}(X'X \hat\beta) = \hat\beta\]


This would mean that we are **infinitely** uncertain about our prior beliefs that the mean vector of our prior distribution would vanish, contributing nothing to our posterior! LIkewise, if our uncertainty decreases (and the precision thus increases) the prior mean, b_0, would contribute more to the posterior mean.

After this short primer on Bayesian statistics, we can now formally compare the Ridge estimator with the above Bayesian estimator. But first, we need to take a look at a more general version of the Ridge estimator.

Generalizing the Ridge estimator

A standard tool used in many regression problems, the standard Ridge estimator is derived by solving a least squares problem from the following loss function:

    \[L(\beta,\lambda) = \frac{1}{2}\sum(y-X\beta)^2 + \frac{1}{2} \lambda ||\beta||^2\]


While minimizing this gives us the standard Ridge estimator you have probably seen in textbooks on the subject, there’s a slightly more general version of this loss function:

    \[L(\beta,\lambda,\mu) = \frac{1}{2}\sum(y-X\beta)^2 + \frac{1}{2} \lambda ||\beta - \mu||^2\]


Let’s derive the estimator by first re-writing the loss function in terms of matrices:

    \[\begin{aligned}L(\beta,\lambda,\mu) &= \frac{1}{2}(y - X \beta)^{T}(y - X \beta) + \frac{1}{2} \lambda||\beta - \mu||^2 \\&= \frac{1}{2} y^Ty - \beta^T X^T y + \frac{1}{2} \beta^T X^T X \beta + \frac{1}{2} \lambda||\beta - \mu||^2\end{aligned}\]


Differentiating with respect to the parameter vector, we end up with this expression for the gradient:

    \[\nabla_{\beta} L (\beta, \lambda, \mu) = -X^T y + X^T X \beta + \lambda (\beta - \mu)\]


So, Minimizing over \beta we get this expression for the generalized ridge estimator:

    \[\hat\beta_{Ridge} = (X'X + \lambda I )^{-1}(\lambda \mu + X'y)\]


The standard Ridge estimator can be recovered by setting \mu=0. Usually we regard \lambda as an abstract parameter that regulates the penalty size and \mu as a vector of values (one for each predictor) that increases the loss the further these coefficients deviate from these values. When \mu=0 the coefficients are pulled towards zero.

Let’s take a look how the estimator behaves when the parameters, \mu and \lambda change. We’ll define a meaningful ‘prior’ for our example and then vary the penalty parameter. As an example, we’ll use the `diamonds` dataset from the `ggplot2` package and model the price as a linear function of the number of carats, in each diamond, the depth, table, x, y and z attributes

![penalty_plots](/Users/thomasalcock/Projekte/blog/bayesian_regularization/penalty_plots.tiff)

As we can see from the plot both with and without a prior the coefficient estimates change rapidly for the first few increases in the penalty size. We also see that the ‘shrinkage’ effect holds from the upper plot: as the penalty increases the coefficients tend towards zero, some faster than others. The plot on the right shows how the coefficients change when we set a sensible ‘prior’. The coefficients still change, but they now tend towards the ‘prior’ we specified. This is because \lambda penalizes deviations from our \mu, which means that larger values for the penalty pull the coefficients towards \mu. You might be asking yourself, how this compares to the Bayesian estimator. Let’s find out!

Comparing the Ridge and Bayesian Estimator

Now that we’ve seen both the Ridge and the Bayesian estimators, it’s time to compare them. We discovered, that the Bayesian estimator contains the OLS estimator. Since we know its form, let’s substitute it and see what happens:

    \[\begin{aligned}\hat\beta_{Bayesian} &= (X'X + B_0^{-1})^{-1}(B_0^{-1} b_0 + X'X \hat\beta) \\&= (X'X + B_0^{-1})^{-1}(B_0^{-1} b_0 + X'X (X'X)^{-1}X'y) \\&= (X'X + B_0^{-1})^{-1}(B_0^{-1} b_0 + X'y)\end{aligned}\]

 
This form makes the analogy much clearer:

\lambda I corresponds to B_0^{-1}, the matrix of precisions. In other words, since I is the identity matrix, the ridge estimator assumes no covariances between the regression coefficients and a constant precision across all coefficients (recall that \lambda is a scalar)
\lambda \mu corresponds to B_0^{-1} b_0, which makes sense, since the vector b_0 is the mean of our prior distribution, which essentially pulls the estimator towards it, just like \mu ‘shrinks’ the coefficients towards its values. This ‘pull’ depends on the uncertainty captured by B_0^{-1} or \lambda I in the ridge estimator.

That’s all well and good, but lets see how changing the uncertainty in the Bayesian case compares to the behaviour of the ridge estimator. Using the same data and the same model specfication as above, we’ll set the covariance matrix B_0 matrix to equal \lambda I and then change lambda. Remember, smaller values of \lambda now imply greater contribution of the prior (less uncertainty) and therefore increasing them makes the prior less important.

![bayes_penalty_plots](/Users/thomasalcock/Projekte/blog/bayesian_regularization/bayes_penalty_plots.tiff)

The above plots match out understanding so far: With a prior mean of zeros the coefficients are shrunken towards zero, as in the ridge regression case when the prior dominates, i.e. when the precision is high. And when a prior mean is set the coefficients tend towards it as the precision increases. So much for the coefficients, but what about the performance? Let’s have a look!

Performance comparison

Lastly, we’ll compare the predictive performance of the two models. Although we could treat the parameters in the model as hyperparameters which we would need to tune, this would defy the purpose of using prior knowlegde. Instead, let’s choose a prior specification for both models, and then compare the performance on a hold out set (30% of the data). While we can use the simple X\hat\beta as our predictor for the Ridge model, the Bayesian model provides us with a full posterior preditctive distribution which we can sample from to get model predictions. To estimate the model I used the `brms`package.

| | RMSE | MAE | MAPE |
| :——————– | ——: | ——: | —-: |
| Bayesian Linear Model | 1625.38 | 1091.36 | 44.15 |
| Ridge Estimator | 1756.01 | 1173.50 | 43.44 |

Overall both models perform similarly, although some error metrics slightly favor one model over the other. Judging by these errors, we could certainly improve our models by specifying a more approriate probability distribution for our target variable. After all, prices can not be negative yet our models can and do produce negative predictions.

Recap

In this post, I’ve shown you how the ridge estimator compares to the Bayesian conjugate linear model. Now you understand the connection between the two models and how a Bayesian approach can provide a more readily interpretable way of regularizing your model. Normally \lambda would be considered a penalty size, but now it can be interpreted as a measure of prior uncertainty. Similarly, the parameter vector \mu can be seen as a vector of prior means for our model parameters in the extended ridge model. As far as the Bayesian approach goes, we also can use prior distributions to implement expert knowledge in your estimation process. This regularizes your model and allows for incorporation of external information in your model. If you are interested in the code, check it out at our [GitHub page](https://github.com/STATWORX/blog/tree/master/bayesian_regularization)!

References

– Jackman S. 2009. Bayesian Analysis for the Social Sciences. West Sussex: Wiley.

Thomas Alcock Thomas Alcock

Introduction

Here at STATWORX, we value reader-friendly presentations of our work. For many R users, the choice is usually a Markdown file that generates a .html or .pdf document, or a Shiny application, which provides users with an easily navigable dashboard.

What if you want to construct a dashboard-style presentation without much hassle? Well, look no further. R Studio’s package flexdashboard gives data scientists a Markdown-based way of easily setting up dashboards without having to resort to full-on front end development. Using Shiny may be a bit too involved when the goal is to present your work in a dashboard.

Why should you learn about flexdashboards ? If you’re familiar with R Markdown and know a bit about Shiny, flexdashboards are easy to learn and give you an alternative to Shiny dashboards.

In this post, you will learn the basics on how to design a flexdashboard. By the end of this article, you’ll be able to :

  • build a simple dashboard involving multiple pages
  • put tabs on each page and adjust the layout
  • integrate widgets
  • deploy your Shiny document on ShinyApps.io.

The basic rules

To set up a flexdashboard, install the package from CRAN using the standard command. To get started, enter the following into the console:

rmarkdown::draft(file = "my_dashboard", template = "flex_dashboard", package = "flexdashboard")

This function creates a .Rmd file with the name associated file name, and uses the package’s flexdashboard template. Rendering your newly created dashboard, you get a column-oriented layout with a header, one page, and three boxes. By default, the page is divided into columns, and the left-hand column is made to be double the height of the two right-hand boxes.

You can change the layout-orientation to rows and also select a different theme. Adding runtime: shiny to the YAML header allows you to use HTML widgets.

Each row (or column) is created using the ——— header, and the panels themselves are created with a ### header followed by the title of the panel. You can introduce tabsetting for each row by adding the {.tabset} attribute after its name. To add a page, use the (=======) header and put the page name above it. Row height can be modified by using {.data-height = } after a row name if you chose a row-oriented layout. Depending on the layout, it may make sense to use {.data-width = } instead.

Here, I’ll design a dashboard which explores the famous diamonds dataset, found in the ggplot2 package. While the first page contains some exploratory plots, the second page compares the performance of a linear model and a ridge regression in predicting the price.

This is the skeleton of the dashboard (minus R code and descriptive text):

---
title: "Dashing diamonds"
output: 
  flexdashboard::flex_dashboard:
    orientation: rows
    vertical_layout: fill
    css: bootswatch-3.3.5-4/flatly/bootstrap.css
    logo: STATWORX_2.jpg
runtime: shiny
---

Exploratory plots 
=======================================================================

Sidebar {.sidebar data-width=700} 
-----------------------------------------------------------------------

**Exploratory plots**

<br>

**Scatterplots**

<br>

**Density plot**

<br>

**Summary statistics**

<br>


Row {.tabset}
-----------------------------------------------------------------------

### Scatterplot of selected variables

### Density plot for selected variable

Row 
-----------------------------------------------------------------------

### Maximum carats {data-width=50}

### Most expensive color {data-width=50}

### Maximal price {data-width=50}

Row {data-height=500}
-----------------------------------------------------------------------

### Summary statistics {data-width=500}

Model comparison
=======================================================================

Sidebar {.sidebar data-width=700}
-----------------------------------------------------------------------

**Model comparison**

<br>

Row{.tabset}
-----------------------------------------------------------------------

 **Comparison of Predictions and Target**

### Linear Model

### Ridge Regression 

Row
-----------------------------------------------------------------------
### Densities of predictions vs. target 

The sidebars were added by specifying the attribute {.sidebar} after the name, followed by a page or row header. Page headers (========) create global sidebars, whereas local sidebars are made using row headers (---------). If you choose a global sidebar, it appears on all pages whereas a local sidebar only appears on the page it is put on. In general, it’s a good idea to add the sidebar after the beginning of the page and before the first row of the page. Sidebars are also good for adding descriptions of what your dashboard/application is about. Here I also changed the width using the attribute data-width. That widens the sidebar and makes the description easier to read. You can also display outputs in your sidebar by adding code chunks below it.

Adding interactive widgets

Now that the basic layout is done let’s add some interactivity. Below the description in the sidebar on the first page, I’ve added several widgets.

```{r}
selectInput("x", "X-Axis", choices = names(train_df), selected = "x")
selectInput("y", "Y-Axis", choices = names(train_df), selected = "price")
selectInput("z", "Color by:", choices = names(train_df), selected = "carat")
selectInput("model_type", "Select model", choices = c("LOESS" = "loess", "Linear" = "lm"), selected = "lm")
checkboxInput("se", "Confidence intervals ?")
​```

Notice that the widgets are identical to those you typically find in a Shiny application and they’ll work because runtime: shiny is specified in the YAML.

To make the plots react to changes in the date selection, you need to specify the input ID’s of your widgets within the appropriate render function. For example, the scatterplot is rendered as a plotly output:

```{r}
renderPlotly({
  p <- train_df %>% 
  	ggplot(aes_string(x = input$x, y = input$y, col = input$z)) + 
    geom_point() +
    theme_minimal() + 
    geom_smooth(method = input$model_type, position = "identity", se = input$se) + 
    labs(x = input$x, y = input$y)
  
  p %>% ggplotly()
})
​```

You can use the render functions you would also use in a Shiny application. Of course, you don’t have to use render-functions to display graphics, but they have the advantage of resizing the plots whenever the browser window is resized.

Adding value boxes

Aside from plots and tables, one of the more stylish features of dashboards are value boxes. flexdashboard provides its own function for value boxes, with which you can nicely convey information about key indicators relevant to your work. Here, I’ll add three such boxes displaying the maximal price, the most expensive color of diamonds and the maximal amount of carats found in the dataset.

flexdashboard::valueBox(max(train_df$carat), 
                        caption = "maximal amount of carats",
                        color = "info",
                        icon = "fa-gem")

There are multiple sources from which icons can be drawn. In this example, I’ve used the gem icon from font awesome. This code chunk follows a header for what would otherwise be a plot or a table, i.e., a ### header.

Final touches and deployment

To finalize your dashboard, you can add a logo and chose from one of several themes, or attach a CSS file. Here, I’ve added a bootswatch theme and modified the colors slightly. Most themes require the logo to be 48×48 pixels large.

---
title: "Dashing diamonds"
output: 
  flexdashboard::flex_dashboard:
    orientation: rows
    vertical_layout: fill
    css: bootswatch-3.3.5-4/flatly/bootstrap.css
    logo: STATWORX_2.jpg
runtime: shiny
---

After creating your dashboard with runtime: shiny, it can be hosted on ShinyApps.io, provided you have an account. You also need to install the package rsconnect. The document can be published with the ‘publish to server’ in RStudio or with:

rsconnect::deployDoc('path')

You can use this function after you’ve obtained your account and authorized it using rsconnect::setAccountInfo() with an access token and a secret provided by the website. Make sure that all of the necessary files are part of the same folder. RStudio’s publish to server has the advantage of automatically recognizing the external files your application requires. You can view the example dashboard here and the code on our GitHub page.

Recap

In this post, you’ve learned how to set up a flexdashboard, customize and deploy it – all without knowing JavaScript or CSS, or even much R Shiny. However, what you’ve learned here is only the beginning! This powerful package also allows you to create storyboards, integrate them in a more modularized way with R Shiny and even set up dashboards for mobile devices. We will explore these topics together in future posts. Stay tuned!

Thomas Alcock Thomas Alcock

Introduction

Here at STATWORX, we value reader-friendly presentations of our work. For many R users, the choice is usually a Markdown file that generates a .html or .pdf document, or a Shiny application, which provides users with an easily navigable dashboard.

What if you want to construct a dashboard-style presentation without much hassle? Well, look no further. R Studio’s package flexdashboard gives data scientists a Markdown-based way of easily setting up dashboards without having to resort to full-on front end development. Using Shiny may be a bit too involved when the goal is to present your work in a dashboard.

Why should you learn about flexdashboards ? If you’re familiar with R Markdown and know a bit about Shiny, flexdashboards are easy to learn and give you an alternative to Shiny dashboards.

In this post, you will learn the basics on how to design a flexdashboard. By the end of this article, you’ll be able to :

The basic rules

To set up a flexdashboard, install the package from CRAN using the standard command. To get started, enter the following into the console:

rmarkdown::draft(file = "my_dashboard", template = "flex_dashboard", package = "flexdashboard")

This function creates a .Rmd file with the name associated file name, and uses the package’s flexdashboard template. Rendering your newly created dashboard, you get a column-oriented layout with a header, one page, and three boxes. By default, the page is divided into columns, and the left-hand column is made to be double the height of the two right-hand boxes.

You can change the layout-orientation to rows and also select a different theme. Adding runtime: shiny to the YAML header allows you to use HTML widgets.

Each row (or column) is created using the ——— header, and the panels themselves are created with a ### header followed by the title of the panel. You can introduce tabsetting for each row by adding the {.tabset} attribute after its name. To add a page, use the (=======) header and put the page name above it. Row height can be modified by using {.data-height = } after a row name if you chose a row-oriented layout. Depending on the layout, it may make sense to use {.data-width = } instead.

Here, I’ll design a dashboard which explores the famous diamonds dataset, found in the ggplot2 package. While the first page contains some exploratory plots, the second page compares the performance of a linear model and a ridge regression in predicting the price.

This is the skeleton of the dashboard (minus R code and descriptive text):

---
title: "Dashing diamonds"
output: 
  flexdashboard::flex_dashboard:
    orientation: rows
    vertical_layout: fill
    css: bootswatch-3.3.5-4/flatly/bootstrap.css
    logo: STATWORX_2.jpg
runtime: shiny
---

Exploratory plots 
=======================================================================

Sidebar {.sidebar data-width=700} 
-----------------------------------------------------------------------

**Exploratory plots**

<br>

**Scatterplots**

<br>

**Density plot**

<br>

**Summary statistics**

<br>


Row {.tabset}
-----------------------------------------------------------------------

### Scatterplot of selected variables

### Density plot for selected variable

Row 
-----------------------------------------------------------------------

### Maximum carats {data-width=50}

### Most expensive color {data-width=50}

### Maximal price {data-width=50}

Row {data-height=500}
-----------------------------------------------------------------------

### Summary statistics {data-width=500}

Model comparison
=======================================================================

Sidebar {.sidebar data-width=700}
-----------------------------------------------------------------------

**Model comparison**

<br>

Row{.tabset}
-----------------------------------------------------------------------

 **Comparison of Predictions and Target**

### Linear Model

### Ridge Regression 

Row
-----------------------------------------------------------------------
### Densities of predictions vs. target 

The sidebars were added by specifying the attribute {.sidebar} after the name, followed by a page or row header. Page headers (========) create global sidebars, whereas local sidebars are made using row headers (---------). If you choose a global sidebar, it appears on all pages whereas a local sidebar only appears on the page it is put on. In general, it’s a good idea to add the sidebar after the beginning of the page and before the first row of the page. Sidebars are also good for adding descriptions of what your dashboard/application is about. Here I also changed the width using the attribute data-width. That widens the sidebar and makes the description easier to read. You can also display outputs in your sidebar by adding code chunks below it.

Adding interactive widgets

Now that the basic layout is done let’s add some interactivity. Below the description in the sidebar on the first page, I’ve added several widgets.

```{r}
selectInput("x", "X-Axis", choices = names(train_df), selected = "x")
selectInput("y", "Y-Axis", choices = names(train_df), selected = "price")
selectInput("z", "Color by:", choices = names(train_df), selected = "carat")
selectInput("model_type", "Select model", choices = c("LOESS" = "loess", "Linear" = "lm"), selected = "lm")
checkboxInput("se", "Confidence intervals ?")
​```

Notice that the widgets are identical to those you typically find in a Shiny application and they’ll work because runtime: shiny is specified in the YAML.

To make the plots react to changes in the date selection, you need to specify the input ID’s of your widgets within the appropriate render function. For example, the scatterplot is rendered as a plotly output:

```{r}
renderPlotly({
  p <- train_df %>% 
  	ggplot(aes_string(x = input$x, y = input$y, col = input$z)) + 
    geom_point() +
    theme_minimal() + 
    geom_smooth(method = input$model_type, position = "identity", se = input$se) + 
    labs(x = input$x, y = input$y)
  
  p %>% ggplotly()
})
​```

You can use the render functions you would also use in a Shiny application. Of course, you don’t have to use render-functions to display graphics, but they have the advantage of resizing the plots whenever the browser window is resized.

Adding value boxes

Aside from plots and tables, one of the more stylish features of dashboards are value boxes. flexdashboard provides its own function for value boxes, with which you can nicely convey information about key indicators relevant to your work. Here, I’ll add three such boxes displaying the maximal price, the most expensive color of diamonds and the maximal amount of carats found in the dataset.

flexdashboard::valueBox(max(train_df$carat), 
                        caption = "maximal amount of carats",
                        color = "info",
                        icon = "fa-gem")

There are multiple sources from which icons can be drawn. In this example, I’ve used the gem icon from font awesome. This code chunk follows a header for what would otherwise be a plot or a table, i.e., a ### header.

Final touches and deployment

To finalize your dashboard, you can add a logo and chose from one of several themes, or attach a CSS file. Here, I’ve added a bootswatch theme and modified the colors slightly. Most themes require the logo to be 48×48 pixels large.

---
title: "Dashing diamonds"
output: 
  flexdashboard::flex_dashboard:
    orientation: rows
    vertical_layout: fill
    css: bootswatch-3.3.5-4/flatly/bootstrap.css
    logo: STATWORX_2.jpg
runtime: shiny
---

After creating your dashboard with runtime: shiny, it can be hosted on ShinyApps.io, provided you have an account. You also need to install the package rsconnect. The document can be published with the ‘publish to server’ in RStudio or with:

rsconnect::deployDoc('path')

You can use this function after you’ve obtained your account and authorized it using rsconnect::setAccountInfo() with an access token and a secret provided by the website. Make sure that all of the necessary files are part of the same folder. RStudio’s publish to server has the advantage of automatically recognizing the external files your application requires. You can view the example dashboard here and the code on our GitHub page.

Recap

In this post, you’ve learned how to set up a flexdashboard, customize and deploy it – all without knowing JavaScript or CSS, or even much R Shiny. However, what you’ve learned here is only the beginning! This powerful package also allows you to create storyboards, integrate them in a more modularized way with R Shiny and even set up dashboards for mobile devices. We will explore these topics together in future posts. Stay tuned!

Thomas Alcock Thomas Alcock