My blog post aims at aspiring data scientists who have to decide which programming language they want to learn first. At STATWORX, we primarily use the two most popular languages, R and Python. Both languages have their strengths and weaknesses, which is why you should ideally master both. To get started, however, we recommend to learn one language and then tackle the other one. Don’t forget that both languages are just a tool; what matters is what you do with your tool at hand. Once you have understood and mastered the concepts of working with data, it should generally be easier to learn the second language. This blog post introduces both languages to beginners. I try to stay as unbiased as possible. For certain tasks, I prefer R – for others Python. Hence, the recommendations are subjective to my preferences. Nevertheless, I hope my experiences help you as a newcomer.
Overview R and Python
Both Python and R are open source programming languages, which means that the source code is publicly available and can be used for free. While Python is a general-purpose programming language, R was developed for statistical analysis. Therefore, users of these languages often have different backgrounds. In general terms, software developers use Python and statisticians R.
R | Python | |
---|---|---|
Release | 1993 | 1991 |
Developer | R Core Team | Python Software Foundation |
Package Management | CRAN | PyPI |
A collection of extensions
Both languages have a basic set of functions that can be extended with packages.
The Comprehensive R Archive Network (CRAN) is a platform for R packages. A set of requirements must be satisfied to include a package on CRAN. CRAN ensures that all packages available for download work. More than 10,000 packages are available on CRAN. Since R is the standard language for statisticians, CRAN has a suitable solution for almost any statistical problem. So it’s just the right place for the latest statistical methods and analyses. Some packages depend on other packages, which can cause problems in specific scenarios. The objective of the packrat package is to ensure that all dependencies are met and everything runs smoothly.
Python has two package management platforms: conda and PyPI (Python Package Index). There are also over 10,000 packages for Python, which, compared to R, cover a vast range of applications. However, only a small share of packages are relevant for data science projects. Because complications can occur when you install Python packages globally, you can use virtual environments. They ensure smooth processes for the various packages and dependencies from package to package, similar to the packrat package in R. It can be quite hard for beginners to get a good grasp of the idea behind the different environments.
Although R and Python packages are used the same way, there are some fundamental differences. Usually, an R package is developed by a single person or a small group of researchers. The authors write the package based on a scientific paper and refer to it in the documentation. Whereas, often, large groups of developers are working on Python packages (numpy, pandas, scikit-learn).
That has advantages and disadvantages:
- The namespace for functions is clear, and functions have the same structure. E.g., when setting up different models and comparing their performance, you’ll use the scikit-learn package in Python. In R you’ll use different packages depending on what model you want to implement. The function and argument names differ from package to package – which can be cumbersome. It is noteworthy that the packages caret and parsnip are trying to correct those discrepancies in R in hindsight.
- In some cases, the functions of scikit-learn have to be checked thoroughly. The developers are usually optimization oriented and are neglecting some statistical aspects. E.g., the scikit-learn function performing a logistic regression (sklearn.linear_model.LogisticRegression) uses per default L2 regularization. The only way to get a linear logistic model without regularization is to set the regularization parameter to a high number. That’s surprising, given the functions name. Furthermore, the developers didn’t understand why this poses an issue to the users. Even if the regularized model generalizes better and, thus, might have a better predictive performance, there are cases where I would like to obtain the non-regularized coefficients for inference.
Find your IDE
Programmers often use an integrated development environment (IDE) that facilitates their work with small but subtle tools.
For R users, RStudio has become the standard IDE. The IDE is distributed by the company RStudio Inc., which stands commercially behind R. RStudio provides not only a pleasant working environment but also develops packages and extensions for the R language. For example, the RStudio team contributed important packages like tidyverse, packrat, and devtools, as well as popular extensions like shiny (for dashboards) and RMarkdown (for reports).
Python users can choose between numerous IDEs (PyCharm, Visual Studio Code, Spyder, and many more). However, no company stands behind Python and is comparable to RStudio Inc. Thanks to the efforts of the large community and the Python Software Foundation, new extensions for Python are continually being published.
The Art of Data Visualization
The most commonly used packages for data visualization with Python are matplotlib and seaborn. Dashboards can be created in Python with dash.
However, R has an ace up its sleeve in data visualization: the ggplot2 package, which is based on Leland Wilkinson’s book The Grammar of Graphics. With this package, you can create attractive and customized graphics, which you can share via shiny dashboards with others.
Both programming languages offer the possibility to create beautiful graphics easily. Nevertheless, the R package ggplot2 convinces with its flexibility, visual possibilities, and thought-out philosophy behind it. Sharing the graphics with a shiny dashboard is extremely easy. I want to mention that there are efforts to implement ggplot2 into Python.
Props for readability
Python was designed following the motto readability counts. So even people, who are not familiar with the programming language can interpret what was done in the code.
R, as a programming language, has changed a lot in recent years. Mainly because of the packages developed by the RStudio team. The readability of the code has improved substantially with the dplyr package and the use of the pipe operator, where code can be read from left to right.
The speed with different observation sizes
Next, I compare how long it takes to create a simulated dataset in R and Python. For a fair comparison, the conditions should be approximately the same. The data is simulated with the packages Xy and XyPy in R and Python, respectively. I used microbenchmark in R and timeit in Python to measure the time. Also, I parallelized the process using eight cores (R: parallel, Python: multiprocessing) to generate the simulation as fast as possible.
For the experiment, a dataset with 100 observations and 50 variables is simulated 100 times. The time it takes the computer to perform the simulation is measured individually for each simulation. That is then repeated for 1,000, 10,000, 100,000 and 1,000,000 observations.
The R and Python code snippets are shown below.
# R
# devtools::install_github("andrebleier/Xy")
# install.packages("parallel")
# install.packages("microbenchmark")
# Load packages
library(Xy)
library(microbenchmark)
library(parallel)
# Extract function definition from for loop
sim_this <- function(n_sim) {
sim <- microbenchmark(Xy(n = n_sim,
numvars = c(50,0),
catvars = 0),
times = 100, unit = "s")
data.frame(n = n_sim,
mean = summary(sim)[, 4])
}
# Time measurement for different number of simulations
n_sim <- c(1e2, 1e3, 1e4, 1e5, 1e6)
sim_in_r <- data.frame(n = rep(0, length(n_sim)),
t = rep(0, length(n_sim)))
for(i in 1:length(n_sim)){
out <- mclapply(n_sim[i],
FUN = sim_this,
mc.cores = 8)
sim_in_r[i, 1] <- out[[1]][1]
sim_in_r[i, 2] <- out[[1]][2]
}
# Python
import multiprocessing as mp
import numpy as np
import timeit
from XyPy import Xy
# Predefine function of interest
def sim_this(n_sim):
return(timeit.timeit( lambda: Xy(n = int(n_sim),
numvars = [50, 0],
catvars = [0, 0],
weights = [5, 10],
stn = 4.0,
cor = [0, 0.1],
interactions = 1,
noisevars = 5), number = 100))
# Paralleled computation
pool = mp.Pool(processes = 8)
n_sim = np.array([1e2, 1e3, 1e4, 1e5, 1e6])
results = [pool.map(sim_this, n_sim)]
The average duration, sorted by the number of observations, is shown in the plot below for R and Python. The x-axis is shown on a logarithmic scale with base 10, to make the plot clearer.
While R is a little faster for dataset sizes of 100 and 1.000 observations, Python is significantly faster for 100.000 and 1.000.000 observations.
For other speed comparisons, I recommend the following STATWORX blog posts: pandas vs. data.table and pandas vs. data.table part 2. In these posts, the focus was laid on data manipulation.
The Standard in Deep Learning
If you are particularly interested in deep learning, I recommend Python to you. Most popular deep learning libraries were written or are designed to be used with Python.
Deep learning is also possible with R, but the R deep learning community is much smaller. Implementations like Keras and TensorFlow can be called in R but are run in Python in the background. Furthermore, the packages do not provide full flexibility for the users, e.g., not all TensorFlow functions are available.
A Survey in the Community
As aspiring data scientists, Kaggle is an essential platform for you. There you can participate in exciting machine learning competitions, experiment for yourself, and learn from the experiences of the community.
In 2018, Kaggle conducted a Machine Learning & Data Science Survey. The poll was online for two weeks and received a total of 23,859 replies. From the results of this survey, I have created different plots to get some insights regarding my blog topic. The code for the individual plots is publicly available on Github.
Excursion: Python & R compared to other languages
Before we jump to R and Python, let’s see how they compare to other programming languages. Each respondent indicated which language she uses primarily. The plot below aggregates by language, and the result is: Most of the participants use Python! Followed by R in second place. In this survey, we do not distinguish between fields of work, which is why Python, the general-purpose programming language, is probably so prominent.
The results between R & Python
In a direct comparison between R and Python, you can see that many R users also use Python. Whereas Python users often work exclusively with Python.
Comparing the use of languages by field reveals a clear dominance of Python. In all fields, except for statisticians, the majority uses Python.
Participants were also asked: What language do you recommend for beginners to learn first? The answers to the question are shown in the table below.
Sprache | Empfehlung | Nutzer | Differenz |
---|---|---|---|
Python | 14.181 | 8.180 | 6.001 |
R | 2.342 | 2.046 | 296 |
SQL | 914 | 1.211 | -297 |
C++ | 339 | 739 | -400 |
Matlab | 256 | 355 | -99 |
Java | 184 | 903 | -719 |
Scala | 74 | 106 | -32 |
Javascript | 72 | 408 | -336 |
SAS | 69 | 228 | -159 |
VBA | 38 | 135 | -97 |
Go | 26 | 46 | -20 |
Other | 161 | 117 | 44 |
When the number of recommendations and the number of users are compared, you can see that R and Python are the only languages that have a positive difference.
Again, Python (14.181) is well ahead of R (2.342).
Conclusion
Beforehand: both languages are very powerful. Therefore you can not make a wrong choice! The choice of language depends on what kind of project you want to tackle.
As a universal programming language, Python is suitable for a variety of applications. Which is why I generally recommend starting with Python. But if statistical analysis or data visualization is paramount in your projects, R has an advantage over Python.
As already mentioned, both languages have their advantages and disadvantages. As an advanced data scientist, you should ideally master both languages.
I hope this post gives you an idea of what the differences are between R and Python and helps you make the right choice for yourself. Since I could not go into much depth with arguments for my preferences in this blog post – you are very welcome to shoot me an e-mail, if you have any further questions regarding the topic.
Happy Coding!
If you’re interested in training, feel free to check out our course Catalogs for R and Python at STATWORX Academy.
References
- https://cran.r-project.org/
- https://www.python.org/
- https://pipenv.readthedocs.io/en/latest/
- https://conda.io/en/latest/
- https://github.com/scikit-learn/scikit-learn/issues/6738
- https://www.rstudio.com/
- https://www.springer.com/us/book/9780387245447
- http://wiki.c2.com/?PythonPhilosophy
- https://www.kaggle.com
- https://www.statworx.com/ai-academy/