overplotting-main

How NOT to overplot

Lea Waniek Blog, Data Science, Statistik

Overplotting can be a serious problem, which complicates data visualization and thus also data exploration. Overplotting describes situations, in which multiple data points overlay each other within a plot, causing the individual observations to be non-distinguishable. In such cases, plots only indicate the general extent of the data, while existing relationshipsmight be heavily obscured. Overplotting especially occurs when dealing with large data sets.

# Generating a sample from a bivariate normal distribution to plot
library(ggplot2)
library(hexbin)
library(dplyr)
library(grid)
library(gridExtra)

# Correlation of the two variables:
r <- 0.8    
sim_data <- MASS::mvrnorm(
                          # Number of observations:
                          n = 20000, 
                          # Means of the variables:
                          mu = c(20, 0), 
                          # Covariance matrix of the variables:
                          Sigma = matrix(c(1, r, 
                                           r, 1),
                                         nrow =2 ), 
                          # Make mean and covaraince pertain to population:
                          empirical = FALSE)
x <- sample(sim_data[, 1], size = 20000) 
y <- sample(sim_data[, 2], size = 20000) 

df <- data.frame(x, y)

# Generating and storing scatter plots for differently sized subsamples
plot1_1 <- df %>%
  sample_n(size = 200) %>%
  ggplot() +
  geom_point(aes(x, y)) +
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "N = 200")

plot1_2 <- df %>%
  sample_n(size = 2000) %>%
  ggplot() +
  geom_point(aes(x, y)) +
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "N = 2000")

plot1_3 <- df %>%
  sample_n(size = 20000) %>%
  ggplot() +
  geom_point(aes(x, y)) +
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "N = 20000")

# Arranging plot objects in overview plot
plot_1 <- grid.arrange(
  plot1_1, plot1_2, plot1_3,
  nrow = 1,
  top = "Overplotting in differently sized samples") 

sample-size-and-overplotting

Adjusting glyphs

When considering discrete (or heavily-rounded continuous) variables with a small range, overplotting is practically inevitable and graphical appraisal might be impractical. However, when exploring bivariate relationships involving at least one continuous variable, graphical analysis often grants very valuable insight. Therefore, this article demonstrates several options to circumvent overplotting using ggplot2.

The most used tool to graphically assess relationships between two continuous variables is the scatter plot. As shown above, overplotting can render scatter plots quite useless. When the degree of overplotting is moderate, modification of the glyphs might offer a solution.
Overplotting may be overcome, by using small, hollow and/or transparent glyphs (with the latter option being referred to as “alpha blending”). Within a given plot layer one can specify the size in millimeters via the size argument (size = … ). Alternatively, one can set the shape to be a dot of the size of a pixel (shape = “.”) or one of the hollow shapes (shape = 0 / 1 / 2 / 5 / 6). The transparency can be adjusted via the alpha argument (alpha = …). The alpha level can range between 0 and 1, representing a fraction with the denominator being the number of data points that would need to be overlaid to obtain a fully opaque color:

# Generating and storing scatter plots with different adjustments of glyphs
plot2_1 <- ggplot() + 
  geom_point(data = df, aes(x, y), size = 0.1) +
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "Small glyphs")

plot2_2 <- ggplot() +
  geom_point(data = df, aes(x, y), shape = 1) +
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "Hollow glyphs")

plot2_3 <- 
  ggplot() +
  geom_point(data = df, aes(x, y), alpha = 0.1) +
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "Transparent glyphs")

# Arranging plot objects in overview plot
plot_2 <- grid.arrange(
  plot2_1, plot2_2, plot2_3,
  nrow = 1,
  top = textGrob("Counteracting overplotting by adjusting the glyphs", 
                 gp = gpar(fontsize = 18)))  

glyphs-and-overplotting

Jittering glyphs

Further, one can jitter data points by adding a little bit of random noise to a variable. To keep the distortion of the data as small as possible, jittering is optimally done only within the less informative dimension pertaining to a discrete variable, if such a variable is indeed considered:

# Generating and storing scatter plots with jittered glyphs
# Shortcut for geom_point(position = "jitter"): geom_jitter
plot3_1 <- 
  df %>%
  mutate(x = round(x, 0)) %>%           # Rounding to simulate discrete variable
  sample_n(size = 2000) %>%
  ggplot() +
  geom_point(aes(x,y), color = "black") +
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "Without jittering")

plot3_2 <- 
  df %>%
  mutate(x = round(x, 0)) %>%           # Rounding to simulate discrete variable
  sample_n(size = 2000) %>%
  ggplot() +
  geom_jitter(aes(x, y), width = 0.3) + # Arguments with/height -> max. noise
                                        # default 40% of the resulution
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "With jittering")

# Arranging plot objects in overview plot
plot_3 <- grid.arrange(plot3_1, plot3_2, 
  nrow = 1, 
  top = textGrob("Counteracting overplotting by jittering", 
                 gp = gpar(fontsize = 18))) 

jittering-and-overplotting

Plotting the joint density function

An alternative to overcome overplotting apart from adjusting the representation of “raw” data points, is to consider the 2d joint density function of two variables. There are two feasible approaches.

Firstly, data points can be binned, the number of observations falling into a given bin can be counted and the resulting counts can be visualized. Mapping the count to color or alpha level are straightforward options. With geom_hex and geom_bin2d ggplot2 offers two implementations to do so. The geoms employ rectangular respectively hexagonal bins but otherwise are quite alike. However, Carr et al. (1987) suggest using hexagonal bins, since utilizing too small square bins may produce visual artefacts.

#  Generating and storing plots with geom_hex
plot4_1 <- 
  ggplot() +
  geom_hex(data = df, 
           aes(x,y)) +
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "With default bins")

plot4_2 <- 
  ggplot() +
  geom_hex(data = df,
           aes(x,y), 
           binwidth = c(0.1, 0.1)) + # Changing vector of hight and width of bins
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "With 0.1 x 0.1 bins") 

# Arranging plot objects in overview plot
plot_4 <- grid.arrange(plot4_1, plot4_2, 
  nrow = 1, 
  top = textGrob("Counteracting overplotting with geom_hex", 
                 gp = gpar(fontsize = 18)))  

geom_hex-and-overplotting

Alternatively with geom_bin2d:

#  Generating and storing plots with geom_bin2d
plot5_1 <- 
  ggplot() +
  geom_bin2d(data = df, 
             aes(x,y)) +
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "With default number of bins")

plot5_2 <- 
  ggplot() +
  geom_bin2d(data = df,
             aes(x,y), 
             bins = 100) +  # Changing number of bins (default 30)
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "With 100 bins") 

# Arranging plot objects in overview plot
plot_5 <- grid.arrange(plot5_1, plot5_2, 
  nrow = 1,
  top = textGrob("Counteracting overplotting with geom_bin2d", 
  gp = gpar(fontsize = 18)))  

geom_bin2d-and-overplotting

Secondly, the 2d density can be estimated. The density can be visualized by plotting its contours or mapping it onto color or alpha level of tiles or onto the size of points. Such visualizations can stand alone or be used to supplement basic scatterplots. Within ggplot2 this statistical transformation is implemented within stat_density_2d. Several geoms are especially suitable for visualization of the transformed data and can be specified via the geom (geom = …) argument.

#  Generating and storing plots with stat_density_2d
plot6_1 <-
ggplot() +
  stat_density_2d(data = df, 
    aes(x, y, color = ..level..)) + # Mapping density level to color
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "2d density contours") 

plot6_2 <-
  ggplot() +
  stat_density_2d(data = df, 
    aes(x, y, fill = ..level..),    # Mapping density level to color
    geom = "polygon") +             # Plotting area
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "2d density polygons") 

plot_6 <- grid.arrange(plot6_1, plot6_2,
  nrow = 1, 
  top = textGrob("Counteracting overplotting with stat_density_2d #1")) 

stat_density_2d-and-overplotting

Alternativly with geom = "tile":

#  Generating and storing heatmaps with stat_density_2d 
plot7_1 <-
  ggplot() +
  stat_density_2d(data = df,
    aes(x, y, fill = ..density..),  # Mapping density level to color
    contour = FALSE,                # Drawing not contours of density 
    geom = "tile") +                # ... but square bins for density
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "2d density heatmap (tile)") 

plot7_2 <-
ggplot() +
  stat_density_2d(data = df,
    aes(x, y, fill = ..density..),  # Mapping density level to color
    contour = FALSE,                # Drawing not contours of density 
    geom = "tile",                  # ... but square bins for density
    h = c(0.1, 2)) +                # Changing hight and width of bins
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "2d density heatmap (0.1 x 2 tiles)") 

# Arranging plot objects in overview plot
plot_7 <- grid.arrange(plot7_1, plot7_2,
  nrow = 1, 
  top = textGrob("Counteracting overplotting with stat_density_2d #2"))  

stat_density_2d-and-overplotting-heatmaps-tiles

Or using geom = "point" and completely without color:

#  Generating and storing point heatmap with stat_density_2d 
plot8_1 <-
  ggplot() +
  stat_density_2d(data = df,
                  aes(x, y, 
                      size = ..density..,      # Mapping density level to size
                      alpha = ..density..),    # ... and alpha level
                  contour = FALSE,             # Drawing not contours of density
                  geom = "point",              # ... but points to map density to
                  n = 20) +                    # Specifying number of points
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "2d density heatmap (point)") 

plot8_2 <-
  ggplot() +
  stat_density_2d(data = df,
                  aes(x, y, 
                      size = ..density..,      # Mapping density level to size
                      alpha = ..density..),    # ... and alpha level
                  contour = FALSE,             # Drawing not contours of density
                  geom = "point",              # ... but points to map density to
                  n = 30) +                    # Specifying number of points
  theme_minimal() +
  xlim(15, 25) + 
  ylim(-5, 5) +
  labs(subtitle = "2d density heatmap (point)") 

# Arranging plot objects in overview plot
plot_8 <- grid.arrange(plot8_1, plot8_2,
  nrow = 1, 
  top = textGrob("Counteracting overplotting with stat_density_2d #3"))  

stat_density_2d-and-overplotting-heatmaps-points

Whichever approach one chooses, overplotting should always be addressed to insure the data visualization to be truly informative.

References

  • D. B. Carr, R. J. Littlefield, W. L. Nicholson, and J. S. Littlefield. Scatterplot matrix techniques for large n. Journal of the American Statistical Association, 82(398):424–436, 1987.
Über den Autor
Lea Waniek

Lea Waniek

I am data scientist at STATWORX, apart from machine learning, I love to play around with RMarkdown and ggplot2, making data science beautiful inside and out.