• 3 Minutes Wednesdays
  • Posts
  • 3MW (It's not all about a single number. Visualize uncertainty with {ggplot})

3MW (It's not all about a single number. Visualize uncertainty with {ggplot})

Guten Tag!

Many greetings from Ulm, Germany. Like every week, let me announce the videos that I published this week.

Ever wanted someone to show you how a cool chart can be recreated with ggplot? Well, today is your change. In my new video, I show you how to recreate a clean dot plot from the PEW Research center. You can find the video on YouTube. (And feel free to send me a link to a chart you want to see me recreate next)

Alriiiiiight, moving on to this week’s newsletter: Today, I show you how to visualize uncertainty aka. distributions with ggplot. This is important because simply reporting a single number like the mean or median will often not cut it. You need to understand how your data varies across observations too.

(All of today’s code can be found on GitHub)

Histograms

Let's let's start with the easiest one, namely histograms. Basically, a histogram is created in four steps

  1. Take the full range of your data

  2. Split that range into equally sized bins.

  3. Count how many observation fall into each bin

  4. Create a bar for each of those bins that indicates the number of observations in the bin

You can create a histogram by calling the geom_histogram() layer. Then, ggplot will calculate how many observations of your variable fall into one of 30 equally sized bin (30 is the default value).

Instead of using bars, you can also use stacked dots to represent the number of observations in each bin. You can create such a chart by either using geom_dotplot() which comes with {ggplot2} or you can use geom_dotsinterval() from {ggdist}. I find the latter somewhat more appealing.

Density charts

Another related visualization is the density chart which performs a so-called kernel density estimation. This works as follows:

  1. For each data point of the quantity that you want to visualize, create a Gaussian bell curve whose peak is centered at the corresponding data point.

  2. Average all of the bell curves.

Average curves, you say? This means that for each x value, you find the y values of each curve and then average them. Long time ago, I've actually created a (rudimentary) visualization for all of this.

Just like you can change the bin width of a histogram, you can tweak parameters of the density chart as well. For example, you could make the bell curve wider or narrower. Or you could replace the Gaussian bell curve by some other probability density function.

You can create a density chart by replacing geom_histogram() from above by geom_density(). Pretty easy, isn't it?

Or you could use stat_density() which does the same but displays the density chart with a filled area by default.

Box plots

Next, let us talk about box plots. Let me demonstrate how they work by using 9 data points from our penguins data set and plotting them together with a boxplot. For now, imagine that these points are our complete data set.

Basically, this chart has four key indicators:

  1. the left line (called whisker) shows you in what range you can find the smallest 25%

  2. the right line (whisker too) shows you in what range you can find the largest 25%

  3. the box shows you in what range you can find the middle 50% of the data

  4. the thick middle line in the box shows you the median. This value is defined by the fact that 50% of the values lie below and above it.

Sometimes, a box plot contains extra points on the left and right. These correspond to outliers, i.e. observations whose values are ”extreme” in some sense. In that case, the whiskers don’t show exactly 25% of the largest/smallest values but something close to 25%. In summary, the box plot is a pretty neat summary of key thresholds of your data.

For our full data, we can create it with geom_boxplot(). Depending on whether you want to have a vertical or horizonal box plot, map the quantity that you want to plot to either the x- or y-aesthetic.

Violin plots

Beware that the distribution of the underlying data can be very different but the box plot may look the same. Think about it. The 6th point in our above chart could move anywhere between the 5th and 7th point and the box plot would look the same. After all, in our data set of 9, the fifth and the seventh points determine how large the upper half of the box is.

That’s why some people prefer the violin plot. It shows you the density of the data that you would get from a density plot. And then for our viewing pleasure this density is mirrored so that it looks all symmetric. If you want to create this with ggplot, just replace geom_boxplot() with geom_violin().

Here, I've also mapped species to the y-aesthetic because geom_violin() needs a y-aesthetic. If you only want to have one violin plot, you could map y to a fixed string.

Additionally, you can combine violins and box plots by stacking the layers.

Alternatively, you can combine violins with the actual data points.

That’s it for today. In the next installment, I will show you how to combine these visualization into one neat plot.

Hope you’ve enjoyed this week’s newsletter. If you want to reach out to me, just reply to this mail or find me on Twitter, uhhh I mean X.

See next week,
Albert 👋

If you like my content, you may also enjoy these:

Reply

or to participate.