3 Minutes Wednesdays
Posts
3MW (Plot any function with stat_function 📈)

3MW (Plot any function with stat_function 📈)

Albert Rapp
October 18, 2023

Guten Tag!

Many greetings from Ulm, Germany. I get lots of questions on how to use the stat_function() layer in ggplot(). Typical use cases of stat_function() are drawing a line from a fitted model or comparing your data to some theoretical distribution like the Gaussian distribution.

So today, let me show you how stat_function() works. But first, let me throw in my regular announcements.

Best chart for comparisons?

People know bar plots. They’re everywhere. And when people want to compare values that should go into two separate bar plots, people are often like “hmmmm, why don’t I all of this into one chart? 🤔”

That’s how paired bar charts are born. And just like regular bar charts they’re everywhere. But they suck. They suck baaaaad. Instead, a dumbbell plot is often a better alternative. In my newest YT video, I show you how they work and how you can build them with ggplot.

Course progress

This week, I created a lesson on how to make great dumbbell plots (even better than the one from my YT video) and slope charts.

And the first 15 beta testers have been selected this week. 🥳 Once their feedback is incorporated in the the first lessons, I will likely do a pre-sale (at a lower price). If you don’t want to miss out on the lower price, don’t forget to sign up for the course mailing list.

Now, let’s jump into this week’s issue. As always, you can find all of the code on GitHub.

Plotting a single function

Imagine that you have some points that lie one a curved path like this.

As you can see, this data basically uses the function x^3 and adds some random noise to this. If we fit a polynomial model we would hopefully get something like x^3 as our model formula back and could plot this into the chart with our data.

Here, we use the stat_function() layer for this. It uses a fun argument to specify the function that needs to be plotted. And the rest is mostly aesthetics that you use for the line chart.

Compare two densities

Okay, how about comparing two density charts? Let’s take our favorite Penguins data and group the species into “Gentoo” and “Others”.

Then, we can just look throw our grouped data into a ggplot that uses geom_density() to compare the distribution of these two groups.

Add theoretical densities

Cool. We can see that Gentoo penguins seem to be heavier than the other species. Now, what if we wanted to also compare these density plots to some theoretical distribution like the Gaussian distribution?

First, we need to estimate the mean and standard deviation from the data. These are the parameters that define the Gaussian distribution.

Now, for each of those rows we’d have to call the dnorm() function to draw the densities of the Gaussian distribution. But in each case, we’d have to use different mean and sd arguments like so:

To do this in stat_function(), we iterate over the rows of params_grouped_penguins and in each iteration we use values from the mean_body_mass_g and sd_body_mass_g column inside of dnorm(). Basically, this would look like this:

But how do we get the correct values from the columns mean_body_mass_g and sd_body_mass_g into the function? The pmap() function makes all of this possible with a nifty trick.

Create a list of layers with `pmap()`

pmap() will iterate over every row and inside of every iteration, you will have to pass the current row to a new list l. Then, you can access the columns via the names in l.

This way, you get a list of stat_function() layers. And the cool thing is that you can just add any list of layers to a previous ggplot like so:

What if this fails (like for facets)?

Unfortunately, this same strategy cannot be used on a facetted plot like this.

The problem with stat_function() is that it will plot all three lines in each facet. But that’s not what we want, right? We want one line in each window.

Thus, we have to go the long route. This means that first we compute the mean and standard deviation for each species like before.

But then we also have to compute the x and y coordinates of the lines ourselves. This means that for each species, we need an x- and y-vector. We can add that to our previous params_penguins tibble with mutate() and pmap().