3MW (Data wrangling superpowers with {tidyselect})

Guten Tag!

Many greetings from Ulm, Germany.

Before we jump into this week’s issue, one quick announcement: This newsletter will go on a break for the next couple of weeks. We’ll be back on June 21. With that said, let’s talk about tidyselect helpers.

I’ve recently come to realize that they are crazy powerful. I mean, I’ve always know that they exist but I haven't really thought much about them. But as I was preparing my latest YT video, I've realized that they are super powerful. And combined with summarise() and across() they form a true superpower for data wrangling.

But to understand how that works, we first need to understand tidyselect helpers. So let’s dive into this week’s issue. You can find the full code on GitHub.

Find columns by literal name

So what are tidyselect helpers? As the name implies, they are helpers to select specific parts of your data and you can use them in a lot of tidyverse functions. Let’s go through a couple of simple examples.

When you want to use select() to, well, select columns from your data set, you can simply type the corresponding column names into select() and it just works.

But for data that has a lot of columns, this can be really tedious. For example take a look at the Ames dataset from {modeldata}. It has 74 columns.

Now if you wanted to select all columns related to the basement, i.e. those columns whose name start with “Bsmt”, then you’re going to spend a lot of time typing because there are Bsmt_Cond, Bsmt_Exposure, BsmtFin_Type_1, BsmtFin_SF_1, BsmtFin_Type_2, BsmtFin_SF_2, Bsmt_Unf_SF, Bsmt_Full_Bath, and Bsmt_Half_Bath (I used R to write this for me).

That's exactly where tidyselect helpers come in. Their sole job is to help you define some logic which you use to target specific columns. In this case that’s starts_with(). You can use it like so.

Combine helpers

You can combine helpers to select as many columns as you need. For example, maybe you want to get all of the variables from before but also those variables that contain the word “Bath”. Just use the contains() helper as well.

Or you could refine your previous selections by selecting all variables that start with “Bsmt” and contain “Bath”. That’s done by using the & operator.

AHH now you remember that you definitely always need to include a specific variable, say MS_Zoning. No problem. The ease of use of the tidyverse is truly magnificent. Just drop everything into select().

Literal or regular expressions?

Short side note: If you’re a fan of regular expressions, then watch out. The only tidyselect helper that accepts these is matches(). The others only work with literal strings. This has already confused me once.

The thing is: Using a regular expression in, say contains(), will not give you an error but will also not give you the expected results.

Find data based on type

Now, finding a variable by name is nice. But what if we don’t care about the names? What if we only want to find the numeric data to calculate important stuff with it? Well, you’re in luck. The where() helper will get the job done.

All it requires is a function that checks for your condition. In the numeric example, that function would be is.numeric() because it returns TRUE if a column is of type numeric. So shocking.

But why stop there? Why not use a custom function that checks more than that? For example, you can create your own function with \(x) to check whether a column is numeric and its mean exceeds some threshold.

Notice that I’ve checked is.numeric() first and combined it with the other condition using &&. You could also use & and get the same output but you’d also get a lot of warnings.

The reason for this is simple. Contrary to the & operator, the && operator stops after the first condition is not met. This means that if you check whether a column is numeric and that check comes back negative, then R won’t bother sticking that variable into mean(). That’s good because mean() is only defined for numeric variables and you’ll get a warning.

So what about that super combo that you promised?

Now that you understand tidyselect helpers, you can use it in combination with e.g. summarise() and across(). For example, you could compute the mean and standard deviation of all numeric columns whose mean exceeds a threshold.

For more information on how that works, check out my latest YT video.

Hope you’ve enjoyed this week’s newsletter. If you want to reach out to me, just reply to this mail or find me on Twitter.

See you in June!
Albert 👋

If you like my content, you may also enjoy these:

Reply

or to participate.