3MW (Using regular expressiont to clean data)

Guten Tag!

Many greetings from Ulm, Germany. Regular expressions and functions from {stringr} are a powerful powerful combination to clean up your data faster. Today, I’ll show you how you that can look using the Taylor Swift data we scraped from the web last week.

New charts for the course 📊

As always, first a short announcement about the course progress. The pre-sale is still ongoing and I created the bar chart and lollipop chart for Part 2 of the course.

If you want to learn how to create great data visualizations with R, you can still sign up for the course and benefit from the reduced price.

Now, let’s jump into this week’s issue. As always, you can find all of the code on GitHub.

Previously on this newsletter

We extracted a Taylor swift data set from Wikipedia last week. And now we’re busy cleaning that mess. Here’s where we’ve left off last time.

Extract country abbreviations

This data has information on Taylor Swift album sales in different countries. Unfortunately, all of that information is jumbled in the sales_a column. Luckily for us, there is a pattern in that column. It seems like it’s always

  • some country abbreviation in all caps,

  • a colon and a white space,

  • a number (spelled with commas),

  • some weird letter in brackets, and

  • possibly a new line delimiter \n

This helps us to first figure out which countries we even have information on. All we have to do is to look for something that’s written in all-caps and followed by a colon.

Here, we combined str_extract_all() with a regular expression that looks for texts that contain

  • one of the capital letters A to Z ([A-Z])

  • that occur at least once or more (+) and

  • are followed by a colon.

Create country regex

This seemed to work just fine. But the results are assembled in a nested list that contain country abbreviations multiple times. So let’s collect all of the unique abbreviations in a vector.

You know what? Let’s put all of this into a regular expression. These will help us to extract the country sales data later on. Just pass this vector to paste0 and collapse them with the | symbol.

Nice this is a regular expression that looks for one of those country abbreviations. Basically, the | means “or” in regular expressions.

Combine regex with numbers

Next, let us use str_extract_all() again to look for combinations of

  • one of those country abbreviations,

  • a white space and

  • numbers or commas ([0-9,]+)

Here, we used glue() to assemble the regular expression to look for the sales numbers. Notice that we had to use paranthesis in country_sales_regex to wrap the countries_regex into a group. That’s a technical thing you’ll often have to use in conjunction with the "|".

Format the matches

In any case, we found text matches. Let’s extract them by unnesting the sales_a column.

Nice! We have each country sales in each own cell. And using separate_wider_delim() we can split the sales_a column into two columns country and sales by using ": " as delimiter.

Excellent. What’s left to do is to convert the sales column into actual numbers. (Notice the <chr> formatting).

The same song again

Nice. We have cleaned up the sales column of our web-scraped table. Let’s do the same thing with the certifications column. If you don’t remember from last week, here’s how that data looks.

A lot of steps are veeery similar. So I will go over this more quickly. After all, we all got other things to do than think about Taylor Swift data. (I’m only guessing here.)

First, we extract the certification names. It’s the exact same thing as with the country abbreviations.

Now we assemble a new regex that extracts the amount of platinum, gold etc. labels. It should contain

  • the things from certifications_regex,

  • a white space,

  • possibly something like “11×” (([0-9]+× )?) and then

  • multiple lower or capital letters ([a-zA-Z]+).

Next, we throw that into str_extract_all() and split the resulting unnested data.

Finally, let us throw in a "1x " if Taylor got only one instead of multiple awards (poor Taylor). That way, we can be sure that each cell contains a number.

Now we can extract the numbers using parse_number() and remove everything that comes before a white space (".+ ") in certifications.

Hooraay 🥳 Now we have fully cleaned our data.

Hope you’ve enjoyed this week’s newsletter. If you want to reach out to me, just reply to this mail or find me on LinkedIn.

See you next week,
Albert 👋

If you like my content, you may also enjoy these:

Join the conversation

or to participate.