3 Minutes Wednesdays
Posts
Why data cleaning matters

Why data cleaning matters

Albert Rapp
September 01, 2024

Guten Tag! 👋

Many greetings from Munich, Germany. Ever wanted to create a fancy interactive table like this:

If so, then you probably want to get the corresponding data first. But that data is often available in online sources like Wikipedia. Thus, you have to find a way to download that data to R (with the {rvest} package.) But more importantly:

You have to find a way to clean up the web-scraped data.

You see, as my newest YouTube video demonstrates: Even if you can get the data into R with {rvest}, it will come in a reaaally messy format. And if you don’t have the right tools to clean that up, you’ll be stuck with cleaning the data before you can even begin to build the table.

Thus, let me show you a few nuggets from my video course and how they helped me to clean up the web-scraped data.

Turning characters into numbers across many columns

At some point in the video, we end up with data on the GDP of different states of Germany across multiple years. It looks something like this:

Notice how the code shows us that all the columns with the gdp data on a specific year is formatted as a character value instead of a number (you can see that by the <chr> tags.)

Well, we’ve already learned that with parse_number() we can convert characters into actual numbers. But we have to apply that function on many columns here. That’s quite cumbersome if you don’t know across(). With that function (that we learn about in Part 1 of the course), you can apply parse_number() on many columns.

With that we have easily converted all of these numbers into double formatting. And we’ve even used parse_number() with the specification of how German numbers are written (commas for decimals, points for groupings.)

Rearranging data

But this data is not in a nice format for, say, summarise() or ggplot(). So, we can rearrange this with pivot_longer().

That was a pretty straightforward application of pivot_longer(). In the course, we cover a few more advanced tricks involving much trickier column names. Here, the simple case was enough though.

Summarizing into list-like columns

For our {reactable} table later on, we need to have the whole series of GDPs across time in a single cell. With summarise() that’s no problem.

Notice how we used “list-like” columns to summarize all of the data into a single cell for every state. That’s something my Data Cleaning Master Class focuses on in Part 1 (focusing on advanced tricks for transforming and summarizing data). These columns are nice in settings like this but they are also neat in conjunction with the powerful functional programming paradigm we’ll cover in Part 5.

That’s it for today. Feel free to check out the full YT video to see data cleaning tricks in action. And if you’re ready to level up your data cleaning skills, I’m happy to have you onboard the Data Cleaning Master Class. 👇️

Data Cleaning With R Master Class

This course teaches you everything you need to know about cleaning data in R fast & efficiently.

And don’t forget: The 15%-off promo code “PART2RELEASE” is still available for 2 more days.

Happy to have you onboard,
Albert

Reply

or to participate.