3 Minutes Wednesdays
Posts
3MW (Cleaning web-scraped tables)

3MW (Cleaning web-scraped tables)

Albert Rapp
November 15, 2023

Guten Tag!

Many greetings from Ulm, Germany. Did you know that you can extract HTML tables from the internet really easily using {rvest}? It’s literally just two lines of code to get the data.

But then the hard part (data cleaning) starts and I’ll show you a couple of powerful functions that help you with that. But first let me share TWO fantastic resources.

Pre-sale launches 🚀

The pre-sale of my dataviz course launched 🥳. This means that you can get the completed parts (~80% of the course) at a reduced price of $75 right now. And the rest of the course will be available to you at NO extra charge when they’re rolled out in the next 2-3 weeks. After the course is complete, the price will rise to $99.

So if you want to learn how to create great data visualizations with R, sign up for the course and benefit from the reduced price right now.

Video course on data cleaning 🧹

Speaking about video courses, my friends over at Statistics Globe put together a big course on data cleaning 🤯. As you will see today, data cleaning is sooooo important for any real-world task.

The course comes with all the cool (and hard to implement) things like group chats and discussion sessions. This goes to show how much effort Joachim and his team put into building a great resource. So if you want to level up your data cleaning skills, check out the full announcement:

Now, let’s jump into this week’s issue. As always, you can find all of the code on GitHub.

Scrape the data

First, we need to find a website that contains an HTML table. For example, here’s one from Wikipedia about Taylor Swift albums.

And with that we can happily start to scrape the data from the tables. All we have to do is to

pass the URL to read_html from {rvest} and
pass the result to html_table().

This will give you a list of all the tables that are on the website. Here, we will just consider the second one that was found on the website. It’s the one that we looked at above.

But if we extract the second element from our list of tables, we’re in for a surprise. This doesn’t look as nice as you might think.

First cleaning

Oh my. There are loads of bad column names and, even worse, the most important info seems to be in the first row of the table instead of the column names. That happened because the “Peak chart positions” column actually spans multiple columns. Thankfully, we can fix all of this by using first row_to_names() and then clean_names() (both from {janitor}).

Much better. But now we see that the last row actually contains information that we do not need in our data set. Also, there are three columns na, na_2 and na_3 that contain only missing values. Let’s get rid of all of those with slice_head() and select().

Clean up album details

Ok that was the basic cleaning. Now begins the really hard part. Cleaning the columns of the data so that everything is ready for analysis. Let’s first look at album_details.

Notice that this column actually has information for three columns. If we use seperate_wider_delim() and use the \n that you see in the above output as delimiter, then we can separate those columns. This will give use three new columns:

Nice. Now what’s left to do is

parse the release column from character to actual dates with mdy(),
remove the “Label:” text from the label column and
do the same thing with “Formats:” in the formats column.

Format the peak position

Super nice! We have cleaned up the first part of our data. Moving on to the peak positions:

That’s a classical use case for pivot_longer(). Rearrange the column names according to the pattern <country>_<some number>:

Fantastic. We’re almost done. Do you see that the values in the peak_position column are formatted as character even though they’re numbers? Fix that with parse_number() and make sure that the dashed lines are interpreted as missing values.

Extracting the sales

Alright, next we should probably extract the data from the sales_a column. Have a look:

But that’s not particularly easy. And you probably still have to digest the previous cleaning steps. So let’s take a break and resume this conversation next week (I’m really trying to stay within my 3-minute limit.)

Hope you’ve enjoyed this week’s newsletter. If you want to reach out to me, just reply to this mail or find me on LinkedIn.

See you next week,
Albert 👋

If you like my content, you may also enjoy these:

Reply

or to participate.