3 Minutes Wednesdays
Posts
3MW (Text manipulation with R)

3MW (Text manipulation with R)

Albert Rapp
February 15, 2023

Guten Tag!

Many greetings from Ulm, Germany.

Text manipulation is an essential data cleaning skill. Usually, it precedes all kinds of data visualization. For example, I’ve recently created this table with {gt}. Today I’m showing you how to do the necessary text manipulation that preceded that {gt} work.

And remember: I show code images here but for your convenience the code can be found on GitHub.

Renaming long names

First, we need to load the underlying data from TidyTuesday. If we do that, we will notice that IBM has a suuuuper long name.

We can rename that long name in the company column with mutate() and if_else().

Remove extra words

Now, let's get rid of words like "Platforms" or "Corporation". Similarly, we can get rid of commas. We can do all that that with str_remove().

Notice how this did not remove everything we wanted? That's because we need to use str_remove_all() if want to get rid of all extra words (and not just the first that is found).

Remove words with special characters

Removing “Inc.” and “.com” is a bit more tricky. That’s because they contain the special character ..

You see, there’s a “language” called regular expressions (regex for short) that is used for describing patterns in texts. And functions like str_remove_all() use this regex language. So far, we have used very basic regex and instructed R to remove all patterns that match e.g. “Corporation”.

If we wanted to instruct R to remove all patterns that match “Corporation” followed by any character (or white space) we’d use str_remove_all(company, 'Corporation.'). That’s the purpose of . in the regex language. It matches anything.

So, to remove a literal . we need to “escape” it with a \, i.e. we should write \. in regex. But here’s the problem. In R (and not regex), \ is used to “escape” special characters too. And R will complain if you write \. because R does not know any special character called ..

Hence, you need to write \\. This “escapes” the \ in R so that R “sees” \. and does not interpret this as a special character. Instead, R will now “send” \. to regex like we want.

For a deeper intro of regex, check out the excellent R4DS chapter. Anyway, armed with our new-found knowledge we can now remove “Inc.” and “.com”.

Remove white space

The previous output has white space around the company names. You can see that from the ” in the output. We can remove that with str_trim().

Democracy visualization

Another visualization I’ve created recently is the following remake from an article of The Economist. Their interactive visualization is beautiful (you should check it out) and shows their democracy index for many countries over time.

Here, I had to do data cleaning as well. First, I had to manually copy last year’s democracy scores from a PDF into a text file. Then, I could read the data.

Read first number

With parse_number() we can extract the first number from that. Conveniently, this first number is our desired democracy index.

Fix parsing errors

It seems that there are parsing errors for two countries that both suspiciously have a - in their name. Let’s replace those with str_replace() before using parse_number().

This concludes this week’s 3MW. In the subscribers-only part, we continue cleaning this data set.

Subscribe to Premium to read the rest.

Become a paying subscriber of Premium to get access to this post and other subscriber-only content.