3 Minutes Wednesdays
Posts
3MW (Dealing With Shitty Text Data)

3MW (Dealing With Shitty Text Data)

Albert Rapp
September 18, 2024

Guten Tag!

Many greetings from Munich, Germany. I wish I could tell you that data always comes in a nice format. Unfortunately, that’s not true. Just this week, I had to get creative to get data from a dumb text source into R. So today, I’m showing you a simplified version of what I had to do.

Employee Data

Imagine that you want to do an analysis using employee data. But unfortunately you don’t have access to a data base that contains stuff like names, positions and department information.

Instead what you might have access to is a dumb list/PDF file that looks something like this. Your file will probably look nicer but the structure might still be the same.

Good old copy & paste

The first thing you have to do to get this data into R is really low tech. Just take your PDF file, highlight the things that you need and then copy that stuff into a text file. Here’s how that could look.

Reading lines

Next, we have to load the plain text file. We can do that with the read_lines() function to get all the lines in our file into a vector.

Extract department information

Now, with the str_extract() function you can extract lines that match a certain pattern. For that, you can also use what is known as regular expressions (regex).

Here, we don’t need much of that powerful regex language. Instead, we just use .+ as a catch-all argument to tell str_extract() to expect a whole bunch of characters after the things we hard-code.

Same thing could be done with the sub-departments.

Extract only groups

Note that this also returns the words “Department” and “Sub-department” all the time. We can avoid that by specifying regex groups with parentheses and telling str_extract() to return only the matched group.

Stick that into a tibble

Cool. Now we can stick that into a tibble and fill in the NA entries. First the tibble:

Then, we can fill it:

Filter unnecessary rows

We can now get rid of the lines that have no information on the employees. You know, things like the first and second row. For that, we simply check whether “(D|d)epartment” occurs in the lines column. This notation checks whether we use a capital or lower “D”.

Separate the lines column

Finally, we can use the separate_wider_delim() function to split the lines column using the vertical lines as separators.

Get rid of extra words in column

Once again, we can get rid of extra words like “Employee ID”. This time, let us use str_remove() for that. We tell that function to look for either “Employee ID”, “Name”, “Position”, or “Hire Date” followed by a colon and a white space. If str_remove() finds such a text, it will remove that for us.

And since we have to do this for all columns from id to hire_date, we can iterate over each column. So, we use across() to apply str_remove() on all these columns.

Use regex for separation

Alternatively, we could have done the last two steps all in one go. For that, we could have used separate_wider_regex(). But this requires a bit more regex:

Either way, we have successfully transformed the shitty data format into nice tidy data set. As you’ve seen, text cleaning techniques are super useful. If you want to learn more about that, I’m currently prepping Part 3 of my Data Cleaning Master Class where we focus on text cleaning, regex and all the good stuff.

As always, if you have any questions, or just want to reach out, feel free to contact me by replying to this mail or finding me on Linkedin.

See you next week,
Albert 👋

Enjoyed this newsletter? Here are other ways I can help you:

Data Cleaning With R Master Class

This course teaches you everything you need to know about cleaning data in R fast & efficiently.

Insightful Data Visualizations for "Uncreative" R Users

In this course, I teach you how to create insightful data visualizations without being a design expert.

Making Beautiful Tables with R

Tables are often boring, hard to read, and, most importantly, they don't communicate effectively. In this course, I will show you that creating beautiful tables is just a matter of learning a few important principles.

Reply

or to participate.