3MW (Dealing With Shitty Text Data)

Guten Tag!

Many greetings from Munich, Germany. I wish I could tell you that data always comes in a nice format. Unfortunately, that’s not true. Just this week, I had to get creative to get data from a dumb text source into R. So today, I’m showing you a simplified version of what I had to do.

Employee Data

Imagine that you want to do an analysis using employee data. But unfortunately you don’t have access to a data base that contains stuff like names, positions and department information.

Instead what you might have access to is a dumb list/PDF file that looks something like this. Your file will probably look nicer but the structure might still be the same.

Good old copy & paste

The first thing you have to do to get this data into R is really low tech. Just take your PDF file, highlight the things that you need and then copy that stuff into a text file. Here’s how that could look.

Reading lines

Next, we have to load the plain text file. We can do that with the read_lines() function to get all the lines in our file into a vector.

Extract department information

Now, with the str_extract() function you can extract lines that match a certain pattern. For that, you can also use what is known as regular expressions (regex).

Here, we don’t need much of that powerful regex language. Instead, we just use .+ as a catch-all argument to tell str_extract() to expect a whole bunch of characters after the things we hard-code.

Same thing could be done with the sub-departments.

Extract only groups

Note that this also returns the words “Department” and “Sub-department” all the time. We can avoid that by specifying regex groups with parentheses and telling str_extract() to return only the matched group.

Stick that into a tibble

Cool. Now we can stick that into a tibble and fill in the NA entries. First the tibble:

Then, we can fill it:

Filter unnecessary rows

We can now get rid of the lines that have no information on the employees. You know, things like the first and second row. For that, we simply check whether “(D|d)epartment” occurs in the lines column. This notation checks whether we use a capital or lower “D”.

Separate the lines column

Finally, we can use the separate_wider_delim() function to split the lines column using the vertical lines as separators.

Get rid of extra words in column

Once again, we can get rid of extra words like “Employee ID”. This time, let us use str_remove() for that. We tell that function to look for either “Employee ID”, “Name”, “Position”, or “Hire Date” followed by a colon and a white space. If str_remove() finds such a text, it will remove that for us.

And since we have to do this for all columns from id to hire_date, we can iterate over each column. So, we use across() to apply str_remove() on all these columns.

Use regex for separation

Alternatively, we could have done the last two steps all in one go. For that, we could have used separate_wider_regex(). But this requires a bit more regex:

Either way, we have successfully transformed the shitty data format into nice tidy data set. As you’ve seen, text cleaning techniques are super useful. If you want to learn more about that, I’m currently prepping Part 3 of my Data Cleaning Master Class where we focus on text cleaning, regex and all the good stuff.

As always, if you have any questions, or just want to reach out, feel free to contact me by replying to this mail or finding me on Linkedin.

See you next week,
Albert 👋

Enjoyed this newsletter? Here are other ways I can help you:

Reply

or to participate.