3MW (Making it easier to clean time data)

Guten Tag!

Many greetings from Ulm, Germany.

I have taken a couple of weeks off for vacation and now I'm back. Hope you’re excited! But before we get started, let me run one thing by you:

I’m planning a short video course on ggplot (max. 3 hours). The idea is to

  • go through the mechanics of ggplot quickly and then

  • teach you how to implement the most common dataviz techniques to create meaningful data visualizations.

Just like my newest YT video series, I don’t want to waste any time for chit-chat or typing out every function call. So it’s going to be a fast and highly condensed ride.

Of course, all of this is still in early stages, so I want to ask you, my readers, how you feel about this. Would you be interested in such a video course? Let me know via this super short, 1-minute, please participate in it, easy poll.

And with that, let's dive into this week's topic, namely cleaning time data. This can be really tedious and infuriating. So let me give you a couple of functions that make your life easier. As always, the full code can be found on GitHub.

Our toy data

Here's a small data set we're going to work with today. It contains a small table that has different date formats in the column dates.

As you can see from the output, the dates are formatted as character and not as actual date or datetime. This is bad because we cannot compute anything with the data as it is. So let’s fix that.

parse_date() from {reader}

We can format the dates by transforming the dates column like we would transform any column, i.e. with mutate(). The parse_date() function from {reader} (part of the tidyverse) does the heavy lifting for us. We just have to specify the date format using ominous abbreviations.

Notice that we have an NA in the result. Also, our console will show a warning like this:

All of this happens because the last date didn’t fit the bill of the format we specified. That’s one of the reasons why time data sucks: too many formats for the same thing (more on how to fix this later).

In any case, we have specified the format using weird abbreviations. Thankfully, we don’t have to know what each of those mean by heart. In the manual of parse_date(), there’s a list.

Avoid weird abbreviations

If you don’t want to use these annoying abbreviations from before, there’s a convenient parse_date_time() function in the lubridate package. That’s good news because looking up the abbreviations is annoying.

Instead of using a format argument, this function uses an orders argument where we can specify time formats without the % abbreviations. Here, mdy is a simple abbreviation for “month, day, year”. Similarly, dmy means “day, month, year”. I think you get the idea.

Also notice that we didn’t even have to specify any other characters like the , or - in order for a proper translation. I think you can see that parse_date_time() is a much more convenient function. But you haven’t even seen the best part yet.

Parse multiple formats all at once

What if the last date in our data set actually means May 06 instead of June 05? In that case the last date would use uses dmy instead of mdy like the rest.

Well, you’re in luck. parse_date_time() can parse multiple formats all at once. You just have to fill orders with a vector containing multiple formats in the right order. Sweet, isn’t it?

Enforce correct order

It can happen that parse_date_time() does not detect that you prefer dmy over mdy (despite the correct order in the orders vector). In that case, you have to brute-force your preference via the select_formats argument. This is a bit tricky and requires understanding how parse_date_time() works under the hood first.

You see, parse_date_time() translates your orders like mdy into all kinds of versions of “month, day, year” using the % notation, e.g. %B %d, %Y or %b-%d-%Y. Then, it tests these on your data. Those formats that do not cause an error on your dates are stored as the names of a vector called trained.

Finally, the default function that is passed to the select_formats argument assigns a score to each successful candidate and orders them accordingly. Then, this order determines the preference that parse_date_time() chooses.

We can take advantage of that by writing our own order function. In our case, we want all formats that start with the day (%d) to come first.

And now I can enforce my preference even if I don’t sort the vector in the orders vector correctly.

Hope you’ve enjoyed this week’s newsletter. If you want to reach out to me, just reply to this mail or find me on Twitter.

See next week,
Albert 👋

If you like my content, you may also enjoy these:

Reply

or to participate.