3MW (6 Data Sets to Practice Data Science)

Guten Tag!

Many greetings from Ulm, Germany. Like every week, let me announce the videos that I published this week.

Last Saturday, I showed viewers how to change the colors in their ggplots. That’s important because waaay too many people just use the default colors and this becomes boring over time. So, instead check out three easy ways to change your colors in ggplot.

Now, moving on to this week’s newsletter: Did you ever want to practice a new data trick and didn’t have a good data set to try this on? Whether it is a new data viz or a new machine learning algorithm that you want to try out, this list of practice data sets may help you to do just that. In the past, these have helped me out a lot.

Palmerpenguins

I had to start with this one. It is my absolute favorite data set. Penguins are fun and so is the data set. It is available in many programming languages and in R you can access it easily via the {palmerpenguins} package. Here’s an overview of what data it can offer:

Ames Housing

This one is also a classic. If you’ve ever been on Kaggle, then you know it. This data set is huge. Lots of variables, both numeric and categorical, and lots of rows. In R. you can easily access it via the {modeldata} package.

Packages from {gt}

The {gt} package is not only a great package for creating tables, it also comes with a good amount of neat data sets. Here are some of {gt}’s data sets that I’ve enjoyed:

towny

A dataset containing census population data from six census years (1996 to 2021) for all 414 of Ontario's local municipalities.

exibble

This is a special data set. It does not contain many rows but it contains a lot of different data types. Great data set to practice formatting numbers or texts.

sp500

Need historic financial data? This dataset provides daily price indicators for the S&P 500 index from the beginning of 1950 to the end of 2015.

Pizza Paradise

Ever wanted to lead a pizza empire? Well, here’s your chance. ObservableHQ created a fake data set about pizza sales. The data set contains real trends so it may be great for some time series analysis. But it’s also great for visualization. Here is one from ObservableHQ.

Also, this data set comes in the form of parquet files. So if you ever wanted to work with this type of data, here’s your chance. You can just download the parquet files and load them into R with the {arrow} package.

BONUS: TidyTuesday

Here’s a bonus resource for you. It’s not just one data set, it’s a whole collection of data sets. And each week, there’s a new one. What I’m talking about is the TidyTuesday challenge.

It’s a great weekly event to practice your data skills on a variety of data sets. And the best part is that you can always check out what others are creating by checking out the #tidyTuesday hashtag on Twitter. Especially watch out for the visuals from Georgios Karamanis or Dan Oehm.

That’s it for today. Hope you’ve enjoyed this week’s newsletter. If you want to reach out to me, just reply to this mail or find me on Twitter, uhhh I mean X.

See next week,
Albert 👋

If you like my content, you may also enjoy these:

Reply

or to participate.