3MW (Turn Text Into Structured Data With R & LLMs)

Enjoy 3 Minute Wednesdays at no cost.

This newsletter is brought to you for free. If you want to advance your R skills and support my work at the same time, then you will like my paid offerings:

Guten Tag!

Many greetings from Munich, Germany. In today’s newsletter, we continue to talk about AI and {ellmer}. You see, {ellmer} still has a pretty cool trick up its sleeves.

And that trick has to do with structured output. Namely, {ellmer} (or rather the LLMs it talks with) allows you to extract data from unstructured texts in a structured format.

That’s huge! One of the most fundamental problems with regular prompting is that you can never be sure that you get output in the form that you wish. That's the exact problem that “structured output” solves.

And the best part: {ellmer} has built-in functionalities that allow you to use this structured mode really easily. Today, I’ll show you how. But first, it's time for my regular announcements.

Working with times and dates

I'm happy to share that I've finished recording Part 4 of my Data Cleaning Master Class and I've started to edit the videos. In this part we deal with one of the most annoying things in data science, namely times and dates!

This week, I've released the first three lessons where we deal with date and datetime formats and how to format them. If you want to get better with tedious time calculations, you can join 100+ students via my Data Cleaning Master Class:

Automate e-mail workflows

E-mails may seem like the most boring thing out there. But they can be surprisingly effective. In my day job, I’ve helped automate tedious e-mail workflows using R.

You know, things that wasted clients’ time when they regularly had to send updated numbers to some other department. With an R script this can be nicely automated and in my newest YT video I show you how that can be done:

And now let's get back to the newsletter:

A ground truth

To test the structured output, we first need some data that we can extract things from. So let’s do something fun. Let’s come up with a little hero story (yes these details are absolutely GPT generated):

  • Description: Dax Ironfist is a stocky, gruff dwarf with soot-covered hands. He despises adventure but is an unmatched engineer.

  • Journey: When his latest creation, an experimental automaton, malfunctions and escapes, Dax is forced to leave his comfort zone and track it down. His pursuit takes him on journey where he meets a young adventurer who needs his expertise. Through their trials, Dax realizes that his skills can help more than just himself.

  • Lesson Learned: Innovation isn’t just about personal success—it’s about using one’s talents to improve the lives of others.

What a beautiful story! Now we can let GPT generate a multi-page story that involves these details. This little stunt probably cost me 10 cents or so in GPT tokens. But it’s worth it. Now we have something that we can ask GPT about.

Anyway, I have saved this story in a file called story.txt. Here’s the first few lines. If you want to read the full story, you can do so on GitHub.

Extract data

To ask specifics about our story, we can create a chat with a corresponding system prompt:

And now, we can feed in the story as a user prompt.

Just so that you can read this better, here’s also the reply in a narrower format. Don’t worry about the code here.

Use a structure

Sweet. This worked pretty great. But the thing with LLMs is that we can never be sure that it will actually adhere to this specific format.

For example, it’s easy to imagine that the LLM might also add introductory sentences like “Sure, here’s your data:”. That might not be great if we rely on the structure of the reply.

Enter structured outputs: With the extract_data() method, we can enforce a certain output. For that, let’s create a new chat with a shorter system prompt.

Then, we can use the extract_data() method. Inside of this function, we can specify a so-called type. That’s the desired output format. It is specified using type_*() functions.

And with this, we can use the extract_data() method in combination with the story text.

Excellent! Now, we have an R list that we can work with.

Access the data

And in case you’re wondering how to get to the data from the chat object itself, here’s how to do it. First, you have to get the last turn.

And from there you can follow the hierarchy all the way down to wrapper deep inside the contents field. Just make sure that you use @, $ and brackets appropriately.

And of course, we could turn this thing into a tibble:

Named entitity recognition

Alright, let’s do one more example. Let’s try to extract all named entities from this story. This is a common thing people want to do in text analysis.

Here, we can do this by creating another type. For now, let’s just go with name and description of the entity.

But of course we don’t want to find just a single entity. We want our LLM to extract all entities. Thus, let’s make sure that we ask our LLM to extract an array (or “vector” as we’d say in the R world).

And as always I prefer a tibble. Here, we have multiple rows (with two columns each), so we need to use bind_rows() to turn this into a tibble.

Adding classification

Finally, let’s complete this example by also showing you how these types can be used for automations. The trick is to modify our type_entity with yet another entry using type_enum().

This particular type corresponds to a categorical variable. It’s what you’d call a factor in the R world.

And with that, we can run our entity recognition again.

Nice! So with that you should have a good idea of how structured outputs work. If you’re looking for more examples, then the {ellmer} docs have a few more fantastic examples.

As always, if you have any questions, or just want to reach out, feel free to contact me by replying to this mail or finding me on LinkedIn or on Bluesky.

See you next week,
Albert 👋

Enjoyed this newsletter? Here are other ways I can help you:

Reply

or to participate.