- 3 Minutes Wednesdays
- Posts
- 3MW (Extract Information From Images and PDFs With R & LLMs)
3MW (Extract Information From Images and PDFs With R & LLMs)
Enjoy 3 Minute Wednesdays at no cost.
This newsletter is brought to you for free. If you want to advance your R skills and support my work at the same time, then you will like my paid offerings:
Guten Tag!
Many greetings from Munich, Germany. Imagine that you have an image of an invoice and want to extract all the positions. With {ellmer}
and image content, that’s no problem at all.
Like every week, I’ll show you how that works. But first, it’s time for my regular announcements.
Time and date calculations
I’m happy to share that 3 more new videos of my Data Cleaning Master Class were released this week. In these three videos, students learn
how to add specific time lengths onto time points,
what the difference between durations and periods is, and
how to perform the simple but surprisingly complicated calculation of adding months to timepoints.
This should help students navigate the oddities of time calculations. If you want to learn that too, you can join 100+ students here.
Sending data to local LLMs
In my newest YT video, I show you how to send text data from data.frames to an LLM. That’s super convenient for all kinds of text analysis and you can find the details on YouTube.
And now, let’s dive into this week’s newsletter.
A sample file
To do what we want to do, we first need an image of an invoice. Luckily it’s not hard to find a sample online. Here’s one I found:

Create image content from local files
I’ve screenshoted this invoice and saved it as a file called invoice.png
. Now, it’s time to figure out how to tell {ellmer}
that it should use this image. Luckily, that’s quite easy. All we have to do is to wrap this into an ContentImage()
object. Something like this:

Unfortunately, this function doesn’t have any arguments we can use to stick our image into such an object. Luckily, there’s another function called content_image_file()
that does the trick. You just have to stick in the path to your image and it will spit out an ContentImage
object.

If you run this command for the first time, it will ask you to install the {magick}
package (if you haven’t already.) That way, the initial image can be resized to an image size suitable for AI.
Send your image to your AI
Now that we have an image, it’s time to send it to an AI. Let’s create a new chat object and then use the chat()
method. And the nice thing is, we can just stick the output of content_image_file()
into the chat()
method. Check it out:

Use a formatted output
Nice! That worked pretty smoothly. But how about a structured output? Remember those from last week? Here’s what we have to do:
First, we create an invoice_position
type. Then, we tell our AI to extract an array/vector for such types.

Convert to tibble
And just like last time, we can turn the last result into a tibble.

What about PDFs?
Now you might be wondering: What about PDFs? It’s probably more likely that you have invoices as PDF. Well, in that case you can use the {pdftools}
package to turn PDFs into images.
To demonstrate that I’ve created a PDF invoice from this template:

I’ve save this in a file called invoice.pdf
and now we can turn this into an image with pdf_convert()
.

Now, we just have to use the same code as before but use the new image file:

Hooray. Solid work everyone! Time to take a celebratory nap. As always, if you have any questions, or just want to reach out, feel free to contact me by replying to this mail or finding me on LinkedIn or on Bluesky.
See you next week,
Albert 👋
Enjoyed this newsletter? Here are other ways I can help you:
Reply