3 Minutes Wednesdays
Posts
3MW (Use AWS Textract With R & {paws})

3MW (Use AWS Textract With R & {paws})

Albert Rapp
February 12, 2025

3 Minute Wednesdays is brought to you by

R in 3 Months Starts March 13

The twice-annual cohort-based program that has helped hundreds of people from around the world learn R starts soon. Learn to wrangle, visualize, and report on data – all in R. Get 10% off with the coupon code 3MWSPRING2025.

Guten Tag!

Many greetings from Munich, Germany. In today’s newsletter we’re using OCR services from AWS. This lets us extract texts from any kind of document or image. That way, we can extract information from all sorts of documents and feed that into text processing tools like LLMs.

But before we do that, it’s time for my regular announcements:

New Lessons On Data Cleaning With Time Data

Like every week, my data cleaning master class is making a lot of progress. This week, I’ve uploaded 6 new bite-sized lessons:

Lesson 07: Time length calculations with durations & periods
Lesson 08: Calculate time between to time points
Lesson 09: Introducing intervals
Lesson 10: Using interval functions
Lesson 11: Rounding dates
Lesson 12: Filter time columns

All of these small little things add up so that you can easily handle any time-related data that comes your way. If you’re interested in joining 100+ learners, you can do so via the course landing page:

Data Cleaning With R Master Class

This course teaches you everything you need to know about cleaning data in R fast & efficiently.

Using AI Functions

If you’ve enjoyed the newsletter from a few weeks ago, then I’m sure you’re delighted to hear that there is also a video version available now. This tutorial shows you how you can use AI tools with {ellmer}.

Get a Textract client

Just like last time, we need to get a client for the specific AWS service we want to use. Here, that’s Textract.

And with this client’s start_document_text_detection() method we can send an object that we have stored in S3 to the text detection service. Here, we will just use the PDF we have stored in S3 last time.

Handle the response

Notice that we saved the result of the start processing call in a variable called response. Let’s check out what’s in it.

This response can be used together with the get_document_text_detection() method.

Make sense of the output

Woah! That’s a lot of stuff. Thankfully, the {paws} documentation offers some clarity.

With that we know that there is a field called JobStatus. This one can be either "IN_PROGRESS", "SUCCEEDED", "FAILED" or "PARTIAL_SUCCESS".

So, when the OCR job is still in progress, there’s nothing to do but to wait and check in every few seconds.

Check for next tokens

Sweet. Our text processing job finished. Let’s first check the NextToken field in the results.

This one is empty. This means that all OCR’ed texts fit into a single response. That’s great so we don´t have to loop through all the NextTokens.

You see, for long documents it can happen that you will have to first collect all responses in a list. The code for this could look something like this:

Thankfully, we don’t have to do that here. Thus, we can directly dive into the blocks of text we want to extract.

Extract Blocks

First, we need to find all the blocks that are of type “LINE”.

From that we can extract the texts:

Send to AI

Nice! We have extracted all the texts. With that information we can let AI handle the rest. Let’s first recall how our document looks:

Now, we define a chat and an invoice position type as discussed a few weeks ago.

And then we ask our AI to extract the information:

Hoooray! This looks like exactly the content we need. As always, if you have any questions, or just want to reach out, feel free to contact me by replying to this mail or finding me on LinkedIn or on Bluesky.

See you next week,
Albert 👋

Enjoyed this newsletter? Here are other ways I can help you:

Data Cleaning With R Master Class

A data scientist’s guide to avoid wasting time with data cleaning. Learn 5 fundamental aspects of data cleaning, get to your insights much quicker.

Insightful Data Visualizations for "Uncreative" R Users

In this course, I teach you how to create insightful data visualizations without being a design expert.

Making Beautiful Tables with R

Tables are often boring, hard to read, and, most importantly, they don't communicate effectively. In this course, I will show you that creating beautiful tables is just a matter of learning a few important principles.

Reply

or to participate.