- 3 Minutes Wednesdays
- Posts
- 3MW (Use AWS Textract With R & {paws})
3MW (Use AWS Textract With R & {paws})
3 Minute Wednesdays is brought to you by
R in 3 Months Starts March 13
The twice-annual cohort-based program that has helped hundreds of people from around the world learn R starts soon. Learn to wrangle, visualize, and report on data – all in R. Get 10% off with the coupon code 3MWSPRING2025.
Guten Tag!
Many greetings from Munich, Germany. In today’s newsletter we’re using OCR services from AWS. This lets us extract texts from any kind of document or image. That way, we can extract information from all sorts of documents and feed that into text processing tools like LLMs.
But before we do that, it’s time for my regular announcements:
New Lessons On Data Cleaning With Time Data
Like every week, my data cleaning master class is making a lot of progress. This week, I’ve uploaded 6 new bite-sized lessons:
Lesson 07: Time length calculations with durations & periods
Lesson 08: Calculate time between to time points
Lesson 09: Introducing intervals
Lesson 10: Using interval functions
Lesson 11: Rounding dates
Lesson 12: Filter time columns
All of these small little things add up so that you can easily handle any time-related data that comes your way. If you’re interested in joining 100+ learners, you can do so via the course landing page:
Using AI Functions
If you’ve enjoyed the newsletter from a few weeks ago, then I’m sure you’re delighted to hear that there is also a video version available now. This tutorial shows you how you can use AI tools with {ellmer}.
Get a Textract client
Just like last time, we need to get a client for the specific AWS service we want to use. Here, that’s Textract.
data:image/s3,"s3://crabby-images/d6c17/d6c17ed8b79f775682a8976a40f85e9970a1e22e" alt=""
And with this client’s start_document_text_detection()
method we can send an object that we have stored in S3 to the text detection service. Here, we will just use the PDF we have stored in S3 last time.
data:image/s3,"s3://crabby-images/e954b/e954b377e6db522079cf63a5a256395965e212cf" alt=""
Handle the response
Notice that we saved the result of the start processing call in a variable called response
. Let’s check out what’s in it.
data:image/s3,"s3://crabby-images/85a93/85a9362e30fe99de96c5c2fba54361e3b77047c3" alt=""
This response can be used together with the get_document_text_detection()
method.
data:image/s3,"s3://crabby-images/e3846/e3846477f25dc7686cae97e9b6f1542097989a80" alt=""
Make sense of the output
Woah! That’s a lot of stuff. Thankfully, the {paws}
documentation offers some clarity.
data:image/s3,"s3://crabby-images/82bd0/82bd08346eb6d19b6848e886bf096b770fb8fb75" alt=""
With that we know that there is a field called JobStatus
. This one can be either "IN_PROGRESS"
, "SUCCEEDED"
, "FAILED"
or "PARTIAL_SUCCESS"
.
data:image/s3,"s3://crabby-images/1b758/1b758cc9f5a1802d4c80cbd9cac2a5dca611d8f5" alt=""
So, when the OCR job is still in progress, there’s nothing to do but to wait and check in every few seconds.
data:image/s3,"s3://crabby-images/0201e/0201ecb8dca7533e86e38c4a74c5ad0e187e8a5e" alt=""
Check for next tokens
Sweet. Our text processing job finished. Let’s first check the NextToken
field in the results.
data:image/s3,"s3://crabby-images/f8894/f88944635bad676772a15879a56d2808e2dab050" alt=""
This one is empty. This means that all OCR’ed texts fit into a single response. That’s great so we don´t have to loop through all the NextToken
s.
You see, for long documents it can happen that you will have to first collect all responses in a list. The code for this could look something like this:
data:image/s3,"s3://crabby-images/88bb1/88bb1b7d240f51cc6041ac7642cae10560c51a29" alt=""
Thankfully, we don’t have to do that here. Thus, we can directly dive into the blocks of text we want to extract.
Extract Blocks
First, we need to find all the blocks that are of type “LINE”.
data:image/s3,"s3://crabby-images/4cb3d/4cb3d67c09a2a558647fac494c4cfeb0487513e0" alt=""
From that we can extract the texts:
data:image/s3,"s3://crabby-images/a2df6/a2df6fb05884266d6a89d30929e113b4fac7b7e6" alt=""
Send to AI
Nice! We have extracted all the texts. With that information we can let AI handle the rest. Let’s first recall how our document looks:
data:image/s3,"s3://crabby-images/de54e/de54e5a840492e1106a475405ee1440ee62e4fd5" alt=""
Now, we define a chat and an invoice position type as discussed a few weeks ago.
data:image/s3,"s3://crabby-images/c2de8/c2de89b797ac332b02c1b7e6fb8f66de0eda9086" alt=""
And then we ask our AI to extract the information:
data:image/s3,"s3://crabby-images/0bb09/0bb0928be12f19f436c01f22991da8601524617c" alt=""
Hoooray! This looks like exactly the content we need. As always, if you have any questions, or just want to reach out, feel free to contact me by replying to this mail or finding me on LinkedIn or on Bluesky.
See you next week,
Albert 👋
Enjoyed this newsletter? Here are other ways I can help you:
Reply