- 3 Minutes Wednesdays
- Posts
- Working with non-tabular files
Working with non-tabular files
Guten Tag! š
Many greetings from Munich, Germany. Last time we talked about {ellmer} and chat objects. Now, let's figure out how to do work with documents. You know, so that we can do AI-assisted document processing.
Unfortunately, data scientists are often great at wrangling tabular data like tibbles or dataframes. But data scientists don't necessarily know how to deal with file formats. That was one reason why my data cleaning masterclass had one part solely focused on how to deal with different file formats. Still, that focused on files containing tabular data.
In our new scenario, things are a bit different. We don't want to get a tabular file format into an LLM. Instead, we want to work with arbitrary text files like PDFs or Word documents or other file formats. And there are two ingredients for this. First, there is the file management part and then thereās the text extraction part. Let's talk about them both today.
File management
There are many functions inside R that allow you to manage files. But I feel like the {fs} package gives you a more consistent interface. Basically, the functions come in three flavors for files, paths and directories. The functions start with file_* , path_* or dir_* respectively:

Examples mostly taken from the {fs} documentation
This allows you to manage files effectively through a consistent, unified interface. And yes, I know how incredibly boring this sounds. But it becomes really convenient when you have to deal with many files. So, if you want to learn more about that, Iāve released a video that goes into more details about {fs}:
Text Extraction
Once you know how to manage files, you can extract texts from them. For most use cases, youāll need texts from PDF. For that, there is the {pdftools} package. If we had an invoice stored at dat/invoice.pdf we could access the text like so:

For Word documents, you could use the {officer} package. It allows you to first extract the content elements in a structured form:

And if you filter that for paragraphs, then youād get only the texts:

Of course, once you have the raw text, you can apply one or more of the many text cleaning functions that the {stringr} package provides you. That's particularly useful if you're looking to add or subtract specific information that can be encoded programmatically before passing the content to an LLM. Feel free to check out the {stringr} docs or my Data Cleaning Master Class for more on text cleaning.
That was it for today. For simple text extractions, you've got the tools you need now. If you want to go even further, you could do an OCR using a service like Azure Document Intelligence or AWS Textract. But thatās a story for some other time.
Best,
Albert
Whenever youāre ready, there are three I can help you:
Automate Your Data Reports: This course helps data analysts eliminate manual copy-paste reporting by automating PDF reports end-to-end, saving hours every cycle and preventing costly mistakes. (Using the lovely Typst language š)
Generate Insights in Minutes, not Hours: This comprehensive course teaches you to handle data faster, smarter, and more efficiently.
Bespoke Data Science Solutions: Iāve helped clients build their own data science solutions. Whether building custom web apps, PDF reports, AI automations or teaching workshops, Iāve got you covered. You can reach out to me via this form (or simply hit reply to this email)
Reply