- 3 Minutes Wednesdays
- Posts
- 3MW (Save Data at AWS S3 With {paws})
3MW (Save Data at AWS S3 With {paws})
Enjoy 3 Minute Wednesdays at no cost.
This newsletter is brought to you for free. If you want to advance your R skills and support my work at the same time, then you will like my paid offerings:
Guten Tag!
Many greetings from Munich, Germany. Modern data science is increasingly becoming a task of chaining together a bunch of LLMs and other cloud services. That’s why I will start to teach you the foundations of AWS so that you learn how to set up S3 buckets with R & the {paws} package.
Registration
The first thing that you need to do is to register with AWS. You can easily do so via their website. There, you’ll have to sign up using your billing information. Just make sure that you stay within the free tier and then your billing information will never be used.
data:image/s3,"s3://crabby-images/b7f1a/b7f1a67368c6eb613163f3343da6f11d46980caf" alt=""
Creating an S3 Client
First, we need a place to store files. That’s where S3 buckets come in (S3 = Simple Storage Service).
data:image/s3,"s3://crabby-images/efc1d/efc1d7e45fee369daf1361b05b6014685046f5d6" alt=""
Now, because we want to use the AWS user interface as little as possible, we are actually going to use R to set things up. To do so, we need to install the {paws}
package.
data:image/s3,"s3://crabby-images/a1499/a1499eccc165b841fa636686b28ea3117bee8af2" alt=""
This is a huge package. So it will take a while to install. While it installs, you can look at its cute hex sticker.
data:image/s3,"s3://crabby-images/e45ed/e45ed501c2e259fb14e86a04d6b876eed2674bb1" alt=""
Once the install process is finished, we need to define an S3 client. Sounds ominous but here’s how you can think of it:
All AWS services are their own little restaurant. And in order to dine at this restaurant, you need to have a client that you can take out to said restaurant. So if you want to dine at the S3 restaurant, you want to have an S3 client.
And with the {paws}
package you can easily create the corresponding client. All you need to do is to call the s3()
function. Typically, you’ll save this client in a variable using the service name.
data:image/s3,"s3://crabby-images/0afef/0afefcea7f4ec74adfab7933006b5b4bc0db5fbd" alt=""
Show all buckets
Now the s3
variable is actually an object that can call a bunch of different methods. One such method is the list_buckets()
method.
data:image/s3,"s3://crabby-images/3f433/3f433f5147fc619560bc15dcbfadadf75b832d0b" alt=""
Oh no! This doesn’t look right. It seems like we don’t have the necessary credentials.
Makes sense if you think about it. How the heck should R know that it needs to interact with our AWS account? We’ve never specified it. And that’s where the environment variables AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
come in.
We have to first create them using the AWS UI in the browser. And once we have them, we can set them in our .Rprofile
using the Sys.setenv()
command. This will tell R about our AWS account. So let’s head over to AWS.
Creating an AWS user
You want to head to the IAM service. Here, IAM stands for “Identity and Access Management”. There, you can create a new user.
data:image/s3,"s3://crabby-images/07c9c/07c9cd29274f24ad1c8cb510e3fe2a74603db1e3" alt=""
First, you’ll need to come up with a name.
data:image/s3,"s3://crabby-images/2df5d/2df5dae45538b7fcfdae8d65dd5ad911b6030b8c" alt=""
Then, you’ll want to attach specific policies to that user. These are the things that the new user will have access to.
data:image/s3,"s3://crabby-images/640de/640de8f334f5d66261e496cb9c4681cb1e0c5945" alt=""
In the policies you can search for “S3” and then attach the “AmazonS3FullAccess” policy. This will allow the user to use S3 without any restrictions.
data:image/s3,"s3://crabby-images/4e401/4e401aba5636b33d61402a231097ca9418d8fade" alt=""
Of course, AWS allows to create granular strategies to avoid giving people full access. But I think for this demo, this is perfectly fine. Also, while we’re at it, let’s also give the user access to the Textract service which we will work with next week.
data:image/s3,"s3://crabby-images/490dd/490dd6266cff1719087533017fef744b92f4e1e1" alt=""
Once all of that is done, you can head over to the next step and create the user.
data:image/s3,"s3://crabby-images/84ab3/84ab33c4d9045381ecfa9c935dd3ba27e65e87fb" alt=""
Getting user credentials
Congrats! You know have create an AWS user that can access S3 and Textract. Now, we need to get credentials for that user. These are the things we can stick into our .Rprofile
.
Here’s what you need to do. First, open up the new user:
data:image/s3,"s3://crabby-images/3f983/3f98306ccabe8f9ae62d6f6ef613c9ec6ff975ac" alt=""
Then, you can head over to the credentials tab and click the “create access key” button.
data:image/s3,"s3://crabby-images/7212f/7212fdee29e41c32b47426ed2b20da93f2fc9365" alt=""
There, you’ll need to select the option to let an application run outside of AWS. This makes sense since we want to let our R session on our computer use the credentials. Therefore, access from outside the AWS network must be possible with the credentials.
data:image/s3,"s3://crabby-images/eab36/eab367e01ccaa66e15e2af3a599e8044127b623e" alt=""
And once you click through the remaining steps (where you don’t need to do anything), you’ll have credentials:
data:image/s3,"s3://crabby-images/5fd90/5fd90341348a173f7532fffb9a6341db973f580a" alt=""
You can now stick them into your .Rprofile
file to set the environment variables. And while you’re at it, you should also define your default region. That should be something that’s close to you. For me that’s eu-central-1
.
data:image/s3,"s3://crabby-images/8d8c7/8d8c75d68436869b75737b9ced36fcab15a98b82" alt=""
Retry the list buckets command
And now after restarting your R session and creating a new S3 client, you should be able to do what we tried earlier.
data:image/s3,"s3://crabby-images/a1452/a145240c6e2a16e624e033cf574840c2e3e898ea" alt=""
If you don’t have such a bucket yet, you can create one with the create_bucket()
method.
data:image/s3,"s3://crabby-images/8511c/8511cd02705c0cae256895ddcda482306ca936e8" alt=""
Sweet! With that you have a bucket into which you can upload files. And to demonstrate that let’s take our invoice from last week. You know, this one here:
data:image/s3,"s3://crabby-images/be3f9/be3f9e4e5d6cfd3756c586205872bc2cda2288bd" alt=""
And to actually upload the file, we can use the put_object()
method.
data:image/s3,"s3://crabby-images/3b3ee/3b3ee9ae56a89e8aef28a045fdfa44b0678f14df" alt=""
And to check that this worked, we could look into our bucket.
data:image/s3,"s3://crabby-images/e343e/e343ea6901bcfcd28e34b09dd65a0c98026de326" alt=""
Nice! Look at that! Inside the Contents
part of the output you can see the file. That means our upload was successful.
Perfect! With that we have paved the way to pass data to OCR services like AWS Textract. But that’s a story for next week.
As always, if you have any questions, or just want to reach out, feel free to contact me by replying to this mail or finding me on LinkedIn or on Bluesky.
See you next week,
Albert 👋
Enjoyed this newsletter? Here are other ways I can help you:
Reply