S3 loader to use boto3 built-in credential configuration #723
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
So far, authenticated reads from S3 were only possible by explicitly passing AWS credentials to the S3Loader. Boto3 supports more multiple methods to configure/provide credentials.
The priority when configuring the S3 client is now:
The S3Loader unit tests are enabled again (disabled in 403167f) and adapted to test authenticated reads from s3://commoncrawl/. If no AWS credentials are set up, the tests are skipped. I've successfully run the tests
Motivation and Context
Best practice is to use methods where credentials are not visible and may not leak into log files or stack traces in error messages. Passing credentials by URL (s3://user@password:bucket/) may induce a security risk.
In April 2022 access to Common Crawl data via the S3 API was restricted to AWS users only. Users with no AWS account are required to switch to HTTP requests using a different base URL, see introducing CloudFront access and general instructions to access Common Crawl data.
Types of changes
Checklist: