Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 loader to use boto3 built-in credential configuration #723

Merged

Conversation

sebastian-nagel
Copy link
Contributor

Description

So far, authenticated reads from S3 were only possible by explicitly passing AWS credentials to the S3Loader. Boto3 supports more multiple methods to configure/provide credentials.

The priority when configuring the S3 client is now:

  1. use the credentials passed into the S3Loader
  2. (newly introduced) try other boto3 built-in methods (environment variables, configuration files, IAM roles)
  3. try unauthenticated / anonymous access (no AWS account required)

The S3Loader unit tests are enabled again (disabled in 403167f) and adapted to test authenticated reads from s3://commoncrawl/. If no AWS credentials are set up, the tests are skipped. I've successfully run the tests

  • using a profile selected by the environment variable AWS_PROFILE
  • on a EC2 instance with an IAM role attached which grants permissions to read from s3://commoncrawl/

Motivation and Context

Best practice is to use methods where credentials are not visible and may not leak into log files or stack traces in error messages. Passing credentials by URL (s3://user@password:bucket/) may induce a security risk.

In April 2022 access to Common Crawl data via the S3 API was restricted to AWS users only. Users with no AWS account are required to switch to HTTP requests using a different base URL, see introducing CloudFront access and general instructions to access Common Crawl data.

Types of changes

  • Replay fix (fixes a replay specific issue)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added or updated tests to cover my changes.
  • All new and existing tests passed.

in order to test authenticated reads. Tests are skipped
if no AWS credentials are configured.
@ikreymer
Copy link
Member

ikreymer commented Aug 9, 2022

Great, thanks! To have the CommonCrawl tests run, can just add any s3 credentials, right?

@ikreymer ikreymer merged commit 510c9dc into webrecorder:main Aug 9, 2022
@sebastian-nagel
Copy link
Contributor Author

Great, thanks! To have the CommonCrawl tests run, can just add any s3 credentials, right?

Yes. Any credentials are sufficient as far as S3 access (in particular to read from s3://commoncrawl/) is allowed to the given user / role.

sebastian-nagel added a commit to commoncrawl/pywb that referenced this pull request Apr 3, 2023
…#723)

* S3Loader: allow authenticated S3 access using boto3 built-in
configuration methods without explicitly passing credentials, cf.
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#configuring-credentials

* S3Loader tests: re-enable tests reading from s3://commoncrawl/
in order to test authenticated reads. Tests are skipped
if no AWS credentials are configured.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants