Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add integration docs for s3 backend #2658

Merged
merged 7 commits into from
Jul 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 102 additions & 0 deletions docs/integrations/object-storage/s3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# AWS S3 Storage Backend

`delta-rs` offers native support for using AWS S3 as an objet storage backend.

You don’t need to install any extra dependencies to red/write Delta tables to S3 with engines that use `delta-rs`. You do need to configure your AWS access credentials correctly.

## Note for boto3 users

Many Python engines use [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to connect to AWS. This library supports reading credentials automatically from your local `.aws/config` or `.aws/creds` file.

For example, if you’re running locally with the proper credentials in your local `.aws/config` or `.aws/creds` file then you can write a Parquet file to S3 like this with pandas:

```python
import pandas as pd
df = pd.DataFrame({'x': [1, 2, 3]})
df.to_parquet("s3://avriiil/parquet-test-pandas")
```

The `delta-rs` writer does not use `boto3` and therefore does not support taking credentials from your `.aws/config` or `.aws/creds` file. If you’re used to working with writers from Python engines like Polars, pandas or Dask, this may mean a small change to your workflow.

## Passing AWS Credentials

You can pass your AWS credentials explicitly by using:

- the `storage_options `kwarg
- Environment variables
- EC2 metadata if using EC2 instances
- AWS Profiles

## Example

Let's work through an example with Polars. The same logic applies to other Python engines like Pandas, Daft, Dask, etc.

Follow the steps below to use Delta Lake on S3 with Polars:

1. Install Polars and deltalake. For example, using:

`pip install polars deltalake`

2. Create a dataframe with some toy data.

`df = pl.DataFrame({'x': [1, 2, 3]})`

3. Set your `storage_options` correctly.

```python
storage_options = {
"AWS_REGION":<region_name>,
'AWS_ACCESS_KEY_ID': <key_id>,
'AWS_SECRET_ACCESS_KEY': <access_key>,
'AWS_S3_LOCKING_PROVIDER': 'dynamodb',
'DELTA_DYNAMO_TABLE_NAME': 'delta_log',
}
```

4. Write data to Delta table using the `storage_options` kwarg.

```python
df.write_delta(
"s3://bucket/delta_table",
storage_options=storage_options,
)
```

## Delta Lake on S3: Safe Concurrent Writes

You need a locking provider to ensure safe concurrent writes when writing Delta tables to S3. This is because S3 does not guarantee mutual exclusion.

A locking provider guarantees that only one writer is able to create the same file. This prevents corrupted or conflicting data.

`delta-rs` uses DynamoDB to guarantee safe concurrent writes.

Run the code below in your terminal to create a DynamoDB table that will act as your locking provider.

```
aws dynamodb create-table \
--table-name delta_log \
--attribute-definitions AttributeName=tablePath,AttributeType=S AttributeName=fileName,AttributeType=S \
--key-schema AttributeName=tablePath,KeyType=HASH AttributeName=fileName,KeyType=RANGE \
--provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5
```

If for some reason you don't want to use DynamoDB as your locking mechanism you can choose to set the `AWS_S3_ALLOW_UNSAFE_RENAME` variable to `true` in order to enable S3 unsafe writes.

Read more in the [Usage](../../usage/writing/writing-to-s3-with-locking-provider.md) section.

## Delta Lake on S3: Required permissions

You need to have permissions to get, put and delete objects in the S3 bucket you're storing your data in. Please note that you must be allowed to delete objects even if you're just appending to the Delta Lake, because there are temporary files into the log folder that are deleted after usage.

In AWS S3, you will need the following permissions:

- s3:GetObject
- s3:PutObject
- s3:DeleteObject

In DynamoDB, you will need the following permissions:

- dynamodb:GetItem
- dynamodb:Query
- dynamodb:PutItem
- dynamodb:UpdateItem
3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,8 @@ nav:
- api/exceptions.md
- Integrations:
- Object Storage:
- integrations/object-storage/hdfs.md
- integrations/object-storage/hdfs.md
- integrations/object-storage/s3.md
- Arrow: integrations/delta-lake-arrow.md
- Daft: integrations/delta-lake-daft.md
- Dagster: integrations/delta-lake-dagster.md
Expand Down
Loading