Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement option to write to a custom buffer during S3 writes #547

Merged
merged 10 commits into from
Feb 15, 2021

Conversation

mpenkov
Copy link
Collaborator

@mpenkov mpenkov commented Oct 6, 2020

Currently, we buffer in-memory. This works well in most cases, but can
be a problem when performing many uploads in parallel. Each upload
needs to buffer at least 5MB of data (that is the minimum part size for
AWS S3). When performing hundreds or thousands of uploads, memory can
become a bottleneck.

This PR relieves that bottleneck by buffering to temporary files
instead. Each part gets saved to a separate temporary file and discarded
after upload. This reduces memory usage at the cost of increased IO with
the local disk.

Currently, we buffer in-memory. This works well in most cases, but can
be a problem when performing many uploads in parallel.  Each upload
needs to buffer at least 5MB of data (that is the minimum part size for
AWS S3). When performing hundreds or thousands of uploads, memory can
become a bottleneck.

This PR relieves that bottleneck by buffering to temporary files
instead. Each part gets saved to a separate temporary file and discarded
after upload. This reduces memory usage at the cost of increased IO with
the local disk.
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation? Who needs thousands of parallel uploads and why?

Seems like a strange obfuscation so -1 on including, unless clearly motivated.

smart_open/s3.py Outdated Show resolved Hide resolved
smart_open/s3.py Outdated Show resolved Hide resolved

FILE
/home/misha/git/smart_open/smart_open/__init__.py
/Users/misha/git/smart_open/smart_open/__init__.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

osx migration, again :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration between home and office :)

re-use the same buffer in the same writer instance
@mpenkov
Copy link
Collaborator Author

mpenkov commented Oct 7, 2020

What's the motivation? Who needs thousands of parallel uploads and why?

I'm partitioning a large (100GB, hundreds of millions of lines) dataset in preparation for some data processing. I need each partition to be relatively small, so I'm using a large number of partitions (hundreds/thousands). If I have all the partitions open for writing simultaneously, the partitioning is very simple:

for record in dataset:
    partition_index = calculate_partition(record)
    partitions[partition_index].write(record)

Seems like a strange obfuscation so -1 on including, unless clearly motivated.

I updated the implementation. It's not really an obfuscation anymore - the custom buffer is now an optional parameter. If people don't pass it in, the implementation behaves identically to the current version in PyPI.

@mpenkov mpenkov changed the title Implement option to buffer to disk for S3 writes Implement option write to a custom buffer S3 writes Oct 7, 2020
@mpenkov mpenkov changed the title Implement option write to a custom buffer S3 writes Implement option to write to a custom buffer during S3 writes Oct 7, 2020
smart_open/s3.py Show resolved Hide resolved
@mpenkov mpenkov merged commit 648d198 into develop Feb 15, 2021
@mpenkov mpenkov deleted the diskbuffer branch February 15, 2021 06:41
howto.md Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants