-
-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement option to write to a custom buffer during S3 writes #547
Conversation
Currently, we buffer in-memory. This works well in most cases, but can be a problem when performing many uploads in parallel. Each upload needs to buffer at least 5MB of data (that is the minimum part size for AWS S3). When performing hundreds or thousands of uploads, memory can become a bottleneck. This PR relieves that bottleneck by buffering to temporary files instead. Each part gets saved to a separate temporary file and discarded after upload. This reduces memory usage at the cost of increased IO with the local disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the motivation? Who needs thousands of parallel uploads and why?
Seems like a strange obfuscation so -1 on including, unless clearly motivated.
|
||
FILE | ||
/home/misha/git/smart_open/smart_open/__init__.py | ||
/Users/misha/git/smart_open/smart_open/__init__.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
osx migration, again :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Migration between home and office :)
re-use the same buffer in the same writer instance
I'm partitioning a large (100GB, hundreds of millions of lines) dataset in preparation for some data processing. I need each partition to be relatively small, so I'm using a large number of partitions (hundreds/thousands). If I have all the partitions open for writing simultaneously, the partitioning is very simple:
I updated the implementation. It's not really an obfuscation anymore - the custom buffer is now an optional parameter. If people don't pass it in, the implementation behaves identically to the current version in PyPI. |
Currently, we buffer in-memory. This works well in most cases, but can
be a problem when performing many uploads in parallel. Each upload
needs to buffer at least 5MB of data (that is the minimum part size for
AWS S3). When performing hundreds or thousands of uploads, memory can
become a bottleneck.
This PR relieves that bottleneck by buffering to temporary files
instead. Each part gets saved to a separate temporary file and discarded
after upload. This reduces memory usage at the cost of increased IO with
the local disk.