Implement option to write to a custom buffer during S3 writes #547

mpenkov · 2020-10-06T14:55:02Z

Currently, we buffer in-memory. This works well in most cases, but can
be a problem when performing many uploads in parallel. Each upload
needs to buffer at least 5MB of data (that is the minimum part size for
AWS S3). When performing hundreds or thousands of uploads, memory can
become a bottleneck.

This PR relieves that bottleneck by buffering to temporary files
instead. Each part gets saved to a separate temporary file and discarded
after upload. This reduces memory usage at the cost of increased IO with
the local disk.

Currently, we buffer in-memory. This works well in most cases, but can be a problem when performing many uploads in parallel. Each upload needs to buffer at least 5MB of data (that is the minimum part size for AWS S3). When performing hundreds or thousands of uploads, memory can become a bottleneck. This PR relieves that bottleneck by buffering to temporary files instead. Each part gets saved to a separate temporary file and discarded after upload. This reduces memory usage at the cost of increased IO with the local disk.

piskvorky

What's the motivation? Who needs thousands of parallel uploads and why?

Seems like a strange obfuscation so -1 on including, unless clearly motivated.

smart_open/s3.py

menshikh-iv · 2020-10-06T16:56:55Z

help.txt


 FILE
-    /home/misha/git/smart_open/smart_open/__init__.py
+    /Users/misha/git/smart_open/smart_open/__init__.py


osx migration, again :)

Migration between home and office :)

re-use the same buffer in the same writer instance

mpenkov · 2020-10-07T01:01:18Z

What's the motivation? Who needs thousands of parallel uploads and why?

I'm partitioning a large (100GB, hundreds of millions of lines) dataset in preparation for some data processing. I need each partition to be relatively small, so I'm using a large number of partitions (hundreds/thousands). If I have all the partitions open for writing simultaneously, the partitioning is very simple:

for record in dataset:
    partition_index = calculate_partition(record)
    partitions[partition_index].write(record)

Seems like a strange obfuscation so -1 on including, unless clearly motivated.

I updated the implementation. It's not really an obfuscation anymore - the custom buffer is now an optional parameter. If people don't pass it in, the implementation behaves identically to the current version in PyPI.

smart_open/s3.py

howto.md

mpenkov requested review from menshikh-iv and piskvorky October 6, 2020 14:55

mpenkov added 2 commits October 6, 2020 23:56

update help.txt

c66550c

improve logging message

6bf628a

piskvorky reviewed Oct 6, 2020

View reviewed changes

smart_open/s3.py Outdated Show resolved Hide resolved

smart_open/s3.py Outdated Show resolved Hide resolved

menshikh-iv approved these changes Oct 6, 2020

View reviewed changes

mpenkov added 2 commits October 7, 2020 09:55

simplify implementation

e0ff3d1

re-use the same buffer in the same writer instance

dedent

9f512dc

mpenkov added 2 commits October 7, 2020 10:05

update help.txt

fad3226

fixup

d9f5f54

mpenkov changed the title ~~Implement option to buffer to disk for S3 writes~~ Implement option write to a custom buffer S3 writes Oct 7, 2020

mpenkov changed the title ~~Implement option write to a custom buffer S3 writes~~ Implement option to write to a custom buffer during S3 writes Oct 7, 2020

piskvorky approved these changes Oct 7, 2020

View reviewed changes

smart_open/s3.py Show resolved Hide resolved

mpenkov mentioned this pull request Jan 31, 2021

Potential memory leak in MultipartWriter._upload_next_part #511

Open

mpenkov added 3 commits February 15, 2021 15:08

Merge branch 'develop' into diskbuffer

d6a87bb

update howto.md

1e958ec

Update CHANGELOG.md

5ee2a68

mpenkov merged commit 648d198 into develop Feb 15, 2021

mpenkov deleted the diskbuffer branch February 15, 2021 06:41

piskvorky reviewed Feb 15, 2021

View reviewed changes

howto.md Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement option to write to a custom buffer during S3 writes #547

Implement option to write to a custom buffer during S3 writes #547

mpenkov commented Oct 6, 2020

piskvorky left a comment

menshikh-iv Oct 6, 2020

mpenkov Oct 6, 2020

mpenkov commented Oct 7, 2020 •

edited

Loading

Implement option to write to a custom buffer during S3 writes #547

Implement option to write to a custom buffer during S3 writes #547

Conversation

mpenkov commented Oct 6, 2020

piskvorky left a comment

Choose a reason for hiding this comment

menshikh-iv Oct 6, 2020

Choose a reason for hiding this comment

mpenkov Oct 6, 2020

Choose a reason for hiding this comment

mpenkov commented Oct 7, 2020 • edited Loading

mpenkov commented Oct 7, 2020 •

edited

Loading