Improve S3 read performance by not copying buffer #284

aperiodic · 2019-04-10T19:04:13Z

Add ByteBuffer Class

This class encapsulates the buffer functionality implemented in
smart_open.s3.BufferedInputBase and in
smart_open.http.BufferedInputBase as a standalone class. It uses the
read implementation introduced to the S3 BufferedInputBase in the
previous commit, to minimize copying of the buffer.

Because the use case for these ByteBuffers is to amortize the network
transfer overhead costs of reading data from S3 or via HTTP, it's better
to read a consistent number of bytes when filling up the buffer to
consistently amortize that overhead cost, rather than filling the buffer
up to some a priori capacity. To that end, the ByteBuffer constructor
takes a chunk_size that sets the number of bytes to read when filling
the buffer, rather than a fixed buffer capacity (which could easily
lead to reading a different number of bytes from the reader if the
caller ever filled the buffer when it was not empty). However, the
caller can still ask for fewer than chunk_size bytes to be read during
filling using the fill method's size argument.

Change the BufferedInputBase class (and descendents) in both
smart_open.s3 and smart_open.http to use instances of ByteBuffer for
their read buffers, instead of bytestrings. This simplifies their
implementation, while still avoiding copying on reads to maintain the
94% performance improvement for many small reads on an S3 file
introduced in the first version of this PR.

Initial, Now Discarded Implementation

Previously, the buffer of an S3 file opened in 'rb' mode was split into
two copies whenever its read() method was called: one copy was
returned to the caller, and the other copy became the new buffer. This
lead to bad performance when performing many small reads with the buffer
set to a large enough size to amortize the cost of transferring data
from S3.

In our use case, we set the buffer to 4 MiB when reading a 16.5 MiB file
containing 694k serialized protobuf messages, each prefixed by their
byte length. This required calling the S3 file's read() method nearly
1.4 million times, which in turn meant that the up-to-4MiB buffer needed
to be copied and garbage collected the same number of times. All of that
copying and garbage collecting alone took 152 seconds, for a total of
179 seconds to download and parse the file.

Eliminate the copying on each read() by adding an instance attribute
_buffer_pos to smart_open.s3's BufferedInputBase and its descendents,
in order to track the current position within the buffer. Change
read() to use _buffer_pos to create the slice of the buffer to
return to the caller, and then move _buffer_pos by the size of that
slice.

Add two new internal instance methods to BufferedInputBase to accomodate
the fact that the buffer now contains already-read data. The first,
_len_remaining_buffer(), calculates how many unread bytes remain in
the buffer (previously this was simply the length of the buffer); this
is now used everywhere that len(self._buffer) was previously used in
BufferedInputBase. The second method, _remaining_buffer(), returns
a copy of the unread portion of the buffer; this is used in
BufferedInputBase's readline() method to find the next unread newline.

With these changes, that same 16.5 MiB protobuf file with 694k messages
can be downloaded and parsed in 10 seconds, a 94% improvement in
performance!

Previously, the buffer of an S3 file opened in 'rb' mode was split into two copies whenever its `read()` method was called: one copy was returned to the caller, and the other copy became the new buffer. This lead to bad performance when performing many small reads with the buffer set to a large enough size to amortize the cost of transferring data from S3. In our use case, we set the buffer to 4 MiB when reading a 16.5 MiB file containing 694k serialized protobuf messages, each prefixed by their byte length. This required calling the S3 file's `read()` method nearly 1.4 million times, which in turn meant that the up-to-4MiB buffer needed to be copied and garbage collected the same number of times. All of that copying and garbage collecting alone took 152 seconds, for a total of 179 seconds to download and parse the file. Eliminate the copying on each `read()` by adding an instance attribute `_buffer_pos` to smart_open.s3's BufferedInputBase and its descendents, in order to track the current position within the buffer. Change `read()` to use `_buffer_pos` to create the slice of the buffer to return to the caller, and then move `_buffer_pos` by the size of that slice. Add two new internal instance methods to BufferedInputBase to accomodate the fact that the buffer now contains already-read data. The first, `_len_remaining_buffer()`, calculates how many unread bytes remain in the buffer (previously this was simply the length of the buffer); this is now used everywhere that `len(self._buffer)` was previously used in BufferedInputBase. The second method, `_remaining_buffer()`, returns a copy of the unread portion of the buffer; this is used in BufferedInputBase's `readline()` method to find the next unread newline. With these changes, that same 16.5 MiB protobuf file with 694k messages can be downloaded and parsed in 10 seconds, a 94% improvement in performance!

smart_open/s3.py

aperiodic

Left a comment about creating a new class to encapsulate the bytestring-based buffer functionality.

smart_open/s3.py

This class encapsulates the buffer functionality implemented in smart_open.s3.BufferedInputBase and in smart_open.http.BufferedInputBase as a standalone class. It uses the read implementation introduced to the S3 BufferedInputBase in the previous commit, to minimize copying of the buffer. Because the use case for these ByteBuffers is to amortize the network transfer overhead costs of reading data from S3 or via HTTP, it's better to read a consistent number of bytes when filling up the buffer to consistently amortize that overhead cost, rather than filling the buffer up to some a priori capacity. To that end, the ByteBuffer constructor takes a `chunk_size` that sets the number of bytes to read when filling the buffer, rather than a fixed buffer capacity (which could easily lead to reading a different number of bytes from the reader if the caller ever filled the buffer when it was not empty). However, the caller can still ask for fewer than `chunk_size` bytes to be read during filling using the fill method's `size` argument.

Change the BufferedInputBase class (and descendents) in both smart_open.s3 and smart_open.http to use instances of ByteBuffer for their read buffers, instead of bytestrings. This simplifies their implementation, while still avoiding copying on reads to maintain the 94% performance improvement for many small reads on an S3 file introduced two commits prior.

Change the tests in the size arg default-setting conditions for smart_open.s3.BufferedInputBase from `size > 0` to `size >= 0`. The first one could be a genuine bug (though it was never called this way by the class), because when a caller would ask for zero bytes, they'd get up to a whole buffer_size's worth back. The second one is a semantics change: if a caller really wants to fill the buffer with zero bytes, we'll let them.

mpenkov

Looks very good. Please have a look at the comments I left.

smart_open/bytebuffer.py

aperiodic · 2019-04-14T22:26:10Z

Ok, I'll address these comments on Monday UTC, between 16:00 and 23:00 UTC.

Move the smart_open.bytebuffer module's docstring to be the ByteBuffer class's docstring instead, and replace the module docstring with a short summary of its contents. Explicitly document all the parameters for ByteBuffer methods.

It would be beneficial to fill ByteBuffers with lists that contain bytestrings, in order to support experimentation at the REPL and to make doctest examples more succinct. To that end, change the implementation of its `fill` method to use a for loop to iterate over the byte source when it's not a file-like object, and break out of the loop once the condition is met (rather than a while loop that uses `next`, which doesn't work on lists). Now that `fill` accepts lists in addition to file-like objects and iterators, change the byte source argument name from the clunky `reader_or_iterable` to `source`, and explain what it's allowed to be in the parameter section of the docstring. Add a test to ensure that filling from lists works.

aperiodic · 2019-04-15T22:32:05Z

Sorry, hit the wrong button.

I've addressed all of your latest comments except the doctest example in the ByteBuffer class docstring. I wasn't able to get the same example to pass in both Python 2 and 3; you can see the commit I pushed up to another branch for the details.

mpenkov · 2019-04-16T03:09:30Z

Don't worry about Py2/3 compatibility in the doctests. As long as it passes under Py3 and the example clearly states that, it's OK. More in-depth Py2/3 support details get handled in the unit tests.

So, please add your doctests to this PR.

This only passes as a doctest in Python 3 because of the bytestring literals for some of the expected values. In Python 2, those values are simply regular strings and doctest seems to work by comparing the string representations of the values rather than the values themselves, so those lines fail because they don't have the `b` prefix.

aperiodic · 2019-04-16T05:16:25Z

Sorry, I force-pushed because I accidentally pushed up the earlier version of the doctest commit that was lacking the note about it only passing in Python 3. Hope it's not too much of an issue.

aperiodic · 2019-04-17T16:48:42Z

Is this ready to merge? I believe I've addressed all your comments (I don't know why GitHub still shows changes requested given all those comments have been resolved).

mpenkov · 2019-04-18T01:10:48Z

From your description, the performance gains are huge, but our test benchmarks don't pick them up. This is likely because the benchmarks are lacking. It would be good to add a test case to our integration tests (see integration-tests/test_s3.py) to prevent future regressions. Can you please have a look at this?

aperiodic · 2019-04-18T23:41:06Z

I've got a benchmark, I'm iterating on the size so it demonstrates the improvement without taking an excessively long time (I'm shooting for a minute or so).

Add the `test_s3_performance_small_reads` benchmark to the S3 integration tests. This benchmark creates a test file one megabyte in size that's filled with 65,536 (64 KiB) "messages", each prefixed with a byte indicating their length. The file is downloaded into a buffer of the same size, and then each length byte and message is read one at a time from the buffer, for a total of 131,072 `read()` calls. Before the changes in this PR, downloading and reading the file this way took 4.97 seconds on average. After this PR's changes, it takes only 1.61 seconds, a 68% improvement in performance. The magnitude of improvement increases as the size of the buffer increases, or the size of the messages decreases.

aperiodic · 2019-04-19T00:07:56Z

Results from running the benchmark on the current master branch:

-------------------------------------------------- benchmark: 1 tests --------------------------------------------------
Name (time in s)                       Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------
test_s3_performance_small_reads     4.8049  5.2253  4.9774  0.1563  4.9470  0.1763       2;0  0.2009       5           1
------------------------------------------------------------------------------------------------------------------------

Results from running benchmark on this PR branch:

-------------------------------------------------- benchmark: 1 tests --------------------------------------------------
Name (time in s)                       Min     Max    Mean  StdDev  Median     IQR  Outliers     OPS  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------
test_s3_performance_small_reads     1.5846  1.6942  1.6350  0.0407  1.6360  0.0519       2;0  0.6116       5           1
------------------------------------------------------------------------------------------------------------------------

mpenkov · 2019-04-19T01:54:36Z

@aperiodic Congrats on your first merged PR to smart_open 🥇 Awesome work!

aperiodic · 2019-04-19T04:35:05Z

Thank you, and thanks for the review! I enjoyed contributing.

piskvorky requested a review from mpenkov April 10, 2019 20:39

mpenkov reviewed Apr 11, 2019

View reviewed changes

smart_open/s3.py Outdated Show resolved Hide resolved

aperiodic commented Apr 11, 2019

View reviewed changes

smart_open/s3.py Outdated Show resolved Hide resolved

Dan Lidral-Porter added 3 commits April 12, 2019 21:55

mpenkov requested changes Apr 14, 2019

View reviewed changes

smart_open/bytebuffer.py Outdated Show resolved Hide resolved

smart_open/bytebuffer.py Outdated Show resolved Hide resolved

smart_open/bytebuffer.py Show resolved Hide resolved

smart_open/bytebuffer.py Outdated Show resolved Hide resolved

Dan Lidral-Porter added 4 commits April 15, 2019 10:42

Add default value for ByteBuffer chunk size.

2e88391

(maint) Improve ByteBuffer docstrings

02ca639

Move the smart_open.bytebuffer module's docstring to be the ByteBuffer class's docstring instead, and replace the module docstring with a short summary of its contents. Explicitly document all the parameters for ByteBuffer methods.

(maint) Fix flake8 errors introduced in this PR.

164339a

aperiodic closed this Apr 15, 2019

aperiodic reopened this Apr 15, 2019

aperiodic force-pushed the no-buffer-copying branch from c604a78 to cec4679 Compare April 16, 2019 05:14

mpenkov approved these changes Apr 19, 2019

View reviewed changes

mpenkov merged commit 2ea4155 into piskvorky:master Apr 19, 2019

aperiodic deleted the no-buffer-copying branch April 22, 2021 23:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve S3 read performance by not copying buffer #284

Improve S3 read performance by not copying buffer #284

aperiodic commented Apr 10, 2019 •

edited

Loading

aperiodic left a comment

mpenkov left a comment

aperiodic commented Apr 14, 2019

aperiodic commented Apr 15, 2019

mpenkov commented Apr 16, 2019

aperiodic commented Apr 16, 2019

aperiodic commented Apr 17, 2019

mpenkov commented Apr 18, 2019

aperiodic commented Apr 18, 2019

aperiodic commented Apr 19, 2019

mpenkov commented Apr 19, 2019

aperiodic commented Apr 19, 2019

Improve S3 read performance by not copying buffer #284

Improve S3 read performance by not copying buffer #284

Conversation

aperiodic commented Apr 10, 2019 • edited Loading

Add ByteBuffer Class

Initial, Now Discarded Implementation

aperiodic left a comment

Choose a reason for hiding this comment

mpenkov left a comment

Choose a reason for hiding this comment

aperiodic commented Apr 14, 2019

aperiodic commented Apr 15, 2019

mpenkov commented Apr 16, 2019

aperiodic commented Apr 16, 2019

aperiodic commented Apr 17, 2019

mpenkov commented Apr 18, 2019

aperiodic commented Apr 18, 2019

aperiodic commented Apr 19, 2019

mpenkov commented Apr 19, 2019

aperiodic commented Apr 19, 2019

aperiodic commented Apr 10, 2019 •

edited

Loading