-
-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve S3 read performance by not copying buffer #284
Conversation
Previously, the buffer of an S3 file opened in 'rb' mode was split into two copies whenever its `read()` method was called: one copy was returned to the caller, and the other copy became the new buffer. This lead to bad performance when performing many small reads with the buffer set to a large enough size to amortize the cost of transferring data from S3. In our use case, we set the buffer to 4 MiB when reading a 16.5 MiB file containing 694k serialized protobuf messages, each prefixed by their byte length. This required calling the S3 file's `read()` method nearly 1.4 million times, which in turn meant that the up-to-4MiB buffer needed to be copied and garbage collected the same number of times. All of that copying and garbage collecting alone took 152 seconds, for a total of 179 seconds to download and parse the file. Eliminate the copying on each `read()` by adding an instance attribute `_buffer_pos` to smart_open.s3's BufferedInputBase and its descendents, in order to track the current position within the buffer. Change `read()` to use `_buffer_pos` to create the slice of the buffer to return to the caller, and then move `_buffer_pos` by the size of that slice. Add two new internal instance methods to BufferedInputBase to accomodate the fact that the buffer now contains already-read data. The first, `_len_remaining_buffer()`, calculates how many unread bytes remain in the buffer (previously this was simply the length of the buffer); this is now used everywhere that `len(self._buffer)` was previously used in BufferedInputBase. The second method, `_remaining_buffer()`, returns a copy of the unread portion of the buffer; this is used in BufferedInputBase's `readline()` method to find the next unread newline. With these changes, that same 16.5 MiB protobuf file with 694k messages can be downloaded and parsed in 10 seconds, a 94% improvement in performance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment about creating a new class to encapsulate the bytestring-based buffer functionality.
This class encapsulates the buffer functionality implemented in smart_open.s3.BufferedInputBase and in smart_open.http.BufferedInputBase as a standalone class. It uses the read implementation introduced to the S3 BufferedInputBase in the previous commit, to minimize copying of the buffer. Because the use case for these ByteBuffers is to amortize the network transfer overhead costs of reading data from S3 or via HTTP, it's better to read a consistent number of bytes when filling up the buffer to consistently amortize that overhead cost, rather than filling the buffer up to some a priori capacity. To that end, the ByteBuffer constructor takes a `chunk_size` that sets the number of bytes to read when filling the buffer, rather than a fixed buffer capacity (which could easily lead to reading a different number of bytes from the reader if the caller ever filled the buffer when it was not empty). However, the caller can still ask for fewer than `chunk_size` bytes to be read during filling using the fill method's `size` argument.
Change the BufferedInputBase class (and descendents) in both smart_open.s3 and smart_open.http to use instances of ByteBuffer for their read buffers, instead of bytestrings. This simplifies their implementation, while still avoiding copying on reads to maintain the 94% performance improvement for many small reads on an S3 file introduced two commits prior.
Change the tests in the size arg default-setting conditions for smart_open.s3.BufferedInputBase from `size > 0` to `size >= 0`. The first one could be a genuine bug (though it was never called this way by the class), because when a caller would ask for zero bytes, they'd get up to a whole buffer_size's worth back. The second one is a semantics change: if a caller really wants to fill the buffer with zero bytes, we'll let them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good. Please have a look at the comments I left.
Ok, I'll address these comments on Monday UTC, between 16:00 and 23:00 UTC. |
Move the smart_open.bytebuffer module's docstring to be the ByteBuffer class's docstring instead, and replace the module docstring with a short summary of its contents. Explicitly document all the parameters for ByteBuffer methods.
It would be beneficial to fill ByteBuffers with lists that contain bytestrings, in order to support experimentation at the REPL and to make doctest examples more succinct. To that end, change the implementation of its `fill` method to use a for loop to iterate over the byte source when it's not a file-like object, and break out of the loop once the condition is met (rather than a while loop that uses `next`, which doesn't work on lists). Now that `fill` accepts lists in addition to file-like objects and iterators, change the byte source argument name from the clunky `reader_or_iterable` to `source`, and explain what it's allowed to be in the parameter section of the docstring. Add a test to ensure that filling from lists works.
Sorry, hit the wrong button. I've addressed all of your latest comments except the doctest example in the ByteBuffer class docstring. I wasn't able to get the same example to pass in both Python 2 and 3; you can see the commit I pushed up to another branch for the details. |
Don't worry about Py2/3 compatibility in the doctests. As long as it passes under Py3 and the example clearly states that, it's OK. More in-depth Py2/3 support details get handled in the unit tests. So, please add your doctests to this PR. |
This only passes as a doctest in Python 3 because of the bytestring literals for some of the expected values. In Python 2, those values are simply regular strings and doctest seems to work by comparing the string representations of the values rather than the values themselves, so those lines fail because they don't have the `b` prefix.
c604a78
to
cec4679
Compare
Sorry, I force-pushed because I accidentally pushed up the earlier version of the doctest commit that was lacking the note about it only passing in Python 3. Hope it's not too much of an issue. |
Is this ready to merge? I believe I've addressed all your comments (I don't know why GitHub still shows changes requested given all those comments have been resolved). |
From your description, the performance gains are huge, but our test benchmarks don't pick them up. This is likely because the benchmarks are lacking. It would be good to add a test case to our integration tests (see integration-tests/test_s3.py) to prevent future regressions. Can you please have a look at this? |
I've got a benchmark, I'm iterating on the size so it demonstrates the improvement without taking an excessively long time (I'm shooting for a minute or so). |
Add the `test_s3_performance_small_reads` benchmark to the S3 integration tests. This benchmark creates a test file one megabyte in size that's filled with 65,536 (64 KiB) "messages", each prefixed with a byte indicating their length. The file is downloaded into a buffer of the same size, and then each length byte and message is read one at a time from the buffer, for a total of 131,072 `read()` calls. Before the changes in this PR, downloading and reading the file this way took 4.97 seconds on average. After this PR's changes, it takes only 1.61 seconds, a 68% improvement in performance. The magnitude of improvement increases as the size of the buffer increases, or the size of the messages decreases.
Results from running the benchmark on the current master branch:
Results from running benchmark on this PR branch:
|
@aperiodic Congrats on your first merged PR to smart_open 🥇 Awesome work! |
Thank you, and thanks for the review! I enjoyed contributing. |
Add ByteBuffer Class
This class encapsulates the buffer functionality implemented in
smart_open.s3.BufferedInputBase and in
smart_open.http.BufferedInputBase as a standalone class. It uses the
read implementation introduced to the S3 BufferedInputBase in the
previous commit, to minimize copying of the buffer.
Because the use case for these ByteBuffers is to amortize the network
transfer overhead costs of reading data from S3 or via HTTP, it's better
to read a consistent number of bytes when filling up the buffer to
consistently amortize that overhead cost, rather than filling the buffer
up to some a priori capacity. To that end, the ByteBuffer constructor
takes a
chunk_size
that sets the number of bytes to read when fillingthe buffer, rather than a fixed buffer capacity (which could easily
lead to reading a different number of bytes from the reader if the
caller ever filled the buffer when it was not empty). However, the
caller can still ask for fewer than
chunk_size
bytes to be read duringfilling using the fill method's
size
argument.Change the BufferedInputBase class (and descendents) in both
smart_open.s3 and smart_open.http to use instances of ByteBuffer for
their read buffers, instead of bytestrings. This simplifies their
implementation, while still avoiding copying on reads to maintain the
94% performance improvement for many small reads on an S3 file
introduced in the first version of this PR.
Initial, Now Discarded Implementation
Previously, the buffer of an S3 file opened in 'rb' mode was split into
two copies whenever its
read()
method was called: one copy wasreturned to the caller, and the other copy became the new buffer. This
lead to bad performance when performing many small reads with the buffer
set to a large enough size to amortize the cost of transferring data
from S3.
In our use case, we set the buffer to 4 MiB when reading a 16.5 MiB file
containing 694k serialized protobuf messages, each prefixed by their
byte length. This required calling the S3 file's
read()
method nearly1.4 million times, which in turn meant that the up-to-4MiB buffer needed
to be copied and garbage collected the same number of times. All of that
copying and garbage collecting alone took 152 seconds, for a total of
179 seconds to download and parse the file.
Eliminate the copying on each
read()
by adding an instance attribute_buffer_pos
to smart_open.s3's BufferedInputBase and its descendents,in order to track the current position within the buffer. Change
read()
to use_buffer_pos
to create the slice of the buffer toreturn to the caller, and then move
_buffer_pos
by the size of thatslice.
Add two new internal instance methods to BufferedInputBase to accomodate
the fact that the buffer now contains already-read data. The first,
_len_remaining_buffer()
, calculates how many unread bytes remain inthe buffer (previously this was simply the length of the buffer); this
is now used everywhere that
len(self._buffer)
was previously used inBufferedInputBase. The second method,
_remaining_buffer()
, returnsa copy of the unread portion of the buffer; this is used in
BufferedInputBase's
readline()
method to find the next unread newline.With these changes, that same 16.5 MiB protobuf file with 694k messages
can be downloaded and parsed in 10 seconds, a 94% improvement in
performance!