Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grow buffers exponentially, not one byte at a time #121

Merged
merged 2 commits into from
Aug 18, 2022

Conversation

jakobnissen
Copy link
Collaborator

To check if a TranscodingStream is eof, it first checks if it has any unused
bytes in its buffer. If not, it will attempt to fill the buffer, and only return
true if no bytes were filled.
If there is no margin in the buffer to fill, it will call makemargin(buffer, 1)
to ensure at least 1 byte of space.
In the happy case, this will remove used up bytes.
Unfortunately, buffers do never move data pointed to by their mark. So, when no
space is available due to the entire buffer being filled up with marked data,
the buffer will grow one byte at a time in a loop until the data fills.
This is unacceptable as megabytes of data can theoretically be marked.

This commit changes the growth behaviour to request at least half the buffer's
length in the margin. In most cases, this will do nothing. But in the unhappy
case mentioned above, it will cause the buffer to grow expoentially, until a
proper size is quickly reached.

To check if a TranscodingStream is eof, it first checks if it has any unused
bytes in its buffer. If not, it will attempt to fill the buffer, and only return
true if no bytes were filled.
If there is no margin in the buffer to fill, it will call `makemargin(buffer, 1)`
to ensure at least 1 byte of space.
In the happy case, this will remove used up bytes.
Unfortunately, buffers do never move data pointed to by their mark. So, when no
space is available due to the entire buffer being filled up with marked data,
the buffer will grow one byte at a time in a loop until the data fills.
This is unacceptable as megabytes of data can theoretically be marked.

This commit changes the growth behaviour to request at least half the buffer's
length in the margin. In most cases, this will do nothing. But in the unhappy
case mentioned above, it will cause the buffer to grow expoentially, until a
proper size is quickly reached.
@jakobnissen
Copy link
Collaborator Author

jakobnissen commented Aug 11, 2022

Pinging @SabrinaJaye - this should solve #48 . Here is a breakdown of how this problem manifests itself in Automa.

In generate_reader, eof(stream) is checked in the main loop to figure out if p_eof should be set. The stream checks eof by first checking if there are any available data in the buffer. This happens to never be the case in Automa, because the machines consume all available data. Then TranscodingStreams (TS) calls fillbuffer - if zero bytes are filled, the stream is eof. However, the buffer might be full of used up data preventing TS from filling the buffer in fillbuffer, so, in fillbuffer TS calls makemargin!(buffer, 1). This ensures at least 1 byte of margin space, which it tries to obtain first by discarding any used up bytes. This is assumed to work most of the time - after all, why would a buffer run out of usable data if it's not because it has used and served a bunch of data?

In Automa specifically, we make heavy use of the buffer's marks. The mark is an integer that points to some byte in the buffer, which TS guarantees is never removed from the buffer even when it has been consumed. We use it in Automa to ensure that e.g. a 100 KB DNA sequence is kept in the buffer, even as the buffer is only 16 KB long. Hence, at some point, makemargin(buffer, 1) fails to remove any data since it's all marked for use by Automa.

And then it... resizes the buffer by 1 byte as requested. It will then continue to do this until the entire sequence fits in the buffer.

Copy link
Member

@mkitti mkitti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change seems reasonable. I just suggested some defensive changes to always RoundUp the div.

src/noop.jl Outdated Show resolved Hide resolved
src/stream.jl Outdated Show resolved Hide resolved
@jakobnissen
Copy link
Collaborator Author

Good call - I've gone with the max(1, div(...)) option.

@mkitti mkitti merged commit fde7eea into JuliaIO:master Aug 18, 2022
@jakobnissen jakobnissen deleted the buffergrowth branch January 9, 2023 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants