Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes BytesMut not to rely on Vec + Arc #435

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

Matthias247
Copy link
Contributor

@Matthias247 Matthias247 commented Oct 18, 2020

This changes the internal BytesMut representation to allocate memory on
its own, and not to use Arc + Vec under the hood.

Doing this allows to have a single representation which is shareable (by
being able to modify an embedded refcount) with zero-cost right from
the start. No representation changes have to be performed over the
lifetime of BytesMut. Therefore BytesMut::freeze() is now also free.

The change also opens the door for adding support for buffer pools or
custom allocators, since an allocator is directly called. Adding this
support would mostly mean having to pass an allocator to BytesMut
which is used called instead of a fixed function. This would however be
a different change.

Fixes #401

This changes the internal BytesMut representation to allocate memory on
its own, and not to use Arc + Vec under the hood.

Doing this allows to have a single representation which is shareable (by
being able to modify an embedded refcount) with zero-cost right from
the start. Not representation changes have to be performed over the
lifetime of `BytesMut`. Therefore `BytesMut::freeze()` is now also free.

The change also opens the door for adding support for buffer pools or
custom allocators, since an allocator is directly called. Adding this
support would mostly mean having to pass an allocator to BytesMut
which is used called instead of a fixed function. This would however be
a different change.

The change is WIP, and misses the following parts:
- The buffer growth strategy is not exactly the same as in the old
  version. It needs to be reviewed what the best strategy is.
- Some situations where an empty buffer (which has no backing
  storage) is manipulated might have bugs. This needs another
  round of review, and some additional tests.
// There are some situations, using `reserve_exact` that the
// buffer capacity could be below `original_capacity`, so do a
// check.
let double = self.cap.checked_shl(1).unwrap_or(new_cap);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one part that leads to the failing tests.
The old version doubled the original capacity of the storage. However if that was split, the capacity of the split part could be much lower.

I'm not really sure whether going back to the original capacity or even doubling it makes sense => If someone split off a 100 byte chunk of a 1MB BytesMut and they want to write a few more bytes into it, it shouldn't grow towards 1 or even 2 MB - right?

I'm somehwat leaning towards ignoring the original capacity at all, just making decisions based on capacity of the current part. But maybe I'm just missing why it was added that way? Maybe to make sure that a receive buffer somewhere which uses BytesMut and from deserialized data is split off always stays at the same size? If people would want that, they could still .reserve() that size themselves after splitting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it seems like surprising behavior. I would vote to ignore the original capacity.

src/bytes_mut.rs Outdated Show resolved Hide resolved
src/bytes_mut.rs Outdated Show resolved Hide resolved
@carllerche
Copy link
Member

cc/ @seanmonstar who did most of this work most recently.

@Matthias247 Matthias247 changed the title WIP: Changes BytesMut not to rely on Vec + Arc Changes BytesMut not to rely on Vec + Arc Oct 18, 2020
@Matthias247
Copy link
Contributor Author

I'm now rather happy with this, so I removed the WIP.

I added some more documentation around the memory allocation strategy - based on my understanding of previous versions - and questions on what we should do there.
I'm somewhat inclined to drop the behavior of taking capacity that an old backing storage had into account for reservations - it just seems like a footgun leading to high memory usage in case people split BytesMut and work on them later in individually.

@Matthias247
Copy link
Contributor Author

Not sure what is going on with Loom. I think it doesn't like my use of alloc::alloc::alloc (direct use of the global allocator)- but don't know how to fix it.

@Matthias247
Copy link
Contributor Author

cargo bench comparison:

name                      old ns/iter      new ns/iter      diff ns/iter   diff %  speedup 
 alloc_big                 52               53                          1    1.92%   x 0.98
 alloc_mid                 53               51                         -2   -3.77%   x 1.04
 alloc_small               53,018           51,740                 -1,278   -2.41%   x 1.02
 alloc_write_split_to_mid  104              67                        -37  -35.58%   x 1.55
 bytes_mut_extend          568 (225 MB/s)   538 (237 MB/s)            -30   -5.28%   x 1.06 
 clone_frozen              9,711            9,785                      74    0.76%   x 0.99
 deref_shared              242              241                        -1   -0.41%   x 1.00
 deref_two                 237              235                        -2   -0.84%   x 1.01
 deref_unique              239              240                         1    0.42%   x 1.00
 deref_unique_unroll       257              258                         1    0.39%   x 1.00
 drain_write_drain         298              228                       -70  -23.49%   x 1.31 
 fmt_write                 14 (2642 MB/s)   14 (2642 MB/s)              0    0.00%   x 1.00
 put_slice_bytes_mut       15 (8533 MB/s)   14 (9142 MB/s)             -1   -6.67%   x 1.07
 put_slice_vec             16 (8000 MB/s)   15 (8533 MB/s)             -1   -6.25%   x 1.07
 put_slice_vec_extend      10 (12800 MB/s)  10 (12800 MB/s)             0    0.00%   x 1.00
 put_u8_bytes_mut          508 (251 MB/s)   454 (281 MB/s)            -54  -10.63%   x 1.12
 put_u8_vec                538 (237 MB/s)   513 (249 MB/s)            -25   -4.65%   x 1.05
 put_u8_vec_push           128 (1000 MB/s)  127 (1007 MB/s)            -1   -0.78%   x 1.01

src/bytes_mut.rs Outdated Show resolved Hide resolved
src/bytes_mut.rs Outdated Show resolved Hide resolved
src/bytes_mut.rs Outdated Show resolved Hide resolved
src/bytes_mut.rs Outdated Show resolved Hide resolved
src/bytes_mut.rs Outdated Show resolved Hide resolved
src/bytes_mut.rs Show resolved Hide resolved
src/bytes_mut.rs Show resolved Hide resolved
src/bytes_mut.rs Outdated Show resolved Hide resolved
// There are some situations, using `reserve_exact` that the
// buffer capacity could be below `original_capacity`, so do a
// check.
let double = self.cap.checked_shl(1).unwrap_or(new_cap);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it seems like surprising behavior. I would vote to ignore the original capacity.

src/bytes_mut.rs Show resolved Hide resolved
Co-authored-by: Cameron Bytheway <[email protected]>
@carllerche
Copy link
Member

It looks like you are inlining the capacity with the data? Given that buffers are often allocated to be page length, would this not result in odd allocation patterns? The size would often be size_of::<Shared>() + PAGE.

@carllerche
Copy link
Member

carllerche commented Oct 20, 2020

Looking through this, I believe this should be able to land w/o a breaking change. I don't think we need to block 0.6 on this.

edit: fixed the version number.

@Matthias247
Copy link
Contributor Author

It looks like you are inlining the capacity with the data? Given that buffers are often allocated to be page length, would this not result in odd allocation patterns? The size would often be size_of::<Shared>() + PAGE.

Are you aware about any applications which do that? I don't think you can currently rely on that, since your system memory allocator might also not do the same - e.g. it might also need to store it's metadata inside the allocation header and therefore take size away from it. But it's true that buffer sizes which are multiples of 4k are often programmers favorites.

What we could do is maybe rounding up allocations to page size, so that the capacity isn't lost? But that might require adding a with_exact_capacity method for use-cases which don't want that.

If users want that behavior, the can still reserve for the desired
capacity manually.
@Matthias247
Copy link
Contributor Author

regarding

It looks like you are inlining the capacity with the data? Given that buffers are often allocated to be page length, would this not result in odd allocation patterns? The size would often be size_of::() + PAGE.

I confirmed that e.g. mimalloc will optimize fixed sized chunks, and Shared would might mean getting pushed to the next chunk size.

Based on this I see 2 ways forward for this:

  1. Don't care about it in the normal case, and add more optimized APIs for people that care about the differences.E.g. BytesMut::with_exact_capacity(size) would return exactly what the user requires (even it if it is not optimal), BytesMut::with_optimized_capacity(size) could either round up (to next multiple of 2) or down (subtract - size_of::<Shared>).
  2. Add another pointer to Bytes itself which tracks the shared state separately from the data. The benefit of this is that this could also eliminate the special casing for Vec in Bytes, since the field can be used to store the capacity. The downsides are however:
  • 2 allocations for BytesMut instead of.
  • 2 times the pointer chasing on resizes/frees/etc.

I guess for the use-case of "have a small serialization buffer and continuously add stuff" approach 2) will end up slower than 1) (and this CR). However it's more deterministic for managing large chunks of data.

@carllerche
Copy link
Member

Do we know why we can't entirely hide the the Arc / Vec implementation behind the vtable? IIRC it is due to wanting to avoid an allocation when converting Vec and needing to track the pointer to the head of the memory as well as the original capacity? Is there another reason that I am forgetting?

@qm3ster
Copy link

qm3ster commented Feb 1, 2022

Would this also allow for free Bytes::try_mut()? #368

Comment on lines +1193 to +1194
let shared_addr = self as *mut Self;
let data_addr = shared_addr.offset(1) as *mut u8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, under Stacked Borrows this pointer only has provenance over Self, so using it to access the data section would be UB. Under PNVI (the proposed C aliasing model) I believe this is allowed. It's not entirely clear what the finalized rules for Rust will be, but you should probably do one of:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BytesMut::freeze is not zero-cost
7 participants