Changes BytesMut not to rely on Vec + Arc #435

Matthias247 · 2020-10-18T04:23:17Z

This changes the internal BytesMut representation to allocate memory on
its own, and not to use Arc + Vec under the hood.

Doing this allows to have a single representation which is shareable (by
being able to modify an embedded refcount) with zero-cost right from
the start. No representation changes have to be performed over the
lifetime of BytesMut. Therefore BytesMut::freeze() is now also free.

The change also opens the door for adding support for buffer pools or
custom allocators, since an allocator is directly called. Adding this
support would mostly mean having to pass an allocator to BytesMut
which is used called instead of a fixed function. This would however be
a different change.

Fixes #401

This changes the internal BytesMut representation to allocate memory on its own, and not to use Arc + Vec under the hood. Doing this allows to have a single representation which is shareable (by being able to modify an embedded refcount) with zero-cost right from the start. Not representation changes have to be performed over the lifetime of `BytesMut`. Therefore `BytesMut::freeze()` is now also free. The change also opens the door for adding support for buffer pools or custom allocators, since an allocator is directly called. Adding this support would mostly mean having to pass an allocator to BytesMut which is used called instead of a fixed function. This would however be a different change. The change is WIP, and misses the following parts: - The buffer growth strategy is not exactly the same as in the old version. It needs to be reviewed what the best strategy is. - Some situations where an empty buffer (which has no backing storage) is manipulated might have bugs. This needs another round of review, and some additional tests.

Matthias247 · 2020-10-18T04:46:36Z

src/bytes_mut.rs

+            // There are some situations, using `reserve_exact` that the
+            // buffer capacity could be below `original_capacity`, so do a
+            // check.
+            let double = self.cap.checked_shl(1).unwrap_or(new_cap);


This is one part that leads to the failing tests.
The old version doubled the original capacity of the storage. However if that was split, the capacity of the split part could be much lower.

I'm not really sure whether going back to the original capacity or even doubling it makes sense => If someone split off a 100 byte chunk of a 1MB BytesMut and they want to write a few more bytes into it, it shouldn't grow towards 1 or even 2 MB - right?

I'm somehwat leaning towards ignoring the original capacity at all, just making decisions based on capacity of the current part. But maybe I'm just missing why it was added that way? Maybe to make sure that a receive buffer somewhere which uses BytesMut and from deserialized data is split off always stays at the same size? If people would want that, they could still .reserve() that size themselves after splitting.

I agree that it seems like surprising behavior. I would vote to ignore the original capacity.

src/bytes_mut.rs

carllerche · 2020-10-18T16:59:29Z

cc/ @seanmonstar who did most of this work most recently.

… versions did Also add a lot more comments around it

Matthias247 · 2020-10-18T18:40:43Z

I'm now rather happy with this, so I removed the WIP.

I added some more documentation around the memory allocation strategy - based on my understanding of previous versions - and questions on what we should do there.
I'm somewhat inclined to drop the behavior of taking capacity that an old backing storage had into account for reservations - it just seems like a footgun leading to high memory usage in case people split BytesMut and work on them later in individually.

Matthias247 · 2020-10-18T19:10:19Z

Not sure what is going on with Loom. I think it doesn't like my use of alloc::alloc::alloc (direct use of the global allocator)- but don't know how to fix it.

Matthias247 · 2020-10-18T21:42:24Z

cargo bench comparison:

name                      old ns/iter      new ns/iter      diff ns/iter   diff %  speedup 
 alloc_big                 52               53                          1    1.92%   x 0.98
 alloc_mid                 53               51                         -2   -3.77%   x 1.04
 alloc_small               53,018           51,740                 -1,278   -2.41%   x 1.02
 alloc_write_split_to_mid  104              67                        -37  -35.58%   x 1.55
 bytes_mut_extend          568 (225 MB/s)   538 (237 MB/s)            -30   -5.28%   x 1.06 
 clone_frozen              9,711            9,785                      74    0.76%   x 0.99
 deref_shared              242              241                        -1   -0.41%   x 1.00
 deref_two                 237              235                        -2   -0.84%   x 1.01
 deref_unique              239              240                         1    0.42%   x 1.00
 deref_unique_unroll       257              258                         1    0.39%   x 1.00
 drain_write_drain         298              228                       -70  -23.49%   x 1.31 
 fmt_write                 14 (2642 MB/s)   14 (2642 MB/s)              0    0.00%   x 1.00
 put_slice_bytes_mut       15 (8533 MB/s)   14 (9142 MB/s)             -1   -6.67%   x 1.07
 put_slice_vec             16 (8000 MB/s)   15 (8533 MB/s)             -1   -6.25%   x 1.07
 put_slice_vec_extend      10 (12800 MB/s)  10 (12800 MB/s)             0    0.00%   x 1.00
 put_u8_bytes_mut          508 (251 MB/s)   454 (281 MB/s)            -54  -10.63%   x 1.12
 put_u8_vec                538 (237 MB/s)   513 (249 MB/s)            -25   -4.65%   x 1.05
 put_u8_vec_push           128 (1000 MB/s)  127 (1007 MB/s)            -1   -0.78%   x 1.01

src/bytes_mut.rs

camshaft · 2020-10-19T18:39:01Z

src/bytes_mut.rs

+            // There are some situations, using `reserve_exact` that the
+            // buffer capacity could be below `original_capacity`, so do a
+            // check.
+            let double = self.cap.checked_shl(1).unwrap_or(new_cap);


I agree that it seems like surprising behavior. I would vote to ignore the original capacity.

src/bytes_mut.rs

Co-authored-by: Cameron Bytheway <[email protected]>

carllerche · 2020-10-20T16:49:41Z

It looks like you are inlining the capacity with the data? Given that buffers are often allocated to be page length, would this not result in odd allocation patterns? The size would often be size_of::<Shared>() + PAGE.

carllerche · 2020-10-20T16:53:49Z

Looking through this, I believe this should be able to land w/o a breaking change. I don't think we need to block 0.6 on this.

edit: fixed the version number.

Matthias247 · 2020-10-20T22:19:46Z

It looks like you are inlining the capacity with the data? Given that buffers are often allocated to be page length, would this not result in odd allocation patterns? The size would often be size_of::<Shared>() + PAGE.

Are you aware about any applications which do that? I don't think you can currently rely on that, since your system memory allocator might also not do the same - e.g. it might also need to store it's metadata inside the allocation header and therefore take size away from it. But it's true that buffer sizes which are multiples of 4k are often programmers favorites.

What we could do is maybe rounding up allocations to page size, so that the capacity isn't lost? But that might require adding a with_exact_capacity method for use-cases which don't want that.

If users want that behavior, the can still reserve for the desired capacity manually.

Matthias247 · 2021-01-04T19:28:23Z

regarding

It looks like you are inlining the capacity with the data? Given that buffers are often allocated to be page length, would this not result in odd allocation patterns? The size would often be size_of::() + PAGE.

I confirmed that e.g. mimalloc will optimize fixed sized chunks, and Shared would might mean getting pushed to the next chunk size.

Based on this I see 2 ways forward for this:

Don't care about it in the normal case, and add more optimized APIs for people that care about the differences.E.g. BytesMut::with_exact_capacity(size) would return exactly what the user requires (even it if it is not optimal), BytesMut::with_optimized_capacity(size) could either round up (to next multiple of 2) or down (subtract - size_of::<Shared>).
Add another pointer to Bytes itself which tracks the shared state separately from the data. The benefit of this is that this could also eliminate the special casing for Vec in Bytes, since the field can be used to store the capacity. The downsides are however:

2 allocations for BytesMut instead of.
2 times the pointer chasing on resizes/frees/etc.

I guess for the use-case of "have a small serialization buffer and continuously add stuff" approach 2) will end up slower than 1) (and this CR). However it's more deterministic for managing large chunks of data.

carllerche · 2021-01-04T21:41:15Z

Do we know why we can't entirely hide the the Arc / Vec implementation behind the vtable? IIRC it is due to wanting to avoid an allocation when converting Vec and needing to track the pointer to the head of the memory as well as the original capacity? Is there another reason that I am forgetting?

qm3ster · 2022-02-01T18:43:06Z

Would this also allow for free Bytes::try_mut()? #368

saethlin · 2022-02-05T00:14:43Z

src/bytes_mut.rs

+        let shared_addr = self as *mut Self;
+        let data_addr = shared_addr.offset(1) as *mut u8;


FYI, under Stacked Borrows this pointer only has provenance over Self, so using it to access the data section would be UB. Under PNVI (the proposed C aliasing model) I believe this is allowed. It's not entirely clear what the finalized rules for Rust will be, but you should probably do one of:

Fix this under SB, probably by adding a [u8] member to Shared

Fix this when it becomes UB for sure

Campaign for this to not become UB, perhaps on Storing an object as &Header, but reading the data past the end of the header rust-lang/unsafe-code-guidelines#256

Matthias247 assigned carllerche and seanmonstar and unassigned carllerche and seanmonstar Oct 18, 2020

Matthias247 commented Oct 18, 2020

View reviewed changes

Darksonn reviewed Oct 18, 2020

View reviewed changes

src/bytes_mut.rs Outdated Show resolved Hide resolved

src/bytes_mut.rs Outdated Show resolved Hide resolved

Matthias247 added 7 commits October 18, 2020 10:04

Fix splitting empty BytesMut, and use NonNull::dangling

3d12491

Add a remark about unsplitting empty Bytes

32c2ae0

Add tests around frozen empty BytesMut

d4d5bb2

Do not propagate Layout errors

3bd5735

Improve recycling test

542f111

Fix split_off for empty BytesMut

626edb5

Let the memory allocation strategy be closer to what std and previous…

ef92196

… versions did Also add a lot more comments around it

Matthias247 changed the title ~~WIP: Changes BytesMut not to rely on Vec + Arc~~ Changes BytesMut not to rely on Vec + Arc Oct 18, 2020

Fix rustfmt

ab0a0d0

Matthias247 added 2 commits October 18, 2020 11:42

Remove std reference

e6d8038

Fix serde feature

0caf28b

Matthias247 added 3 commits October 18, 2020 15:44

Document how BytesMut::freeze() handles excess capacity

4595357

Try fix loom

f45a0bf

Improve correctness of shared state initialization

cffa768

camshaft reviewed Oct 19, 2020

View reviewed changes

Apply suggestions from code review

43d9ddc

Co-authored-by: Cameron Bytheway <[email protected]>

Add more docs. Minimize unsafe blocks

800c527

fix no_std

e429613

camshaft approved these changes Oct 20, 2020

View reviewed changes

Remove the considerations of original capacity

a9ca97d

If users want that behavior, the can still reserve for the desired capacity manually.

camshaft approved these changes Oct 20, 2020

View reviewed changes

Matthias247 mentioned this pull request Feb 5, 2021

Add and use over-allocation metric to decide when to defragment quinn-rs/quinn#1000

Merged

saethlin reviewed Feb 5, 2022

View reviewed changes

andrewbaxter mentioned this pull request Jul 16, 2022

0.5 release prevents going back from Bytes to BytesMut #350

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes BytesMut not to rely on Vec + Arc #435

Changes BytesMut not to rely on Vec + Arc #435

Matthias247 commented Oct 18, 2020 •

edited

Loading

Matthias247 Oct 18, 2020

camshaft Oct 19, 2020

carllerche commented Oct 18, 2020

Matthias247 commented Oct 18, 2020

Matthias247 commented Oct 18, 2020

Matthias247 commented Oct 18, 2020

camshaft Oct 19, 2020

carllerche commented Oct 20, 2020

carllerche commented Oct 20, 2020 •

edited

Loading

Matthias247 commented Oct 20, 2020

Matthias247 commented Jan 4, 2021

carllerche commented Jan 4, 2021

qm3ster commented Feb 1, 2022

saethlin Feb 5, 2022

		let shared_addr = self as *mut Self;
		let data_addr = shared_addr.offset(1) as *mut u8;

Changes BytesMut not to rely on Vec + Arc #435

Are you sure you want to change the base?

Changes BytesMut not to rely on Vec + Arc #435

Conversation

Matthias247 commented Oct 18, 2020 • edited Loading

Matthias247 Oct 18, 2020

Choose a reason for hiding this comment

camshaft Oct 19, 2020

Choose a reason for hiding this comment

carllerche commented Oct 18, 2020

Matthias247 commented Oct 18, 2020

Matthias247 commented Oct 18, 2020

Matthias247 commented Oct 18, 2020

camshaft Oct 19, 2020

Choose a reason for hiding this comment

carllerche commented Oct 20, 2020

carllerche commented Oct 20, 2020 • edited Loading

Matthias247 commented Oct 20, 2020

Matthias247 commented Jan 4, 2021

carllerche commented Jan 4, 2021

qm3ster commented Feb 1, 2022

saethlin Feb 5, 2022

Choose a reason for hiding this comment

Matthias247 commented Oct 18, 2020 •

edited

Loading

carllerche commented Oct 20, 2020 •

edited

Loading