Proof of Concept: Outlined RawTable #76

Gankra · 2019-04-27T03:25:19Z

Needs some cleanup, but here's the basic implementation of what I suggested in #69

Doesn't seem to have been that hard since we already were set up for singleton shenanigans and RawTable is actually a fairly small file.

This reduces size_of::<hashbrown::HashMap>() to 8, and size_of::<std::HashMap>() to 24 on 64-bit platforms (only difference is SipHash random seed). However it does strictly increase the overall memory footprint of the map by 8, since there is now an additional pointer-to-the-header. This can potentially be avoided (see notes below).

Some quick notes in random order:

the ctrl and data ptrs had to be changed to raw pointers, as NonNull cannot be constructed in a const context
consequently, I also had to make them *const to get the right variance
the Sync impl for Header is so it can be put in a static
is_empty_singleton changed to a pointer comparison to a static value, which should be faster than looking up the len
added a ton of getters for convenience, should presumably evaporate in opt
I believe it is always safe to access all the header fields immutably, as even new_uninitialized initializes those fields immediately.
We avoid any dependency on the size or align of T with the singleton header because its only reference to T is an always-aligned null pointer to T.
Group::empty_singleton() is removed, as it was duplicate code and also it needs to be inline in the static singleton's expr. Also ALIGNED_BYTES is changed to a static so that it is guaranteed to have stable pointer identity (miri was upset about this).
I adjusted calculate_layout to only align the data array to T instead of max(T, Group). It's unclear to me why the implementation was doing that, except that I guess it let that statically-computed value be reused for the alloc call. I also just refactored that code to be clearer since it was getting a bit hairy with 3 components.

Potential Optimization Worth Considering: ctrl should have a statically-known offset from header. As such it's possible we can remove the ctrl pointer and just compute it on demand?

I have not yet benchmarked this at all. Not sure what tools we have for this outside of me just running cargo bench locally.

Gankra · 2019-04-27T03:34:36Z

Also when running cargo test locally on my mac I see:

...
test map::test_map::test_occupied_entry_key ... ok
test map::test_map::test_remove ... ok
test map::test_map::test_remove_entry ... ok
test map::test_map::test_raw_entry ... ok
test map::test_map::test_show ... ok
hashbrown-94db3bfa716f052b(5235,0x70000df2e000) malloc: can't allocate region
*** mach_vm_map(size=13835058055282167808) failed (error code=3)
hashbrown-94db3bfa716f052b(5235,0x70000df2e000) malloc: *** set a breakpoint in malloc_error_break to debug
test map::test_map::test_size_hint ... ok
test map::test_map::test_retain ... ok
test map::test_map::test_try_reserve ... ok
test map::test_map::test_vacant_entry_key ... ok
...

This is troubling, but all the test pass. Is there just a test that's checking we fail to allocate a massive thing? (edit: yes)

Gankra · 2019-04-27T03:40:21Z

cargo bench results:

before:

test insert_erase_i32 ... bench:      27,526 ns/iter (+/- 1,773)
test insert_erase_i64 ... bench:      30,098 ns/iter (+/- 5,014)
test insert_i32       ... bench:      21,701 ns/iter (+/- 1,867)
test insert_i64       ... bench:      24,181 ns/iter (+/- 2,313)
test iter_i32         ... bench:       1,794 ns/iter (+/- 131)
test iter_i64         ... bench:       1,805 ns/iter (+/- 103)
test lookup_fail_i32  ... bench:       1,821 ns/iter (+/- 149)
test lookup_fail_i64  ... bench:       1,795 ns/iter (+/- 157)
test lookup_i32       ... bench:       2,713 ns/iter (+/- 292)
test lookup_i64       ... bench:       2,802 ns/iter (+/- 361)

after:

running 10 tests
test insert_erase_i32 ... bench:      24,184 ns/iter (+/- 6,220)
test insert_erase_i64 ... bench:      31,264 ns/iter (+/- 4,505)
test insert_i32       ... bench:      18,267 ns/iter (+/- 1,264)
test insert_i64       ... bench:      25,695 ns/iter (+/- 1,993)
test iter_i32         ... bench:       1,797 ns/iter (+/- 136)
test iter_i64         ... bench:       1,791 ns/iter (+/- 173)
test lookup_fail_i32  ... bench:       2,059 ns/iter (+/- 90)
test lookup_fail_i64  ... bench:       2,061 ns/iter (+/- 87)
test lookup_i32       ... bench:       3,035 ns/iter (+/- 117)
test lookup_i64       ... bench:       3,215 ns/iter (+/- 389)

Gankra · 2019-04-27T03:48:49Z

lol sick, all of the CI except macos fails 🙃

Amanieu · 2019-04-27T10:06:16Z

I'm a bit skeptical, this feels like a somewhat roundabout way of simply boxing the HashMap. The only difference is that you have 2 allocations instead of 1.

Gankra · 2019-04-27T13:03:08Z

remove ctrl bench:

test insert_erase_i32 ... bench:      24,478 ns/iter (+/- 665)
test insert_erase_i64 ... bench:      31,363 ns/iter (+/- 4,813)
test insert_i32       ... bench:      18,021 ns/iter (+/- 1,253)
test insert_i64       ... bench:      25,321 ns/iter (+/- 1,040)
test iter_i32         ... bench:       1,818 ns/iter (+/- 82)
test iter_i64         ... bench:       1,822 ns/iter (+/- 109)
test lookup_fail_i32  ... bench:       2,125 ns/iter (+/- 79)
test lookup_fail_i64  ... bench:       1,824 ns/iter (+/- 125)
test lookup_i32       ... bench:       2,979 ns/iter (+/- 76)
test lookup_i64       ... bench:       3,194 ns/iter (+/- 178)

Gankra · 2019-04-27T13:11:51Z

Well it's significantly better than Box<HashMap> for several reasons:

One allocation
Doesn't allocate for HashMap::new()
with ctrl removed: same memory footprint as not-boxing
edit: And HashMap::new() actually now has less memory footprint

fwiw Swift's Dictionary type has a similar outlined design, although they obviously have fairly different constraints what with reference-counted CoW collections and ObjC interop.

At very least, the implementation is here, so it would definitely be interesting for parties interested in shrinking the footprint of HashMap to grab this and profile/benchmark their workloads with it. At least locally on my laptop it does seem to result in some hurting on cargo bench, but this sort of change is one which probably is more significant in non-synthetic situations, where memory usage and cache pressure are more significant (keep in mind the empty singleton is shared, so performing ops on it is more likely to be in cache).

Gankra · 2019-04-27T13:52:48Z

(also the miri build seems broken for an unrelated reason..)

Gankra · 2019-04-27T14:14:37Z

reran benchmarks, honestly not too bad a difference between no-ctrl and master. Close enough that some messaging might plausibly get us to parity.

Disclaimer: my laptop is very prone to thermal throttling, so someone with a more stable system might want to try it.

old:
test insert_erase_i32 ... bench:      28,049 ns/iter (+/- 2,981)
test insert_erase_i64 ... bench:      29,961 ns/iter (+/- 4,659)
test insert_i32       ... bench:      22,098 ns/iter (+/- 856)
test insert_i64       ... bench:      24,722 ns/iter (+/- 4,997)
test iter_i32         ... bench:       1,810 ns/iter (+/- 167)
test iter_i64         ... bench:       1,813 ns/iter (+/- 72)
test lookup_fail_i32  ... bench:       1,835 ns/iter (+/- 85)
test lookup_fail_i64  ... bench:       1,796 ns/iter (+/- 101)
test lookup_i32       ... bench:       2,758 ns/iter (+/- 153)
test lookup_i64       ... bench:       2,813 ns/iter (+/- 111)


new (ctrl):
test insert_erase_i32 ... bench:      29,437 ns/iter (+/- 834)
test insert_erase_i64 ... bench:      25,578 ns/iter (+/- 3,472)
test insert_i32       ... bench:      23,261 ns/iter (+/- 2,539)
test insert_i64       ... bench:      19,313 ns/iter (+/- 973)
test iter_i32         ... bench:       1,804 ns/iter (+/- 43)
test iter_i64         ... bench:       1,785 ns/iter (+/- 130)
test lookup_fail_i32  ... bench:       2,055 ns/iter (+/- 81)
test lookup_fail_i64  ... bench:       2,044 ns/iter (+/- 275)
test lookup_i32       ... bench:       3,170 ns/iter (+/- 208)
test lookup_i64       ... bench:       3,127 ns/iter (+/- 261)

new (no ctrl):
test insert_erase_i32 ... bench:      23,930 ns/iter (+/- 882)
test insert_erase_i64 ... bench:      30,739 ns/iter (+/- 4,469)
test insert_i32       ... bench:      17,866 ns/iter (+/- 710)
test insert_i64       ... bench:      24,982 ns/iter (+/- 679)
test iter_i32         ... bench:       1,782 ns/iter (+/- 141)
test iter_i64         ... bench:       1,790 ns/iter (+/- 171)
test lookup_fail_i32  ... bench:       2,049 ns/iter (+/- 146)
test lookup_fail_i64  ... bench:       1,793 ns/iter (+/- 151)
test lookup_i32       ... bench:       2,871 ns/iter (+/- 215)
test lookup_i64       ... bench:       3,116 ns/iter (+/- 161)

jonhoo · 2019-04-27T14:27:55Z

On AMD Ryzen 5 2600X with SMT disabled:

 name              master ns/iter  singleton ns/iter  diff ns/iter   diff %  speedup
 insert_erase_i32  17,217          21,650                    4,433   25.75%   x 0.80
 insert_erase_i64  22,715          17,995                   -4,720  -20.78%   x 1.26
 insert_i32        13,793          17,832                    4,039   29.28%   x 0.77
 insert_i64        19,142          14,302                   -4,840  -25.28%   x 1.34
 iter_i32          1,600           1,620                        20    1.25%   x 0.99
 iter_i64          1,566           1,603                        37    2.36%   x 0.98
 lookup_fail_i32   1,407           1,411                         4    0.28%   x 1.00
 lookup_fail_i64   1,455           1,464                         9    0.62%   x 0.99
 lookup_i32        1,855           1,886                        31    1.67%   x 0.98
 lookup_i64        2,033           2,058                        25    1.23%   x 0.99

On Intel(R) Xeon(R) CPU E5-2660 v3 with HT and NUMA disabled:

 name              master ns/iter  singleton ns/iter  diff ns/iter   diff %  speedup
 insert_erase_i32  31,722          32,894                    1,172    3.69%   x 0.96
 insert_erase_i64  41,297          33,095                   -8,202  -19.86%   x 1.25
 insert_i32        24,041          25,062                    1,021    4.25%   x 0.96
 insert_i64        33,884          25,184                   -8,700  -25.68%   x 1.35
 iter_i32          2,717           2,730                        13    0.48%   x 1.00
 iter_i64          2,719           2,718                        -1   -0.04%   x 1.00
 lookup_fail_i32   2,791           3,123                       332   11.90%   x 0.89
 lookup_fail_i64   3,123           2,774                      -349  -11.18%   x 1.13
 lookup_i32        3,831           4,196                       365    9.53%   x 0.91
 lookup_i64        4,314           4,508                       194    4.50%   x 0.96

On Intel(R) Xeon(R) CPU E3-1240 v5 with HT disabled:

 name              master ns/iter  singleton ns/iter  diff ns/iter   diff %  speedup
 insert_erase_i32  16,977          17,181                      204    1.20%   x 0.99
 insert_erase_i64  26,322          18,239                   -8,083  -30.71%   x 1.44
 insert_i32        12,781          12,889                      108    0.85%   x 0.99
 insert_i64        22,082          13,922                   -8,160  -36.95%   x 1.59
 iter_i32          1,853           1,853                         0    0.00%   x 1.00
 iter_i64          1,853           1,853                         0    0.00%   x 1.00
 lookup_fail_i32   1,852           1,936                        84    4.54%   x 0.96
 lookup_fail_i64   1,935           1,852                       -83   -4.29%   x 1.04
 lookup_i32        2,334           2,466                       132    5.66%   x 0.95
 lookup_i64        2,576           2,614                        38    1.48%   x 0.99

Gankra · 2019-04-27T15:00:09Z

well that seems very promising

Amanieu · 2019-04-27T19:50:55Z

You can ignore the miri failures for now, it's a bug in miri.

Amanieu · 2019-04-28T01:50:46Z

One thing you could try it to have HashMap point to ctrl directly rather than the header. You can then address the header at a negative offset from the pointer.

Gankra · 2019-04-29T02:04:11Z

One thing you could try it to have HashMap point to ctrl directly rather than the header. You can then address the header at a negative offset from the pointer.

Pushed up an implementation for that. locally seems like a wash, but again don't have a good experimental rig.

With this I've got 3 commits representing 3 different implementation strategies, and basically all the correctness-sensitive work. I've never been much of an optimizing buff, so someone who's interested in pursuing this path (@SimonSapin ?) will need to pick this up, test it out in servo/firefox/rustc to see if it's useful in larger usescases, and then possibly profile + analyze the result to figure out if it can be tuned to make it a total slam dunk, and not a tradeoff.

Gankra · 2019-05-01T18:15:12Z

@Mark-Simulacrum would this be something you could help us do some profiling for? Specifically for this change integrated into rust-lang/rust? It has CI that can do profiling/benchmarking try runs, right?

Mark-Simulacrum · 2019-05-01T20:07:39Z

Copying from Discord: "best bet is probably to file a PR bumping the submodule/dep and bors try that; or you can do it locally if you'd like (lower latency there)"

To expand on that a bit, https://github.com/rust-lang-nursery/rustc-perf/tree/master/collector#how-to-benchmark-a-change-on-your-own-machine has some directions on how to run perf locally. That may be simpler, and is likely to be much faster latency wise, than queueing perf builds.

bors · 2019-05-28T12:21:04Z

☔ The latest upstream changes (presumably #80) made this pull request unmergeable. Please resolve the merge conflicts.

Gankra · 2019-05-28T14:52:42Z

Gonna close this for lack of super compelling results.

Gankra force-pushed the singleton branch from 18a9e0b to 7d92063 Compare April 27, 2019 03:26

Proof of Concept for outlined RawTable

3353a68

Gankra force-pushed the singleton branch from b95da8a to 3353a68 Compare April 27, 2019 05:31

Proof of Concept: eliminate the stored ctrl ptr

09ed470

Gankra force-pushed the singleton branch from 9438756 to 09ed470 Compare April 27, 2019 13:31

POC point at ctrl

99e5ba9

Gankra mentioned this pull request May 2, 2019

test out experimental outlined hashmap design rust-lang/rust#60470

Closed

Gankra closed this May 28, 2019

Amanieu mentioned this pull request Dec 4, 2019

libstd: Store capacity of std::vec::Vec on the heap? rust-lang/rust#67024

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proof of Concept: Outlined RawTable #76

Proof of Concept: Outlined RawTable #76

Gankra commented Apr 27, 2019

Gankra commented Apr 27, 2019 •

edited

Loading

Gankra commented Apr 27, 2019

Gankra commented Apr 27, 2019

Amanieu commented Apr 27, 2019

Gankra commented Apr 27, 2019

Gankra commented Apr 27, 2019 •

edited

Loading

Gankra commented Apr 27, 2019

Gankra commented Apr 27, 2019

jonhoo commented Apr 27, 2019 •

edited

Loading

Gankra commented Apr 27, 2019

Amanieu commented Apr 27, 2019

Amanieu commented Apr 28, 2019

Gankra commented Apr 29, 2019 •

edited

Loading

Gankra commented May 1, 2019

Mark-Simulacrum commented May 1, 2019

bors commented May 28, 2019

Gankra commented May 28, 2019

Proof of Concept: Outlined RawTable #76

Proof of Concept: Outlined RawTable #76

Conversation

Gankra commented Apr 27, 2019

Gankra commented Apr 27, 2019 • edited Loading

Gankra commented Apr 27, 2019

Gankra commented Apr 27, 2019

Amanieu commented Apr 27, 2019

Gankra commented Apr 27, 2019

Gankra commented Apr 27, 2019 • edited Loading

Gankra commented Apr 27, 2019

Gankra commented Apr 27, 2019

jonhoo commented Apr 27, 2019 • edited Loading

Gankra commented Apr 27, 2019

Amanieu commented Apr 27, 2019

Amanieu commented Apr 28, 2019

Gankra commented Apr 29, 2019 • edited Loading

Gankra commented May 1, 2019

Mark-Simulacrum commented May 1, 2019

bors commented May 28, 2019

Gankra commented May 28, 2019

Gankra commented Apr 27, 2019 •

edited

Loading

Gankra commented Apr 27, 2019 •

edited

Loading

jonhoo commented Apr 27, 2019 •

edited

Loading

Gankra commented Apr 29, 2019 •

edited

Loading