Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of Concept: Outlined RawTable #76

Closed
wants to merge 3 commits into from

Conversation

Gankra
Copy link

@Gankra Gankra commented Apr 27, 2019

Needs some cleanup, but here's the basic implementation of what I suggested in #69

Doesn't seem to have been that hard since we already were set up for singleton shenanigans and RawTable is actually a fairly small file.

This reduces size_of::<hashbrown::HashMap>() to 8, and size_of::<std::HashMap>() to 24 on 64-bit platforms (only difference is SipHash random seed). However it does strictly increase the overall memory footprint of the map by 8, since there is now an additional pointer-to-the-header. This can potentially be avoided (see notes below).

Some quick notes in random order:

  • the ctrl and data ptrs had to be changed to raw pointers, as NonNull cannot be constructed in a const context
  • consequently, I also had to make them *const to get the right variance
  • the Sync impl for Header is so it can be put in a static
  • is_empty_singleton changed to a pointer comparison to a static value, which should be faster than looking up the len
  • added a ton of getters for convenience, should presumably evaporate in opt
  • I believe it is always safe to access all the header fields immutably, as even new_uninitialized initializes those fields immediately.
  • We avoid any dependency on the size or align of T with the singleton header because its only reference to T is an always-aligned null pointer to T.
  • Group::empty_singleton() is removed, as it was duplicate code and also it needs to be inline in the static singleton's expr. Also ALIGNED_BYTES is changed to a static so that it is guaranteed to have stable pointer identity (miri was upset about this).
  • I adjusted calculate_layout to only align the data array to T instead of max(T, Group). It's unclear to me why the implementation was doing that, except that I guess it let that statically-computed value be reused for the alloc call. I also just refactored that code to be clearer since it was getting a bit hairy with 3 components.

Potential Optimization Worth Considering: ctrl should have a statically-known offset from header. As such it's possible we can remove the ctrl pointer and just compute it on demand?

I have not yet benchmarked this at all. Not sure what tools we have for this outside of me just running cargo bench locally.

@Gankra
Copy link
Author

Gankra commented Apr 27, 2019

Also when running cargo test locally on my mac I see:

...
test map::test_map::test_occupied_entry_key ... ok
test map::test_map::test_remove ... ok
test map::test_map::test_remove_entry ... ok
test map::test_map::test_raw_entry ... ok
test map::test_map::test_show ... ok
hashbrown-94db3bfa716f052b(5235,0x70000df2e000) malloc: can't allocate region
*** mach_vm_map(size=13835058055282167808) failed (error code=3)
hashbrown-94db3bfa716f052b(5235,0x70000df2e000) malloc: *** set a breakpoint in malloc_error_break to debug
test map::test_map::test_size_hint ... ok
test map::test_map::test_retain ... ok
test map::test_map::test_try_reserve ... ok
test map::test_map::test_vacant_entry_key ... ok
...

This is troubling, but all the test pass. Is there just a test that's checking we fail to allocate a massive thing? (edit: yes)

@Gankra
Copy link
Author

Gankra commented Apr 27, 2019

cargo bench results:

before:

test insert_erase_i32 ... bench:      27,526 ns/iter (+/- 1,773)
test insert_erase_i64 ... bench:      30,098 ns/iter (+/- 5,014)
test insert_i32       ... bench:      21,701 ns/iter (+/- 1,867)
test insert_i64       ... bench:      24,181 ns/iter (+/- 2,313)
test iter_i32         ... bench:       1,794 ns/iter (+/- 131)
test iter_i64         ... bench:       1,805 ns/iter (+/- 103)
test lookup_fail_i32  ... bench:       1,821 ns/iter (+/- 149)
test lookup_fail_i64  ... bench:       1,795 ns/iter (+/- 157)
test lookup_i32       ... bench:       2,713 ns/iter (+/- 292)
test lookup_i64       ... bench:       2,802 ns/iter (+/- 361)

after:

running 10 tests
test insert_erase_i32 ... bench:      24,184 ns/iter (+/- 6,220)
test insert_erase_i64 ... bench:      31,264 ns/iter (+/- 4,505)
test insert_i32       ... bench:      18,267 ns/iter (+/- 1,264)
test insert_i64       ... bench:      25,695 ns/iter (+/- 1,993)
test iter_i32         ... bench:       1,797 ns/iter (+/- 136)
test iter_i64         ... bench:       1,791 ns/iter (+/- 173)
test lookup_fail_i32  ... bench:       2,059 ns/iter (+/- 90)
test lookup_fail_i64  ... bench:       2,061 ns/iter (+/- 87)
test lookup_i32       ... bench:       3,035 ns/iter (+/- 117)
test lookup_i64       ... bench:       3,215 ns/iter (+/- 389)

@Gankra
Copy link
Author

Gankra commented Apr 27, 2019

lol sick, all of the CI except macos fails 🙃

@Amanieu
Copy link
Member

Amanieu commented Apr 27, 2019

I'm a bit skeptical, this feels like a somewhat roundabout way of simply boxing the HashMap. The only difference is that you have 2 allocations instead of 1.

@Gankra
Copy link
Author

Gankra commented Apr 27, 2019

remove ctrl bench:

test insert_erase_i32 ... bench:      24,478 ns/iter (+/- 665)
test insert_erase_i64 ... bench:      31,363 ns/iter (+/- 4,813)
test insert_i32       ... bench:      18,021 ns/iter (+/- 1,253)
test insert_i64       ... bench:      25,321 ns/iter (+/- 1,040)
test iter_i32         ... bench:       1,818 ns/iter (+/- 82)
test iter_i64         ... bench:       1,822 ns/iter (+/- 109)
test lookup_fail_i32  ... bench:       2,125 ns/iter (+/- 79)
test lookup_fail_i64  ... bench:       1,824 ns/iter (+/- 125)
test lookup_i32       ... bench:       2,979 ns/iter (+/- 76)
test lookup_i64       ... bench:       3,194 ns/iter (+/- 178)

@Gankra
Copy link
Author

Gankra commented Apr 27, 2019

Well it's significantly better than Box<HashMap> for several reasons:

  • One allocation
  • Doesn't allocate for HashMap::new()
  • with ctrl removed: same memory footprint as not-boxing
  • edit: And HashMap::new() actually now has less memory footprint

fwiw Swift's Dictionary type has a similar outlined design, although they obviously have fairly different constraints what with reference-counted CoW collections and ObjC interop.

At very least, the implementation is here, so it would definitely be interesting for parties interested in shrinking the footprint of HashMap to grab this and profile/benchmark their workloads with it. At least locally on my laptop it does seem to result in some hurting on cargo bench, but this sort of change is one which probably is more significant in non-synthetic situations, where memory usage and cache pressure are more significant (keep in mind the empty singleton is shared, so performing ops on it is more likely to be in cache).

@Gankra
Copy link
Author

Gankra commented Apr 27, 2019

(also the miri build seems broken for an unrelated reason..)

@Gankra
Copy link
Author

Gankra commented Apr 27, 2019

reran benchmarks, honestly not too bad a difference between no-ctrl and master. Close enough that some messaging might plausibly get us to parity.

Disclaimer: my laptop is very prone to thermal throttling, so someone with a more stable system might want to try it.

old:
test insert_erase_i32 ... bench:      28,049 ns/iter (+/- 2,981)
test insert_erase_i64 ... bench:      29,961 ns/iter (+/- 4,659)
test insert_i32       ... bench:      22,098 ns/iter (+/- 856)
test insert_i64       ... bench:      24,722 ns/iter (+/- 4,997)
test iter_i32         ... bench:       1,810 ns/iter (+/- 167)
test iter_i64         ... bench:       1,813 ns/iter (+/- 72)
test lookup_fail_i32  ... bench:       1,835 ns/iter (+/- 85)
test lookup_fail_i64  ... bench:       1,796 ns/iter (+/- 101)
test lookup_i32       ... bench:       2,758 ns/iter (+/- 153)
test lookup_i64       ... bench:       2,813 ns/iter (+/- 111)


new (ctrl):
test insert_erase_i32 ... bench:      29,437 ns/iter (+/- 834)
test insert_erase_i64 ... bench:      25,578 ns/iter (+/- 3,472)
test insert_i32       ... bench:      23,261 ns/iter (+/- 2,539)
test insert_i64       ... bench:      19,313 ns/iter (+/- 973)
test iter_i32         ... bench:       1,804 ns/iter (+/- 43)
test iter_i64         ... bench:       1,785 ns/iter (+/- 130)
test lookup_fail_i32  ... bench:       2,055 ns/iter (+/- 81)
test lookup_fail_i64  ... bench:       2,044 ns/iter (+/- 275)
test lookup_i32       ... bench:       3,170 ns/iter (+/- 208)
test lookup_i64       ... bench:       3,127 ns/iter (+/- 261)

new (no ctrl):
test insert_erase_i32 ... bench:      23,930 ns/iter (+/- 882)
test insert_erase_i64 ... bench:      30,739 ns/iter (+/- 4,469)
test insert_i32       ... bench:      17,866 ns/iter (+/- 710)
test insert_i64       ... bench:      24,982 ns/iter (+/- 679)
test iter_i32         ... bench:       1,782 ns/iter (+/- 141)
test iter_i64         ... bench:       1,790 ns/iter (+/- 171)
test lookup_fail_i32  ... bench:       2,049 ns/iter (+/- 146)
test lookup_fail_i64  ... bench:       1,793 ns/iter (+/- 151)
test lookup_i32       ... bench:       2,871 ns/iter (+/- 215)
test lookup_i64       ... bench:       3,116 ns/iter (+/- 161)

@jonhoo
Copy link
Contributor

jonhoo commented Apr 27, 2019

On AMD Ryzen 5 2600X with SMT disabled:

 name              master ns/iter  singleton ns/iter  diff ns/iter   diff %  speedup
 insert_erase_i32  17,217          21,650                    4,433   25.75%   x 0.80
 insert_erase_i64  22,715          17,995                   -4,720  -20.78%   x 1.26
 insert_i32        13,793          17,832                    4,039   29.28%   x 0.77
 insert_i64        19,142          14,302                   -4,840  -25.28%   x 1.34
 iter_i32          1,600           1,620                        20    1.25%   x 0.99
 iter_i64          1,566           1,603                        37    2.36%   x 0.98
 lookup_fail_i32   1,407           1,411                         4    0.28%   x 1.00
 lookup_fail_i64   1,455           1,464                         9    0.62%   x 0.99
 lookup_i32        1,855           1,886                        31    1.67%   x 0.98
 lookup_i64        2,033           2,058                        25    1.23%   x 0.99

On Intel(R) Xeon(R) CPU E5-2660 v3 with HT and NUMA disabled:

 name              master ns/iter  singleton ns/iter  diff ns/iter   diff %  speedup
 insert_erase_i32  31,722          32,894                    1,172    3.69%   x 0.96
 insert_erase_i64  41,297          33,095                   -8,202  -19.86%   x 1.25
 insert_i32        24,041          25,062                    1,021    4.25%   x 0.96
 insert_i64        33,884          25,184                   -8,700  -25.68%   x 1.35
 iter_i32          2,717           2,730                        13    0.48%   x 1.00
 iter_i64          2,719           2,718                        -1   -0.04%   x 1.00
 lookup_fail_i32   2,791           3,123                       332   11.90%   x 0.89
 lookup_fail_i64   3,123           2,774                      -349  -11.18%   x 1.13
 lookup_i32        3,831           4,196                       365    9.53%   x 0.91
 lookup_i64        4,314           4,508                       194    4.50%   x 0.96

On Intel(R) Xeon(R) CPU E3-1240 v5 with HT disabled:

 name              master ns/iter  singleton ns/iter  diff ns/iter   diff %  speedup
 insert_erase_i32  16,977          17,181                      204    1.20%   x 0.99
 insert_erase_i64  26,322          18,239                   -8,083  -30.71%   x 1.44
 insert_i32        12,781          12,889                      108    0.85%   x 0.99
 insert_i64        22,082          13,922                   -8,160  -36.95%   x 1.59
 iter_i32          1,853           1,853                         0    0.00%   x 1.00
 iter_i64          1,853           1,853                         0    0.00%   x 1.00
 lookup_fail_i32   1,852           1,936                        84    4.54%   x 0.96
 lookup_fail_i64   1,935           1,852                       -83   -4.29%   x 1.04
 lookup_i32        2,334           2,466                       132    5.66%   x 0.95
 lookup_i64        2,576           2,614                        38    1.48%   x 0.99

@Gankra
Copy link
Author

Gankra commented Apr 27, 2019

well that seems very promising

@Amanieu
Copy link
Member

Amanieu commented Apr 27, 2019

You can ignore the miri failures for now, it's a bug in miri.

@Amanieu
Copy link
Member

Amanieu commented Apr 28, 2019

One thing you could try it to have HashMap point to ctrl directly rather than the header. You can then address the header at a negative offset from the pointer.

@Gankra
Copy link
Author

Gankra commented Apr 29, 2019

One thing you could try it to have HashMap point to ctrl directly rather than the header. You can then address the header at a negative offset from the pointer.

Pushed up an implementation for that. locally seems like a wash, but again don't have a good experimental rig.

With this I've got 3 commits representing 3 different implementation strategies, and basically all the correctness-sensitive work. I've never been much of an optimizing buff, so someone who's interested in pursuing this path (@SimonSapin ?) will need to pick this up, test it out in servo/firefox/rustc to see if it's useful in larger usescases, and then possibly profile + analyze the result to figure out if it can be tuned to make it a total slam dunk, and not a tradeoff.

@Gankra
Copy link
Author

Gankra commented May 1, 2019

@Mark-Simulacrum would this be something you could help us do some profiling for? Specifically for this change integrated into rust-lang/rust? It has CI that can do profiling/benchmarking try runs, right?

@Mark-Simulacrum
Copy link
Member

Copying from Discord: "best bet is probably to file a PR bumping the submodule/dep and bors try that; or you can do it locally if you'd like (lower latency there)"

To expand on that a bit, https://github.com/rust-lang-nursery/rustc-perf/tree/master/collector#how-to-benchmark-a-change-on-your-own-machine has some directions on how to run perf locally. That may be simpler, and is likely to be much faster latency wise, than queueing perf builds.

@bors
Copy link
Contributor

bors commented May 28, 2019

☔ The latest upstream changes (presumably #80) made this pull request unmergeable. Please resolve the merge conflicts.

@Gankra
Copy link
Author

Gankra commented May 28, 2019

Gonna close this for lack of super compelling results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants