-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proof of Concept: Outlined RawTable #76
Conversation
Also when running
This is troubling, but all the test pass. Is there just a test that's checking we fail to allocate a massive thing? (edit: yes) |
cargo bench results: before:
after:
|
lol sick, all of the CI except macos fails 🙃 |
I'm a bit skeptical, this feels like a somewhat roundabout way of simply boxing the |
remove ctrl bench:
|
Well it's significantly better than
fwiw Swift's Dictionary type has a similar outlined design, although they obviously have fairly different constraints what with reference-counted CoW collections and ObjC interop. At very least, the implementation is here, so it would definitely be interesting for parties interested in shrinking the footprint of HashMap to grab this and profile/benchmark their workloads with it. At least locally on my laptop it does seem to result in some hurting on |
(also the miri build seems broken for an unrelated reason..) |
reran benchmarks, honestly not too bad a difference between no-ctrl and master. Close enough that some messaging might plausibly get us to parity. Disclaimer: my laptop is very prone to thermal throttling, so someone with a more stable system might want to try it.
|
On AMD Ryzen 5 2600X with SMT disabled:
On Intel(R) Xeon(R) CPU E5-2660 v3 with HT and NUMA disabled:
On Intel(R) Xeon(R) CPU E3-1240 v5 with HT disabled:
|
well that seems very promising |
You can ignore the miri failures for now, it's a bug in miri. |
One thing you could try it to have |
Pushed up an implementation for that. locally seems like a wash, but again don't have a good experimental rig. With this I've got 3 commits representing 3 different implementation strategies, and basically all the correctness-sensitive work. I've never been much of an optimizing buff, so someone who's interested in pursuing this path (@SimonSapin ?) will need to pick this up, test it out in servo/firefox/rustc to see if it's useful in larger usescases, and then possibly profile + analyze the result to figure out if it can be tuned to make it a total slam dunk, and not a tradeoff. |
@Mark-Simulacrum would this be something you could help us do some profiling for? Specifically for this change integrated into rust-lang/rust? It has CI that can do profiling/benchmarking try runs, right? |
Copying from Discord: "best bet is probably to file a PR bumping the submodule/dep and bors try that; or you can do it locally if you'd like (lower latency there)" To expand on that a bit, https://github.com/rust-lang-nursery/rustc-perf/tree/master/collector#how-to-benchmark-a-change-on-your-own-machine has some directions on how to run perf locally. That may be simpler, and is likely to be much faster latency wise, than queueing perf builds. |
☔ The latest upstream changes (presumably #80) made this pull request unmergeable. Please resolve the merge conflicts. |
Gonna close this for lack of super compelling results. |
Needs some cleanup, but here's the basic implementation of what I suggested in #69
Doesn't seem to have been that hard since we already were set up for singleton shenanigans and RawTable is actually a fairly small file.
This reduces
size_of::<hashbrown::HashMap>()
to 8, andsize_of::<std::HashMap>()
to 24 on 64-bit platforms (only difference is SipHash random seed). However it does strictly increase the overall memory footprint of the map by 8, since there is now an additional pointer-to-the-header. This can potentially be avoided (see notes below).Some quick notes in random order:
const
context*const
to get the right varianceSync
impl for Header is so it can be put in a staticis_empty_singleton
changed to a pointer comparison to a static value, which should be faster than looking up the lenT
instead ofmax(T, Group)
. It's unclear to me why the implementation was doing that, except that I guess it let that statically-computed value be reused for thealloc
call. I also just refactored that code to be clearer since it was getting a bit hairy with 3 components.Potential Optimization Worth Considering:
ctrl
should have a statically-known offset fromheader
. As such it's possible we can remove thectrl
pointer and just compute it on demand?I have not yet benchmarked this at all. Not sure what tools we have for this outside of me just running
cargo bench
locally.