-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hines/thread padding #2951
base: master
Are you sure you want to change the base?
Hines/thread padding #2951
Conversation
so introduce Memb_list m_cache_offset in analogy with m_storage_offset.
✔️ 7492f82 -> Azure artifacts URL |
This comment has been minimized.
This comment has been minimized.
✔️ 7ab6c4e -> Azure artifacts URL |
This comment has been minimized.
This comment has been minimized.
Do you have a sense how often there is false sharing? IE: How often they are written to? |
Not really. We've been circling around the 8.2 vs 9.0 performance issue #2787 for quite a while and have made a lot of progress (mostly caching pointers to doubles such as diam and area). With
You are exactly correct. Ironically, that was how it was done in long ago precursors to 8.2. In 9.0, threads constitute merely a permutation of the data (into thread partitions) where each partition for each SoA variable (with this PR) begins on a cacheline boundary. It is actually quite beautiful to me to see the tremendous simplification of threads in 9.0 into a fairly trivial low level permutation. The 8.2 and before implementation of threads was a far reaching and complex change involving a great deal of memory management and copying between representations. At the moment, even with this PR, 8.2 has a slight performance edge over 9.0 on the #2787 model issue with gcc. That performance edge disappears for the Apple M1 and is much smaller with the intel compiler on x86_64. The bottom line so far with this experiment is that it appears that destructive cache line sharing is not a significant performance issue for #2787 . But pinning that down with confidence requires a bit more testing. |
With the Just for reference, with
|
Interesting, it's too bad more concrete proof didn't surface.
Also very interesting. It's certainly nice that it's relatively simple to implement with the updated system.
and
Whenever I see small perf changes, like the above, I always remember the Producing Wrong Data Without Doing Anything Obviously Wrong!. It's old, but the take away is pretty stunning; that there can be measurement bias simply due to link order, or environment size. Their showings aren't necessarily applicable here, it's more that small perf improvements may be have different root causes, and teasing out causal relationships is important. |
✔️ 19c2887 -> Azure artifacts URL |
This comment has been minimized.
This comment has been minimized.
|
✔️ 13d7fd7 -> Azure artifacts URL |
An experiment to see if performance improves if destructive (write) cacheline sharing is avoided (cache lines are assumed to be 64bytes).
Using the model of #2787, improvement is small or nonexistent on an Apple M1. Timing results for 8 threads are
Perhaps there is a datahandle overhead for plotting that could be solved by caching.