-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explicitly stage strided loads #7230
Conversation
The excellent PR description of what's going on now (and what we used to do) should be captured in a code comment somewhere. |
Some backends don't like non-power-of-two vectors. Do two overlapping half-sized loads and shuffle instead of one funny-sized load.
c455859
to
df3bf08
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- don't see any red flags inside Google.
} | ||
// TODO: We do not yet handle nested vectorization here for | ||
// ramps which have not already collapsed. We could potentially | ||
// handle more interesting types of shuffle than simple flat slices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this TODO be an open issue?
src/StageStridedLoads.cpp
Outdated
// The loop body definitely runs | ||
|
||
// TODO: worry about different iterations of the loop somehow not | ||
// providing the evidence we thought it did. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same as above) should this TODO be an issue? Or a blocker for this PR? I'm also not quite sure what the TODO means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm with nits
* Add a pass to do explicit densification of strided loads * densify more types of strided load * Reorder downsample in local laplacian for slightly better performance * Move allocation padding into the IR. Still WIP. * Simplify concat_bits handling * Use evidence from parent scopes to densify * Disallow padding allocations with custom new expressions * Add test for parent scopes * Remove debugging prints. Avoid nested ramps. * Avoid parent scope loops * Update cmakefiles * Fix for large_buffers * Pad stack allocations too * Restore vld2/3/4 generation on non-Apple ARM chips * Appease clang-format and clang-tidy * Silence clang-tidy * Better comments * Comment improvements * Nuke code that reads out of bounds * Fix stage_strided_loads test * Change strategy for loads from external buffers Some backends don't like non-power-of-two vectors. Do two overlapping half-sized loads and shuffle instead of one funny-sized load. * Add explanatory comment to ARM backend * Fix cpp backend shuffling * Fix missing msan annotations * Magnify heap cost effect in stack_vs_heap performance test * Address review comments * clang-tidy * Fix for when same load node occurs in two different allocate nodes
This PR adds a new compiler pass that converts strided loads into dense loads followed by shuffles.
For a stride of two, the trick is to do a dense load of twice the size, and then extract either the even or odd lanes. This was previously done in codegen, where it was challenging, because it's not easy to know there if it's safe to do the double-sized load, as it either loads one element beyond or before the original load. We used the alignment of the ramp base to try to tell if it was safe to shift backwards, and we added padding to internal allocations so that for those at least it was safe to shift forwards. Unfortunately the alignment of the ramp base is usually unknown if you don't know anything about the strides of the input, and adding padding to allocations was a serious wart in our memory allocators.
This PR instead actively looks for evidence elsewhere in the Stmt (at some location which definitely executes whenever the load being transformed executes) that it's safe to read further forwards or backwards in memory. The evidence is in the form of a load at the same base address with a different constant offset. It also clusters groups of these loads so that they do the same dense load each and extract the appropriate slice of lanes. For loads from external buffers, it just does a shorter load, and for loads from internal allocations that we can't shift forwards or backwards, it adds padding to the allocation explicitly, via a new padding field on Allocate nodes.
Edit: Some backends don't like non-power-of-two sized vectors, so for loads from external buffers, we now split the load into two overlapping dense loads of half the size, then shuffle out appropriate lanes from each and concat the results.