-
Notifications
You must be signed in to change notification settings - Fork 12.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV] Combine vslidedown_vl with known VL and offset to a smaller LMUL #66267
Conversation
@llvm/pr-subscribers-backend-risc-v ChangesIf we know the VL of a vslidedown_vl, we can work out the minimum number of registers it's going to operate across. We can reuse the logic from extract_vector_elt to perform it in a smaller type and reduce the LMUL.One observation from adding this is that the vslide*_vl nodes all take a mask This could maybe also be done in InsertVSETVLI, but I haven't explored it yet. -- Patch is 308.49 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/66267.diff 14 Files Affected:
<pre>
|
This optimization increases the number of dynamic instructions in the case that other instructions prefer to use the larger LMUL:
gets optimized to
If this sequence is in a loop, then the effect of this is magnified. Going from M8, M4, or M2 to a smaller LMUL will likely see an improvement that outweighs the cost of the extra vsetivli. However, it is less clear whether going from M1, MF2, or MF4 will outweigh the cost. Should we consider only doing this for LMUL that we are confident bring benefit? |
|
This is good enough for me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch isn't sound. Just because the VL is known doesn't mean the slidedown offset is <= VL. The elements that will be written to the lower elements of the destination may come from the high LMUL part of the input.
Sorry you're right, I think this should be |
If we know the VL and offset of a vslidedown_vl, we can work out the minimum number of registers it's going to operate across. We can reuse the logic from extract_vector_elt to perform it in a smaller type and reduce the LMUL. The aim is to generalize llvm#65598 and hopefully extend this to vslideup_vl too so that we can get the same optimisation for insert_subvector and insert_vector_elt. One observation from adding this is that the vslide*_vl nodes all take a mask operand, but currently anything other than vmset_vl will fail to select, as all the patterns expect true_mask. So we need to create a new vmset_vl instead of using extract_subvector on the existing vmset_vl.
b48e459
to
569fb00
Compare
Similiar to llvm#66267, we can perform a vslideup_vl on a smaller type if we know the highest lane that will be written to, which can be determined from VL. This is an alternative to llvm#65997 and llvm#66087
DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, *ShrunkVT, N->getOperand(1), | ||
DAG.getVectorIdxConstant(0, DL)); | ||
|
||
// The only mask ever used in vslide*_vl nodes is vmset_vl, and the only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an awful hack. How hard would it be to add a pattern for vslidedown_vl with an arbitrary mask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I'll try adding a masked pattern, hopefully RISCVISelDAGToDAG will be able to convert it to the unmasked pseudo
; CHECK-NEXT: vmv.x.s a0, v8 | ||
; CHECK-NEXT: li a1, 32 | ||
; CHECK-NEXT: vsetivli zero, 1, e64, m4, ta, ma |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a problem with your change, but why on earth are we using m4 for VL=1?
; RV32-NEXT: vsrl.vx v16, v12, a0 | ||
; RV32-NEXT: vmv.x.s a3, v16 | ||
; RV32-NEXT: vmv.x.s a3, v12 | ||
; RV32-NEXT: vsetivli zero, 1, e64, m4, ta, ma |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as previous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous batch of one off changes - sorry, got confused by the pr web interface.
To summarize:
- We need a more general solution to the vslidedown w/mask case. Either via general select, or by removing the mask parameter on the node. The current form is highly confusing.
- The vsrl using high lmul implies maybe we should go with the extract specific case as this lmul should be reducible as well.
Sometimes with mask vectors that have been widened, there is a CopyToRegClass node in between the VMSET and the CopyToReg. This is a resurrection https://reviews.llvm.org/D148524, and is needed to remove the mask operand when it's extracted from a subregister as planned in llvm#66267 (comment)
Sometimes with mask vectors that have been widened, there is a CopyToRegClass node in between the VMSET and the CopyToReg. This is a resurrection of https://reviews.llvm.org/D148524, and is needed to remove the mask operand when it's extracted from a subvector as planned in #66267 (comment)
We were previously only matching on the true_mask pattern. This patch allows arbitrary masks to be matched, which means we can avoid the workaround used in #66267. We can just add patterns for the _MASK pseudo variants because RISCVDAGToDAGISel::doPeepholeMaskedRVV will transform them to the unmasked variant if the mask is all ones.
Marking as a draft as it needs rebased, and may not be worthwhile doing if #66671 is not able to fully subsume the lowering logic for insert_vector_elt and insert_subvector |
If we know the VL and offset of a vslidedown_vl, we can work out the minimum
number of registers it's going to operate across. We can reuse the logic from
extract_vector_elt to perform it in a smaller type and reduce the LMUL.
The aim is to generalize #65598 and hopefully extend this to vslideup_vl too so
that we can get the same optimisation for insert_subvector and
insert_vector_elt.
One observation from adding this is that the vslide*_vl nodes all take a mask
operand, but currently anything other than vmset_vl will fail to select, as all
the patterns expect true_mask. So we need to create a new vmset_vl instead of
using extract_subvector on the existing vmset_vl.