-
Notifications
You must be signed in to change notification settings - Fork 12.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU][CodeGen] Fold immediates in src1 operands of V_MAD/MAC/FMA/FMAC. #68002
Conversation
; GFX10-NEXT: v_fmac_f32_e32 v1, v2, v0 | ||
; GFX10-NEXT: v_max_f32_e32 v0, 0, v1 | ||
; GFX10-NEXT: ; return to shader part epilog | ||
; | ||
; GFX11-LABEL: _amdgpu_ps_main: | ||
; GFX11: ; %bb.0: ; %.entry | ||
; GFX11-NEXT: image_sample v[0:1], v[0:1], s[0:7], s[0:3] dmask:0x3 dim:SQ_RSRC_IMG_2D | ||
; GFX11-NEXT: v_mov_b32_e32 v4, 0 | ||
; GFX11-NEXT: v_dual_mov_b32 v4, 0 :: v_dual_mov_b32 v7, 0x3ca3d70a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a use of v7 after this. That seems strange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see v_dual_mul_f32 v3, v4, v6 :: v_dual_fmamk_f32 v4, v5, 0x3c23d70a, v7
at line 129.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I missed it. It looks fine.
@@ -3250,9 +3250,12 @@ bool SIInstrInfo::FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI, | |||
MachineOperand *Src2 = getNamedOperand(UseMI, AMDGPU::OpName::src2); | |||
|
|||
// Multiplied part is the constant: Use v_madmk_{f16, f32}. | |||
// We should only expect these to be on src0 due to canonicalization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the referenced canonicalization and should we perhaps fix that instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment was added long ago in f078330. The tests there don't seem to use any instrinsics, so I guess the comment was referring to fmul/fadd canonicalisation as it was at the time. Matt @arsenm may know better.
The test file, madmk.ll, still exists, but it seems doesn't rely on that custom code anymore.
Canonicalisation does generally make sense to me, and we do canonicalise (fma c, x, y) to (fma x, c, y) in SDAGCombiner, but here we are at a much later stage dealing with concrete legalised instructions, and for V_FMAC_F16/F32 specificaly we have special code in SIInstrInfo::legalizeOperandsVOP3()
that inserts an SGPR->VGPR COPY for src1. We then fold the immediate operand of the COPY to V_MOV_B32_e32 <imm>
but do not fold that any further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking. The compiler appears to have not been emitting madmk for a while. I'm fine with this approach but can't comment on whether some other approach using canonicalization is possible or desirable.
@@ -7149,7 +7149,7 @@ define amdgpu_kernel void @udiv_i64_oddk_denom(ptr addrspace(1) %out, i64 %x) { | |||
; GFX6-NEXT: v_mul_f32_e32 v0, 0x5f7ffffc, v0 | |||
; GFX6-NEXT: v_mul_f32_e32 v1, 0x2f800000, v0 | |||
; GFX6-NEXT: v_trunc_f32_e32 v1, v1 | |||
; GFX6-NEXT: v_mac_f32_e32 v0, 0xcf800000, v1 | |||
; GFX6-NEXT: v_madmk_f32 v0, v1, 0xcf800000, v0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point out any cases where this patch improves the generated code? It looks like it just replaces one VOP2 instruction with another VOP2 instruction. I guess in theory v_madmk_f32 is preferable to v_mac_f32_e32 because it gives the option of using different registers for dst and src2, but in practice it looks like that never actually happens, as far as I can see from the test updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In llvm.log.ll we seem to be able to eliminate some register moves. (I was wondering it myself whether that folding logic is actually a dead code.)
; GFX1100-SDAG-NEXT: s_waitcnt_depctr 0xfff
-; GFX1100-SDAG-NEXT: v_fmac_f32_e32 v1, 0x3f317218, v0
-; GFX1100-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX1100-SDAG-NEXT: v_mov_b32_e32 v0, v1
+; GFX1100-SDAG-NEXT: v_fmamk_f32 v0, v0, 0x3f317218, v1
; GFX1100-SDAG-NEXT: s_setpc_b64 s[30:31]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
; GFX10-NEXT: v_fmac_f32_e32 v1, v2, v0 | ||
; GFX10-NEXT: v_max_f32_e32 v0, 0, v1 | ||
; GFX10-NEXT: ; return to shader part epilog | ||
; | ||
; GFX11-LABEL: _amdgpu_ps_main: | ||
; GFX11: ; %bb.0: ; %.entry | ||
; GFX11-NEXT: image_sample v[0:1], v[0:1], s[0:7], s[0:3] dmask:0x3 dim:SQ_RSRC_IMG_2D | ||
; GFX11-NEXT: v_mov_b32_e32 v4, 0 | ||
; GFX11-NEXT: v_dual_mov_b32 v4, 0 :: v_dual_mov_b32 v7, 0x3ca3d70a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I missed it. It looks fine.
@@ -3250,9 +3250,12 @@ bool SIInstrInfo::FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI, | |||
MachineOperand *Src2 = getNamedOperand(UseMI, AMDGPU::OpName::src2); | |||
|
|||
// Multiplied part is the constant: Use v_madmk_{f16, f32}. | |||
// We should only expect these to be on src0 due to canonicalization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking. The compiler appears to have not been emitting madmk for a while. I'm fine with this approach but can't comment on whether some other approach using canonicalization is possible or desirable.
@@ -3266,18 +3269,22 @@ bool SIInstrInfo::FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI, | |||
if (pseudoToMCOpcode(NewOpc) == -1) | |||
return false; | |||
|
|||
// We need to swap operands 0 and 1 since madmk constant is at operand 1. | |||
// V_FMAMK_F16_t16 takes VGPR_32_Lo128 operands, so the rewrite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this part in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, #66202 handles the V_FMAAK_F16_t16
case above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
deeb33d
to
3f93118
Compare
3f93118
to
12165f5
Compare
No description provided.