[AMDGPU][CodeGen] Fold immediates in src1 operands of V_MAD/MAC/FMA/FMAC. #68002

kosarev · 2023-10-02T16:19:16Z

No description provided.

Sisyph · 2023-10-02T17:14:29Z

llvm/test/CodeGen/AMDGPU/dagcombine-fma-fmad.ll

 ; GFX10-NEXT:    v_fmac_f32_e32 v1, v2, v0
 ; GFX10-NEXT:    v_max_f32_e32 v0, 0, v1
 ; GFX10-NEXT:    ; return to shader part epilog
 ;
 ; GFX11-LABEL: _amdgpu_ps_main:
 ; GFX11:       ; %bb.0: ; %.entry
 ; GFX11-NEXT:    image_sample v[0:1], v[0:1], s[0:7], s[0:3] dmask:0x3 dim:SQ_RSRC_IMG_2D
-; GFX11-NEXT:    v_mov_b32_e32 v4, 0
+; GFX11-NEXT:    v_dual_mov_b32 v4, 0 :: v_dual_mov_b32 v7, 0x3ca3d70a


I don't see a use of v7 after this. That seems strange.

I can see v_dual_mul_f32 v3, v4, v6 :: v_dual_fmamk_f32 v4, v5, 0x3c23d70a, v7 at line 129.

Thanks, I missed it. It looks fine.

Sisyph · 2023-10-02T17:16:29Z

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

@@ -3250,9 +3250,12 @@ bool SIInstrInfo::FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI,
    MachineOperand *Src2 = getNamedOperand(UseMI, AMDGPU::OpName::src2);

    // Multiplied part is the constant: Use v_madmk_{f16, f32}.
-    // We should only expect these to be on src0 due to canonicalization.


What is the referenced canonicalization and should we perhaps fix that instead?

The comment was added long ago in f078330. The tests there don't seem to use any instrinsics, so I guess the comment was referring to fmul/fadd canonicalisation as it was at the time. Matt @arsenm may know better.

The test file, madmk.ll, still exists, but it seems doesn't rely on that custom code anymore.

Canonicalisation does generally make sense to me, and we do canonicalise (fma c, x, y) to (fma x, c, y) in SDAGCombiner, but here we are at a much later stage dealing with concrete legalised instructions, and for V_FMAC_F16/F32 specificaly we have special code in SIInstrInfo::legalizeOperandsVOP3() that inserts an SGPR->VGPR COPY for src1. We then fold the immediate operand of the COPY to V_MOV_B32_e32 <imm> but do not fold that any further.

Thanks for checking. The compiler appears to have not been emitting madmk for a while. I'm fine with this approach but can't comment on whether some other approach using canonicalization is possible or desirable.

jayfoad · 2023-10-02T18:44:31Z

llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll

@@ -7149,7 +7149,7 @@ define amdgpu_kernel void @udiv_i64_oddk_denom(ptr addrspace(1) %out, i64 %x) {
 ; GFX6-NEXT:    v_mul_f32_e32 v0, 0x5f7ffffc, v0
 ; GFX6-NEXT:    v_mul_f32_e32 v1, 0x2f800000, v0
 ; GFX6-NEXT:    v_trunc_f32_e32 v1, v1
-; GFX6-NEXT:    v_mac_f32_e32 v0, 0xcf800000, v1
+; GFX6-NEXT:    v_madmk_f32 v0, v1, 0xcf800000, v0


Can you point out any cases where this patch improves the generated code? It looks like it just replaces one VOP2 instruction with another VOP2 instruction. I guess in theory v_madmk_f32 is preferable to v_mac_f32_e32 because it gives the option of using different registers for dst and src2, but in practice it looks like that never actually happens, as far as I can see from the test updates.

In llvm.log.ll we seem to be able to eliminate some register moves. (I was wondering it myself whether that folding logic is actually a dead code.)

; GFX1100-SDAG-NEXT: s_waitcnt_depctr 0xfff -; GFX1100-SDAG-NEXT: v_fmac_f32_e32 v1, 0x3f317218, v0 -; GFX1100-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) -; GFX1100-SDAG-NEXT: v_mov_b32_e32 v0, v1 +; GFX1100-SDAG-NEXT: v_fmamk_f32 v0, v0, 0x3f317218, v1 ; GFX1100-SDAG-NEXT: s_setpc_b64 s[30:31]

Looks good, thanks.

Sisyph

LGTM

Sisyph · 2023-10-03T14:42:56Z

llvm/test/CodeGen/AMDGPU/dagcombine-fma-fmad.ll

 ; GFX10-NEXT:    v_fmac_f32_e32 v1, v2, v0
 ; GFX10-NEXT:    v_max_f32_e32 v0, 0, v1
 ; GFX10-NEXT:    ; return to shader part epilog
 ;
 ; GFX11-LABEL: _amdgpu_ps_main:
 ; GFX11:       ; %bb.0: ; %.entry
 ; GFX11-NEXT:    image_sample v[0:1], v[0:1], s[0:7], s[0:3] dmask:0x3 dim:SQ_RSRC_IMG_2D
-; GFX11-NEXT:    v_mov_b32_e32 v4, 0
+; GFX11-NEXT:    v_dual_mov_b32 v4, 0 :: v_dual_mov_b32 v7, 0x3ca3d70a


Thanks, I missed it. It looks fine.

Sisyph · 2023-10-03T16:41:26Z

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

@@ -3250,9 +3250,12 @@ bool SIInstrInfo::FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI,
    MachineOperand *Src2 = getNamedOperand(UseMI, AMDGPU::OpName::src2);

    // Multiplied part is the constant: Use v_madmk_{f16, f32}.
-    // We should only expect these to be on src0 due to canonicalization.


Thanks for checking. The compiler appears to have not been emitting madmk for a while. I'm fine with this approach but can't comment on whether some other approach using canonicalization is possible or desirable.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

jayfoad · 2023-10-05T06:59:11Z

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

@@ -3266,18 +3269,22 @@ bool SIInstrInfo::FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI,
      if (pseudoToMCOpcode(NewOpc) == -1)
        return false;

-      // We need to swap operands 0 and 1 since madmk constant is at operand 1.
+      // V_FMAMK_F16_t16 takes VGPR_32_Lo128 operands, so the rewrite


Isn't this part in a separate PR?

No, #66202 handles the V_FMAAK_F16_t16 case above.

jayfoad

LGTM, thanks.

…MAC.

kosarev requested review from jayfoad, arsenm, Sisyph and rampitec October 2, 2023 16:19

llvmbot added the backend:AMDGPU label Oct 2, 2023

Sisyph reviewed Oct 2, 2023

View reviewed changes

jayfoad reviewed Oct 2, 2023

View reviewed changes

kosarev mentioned this pull request Oct 3, 2023

[AMDGPU][GFX11] Do not rewrite V_FMA/FMAC_* to V_FMAAK_F16_t16 on operand legalization. #66202

Merged

kosarev requested review from Sisyph and jayfoad October 3, 2023 14:21

Sisyph approved these changes Oct 3, 2023

View reviewed changes

jayfoad reviewed Oct 3, 2023

View reviewed changes

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp Show resolved Hide resolved

kosarev requested a review from jayfoad October 4, 2023 09:47

jayfoad reviewed Oct 5, 2023

View reviewed changes

kosarev requested a review from jayfoad October 5, 2023 10:18

jayfoad approved these changes Oct 5, 2023

View reviewed changes

kosarev force-pushed the fold_src1_imms_in_fmas branch from deeb33d to 3f93118 Compare October 5, 2023 10:38

[AMDGPU][CodeGen] Fold immediates in src1 operands of V_MAD/MAC/FMA/F…

12165f5

…MAC.

kosarev force-pushed the fold_src1_imms_in_fmas branch from 3f93118 to 12165f5 Compare October 5, 2023 11:08

kosarev merged commit f04aa1f into llvm:main Oct 5, 2023

kosarev deleted the fold_src1_imms_in_fmas branch October 5, 2023 11:23

stepthomas mentioned this pull request Oct 10, 2023

AMDGPU stepthomas atomic csub no rtn forms ver2 stepthomas/llvm-project#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU][CodeGen] Fold immediates in src1 operands of V_MAD/MAC/FMA/FMAC. #68002

[AMDGPU][CodeGen] Fold immediates in src1 operands of V_MAD/MAC/FMA/FMAC. #68002

kosarev commented Oct 2, 2023

Sisyph Oct 2, 2023

kosarev Oct 2, 2023

Sisyph Oct 3, 2023

Sisyph Oct 2, 2023

kosarev Oct 3, 2023

Sisyph Oct 3, 2023

jayfoad Oct 2, 2023

kosarev Oct 3, 2023

jayfoad Oct 3, 2023

Sisyph left a comment

Sisyph Oct 3, 2023

Sisyph Oct 3, 2023

jayfoad Oct 5, 2023

kosarev Oct 5, 2023

jayfoad left a comment

[AMDGPU][CodeGen] Fold immediates in src1 operands of V_MAD/MAC/FMA/FMAC. #68002

[AMDGPU][CodeGen] Fold immediates in src1 operands of V_MAD/MAC/FMA/FMAC. #68002

Conversation

kosarev commented Oct 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sisyph left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayfoad left a comment

Choose a reason for hiding this comment