Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU][CodeGen] Fold immediates in src1 operands of V_MAD/MAC/FMA/FMAC. #68002

Merged
merged 1 commit into from
Oct 5, 2023

Conversation

kosarev
Copy link
Collaborator

@kosarev kosarev commented Oct 2, 2023

No description provided.

; GFX10-NEXT: v_fmac_f32_e32 v1, v2, v0
; GFX10-NEXT: v_max_f32_e32 v0, 0, v1
; GFX10-NEXT: ; return to shader part epilog
;
; GFX11-LABEL: _amdgpu_ps_main:
; GFX11: ; %bb.0: ; %.entry
; GFX11-NEXT: image_sample v[0:1], v[0:1], s[0:7], s[0:3] dmask:0x3 dim:SQ_RSRC_IMG_2D
; GFX11-NEXT: v_mov_b32_e32 v4, 0
; GFX11-NEXT: v_dual_mov_b32 v4, 0 :: v_dual_mov_b32 v7, 0x3ca3d70a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a use of v7 after this. That seems strange.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see v_dual_mul_f32 v3, v4, v6 :: v_dual_fmamk_f32 v4, v5, 0x3c23d70a, v7 at line 129.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I missed it. It looks fine.

@@ -3250,9 +3250,12 @@ bool SIInstrInfo::FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI,
MachineOperand *Src2 = getNamedOperand(UseMI, AMDGPU::OpName::src2);

// Multiplied part is the constant: Use v_madmk_{f16, f32}.
// We should only expect these to be on src0 due to canonicalization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the referenced canonicalization and should we perhaps fix that instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment was added long ago in f078330. The tests there don't seem to use any instrinsics, so I guess the comment was referring to fmul/fadd canonicalisation as it was at the time. Matt @arsenm may know better.

The test file, madmk.ll, still exists, but it seems doesn't rely on that custom code anymore.

Canonicalisation does generally make sense to me, and we do canonicalise (fma c, x, y) to (fma x, c, y) in SDAGCombiner, but here we are at a much later stage dealing with concrete legalised instructions, and for V_FMAC_F16/F32 specificaly we have special code in SIInstrInfo::legalizeOperandsVOP3() that inserts an SGPR->VGPR COPY for src1. We then fold the immediate operand of the COPY to V_MOV_B32_e32 <imm> but do not fold that any further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking. The compiler appears to have not been emitting madmk for a while. I'm fine with this approach but can't comment on whether some other approach using canonicalization is possible or desirable.

@@ -7149,7 +7149,7 @@ define amdgpu_kernel void @udiv_i64_oddk_denom(ptr addrspace(1) %out, i64 %x) {
; GFX6-NEXT: v_mul_f32_e32 v0, 0x5f7ffffc, v0
; GFX6-NEXT: v_mul_f32_e32 v1, 0x2f800000, v0
; GFX6-NEXT: v_trunc_f32_e32 v1, v1
; GFX6-NEXT: v_mac_f32_e32 v0, 0xcf800000, v1
; GFX6-NEXT: v_madmk_f32 v0, v1, 0xcf800000, v0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point out any cases where this patch improves the generated code? It looks like it just replaces one VOP2 instruction with another VOP2 instruction. I guess in theory v_madmk_f32 is preferable to v_mac_f32_e32 because it gives the option of using different registers for dst and src2, but in practice it looks like that never actually happens, as far as I can see from the test updates.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In llvm.log.ll we seem to be able to eliminate some register moves. (I was wondering it myself whether that folding logic is actually a dead code.)

 ; GFX1100-SDAG-NEXT:    s_waitcnt_depctr 0xfff
-; GFX1100-SDAG-NEXT:    v_fmac_f32_e32 v1, 0x3f317218, v0
-; GFX1100-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX1100-SDAG-NEXT:    v_mov_b32_e32 v0, v1
+; GFX1100-SDAG-NEXT:    v_fmamk_f32 v0, v0, 0x3f317218, v1
 ; GFX1100-SDAG-NEXT:    s_setpc_b64 s[30:31]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks.

Copy link
Contributor

@Sisyph Sisyph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

; GFX10-NEXT: v_fmac_f32_e32 v1, v2, v0
; GFX10-NEXT: v_max_f32_e32 v0, 0, v1
; GFX10-NEXT: ; return to shader part epilog
;
; GFX11-LABEL: _amdgpu_ps_main:
; GFX11: ; %bb.0: ; %.entry
; GFX11-NEXT: image_sample v[0:1], v[0:1], s[0:7], s[0:3] dmask:0x3 dim:SQ_RSRC_IMG_2D
; GFX11-NEXT: v_mov_b32_e32 v4, 0
; GFX11-NEXT: v_dual_mov_b32 v4, 0 :: v_dual_mov_b32 v7, 0x3ca3d70a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I missed it. It looks fine.

@@ -3250,9 +3250,12 @@ bool SIInstrInfo::FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI,
MachineOperand *Src2 = getNamedOperand(UseMI, AMDGPU::OpName::src2);

// Multiplied part is the constant: Use v_madmk_{f16, f32}.
// We should only expect these to be on src0 due to canonicalization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking. The compiler appears to have not been emitting madmk for a while. I'm fine with this approach but can't comment on whether some other approach using canonicalization is possible or desirable.

@kosarev kosarev requested a review from jayfoad October 4, 2023 09:47
@@ -3266,18 +3269,22 @@ bool SIInstrInfo::FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI,
if (pseudoToMCOpcode(NewOpc) == -1)
return false;

// We need to swap operands 0 and 1 since madmk constant is at operand 1.
// V_FMAMK_F16_t16 takes VGPR_32_Lo128 operands, so the rewrite
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this part in a separate PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, #66202 handles the V_FMAAK_F16_t16 case above.

@kosarev kosarev requested a review from jayfoad October 5, 2023 10:18
Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@kosarev kosarev force-pushed the fold_src1_imms_in_fmas branch from deeb33d to 3f93118 Compare October 5, 2023 10:38
@kosarev kosarev force-pushed the fold_src1_imms_in_fmas branch from 3f93118 to 12165f5 Compare October 5, 2023 11:08
@kosarev kosarev merged commit f04aa1f into llvm:main Oct 5, 2023
@kosarev kosarev deleted the fold_src1_imms_in_fmas branch October 5, 2023 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants