-
Notifications
You must be signed in to change notification settings - Fork 12.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU][GFX11] Do not rewrite V_FMA/FMAC_* to V_FMAAK_F16_t16 on operand legalization. #66202
Conversation
…rand legalization. V_FMAAK_F16_t16 takes VGPR_32_Lo128 operands whereas the original instructions would have VGPR_32 operands. Switching the opcodes without updating operands' register classes leads to MachineVerifier complaining about the classes not matching instruction definitions. The problem only reveals itself of builds with expensive checks enabled because of missing -verify-machineinstrs in the test. This is the third attempt to update CodeGen/AMDGPU/fma.f16.ll to run for GFX11, following the second attempt in a1e38e0, partially reverted in eaf737a.
@llvm/pr-subscribers-backend-amdgpu ChangesV_FMAAK_F16_t16 takes VGPR_32_Lo128 operands whereas the original instructions would have VGPR_32 operands. Switching the opcodes without updating operands' register classes leads to MachineVerifier complaining about the classes not matching instruction definitions. The problem only reveals itself of builds with expensive checks enabled because of missing -verify-machineinstrs in the test.This is the third attempt to update CodeGen/AMDGPU/fma.f16.ll to run for GFX11, following the second attempt in a1e38e0, partially reverted in eaf737a.Full diff: https://github.com/llvm/llvm-project/pull/66202.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp index 38b5e0114903cdf..a887f7567d76c89 100644 --- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp +++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp @@ -3318,6 +3318,12 @@ bool SIInstrInfo::FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI, if (pseudoToMCOpcode(NewOpc) == -1) return false; + // V_FMAAK_F16_t16 takes VGPR_32_Lo128 operands, so the rewrite + // would also require restricting their register classes. For now + // just bail out. + if (NewOpc == AMDGPU::V_FMAAK_F16_t16) + return false; + const int64_t Imm = ImmOp->getImm(); // FIXME: This would be a lot easier if we could return a new instruction diff --git a/llvm/test/CodeGen/AMDGPU/fma.f16.ll b/llvm/test/CodeGen/AMDGPU/fma.f16.ll index 23971f2b681cb0c..e8423ce9fbc36a2 100644 --- a/llvm/test/CodeGen/AMDGPU/fma.f16.ll +++ b/llvm/test/CodeGen/AMDGPU/fma.f16.ll @@ -1,8 +1,10 @@ ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py -; RUN: llc -global-isel=0 -march=amdgcn -mcpu=gfx900 < %s | FileCheck %s -check-prefixes=GFX9,GFX9-SDAG -; RUN: llc -global-isel=1 -march=amdgcn -mcpu=gfx900 < %s | FileCheck %s -check-prefixes=GFX9,GFX9-GISEL -; RUN: llc -global-isel=0 -march=amdgcn -mcpu=gfx1010 < %s | FileCheck %s -check-prefixes=GFX10,GFX10-SDAG -; RUN: llc -global-isel=1 -march=amdgcn -mcpu=gfx1010 < %s | FileCheck %s -check-prefixes=GFX10,GFX10-GISEL +; RUN: llc -global-isel=0 -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GFX9,GFX9-SDAG +; RUN: llc -global-isel=1 -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GFX9,GFX9-GISEL +; RUN: llc -global-isel=0 -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GFX10,GFX10-SDAG +; RUN: llc -global-isel=1 -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GFX10,GFX10-GISEL +; RUN: llc -global-isel=0 -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GFX11,GFX11-SDAG +; RUN: llc -global-isel=1 -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s | FileCheck %s -check-prefixes=GFX11,GFX11-GISEL declare half @llvm.fma.f16(half, half, half) declare half @llvm.maxnum.f16(half, half) @@ -19,6 +21,12 @@ define half @test_fma(half %x, half %y, half %z) { ; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) ; GFX10-NEXT: v_fma_f16 v0, v0, v1, v2 ; GFX10-NEXT: s_setpc_b64 s[30:31] +; +; GFX11-LABEL: test_fma: +; GFX11: ; %bb.0: +; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) +; GFX11-NEXT: v_fma_f16 v0, v0, v1, v2 +; GFX11-NEXT: s_setpc_b64 s[30:31] %r = call half @llvm.fma.f16(half %x, half %y, half %z) ret half %r } @@ -36,6 +44,12 @@ define half @test_fmac(half %x, half %y, half %z) { ; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) ; GFX10-NEXT: v_fmac_f16_e32 v0, v1, v2 ; GFX10-NEXT: s_setpc_b64 s[30:31] +; +; GFX11-LABEL: test_fmac: +; GFX11: ; %bb.0: +; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) +; GFX11-NEXT: v_fmac_f16_e32 v0, v1, v2 +; GFX11-NEXT: s_setpc_b64 s[30:31] %r = call half @llvm.fma.f16(half %y, half %z, half %x) ret half %r } @@ -61,6 +75,12 @@ define half @test_fmaak(half %x, half %y, half %z) { ; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) ; GFX10-NEXT: v_fmaak_f16 v0, v0, v1, 0x4200 ; GFX10-NEXT: s_setpc_b64 s[30:31] +; +; GFX11-LABEL: test_fmaak: +; GFX11: ; %bb.0: +; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) +; GFX11-NEXT: v_fmaak_f16 v0, v0, v1, 0x4200 +; GFX11-NEXT: s_setpc_b64 s[30:31] %r = call half @llvm.fma.f16(half %x, half %y, half 0xH4200) ret half %r } @@ -86,6 +106,12 @@ define half @test_fmamk(half %x, half %y, half %z) { ; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) ; GFX10-NEXT: v_fmamk_f16 v0, v0, 0x4200, v2 ; GFX10-NEXT: s_setpc_b64 s[30:31] +; +; GFX11-LABEL: test_fmamk: +; GFX11: ; %bb.0: +; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) +; GFX11-NEXT: v_fmamk_f16 v0, v0, 0x4200, v2 +; GFX11-NEXT: s_setpc_b64 s[30:31] %r = call half @llvm.fma.f16(half %x, half 0xH4200, half %z) ret half %r } @@ -139,6 +165,33 @@ define i32 @test_D139469_f16(half %arg) { ; GFX10-GISEL-NEXT: s_or_b32 s4, vcc_lo, s4 ; GFX10-GISEL-NEXT: v_cndmask_b32_e64 v0, 0, 1, s4 ; GFX10-GISEL-NEXT: s_setpc_b64 s[30:31] +; +; GFX11-SDAG-LABEL: test_D139469_f16: +; GFX11-SDAG: ; %bb.0: ; %bb +; GFX11-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) +; GFX11-SDAG-NEXT: v_mov_b32_e32 v1, 0x211e +; GFX11-SDAG-NEXT: v_mul_f16_e32 v2, 0x291e, v0 +; GFX11-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1) +; GFX11-SDAG-NEXT: v_fmac_f16_e32 v1, 0x291e, v0 +; GFX11-SDAG-NEXT: v_min_f16_e32 v0, v2, v1 +; GFX11-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) +; GFX11-SDAG-NEXT: v_cmp_gt_f16_e32 vcc_lo, 0, v0 +; GFX11-SDAG-NEXT: v_cndmask_b32_e64 v0, 0, 1, vcc_lo +; GFX11-SDAG-NEXT: s_setpc_b64 s[30:31] +; +; GFX11-GISEL-LABEL: test_D139469_f16: +; GFX11-GISEL: ; %bb.0: ; %bb +; GFX11-GISEL-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) +; GFX11-GISEL-NEXT: s_movk_i32 s0, 0x291e +; GFX11-GISEL-NEXT: v_mul_f16_e32 v1, 0x291e, v0 +; GFX11-GISEL-NEXT: v_fmaak_f16 v0, s0, v0, 0x211e +; GFX11-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2) +; GFX11-GISEL-NEXT: v_cmp_gt_f16_e32 vcc_lo, 0, v1 +; GFX11-GISEL-NEXT: v_cmp_gt_f16_e64 s0, 0, v0 +; GFX11-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(SALU_CYCLE_1) +; GFX11-GISEL-NEXT: s_or_b32 s0, vcc_lo, s0 +; GFX11-GISEL-NEXT: v_cndmask_b32_e64 v0, 0, 1, s0 +; GFX11-GISEL-NEXT: s_setpc_b64 s[30:31] bb: %i = fmul contract half %arg, 0xH291E %i1 = fcmp olt half %i, 0xH0000 @@ -213,6 +266,44 @@ define <2 x i32> @test_D139469_v2f16(<2 x half> %arg) { ; GFX10-GISEL-NEXT: s_or_b32 s4, s6, s5 ; GFX10-GISEL-NEXT: v_cndmask_b32_e64 v1, 0, 1, s4 ; GFX10-GISEL-NEXT: s_setpc_b64 s[30:31] +; +; GFX11-SDAG-LABEL: test_D139469_v2f16: +; GFX11-SDAG: ; %bb.0: ; %bb +; GFX11-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) +; GFX11-SDAG-NEXT: s_movk_i32 s0, 0x211e +; GFX11-SDAG-NEXT: v_pk_mul_f16 v1, 0x291e, v0 op_sel_hi:[0,1] +; GFX11-SDAG-NEXT: v_pk_fma_f16 v0, 0x291e, v0, s0 op_sel_hi:[0,1,0] +; GFX11-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1) +; GFX11-SDAG-NEXT: v_pk_min_f16 v0, v1, v0 +; GFX11-SDAG-NEXT: v_lshrrev_b32_e32 v1, 16, v0 +; GFX11-SDAG-NEXT: v_cmp_gt_f16_e32 vcc_lo, 0, v0 +; GFX11-SDAG-NEXT: v_cndmask_b32_e64 v0, 0, 1, vcc_lo +; GFX11-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_3) +; GFX11-SDAG-NEXT: v_cmp_gt_f16_e32 vcc_lo, 0, v1 +; GFX11-SDAG-NEXT: v_cndmask_b32_e64 v1, 0, 1, vcc_lo +; GFX11-SDAG-NEXT: s_setpc_b64 s[30:31] +; +; GFX11-GISEL-LABEL: test_D139469_v2f16: +; GFX11-GISEL: ; %bb.0: ; %bb +; GFX11-GISEL-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) +; GFX11-GISEL-NEXT: s_mov_b32 s0, 0x291e291e +; GFX11-GISEL-NEXT: v_pk_mul_f16 v1, v0, 0x291e op_sel_hi:[1,0] +; GFX11-GISEL-NEXT: v_pk_fma_f16 v0, v0, s0, 0x211e op_sel_hi:[1,1,0] +; GFX11-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2) +; GFX11-GISEL-NEXT: v_lshrrev_b32_e32 v2, 16, v1 +; GFX11-GISEL-NEXT: v_lshrrev_b32_e32 v3, 16, v0 +; GFX11-GISEL-NEXT: v_cmp_gt_f16_e32 vcc_lo, 0, v1 +; GFX11-GISEL-NEXT: v_cmp_gt_f16_e64 s0, 0, v0 +; GFX11-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4) +; GFX11-GISEL-NEXT: v_cmp_gt_f16_e64 s1, 0, v2 +; GFX11-GISEL-NEXT: v_cmp_gt_f16_e64 s2, 0, v3 +; GFX11-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(SALU_CYCLE_1) +; GFX11-GISEL-NEXT: s_or_b32 s0, vcc_lo, s0 +; GFX11-GISEL-NEXT: v_cndmask_b32_e64 v0, 0, 1, s0 +; GFX11-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(SALU_CYCLE_1) +; GFX11-GISEL-NEXT: s_or_b32 s0, s1, s2 +; GFX11-GISEL-NEXT: v_cndmask_b32_e64 v1, 0, 1, s0 +; GFX11-GISEL-NEXT: s_setpc_b64 s[30:31] bb: %i = fmul contract <2 x half> %arg, <half 0xH291E, half 0xH291E> %i1 = fcmp olt <2 x half> %i, <half 0xH0000, half 0xH0000> |
I don't understand. Is there some mismatch between the way the t16 forms of V_FMA/FMAC_* are defined, vs the t16 form of V_FMAAK_F16 ? What is the exact opcode that was rewritten to V_FMAAK_F16_t16 in the failing case? |
@@ -3318,6 +3318,12 @@ bool SIInstrInfo::FoldImmediate(MachineInstr &UseMI, MachineInstr &DefMI, | |||
if (pseudoToMCOpcode(NewOpc) == -1) | |||
return false; | |||
|
|||
// V_FMAAK_F16_t16 takes VGPR_32_Lo128 operands, so the rewrite | |||
// would also require restricting their register classes. For now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it is of any practical value. For fake True16 instructions, it seems we already have enough support and generally don't want to invest more in that. For real True16 instructions that new code wouldn't be used anyway; VGPR_16_Lo128 is not even allocatable downstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the reasoning for real True16 is right, but the conclusion we don't want to shrink is right. VGPR_16_Lo128 (or something derived from it ) would still be used downstream.
dag InOperandList = (ins VSrcT_f16_Lo128_Deferred:$src0, VGPRSrc_16_Lo128:$src1, KImmFP16:$imm);
We don't want to shrink pre-RA because it would restrict the register allocation to only use the bottom 128 vgprs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks irrelevant to me. The comment aims to explain why we bail out for fake True16 and VGPR_32_Lo128
, not VGPR_16_Lo128
. The instruction in the condition is supposed to be changed to V_FMAAK_F16_fake16
when we have one, and then removed along with the rest of the fake16 support.
Oh I see, it's because we are folding a VOP3 instruction (V_FMAC_F16_t16_e64) to V_FMAAK_F16_t16 which is VOP2 only. |
Yes. Which I understand makes the shrinking pass the right place to rewrite to V_FMAAK_F16_t16, should we want that. |
Unfortunately it is not that simple. It would be nice if folding immediates and shrinking VOP3 to VOP2 were orthogonal and could be implemented in separate passes, but this is one of the awkward cases where there is some interaction - you can only fold if you also shrink. In our compiler we do folding first and then shrinking. So the options are:
|
Ping. |
Is it true that this function FoldImmediate is only called pre-RA? It seems from my cursory check just now that is the case. If so, then I agree that bailing out on FMAAK_F16_t16 and FMAMK_F16_t16 is correct. If not, then I think we would want to check the assigned registers similar to what is done in SIShrinkInstructions::shouldShrinkTrue16. |
Yes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
V_FMAAK_F16_t16 takes VGPR_32_Lo128 operands whereas the original instructions would have VGPR_32 operands. Switching the opcodes without updating operands' register classes leads to MachineVerifier complaining about the classes not matching instruction definitions. The problem only reveals itself of builds with expensive checks enabled because of missing -verify-machineinstrs in the test.
This is the third attempt to update CodeGen/AMDGPU/fma.f16.ll to run for GFX11, following the second attempt in a1e38e0, partially reverted in eaf737a.