[AArch64][CodeGen] Fix wrong operand order when creating vcmla intrinsic #65278

daisy202309 · 2023-09-04T17:23:52Z

No description provided.

efriedma-quic

Please note in the commit message whether this is actually a visible miscompile; at first glance, multiplication is commutative.

Assuming it is a miscompile, please verify that using the intrinsics from clang produces the correct result.

Otherwise LGTM

daisy202309 · 2023-09-05T10:21:55Z

Please note in the commit message whether this is actually a visible miscompile; at first glance, multiplication is commutative.

Assuming it is a miscompile, please verify that using the intrinsics from clang produces the correct result.

Otherwise LGTM

For the fcmla instruction with the rot value of 90 or 270, the results obtained by exchanging the positions of the last two parameters are different.

Here is an example:

#include<arm_neon.h>
#include<stdio.h>

float64x2_t test_rot0(float64_t *acc, float64_t *lhs, float64_t *rhs) {
  return vcmlaq_f64(vld1q_dup_f64(acc), vld1q_dup_f64(lhs), vld1q_dup_f64(rhs));
}

float64x2_t test_rot90(float64_t *acc, float64_t *lhs, float64_t *rhs) {
  return vcmlaq_rot90_f64(vld1q_dup_f64(acc), vld1q_dup_f64(lhs), vld1q_dup_f64(rhs));
}

float64x2_t test_rot180(float64_t *acc, float64_t *lhs, float64_t *rhs) {
  return vcmlaq_rot180_f64(vld1q_dup_f64(acc), vld1q_dup_f64(lhs), vld1q_dup_f64(rhs));
}

float64x2_t test_rot270(float64_t *acc, float64_t *lhs, float64_t *rhs) {
  return vcmlaq_rot270_f64(vld1q_dup_f64(acc), vld1q_dup_f64(lhs), vld1q_dup_f64(rhs));
}

int main() {
  float64_t acc[] = {100, 100};
  float64_t lhs[] = {10, 10};
  float64_t rhs[] = {77, 77};
  float64_t r0[] = {1000, 1000};
  float64_t r90[] = {1000, 1000};
  float64_t r180[] = {1000, 1000};
  float64_t r270[] = {1000, 1000};

  vst1q_lane_f64(r0, test_rot0(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r0[1], test_rot0(acc, rhs, lhs), 1);

  vst1q_lane_f64(r90, test_rot90(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r90[1], test_rot90(acc, rhs, lhs), 1);

  vst1q_lane_f64(r180, test_rot180(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r180[1], test_rot180(acc, rhs, lhs), 1);

  vst1q_lane_f64(r270, test_rot270(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r270[1], test_rot270(acc, rhs, lhs), 1);

  printf("r0: %lf %lf\n", r0[0], r0[1]);
  printf("r90: %lf %lf\n", r90[0], r90[1]);
  printf("r180: %lf %lf\n", r180[0], r180[1]);
  printf("r270: %lf %lf\n", r270[0], r270[1]);
}

The result:

r0: 870.000000 870.000000
r90: -670.000000 870.000000
r180: -670.000000 -670.000000
r270: 870.000000 -670.000000

I'm not sure whether I need add this test case in my patch.

efriedma-quic · 2023-09-08T06:01:34Z

I don't think we need an end-to-end testcase; it should be sufficient to have testcases for clang->llvm IR and llvm IR->asm. So should be fine as-is.

davemgreen · 2023-09-08T06:19:45Z

The test case you have is testing different lanes from the vcmla instrinsics, not just commutativity. The different lanes would be expected to be different. The lane indices would need to be the same to test commutivity:

  vst1q_lane_f64(r0, test_rot0(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r0[1], test_rot0(acc, rhs, lhs), 0);

  vst1q_lane_f64(r90, test_rot90(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r90[1], test_rot90(acc, rhs, lhs), 0);

  vst1q_lane_f64(r180, test_rot180(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r180[1], test_rot180(acc, rhs, lhs), 0);

  vst1q_lane_f64(r270, test_rot270(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r270[1], test_rot270(acc, rhs, lhs), 0);

When you say "Fix wrong operand order" is there a bug here? And if so can you explain where. The same change may need to be applied for ARM too, but I was under the impression that enough testing had been done to catch problems like this (baring maybe some edge cases with Nan's and whatnot)

daisy202309 · 2023-09-08T14:26:57Z

The test case you have is testing different lanes from the vcmla instrinsics, not just commutativity. The different lanes would be expected to be different. The lane indices would need to be the same to test commutivity:
  vst1q_lane_f64(r0, test_rot0(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r0[1], test_rot0(acc, rhs, lhs), 0);

  vst1q_lane_f64(r90, test_rot90(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r90[1], test_rot90(acc, rhs, lhs), 0);

  vst1q_lane_f64(r180, test_rot180(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r180[1], test_rot180(acc, rhs, lhs), 0);

  vst1q_lane_f64(r270, test_rot270(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r270[1], test_rot270(acc, rhs, lhs), 0);
When you say "Fix wrong operand order" is there a bug here? And if so can you explain where. The same change may need to be applied for ARM too, but I was under the impression that enough testing had been done to catch problems like this (baring maybe some edge cases with Nan's and whatnot)

I meet a run time error in SPECCPU2006 433 because getting a wrong calculation result. And I can get a right result by adding this patch or close the optimization of complex-deinterleaving.
The order of (InputA, inputB) is used in other places of this function, only this place is in the order of (InputB, InputA). I guess this is a clerical error. I tried to build a small example to reproduce this error, but I have not succeeded so far, and the actual scenario is more complicated.

davemgreen · 2023-09-12T21:47:40Z

Is the milc compiled with -Ofast or without fast math? I don't think we've seen the same thing here.

From what I can tell this this should just change from one complex multiply to another. It will be different but I'm not sure it's better or worse. For example with a normal complex mul:

vcmla A, B, C, #0
vcmla A, B, C, #90

would be:

Ar = ((Ar + Br.Cr) - Bi.Ci)
Ai = ((Ai + Br.Ci) + Bi.Cr)

vs, if the operands were the other way:

Ar = ((Ar + Cr.Br) - Ci.Bi)
Ai = ((Ai + Cr.Bi) + Ci.Br)

The brackets show where the fused multiply-accumulates happen. So there will be slightly different rounding between the two versions, but neither match the original exactly.

The new order sounds like a more natural order, so I'm not against it. It might just be that milc is a little susceptible to fast math rounding differences.

…sic (llvm#65278) Co-authored-by: lizhijin <[email protected]>

daisy202309 requested a review from a team as a code owner September 4, 2023 17:23

daisy202309 marked this pull request as draft September 4, 2023 17:32

daisy202309 marked this pull request as ready for review September 4, 2023 17:32

efriedma-quic approved these changes Sep 5, 2023

View reviewed changes

davemgreen requested review from igogo-x86 and NickGuy-Arm September 5, 2023 07:19

igogo-x86 approved these changes Sep 5, 2023

View reviewed changes

github-actions bot added the backend:AArch64 label Sep 8, 2023

daisy202309 closed this Sep 8, 2023

daisy202309 force-pushed the main branch from 140e07a to 2c93e3c Compare September 8, 2023 13:54

daisy202309 reopened this Sep 8, 2023

daisy202309 force-pushed the main branch 2 times, most recently from 8cdfd48 to 2593952 Compare September 12, 2023 16:03

[AArch64][CodeGen] Fix wrong operand order when creating vcmla intrinsic

b910732

daisy202309 force-pushed the main branch from 2593952 to b910732 Compare September 13, 2023 14:34

wweiandrew merged commit 9ef15f4 into llvm:main Sep 14, 2023

kstoimenov pushed a commit to kstoimenov/llvm-project that referenced this pull request Sep 14, 2023

[AArch64][CodeGen] Fix wrong operand order when creating vcmla intrin…

ed36b24

…sic (llvm#65278) Co-authored-by: lizhijin <[email protected]>

This was referenced Sep 14, 2023

[LSAN][NFC] Add a new line to a log kstoimenov/llvm-project#5

Closed

[LSAN][NFC] Add a new line to a log kstoimenov/llvm-project#6

Closed

ZijunZhaoCCK pushed a commit to ZijunZhaoCCK/llvm-project that referenced this pull request Sep 19, 2023

[AArch64][CodeGen] Fix wrong operand order when creating vcmla intrin…

c6da1e1

…sic (llvm#65278) Co-authored-by: lizhijin <[email protected]>

ellishg mentioned this pull request Jul 25, 2024

[AArch64][CodeGen] Fix wrong operand order when creating vcmla intrin… swiftlang/llvm-project#9012

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AArch64][CodeGen] Fix wrong operand order when creating vcmla intrinsic #65278

[AArch64][CodeGen] Fix wrong operand order when creating vcmla intrinsic #65278

daisy202309 commented Sep 4, 2023

efriedma-quic left a comment

daisy202309 commented Sep 5, 2023 •

edited

Loading

efriedma-quic commented Sep 8, 2023

davemgreen commented Sep 8, 2023

daisy202309 commented Sep 8, 2023 •

edited

Loading

davemgreen commented Sep 12, 2023

[AArch64][CodeGen] Fix wrong operand order when creating vcmla intrinsic #65278

[AArch64][CodeGen] Fix wrong operand order when creating vcmla intrinsic #65278

Conversation

daisy202309 commented Sep 4, 2023

efriedma-quic left a comment

Choose a reason for hiding this comment

daisy202309 commented Sep 5, 2023 • edited Loading

efriedma-quic commented Sep 8, 2023

davemgreen commented Sep 8, 2023

daisy202309 commented Sep 8, 2023 • edited Loading

davemgreen commented Sep 12, 2023

daisy202309 commented Sep 5, 2023 •

edited

Loading

daisy202309 commented Sep 8, 2023 •

edited

Loading