optimize for ppc64le using VSX intrinsics #784

penghongbo · 2024-04-04T16:35:38Z

I would like to contribute to the ggml in improving the performance on ppc64le. Here is the updated code by using the VSX intrinsics. Please review. Thank you.

penghongbo · 2024-04-17T01:46:57Z

@ggerganov would you please review this PR or assign reviewers? I am working for IBM and we would like to have this optimization on ppc64le. Thank you.

2. fix typo in suffix of scaling.

penghongbo · 2024-04-25T12:56:01Z

@ggerganov is this your preferred place to make PR for enhancement or optimization. Or shall I make PR in llama.cpp and then you sync to this repository?

ggerganov

Hey, thanks and sorry for the delay. This implementation fits in the existing pattern and seems all new code is behind the __POWER9_VECTOR__ define which is great.

I don't have the hardware to test this - can you share some sample numbers for the speed-up that you observe?

penghongbo · 2024-04-26T08:29:35Z

Here is the speedup of vec_dot_q float32 throughput numbers compared to current master branch on RHEL9.2 (Power10 machine) by the test-quantizer-perf -i 10000. The code was compiled by gcc-12.2.1.

Type	q4_0	q4_1	q5_0	q5_1	q8_0	q2_K	q3_K	q4_K	q5_K	q6_K	iq3_xxs	iq4_nl	iq3_s	iq2_s	iq4_xs
speedup	4.99	3.39	4.27	3.17	1.37	7.70	4.56	5.26	4.78	5.14	3.57	5.97	3.61	4.91	6.99

The float32 throughput of q8_0 in function quantize_row_q and quantize_row_q_dot also get a speedup as around 4.81.

I also tried to run the test-quantizer-fns and verified the code in llama.cpp to verify the functionality. All passed in the test env as mentioned above.

Thanks for your review and approval.

optimize for ppc64le using VSX intrinsics

2558f56

ggerganov mentioned this pull request Apr 10, 2024

Improve cpu prompt eval speed ggml-org/llama.cpp#6414

Merged

penghongbo added 3 commits April 19, 2024 01:21

1. code clean up by removing comments about overflow concern.

2d0782b

2. fix typo in suffix of scaling.

Merge remote-tracking branch 'origin' into ppc64le

0c79e33

Continue to fix typo in suffix of scaling for QK_K <> 256

b975c94

ggerganov approved these changes Apr 25, 2024

View reviewed changes

Merge branch 'master' into HEAD

d68d303

ggerganov merged commit 9149580 into ggml-org:master May 12, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize for ppc64le using VSX intrinsics #784

optimize for ppc64le using VSX intrinsics #784

penghongbo commented Apr 4, 2024

penghongbo commented Apr 17, 2024

penghongbo commented Apr 25, 2024

ggerganov left a comment

penghongbo commented Apr 26, 2024

optimize for ppc64le using VSX intrinsics #784

optimize for ppc64le using VSX intrinsics #784

Conversation

penghongbo commented Apr 4, 2024

penghongbo commented Apr 17, 2024

penghongbo commented Apr 25, 2024

ggerganov left a comment

Choose a reason for hiding this comment

penghongbo commented Apr 26, 2024