Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX Q4_0 and Q8_0 sgemm #6891

Merged
merged 12 commits into from
May 8, 2024
Merged

AVX Q4_0 and Q8_0 sgemm #6891

merged 12 commits into from
May 8, 2024

Conversation

netrunnereve
Copy link
Collaborator

As promised in #6414 here's a regular AVX implementation of the sgemm Q4_0 and Q8_0 kernels for Sandy Bridge and Ivy Bridge users. There definitely is a performance loss since I have to use 128 bit SSE instructions for all integer operations, but the speedup is still decent and visible if you're loading a long prompt without a GPU.

On my 4c/8t Xeon v2:

model size params backend threads test t/s speedup
llama 7B Q4_0 (master) 3.56 GiB 6.74 B CPU 8 pp 512 5.99 ± 0.01
llama 7B Q4_0 (PR) 3.56 GiB 6.74 B CPU 8 pp 512 7.36 ± 0.01 23%
llama 7B Q8_0 (master) 6.67 GiB 6.74 B CPU 8 pp 512 6.03 ± 0.00
llama 7B Q8_0 (PR) 6.67 GiB 6.74 B CPU 8 pp 512 8.18 ± 0.00 36%

@cebtenzzre
Copy link
Collaborator

cc @jart

Copy link
Contributor

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case LGTM. Thanks for hacking on this!

@jart
Copy link
Contributor

jart commented Apr 29, 2024

Please sync to HEAD. This change should be merged before #6840 is finished with review, otherwise there will be additional conflicts.

@netrunnereve
Copy link
Collaborator Author

As a test I tried using a single 256 bit load for q8_0 (as a 256 bit memory read followed by some processing might be faster than two 128 bit reads) but that actually turned out to be 8% slower than my original.

diff --git a/sgemm.cpp b/sgemm.cpp
index 40ba9d7..c07f338 100644
--- a/sgemm.cpp
+++ b/sgemm.cpp
@@ -732,10 +732,9 @@ class tinyBLAS_Q0_AVX {
                                              _mm256_sign_epi8(load(B + ldb * (jj + j) + l),
                                                               load(A + lda * (ii + i) + l)));
 #else
-                        __m128i ali0 = load0(A + lda * (ii + i) + l);
-                        __m128i ali1 = load1(A + lda * (ii + i) + l);
-                        __m128i blj0 = load0(B + ldb * (jj + j) + l);
-                        __m128i blj1 = load1(B + ldb * (jj + j) + l);
+                        __m128i ali0, ali1, blj0, blj1;
+                        load(A + lda * (ii + i) + l, &ali0, &ali1);
+                        load(B + ldb * (jj + j) + l, &blj0, &blj1);
 
                         __m128i sepAA0 = _mm_sign_epi8(ali0, ali0);
                         __m128i sepAA1 = _mm_sign_epi8(ali1, ali1);
@@ -763,26 +762,20 @@ class tinyBLAS_Q0_AVX {
         return _mm256_loadu_si256((const __m256i *)b->qs);
     }
 
-    inline __m128i load0(const block_q8_0 *b) {
-        return _mm_loadu_si128((const __m128i *)b->qs);
-    }
-
-    inline __m128i load1(const block_q8_0 *b) {
-        return _mm_loadu_si128(((const __m128i *)b->qs) + 1);
+    inline void load(const block_q8_0 *b, __m128i *r0, __m128i *r1) {
+        __m256i bl = _mm256_loadu_si256((const __m256i *)b->qs);
+		*r0 = _mm256_extractf128_si256(bl, 0);
+		*r1 = _mm256_extractf128_si256(bl, 1);
     }
 
     inline __m256i load(const block_q4_0 *b) {
         return _mm256_sub_epi8(denibble(b->qs), _mm256_set1_epi8(8));
     }
 
-    inline __m128i load0(const block_q4_0 *b) {
-        const __m128i x = _mm_loadu_si128((const __m128i *)(b->qs));
-        return _mm_sub_epi8(_mm_and_si128(_mm_set1_epi8(15), x), _mm_set1_epi8(8));
-    }
-
-    inline __m128i load1(const block_q4_0 *b) {
-        const __m128i x = _mm_loadu_si128((const __m128i *)(b->qs));
-        return _mm_sub_epi8(_mm_and_si128(_mm_set1_epi8(15), _mm_srli_epi16(x, 4)), _mm_set1_epi8(8));
+    inline void load(const block_q4_0 *b, __m128i *r0, __m128i *r1) {
+        __m128i x = _mm_loadu_si128((const __m128i *)(b->qs));
+        *r0 = _mm_sub_epi8(_mm_and_si128(_mm_set1_epi8(15), x), _mm_set1_epi8(8));
+        *r1 = _mm_sub_epi8(_mm_and_si128(_mm_set1_epi8(15), _mm_srli_epi16(x, 4)), _mm_set1_epi8(8));
     }
 
     inline __m256 updot(__m256i u, __m256i s) {

Anyways this has been synced with master and is ready for review, the CI is failing since the SDE emulation is so slow that the test timed out.

@ggerganov ggerganov merged commit 465263d into ggml-org:master May 8, 2024
58 checks passed
@netrunnereve netrunnereve deleted the sgemm-avx branch May 8, 2024 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants