AVX Q4_0 and Q8_0 sgemm #6891

netrunnereve · 2024-04-25T04:01:13Z

As promised in #6414 here's a regular AVX implementation of the sgemm Q4_0 and Q8_0 kernels for Sandy Bridge and Ivy Bridge users. There definitely is a performance loss since I have to use 128 bit SSE instructions for all integer operations, but the speedup is still decent and visible if you're loading a long prompt without a GPU.

On my 4c/8t Xeon v2:

model	size	params	backend	threads	test	t/s	speedup
llama 7B Q4_0 (master)	3.56 GiB	6.74 B	CPU	8	pp 512	5.99 ± 0.01
llama 7B Q4_0 (PR)	3.56 GiB	6.74 B	CPU	8	pp 512	7.36 ± 0.01	23%
llama 7B Q8_0 (master)	6.67 GiB	6.74 B	CPU	8	pp 512	6.03 ± 0.00
llama 7B Q8_0 (PR)	6.67 GiB	6.74 B	CPU	8	pp 512	8.18 ± 0.00	36%

cebtenzzre · 2024-04-25T14:46:13Z

cc @jart

sgemm.cpp

jart

In that case LGTM. Thanks for hacking on this!

jart · 2024-04-29T16:16:17Z

Please sync to HEAD. This change should be merged before #6840 is finished with review, otherwise there will be additional conflicts.

oops

netrunnereve · 2024-05-08T03:55:48Z

As a test I tried using a single 256 bit load for q8_0 (as a 256 bit memory read followed by some processing might be faster than two 128 bit reads) but that actually turned out to be 8% slower than my original.

diff --git a/sgemm.cpp b/sgemm.cpp
index 40ba9d7..c07f338 100644
--- a/sgemm.cpp
+++ b/sgemm.cpp
@@ -732,10 +732,9 @@ class tinyBLAS_Q0_AVX {
                                              _mm256_sign_epi8(load(B + ldb * (jj + j) + l),
                                                               load(A + lda * (ii + i) + l)));
 #else
-                        __m128i ali0 = load0(A + lda * (ii + i) + l);
-                        __m128i ali1 = load1(A + lda * (ii + i) + l);
-                        __m128i blj0 = load0(B + ldb * (jj + j) + l);
-                        __m128i blj1 = load1(B + ldb * (jj + j) + l);
+                        __m128i ali0, ali1, blj0, blj1;
+                        load(A + lda * (ii + i) + l, &ali0, &ali1);
+                        load(B + ldb * (jj + j) + l, &blj0, &blj1);
 
                         __m128i sepAA0 = _mm_sign_epi8(ali0, ali0);
                         __m128i sepAA1 = _mm_sign_epi8(ali1, ali1);
@@ -763,26 +762,20 @@ class tinyBLAS_Q0_AVX {
         return _mm256_loadu_si256((const __m256i *)b->qs);
     }
 
-    inline __m128i load0(const block_q8_0 *b) {
-        return _mm_loadu_si128((const __m128i *)b->qs);
-    }
-
-    inline __m128i load1(const block_q8_0 *b) {
-        return _mm_loadu_si128(((const __m128i *)b->qs) + 1);
+    inline void load(const block_q8_0 *b, __m128i *r0, __m128i *r1) {
+        __m256i bl = _mm256_loadu_si256((const __m256i *)b->qs);
+		*r0 = _mm256_extractf128_si256(bl, 0);
+		*r1 = _mm256_extractf128_si256(bl, 1);
     }
 
     inline __m256i load(const block_q4_0 *b) {
         return _mm256_sub_epi8(denibble(b->qs), _mm256_set1_epi8(8));
     }
 
-    inline __m128i load0(const block_q4_0 *b) {
-        const __m128i x = _mm_loadu_si128((const __m128i *)(b->qs));
-        return _mm_sub_epi8(_mm_and_si128(_mm_set1_epi8(15), x), _mm_set1_epi8(8));
-    }
-
-    inline __m128i load1(const block_q4_0 *b) {
-        const __m128i x = _mm_loadu_si128((const __m128i *)(b->qs));
-        return _mm_sub_epi8(_mm_and_si128(_mm_set1_epi8(15), _mm_srli_epi16(x, 4)), _mm_set1_epi8(8));
+    inline void load(const block_q4_0 *b, __m128i *r0, __m128i *r1) {
+        __m128i x = _mm_loadu_si128((const __m128i *)(b->qs));
+        *r0 = _mm_sub_epi8(_mm_and_si128(_mm_set1_epi8(15), x), _mm_set1_epi8(8));
+        *r1 = _mm_sub_epi8(_mm_and_si128(_mm_set1_epi8(15), _mm_srli_epi16(x, 4)), _mm_set1_epi8(8));
     }
 
     inline __m256 updot(__m256i u, __m256i s) {

Anyways this has been synced with master and is ready for review, the CI is failing since the SDE emulation is so slow that the test timed out.

netrunnereve added 7 commits April 22, 2024 23:35

basic avx implementation

86d1d84

style

257391a

combine denibble with load

9facb0f

reduce 256 to 128 (and back!) conversions

dee9566

sse load

063a31f

Merge branch 'ggerganov:master' into sgemm-avx

e97c0fd

Update sgemm.cpp

fb80f13

Merge branch 'ggerganov:master' into sgemm-avx

330b3bc

jart reviewed Apr 28, 2024

View reviewed changes

sgemm.cpp Show resolved Hide resolved

sgemm.cpp Show resolved Hide resolved

sgemm.cpp Show resolved Hide resolved

jart approved these changes Apr 29, 2024

View reviewed changes

netrunnereve added 4 commits April 29, 2024 22:17

merge

8916954

oops

ae0b5ea

oops

Merge branch 'ggerganov:master' into sgemm-avx

8af6f9c

Merge branch 'ggerganov:master' into sgemm-avx

f87fc83

ggerganov approved these changes May 8, 2024

View reviewed changes

ggerganov merged commit 465263d into ggml-org:master May 8, 2024
58 checks passed

netrunnereve deleted the sgemm-avx branch May 8, 2024 22:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX Q4_0 and Q8_0 sgemm #6891

AVX Q4_0 and Q8_0 sgemm #6891

netrunnereve commented Apr 25, 2024

cebtenzzre commented Apr 25, 2024

jart left a comment

jart commented Apr 29, 2024

netrunnereve commented May 8, 2024

AVX Q4_0 and Q8_0 sgemm #6891

AVX Q4_0 and Q8_0 sgemm #6891

Conversation

netrunnereve commented Apr 25, 2024

cebtenzzre commented Apr 25, 2024

jart left a comment

Choose a reason for hiding this comment

jart commented Apr 29, 2024

netrunnereve commented May 8, 2024