Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : add AVX support based on AVX2 code #1430

Merged
merged 4 commits into from
May 14, 2023
Merged

Conversation

katsu560
Copy link
Contributor

I rereopen new PR on latest master.
I added AVX support based on AVX2 code to below functions.
static inline __m256i bytes_from_bits_32(const uint8_t * x)
static inline __m256i bytes_from_nibbles_32(const uint8_t * rsi)
static inline __m256 sum_i16_pairs_float(const __m128i xh, const __m128i xl)
static inline __m256 mul_sum_i8_pairs_float(const __m256i x, const __m256i y)
static void ggml_vec_dot_q4_1_q8_1(const int n, float * restrict s, const void * restrict vx, const void * restrict vy)
static void ggml_vec_dot_q5_0_q8_0(const int n, float * restrict s, const void * restrict vx, const void * restrict vy)
static void ggml_vec_dot_q5_1_q8_1(const int n, float * restrict s, const void * restrict vx, const void * restrict vy)
static void ggml_vec_dot_q8_0_q8_0(const int n, float * restrict s, const void * restrict vx, const void * restrict vy)

performance improved:
with q5_0 model: 1084790.97 ms -> 156907.19 ms
with q5_1 model: 1072326.98 ms -> 301364.32 ms

ggerganov and sw, please confirm this PR.

Copy link
Contributor

@sw sw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@katsu560 Thank you, this is working fine.

However, for ggml_vec_dot_q4_1_q8_1 and ggml_vec_dot_q8_0_q8_0, only a single line is different for AVX and AVX2, I think those should be merged.

Also please learn to use Git and PRs and don't open a new PR for every modification on the same topic.

@prusnak
Copy link
Collaborator

prusnak commented May 13, 2023

only a single line is different for AVX and AVX2, I think those should be merged.

It seems like you should be able to use the following trick if only one line differs:

#elif defined(__AVX__) || defined(__AVX2__)
//
// common code for both AVX1 and AVX2
//
#if defined(__AVX__)
//
// code specific for AVX1
//
#else // defined(__AVX2__)
//
// code specific for AVX2
//
#endif
//
// common code for both AVX1 and AVX2
//
#elif ...

@sw
Copy link
Contributor

sw commented May 13, 2023

I don't know if it's really worthwhile to add SIMD optimizations for the quantization of the Q4, Q5 formats. These were removed with #1405. I think we'd want optimized dot-product routines, and the Q8 quantizations.

@ggerganov
Copy link
Owner

I don't know if it's really worthwhile to add SIMD optimizations for the quantization of the Q4, Q5 formats

Yes, let's not complicate the implementation with Q4 and Q5 SIMD quantize / dequantize for now.
We can SIMD-ify these at a later stage, when we are confident which quantization formats will remain

@katsu560
Copy link
Contributor Author

I pushed the code to merge AVX2/AVX code.
I'd like you to confirm and merge to your code.

Regarding the adding SIMD optimizations for the quantization of the Q4, Q5 formats,
I understood you withhold the merging these optimizations right now.

@katsu560 katsu560 requested a review from sw May 14, 2023 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants