-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX Q4_0 and Q8_0 sgemm #6891
AVX Q4_0 and Q8_0 sgemm #6891
Conversation
cc @jart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case LGTM. Thanks for hacking on this!
Please sync to HEAD. This change should be merged before #6840 is finished with review, otherwise there will be additional conflicts. |
As a test I tried using a single 256 bit load for q8_0 (as a 256 bit memory read followed by some processing might be faster than two 128 bit reads) but that actually turned out to be 8% slower than my original.
Anyways this has been synced with master and is ready for review, the CI is failing since the SDE emulation is so slow that the test timed out. |
As promised in #6414 here's a regular AVX implementation of the sgemm Q4_0 and Q8_0 kernels for Sandy Bridge and Ivy Bridge users. There definitely is a performance loss since I have to use 128 bit SSE instructions for all integer operations, but the speedup is still decent and visible if you're loading a long prompt without a GPU.
On my 4c/8t Xeon v2: