Ideas

Instead of permuting the scales individually, multiply unpermuted scales, then permute the result (doesn't seem to improve performance).
Get rid of masked loads so that the compiler can use vperm* directly on memory operands (promising).
Somehow use one vpdpbusds instead of two (doesn't seem to be possible).
Somehow use the accumulator in vpdpbusds instead of a separate subtraction at the very end (also doesn't seem to be possible).

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
README.md		README.md
avx2.cpp		avx2.cpp
avx2.h		avx2.h
avx2_const_me.cpp		avx2_const_me.cpp
avx2_const_me.h		avx2_const_me.h
avx512.cpp		avx512.cpp
avx512.h		avx512.h
avx512_modern.cpp		avx512_modern.cpp
avx512_modern.h		avx512_modern.h
avx512_no_unroll.cpp		avx512_no_unroll.cpp
avx512_no_unroll.h		avx512_no_unroll.h
build.sh		build.sh
common-inl.h		common-inl.h
dst.bin		dst.bin
ggml.cpp		ggml.cpp
ggml.h		ggml.h
main.cpp		main.cpp
main_loop-inl.h		main_loop-inl.h
masked_load.c		masked_load.c
pack_nibbles.cpp		pack_nibbles.cpp
print_bits.c		print_bits.c
quantize_avx2-inl.h		quantize_avx2-inl.h
run.sh		run.sh
src0.bin		src0.bin
src1.bin		src1.bin
test.asm		test.asm
underflow.asm		underflow.asm
vperm.asm		vperm.asm

Provide feedback