-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
riscv64: Implement optimised crc using zbc and zbb extensions #299
base: master
Are you sure you want to change the base?
Conversation
Use the base implementations for every function. Signed-off-by: Daniel Gregory <[email protected]>
The Zbc extension defines instructions for carryless multiplication that can be used to accelerate the calculation of CRC checksums. This technique is described in Intel's whitepaper, "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction". The Zbb extension defines, among other bit manipulation operations, an instruction for byte-reversing a register (rev8). This is used when doing endianness swaps. crc_fold_common_clmul.h defines a macro that reduces a double-word aligned buffer to 128 bits by folding four 128-bit chunks in parallel then folding a single 128-bit chunk until less than two remain. This macro can be reused for all the CRC algorithms with some parametrisation controlling: - where the seed is xor-ed into the first fold - whether an endianness swap is needed on double-words read in - whether the algorithm is reflected, which affects whether clmulh gives back the high double word of a result or the low double word Where the algorithms differ more is in how the final 128-bits is reduced to a 32/64 bit result (which also changes if the algorithm is reflected) and how the buffer is made to be double-word aligned. 32-bit CRCs use a Barrett's reduction to reduce the buffer enough to be double-word aligned and to reduce any excess leftover after folding. As the different CRC32 algorithms isa-l supports differ in whether the seed is inverted and function signature, the alignment, excess and 128-bit reduction are defined as macros in crc32_*_common_clmul.h that the implementations (crc32_*.S) include and surround with algorithm-specific assembly and precomputed constants. This also makes it straightforward to reuse the macros to calculate crc16_t10dif. 64-bit CRCs use a table-based reduction to align the buffer and handle excess. All isa-l's CRC64 algorithms pass arguments in the same order and invert the seed before & after folding, so crc64_*_common_clmul.h both contain a macro for defining a CRC64 function with a particular name. Then each of the crc64_*.S contain a call to that macro along with the precomputed constants and lookup table. The .h header files added don't contain C code and so are excluded from Clang formatting, similarly to the header files defined for aarch64. Signed-off-by: Daniel Gregory <[email protected]>
Rather than duplicating all the crc32 4-folding and modifying it to write back to the destination the read-in bytes, write a very simple memcpy that then tail calls crc16_t10dif. This makes the performance of crc16_t10dif_copy much worse than crc16_t10dif, but still about twice as fast as crc16_t10dif_copy_base. Signed-off-by: Daniel Gregory <[email protected]>
aab4a5b
to
a62dd04
Compare
Thanks @daniel-gregory! We decide the implementation to use at runtime, so it would be great to do the same here too, thanks! |
Any update here, @daniel-gregory? |
He has already left the company. I will continue to improve this patch and will re - issue the patch later. |
The RISC-V carryless-multiplication extension, Zbc, provides instructions that can be used to optimise the calculation of Cyclic Redundancy Checks (CRCs). This pull request creates a new RISC-V target for isa-l and provides optimised implementations of all the CRC16, CRC32 and CRC64 algorithms using these instructions, based on the approach described in Intel's whitepaper on the topic, "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction". The core loop, which folds four 128-bit chunks in parallel, is shared between all the algorithms.
This patch also requires the target have the Zbb bit-manipulation extension. This provides an endianness swap hardware instruction, which makes up a fair part of the core folding loop for non-reflected CRCs.
On a MuseBook (1.6 GHz Spacemit X60), I gathered the following performance numbers, observing around a 20x increase in throughput for reflected algorithms and 17x for normal algorithms, likely due to the extra endianness swap instructions needed.
This patch doesn't currently have functionality for picking which version to use at runtime like the CRC implementations for aarch64 and x86_64 do. The approach used by them (reading either cpuid or hwcap) doesn't immediately translate to RISCV; I have some ideas for alternate routes, either using the linux riscv hwprobe interface which would require an up-to-date version of the kernel (v6.4+), or by detecting at buildtime with compiler flags (gcc/clang only and doesn't help detect at runtime). It would be great to get your opinion on which approach would be preferred.