-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: introduce an optimized integer compare algorithm for lists #3813
Conversation
Problem: when lists are long OpRem will spend lots of time comparing an element with records in the list. For integer-based lists, most of the time is spent in lpCompare. In addition, lpGet has lots of branches that penalize integers use-cases. This PR: 1. Introduces lpGet2 - the improved version with less branches. 2. Benchmarks lpCompare vs an algorithm that compares records to an integer. 3. Benchmarks lpGet vs lpGet2 ``` ---------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------- BM_LpCompare 1187 ns 1187 ns 4715144 BM_LpCompareInt 371 ns 371 ns 15216611 BM_LpGet/1 265 ns 265 ns 21473149 BM_LpGet/2 214 ns 214 ns 26075164 ``` There are no functional changes to the Dragonfly code. Signed-off-by: Roman Gershman <[email protected]>
src/redis/listpack.c
Outdated
switch (encoding) { | ||
case LP_ENCODING_16BIT_INT: | ||
uval = (uint64_t)p[1] | (uint64_t)p[2] << 8; | ||
negstart = (uint64_t)1<<15; | ||
negmax = UINT16_MAX; | ||
break; | ||
case LP_ENCODING_24BIT_INT: | ||
uval = (uint64_t)p[1] | (uint64_t)p[2] << 8 | (uint64_t)p[3] << 16; | ||
negstart = (uint64_t)1<<23; | ||
negmax = UINT32_MAX>>8; | ||
break; | ||
case LP_ENCODING_32BIT_INT: | ||
uval = (uint64_t)p[1] | (uint64_t)p[2] << 8 | (uint64_t)p[3] << 16 | (uint64_t)p[4] << 24; | ||
negstart = (uint64_t)1<<31; | ||
negmax = UINT32_MAX; | ||
break; | ||
case LP_ENCODING_64BIT_INT: | ||
uval = (uint64_t)p[1] | (uint64_t)p[2] << 8 | (uint64_t)p[3] << 16 | (uint64_t)p[4] << 24 | | ||
(uint64_t)p[5] << 32 | (uint64_t)p[6] << 40 | (uint64_t)p[7] << 48 | (uint64_t)p[8] << 56; | ||
negstart = (uint64_t)1<<63; | ||
negmax = UINT64_MAX; | ||
break; | ||
default: | ||
// Invalid encoding, return a large integer value. | ||
uval = 12345678900000000ULL + p[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part can be vectorized using the example x86 _mm_slli_epi16 instruction (maybe exist some extension for GCC or we can use separate intrinsic for arm and x86) also exist vector instructions that merge two int8 into int16 and 2 int16 into int32 etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see what can be vectorized here.
The whole logic is around a single byte
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I select too much, all shifts and OR operations like
uval = (uint64_t)p[1] | (uint64_t)p[2] << 8 | (uint64_t)p[3] << 16 | (uint64_t)p[4] << 24 |
(uint64_t)p[5] << 32 | (uint64_t)p[6] << 40 | (uint64_t)p[7] << 48 | (uint64_t)p[8] << 56;
negstart = (uint64_t)1<<63;
negmax = UINT64_MAX;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you paste it into godbolt you can see it has been compiled into a single load instruction. in any case, this is why I added micro benchmarks - to be able to iterate and improve if needed. I think it's fairly good as is.
Problem: when lists are long OpRem will spend lots of time comparing an element with records in the list. For integer-based lists, most of the time is spent in lpCompare. In addition, lpGet has lots of branches that penalize integers use-cases.
This PR:
On c7g:
There are no functional changes to the Dragonfly code.
While by itself
LpGet/2
is not worth the hassle,BM_LpCompareInt
shows that comparing integers is significantly fasterthan comparing strings