Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: introduce an optimized integer compare algorithm for lists #3813

Merged
merged 2 commits into from
Sep 29, 2024

Conversation

romange
Copy link
Collaborator

@romange romange commented Sep 27, 2024

Problem: when lists are long OpRem will spend lots of time comparing an element with records in the list. For integer-based lists, most of the time is spent in lpCompare. In addition, lpGet has lots of branches that penalize integers use-cases.

This PR:

  1. Introduces lpGet2 - the improved version with less branches.
  2. Benchmarks lpCompare vs an algorithm that compares records to an integer.
  3. Benchmarks lpGet vs lpGet2

On c7g:

----------------------------------------------------------
Benchmark                Time             CPU   Iterations
----------------------------------------------------------
BM_LpCompare          3068 ns         3068 ns      1844176                                                                                                                                                                                  
BM_LpCompareInt        705 ns          705 ns      8040593                                                                                                                                                                                  
BM_LpGet/1             385 ns          385 ns     14604564                                                                                                                                                                                  
BM_LpGet/2             309 ns          309 ns     18110999 

There are no functional changes to the Dragonfly code.
While by itself LpGet/2 is not worth the hassle, BM_LpCompareInt shows that comparing integers is significantly faster
than comparing strings

Problem: when lists are long OpRem will spend lots of time comparing an element with records in the list.
For integer-based lists, most of the time is spent in lpCompare.
In addition, lpGet has lots of branches that penalize integers use-cases.

This PR:
1. Introduces lpGet2 - the improved version with less branches.
2. Benchmarks lpCompare vs an algorithm that compares records to an integer.
3. Benchmarks lpGet vs lpGet2

```
----------------------------------------------------------
Benchmark                Time             CPU   Iterations
----------------------------------------------------------
BM_LpCompare          1187 ns         1187 ns      4715144
BM_LpCompareInt        371 ns          371 ns     15216611
BM_LpGet/1             265 ns          265 ns     21473149
BM_LpGet/2             214 ns          214 ns     26075164
```

There are no functional changes to the Dragonfly code.

Signed-off-by: Roman Gershman <[email protected]>
@romange romange requested a review from chakaz September 27, 2024 15:47
Comment on lines 661 to 685
switch (encoding) {
case LP_ENCODING_16BIT_INT:
uval = (uint64_t)p[1] | (uint64_t)p[2] << 8;
negstart = (uint64_t)1<<15;
negmax = UINT16_MAX;
break;
case LP_ENCODING_24BIT_INT:
uval = (uint64_t)p[1] | (uint64_t)p[2] << 8 | (uint64_t)p[3] << 16;
negstart = (uint64_t)1<<23;
negmax = UINT32_MAX>>8;
break;
case LP_ENCODING_32BIT_INT:
uval = (uint64_t)p[1] | (uint64_t)p[2] << 8 | (uint64_t)p[3] << 16 | (uint64_t)p[4] << 24;
negstart = (uint64_t)1<<31;
negmax = UINT32_MAX;
break;
case LP_ENCODING_64BIT_INT:
uval = (uint64_t)p[1] | (uint64_t)p[2] << 8 | (uint64_t)p[3] << 16 | (uint64_t)p[4] << 24 |
(uint64_t)p[5] << 32 | (uint64_t)p[6] << 40 | (uint64_t)p[7] << 48 | (uint64_t)p[8] << 56;
negstart = (uint64_t)1<<63;
negmax = UINT64_MAX;
break;
default:
// Invalid encoding, return a large integer value.
uval = 12345678900000000ULL + p[0];
Copy link
Contributor

@BorysTheDev BorysTheDev Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part can be vectorized using the example x86 _mm_slli_epi16 instruction (maybe exist some extension for GCC or we can use separate intrinsic for arm and x86) also exist vector instructions that merge two int8 into int16 and 2 int16 into int32 etc

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see what can be vectorized here.
The whole logic is around a single byte

Copy link
Contributor

@BorysTheDev BorysTheDev Sep 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I select too much, all shifts and OR operations like
uval = (uint64_t)p[1] | (uint64_t)p[2] << 8 | (uint64_t)p[3] << 16 | (uint64_t)p[4] << 24 |
(uint64_t)p[5] << 32 | (uint64_t)p[6] << 40 | (uint64_t)p[7] << 48 | (uint64_t)p[8] << 56;
negstart = (uint64_t)1<<63;
negmax = UINT64_MAX;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you paste it into godbolt you can see it has been compiled into a single load instruction. in any case, this is why I added micro benchmarks - to be able to iterate and improve if needed. I think it's fairly good as is.

@romange romange requested review from chakaz and removed request for chakaz September 28, 2024 07:55
@romange romange merged commit a3cabd1 into main Sep 29, 2024
12 checks passed
@romange romange deleted the OptimizeCompare branch September 29, 2024 06:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants