Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: do not move src ByteBuffer position for LZ4 length prefixed decompress #12539

Merged
merged 2 commits into from
Mar 12, 2024

Conversation

itschrispeck
Copy link
Collaborator

@itschrispeck itschrispeck commented Mar 2, 2024

Addresses #12534

Based on my understanding, calling _decompressor.decompress(compressedInput, decompressedOutput) is problematic because it moves the position of the src and dest ByteBuffers. It is possible to move the position of the src buffer beyond the limit of the buffer (see issue for details).

The change calls the same underlying method, but only moves the position of the dest buffer. The reason why this was a V4 raw index writer/reader version issue is that this only happens with the length prefixed LZ4 library, and this code path is not used for other raw index writer versions.

Testing: Set up a unit test locally, now able to read a previously unreadable segment when loading/iterating through a problematic column using PinotSegmentRecordReader

@codecov-commenter
Copy link

codecov-commenter commented Mar 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.81%. Comparing base (59551e4) to head (a6f65ac).
Report is 119 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #12539      +/-   ##
============================================
+ Coverage     61.75%   61.81%   +0.05%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2453      +17     
  Lines        133233   133864     +631     
  Branches      20636    20764     +128     
============================================
+ Hits          82274    82743     +469     
- Misses        44911    45006      +95     
- Partials       6048     6115      +67     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 61.75% <100.00%> (+0.04%) ⬆️
java-21 61.69% <100.00%> (+0.07%) ⬆️
skip-bytebuffers-false 61.77% <100.00%> (+0.03%) ⬆️
skip-bytebuffers-true 61.67% <100.00%> (+33.94%) ⬆️
temurin 61.81% <100.00%> (+0.05%) ⬆️
unittests 61.80% <100.00%> (+0.05%) ⬆️
unittests1 46.93% <0.00%> (+0.04%) ⬆️
unittests2 27.69% <100.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@richardstartin
Copy link
Member

This is a good catch. I have a couple of remarks:

  1. It would be good to add a test because this is a subtle bug that has been hiding in plain sight for over two years, and it could regress in refactoring
  2. Whilst I introduced this bug and am a committer, I’m not really involved in the project any more and can only rubber stamp this change.

@itschrispeck itschrispeck changed the title bugfix: do not move src ByteBuffer position for LZ4 length prefixed decompress [don't merge] do not move src ByteBuffer position for LZ4 length prefixed decompress Mar 4, 2024
@itschrispeck itschrispeck force-pushed the lz4_length_decompress branch from 055aa92 to 2865f63 Compare March 4, 2024 07:44
@itschrispeck itschrispeck changed the title [don't merge] do not move src ByteBuffer position for LZ4 length prefixed decompress bugfix: do not move src ByteBuffer position for LZ4 length prefixed decompress Mar 4, 2024
@Jackie-Jiang
Copy link
Contributor

Trying to understand the problem better here. Why would it cause problem when the source position is moved? It is allowed for decompress() to move the position of the compressed buffer, and it should not break.

@itschrispeck
Copy link
Collaborator Author

itschrispeck commented Mar 6, 2024

Trying to understand the problem better here. Why would it cause problem when the source position is moved? It is allowed for decompress() to move the position of the compressed buffer, and it should not break.

I don't think the last bit is always true. While it is allowed to move the position, lz4_decompress_fast can move it beyond the source buffer's limit. See this example, the returned bytes read value is greater than src.limit:

image

There's a note in the docs:

The function never writes past the output buffer. However, since it doesn't know its 'src' size, it may read past the intended input. Also, because match offsets are not validated during decoding, reads from 'src' may underflow.

I'm also wondering if the the fast decompressor is the best choice. It may be better to move to the safe decompressor which is already used for non v4 indexes - doing that would also address this bug. The lz4 and lz4-java docs disagree here, the former states no significant performance difference but the latter suggests safe is a bit slower (may be out of date, or slower if non JNI path is used).

LZ4_decompress_fast() : unsafe! This function used to be a bit faster than LZ4_decompress_safe(), though situation has changed in recent versions, and now LZ4_decompress_safe() can be as fast and sometimes faster than LZ4_decompress_fast(). Moreover, LZ4_decompress_fast() is not protected vs malformed input, as it doesn't perform full validation of compressed data. As a consequence, this function is no longer recommended, and may be deprecated in future versions. It's only remaining specificity is that it can decompress data without knowing its compressed size.

@Jackie-Jiang
Copy link
Contributor

@itschrispeck Thanks for digging into it! I feel using safe decompressor is the safe go here, and we can get consistent behavior for v2 and v4.

@Jackie-Jiang
Copy link
Contributor

+1 on adding a unit test to prevent future regression

@itschrispeck
Copy link
Collaborator Author

Updated to use safe decompressor. This will move src to src.limit() instead of based on bytes read (ref).

I wasn't able to figure out a good way to round trip unit test this. I've added a comment on why the safe decompressor is chosen to hopefully avoid refactoring into the fast decompressor. With more understanding of the LZ4 algorithm I could probably edit some compressed data's bytes to cause this, but given that fast decompressor is deprecated I'd prefer if we can merge this without the test.

I did a basic validation of the underlying decompress implementations. In general, perf differences are a wash and vary based on host/input:

+--------------+------------------------------+--------------+----------------+--------------+--------------+
|Source        |Function Benchmarked          | Total Seconds|  Iterations/sec|  ns/Iteration|  % of default|
+--------------+------------------------------+--------------+----------------+--------------+--------------+
|Normal Text   |LZ4_decompress_safe()         | 140.382024000|          356170|          2807|       100.00%|
|Normal Text   |LZ4_decompress_fast()         | 133.453277000|          374662|          2669|        95.06%|
|              |                              |              |                |              |              |
|Compressible  |LZ4_decompress_safe()         |  28.044312000|         1782892|           560|        19.98%|
|Compressible  |LZ4_decompress_fast()         |  32.404868999|         1542978|           648|        23.08%|
+--------------+------------------------------+--------------+----------------+--------------+--------------+

@Jackie-Jiang Jackie-Jiang merged commit da2604b into apache:master Mar 12, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants