-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce more compression libraries and implementations #3354
Comments
@mulugetam there is an issue on Apache Lucene to support ZSTD [1], [2], plus somewhat unrelated to index storage but ZSTD [3] for snapshots. I believe the main concern is related to usage of the JNI bindings, I suspect your implementation is also using that? Thank you. [1] https://issues.apache.org/jira/browse/LUCENE-8739 |
This is awesome @mulugetam, looking forward! |
@reta on JNI, I think the issue is primarily native code, not the bindings themselves, as Lucene is strongly opposed to including native code? This isn't a limitation of OpenSearch, we have plugins, such as k-nn, that use native libraries. I don't love JNI btw as it requires one to write JNI bindings in C. I would prefer to replace all native boilerplate code written in OpenSearch with https://github.com/java-native-access/jna (see opensearch-project/k-NN#35 for an example). |
thanks @dblock, yeah I think the native code + JNI bindings (as the consequence) is the concern for Apache Lucene (as per discussion thread), I am just curious what the compression implementation was used, as you said, OpenSearch has no such limitations or / and constraints. |
Compression for snapshotting I added here: #2996. @mulugetam Do you implement Lucene codes with ZSTD compression support? Regarding lz4 Lucense has its own implementation the missing part is a support of lz4 InputStream. |
@willyborankin Yes. From Lucene, you'd set your codec, e.g. zstd, using:
and lz4-jni (native) like:
From OpenSearch, only the EngineConfig.java and CodecService.java source files need to be updated. You can then use it by setting the Below is how I used it when running OpenSearch-Benchmark:
The current lz4 implementation in Lucene, which is written in pure Java, has a very poor performance and that is why we went for the native implementation. Compression for snapshotting and lz4 InputStream is something we'd need to look into. I'll be upstreaming the patches as soon as I'm done with our internal process here. |
@mulugetam This is perfect. I misread your comment about LZ4 looks really good. Regarding lz4 and InputStream, I can implement it like it was done for Kafka here: https://github.com/apache/kafka/tree/99b9b3e84f4e98c3f07714e1de6a139a004cbc5b/clients/src/main/java/org/apache/kafka/common/compress. |
Maybe it makes sense to add a global parameter in |
@willyborankin what would be the benefit to compress LZ4 indices with LZ4 again while snapshotting? |
@reta Better compression but slow read even compare to ZSTD ... I think it is up to the end-user to decide. Wdyt? |
Ok, let me go back to your original question:
Let say the indices are compressed with LZ4 / ZSTD, what would be the benefit to compress them again with LZ4 / ZSTD while snapshotting? I am trying to understand this part. |
The idea is to set this parameter in case you want to use the same compression for indexes and snapshotting otherwise use lets say a granular settings for indexes and snapshotting separately. |
@willyborankin @reta @dblock Have created the PR here: #3577. |
This is tangentially related to encryption, #3469, just putting it on the radar here so we can think of consistent behavior in indexes vs. snapshots. |
@dblock answered |
Add experimental support for zstd (with and without dictionary support) and lz4 (native) compressions as discussed in: opensearch-project#3354. Users would be able to set the index.codec setting with the values "lz4_native" (for lz4 native), "zstd" (for zstd with dictionary), and "zstd_no_dict" (for zstd without a dictionary). Signed-off-by: Mulugeta Mammo <[email protected]>
* Add experimental support for zstd and lz4 (native) compression. Add experimental support for zstd (with and without dictionary support) and lz4 (native) compressions as discussed in: #3354. Users would be able to set the index.codec setting with the values "lz4_native" (for lz4 native), "zstd" (for zstd with dictionary), and "zstd_no_dict" (for zstd without a dictionary). Signed-off-by: Mulugeta Mammo <[email protected]> * Fix license issues, add tests for zstd, zstd_no_dict, and lz4_native compressions. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix DCO and and issues with CodecTests.java. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix forbidden api violation error for lz4-java. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix license headers. Remove and fix unnecessary fields. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix magic numbers. Use more restrictive access modifiers. Signed-off-by: Mulugeta Mammo <[email protected]> * Use protected access modifier for Zstd and LZ4 compression mode classes. Signed-off-by: Mulugeta Mammo <[email protected]> * Allow negative compression levels for zstd. Use more restrictive access modifiers. Signed-off-by: Mulugeta Mammo <[email protected]> * Use a more restrictive permission for loading zstd-jni and lz4-java libraries. Signed-off-by: Mulugeta Mammo <[email protected]> * Rename a file (follow a consistent version naming convention). Signed-off-by: Mulugeta Mammo <[email protected]> * Refactor and create a new custom-codecs sandbox module. Signed-off-by: Mulugeta Mammo <[email protected]> * Remove blank lines. Signed-off-by: Mulugeta Mammo <[email protected]> * Restore Lucene92CustomCodec to extend from FilterCodec. Signed-off-by: Mulugeta Mammo <[email protected]> * Make use of the compressionLevel argument. Signed-off-by: Mulugeta Mammo <[email protected]> * Make Lucene92CustomCodec abstract and use a package-private access modifier. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix missing JavaDoc issues. Remove unused field in PerFieldMappingPostingFormatCodec. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix lint errors. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix the description for the custom-codecs plugin. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix wildcard import and improve documentation. Signed-off-by: Mulugeta Mammo <[email protected]> * Access control exception fixed. Removed lz4-java support for now. - PRs were made to zstd-jni and lz4-java to use AccessController.doPrivileged. - The zstd-jni PR is merged since version 1.5.4-1. - The lz4-java support temporarily removed until the PR gets merged. Signed-off-by: Mulugeta Mammo <[email protected]> * Upgrade plugin to use Lucene95Codec. Rename files accordingly. - Upgrade plugin to use Lucene95Codec. Rename files accordingly. - Fix lint issue with plugin-security. - Remove thridPartyAudit that was there for supporting lz4-java. Signed-off-by: Mulugeta Mammo <[email protected]> * Add test cases for compression/decompression. Other minor changes. - add test cases for compression/decompression. - rename package. - add a CHANGELOG entry. - add more checks for signed integer arithmetic. Signed-off-by: Mulugeta Mammo <[email protected]> * Remove ES grant in plugin-security.policy. Fix minor javadoc issue. - Remove ES grant in plugin-security.policy file. - Replace @link and @see to fix javadoc error. Signed-off-by: Mulugeta Mammo <[email protected]> * Upgrade jettison version to 1.5.4. Signed-off-by: Mulugeta Mammo <[email protected]> * Update SHA for jettison 1.5.4. Signed-off-by: Mulugeta Mammo <[email protected]> --------- Signed-off-by: Mulugeta Mammo <[email protected]> Co-authored-by: Daniel (dB.) Doubrovkine <[email protected]>
* Add experimental support for zstd and lz4 (native) compression. Add experimental support for zstd (with and without dictionary support) and lz4 (native) compressions as discussed in: #3354. Users would be able to set the index.codec setting with the values "lz4_native" (for lz4 native), "zstd" (for zstd with dictionary), and "zstd_no_dict" (for zstd without a dictionary). Signed-off-by: Mulugeta Mammo <[email protected]> * Fix license issues, add tests for zstd, zstd_no_dict, and lz4_native compressions. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix DCO and and issues with CodecTests.java. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix forbidden api violation error for lz4-java. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix license headers. Remove and fix unnecessary fields. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix magic numbers. Use more restrictive access modifiers. Signed-off-by: Mulugeta Mammo <[email protected]> * Use protected access modifier for Zstd and LZ4 compression mode classes. Signed-off-by: Mulugeta Mammo <[email protected]> * Allow negative compression levels for zstd. Use more restrictive access modifiers. Signed-off-by: Mulugeta Mammo <[email protected]> * Use a more restrictive permission for loading zstd-jni and lz4-java libraries. Signed-off-by: Mulugeta Mammo <[email protected]> * Rename a file (follow a consistent version naming convention). Signed-off-by: Mulugeta Mammo <[email protected]> * Refactor and create a new custom-codecs sandbox module. Signed-off-by: Mulugeta Mammo <[email protected]> * Remove blank lines. Signed-off-by: Mulugeta Mammo <[email protected]> * Restore Lucene92CustomCodec to extend from FilterCodec. Signed-off-by: Mulugeta Mammo <[email protected]> * Make use of the compressionLevel argument. Signed-off-by: Mulugeta Mammo <[email protected]> * Make Lucene92CustomCodec abstract and use a package-private access modifier. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix missing JavaDoc issues. Remove unused field in PerFieldMappingPostingFormatCodec. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix lint errors. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix the description for the custom-codecs plugin. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix wildcard import and improve documentation. Signed-off-by: Mulugeta Mammo <[email protected]> * Access control exception fixed. Removed lz4-java support for now. - PRs were made to zstd-jni and lz4-java to use AccessController.doPrivileged. - The zstd-jni PR is merged since version 1.5.4-1. - The lz4-java support temporarily removed until the PR gets merged. Signed-off-by: Mulugeta Mammo <[email protected]> * Upgrade plugin to use Lucene95Codec. Rename files accordingly. - Upgrade plugin to use Lucene95Codec. Rename files accordingly. - Fix lint issue with plugin-security. - Remove thridPartyAudit that was there for supporting lz4-java. Signed-off-by: Mulugeta Mammo <[email protected]> * Add test cases for compression/decompression. Other minor changes. - add test cases for compression/decompression. - rename package. - add a CHANGELOG entry. - add more checks for signed integer arithmetic. Signed-off-by: Mulugeta Mammo <[email protected]> * Remove ES grant in plugin-security.policy. Fix minor javadoc issue. - Remove ES grant in plugin-security.policy file. - Replace @link and @see to fix javadoc error. Signed-off-by: Mulugeta Mammo <[email protected]> * Upgrade jettison version to 1.5.4. Signed-off-by: Mulugeta Mammo <[email protected]> * Update SHA for jettison 1.5.4. Signed-off-by: Mulugeta Mammo <[email protected]> --------- Signed-off-by: Mulugeta Mammo <[email protected]> Co-authored-by: Daniel (dB.) Doubrovkine <[email protected]> (cherry picked from commit f071c9b) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Add experimental support for zstd and lz4 (native) compression. Add experimental support for zstd (with and without dictionary support) and lz4 (native) compressions as discussed in: opensearch-project#3354. Users would be able to set the index.codec setting with the values "lz4_native" (for lz4 native), "zstd" (for zstd with dictionary), and "zstd_no_dict" (for zstd without a dictionary). Signed-off-by: Mulugeta Mammo <[email protected]> * Fix license issues, add tests for zstd, zstd_no_dict, and lz4_native compressions. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix DCO and and issues with CodecTests.java. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix forbidden api violation error for lz4-java. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix license headers. Remove and fix unnecessary fields. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix magic numbers. Use more restrictive access modifiers. Signed-off-by: Mulugeta Mammo <[email protected]> * Use protected access modifier for Zstd and LZ4 compression mode classes. Signed-off-by: Mulugeta Mammo <[email protected]> * Allow negative compression levels for zstd. Use more restrictive access modifiers. Signed-off-by: Mulugeta Mammo <[email protected]> * Use a more restrictive permission for loading zstd-jni and lz4-java libraries. Signed-off-by: Mulugeta Mammo <[email protected]> * Rename a file (follow a consistent version naming convention). Signed-off-by: Mulugeta Mammo <[email protected]> * Refactor and create a new custom-codecs sandbox module. Signed-off-by: Mulugeta Mammo <[email protected]> * Remove blank lines. Signed-off-by: Mulugeta Mammo <[email protected]> * Restore Lucene92CustomCodec to extend from FilterCodec. Signed-off-by: Mulugeta Mammo <[email protected]> * Make use of the compressionLevel argument. Signed-off-by: Mulugeta Mammo <[email protected]> * Make Lucene92CustomCodec abstract and use a package-private access modifier. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix missing JavaDoc issues. Remove unused field in PerFieldMappingPostingFormatCodec. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix lint errors. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix the description for the custom-codecs plugin. Signed-off-by: Mulugeta Mammo <[email protected]> * Fix wildcard import and improve documentation. Signed-off-by: Mulugeta Mammo <[email protected]> * Access control exception fixed. Removed lz4-java support for now. - PRs were made to zstd-jni and lz4-java to use AccessController.doPrivileged. - The zstd-jni PR is merged since version 1.5.4-1. - The lz4-java support temporarily removed until the PR gets merged. Signed-off-by: Mulugeta Mammo <[email protected]> * Upgrade plugin to use Lucene95Codec. Rename files accordingly. - Upgrade plugin to use Lucene95Codec. Rename files accordingly. - Fix lint issue with plugin-security. - Remove thridPartyAudit that was there for supporting lz4-java. Signed-off-by: Mulugeta Mammo <[email protected]> * Add test cases for compression/decompression. Other minor changes. - add test cases for compression/decompression. - rename package. - add a CHANGELOG entry. - add more checks for signed integer arithmetic. Signed-off-by: Mulugeta Mammo <[email protected]> * Remove ES grant in plugin-security.policy. Fix minor javadoc issue. - Remove ES grant in plugin-security.policy file. - Replace @link and @see to fix javadoc error. Signed-off-by: Mulugeta Mammo <[email protected]> * Upgrade jettison version to 1.5.4. Signed-off-by: Mulugeta Mammo <[email protected]> * Update SHA for jettison 1.5.4. Signed-off-by: Mulugeta Mammo <[email protected]> --------- Signed-off-by: Mulugeta Mammo <[email protected]> Co-authored-by: Daniel (dB.) Doubrovkine <[email protected]> Signed-off-by: Valentin Mitrofanov <[email protected]>
* Add experimental support for zstd and lz4 (native) compression. Add experimental support for zstd (with and without dictionary support) and lz4 (native) compressions as discussed in: #3354. Users would be able to set the index.codec setting with the values "lz4_native" (for lz4 native), "zstd" (for zstd with dictionary), and "zstd_no_dict" (for zstd without a dictionary). * Fix license issues, add tests for zstd, zstd_no_dict, and lz4_native compressions. * Fix DCO and and issues with CodecTests.java. * Fix forbidden api violation error for lz4-java. * Fix license headers. Remove and fix unnecessary fields. * Fix magic numbers. Use more restrictive access modifiers. * Use protected access modifier for Zstd and LZ4 compression mode classes. * Allow negative compression levels for zstd. Use more restrictive access modifiers. * Use a more restrictive permission for loading zstd-jni and lz4-java libraries. * Rename a file (follow a consistent version naming convention). * Refactor and create a new custom-codecs sandbox module. * Remove blank lines. * Restore Lucene92CustomCodec to extend from FilterCodec. * Make use of the compressionLevel argument. * Make Lucene92CustomCodec abstract and use a package-private access modifier. * Fix missing JavaDoc issues. Remove unused field in PerFieldMappingPostingFormatCodec. * Fix lint errors. * Fix the description for the custom-codecs plugin. * Fix wildcard import and improve documentation. * Access control exception fixed. Removed lz4-java support for now. - PRs were made to zstd-jni and lz4-java to use AccessController.doPrivileged. - The zstd-jni PR is merged since version 1.5.4-1. - The lz4-java support temporarily removed until the PR gets merged. * Upgrade plugin to use Lucene95Codec. Rename files accordingly. - Upgrade plugin to use Lucene95Codec. Rename files accordingly. - Fix lint issue with plugin-security. - Remove thridPartyAudit that was there for supporting lz4-java. * Add test cases for compression/decompression. Other minor changes. - add test cases for compression/decompression. - rename package. - add a CHANGELOG entry. - add more checks for signed integer arithmetic. * Remove ES grant in plugin-security.policy. Fix minor javadoc issue. - Remove ES grant in plugin-security.policy file. - Replace @link and @see to fix javadoc error. * Upgrade jettison version to 1.5.4. * Update SHA for jettison 1.5.4. --------- (cherry picked from commit f071c9b) Signed-off-by: Mulugeta Mammo <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Daniel (dB.) Doubrovkine <[email protected]>
Will this work on AWS graviton or AMD processors, or is it just going to work when a server is using an Intel chip? And is there a downside to the non-intel users? |
I opened opensearch-project/documentation-website#3816 to document this and other codecs. @mulugetam Interested in drafting something? |
Can someone explain what the impacts are on backward compatibility when a new compression codec is introduced? Does a cluster need to do a full restart and/or reindex? Is it possible to upgrade with zero downtime? |
What would be the steps to take this feature out of experimental @mulugetam? |
Hi @dblock, as far as I understand all you need to do is just move |
@mulugetam any tests have been performed, internally or externally, you could share results of? should we run opensearch-benchmarks workloads with this new codec? |
Lucene, and by extension OpenSearch, currently support only zlib and lz4 compression algorithms offering best compression ratio and best speed, respectively. There are a couple of limitations with this:
To address the limitations, we have implemented a small library that adds support for both zstd and lz4 (native) compressions to OpenSearch. The library requires a couple of source code changes to OpenSearch, EngineConfig.java and CodecService.java, but does not require any code changes to Lucene.
The charts below summarize the relative performance gain we measured for the different compression algorithms and implementations using the StoredFieldsBenchmark.
Indexing Time (relative, less is better):
Retrieval Time (relative, less is better):
Stored size (relative, less is better):
We have also used the OpenSearch-Benchmark with the nyc_taxis workload to compare the indexing performance of the different compression algorithms and implementations above and measured decent gains as well.
We'll be upstreaming our source code and patches in a couple of weeks and discuss on how best we can integrate this into OpenSearch and make them available for users. This post is intended to start the discussion.
The text was updated successfully, but these errors were encountered: