Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce more compression libraries and implementations #3354

Open
mulugetam opened this issue May 17, 2022 · 22 comments
Open

Introduce more compression libraries and implementations #3354

mulugetam opened this issue May 17, 2022 · 22 comments
Labels
discuss Issues intended to help drive brainstorming and decision making distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@mulugetam
Copy link
Contributor

Lucene, and by extension OpenSearch, currently support only zlib and lz4 compression algorithms offering best compression ratio and best speed, respectively. There are a couple of limitations with this:

  • Use cases that prefer a middle ground between best compression ratio and best speed, such as that offered by zstd, are not supported.
  • The lz4 implementation is written entirely in Java and, based on our benchmark results, is less performant than a native lz4 implementation.

To address the limitations, we have implemented a small library that adds support for both zstd and lz4 (native) compressions to OpenSearch. The library requires a couple of source code changes to OpenSearch, EngineConfig.java and CodecService.java, but does not require any code changes to Lucene.

The charts below summarize the relative performance gain we measured for the different compression algorithms and implementations using the StoredFieldsBenchmark.

Indexing Time (relative, less is better):
image

Retrieval Time (relative, less is better):
image

Stored size (relative, less is better):
image

We have also used the OpenSearch-Benchmark with the nyc_taxis workload to compare the indexing performance of the different compression algorithms and implementations above and measured decent gains as well.

We'll be upstreaming our source code and patches in a couple of weeks and discuss on how best we can integrate this into OpenSearch and make them available for users. This post is intended to start the discussion.

@mulugetam mulugetam added enhancement Enhancement or improvement to existing feature or request untriaged labels May 17, 2022
@reta
Copy link
Collaborator

reta commented May 18, 2022

@mulugetam there is an issue on Apache Lucene to support ZSTD [1], [2], plus somewhat unrelated to index storage but ZSTD [3] for snapshots. I believe the main concern is related to usage of the JNI bindings, I suspect your implementation is also using that? Thank you.

[1] https://issues.apache.org/jira/browse/LUCENE-8739
[2] apache/lucene#439
[3] #2192

@dblock
Copy link
Member

dblock commented May 18, 2022

This is awesome @mulugetam, looking forward!

@dblock
Copy link
Member

dblock commented May 18, 2022

@reta on JNI, I think the issue is primarily native code, not the bindings themselves, as Lucene is strongly opposed to including native code? This isn't a limitation of OpenSearch, we have plugins, such as k-nn, that use native libraries.

I don't love JNI btw as it requires one to write JNI bindings in C. I would prefer to replace all native boilerplate code written in OpenSearch with https://github.com/java-native-access/jna (see opensearch-project/k-NN#35 for an example).

@reta
Copy link
Collaborator

reta commented May 18, 2022

thanks @dblock, yeah I think the native code + JNI bindings (as the consequence) is the concern for Apache Lucene (as per discussion thread), I am just curious what the compression implementation was used, as you said, OpenSearch has no such limitations or / and constraints.

@willyborankin
Copy link
Contributor

willyborankin commented May 28, 2022

Compression for snapshotting I added here: #2996. @mulugetam Do you implement Lucene codes with ZSTD compression support? Regarding lz4 Lucense has its own implementation the missing part is a support of lz4 InputStream.

@mulugetam
Copy link
Contributor Author

mulugetam commented Jun 2, 2022

@willyborankin Yes. From Lucene, you'd set your codec, e.g. zstd, using:

indexWriterConfig.setCodec(new Lucene92CustomCodec(Lucene91CustomCodec.Mode.ZSTD));

and lz4-jni (native) like:

indexWriterConfig.setCodec(new Lucene92CustomCodec(Lucene92CustomCodec.Mode.LZ4_NATIVE));

From OpenSearch, only the EngineConfig.java and CodecService.java source files need to be updated. You can then use it by setting the index.codec property.

Below is how I used it when running OpenSearch-Benchmark:

echo '{"index_settings":{"index.codec": "zstd"}}' > params.json
opensearch-benchmark execute_test \
--workload=nyc_taxis \
--workload-params=params.json \
--target-hosts=127.0.0.1:9200 \
--pipeline=benchmark-only

The current lz4 implementation in Lucene, which is written in pure Java, has a very poor performance and that is why we went for the native implementation.

Compression for snapshotting and lz4 InputStream is something we'd need to look into. I'll be upstreaming the patches as soon as I'm done with our internal process here.

@willyborankin
Copy link
Contributor

@mulugetam This is perfect. I misread your comment about LZ4 looks really good. Regarding lz4 and InputStream, I can implement it like it was done for Kafka here: https://github.com/apache/kafka/tree/99b9b3e84f4e98c3f07714e1de6a139a004cbc5b/clients/src/main/java/org/apache/kafka/common/compress.

@willyborankin
Copy link
Contributor

willyborankin commented Jun 8, 2022

Maybe it makes sense to add a global parameter in opensearch.yml such this:
cluster.compression_type: which defines compression type both for indexes and snapshotting. WDYT?

@reta
Copy link
Collaborator

reta commented Jun 8, 2022

@willyborankin what would be the benefit to compress LZ4 indices with LZ4 again while snapshotting?

@willyborankin
Copy link
Contributor

@reta Better compression but slow read even compare to ZSTD ... I think it is up to the end-user to decide. Wdyt?

@reta
Copy link
Collaborator

reta commented Jun 8, 2022

Ok, let me go back to your original question:

Maybe it makes sense to add a global parameter in opensearch.yml such this:
cluster.compression_type: which defines compression type both for indexes and snapshotting.

Let say the indices are compressed with LZ4 / ZSTD, what would be the benefit to compress them again with LZ4 / ZSTD while snapshotting? I am trying to understand this part.

@willyborankin
Copy link
Contributor

The idea is to set this parameter in case you want to use the same compression for indexes and snapshotting otherwise use lets say a granular settings for indexes and snapshotting separately.

@anasalkouz anasalkouz added the discuss Issues intended to help drive brainstorming and decision making label Jun 9, 2022
@mulugetam
Copy link
Contributor Author

@willyborankin @reta @dblock Have created the PR here: #3577.

@dblock
Copy link
Member

dblock commented Jun 14, 2022

This is tangentially related to encryption, #3469, just putting it on the radar here so we can think of consistent behavior in indexes vs. snapshots.

@willyborankin
Copy link
Contributor

tangentially

@dblock answered

mulugetam added a commit to mulugetam/OpenSearch that referenced this issue Jul 7, 2022
Add experimental support for zstd (with and without dictionary support) and lz4 (native) compressions as discussed in: opensearch-project#3354.

Users would be able to set the index.codec setting with the values "lz4_native" (for lz4 native), "zstd" (for zstd with dictionary), and "zstd_no_dict" (for zstd without a dictionary).

Signed-off-by: Mulugeta Mammo <[email protected]>
dblock added a commit that referenced this issue Apr 4, 2023
* Add experimental support for zstd and lz4 (native) compression.

Add experimental support for zstd (with and without dictionary support) and lz4 (native) compressions as discussed in: #3354.

Users would be able to set the index.codec setting with the values "lz4_native" (for lz4 native), "zstd" (for zstd with dictionary), and "zstd_no_dict" (for zstd without a dictionary).

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix license issues, add tests for zstd, zstd_no_dict, and lz4_native compressions.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix DCO and and issues with CodecTests.java.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix forbidden api violation error for lz4-java.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix license headers. Remove and fix unnecessary fields.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix magic numbers. Use more restrictive access modifiers.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Use protected access modifier for Zstd and LZ4 compression mode classes.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Allow negative compression levels for zstd. Use more restrictive access modifiers.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Use a more restrictive permission for loading zstd-jni and lz4-java libraries.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Rename a file (follow a consistent version naming convention).

Signed-off-by: Mulugeta Mammo <[email protected]>

* Refactor and create a new custom-codecs sandbox module.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Remove blank lines.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Restore Lucene92CustomCodec to extend from FilterCodec.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Make use of the compressionLevel argument.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Make Lucene92CustomCodec abstract and use a package-private access modifier.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix missing JavaDoc issues. Remove unused field in PerFieldMappingPostingFormatCodec.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix lint errors.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix the description for the custom-codecs plugin.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix wildcard import and improve documentation.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Access control exception fixed. Removed lz4-java support for now.

- PRs were made to zstd-jni and lz4-java to use AccessController.doPrivileged.
- The zstd-jni PR is merged since version 1.5.4-1.
- The lz4-java support temporarily removed until the PR gets merged.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Upgrade plugin to use Lucene95Codec. Rename files accordingly.

- Upgrade plugin to use Lucene95Codec. Rename files accordingly.
- Fix lint issue with plugin-security.
- Remove thridPartyAudit that was there for supporting lz4-java.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Add test cases for compression/decompression. Other minor changes.

  - add test cases for compression/decompression.
  - rename package.
  - add a CHANGELOG entry.
  - add more checks for signed integer arithmetic.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Remove ES grant in plugin-security.policy. Fix minor javadoc issue.

  - Remove ES grant in plugin-security.policy file.
  - Replace @link and @see to fix javadoc error.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Upgrade jettison version to 1.5.4.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Update SHA for jettison 1.5.4.

Signed-off-by: Mulugeta Mammo <[email protected]>

---------

Signed-off-by: Mulugeta Mammo <[email protected]>
Co-authored-by: Daniel (dB.) Doubrovkine <[email protected]>
opensearch-trigger-bot bot pushed a commit that referenced this issue Apr 4, 2023
* Add experimental support for zstd and lz4 (native) compression.

Add experimental support for zstd (with and without dictionary support) and lz4 (native) compressions as discussed in: #3354.

Users would be able to set the index.codec setting with the values "lz4_native" (for lz4 native), "zstd" (for zstd with dictionary), and "zstd_no_dict" (for zstd without a dictionary).

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix license issues, add tests for zstd, zstd_no_dict, and lz4_native compressions.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix DCO and and issues with CodecTests.java.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix forbidden api violation error for lz4-java.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix license headers. Remove and fix unnecessary fields.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix magic numbers. Use more restrictive access modifiers.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Use protected access modifier for Zstd and LZ4 compression mode classes.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Allow negative compression levels for zstd. Use more restrictive access modifiers.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Use a more restrictive permission for loading zstd-jni and lz4-java libraries.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Rename a file (follow a consistent version naming convention).

Signed-off-by: Mulugeta Mammo <[email protected]>

* Refactor and create a new custom-codecs sandbox module.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Remove blank lines.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Restore Lucene92CustomCodec to extend from FilterCodec.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Make use of the compressionLevel argument.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Make Lucene92CustomCodec abstract and use a package-private access modifier.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix missing JavaDoc issues. Remove unused field in PerFieldMappingPostingFormatCodec.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix lint errors.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix the description for the custom-codecs plugin.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix wildcard import and improve documentation.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Access control exception fixed. Removed lz4-java support for now.

- PRs were made to zstd-jni and lz4-java to use AccessController.doPrivileged.
- The zstd-jni PR is merged since version 1.5.4-1.
- The lz4-java support temporarily removed until the PR gets merged.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Upgrade plugin to use Lucene95Codec. Rename files accordingly.

- Upgrade plugin to use Lucene95Codec. Rename files accordingly.
- Fix lint issue with plugin-security.
- Remove thridPartyAudit that was there for supporting lz4-java.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Add test cases for compression/decompression. Other minor changes.

  - add test cases for compression/decompression.
  - rename package.
  - add a CHANGELOG entry.
  - add more checks for signed integer arithmetic.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Remove ES grant in plugin-security.policy. Fix minor javadoc issue.

  - Remove ES grant in plugin-security.policy file.
  - Replace @link and @see to fix javadoc error.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Upgrade jettison version to 1.5.4.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Update SHA for jettison 1.5.4.

Signed-off-by: Mulugeta Mammo <[email protected]>

---------

Signed-off-by: Mulugeta Mammo <[email protected]>
Co-authored-by: Daniel (dB.) Doubrovkine <[email protected]>
(cherry picked from commit f071c9b)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
mitrofmep pushed a commit to mitrofmep/OpenSearch that referenced this issue Apr 5, 2023
* Add experimental support for zstd and lz4 (native) compression.

Add experimental support for zstd (with and without dictionary support) and lz4 (native) compressions as discussed in: opensearch-project#3354.

Users would be able to set the index.codec setting with the values "lz4_native" (for lz4 native), "zstd" (for zstd with dictionary), and "zstd_no_dict" (for zstd without a dictionary).

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix license issues, add tests for zstd, zstd_no_dict, and lz4_native compressions.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix DCO and and issues with CodecTests.java.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix forbidden api violation error for lz4-java.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix license headers. Remove and fix unnecessary fields.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix magic numbers. Use more restrictive access modifiers.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Use protected access modifier for Zstd and LZ4 compression mode classes.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Allow negative compression levels for zstd. Use more restrictive access modifiers.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Use a more restrictive permission for loading zstd-jni and lz4-java libraries.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Rename a file (follow a consistent version naming convention).

Signed-off-by: Mulugeta Mammo <[email protected]>

* Refactor and create a new custom-codecs sandbox module.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Remove blank lines.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Restore Lucene92CustomCodec to extend from FilterCodec.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Make use of the compressionLevel argument.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Make Lucene92CustomCodec abstract and use a package-private access modifier.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix missing JavaDoc issues. Remove unused field in PerFieldMappingPostingFormatCodec.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix lint errors.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix the description for the custom-codecs plugin.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Fix wildcard import and improve documentation.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Access control exception fixed. Removed lz4-java support for now.

- PRs were made to zstd-jni and lz4-java to use AccessController.doPrivileged.
- The zstd-jni PR is merged since version 1.5.4-1.
- The lz4-java support temporarily removed until the PR gets merged.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Upgrade plugin to use Lucene95Codec. Rename files accordingly.

- Upgrade plugin to use Lucene95Codec. Rename files accordingly.
- Fix lint issue with plugin-security.
- Remove thridPartyAudit that was there for supporting lz4-java.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Add test cases for compression/decompression. Other minor changes.

  - add test cases for compression/decompression.
  - rename package.
  - add a CHANGELOG entry.
  - add more checks for signed integer arithmetic.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Remove ES grant in plugin-security.policy. Fix minor javadoc issue.

  - Remove ES grant in plugin-security.policy file.
  - Replace @link and @see to fix javadoc error.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Upgrade jettison version to 1.5.4.

Signed-off-by: Mulugeta Mammo <[email protected]>

* Update SHA for jettison 1.5.4.

Signed-off-by: Mulugeta Mammo <[email protected]>

---------

Signed-off-by: Mulugeta Mammo <[email protected]>
Co-authored-by: Daniel (dB.) Doubrovkine <[email protected]>
Signed-off-by: Valentin Mitrofanov <[email protected]>
reta pushed a commit that referenced this issue Apr 5, 2023
* Add experimental support for zstd and lz4 (native) compression.

Add experimental support for zstd (with and without dictionary support) and lz4 (native) compressions as discussed in: #3354.

Users would be able to set the index.codec setting with the values "lz4_native" (for lz4 native), "zstd" (for zstd with dictionary), and "zstd_no_dict" (for zstd without a dictionary).



* Fix license issues, add tests for zstd, zstd_no_dict, and lz4_native compressions.



* Fix DCO and and issues with CodecTests.java.



* Fix forbidden api violation error for lz4-java.



* Fix license headers. Remove and fix unnecessary fields.



* Fix magic numbers. Use more restrictive access modifiers.



* Use protected access modifier for Zstd and LZ4 compression mode classes.



* Allow negative compression levels for zstd. Use more restrictive access modifiers.



* Use a more restrictive permission for loading zstd-jni and lz4-java libraries.



* Rename a file (follow a consistent version naming convention).



* Refactor and create a new custom-codecs sandbox module.



* Remove blank lines.



* Restore Lucene92CustomCodec to extend from FilterCodec.



* Make use of the compressionLevel argument.



* Make Lucene92CustomCodec abstract and use a package-private access modifier.



* Fix missing JavaDoc issues. Remove unused field in PerFieldMappingPostingFormatCodec.



* Fix lint errors.



* Fix the description for the custom-codecs plugin.



* Fix wildcard import and improve documentation.



* Access control exception fixed. Removed lz4-java support for now.

- PRs were made to zstd-jni and lz4-java to use AccessController.doPrivileged.
- The zstd-jni PR is merged since version 1.5.4-1.
- The lz4-java support temporarily removed until the PR gets merged.



* Upgrade plugin to use Lucene95Codec. Rename files accordingly.

- Upgrade plugin to use Lucene95Codec. Rename files accordingly.
- Fix lint issue with plugin-security.
- Remove thridPartyAudit that was there for supporting lz4-java.



* Add test cases for compression/decompression. Other minor changes.

  - add test cases for compression/decompression.
  - rename package.
  - add a CHANGELOG entry.
  - add more checks for signed integer arithmetic.



* Remove ES grant in plugin-security.policy. Fix minor javadoc issue.

  - Remove ES grant in plugin-security.policy file.
  - Replace @link and @see to fix javadoc error.



* Upgrade jettison version to 1.5.4.



* Update SHA for jettison 1.5.4.



---------



(cherry picked from commit f071c9b)

Signed-off-by: Mulugeta Mammo <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Daniel (dB.) Doubrovkine <[email protected]>
@wbeckler
Copy link

Will this work on AWS graviton or AMD processors, or is it just going to work when a server is using an Intel chip? And is there a downside to the non-intel users?

@dblock
Copy link
Member

dblock commented Apr 20, 2023

I opened opensearch-project/documentation-website#3816 to document this and other codecs. @mulugetam Interested in drafting something?

@wbeckler
Copy link

Can someone explain what the impacts are on backward compatibility when a new compression codec is introduced? Does a cluster need to do a full restart and/or reindex? Is it possible to upgrade with zero downtime?

@dblock
Copy link
Member

dblock commented May 17, 2023

What would be the steps to take this feature out of experimental @mulugetam?

@willyborankin
Copy link
Contributor

What would be the steps to take this feature out of experimental @mulugetam?

Hi @dblock, as far as I understand all you need to do is just move sandbox/plugins/custom-codecs into modules (maybe additional tests need to be added). @reta correct me if I'm wrong. I can do it after #2996 has been merged

@reta
Copy link
Collaborator

reta commented May 24, 2023

@mulugetam any tests have been performed, internally or externally, you could share results of? should we run opensearch-benchmarks workloads with this new codec?

@sarthakaggarwal97
Copy link
Contributor

@reta you can view the results of benchmarks over at #7805

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

8 participants