Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zstd compressor (#3) #1604

Draft
wants to merge 38 commits into
base: main
Choose a base branch
from
Draft

Conversation

clusty
Copy link

@clusty clusty commented Dec 21, 2023

This PR adds a new compression algorithm optimized for Deep images.
During the development, deep images were shrunk by 20-30% compared to ZIPS with similar compression performance and 35-45% smaller when the compression level is turned to the max (about a 3-4x slowdown compared to ZIPS). The decompression speed slightly faster than ZIPS.

I did not spend a lot of time trying to optimize for non-Deep images, but
certain small non-deep example images with HALF pixels yielded a slightly worse compression rate compared to ZIPS.

The core algorithm is implemented in Blosc, with ZSTD codec with an appropriate data prefiltering.
The main advantage is that it offers a unified API access to multiple codecs and is under active development. Also the binary serialization stays compatible if ever we want to change the codecs/config parameters in the future, without introducing breaking changes.

Open questions:
I still have not figured out how to integrate Blosc properly into the OpenEXR build pipeline.

Co-authored-by: Philippe Leprince <[email protected]>
Signed-off-by: Vlad Lazar <[email protected]>

* better multy-type compression

* Version the Stream

---------

Signed-off-by: Vlad Lazar <[email protected]>
Co-authored-by: Philippe Leprince <[email protected]>
Copy link

linux-foundation-easycla bot commented Dec 21, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@cary-ilm
Copy link
Member

cary-ilm commented Jan 3, 2024

A belated thanks for this, it mostly looks pretty straightforward, with some caveats.

Obviously, the blosc configuration needs to be sorted out. If you haven't already, you might look at OpenVDB for template for building against blosc: https://github.com/AcademySoftwareFoundation/openvdb/blob/master/CMakeLists.txt#L99 and https://github.com/AcademySoftwareFoundation/openvdb/blob/master/cmake/FindBlosc.cmake

We'll obviously need to carefully consider the consequences and timing of extensions to the file format, as well as library dependencies, and possibly want to bundle this with other changes, but we have already been anticipating adding new compression algorithms, so having a working example will be helpful.

A few pedantic requests:

  • We require a CLA, can you sign via the red "NOT COVERED" button above?
  • We also require copyright notices at the top of each file.
  • In addition to the example code, there should be tests in the test suite in src/test.
  • We'll need the parallel implementation in the OpenEXRCore library.
  • I'll leave a few comments inline about minimizing whitespace changes.
  • Also, the associated docs should be updated as well.

Copy link
Member

@cary-ilm cary-ilm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our preference is to minimize whitespace changes wherever possible, and format code via clang-format.

@pleprince
Copy link
Contributor

I will add the cmake changes shortly. Thanks

@clusty
Copy link
Author

clusty commented Jan 11, 2024

@cary-ilm Is there a way to share the implementation of my compressor between OpenEXR and OpenEXRCore ? The 2 are the the same minus some memory management bits.

@@ -220,3 +222,7 @@ openexr_define_library(OpenEXR
OpenEXR::IlmThread
OpenEXR::OpenEXRCore
)

target_include_directories(OpenEXR PUBLIC "/home/vladal/bin/include/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not forget to change this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, I noticed. Thanks

Clang-format deepExamples.cpp and fix a comment typo.

Clang-format deepExamples.cpp and fix a comment typo.
@cary-ilm
Copy link
Member

OpenEXR depends on OpenEXRCore (minimally at this point but that may well increase), so if you able to put an implementation in OpenEXRCore that can be used in OpenEXR, that's fine. But obviously you're aware that OpenEXRCore is a C-language library, so no C++ classes there.

@peterhillman
Copy link
Contributor

I see you updated exrheader to dump out the compression type. exrmaketiled, exrenvmap and exrmultiview take a command line string and turn that to a compression type. Should they be extended to support ZSTD compression?

Perhaps it's worth making library functions that turn a string like 'piz' into PIZ_COMPRESSION, and vice versa, to simplify updating these utliities (and end-user code) as new compression types are added.


target_include_directories(OpenEXR PUBLIC "/home/vladal/bin/include/")
target_link_directories(OpenEXR PUBLIC "/home/vladal/bin/lib")
target_link_libraries(OpenEXR PUBLIC "dl" "blosc2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The library you use here is this one here -> https://github.com/Blosc/c-blosc2 right? Does the newest release of the library (2.12.0) work with this code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should work.

@pleprince
Copy link
Contributor

pleprince commented Jan 12, 2024

Hi @cary-ilm & @peterhillman

I have made a number of cmake modifications based on the Imath lib section. That seem to work but there are a few lines that I copied and pasted that I don't understand and any help would be much appreciated.

  # the install creates this but if we're using the library locally we
  # haven't installed the header files yet, so need to extract those
  # and make a variable for header only usage
  if(NOT TARGET Imath::ImathConfig)
    get_target_property(imathinc Imath INTERFACE_INCLUDE_DIRECTORIES)
    get_target_property(imathconfinc ImathConfig INTERFACE_INCLUDE_DIRECTORIES)
    list(APPEND imathinc ${imathconfinc})
    set(IMATH_HEADER_ONLY_INCLUDE_DIRS ${imathinc})
    message(STATUS "Imath interface dirs ${IMATH_HEADER_ONLY_INCLUDE_DIRS}")
  endif()

my problem here is IMATH_HEADER_ONLY_INCLUDE_DIRS isn't referenced anywhere else in the project. Is that being picked by magic / naming convention ? I didn't find anything in the cmake docs. I am assuming this is adding the Imath API headers to the install, in which case I don't need it.

The other bit I don't understand is:

    # Propagate OpenEXR's setting for pkg-config generation to Imath:
    # If OpenEXR is generating it, the internal Imath should, too.
    set(IMATH_INSTALL_PKG_CONFIG ${OPENEXR_INSTALL_PKG_CONFIG}) 

What's exactly the point here ? Another bit of magic, I'm sure ! :-)

Thanks in advance

@peterhillman
Copy link
Contributor

I got this to build OK, so it looks like the Blosc build is working.
Did you research different settings for the number of scanlines in each chunk? 32 scanlines of deep image data could get very large. It's also a little more complicated to multi-thread efficiently with multi-scanline compression schemes. I wonder whether there should be a ZSTD1 as well as a ZSTD32? Or perhaps just a ZSTD1?

As it happens, it seems the library doesn't properly support writing multi-scanline compression formats to deep images.
I wrote a test tool to read and write deep images with a different compression. It only allocates a single deep scanline at a time, points a DeepFrameBuffer at that, then loops reading the scanline from the input and then calling writePixels(1). It appears LineBufferTask::execute in ImfDeepScanLineOutputFile waits until writePixels() has been called 32 times, then it copies the pixel counts from the FrameBuffer for all 32 scanlines before writing. In my case, I've overwritten the data from the previous 31 scanlines.
I had a brief go at fixing it so it saves the pixel counts as soon as writePixels is called, but haven't got that working properly yet. I hacked the ZSTD_COMPRESSION to be single scanline, and that works.

@clusty
Copy link
Author

clusty commented Jan 16, 2024

I got this to build OK, so it looks like the Blosc build is working. Did you research different settings for the number of scanlines in each chunk? 32 scanlines of deep image data could get very large. It's also a little more complicated to multi-thread efficiently with multi-scanline compression schemes. I wonder whether there should be a ZSTD1 as well as a ZSTD32? Or perhaps just a ZSTD1?

The 32 number came from some experiements with some real world data I had. You are probably right that 32 might be large, when the number of samples is large. Will test with 1 and see how much does the compression degrade.

As it happens, it seems the library doesn't properly support writing multi-scanline compression formats to deep images. I wrote a test tool to read and write deep images with a different compression. It only allocates a single deep scanline at a time, points a DeepFrameBuffer at that, then loops reading the scanline from the input and then calling writePixels(1). It appears LineBufferTask::execute in ImfDeepScanLineOutputFile waits until writePixels() has been called 32 times, then it copies the pixel counts from the FrameBuffer for all 32 scanlines before writing. In my case, I've overwritten the data from the previous 31 scanlines. I had a brief go at fixing it so it saves the pixel counts as soon as writePixels is called, but haven't got that working properly yet. I hacked the ZSTD_COMPRESSION to be single scanline, and that works.

Thanks very much for looking at this. We had some discussions about this topic. I was experience spurious deadlocks in LineBufferTask. At some point all the reliability issues had gone away, so I chucked it to my imagination :)

I think this all points to the idea to make ZSTD work only on 1 scanline for now. (might make a 32 scanline version later)

@pleprince
Copy link
Contributor

Perhaps it's worth making library functions that turn a string like 'piz' into PIZ_COMPRESSION, and vice versa, to simplify updating these utliities (and end-user code) as new compression types are added.

I have worked on that today and will make a separate PR.

@peterhillman
Copy link
Contributor

I had issues doing cmake ../openexr -DCMAKE_BUILD_TYPE=Debug - it seems that causes blosc to generate a libblosc2_d.a but OpenEXR still tries to link against libblosc2.a.
In other news I think I have a version of OpenEXR which fixes deep reading and writing to allow single deep scanlines to be written and read with codecs that compress multiple scanlines together. The current code is using the supplied SampleCount slice to store per-pixel sample counts, so that must be valid for all scanlines of the chunk (and the library will write the sample count to your buffer for all scanlines of the chunk, even if you only ask it to read one scanline). There are no issues with the current implementation, as long as all deep codecs are single scanline only.
My changes further enforce that deep images cannot have subsampling other than 1, which means every sample has the same set of channels. Reading is inefficient: I believe reading scanlines consecutively from the same chunk causes the chunk to be decoded every time.

I would suggest that we have a ZSTD single scanline codec, but if decide to add a multiscanline version too, then we will need to make these changes. I'm a little bit timid about pushing my changes unless we need them. If we have multiple versions of ZSTD, then I think they should be in the same release

@lji-ilm
Copy link

lji-ilm commented Jan 18, 2024

I did not spend a lot of time trying to optimize for non-Deep images, but
certain small non-deep example images with HALF pixels yielded a slightly worse compression rate compared to ZIPS.

I looked up blosc2 a bit and it seems it turns on byte swizzle by default. This is another empirical optimization chasing after the idea that "exponent might be smooth while mantissa might fluctuate rapidly"; in other words, the higher bytes and lower bytes might exhibit different behaviours inside a float and we should collect them into separate chunks. But the blosc2 implementation probably only works for 4 byte IEEE float. I have a hard time imaging how this would optimize half:

https://www.slideshare.net/PyData/blosc-py-data-2014#17

blosc2shuffle

openEXR's internal zip chased a similar idea, however the existing code swizzled by 2 bytes for an entire block of numbers rather than doing a local 16 byte swizzle on a 4 byte pattern: https://github.com/AcademySoftwareFoundation/openexr/blob/86d0fb09859951d1e51a889e4ff2b7b3baecf021/src/lib/OpenEXR/ImfZip.cpp#L50C2-L72

This might be one reason why it works well for float seems not be so performant on half.

@Vertexwahn
Copy link
Contributor

Vertexwahn commented Jan 21, 2024

FYI: I just created a PR on the Bazel Central Registry to add c-blosc2 2.12.0 (bazelbuild/bazel-central-registry#1349). Once this is merged Bazel builds can be fixed this way:

Add to MODULE.bazel:

bazel_dep(name = "c-blosc2", version = "2.12.0")

Add to BUILD.bazel under the deps section of cc_library `OpenEXR´:

"@c-blosc2//c-blosc2",

* Whitespaces and licensing

* WIP OpenEXRCore implementation

* Brand new spanking blosc build.

* Switch to single Scanline zstd compression and Single implementation

* Fixed the tests

* Undo whitespace changes

* Last touches

* Revert extra build changes
@Vertexwahn
Copy link
Contributor

Vertexwahn commented Apr 5, 2024

@pleprince I just added the updated version of c-blosc2 to my personal Bazel registry.
You can create a .bazelrc (at the root dir or the OpenEXR repo, next to MODULE.bazel file) and add the following content:

common --registry=https://raw.githubusercontent.com/Vertexwahn/bazel-registry/main/
common --registry=https://bcr.bazel.build

This way also my personal registry will be considered that has already bazel_dep(name = "c-blosc2", version = "2.12.0.bcr.2") available. Did not test it - but might be a working workaround until bazelbuild/bazel-central-registry#1773 is merged.

@Vertexwahn
Copy link
Contributor

@pleprince Just tested on Ubunut 22.04:

diff --git a/.bazelversion b/.bazelversion
index a8907c02..21c8c7b4 100644
--- a/.bazelversion
+++ b/.bazelversion
@@ -1 +1 @@
-7.0.2
+7.1.1
diff --git a/MODULE.bazel b/MODULE.bazel
index 63d3f067..ba4e4e2d 100644
--- a/MODULE.bazel
+++ b/MODULE.bazel
@@ -10,4 +10,4 @@ bazel_dep(name = "bazel_skylib", version = "1.5.0")
 bazel_dep(name = "imath", version = "3.1.11")
 bazel_dep(name = "libdeflate", version = "1.19")
 bazel_dep(name = "platforms", version = "0.0.8")
-bazel_dep(name = "c-blosc2", version = "2.12.0.bcr.1")
+bazel_dep(name = "c-blosc2", version = "2.12.0.bcr.2")

bazel build //... && bazel test //... works for me with the above applied patches

@pleprince
Copy link
Contributor

Thank you so much @Vertexwahn !

@pleprince
Copy link
Contributor

pleprince commented Apr 8, 2024

@cary-ilm @peterhillman
The fuzz test is failing to build. It is compiling with clang++ but fails to link because ld doesn't seem to support DWARF-5 on that system image. clang 14+ now defaults to DWARF-5 and I wonder what's the best option here:

  • modify the compile flags to generate DWARF-4
  • link the fuzz binaries with lld instead of ld (pass -fuse-ld=lld to clang++). Not sure the disk image has lld installed though.
    Thanks

meteorcloudy pushed a commit to bazelbuild/bazel-central-registry that referenced this pull request Apr 16, 2024
AcademySoftwareFoundation/openexr#1604 revealed
that the c-blosc2 module still has some linker issues. This PR fixes
those issues. This PR is an improvement of the c-blosc2 model at version
2.12.0.

In detail, missing c files were added to the c-blosc2 target. Also,
textual_hdrs were reworked. Besides this cc_tests targets were
introduced to make sure the whole thing works across different systems.

The defines `HAVE_PLUGINS` and `HAVE_ZSTD` were added. Without the
define `HAVE_PLUGINS` zfp plugin is not initialized (which is required
by OpenEXR).

@phaedon Maybe this fixes also some issues on your side ;). You can
create a PR and add yourself as a maintainer to this module if you like.
This way you will get notified if "improvements" are made or at least it
gives you a chance to veto changes ;). Furthermore, the PR policy was
changed in a way that a positive review of a maintainer can lead under
some circumstances to a successful merge without the need to get a
review of one of the "admins/repo owners"
@Vertexwahn
Copy link
Contributor

@pleprince bazelbuild/bazel-central-registry#1773 is merged now. Use bazel_dep(name = "c-blosc2", version = "2.12.0.bcr.2") now! Should fix the Bazel build issues.

@peterhillman
Copy link
Contributor

@pleprince I'm a little out of my depth here. I would have thought it was OK to change the fuzz test compile flags, since the purpose of fuzz testing is to verify the code correctness, not that it builds on all architectures. Does this issue mean that adding zstd adds an additional constraint on compiling debug builds in general, or is it just a quirk of the way that the oss-fuzz builds are configured?

@clusty
Copy link
Author

clusty commented Apr 19, 2024

@pleprince I'm a little out of my depth here. I would have thought it was OK to change the fuzz test compile flags, since the purpose of fuzz testing is to verify the code correctness, not that it builds on all architectures. Does this issue mean that adding zstd adds an additional constraint on compiling debug builds in general, or is it just a quirk of the way that the oss-fuzz builds are configured?

That error seems to pop up because for some reason blosc is compiled with DWAF5 format symbols, but the linker used for the fuzzer does not understand this ( clang/llvm forums seem to mention this error ). I have not encountered this issue locally, but then again I have a fairly recent toolchain.

What I don't understand is what makes blosc so special: I presume the fuzzer uses the same linker as the regular shared object EXR build chain ?

aiuto pushed a commit to aiuto/bazel-central-registry that referenced this pull request Jun 3, 2024
AcademySoftwareFoundation/openexr#1604 revealed
that the c-blosc2 module still has some linker issues. This PR fixes
those issues. This PR is an improvement of the c-blosc2 model at version
2.12.0.

In detail, missing c files were added to the c-blosc2 target. Also,
textual_hdrs were reworked. Besides this cc_tests targets were
introduced to make sure the whole thing works across different systems.

The defines `HAVE_PLUGINS` and `HAVE_ZSTD` were added. Without the
define `HAVE_PLUGINS` zfp plugin is not initialized (which is required
by OpenEXR).

@phaedon Maybe this fixes also some issues on your side ;). You can
create a PR and add yourself as a maintainer to this module if you like.
This way you will get notified if "improvements" are made or at least it
gives you a chance to veto changes ;). Furthermore, the PR policy was
changed in a way that a positive review of a maintainer can lead under
some circumstances to a successful merge without the need to get a
review of one of the "admins/repo owners"
@pleprince
Copy link
Contributor

Hello @kdt3rd and @cary-ilm
I looked into why OSS Fuzz is failing on this PR and it looks like I need to add libblosc2 as a lib to the clang++ invocation.
This seems to hardcoded in https://github.com/google/oss-fuzz/blob/master/projects/openexr/build.sh but I don't know how to proceed. It haven't found a way to, for example, create PR on oss-fuzz and reference it in my openexr PR.
Thanks for your time.

@peterhillman
Copy link
Contributor

Perhaps it would be simplest to ignore the oss-fuzz test failure for now. Once this is all merged then we can do a PR to oss-fuzz to update the build.sh
It must be common that the various oss-fuzz projects need to update their build configurations to support newly added features, and would like that to happen without a build failure being reported, but I haven't found out how that's done.

// clevel 9 is about a 20% increase in compression compared to 5.
// Decompression speed is unchanged.
int zstd_level;
exr_get_default_zstd_compression_level (&zstd_level);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be getting the current compression level, not the default one? Changing the compression level doesn't seem to make a difference to the output size

}

blosc2_cparams cparams = BLOSC2_CPARAMS_DEFAULTS;
int typeSize = inSize % 4 == 0 ? 4 : 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I follow correctly: if the total data size is a multiple of four, then this will pair together two halfs into a 4 byte 'type', but otherwise it could split floats into two 2 byte types. Is that a performance overhead?

A more pedantic approach might be to do multiple compression blocks, each containing a single data type, so blosc knows the exact size of each type. Perhaps each consecutive group of channels of the same type are compressed into a single group. (For deep, you also need to compute the total number of samples in the data, by looping through the channel list to compute the total size of the samples and dividing the data size by that)

Copy link
Author

@clusty clusty Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a while since I looked at that code.
There is a hard requirement inside Blosc, if I remember right, that the number of data points needs to be a multiple of number of typesize. I think this code prevented a crash when you had only 1 half channel.
So if you have an even number of half channels, they are treated in 4 byte chunks, where as if you have an even number of halfs everything is a treated as half.

At some point I implemented that channel sorting to have the correct type, but for some reason, the compression rate dropped: I had 1 blosc call per channel. I suspect that 1 scanline was not enough data if it is sliced for every channel (I did not do 1 blosc call per channel size type though)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my tests, If there are only half float channels, then setting typeSize to 2 gives files 52.7% of uncompressed size, setting it to 4 gives 56.8%. For comparison, ZIPS gives 53.3%, so ZIPS performs better than ZSTD compression unless inSize is not a multiple of four. If there are full float channels in the mix and typeSize is 4, ZSTD performs better than ZIPS all round. All that suggests that it might be worth compressing channels with the correct typeSize. Fortunately, with the common set of channel names (A,B,G,R,Z,ZBack,id) it just happens that the alphanumeric sorting of the channel names groups together the 16 bit channels and the 32 bit channels.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterhillman your suggestion is to sort the channel data by size and potentially do 2 blosc calls ? ( when you have mixed float / half channels ).

the complication is that the pixel and the sample count tables are not available to the compressor for deep files. I have a branch that does a refactor to that effect but it was fairly large and probably full of bugs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking you do mulitple blosc calls, at worst one per channel. If A and Z are stored as 32 bit float, and B,G,R as 16, you would do A by itself, then B,G,R together, and finallty Z, so you would have three blosc data chunks. That way you can compress channels together and don't need the overhead of rearranging the data.

It would be a minor refactor, yes, but just of the zstd code. You shouldn't need the sample count table, but you do need the ChannelList, which you can get from the header in the same way that other codecs do. You sum up the total size of all the channels and divide inSize by that to give you the number of samples. Then the first (numberofSamples*firstChannelSize) bytes are all the data for the first channel alphanumerically, then the data for the second channel, and so on through the block.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense…. I’d also need to detect when I am getting called with less data than expected ( if I remember right, first call to the compressor for the scan line is for the sample counts )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good point. You would have to look at the dataWindow and compute the expected size of the sample count table to detect whether the data is sample count or pixel data, and it would be possible the pixel data coincidentally happens to be exactly that size. If that happened, then the data would be compressed as single 4 byte channel, even if it wasn't. That wouldn't do much harm, though, because the data size would have to be relatively small.

I suppose a minor extension to the API would be to pass an extra parameter to the compressors to indicate whether they are getting normal pixel data, sample counts, or deep data.

Presumably zstd compressed data would have some kind of header in it to indicate the number of blocks and the uncompressed size of each, so the decompresser would do the right thing regardless of what the compressor did (and would mean the decompressor doesn't need to look at the header at all - it just goes off the header in the compressed data block). That would mean the API can be changed later without breaking the zstd codec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I got it right:
If the number of bytes I get is datawindow.x * 4 ( or whatever is the type for counts ), it means its a sample count compress call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the sample count table will always be (datawindow.max.x + 1 - datawindow.min.x)*sizeof(int) bytes for scanline images, and it will be (tile_area)*sizeof(int) bytes for tiled deep images. The actual pixel data will always be divisible by the sum of the channel sizes.

This code will need to be removed before merging.

Signed-off-by: Philippe Leprince <[email protected]>
@lji-ilm
Copy link

lji-ilm commented Aug 24, 2024

While we all agree that zstd is a respectable modern compressor to add, there seems to be more questions regarding blosc2 and its parameter tuning for different data types in OpenEXR. I also think the performance benefit of blosc2's memory marshalling is somewhat related to the hardware architecture, and more exotic CPU<->Memory bridges/caches might not benefit as much, although I am still learning the details. It is also designed to be CPU only, although none of OpenEXR's compression types current can support GPU (but some of the simpler ones might be quite easy to port to a GPU implementation).

I'm wondering if anyone here have experimented simply introducing zstd by itself to OpenEXR but not the blosc2 memory marshaling wrapping? It can be its own compression type, or even simply just replace current libdeflate invocation if we just want to see some results. This could introduce a smaller PR/add less dependency to OpenEXR in this one step; and we can better quantify the benefit of the two technologies, blosc2 and zstd, separately.

@clusty
Copy link
Author

clusty commented Aug 25, 2024

While we all agree that zstd is a respectable modern compressor to add, there seems to be more questions regarding blosc2 and its parameter tuning for different data types in OpenEXR. I also think the performance benefit of blosc2's memory marshalling is somewhat related to the hardware architecture, and more exotic CPU<->Memory bridges/caches might not benefit as much, although I am still learning the details. It is also designed to be CPU only, although none of OpenEXR's compression types current can support GPU (but some of the simpler ones might be quite easy to port to a GPU implementation).

I'm wondering if anyone here have experimented simply introducing zstd by itself to OpenEXR but not the blosc2 memory marshaling wrapping? It can be its own compression type, or even simply just replace current libdeflate invocation if we just want to see some results. This could introduce a smaller PR/add less dependency to OpenEXR in this one step; and we can better quantify the benefit of the two technologies, blosc2 and zstd, separately.

I had an initial prototype where I integrated zstd directly, so definitely doable. I wanted to mention that a lot of the compression gain can be attributed to the proper data preprocessing: currently using only a byte shuffle provided by blosc. We are in the process of trying to improve on this with some actual knowledge of the data being compressed which hopefully will increase the performance for half pixels and maybe even more.

The main performance of blosc comes for specialized SIMD implementations, so you are right that exotic architectures will not benefit out of the box, but can always be implemented inside blosc if need be.

Once we are done with iterating with the algo, we’ll deal with the dependency issues raised by this PR raises.

@lji-ilm
Copy link

lji-ilm commented Aug 25, 2024

I had an initial prototype where I integrated zstd directly, so definitely doable. I wanted to mention that a lot of the compression gain can be attributed to the proper data preprocessing: currently using only a byte shuffle provided by blosc. We are in the process of trying to improve on this with some actual knowledge of the data being compressed which hopefully will increase the performance for half pixels and maybe even more.

The main performance of blosc comes for specialized SIMD implementations, so you are right that exotic architectures will not benefit out of the box, but can always be implemented inside blosc if need be.

Once we are done with iterating with the algo, we’ll deal with the dependency issues raised by this PR raises.

I agree that a proper pre-filtering of byte patterns would benefit compression, as we're also investigate this aspect on our side. However, currently my work focus on only RGB-half data and does not look at deep data at all. I believe this PR is much further down the road in terms of deep data compression, and its backward compatibility with half.

For deep data that is mostly float32 streams, I would like to mention the zfp project https://github.com/LLNL/zfp and the scientific publication that outlines its principles https://ieeexplore.ieee.org/document/6876024, but maybe you & co are already familiar with this line of work.

@clusty
Copy link
Author

clusty commented Aug 25, 2024

Blosc already has a zfp implementation. I was not able to test it properly before since I did not implemented pixel unpacking for tiles ( this is the current bit that we are debugging ). Once that is done, this experiment is very easy to perform in a few lines of code.

@pleprince
Copy link
Contributor

pleprince commented Aug 25, 2024

FWIW, Zfp was my first idea. I experimented with it a bit but the compression rate was inferior to zstd on my deep datasets. It performed better in lossy mode but this not a fair comparison to zstd anymore and I didn't assess which level of loss is actually acceptable.
Might be worth revisiting with a larger data set though.

@aras-p
Copy link
Contributor

aras-p commented Sep 25, 2024

In my tests a few years ago, ZFP in lossless mode also was not great on EXR data. Both compression ratio and performance were not impressive. https://aras-p.info/blog/2021/08/27/EXR-Filtering-and-ZFP/

@TodicaIonut
Copy link
Contributor

all the codec implementations have moved from OpenEXR/Imf*Compressor to OpenEXRCore/internal_*? @pleprince @clusty

@pleprince
Copy link
Contributor

Hi @TodicaIonut. I merged the latest release and a new batch of changes in a private branch. We will update this PR when it's ready.

@TodicaIonut
Copy link
Contributor

I merged the latest release

commit a6ee9f0 Zstd Compressor ImfZstdCompressor.cpp, ImfZstdCompressor.h.
@pleprince

@clusty clusty marked this pull request as draft January 13, 2025 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants