Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bzip2, lz4, and zstd filters from HDF5Plugins.jl #875

Closed

Conversation

mkitti
Copy link
Member

@mkitti mkitti commented Nov 1, 2021

This adds bzip2, lz4, and zstd filters from HDF5Plugins.jl

To Do: Add tests

Closes #874

Copy link
Member

@musm musm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some quick comments, haven't reviewed the PR fully yet.

Project.toml Show resolved Hide resolved
src/filters/TestUtils.jl Outdated Show resolved Hide resolved
src/HDF5.jl Outdated Show resolved Hide resolved
@mkitti
Copy link
Member Author

mkitti commented Nov 1, 2021

I'm going to convert this to a draft, since I'm seeing several additional items I might want to change beyond the comments above.

  1. Should we create a new module to encapsulate the Blosc code? I notice that the Blosc exports are leaking into the Filters module. For the new filters, I've created a filter for each of them, mimicking the name of the original C source files.
  2. Should we rename FILTER_BLOSC to H5Z_FILTER_BLOSC to be consistent with the other builtin and compression filters?
  3. Should the Filter type be parameterized by the filterid? This would simplify the interface and allow us to define filterid for all filters. We might even be able to autopopulate the FILTERS dictionary just by querying the subtypes of Filter.

@mkitti mkitti marked this pull request as draft November 1, 2021 15:36
src/filters/H5Zzstd.jl Outdated Show resolved Hide resolved
@musm
Copy link
Member

musm commented Nov 12, 2021

It would also be good for @simonbyrne to take a look and review this since he overhauled much of the Filters module recently.

@musm
Copy link
Member

musm commented Nov 12, 2021

Should we create a new module to encapsulate the Blosc code? I notice that the Blosc exports are leaking into the Filters module. For the new filters, I've created a filter for each of them, mimicking the name of the original C source files.
👍
Should we rename FILTER_BLOSC to H5Z_FILTER_BLOSC to be consistent with the other builtin and compression filters?
👍

I think those make sense

@mkitti
Copy link
Member Author

mkitti commented Nov 12, 2021

I'm almost ready to take this out of draft status. I've been battle testing this against some exotic real world applications, and it appears to be holding up. Compatibility with h5py and hdf5plugins from conda-forge seems to work well.

I'm starting to think about a second iteration since we may be able to multithread the ZSTD compressor easily after I merge the parameters API over at CodecZstd.jl:
JuliaIO/CodecZstd.jl#31

Using the C API directly this would be

cctx = LibZstd.ZSTD_createCCtx()
LibZstd.ZSTD_CCtx_setParameter(cctx, LibZstd.ZSTD_c_nbWorkers, Threads.nthreads())
LibZstd.ZSTD_compress2(cctx, ...)

src/filters/H5Zbzip2.jl Outdated Show resolved Hide resolved
@mkitti mkitti marked this pull request as ready for review November 15, 2021 00:42
Register the filter with the HDF5 library via API.h5z_register.
Also add F to the FILTERS dictionary.
"""
function register_filter(filter::F) where F <: Filter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason this needs to be defined on an instance rather than just a type?

src/filters/filters.jl Outdated Show resolved Hide resolved
@mkitti
Copy link
Member Author

mkitti commented Nov 21, 2021

It looks like the nightly tests are hanging and taking 6 hours. Is it waiting for input?

@musm
Copy link
Member

musm commented Nov 22, 2021

It looks like the nightly tests are hanging and taking 6 hours. Is it waiting for input?

Nightly tests are currently broken on master. You can safely ignore them.

LICENSE.txt Outdated Show resolved Hide resolved
@musm musm mentioned this pull request Nov 23, 2021
@musm
Copy link
Member

musm commented Nov 23, 2021

So main 'breaking' change is that users will now have to load in the Blosc module in order to use that compression filter, since it's currently loaded by default on master. I think that makes sense and is reasonable. I'm leaving this as a note to myself to update the changelog with this addition for the next release.

Alternatively, we could decide to automatically load in Blosc by default and hot-load the rest, but is there any reason to prefer this? My guess is no, we should just let them all be hot-loaded on an as need basis.

In any case, I think it is superior to hot-load the compression filters on a as need basis. In other words, why load in all the filters if we don't need them. So to me the latest changes are more elegant even if it requires some Requires.jl machinery.

Copy link
Member

@musm musm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes to abbreviate license comments

src/filters/H5Zbzip2.jl Outdated Show resolved Hide resolved
src/filters/H5Zlz4.jl Outdated Show resolved Hide resolved
test/filter.jl Outdated Show resolved Hide resolved
@mkitti
Copy link
Member Author

mkitti commented Dec 2, 2021

I did some refactoring and got the generic register_filter to work #875 (comment) . The generic methods are now all defined on the type rather than the instance. The generic methods on the instances just refer to the methods defined on the type now.

I suppose we could drop the instance methods, but I do anticipate that there may be a filter where one may need to define methods based on the instance rather than just the type.

@mkitti
Copy link
Member Author

mkitti commented Dec 2, 2021

As of 351c7e1 on Julia 1.7:

julia> @time using HDF5
  0.441126 seconds (157.25 k allocations: 10.384 MiB, 44.38% compilation time)

julia> @time HDF5.Filters.register_filters()
  1.231270 seconds (2.75 M allocations: 152.838 MiB, 2.90% gc time, 82.70% compilation time)

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

@kylebeggs
Copy link

I've been having a lot of trouble with HDF5 on M1 Macs and this fixed my issues!

@musm
Copy link
Member

musm commented Dec 3, 2021

As of 351c7e1 on Julia 1.7:

julia> @time using HDF5
  0.441126 seconds (157.25 k allocations: 10.384 MiB, 44.38% compilation time)

julia> @time HDF5.Filters.register_filters()
  1.231270 seconds (2.75 M allocations: 152.838 MiB, 2.90% gc time, 82.70% compilation time)

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

Yeah registering filters is probably a lot slower than we'd like. I think at this stage, I'll review the PR again and get it merged this weekend. We can then look into improving compilation times of the submodules.

@mkitti
Copy link
Member Author

mkitti commented Dec 3, 2021

Specifically @kylebeggs was having an issue with Blosc. Since this branch does not automatically load Blosc, he was able to use this branch to read HDF5 files which did not require Blosc decompression.

@mkitti
Copy link
Member Author

mkitti commented Dec 6, 2021

@simonbyrne and @musm , please let me know if there any further unresolved issues here.

@musm
Copy link
Member

musm commented Dec 7, 2021

@mkitti sounds good, thanks for the patience with this one. I plan on giving it another review sometime this week merging.

@simonbyrne if you have any additional comments, that would be great to add beforehand, otherwise I think this PR should be good to go.

The major follow up work is to package out the submodules so we can take advantage of precompilation.


See `API.h5z_register`
"""
filter_func(::F) where {F<:Filter} = filter_func(F)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the functions on instances are actually used, I would remove them and just keep the type ones.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific? Do you want to remove filtername(::F) where F <: Filter as well?

julia> HDF5.Filters.filtername(BloscFilter())
ERROR: MethodError: no method matching filtername(::BloscFilter)
Closest candidates are:
  filtername(::HDF5.Filters.UnknownFilter) at C:\Users\kittisopikulm\.julia\dev\HDF5\src\filters\filters.jl:157
  filtername(::Type{HDF5.Filters.UnknownFilter}) at C:\Users\kittisopikulm\.julia\dev\HDF5\src\filters\filters.jl:158
  filtername(::Type{HDF5.Filters.H5Zblosc.BloscFilter}) at C:\Users\kittisopikulm\.julia\dev\HDF5\src\filters\H5Zblosc.jl:139
  ...
Stacktrace:
 [1] top-level scope
   @ REPL[66]:1

julia> HDF5.Filters.filtername(BloscFilter)
"blosc"

julia> HDF5.Filters.filtername(typeof(BloscFilter()))
"blosc"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess unless you need it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm conflicted here since the general public API should be invoking this on the instance. It allows for more general implementations of filters. Using filtername or filter_func on an instance always works at the moment whether it is an UnknownFilter or a BloscFilter instance. Being able to garner the information statically from a type is the special case here with an important technical detail (a la @cfunction).

What I'm thinking about is if a package like JLD2 wanted to query information about the filters to do HDF5-compatible filtering or compression.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this, I concur. I've removed the methods on the Filter instances in c450eba

@mkitti
Copy link
Member Author

mkitti commented Dec 7, 2021

The major follow up work is to package out the submodules so we can take advantage of precompilation.

I'm already prototyping this. The question I'm coming up against is how do we want to structure the dependencies in the future and how the HDF5.Filters.register_filters() function should behave. What is the long term interface for loading filters? Should the Blosc filter be loaded upon using Blosc or using H5ZBlosc?

  1. We could drop all the compressor dependencies from HDF5.jl. In this case, H5Z*.jl packages would depend on HDF5 but not vice versa. If people wanted to use one of the filters they would have to install the H5Z* packages. We may be able to provide a fallback to load the module as we do now without precompilaton, but there may be some unfriendly warnings or errors from trying to load codec package (e.g. Blosc or CodecLz4) that is not listed in the dependencies.

  2. We could add a reverse dependency from HDF5 to the H5Z* packages. While HDF5 might not directly depend on the compression codecs, the H5Z packages would. Thus packages like Blosc[2?], CodecBzip2, CodecLz4, CodecZstd, would remain as indirect dependencies.

Comment on lines +143 to +145
filter_cfunc(::Type{BloscFilter}) = @cfunction(blosc_filter, Csize_t,
(Cuint, Csize_t, Ptr{Cuint}, Csize_t,
Ptr{Csize_t}, Ptr{Ptr{Cvoid}}))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed? Won't the generic one work?

Copy link
Member Author

@mkitti mkitti Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is useful since all the arguments here are constants and @cfunction will return a pointer rather than a closure CFunction. The closure depends on LLVM trampolines, which are not available on all platforms.

https://docs.julialang.org/en/v1/manual/calling-c-and-fortran-code/#Closure-cfunctions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right. maybe add a comment to that effect?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to that effect in c450eba (still needs to be pushed to this branch)

@musm
Copy link
Member

musm commented Dec 7, 2021

Ideally we could design the dependency to work as follows:

using HDF5

then if a user needs a filter they would load

using CodecLz4

In other words, I'd like the interface to remain exactly as in this PR.

Would something like the following work?

function __init__()

    @require Blosc="a74b3585-a348-5f62-a45c-50e91977d574" @eval begin
        using H5Zblosc # registered package living in this repo
        import .H5Zblosc: BloscFilter
        register_filter(BloscFilter)
    end

end

@mkitti
Copy link
Member Author

mkitti commented Dec 7, 2021

Would something like the following work?

In my prototype, HDF5.Filters.__init__() is as follows:

function __init__()
    @require Blosc="a74b3585-a348-5f62-a45c-50e91977d574" @eval begin
        try
               @eval using H5Zblosc
        catch err
            @warn "Using embedded copy of H5Zblosc. Use Pkg.add(\"H5Zblosc\") for precompilation."
            include("H5ZBlosc/src/H5Zblosc.jl")
            import .H5Zblosc: BloscFilter
        end
        register_filter(BloscFilter)
    end
    @require CodecBzip2="523fee87-0ab8-5b00-afb7-3ecf72e48cfd" @eval begin
        try
            @eval using H5Zbzip2
        catch err
            @warn "Using embedded copy of H5Zbzip2. Use Pkg.add(\"H5Zbzip2\") for precompilation."
            include("H5Zbzip2/src/H5Zbzip2.jl")
            import .H5Zbzip2: Bzip2Filter
        end
        register_filter(Bzip2Filter)
    end
    @require CodecLz4="5ba52731-8f18-5e0d-9241-30f10d1ec561" @eval begin
        try
            @eval using H5Zlz4
        catch err
            @warn "Using embedded copy of H5Zlz4. Use Pkg.add(\"H5Zlz4\") for precompilation."
            include("H5Zlz4/src/H5Zlz4.jl")
            import .H5Zlz4: Lz4Filter
        end
        register_filter(Lz4Filter)
    end
    @require CodecZstd="6b39b394-51ab-5f42-8807-6242bab2b4c2" @eval begin
        try
            @eval using H5Zzstd
        catch err
            @warn "Using embedded copy of H5Zzstd Use Pkg.add(\"H5Zzstd\") for precompilation."
            include("H5Zzstd/src/H5Zzstd.jl")
            import .H5Zzstd: ZstdFilter
        end
        register_filter(ZstdFilter)
    end
end

The H5Zblosc module exports BloscFilter.

@simonbyrne
Copy link
Collaborator

Requires is a bit of a hack, personally I think it would be easiest to create subpackages (HDF5Blosc etc) that live in the same repo, and register them when need be. In that way, user have to explicitly add the dependency if they want the functionality (otherwise another package using Blosc could trigger the requires hook inadvertently).

@musm
Copy link
Member

musm commented Dec 7, 2021

(otherwise another package using Blosc could trigger the requires hook inadvertently

That's a good point and I didn't think about this scenario. How realistic/big a problem is that in practice?

So I suppose the alternative suggestion is instead of relying on the original compression packages e.g. Blosc.jl. We could instead on the H5Z* packages (I think this is @mkitti second suggestion), as in:

function __init__()

    @require H5Zblosc ="a74b3585-a348-5f62-a45c-50e91977d574" @eval begin
        @eval using H5Zblosc # registered package (submodule) living in this repo
        import .H5Zblosc: BloscFilter
        register_filter(BloscFilter)
    end

end

Thus users could use

using HDF5
...
using H5Zblosc # load and register Blosc in HDF5

@mkitti
Copy link
Member Author

mkitti commented Dec 7, 2021

In the second scenario, where the user does using H5Zblosc to load the filter, the registration code should be in the __init__ for H5Zblosc. Requires.jl is no longer needed, but may be useful in a fallback scenario. The second scenario is also cleaner in that it reduces the number of dependencies for HDF5.jl.

@mkitti
Copy link
Member Author

mkitti commented Dec 7, 2021

If we are going with the second scenario that does not use Requires.jl, then perhaps I should commit my prototype now to this branch. The main effect is creating some new folders with Project.tomls and then moving the compression modules into those folders. The HDF5.jl Project.toml also changes as well. Per the comment above, we may need to do some modification of the __init__ functions.

In any scenario, perhaps we should drop HDF5.Filters.register_filters() since that does not appear to be a method we want the user to use. They should either do using Blosc or using H5Zblosc.

HDF5.Filters.register_filters() only really works if we make the compression libraries direct dependencies. If I drop the direct dependencies from HDF5.jl for Blosc.jl, but have Blosc.jl in my project I get a warning when trying to use it. If I do not have the package (e.g. CodecBZip2) in my project, but try to use it, I get an error.

julia> HDF5.Filters.register_filters()
┌ Warning: Package HDF5 does not have Blosc in its dependencies:
│ - If you have HDF5 checked out for development and have
│   added Blosc as a dependency but haven't updated your primary
│   environment's manifest file, try `Pkg.resolve()`.
│ - Otherwise you may need to report an issue with HDF5
└ Loading Blosc into HDF5 from project dependency, future warnings for HDF5 are suppressed.
[ Info: Precompiling H5Zblosc [c8ec2601-a99c-407f-b158-e79c03c2f5f7]

ERROR: ArgumentError: Package HDF5 does not have CodecBzip2 in its dependencies:
- If you have HDF5 checked out for development and have
  added CodecBzip2 as a dependency but haven't updated your primary
  environment's manifest file, try `Pkg.resolve()`.
- Otherwise you may need to report an issue with HDF5

@mkitti
Copy link
Member Author

mkitti commented Dec 8, 2021

I've implemented the independent H5Zblosc, H5Zbzip2, H5Zlz4, and H5Zzstd subdirectory packages in #880. If we are already committed to moving in that direction, I can just merge that branch into this one.

@musm
Copy link
Member

musm commented Dec 9, 2021

This is looking good. So after merging the appropriate PRs here. I'll need to register each individual subpackage? Now since the individual subpackages are 'independent' can you now just move the appropriate licenses over to each subpackages' license file (right now each subpackage doesn't have a license, which presumably we want to add since they can be installed independently of HDF5?).

@mkitti
Copy link
Member Author

mkitti commented Dec 9, 2021

I'll need to register each individual subpackage?

Yes. Well, technically anyone in JuliaIO could register them, so the burden of doing so does not need be completely on you. I've had to look into how to structure and register subdirectory packages for #880 , so perhaps I could do the initial registration which would be a template for you to do future updates? I could also just tell you how to do it. Frankly, the process is not as well documented as it should be.

Now since the individual subpackages are 'independent' can you now just move the appropriate licenses over to each subpackages' license file (right now each subpackage doesn't have a license, which presumably we want to add since they can be installed independently of HDF5?).

Yes. We would need to apply a license to the individual sub-packages in order to register them. Presumably we would copy the same LICENSE file on master. To be clear, the individual sub-packages should be under MIT as well. We could also move the appropriate other copyright notices into the individual package subdirectories, although leaving them where they are at the root of the repository would also still work. It may be better to keep them where they are now just so it is clear that some of the code in the repository does have some additional notice requirements.

@musm
Copy link
Member

musm commented Dec 16, 2021

superseded by #875

@musm musm closed this Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HDF5Plugins.jl: lz4, zstd, and bzip2 filter support
4 participants