Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bitinformation of masked data #29

Open
milankl opened this issue Mar 12, 2022 · 14 comments
Open

Bitinformation of masked data #29

milankl opened this issue Mar 12, 2022 · 14 comments
Labels
enhancement New feature or request

Comments

@milankl
Copy link
Owner

milankl commented Mar 12, 2022

As requested by @aaronspring in #25, we need to extend the bitinformation function with a method for masked arrays. This is an issue to discuss a potential implementation of this and to discuss progress. The methods we would need are

function bitinformation(A::AbstractArray{NF},    # input array A
                        mask::BitArray,          # mask of A
                        kwargs...) where {T<:Union{Integer,AbstractFloat}}

Then I further suggest separating out the permutation for information along dimension dim,

function roll_dim_forward(A::AbstractArray,dim::Int)

which is basically this bit

if dim > 1
permu = collect(1:ndims(A)) # permutation
perm0 = vcat(permu[2:end],permu[1]) # used to permute permu
for _ in 1:dim-1 # create permutation array
permute!(permu,perm0)
end
A = PermutedDimsArray(A,permu) # permute A, such that desired dim is 1st dim
end

and moving the zeroing of insignificant information into its own function, like so

function set_insignificant_zero!(H::Vector,nelements::Int,confidence::Real)

which should basically include these lines

if set_zero_insignificant
Hf = binom_free_entropy(nelements,confidence) # free entropy of random 50/50 at trial size
# get chance p for 1 (or 0) from binom distr
@inbounds for i in eachindex(M)
M[i] = M[i] <= Hf ? 0 : M[i] # set zero what's insignificant
end
end

@milankl
Copy link
Owner Author

milankl commented Mar 21, 2022

This issues will be addressed in #30

@milankl
Copy link
Owner Author

milankl commented Mar 23, 2022

I've added more tests in #30 and merged the PR as all of them pass. I'd still love to see the impact if you could share your results @aaronspring at some point. Many thanks

@milankl
Copy link
Owner Author

milankl commented Mar 28, 2022

Originally posted by @aaronspring in #30 (comment)

@milankl I find differences whether using mask or not, specially the first bits and within 99-100% information:

dim: 4 time
image from https://gist.github.com/aaronspring/5de0bc6be5a8d547f3503ff8b1aef8c6

dim: 1 x
image from https://gist.github.com/aaronspring/7b0675e36467e5c647f2fe3f546d4bf5

dim analysis:
image from https://gist.github.com/aaronspring/52662dd885eebfd6ce4b88939be016c6

@milankl
Copy link
Owner Author

milankl commented Mar 28, 2022

Thanks Aaron, that looks awesome. Good to see that all these information patches in the exponent bits disappear! For dissicos could you convert your arrays to signed exponent bits? It looks like there's a bunch of exponent bits that simultaneously flips over (which happens when your data covers a range across floating-point 2)

julia> A = rand(Float32,3,3)
3×3 Matrix{Float32}:
 0.408687   0.922863  0.0622634
 0.838222   0.947521  0.141393
 0.0944574  0.991588  0.576514

julia> signed_exponent(A)
3×3 Matrix{Float32}:
 13.078    7.3829   127.515
  6.70577  7.58017   18.0983
 48.3622   7.93271    4.61211

It looks wrong, because Julia will interpret the exponent bits still as being biased (the information that the exponent bits are now to be interpreted differently is not stored), but you can always check that nothing went wrong by applying the inverse biased_exponent. If you use signed_exponent as a preprocessing step, note that also your mask will change

julia> signed_exponent([-9e-33])
1-element Vector{Float64}:
 -4.739053125085073e32

Originally posted by @milankl in #30 (comment)

@aaronspring
Copy link

as you say, signed_exponent added makes the black bar in dissicos disappear
by default 2022-03-28 at 15 21 26

@milankl
Copy link
Owner Author

milankl commented Mar 28, 2022

Perfect, that looks really good to me. Now I'd keep these mantissa bits and then redo any analysis that you'd otherwise do with the full precision data set. Feel free to post any of that here.

We haven't addressed the interpolation problem due to the ocean-atmosphere coupling, I'd be curious whether this has any consquences.

@aaronspring
Copy link

could you point me to an example how to save a NetCDF inside julia with one of your compressions? I only find sizeof but no writing to disk

@milankl
Copy link
Owner Author

milankl commented Mar 28, 2022

Use NetCDF.jl, but switch lossless compression on, like here. If you then pass on arrays with rounded mantissas via round(::Array,keepbits::Int) from BitInformation.jl you'll get it accordingly compressed. Alternatively, use nco like

ncks --baa=8 --ppc varname=3 file.nc file_compressed.nc

where baa=8 is bitrounding, varname=3 means 3 mantissa bits to keep, varname is the name of the nc variables, then input and output file. See this thread too nco/nco#250 (comment)

@milankl
Copy link
Owner Author

milankl commented Mar 29, 2022

Branch #main now contains bitinformation(::AbstractArray{T};masked_value::T) where T so that you should be able to do directly

julia> using BitInformation

julia> A = rand(Float32,3,3);

julia> round!(A,1)
3×3 Matrix{Float32}:
 0.5   0.1875    0.25
 1.0   0.046875  0.0625
 0.25  0.5       0.75

julia> bitinformation(A,masked_value=0.5f0)

note that masked_value has to be of the same eltype as the array. That means if eltype(A)=Float32 then masked_value=0.5 will throw an error. However, I've used === to create the mask, so once #34 is merged you can even have NaNs in your data and simply use bitinformation(A,masked_value=NaN) (or NaN32) although == comparison for NaN always yields false 😉

@milankl
Copy link
Owner Author

milankl commented Mar 29, 2022

v0.5 got merged into the Julia registry now. Will leave this issue open to eventually pick some of the posts here for the documentation.

@aaronspring
Copy link

Now I'd keep these mantissa bits and then redo any analysis that you'd otherwise do with the full precision data set. Feel free to post any of that here.

now with the masking enabled, I am quite happy with the bitrounding results:
compared mean and std on monthly and annualized output as well as variance-weighted mean period as my variability analysis.
https://gist.github.com/aaronspring/383dbbfe31baa4618c5b0dbef4f6d574

@milankl
Copy link
Owner Author

milankl commented Mar 29, 2022

https://gist.github.com/aaronspring/383dbbfe31baa4618c5b0dbef4f6d574

Many thanks for sharing this. Makes me very happy to see that (once we've sorted out the masked array issue, which I wanted to do for a long time anyway!) we basically have another example where the bitwise real information content-approach just works. Sure you are left with the choice of 99% ... to 100% of information, but I believe that this will always be the case and in general dependent on your storage requirements. E.g. is 6x compression enough or will 10x be a game changer (because you can store more ensemble members, a higher temporal resolution, etc)? Is it data that should be kept for archiving purposes but won't be touched much?

If you have any other remarks, suggestions, concerns etc. feel free to reach out here!

@aaronspring
Copy link

Is it data that should be kept for archiving purposes but won't be touched much?

I just need to archive some data and probably wont touch it again.

But I think the potential for bitrounding is enormous. I will soon try it on high spatial resolution data. I expect much higher compression gains there. I am going to pitch my results in our ocean group meeting in early may.

@milankl
Copy link
Owner Author

milankl commented Mar 30, 2022

But I think the potential for bitrounding is enormous.

I agree 😄 Especially once netcdf supports more lossless codecs, I believe zstd is soon-ish going to be supported (discussion here Unidata/netcdf-c#2227) which likely yields even higher compression factors in the "archiving won't touch much" case, or higher performance in the "we'll read this data every day" case.

@milankl milankl added the enhancement New feature or request label Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants