Bitinformation of masked data #29

milankl · 2022-03-12T16:38:31Z

As requested by @aaronspring in #25, we need to extend the bitinformation function with a method for masked arrays. This is an issue to discuss a potential implementation of this and to discuss progress. The methods we would need are

function bitinformation(A::AbstractArray{NF},    # input array A
                        mask::BitArray,          # mask of A
                        kwargs...) where {T<:Union{Integer,AbstractFloat}}

Then I further suggest separating out the permutation for information along dimension dim,

function roll_dim_forward(A::AbstractArray,dim::Int)

which is basically this bit

BitInformation.jl/src/mutual_information.jl

Lines 66 to 73 in fa2dd50

    
           if dim > 1 
        
               permu = collect(1:ndims(A))             # permutation 
        
               perm0 = vcat(permu[2:end],permu[1])     # used to permute permu 
        
               for _ in 1:dim-1                       # create permutation array 
        
                   permute!(permu,perm0) 
        
               end 
        
               A = PermutedDimsArray(A,permu)          # permute A, such that desired dim is 1st dim     
        
           end

and moving the zeroing of insignificant information into its own function, like so

function set_insignificant_zero!(H::Vector,nelements::Int,confidence::Real)

which should basically include these lines

BitInformation.jl/src/mutual_information.jl

Lines 48 to 54 in fa2dd50

    
           if set_zero_insignificant 
        
               Hf = binom_free_entropy(nelements,confidence)   # free entropy of random 50/50 at trial size 
        
                                                               # get chance p for 1 (or 0) from binom distr 
        
               @inbounds for i in eachindex(M) 
        
                   M[i] = M[i] <= Hf ? 0 : M[i]                # set zero what's insignificant 
        
               end                           
        
           end

The text was updated successfully, but these errors were encountered:

milankl · 2022-03-21T16:54:53Z

This issues will be addressed in #30

milankl · 2022-03-23T12:57:13Z

I've added more tests in #30 and merged the PR as all of them pass. I'd still love to see the impact if you could share your results @aaronspring at some point. Many thanks

milankl · 2022-03-28T12:15:42Z

Originally posted by @aaronspring in #30 (comment)

@milankl I find differences whether using mask or not, specially the first bits and within 99-100% information:

dim: 4 time
from https://gist.github.com/aaronspring/5de0bc6be5a8d547f3503ff8b1aef8c6

dim: 1 x
from https://gist.github.com/aaronspring/7b0675e36467e5c647f2fe3f546d4bf5

dim analysis:
from https://gist.github.com/aaronspring/52662dd885eebfd6ce4b88939be016c6

milankl · 2022-03-28T12:17:59Z

Thanks Aaron, that looks awesome. Good to see that all these information patches in the exponent bits disappear! For dissicos could you convert your arrays to signed exponent bits? It looks like there's a bunch of exponent bits that simultaneously flips over (which happens when your data covers a range across floating-point 2)

julia> A = rand(Float32,3,3)
3×3 Matrix{Float32}:
 0.408687   0.922863  0.0622634
 0.838222   0.947521  0.141393
 0.0944574  0.991588  0.576514

julia> signed_exponent(A)
3×3 Matrix{Float32}:
 13.078    7.3829   127.515
  6.70577  7.58017   18.0983
 48.3622   7.93271    4.61211

It looks wrong, because Julia will interpret the exponent bits still as being biased (the information that the exponent bits are now to be interpreted differently is not stored), but you can always check that nothing went wrong by applying the inverse biased_exponent. If you use signed_exponent as a preprocessing step, note that also your mask will change

julia> signed_exponent([-9e-33])
1-element Vector{Float64}:
 -4.739053125085073e32

Originally posted by @milankl in #30 (comment)

aaronspring · 2022-03-28T13:24:09Z

as you say, signed_exponent added makes the black bar in dissicos disappear

milankl · 2022-03-28T13:30:41Z

Perfect, that looks really good to me. Now I'd keep these mantissa bits and then redo any analysis that you'd otherwise do with the full precision data set. Feel free to post any of that here.

We haven't addressed the interpolation problem due to the ocean-atmosphere coupling, I'd be curious whether this has any consquences.

aaronspring · 2022-03-28T13:57:46Z

could you point me to an example how to save a NetCDF inside julia with one of your compressions? I only find sizeof but no writing to disk

milankl · 2022-03-28T14:12:00Z

Use NetCDF.jl, but switch lossless compression on, like here. If you then pass on arrays with rounded mantissas via round(::Array,keepbits::Int) from BitInformation.jl you'll get it accordingly compressed. Alternatively, use nco like

ncks --baa=8 --ppc varname=3 file.nc file_compressed.nc

where baa=8 is bitrounding, varname=3 means 3 mantissa bits to keep, varname is the name of the nc variables, then input and output file. See this thread too nco/nco#250 (comment)

milankl · 2022-03-29T11:50:56Z

Branch #main now contains bitinformation(::AbstractArray{T};masked_value::T) where T so that you should be able to do directly

julia> using BitInformation

julia> A = rand(Float32,3,3);

julia> round!(A,1)
3×3 Matrix{Float32}:
 0.5   0.1875    0.25
 1.0   0.046875  0.0625
 0.25  0.5       0.75

julia> bitinformation(A,masked_value=0.5f0)

note that masked_value has to be of the same eltype as the array. That means if eltype(A)=Float32 then masked_value=0.5 will throw an error. However, I've used === to create the mask, so once #34 is merged you can even have NaNs in your data and simply use bitinformation(A,masked_value=NaN) (or NaN32) although == comparison for NaN always yields false 😉

milankl · 2022-03-29T13:57:56Z

v0.5 got merged into the Julia registry now. Will leave this issue open to eventually pick some of the posts here for the documentation.

aaronspring · 2022-03-29T14:45:01Z

Now I'd keep these mantissa bits and then redo any analysis that you'd otherwise do with the full precision data set. Feel free to post any of that here.

now with the masking enabled, I am quite happy with the bitrounding results:
compared mean and std on monthly and annualized output as well as variance-weighted mean period as my variability analysis.
https://gist.github.com/aaronspring/383dbbfe31baa4618c5b0dbef4f6d574

milankl · 2022-03-29T16:21:55Z

https://gist.github.com/aaronspring/383dbbfe31baa4618c5b0dbef4f6d574

Many thanks for sharing this. Makes me very happy to see that (once we've sorted out the masked array issue, which I wanted to do for a long time anyway!) we basically have another example where the bitwise real information content-approach just works. Sure you are left with the choice of 99% ... to 100% of information, but I believe that this will always be the case and in general dependent on your storage requirements. E.g. is 6x compression enough or will 10x be a game changer (because you can store more ensemble members, a higher temporal resolution, etc)? Is it data that should be kept for archiving purposes but won't be touched much?

If you have any other remarks, suggestions, concerns etc. feel free to reach out here!

aaronspring · 2022-03-30T07:51:33Z

Is it data that should be kept for archiving purposes but won't be touched much?

I just need to archive some data and probably wont touch it again.

But I think the potential for bitrounding is enormous. I will soon try it on high spatial resolution data. I expect much higher compression gains there. I am going to pitch my results in our ocean group meeting in early may.

milankl · 2022-03-30T09:53:33Z

But I think the potential for bitrounding is enormous.

I agree 😄 Especially once netcdf supports more lossless codecs, I believe zstd is soon-ish going to be supported (discussion here Unidata/netcdf-c#2227) which likely yields even higher compression factors in the "archiving won't touch much" case, or higher performance in the "we'll read this data every day" case.

milankl mentioned this issue Mar 22, 2022

Bitinformation of masked arrays #30

Merged

milankl added the enhancement New feature or request label Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bitinformation of masked data #29

Bitinformation of masked data #29

milankl commented Mar 12, 2022

milankl commented Mar 21, 2022

milankl commented Mar 23, 2022

milankl commented Mar 28, 2022 •

edited

Loading

milankl commented Mar 28, 2022

aaronspring commented Mar 28, 2022

milankl commented Mar 28, 2022

aaronspring commented Mar 28, 2022

milankl commented Mar 28, 2022

milankl commented Mar 29, 2022

milankl commented Mar 29, 2022

aaronspring commented Mar 29, 2022

milankl commented Mar 29, 2022

aaronspring commented Mar 30, 2022

milankl commented Mar 30, 2022

Bitinformation of masked data #29

Bitinformation of masked data #29

Comments

milankl commented Mar 12, 2022

milankl commented Mar 21, 2022

milankl commented Mar 23, 2022

milankl commented Mar 28, 2022 • edited Loading

milankl commented Mar 28, 2022

aaronspring commented Mar 28, 2022

milankl commented Mar 28, 2022

aaronspring commented Mar 28, 2022

milankl commented Mar 28, 2022

milankl commented Mar 29, 2022

milankl commented Mar 29, 2022

aaronspring commented Mar 29, 2022

milankl commented Mar 29, 2022

aaronspring commented Mar 30, 2022

milankl commented Mar 30, 2022

milankl commented Mar 28, 2022 •

edited

Loading