Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BitInformation for data previously reduced in precision #46

Open
rsignell-usgs opened this issue Feb 16, 2023 · 5 comments
Open

BitInformation for data previously reduced in precision #46

rsignell-usgs opened this issue Feb 16, 2023 · 5 comments
Labels
documentation Improvements or additions to documentation

Comments

@rsignell-usgs
Copy link

We have a large dataset from WRF, where the providers already reduced precision on some variables before passing to us. For example:

ALBEDO:number_of_significant_digits = 5 ;          
LAI:number_of_significant_digits = 3 ;
LH:number_of_significant_digits = 5 ;   
...

Can we use this knowledge somehow to create an alternate algorithm for calculating the appropriate keepbits?

@rsignell-usgs rsignell-usgs changed the title BitInformation for data known to be already reduced in precision BitInformation for data previously reduced in precision Feb 16, 2023
@milankl
Copy link
Owner

milankl commented Feb 16, 2023

Do you know whether they rounded in decimal or in binary? Because while rounding in one base also quantizes in the other it doesn't round (as in make the trailing digits/bits shorter/simpler, the actual idea of rounding) in the other. Example:

julia> a = randn(Float32,5)
5-element Vector{Float32}:
 -0.35390654
  1.223104
 -0.04762434
 -1.5507344
  0.89877653

julia> b = round.(a,digits=3)
5-element Vector{Float32}:
 -0.354
  1.223
 -0.048
 -1.551
  0.899

julia> bitstring.(b,:split)
5-element Vector{String}:
 "1 01111101 01101010011111101111101"
 "0 01111111 00111001000101101000100"
 "1 01111010 10001001001101110100110"
 "1 01111111 10001101000011100101011"
 "0 01111110 11001100010010011011101"

While the mantissa bits aren't all zero after some mantissa bits, the resulting array may still be more compressible as the number of possible mantissa bitpatterns still have been greatly reduced.

But to actually answer your question, you can (roughly) translate between significant digits and bits via

nsb(nsd::Integer) = Integer(ceil(log(10)/log(2)*nsd))

which just arises from the idea that with $d$ digits you have a maximum absolute error of $\tfrac{10^{-d}}{2}$, and with $b$ bits you have a maximum absolute error of $\tfrac{2^{-b}}{2}$, equating the two and rounding up rather than down gives you the number of bits you need to have an error that isn't greater than the error from rounding in decimal. There are some caveats that @czender can elaborate more on.

In your case this means that if you already know that your dataset contains only $d$ digits of precision you can definitely round to

$d$ digits $b$ bits
1 4
2 7
3 10
4 14
5 17
6 20
7 24

without losing any information. And in fact, you probably want to because if there's real information in the mantissa bits past those it's artificial from the quantization.

@milankl
Copy link
Owner

milankl commented Feb 16, 2023

In fact, we already discussed that here nco/nco#250

@milankl milankl added the documentation Improvements or additions to documentation label Feb 16, 2023
@rsignell-usgs
Copy link
Author

rsignell-usgs commented Feb 17, 2023

@pnorton-usgs and I checked, and these variables were processed with NCO using args like ppc ALBEDO:5, which specifies the Number of Significant Digits (as opposed to the Decimal Significant Digits).

So a few values look like this:

import struct

a = ds['ALBEDO'][100,500:510,600].values

a
array([0.2371893 , 0.22657919, 0.22525072, 0.21817589, 0.20056486,
       0.2204647 , 0.22222233, 0.22418547, 0.22438478, 0.22910786],
      dtype=float32)
def binary(num):
    return ''.join('{:0>8b}'.format(c) for c in struct.pack('!f', num))

for v in list(a):
    print(binary(v))
00111110011100101110000111000000
00111110011010000000010001100000
00111110011001101010100000100000
00111110010111110110100110000000
00111110010011010110000011100000
00111110011000011100000110000000
00111110011000111000111001000000
00111110011001011001000011100000
00111110011001011100010100100000
00111110011010101001101101000000

@milankl
Copy link
Owner

milankl commented Feb 17, 2023

Just checked what Charlie means by NSD vs DSD, but the former is a relative error vs the latter is an absolute error. Great that you used nco's ppc option, because then rounding is actually done in binary. Depending on the version of nco this should also do granular bitrounding (see here for more on this) meaning that keepbits is somewhat variable from value to value. I guess it's between 16-18 keepbits here, looking at trailing zeros but it's obv impossible to say for sure without knowing the full precision values.

That gives you an upper bound for the bitinformation, but I'd still just run it over and see what the bitinformation analysis says?

@rsignell-usgs
Copy link
Author

Will do @milankl !! Thanks for the great rounding examples also. I feel like I'm starting to understand this stuff! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants