Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ds.to_netcdf() changes values of variable #6272

Closed
ArcticSnow opened this issue Feb 13, 2022 · 12 comments
Closed

ds.to_netcdf() changes values of variable #6272

ArcticSnow opened this issue Feb 13, 2022 · 12 comments
Labels
needs mcve https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports usage question

Comments

@ArcticSnow
Copy link

ArcticSnow commented Feb 13, 2022

What happened?

I am very puzzled by an odd behavior of ds.to_netcdf() that modifies the value of a variable by offseting partly the vlue by 44500.

I have list of file containing meteorological variables organized in three dimensions 'time', 'longitude', latitude'. The files are loaded using xr.open_mfdataset('*.nc'). No problem here. The dataset is loaded in chuncks using Dask.

In [11]: ds
Out[11]: 
<xarray.Dataset>
Dimensions:    (longitude: 10, latitude: 7, level: 14, time: 35760)
Coordinates:
  * longitude  (longitude) float32 6.5 6.75 7.0 7.25 7.5 7.75 8.0 8.25 8.5 8.75
  * latitude   (latitude) float32 61.25 61.0 60.75 60.5 60.25 60.0 59.75
  * level      (level) int32 600 650 700 750 775 800 ... 900 925 950 975 1000
  * time       (time) datetime64[ns] 1978-10-01 ... 2021-08-31T23:00:00
Data variables:
    z          (time, level, latitude, longitude) float32 dask.array<chunksize=(744, 14, 7, 10), meta=np.ndarray>
    t          (time, level, latitude, longitude) float32 dask.array<chunksize=(744, 14, 7, 10), meta=np.ndarray>
    u          (time, level, latitude, longitude) float32 dask.array<chunksize=(744, 14, 7, 10), meta=np.ndarray>
    v          (time, level, latitude, longitude) float32 dask.array<chunksize=(744, 14, 7, 10), meta=np.ndarray>
    r          (time, level, latitude, longitude) float32 dask.array<chunksize=(744, 14, 7, 10), meta=np.ndarray>
    q          (time, level, latitude, longitude) float32 dask.array<chunksize=(744, 14, 7, 10), meta=np.ndarray>
Attributes:
    Conventions:  CF-1.6
    history:      2021-12-08 19:22:33 GMT by grib_to_netcdf-2.23.0: /opt/ecmw...

Now, I would like to save a subset of this dataset to a netcdf file as follow: ds.isel(latitude[1,2,3], longitude=[3,4,5]).to_netcdf('sub.nc'). So far nothing particular.

The value z is a float32 which varies from 2000 to -2000 along the time dimension. After being saved in the subsample, z is still a float32 but the values that are less than -1000 are being offset by 44500. However, if I do (ds.z.isel(latitude[1,2,3], longitude=[3,4,5])*1).to_netcdf('sub.nc'), instead, then all values in the subsampled netcdf are fine. I am very puzzled by this behavior. Could this be an odd behavior of dask chuncks and to_netcdf()?

What did you expect to happen?

I expected no modification of the data after saving to netcdf, no matter what.

Minimal Complete Verifiable Example

I will share files upon request.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 5.13.0-28-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.8.1

xarray: 0.20.1
pandas: 1.3.5
numpy: 1.20.3
scipy: 1.7.3
netCDF4: 1.5.7
pydap: None
h5netcdf: 0.11.0
h5py: 3.6.0
Nio: None
zarr: None
cftime: 1.5.1.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.8
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2021.10.0
distributed: 2021.10.0
matplotlib: 3.5.0
cartopy: None
seaborn: 0.11.2
numbagg: None
fsspec: 2022.01.0
cupy: None
pint: None
sparse: None
setuptools: 58.0.4
pip: 21.2.4
conda: None
pytest: None
IPython: 7.31.1
sphinx: None

@ArcticSnow ArcticSnow added bug needs triage Issue that has not been reviewed by xarray team member labels Feb 13, 2022
@andersy005
Copy link
Member

@ArcticSnow,

The value z is a float32 which varies from 2000 to -2000 along the time dimension. After being saved in the subsample, z is still a float32 but the values that are less than -1000 are being offset by 44500.

You may have scale_factor and add_offset attributes in your dataset.

However, if I do (ds.z.isel(latitude[1,2,3], longitude=[3,4,5])*1).to_netcdf('sub.nc')

There's a chance xarray is discarding the attributes/encoding during the ds.z.isel(latitude[1,2,3], longitude=[3,4,5])*1 and as a result, netCDF ends up not encoding z during the to_netcdf() call.

What's the output of

print(ds.z.encoding)
print(ds.z.attrs)

??

@andersy005 andersy005 added usage question and removed bug needs triage Issue that has not been reviewed by xarray team member labels Feb 14, 2022
@ArcticSnow
Copy link
Author

ArcticSnow commented Feb 14, 2022

Thank you for your reply @andersy005.

So this is the encodiing before writing to netcdf, when loaded with xr.open_mfdataset():

In [61]: ds.z.isel(latitude=[1,2,3], longitude=[3,4,5]).encoding
Out[61]: 
{'source': '/home/simonfi/github/TopoPyScale_examples/ex1_norway_finse/inputs/climate/PLEV_197810.nc',
 'original_shape': (744, 14, 7, 10),
 'dtype': dtype('int16'),
 'missing_value': -32767,
 '_FillValue': -32767,
 'scale_factor': 0.6796473581594864,
 'add_offset': 21239.89345268811}

In [63]: ds.z.isel(latitude=[1,2,3], longitude=[3,4,5]).attrs
Out[63]: 
{'units': 'm**2 s**-2',
 'long_name': 'Geopotential',
 'standard_name': 'geopotential'}

After saving to netcdf with ds.z.isel(latitude=[1,2,3], longitude=[3,4,5]).to_netcdf('sub.nc'),

In [64]: d.z.encoding
Out[64]: 
{'zlib': False,
 'shuffle': False,
 'complevel': 0,
 'fletcher32': False,
 'contiguous': True,
 'chunksizes': None,
 'source': '/home/simonfi/github/TopoPyScale_examples/ex1_norway_finse/sub.nc',
 'original_shape': (35760, 14, 3, 3),
 'dtype': dtype('int16'),
 'missing_value': -32767,
 '_FillValue': -32767,
 'scale_factor': 0.6796473581594864,
 'add_offset': 21239.89345268811}

So the chuncks were concatenated into this single file. Now, if I look at for instance the same timeseries before and after saving to sub.nc:

image

As this offset is not applied constantly in the dimension time, I though this could be seen as a "bug". Could it be that if the encoding is not specified, each dask chunck are encoded independently?

*sorry for the large data gap in between 1980 and the 2000's.

@ArcticSnow
Copy link
Author

and in case I multiply the variable z prioir to save to_netcdf()

In [78]: (ds.z.isel(latitude=[1,2,3], longitude=[3,4,5])*1).to_netcdf('sub1.nc')

In [79]: d1 = xr.open_dataset('sub1.nc')

In [80]: d1.z.encoding
Out[80]: 
{'zlib': False,
 'shuffle': False,
 'complevel': 0,
 'fletcher32': False,
 'contiguous': True,
 'chunksizes': None,
 'source': '/home/simonfi/github/TopoPyScale_examples/ex1_norway_finse/sub1.nc',
 'original_shape': (35760, 14, 3, 3),
 'dtype': dtype('float32'),
 '_FillValue': nan}

@ArcticSnow
Copy link
Author

oh and another strange thing. The timeseries I multiply by 1 and save to sub1.nc is not exaclty the same:
image

@yuting-chen-0604
Copy link

I came across the same issue! @ArcticSnow have you found any solutions here? Struggled a bit but didn't find any solutions yet ...

@yuting-chen-0604
Copy link

yuting-chen-0604 commented Nov 24, 2023

Changing the line from ds.to_netcdf("%s/ds.nc"% save_fp) into (ds*1).to_netcdf("%s/ds.nc"% save_fp) solves everything

@max-sixty max-sixty added needs mcve https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports plan to close May be closeable, needs more eyeballs labels Nov 24, 2023
@max-sixty
Copy link
Collaborator

If someone has an MCVE, please feel free to add here. Otherwise, we can close this until we have one

@ArcticSnow
Copy link
Author

@max-sixty , what is meant by MCVE?
@yuting-chen-mck , I moved away from using open_mfdataset()

@max-sixty
Copy link
Collaborator

@max-sixty , what is meant by MCVE?

https://stackoverflow.com/help/minimal-reproducible-example

@headtr1ck
Copy link
Collaborator

Without checking or knowing what is going on I assume that ds*1 drops the encoding/attributes.

@kmuehlbauer
Copy link
Contributor

Late to the party, here, thanks for reviving @yuting-chen-mck.

Another issue with packed data (with scale_factor/add_offset) is that these values can differ across the source files. When concatenating only the scale_factor/add_offset of the first dataset/file will survive in encoding. When writing out again the packing uses these scale_factor/add_offset (and might even overflow the packed dtype):

import numpy as np
packed_value1 = np.int16(-30000)

# original values of nth file
scale_factor1 = 0.6796473581594864
add_offset1 = 21239.89345268811

# data of first file
scale_factor2 = 0.4796473581594864
add_offset2 = 21239.89345268811

# unpacking with original data
unpacked_value = packed_value1 * scale_factor1 + add_offset1
print(f"1st unpacking: {packed_value1} -> {unpacked_value}")
# repacking with data of first file
packed_value2 = np.int16((unpacked_value - add_offset2) / scale_factor2)
print(f"repacking: {unpacked_value} -> {packed_value2}")
# unpacking again
unpacked_value2 = packed_value2 * scale_factor2 + add_offset2
print(f"2nd unpacking: {packed_value2} -> {unpacked_value2}")
1st unpacking: -30000 -> 850.4727079035183
repacking: 850.4727079035183 -> 23027
2nd unpacking: 23027 -> 32284.7331690266

If that's the case here, in issue #5739 (and therein linked ones) is more reading. Bottom line: either drop encoding or provide it with fitting scale_factor/add_offset.

@max-sixty max-sixty removed the plan to close May be closeable, needs more eyeballs label Dec 5, 2023
@kmuehlbauer
Copy link
Contributor

So the root cause here is probably how encoding is handled. For that please see: #6323.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs mcve https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports usage question
Projects
None yet
Development

No branches or pull requests

6 participants