ds.to_netcdf() changes values of variable #6272

ArcticSnow · 2022-02-13T19:15:20Z

What happened?

I am very puzzled by an odd behavior of ds.to_netcdf() that modifies the value of a variable by offseting partly the vlue by 44500.

I have list of file containing meteorological variables organized in three dimensions 'time', 'longitude', latitude'. The files are loaded using xr.open_mfdataset('*.nc'). No problem here. The dataset is loaded in chuncks using Dask.

In [11]: ds
Out[11]: 
<xarray.Dataset>
Dimensions:    (longitude: 10, latitude: 7, level: 14, time: 35760)
Coordinates:
  * longitude  (longitude) float32 6.5 6.75 7.0 7.25 7.5 7.75 8.0 8.25 8.5 8.75
  * latitude   (latitude) float32 61.25 61.0 60.75 60.5 60.25 60.0 59.75
  * level      (level) int32 600 650 700 750 775 800 ... 900 925 950 975 1000
  * time       (time) datetime64[ns] 1978-10-01 ... 2021-08-31T23:00:00
Data variables:
    z          (time, level, latitude, longitude) float32 dask.array<chunksize=(744, 14, 7, 10), meta=np.ndarray>
    t          (time, level, latitude, longitude) float32 dask.array<chunksize=(744, 14, 7, 10), meta=np.ndarray>
    u          (time, level, latitude, longitude) float32 dask.array<chunksize=(744, 14, 7, 10), meta=np.ndarray>
    v          (time, level, latitude, longitude) float32 dask.array<chunksize=(744, 14, 7, 10), meta=np.ndarray>
    r          (time, level, latitude, longitude) float32 dask.array<chunksize=(744, 14, 7, 10), meta=np.ndarray>
    q          (time, level, latitude, longitude) float32 dask.array<chunksize=(744, 14, 7, 10), meta=np.ndarray>
Attributes:
    Conventions:  CF-1.6
    history:      2021-12-08 19:22:33 GMT by grib_to_netcdf-2.23.0: /opt/ecmw...

Now, I would like to save a subset of this dataset to a netcdf file as follow: ds.isel(latitude[1,2,3], longitude=[3,4,5]).to_netcdf('sub.nc'). So far nothing particular.

The value z is a float32 which varies from 2000 to -2000 along the time dimension. After being saved in the subsample, z is still a float32 but the values that are less than -1000 are being offset by 44500. However, if I do (ds.z.isel(latitude[1,2,3], longitude=[3,4,5])*1).to_netcdf('sub.nc'), instead, then all values in the subsampled netcdf are fine. I am very puzzled by this behavior. Could this be an odd behavior of dask chuncks and to_netcdf()?

What did you expect to happen?

I expected no modification of the data after saving to netcdf, no matter what.

Minimal Complete Verifiable Example

I will share files upon request.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 5.13.0-28-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.8.1

xarray: 0.20.1
pandas: 1.3.5
numpy: 1.20.3
scipy: 1.7.3
netCDF4: 1.5.7
pydap: None
h5netcdf: 0.11.0
h5py: 3.6.0
Nio: None
zarr: None
cftime: 1.5.1.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.8
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2021.10.0
distributed: 2021.10.0
matplotlib: 3.5.0
cartopy: None
seaborn: 0.11.2
numbagg: None
fsspec: 2022.01.0
cupy: None
pint: None
sparse: None
setuptools: 58.0.4
pip: 21.2.4
conda: None
pytest: None
IPython: 7.31.1
sphinx: None

The text was updated successfully, but these errors were encountered:

andersy005 · 2022-02-14T14:26:44Z

@ArcticSnow,

The value z is a float32 which varies from 2000 to -2000 along the time dimension. After being saved in the subsample, z is still a float32 but the values that are less than -1000 are being offset by 44500.

You may have scale_factor and add_offset attributes in your dataset.

However, if I do (ds.z.isel(latitude[1,2,3], longitude=[3,4,5])*1).to_netcdf('sub.nc')

There's a chance xarray is discarding the attributes/encoding during the ds.z.isel(latitude[1,2,3], longitude=[3,4,5])*1 and as a result, netCDF ends up not encoding z during the to_netcdf() call.

What's the output of

print(ds.z.encoding)
print(ds.z.attrs)

??

ArcticSnow · 2022-02-14T16:39:49Z

Thank you for your reply @andersy005.

So this is the encodiing before writing to netcdf, when loaded with xr.open_mfdataset():

In [61]: ds.z.isel(latitude=[1,2,3], longitude=[3,4,5]).encoding
Out[61]: 
{'source': '/home/simonfi/github/TopoPyScale_examples/ex1_norway_finse/inputs/climate/PLEV_197810.nc',
 'original_shape': (744, 14, 7, 10),
 'dtype': dtype('int16'),
 'missing_value': -32767,
 '_FillValue': -32767,
 'scale_factor': 0.6796473581594864,
 'add_offset': 21239.89345268811}

In [63]: ds.z.isel(latitude=[1,2,3], longitude=[3,4,5]).attrs
Out[63]: 
{'units': 'm**2 s**-2',
 'long_name': 'Geopotential',
 'standard_name': 'geopotential'}

After saving to netcdf with ds.z.isel(latitude=[1,2,3], longitude=[3,4,5]).to_netcdf('sub.nc'),

In [64]: d.z.encoding
Out[64]: 
{'zlib': False,
 'shuffle': False,
 'complevel': 0,
 'fletcher32': False,
 'contiguous': True,
 'chunksizes': None,
 'source': '/home/simonfi/github/TopoPyScale_examples/ex1_norway_finse/sub.nc',
 'original_shape': (35760, 14, 3, 3),
 'dtype': dtype('int16'),
 'missing_value': -32767,
 '_FillValue': -32767,
 'scale_factor': 0.6796473581594864,
 'add_offset': 21239.89345268811}

So the chuncks were concatenated into this single file. Now, if I look at for instance the same timeseries before and after saving to sub.nc:

As this offset is not applied constantly in the dimension time, I though this could be seen as a "bug". Could it be that if the encoding is not specified, each dask chunck are encoded independently?

*sorry for the large data gap in between 1980 and the 2000's.

ArcticSnow · 2022-02-14T16:53:06Z

and in case I multiply the variable z prioir to save to_netcdf()

In [78]: (ds.z.isel(latitude=[1,2,3], longitude=[3,4,5])*1).to_netcdf('sub1.nc')

In [79]: d1 = xr.open_dataset('sub1.nc')

In [80]: d1.z.encoding
Out[80]: 
{'zlib': False,
 'shuffle': False,
 'complevel': 0,
 'fletcher32': False,
 'contiguous': True,
 'chunksizes': None,
 'source': '/home/simonfi/github/TopoPyScale_examples/ex1_norway_finse/sub1.nc',
 'original_shape': (35760, 14, 3, 3),
 'dtype': dtype('float32'),
 '_FillValue': nan}

ArcticSnow · 2022-02-14T16:59:07Z

oh and another strange thing. The timeseries I multiply by 1 and save to sub1.nc is not exaclty the same:

yuting-chen-0604 · 2023-11-24T21:42:34Z

I came across the same issue! @ArcticSnow have you found any solutions here? Struggled a bit but didn't find any solutions yet ...

yuting-chen-0604 · 2023-11-24T21:46:59Z

Changing the line from ds.to_netcdf("%s/ds.nc"% save_fp) into (ds*1).to_netcdf("%s/ds.nc"% save_fp) solves everything

max-sixty · 2023-11-24T22:13:16Z

If someone has an MCVE, please feel free to add here. Otherwise, we can close this until we have one

ArcticSnow · 2023-11-25T11:05:30Z

@max-sixty , what is meant by MCVE?
@yuting-chen-mck , I moved away from using open_mfdataset()

max-sixty · 2023-11-25T18:25:41Z

@max-sixty , what is meant by MCVE?

https://stackoverflow.com/help/minimal-reproducible-example

headtr1ck · 2023-11-26T17:43:10Z

Without checking or knowing what is going on I assume that ds*1 drops the encoding/attributes.

kmuehlbauer · 2023-11-28T06:59:47Z

Late to the party, here, thanks for reviving @yuting-chen-mck.

Another issue with packed data (with scale_factor/add_offset) is that these values can differ across the source files. When concatenating only the scale_factor/add_offset of the first dataset/file will survive in encoding. When writing out again the packing uses these scale_factor/add_offset (and might even overflow the packed dtype):

import numpy as np
packed_value1 = np.int16(-30000)

# original values of nth file
scale_factor1 = 0.6796473581594864
add_offset1 = 21239.89345268811

# data of first file
scale_factor2 = 0.4796473581594864
add_offset2 = 21239.89345268811

# unpacking with original data
unpacked_value = packed_value1 * scale_factor1 + add_offset1
print(f"1st unpacking: {packed_value1} -> {unpacked_value}")
# repacking with data of first file
packed_value2 = np.int16((unpacked_value - add_offset2) / scale_factor2)
print(f"repacking: {unpacked_value} -> {packed_value2}")
# unpacking again
unpacked_value2 = packed_value2 * scale_factor2 + add_offset2
print(f"2nd unpacking: {packed_value2} -> {unpacked_value2}")

1st unpacking: -30000 -> 850.4727079035183
repacking: 850.4727079035183 -> 23027
2nd unpacking: 23027 -> 32284.7331690266

If that's the case here, in issue #5739 (and therein linked ones) is more reading. Bottom line: either drop encoding or provide it with fitting scale_factor/add_offset.

kmuehlbauer · 2024-03-25T14:56:51Z

So the root cause here is probably how encoding is handled. For that please see: #6323.

ArcticSnow added bug needs triage Issue that has not been reviewed by xarray team member labels Feb 13, 2022

andersy005 added usage question and removed bug needs triage Issue that has not been reviewed by xarray team member labels Feb 14, 2022

max-sixty added needs mcve https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports plan to close May be closeable, needs more eyeballs labels Nov 24, 2023

max-sixty removed the plan to close May be closeable, needs more eyeballs label Dec 5, 2023

kmuehlbauer closed this as completed Mar 25, 2024

This was referenced Apr 19, 2024

Iss_14_fix_data_processing_bugs mcmancini/Era5Weather#15

Merged

Fix encoding issues with ERA5 data LEEP-Modelling-Team/AgroManagement#22

Closed

Added explicit encoding whern saving ERA5 netcds files LEEP-Modelling-Team/AgroManagement#23

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ds.to_netcdf() changes values of variable #6272

ds.to_netcdf() changes values of variable #6272

ArcticSnow commented Feb 13, 2022 •

edited

Loading

andersy005 commented Feb 14, 2022

ArcticSnow commented Feb 14, 2022 •

edited

Loading

ArcticSnow commented Feb 14, 2022

ArcticSnow commented Feb 14, 2022

yuting-chen-0604 commented Nov 24, 2023

yuting-chen-0604 commented Nov 24, 2023 •

edited

Loading

max-sixty commented Nov 24, 2023

ArcticSnow commented Nov 25, 2023

max-sixty commented Nov 25, 2023

headtr1ck commented Nov 26, 2023

kmuehlbauer commented Nov 28, 2023

kmuehlbauer commented Mar 25, 2024

ds.to_netcdf() changes values of variable #6272

ds.to_netcdf() changes values of variable #6272

Comments

ArcticSnow commented Feb 13, 2022 • edited Loading

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

Relevant log output

Anything else we need to know?

Environment

INSTALLED VERSIONS

andersy005 commented Feb 14, 2022

ArcticSnow commented Feb 14, 2022 • edited Loading

ArcticSnow commented Feb 14, 2022

ArcticSnow commented Feb 14, 2022

yuting-chen-0604 commented Nov 24, 2023

yuting-chen-0604 commented Nov 24, 2023 • edited Loading

max-sixty commented Nov 24, 2023

ArcticSnow commented Nov 25, 2023

max-sixty commented Nov 25, 2023

headtr1ck commented Nov 26, 2023

kmuehlbauer commented Nov 28, 2023

kmuehlbauer commented Mar 25, 2024

ArcticSnow commented Feb 13, 2022 •

edited

Loading

ArcticSnow commented Feb 14, 2022 •

edited

Loading

yuting-chen-0604 commented Nov 24, 2023 •

edited

Loading