-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different data values from xarray open_mfdataset when using chunks #3686
Comments
Actually, that's true not just for Just a guess, but I think the problem here is that the calculations are done in floating-point arithmetic (probably float32...), and you get accumulated precision errors depending on the number of chunks. Internally in the NetCDF file I'm basically just starting to use xarray myself, so please someone correct me if any of the above is wrong. |
@dmedv Thanks for this, it all makes sense to me and I see the same results, however I wasn't able to "convert back" using
^^ That returns a different result than what I expect. I wonder if this is because of the However this led me to another seemingly related issue: #2304 Loss of precision seems to be the key here, so coercing the
returns:
|
@abarciauskas-bgse Yes, indeed, I forgot about |
Thanks for the useful issue @abarciauskas-bgse and valuable test @dmedv. I believe this is fundamentally a Dask issue. In general, Dask's algorithms do not guarantee numerically identical results for different chunk sizes. Roundoff errors accrue slightly differently based on how the array is split up. These errors are usually acceptable to users. For example, 290.13754 vs 290.13757, the error is in the 8th significant digit, 1 part in 100,00,000. Since there are only 65,536 16-bit integers (the original data type in the netCDF file), this seems more than adequate precision to me. Calling There appears to be a second issue here related to fill values, but I haven't quite grasped whether we think there is a bug.
There may be a reason why these operations are coupled. Would have to look more closely at the code to know for sure. |
Actually, there is no need to separate them. One can simply do something like this to apply the mask:
It's not a bug, but if we set |
Thanks @rabernat I would like to use assert_allclose to test the output but at first pass it seems that might be prohibitively slow to test for large datasets, do you recommend sampling or other good testing strategies (e.g. to assert the xarray datasets are equal to some precision) |
Closing as using |
MCVE Code Sample
You will first need to download or (mount podaac's drive) from PO.DAAC, including credentials:
Then run the following code:
Note, these are just a few examples but I tried a variety of other chunk options and got similar discrepancies between the unchunked and chunked datasets.
Output:
Expected Output
Values output from queries of chunked and unchunked xarray dataset are equal.
Problem Description
I want to understand how to chunk or query data to verify data opened using chunks will have the same output as data opened without chunking. Would like to store data ultimately in Zarr but verifying data integrity is critical.
Output of
xr.show_versions()
xarray: 0.14.1
pandas: 0.25.3
numpy: 1.17.3
scipy: None
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.3.2
cftime: 1.0.4.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.9.1
distributed: 2.9.1
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 44.0.0.post20200102
pip: 19.3.1
conda: None
pytest: None
IPython: 7.11.1
sphinx: None
The text was updated successfully, but these errors were encountered: