-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data from years 2018 onwards cannot be extracted with xarray #1
Comments
I was looking through several NetCDF files and climate variables from 2015 to 2018 and, unfortunately, I did find any differences in terms of data structure. See an example for max temp at https://github.com/JJguri/bestiapop/blob/master/sample_data/netcdf_exploration.ipynb |
Can you try to get to a singular data point in both of those? use a combination of lat-lon-time to get there, if possible, display the data values for a whole week or something like that, need to see that they are both reachable via the same code. |
ok this is good, we then need to extract the |
I was playing a little bit and, finally, I found both 2015 and 2018 are equal. The only difference I found was the position of the Attributes in any array you got. I am not sure it could be affecting the download, otherwise, the error is not in the structure of the file. Please see the netcdf_exploration file for details. |
SILO used two different formats of FillValues because the NetCDF files were constructed using different software tools. Data up to 2016 is in 64bit format, with fill values of -32768. bestiapop should be able to read and skip all of them. |
Cannot find the "-32768" fillvalues (assuming that by "fillvalues" you mean non-existent values). I've tested with multiple variables for year 2018 and they are all still "NaN". I can't still determine why the data exploration slows down for 2018 data. Perhaps the error is not where we thought it was? |
I think it is related also with the dtype as was discussed in the following issue pydata/xarray#2304 |
Could this bug generates an issue when we want to publish the package? Which 'easy' options we have to avoid this? |
Closing this issue as the problem was not with BestiaPop but rather with how SILO compiled its NetCDF4 files. As explained by the SILO team, as of Jun 2020, SILO has refactored all its NetCDF4 files to perform better when extracting spatial data points rather than time-based data points. This effectively means that it is slower to extract data for all days of the year from NetCDF4 files, for a single combination of lat/lon, than it is to extract data for all combinations of lat/lon for a single day. Since SILO NetCDF4 files are split into These factors create a double bottleneck when directly loading NetCDF4 files from AWS S3 buckets:
This issue has been circumvented now by leveraging SILO's cloud API which most likely is tied up to a fast backend database and/or reads directly from their local NetCDF4 files but leveraging the power of cloud computing. |
Xarray seems to have some issues extracting slices of data from SILO NETCF4 files from 2018 onwards. Error thrown is:
Specifically, the error is here:
This seems to be related with the way files starting from 2018 are encoded. We need to investigate further by obtaining one such file and exploring it interactively with Jupyter.
To troubleshoot this we need to import xarray and then load the NETCD4 file inside a Jupyter Notebook:
This will store the values in the variable value_array as a numpy array. We need to figure out why this code
value_array[variable_short_name].sel(lat=some_lat, lon=some_lon).values
(where "variable_short_name" is the name of the variable stored in the NetCDF file like "daily_rain") doesn't work for files higher than 2017. It is likely that the values for that lat and lon combinations are stored inside another layer in that array.The text was updated successfully, but these errors were encountered: