Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding time according to CF conventions raises error if a NaN is found #1662

Closed
gmaze opened this issue Oct 26, 2017 · 4 comments
Closed

Comments

@gmaze
Copy link
Contributor

gmaze commented Oct 26, 2017

Working with Argo data, I have difficulties decoding time-related variables:
More specifically, it may happens that a variable being a date contains FillValue that are set to NaN at the opening of the netcdf file. That makes the decoding to raise an error.

Sure I can open the netcdf file with the decode_times = False option but it's not an issue of being able or not to decode the data, it seems to me to be about how to handle FillValue in a time axis.

I understand that with most of gridded datasets, the time axis/dimension/coordinate is full and does not contains missing values, that may be explaining why nobody have reported this before.

Here is a simple way to reproduce the error:

attrs = {'units': 'days since 1950-01-01 00:00:00 UTC'} # Classic Argo data Julian Day units

# OK !
jd = [24658.46875, 24658.46366898, 24658.47256944] # Sample of Julian date from Argo data
ds = xr.Dataset({'time': ('time', jd, attrs)})
print xr.decode_cf(ds)

<xarray.Dataset>
Dimensions:  (time: 3)
Coordinates:
  * time     (time) datetime64[ns] 2017-07-06T11:15:00 ...
Data variables:
    *empty*

But then:

# Not OK with a NaN
jd = [24658.46875, 24658.46366898, 24658.47256944, np.NaN] # Another sample of Julian date from Argo data
ds = xr.Dataset({'time': ('time', jd, attrs)})
print xr.decode_cf(ds)

ValueError: unable to decode time units 'days since 1950-01-01 00:00:00 UTC' with the default calendar. Try opening your dataset with decode_times=False. Full traceback:
Traceback (most recent call last):
  File "/Users/gmaze/anaconda/envs/obidam/lib/python2.7/site-packages/xarray/conventions.py", line 389, in __init__
    result = decode_cf_datetime(example_value, units, calendar)
  File "/Users/gmaze/anaconda/envs/obidam/lib/python2.7/site-packages/xarray/conventions.py", line 157, in decode_cf_datetime
    dates = _decode_datetime_with_netcdf4(flat_num_dates, units, calendar)
  File "/Users/gmaze/anaconda/envs/obidam/lib/python2.7/site-packages/xarray/conventions.py", line 99, in _decode_datetime_with_netcdf4
    dates = np.asarray(nc4.num2date(num_dates, units, calendar))
  File "netCDF4/_netCDF4.pyx", line 5244, in netCDF4._netCDF4.num2date (netCDF4/_netCDF4.c:64839)
ValueError: cannot convert float NaN to integer

I would expect the decoding to work like in the first case and to simply preserve NaNs where they are.

Any ideas or suggestions ?
Thanks

@rabernat
Copy link
Contributor

Hi Guillaume! Nice to see so many old friends showing up on the xarray repo...

The issue you raise is totally reasonable from a user perspective: missing values in datetime data should be permitted. But there are some upstream issues that make it challenging to solve (like most of our headaches related to datetime data).

In numpy (and computer arithmetic in general), NaN only exists in floating point datatypes. It is impossible to have a numpy datetime array with NaN in it:

>>> a = np.array(['2010-01-01', '2010-01-02'], dtype='datetime64[ns]')
>>> a[0] = np.nan
ValueError: Could not convert object to NumPy datetime

The same error would be raised if a were an integer array; to get around that, xarray automatically casts integers with missing data to floats. But that approach obviously doesn't work with datetime dtypes.

Further downstream, xarray relies on netcdf4-python's num2date function to decode the date. The error is raised by that package.

This is my understanding of the problem. Some other folks here like @jhamman and @spencerkclark might have ideas about how to solve it. They are working on a new package called netcdftime which will isolate and hopefully enhance such time encoding / decoding functions.

@shoyer
Copy link
Member

shoyer commented Oct 26, 2017

I'm pretty sure this used to work in some form. I definitely worked with a dataset in the infancy of xarray that had coordinates with missing times.

The current issue appears to be that pandas represents the NaT values as an integer, and then (predictably) suffers from numeric overflow:

In [8]: import pandas as pd

In [9]: pd.to_timedelta(['24658 days 11:15:00', 'NaT']) + pd.Timestamp('1950-01-01')
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-9-cc287bf4c401> in <module>()
----> 1 pd.to_timedelta(['24658 days 11:15:00', 'NaT']) + pd.Timestamp('1950-01-01')

~/conda/envs/xarray-py36/lib/python3.6/site-packages/pandas/core/indexes/datetimelike.py in __add__(self, other)
    658                 return self.shift(other)
    659             elif isinstance(other, (Timestamp, datetime)):
--> 660                 return self._add_datelike(other)
    661             else:  # pragma: no cover
    662                 return NotImplemented

~/conda/envs/xarray-py36/lib/python3.6/site-packages/pandas/core/indexes/timedeltas.py in _add_datelike(self, other)
    354             other = Timestamp(other)
    355             i8 = self.asi8
--> 356             result = checked_add_with_arr(i8, other.value)
    357             result = self._maybe_mask_results(result, fill_value=iNaT)
    358         return DatetimeIndex(result, name=self.name, copy=False)

~/conda/envs/xarray-py36/lib/python3.6/site-packages/pandas/core/algorithms.py in checked_add_with_arr(arr, b, arr_mask, b_mask)
    889
    890     if to_raise:
--> 891         raise OverflowError("Overflow in int64 addition")
    892     return arr + b
    893

OverflowError: Overflow in int64 addition

This appears to be specific to our use of a TimedeltaIndex. Overflow doesn't appear if you add either value as scalars:

In [11]: pd.NaT + pd.Timestamp('1950-01-01')
Out[11]: NaT

In [12]: pd.Timedelta('24658 days 11:15:00') + pd.Timestamp('1950-01-01')
Out[12]: Timestamp('2017-07-06 11:15:00')

@gmaze
Copy link
Contributor Author

gmaze commented Oct 27, 2017

Hi Ryan, never been very far, following/promoting xarray around here, and congrats for Pangeo !

Ok, I get the datatype being wrong, but about the issue from pandas TimedeltaIndex:
Does this means that a quick/dirty fix should be to decode value by value rather than on a vector ?

@gmaze
Copy link
Contributor Author

gmaze commented Oct 27, 2017

Note that if the xarray decode_cf is given a NaT, in a datetime64, it works:

attrs = {'units': 'days since 1950-01-01 00:00:00 UTC'} # Classic Argo data Julian Day reference
jd = [24658.46875, 24658.46366898, 24658.47256944, np.NaN]  # Sample

def dirtyfixNaNjd(ref,day):
    td = pd.NaT
    if not np.isnan(day):
        td = pd.Timedelta(days=day)
    return pd.Timestamp(ref) + td

jd = [dirtyfixNaNjd('1950-01-01',day) for day in jd]
print jd
[Timestamp('2017-07-06 11:15:00'), Timestamp('2017-07-06 11:07:40.999872'), Timestamp('2017-07-06 11:20:29.999616'), NaT]

then:

ds = xr.Dataset({'time': ('time', jd, {'units': 'ns'})}) # Update the units attribute appropriately
ds = xr.decode_cf(ds)
print ds['time'].values
['2017-07-06T11:15:00.000000000' '2017-07-06T11:07:40.999872000'
 '2017-07-06T11:20:29.999616000' 'NaT']

@gmaze gmaze closed this as completed Nov 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants