Decoding time according to CF conventions raises error if a NaN is found #1662

gmaze · 2017-10-26T11:33:44Z

Working with Argo data, I have difficulties decoding time-related variables:
More specifically, it may happens that a variable being a date contains FillValue that are set to NaN at the opening of the netcdf file. That makes the decoding to raise an error.

Sure I can open the netcdf file with the decode_times = False option but it's not an issue of being able or not to decode the data, it seems to me to be about how to handle FillValue in a time axis.

I understand that with most of gridded datasets, the time axis/dimension/coordinate is full and does not contains missing values, that may be explaining why nobody have reported this before.

Here is a simple way to reproduce the error:

attrs = {'units': 'days since 1950-01-01 00:00:00 UTC'} # Classic Argo data Julian Day units

# OK !
jd = [24658.46875, 24658.46366898, 24658.47256944] # Sample of Julian date from Argo data
ds = xr.Dataset({'time': ('time', jd, attrs)})
print xr.decode_cf(ds)

<xarray.Dataset>
Dimensions:  (time: 3)
Coordinates:
  * time     (time) datetime64[ns] 2017-07-06T11:15:00 ...
Data variables:
    *empty*

But then:

# Not OK with a NaN
jd = [24658.46875, 24658.46366898, 24658.47256944, np.NaN] # Another sample of Julian date from Argo data
ds = xr.Dataset({'time': ('time', jd, attrs)})
print xr.decode_cf(ds)

ValueError: unable to decode time units 'days since 1950-01-01 00:00:00 UTC' with the default calendar. Try opening your dataset with decode_times=False. Full traceback:
Traceback (most recent call last):
  File "/Users/gmaze/anaconda/envs/obidam/lib/python2.7/site-packages/xarray/conventions.py", line 389, in __init__
    result = decode_cf_datetime(example_value, units, calendar)
  File "/Users/gmaze/anaconda/envs/obidam/lib/python2.7/site-packages/xarray/conventions.py", line 157, in decode_cf_datetime
    dates = _decode_datetime_with_netcdf4(flat_num_dates, units, calendar)
  File "/Users/gmaze/anaconda/envs/obidam/lib/python2.7/site-packages/xarray/conventions.py", line 99, in _decode_datetime_with_netcdf4
    dates = np.asarray(nc4.num2date(num_dates, units, calendar))
  File "netCDF4/_netCDF4.pyx", line 5244, in netCDF4._netCDF4.num2date (netCDF4/_netCDF4.c:64839)
ValueError: cannot convert float NaN to integer

I would expect the decoding to work like in the first case and to simply preserve NaNs where they are.

Any ideas or suggestions ?
Thanks

The text was updated successfully, but these errors were encountered:

rabernat · 2017-10-26T12:16:03Z

Hi Guillaume! Nice to see so many old friends showing up on the xarray repo...

The issue you raise is totally reasonable from a user perspective: missing values in datetime data should be permitted. But there are some upstream issues that make it challenging to solve (like most of our headaches related to datetime data).

In numpy (and computer arithmetic in general), NaN only exists in floating point datatypes. It is impossible to have a numpy datetime array with NaN in it:

>>> a = np.array(['2010-01-01', '2010-01-02'], dtype='datetime64[ns]')
>>> a[0] = np.nan
ValueError: Could not convert object to NumPy datetime

The same error would be raised if a were an integer array; to get around that, xarray automatically casts integers with missing data to floats. But that approach obviously doesn't work with datetime dtypes.

Further downstream, xarray relies on netcdf4-python's num2date function to decode the date. The error is raised by that package.

This is my understanding of the problem. Some other folks here like @jhamman and @spencerkclark might have ideas about how to solve it. They are working on a new package called netcdftime which will isolate and hopefully enhance such time encoding / decoding functions.

shoyer · 2017-10-26T16:56:03Z

I'm pretty sure this used to work in some form. I definitely worked with a dataset in the infancy of xarray that had coordinates with missing times.

The current issue appears to be that pandas represents the NaT values as an integer, and then (predictably) suffers from numeric overflow:

In [8]: import pandas as pd

In [9]: pd.to_timedelta(['24658 days 11:15:00', 'NaT']) + pd.Timestamp('1950-01-01')
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-9-cc287bf4c401> in <module>()
----> 1 pd.to_timedelta(['24658 days 11:15:00', 'NaT']) + pd.Timestamp('1950-01-01')

~/conda/envs/xarray-py36/lib/python3.6/site-packages/pandas/core/indexes/datetimelike.py in __add__(self, other)
    658                 return self.shift(other)
    659             elif isinstance(other, (Timestamp, datetime)):
--> 660                 return self._add_datelike(other)
    661             else:  # pragma: no cover
    662                 return NotImplemented

~/conda/envs/xarray-py36/lib/python3.6/site-packages/pandas/core/indexes/timedeltas.py in _add_datelike(self, other)
    354             other = Timestamp(other)
    355             i8 = self.asi8
--> 356             result = checked_add_with_arr(i8, other.value)
    357             result = self._maybe_mask_results(result, fill_value=iNaT)
    358         return DatetimeIndex(result, name=self.name, copy=False)

~/conda/envs/xarray-py36/lib/python3.6/site-packages/pandas/core/algorithms.py in checked_add_with_arr(arr, b, arr_mask, b_mask)
    889
    890     if to_raise:
--> 891         raise OverflowError("Overflow in int64 addition")
    892     return arr + b
    893

OverflowError: Overflow in int64 addition

This appears to be specific to our use of a TimedeltaIndex. Overflow doesn't appear if you add either value as scalars:

In [11]: pd.NaT + pd.Timestamp('1950-01-01')
Out[11]: NaT

In [12]: pd.Timedelta('24658 days 11:15:00') + pd.Timestamp('1950-01-01')
Out[12]: Timestamp('2017-07-06 11:15:00')

gmaze · 2017-10-27T14:42:56Z

Hi Ryan, never been very far, following/promoting xarray around here, and congrats for Pangeo !

Ok, I get the datatype being wrong, but about the issue from pandas TimedeltaIndex:
Does this means that a quick/dirty fix should be to decode value by value rather than on a vector ?

gmaze · 2017-10-27T15:44:11Z

Note that if the xarray decode_cf is given a NaT, in a datetime64, it works:

attrs = {'units': 'days since 1950-01-01 00:00:00 UTC'} # Classic Argo data Julian Day reference
jd = [24658.46875, 24658.46366898, 24658.47256944, np.NaN]  # Sample

def dirtyfixNaNjd(ref,day):
    td = pd.NaT
    if not np.isnan(day):
        td = pd.Timedelta(days=day)
    return pd.Timestamp(ref) + td

jd = [dirtyfixNaNjd('1950-01-01',day) for day in jd]
print jd

[Timestamp('2017-07-06 11:15:00'), Timestamp('2017-07-06 11:07:40.999872'), Timestamp('2017-07-06 11:20:29.999616'), NaT]

then:

ds = xr.Dataset({'time': ('time', jd, {'units': 'ns'})}) # Update the units attribute appropriately
ds = xr.decode_cf(ds)
print ds['time'].values

['2017-07-06T11:15:00.000000000' '2017-07-06T11:07:40.999872000'
 '2017-07-06T11:20:29.999616000' 'NaT']

shoyer mentioned this issue Oct 26, 2017

NaT in TimedeltaIndex + Timestamp overflows pandas-dev/pandas#17991

Closed

gmaze closed this as completed Nov 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoding time according to CF conventions raises error if a NaN is found #1662

Decoding time according to CF conventions raises error if a NaN is found #1662

gmaze commented Oct 26, 2017

rabernat commented Oct 26, 2017

shoyer commented Oct 26, 2017

gmaze commented Oct 27, 2017

gmaze commented Oct 27, 2017

Decoding time according to CF conventions raises error if a NaN is found #1662

Decoding time according to CF conventions raises error if a NaN is found #1662

Comments

gmaze commented Oct 26, 2017

rabernat commented Oct 26, 2017

shoyer commented Oct 26, 2017

gmaze commented Oct 27, 2017

gmaze commented Oct 27, 2017