Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invalid timestamps in the future #975

Closed
mathause opened this issue Aug 18, 2016 · 6 comments · Fixed by #984
Closed

invalid timestamps in the future #975

mathause opened this issue Aug 18, 2016 · 6 comments · Fixed by #984

Comments

@mathause
Copy link
Collaborator

mathause commented Aug 18, 2016

If I have a netCDF file that has invalid timesteps from the 'future', it is wrongly converted to datetime64[ns].

import netCDF4 as nc
import numpy as np
import xarray as xr

# create netCDF file
ncf = nc.Dataset('test_future.nc', 'w')
ncf.createDimension('time')
ncf.createVariable('time', np.int, dimensions=('time'))
ncf.variables['time'].units = 'days since 1850-01-01 00:00:00'
ncf.variables['time'].calendar = 'standard'
ncf.variables['time'][:] = np.arange(850) * 365
ncf.close()

# open with xr
ds = xr.open_dataset('test_future.nc')
# this works
ds
# ds.time is a datetime64[ns] object
# this fails
ds.time

If I choose chalendar='noleap' the dates wrap around!

ncf = nc.Dataset('test_future_noleap.nc', 'w')
ncf.createDimension('time')
ncf.createVariable('time', np.int, dimensions=('time'))
ncf.variables['time'].units = 'days since 1850-01-01 00:00:00'
ncf.variables['time'].calendar = 'noleap'
ncf.variables['time'][:] = np.arange(850) * 365
ncf.close()

# open with xr
ds = xr.open_dataset('test_future_noleap.nc')
# after 2262 they go back to 1678!
ds.time

If my 'invalid' time is from the 'past' it works as expected:

ncf = nc.Dataset('test_past.nc', 'w')
ncf.createDimension('time')
ncf.createVariable('time', np.int, dimensions=('time'))
ncf.variables['time'].units = 'days since 1000-01-01 00:00:00'
ncf.variables['time'].calendar = 'standard'
ncf.variables['time'][:] = np.arange(850) * 365
ncf.close()

# open with xr
ds = xr.open_dataset('test_past.nc')
# this works
ds
# ds.time is a object
ds.time
@shoyer
Copy link
Member

shoyer commented Aug 18, 2016

This almost certainly related to the fact that datetime64[ns] does not support years outside the years 1678-2262:
#789

@mathause
Copy link
Collaborator Author

Yes, definitely. However, the documentation states that we should get back a netcdftime.datetime object when we are out of these bounds (http://xarray.pydata.org/en/stable/time-series.html#creating-datetime64-data). This only worked for (3) and not for (1). Thus only when the date was < 1678 and not for > 2262. I have not looked into why this occurs.

In my example (2) it returned a working datetime64[ns] object by wrapping around the year, which looks like an overflow problem. I also haven't looked into where this happens (xarray or netcdftime or ...), but this feels a bit dangerous to me.

@mathause
Copy link
Collaborator Author

mathause commented Aug 19, 2016

I tried to look into the logic of decoding datetimes and I am not sure I got it. So the dtype of the dates should be:

if 1678 < year < 2262:
    `datetime64[ns]`
else:
    if calendar in ['standard', 'gregorian', 'proleptic_gregorian']:
        `datetime.datetime`
    else:
        `netcdftime._datetime.datetime`

(Is it ever a timedelta64[ns]?)

The necessary conversion seems to be determined lazily (which may be the core of my problem above), Try this:

import xarray as xr
import numpy as np

units = 'days since 1850-01-01 00:00:00'
dates = np.arange(850) * 365

dta = xr.conventions.DecodedCFDatetimeArray(dates, units)

dta[0:1] # a datetime64[ns] object
dta[-1] # a datetime.datetime object
dta[:] # a datetime.datetime object

However, when I load these dates from a netCDF file (see the example in the first post) it results in an error. (Thus, the behavior is not exactly the same as when using DecodedCFDatetimeArray, I haven't figured out why.)

Another (or the same) problem is that in DecodedCFDatetimeArray in (in __init__) only the first element is tested when trying to generate the error message, maybe the first and the last element should be tested.

Here: (https://github.com/pydata/xarray/blob/master/xarray/conventions.py, line 375)

@shoyer
Copy link
Member

shoyer commented Aug 21, 2016

Yes, checking the first and last elements would be an improvement over our current heuristics.

I think you get the dtype selection right (we should write this down!). You can get timedelta64, but only if you have a units attribute that is just "units" not "units since origin".

@mathause
Copy link
Collaborator Author

As somewhat hinted at above there seem to be several issues here. I tried to look into a solution for checking the first and last element (which seems work for Problem (1) in my original post) but the OverflowError persisted so I looked into this and now my code is a mess but I figured this second problem out.

Pandas does not raise an overflow error when adding a TimedeltaIndex and a Timestamp.

import pandas as pd

# overflow error
pd.to_timedelta(106580, 'D') + pd.Timestamp('2000')
# no overflow error
pd.to_timedelta([106580], 'D') + pd.Timestamp('2000')

This screws up line 145 in https://github.com/pydata/xarray/blob/master/xarray/conventions.py.

@mathause
Copy link
Collaborator Author

pandas-dev/pandas#14068

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants