API: Interpolate at new values #9340

rubennj · 2015-01-22T22:10:34Z

First time I used the .interpolate() method I thought that it receives a new index and then interpolates on it, similar to scipy.interpolate.interp1d
From scipy web:

from scipy import interpolate
x = np.arange(0, 10)
y = np.exp(-x/3.0)
f = interpolate.interp1d(x, y)
xnew = np.arange(0,9, 0.1)
ynew = f(xnew)   # use interpolation function returned by `interp1d`

Later I saw the .reindex() method, so I understood that this role is done by .reindex(). However .reindex() is not really doing a powerful interpolation, just extending the current values using the method keyword.

The current way to achieve it (joining previous and new index and then using .reindex()), in version 0.15.0,

index_joined = df.index.join(new_index, how='outer')
df.reindex(index=index_joined).interpolate().reindex(new_index)

A simpler syntax could be accepting the methods from .interpolate() into the 'method' keyword of .reindex():

df.reindex(index=new_index, method='linear')

TomAugspurger · 2015-01-22T23:01:36Z

See the (long) discussion at #4915

Basically, we wanted to keep the API of interpolate simple, but it's probably too clever since you have to be familiar with reindexing first. I'd actually favor changing the API of interploate to have a new parameter at, which is an array you evaluate the interpolation function at (xnew in your first example).

Also it could take some kind of parameter for whether to return just the interpolated values, or all the values.

rubennj · 2015-01-24T15:59:12Z

Probably I didn't address well. Actually, I see that to reindex is what I really meant, since Pandas works with objects that already have an index and the real intention of this proposal is to change the index (and consequently to interpolate the corresponding values).

I would say that .interpolate() is well defined as already is (to act at missing datapoints), since .reindex() is present.

Anyhow, to have a simple syntax (on .reindex() or .interpolate()) to get this action is very helpful.

rubennj · 2015-02-04T16:00:52Z

Any news? @TomAugspurger

shoyer · 2015-02-18T03:13:56Z

@rubennj I agree, we should have some sort of interpolation method that works like reindex. See also my recent PR to add a 'nearest' method to reindex: #9258

It is a bit awkward from an internals perspective to put this on reindex, because reindex currently does not do any interpolation, but rather only takes existing (possibly repeated) values. This is sometimes advantageous: you can always reindex, even if the data values are non-numeric.

shoyer · 2015-02-18T03:38:53Z

What about creating two interpolate methods: .interpolate_na() and .interpolate_at()? The former would be an alias for the current interpolate (eventually to be deprecated); the later would work for this new functionality.

I worry that hiding this functionality in reindex means it's unlikely to be easily found -- "interpolate" is a much more obvious name.

rubennj · 2015-02-18T22:56:47Z

OK, I see that .reindex() shouldn't be touched.

I think that one method, .interpolate(), modulated by parameters looks more compact, in the way @TomAugspurger suggested, but I wonder how compatible would be with the current behaviour.

shoyer · 2015-02-18T23:20:55Z

We could add a new optional parameter at to interpolate, which if not None will trigger this alternate interpolate API.

My hesitation with combining the functionality into one method is that there are at least two steps required to transform from one mode to the other, e.g., s.interpolate_na() <=> s.dropna().interpolate_at(s.index) (the equivalent to s.interpolate_at(target) is even worse, as shown in the first post). The two methods do pretty fundamentally different things, though both involve interpolation.

Another option would be move the interpolate_na functionality to fillna, and reserve interpolate for the interpolate_at functionality (at least pending deprecation cycles, etc.). That might be a slightly more awkward transition, though.

jreback · 2015-02-18T23:30:13Z

I think a nice soln here is to create a cookbook recipe for this pattern and a link from the docs

shoyer · 2015-02-18T23:46:27Z

I think a nice soln here is to create a cookbook recipe for this pattern and a link from the docs

At least in my experience and anecdotally from my colleagues, the current API of interpolate (filling missing values) is unexpected and not what we were looking for. For example, it's pretty different from what scipy and numpy's interpolate functions do. So I think some sort of API change to make this more intuitive is warranted :).

Interpolation at new values is also a very common pattern in my experience. The cookbook recipe will need to be even more complex than @rubennj's example if it is to handle propagating NA correctly, e.g., to ensure that the result of pd.Series([1, 2, np.nan, 4]).interpolate_at([2.5]) is NaN, not 2.5. It's also non-trivial to wrap SciPy directly, because SciPy uses a different meaning for NaN (see scipy/scipy#4086 and the linked issues). So I believe pretty strongly that incorporating some sort of interpolate_at functionality in pandas itself would be a good idea (regardless of what it's called).

rubennj · 2015-02-19T23:15:03Z

I also thought that .interpolate() was doing that at first.
Interpolating with a new index looks fundamental when you want to compare datasets with different indexes but close related, e.g. several spectra or time-series measured wih different instruments.

I don't understand why overloading interpolate() is a problem, adding the new parameter at. It's not the most elegant solution, but the "two functions" solution doesn't look so ideally clear and it will be a pity to lose the interpolate() method.

To transfer the current function to fill_na() and to use interpolate() for interpolation with a new index is ideal and quite clear to understand in my opinion. However I don't know how bad the transition can be in this case.

TomAugspurger · 2015-08-14T16:15:49Z

I'm in favor of adding a new method, since the behavior is different enough from the current interpolate:

df.interpolate_at(new_values, method='linear')

@denfromufa are you interested in submitting a pull request for this? Otherwise I'll add it to my list.

den-run-ai · 2015-08-15T02:17:09Z

@TomAugspurger, sure I can submit, please confirm the logic:

def interpolate_at(df, new_idxs):
    return df.drop_duplicates().dropna(
    ).reindex(
        np.concatenate(
        (df.index, np.unique(new_idxs)))
        ).sort().interpolate().ix[new_idxs]

Also should axis be an input or transposing is easy enough?

shoyer · 2015-08-15T02:58:53Z

My thoughts:

The function signature should exactly match interpolate except for the values argument.
The implementation should probably be at a lower level to ensure that it is makes a minimal number of copies.

On Fri, Aug 14, 2015 at 7:17 PM, denfromufa [email protected]
wrote:

@TomAugspurger, sure I can submit, please confirm the logic:
def interpeasy(df, new_idxs):
return df.drop_duplicates().dropna(
).reindex(
np.concatenate(
(df.index, np.unique(new_idxs)))
).sort().interpolate().ix[new_idxs]

Also should axis be an input or transposing is easy enough?

Reply to this email directly or view it on GitHub:
#9340 (comment)

den-run-ai · 2015-08-17T05:39:57Z

copying the signature from interpolate() looks feasible.

for lower level to avoid copies do you mean something like:

def interpolate_at(df, new_idxs):
    df=df.drop_duplicates()
    df.dropna(inplace=True)
    df.reindex(np.concatenate(
        (df.index, np.unique(new_idxs))), inplace=True)
    df.sort(inplace=True)
    df.interpolate(inplace=True)
    return df.ix[new_idxs]

shoyer · 2015-08-17T05:51:07Z

My thought was that it's probably worth looking at the implementation of df.interpolate (which calls scipy.interpolate) and using that logic at a lower level.

On Sun, Aug 16, 2015 at 10:40 PM, denfromufa [email protected]
wrote:

copying the signature from interpolate() looks feasible.
for lower level to avoid copies do you mean something like:
def interpolate_at(df, new_idxs):
    df=df.drop_duplicates()
    df.dropna(inplace=True)
    df.reindex(np.concatenate(
        (df.index, np.unique(new_idxs))), inplace=True)
    df.sort(inplace=True)
    df.interpolate(inplace=True)
    return df.ix[new_idxs]
Reply to this email directly or view it on GitHub:
#9340 (comment)

den-run-ai · 2015-08-17T15:54:12Z

Keeping interpolate_at() at such low level (scipy) would force me to
duplicate all the other code related to fill_na and filtering the options
to preserve the signature of interpolate().

Please correct me if I do not understand something?

The only reason why I would go that low level if we want to implement
scattered grid interpolation on multiindex :)

On Mon, Aug 17, 2015, 12:51 AM Stephan Hoyer [email protected]
wrote:

My thought was that it's probably worth looking at the implementation of
df.interpolate (which calls scipy.interpolate) and using that logic at a
lower level.

On Sun, Aug 16, 2015 at 10:40 PM, denfromufa [email protected]
wrote:
copying the signature from interpolate() looks feasible.
for lower level to avoid copies do you mean something like:
def interpolate_at(df, new_idxs):
df=df.drop_duplicates()
df.dropna(inplace=True)
df.reindex(np.concatenate(
(df.index, np.unique(new_idxs))), inplace=True)
df.sort(inplace=True)
df.interpolate(inplace=True)
return df.ix[new_idxs]
Reply to this email directly or view it on GitHub:
#9340 (comment)
—
Reply to this email directly or view it on GitHub
#9340 (comment).

shoyer · 2015-08-17T16:38:33Z

OK, fair enough -- just thought it would be worth taking a look. I agree that we don't want to duplicate that logic, but it may be possible to refactor it pretty straightforwardly to make it work for both cases.

den-run-ai · 2015-08-18T15:26:05Z

two incompatible options:

limit : int, default None.
Maximum number of consecutive NaNs to fill.
inplace : bool, default False
Update the NDFrame in place if possible.

shoyer · 2015-08-18T17:22:33Z

@denfromufa agreed, you can skip those.

As far as overall API goes, I would still advocate for renaming the existing interpolate to interpolate_na, just to make it clear what these interpolation methods do and that they are on equal footing.

fillna serves a distinct purpose -- it does reindexing style assignment to the locations with NAs.

jreback · 2015-08-18T21:38:33Z

for consistency let's call this interpolatena or interpna (the existing one).

den-run-ai · 2015-08-19T02:07:41Z

IMO, interpolate_na and interpolate_at are better choices, like suggested
originally, although not consistent with dropna & fillna.

On Tue, Aug 18, 2015, 4:38 PM Jeff Reback [email protected] wrote:

for consistency let's call this interpolatena or interpna (the existing
one).

—
Reply to this email directly or view it on GitHub
#9340 (comment).

shoyer · 2015-08-19T04:51:31Z

I don't like interpolatena because the words blend together without a separating character (especially with a vowel followed by a consonant). IMO, it's clearer with the extra _ character (which also makes it PEP8 compliant).

interpna matches an R function of the same name (and function) so there is some precedent there. But it's not immediately obvious like spelling the word out fully. I recall discussing this sort of thing on my tolerance pull request.

@denfromufa one more thought on implementation. I'm pretty sure that interpolate_at is a closer fit to the signature of the scipy functions than interpolate_na. This suggests that it might actually be a better idea (more efficient) to refactor interpolate_na to call interpolate_at rather than the other way around, e.g.,

def interpolate_na(series, inplace=False):
    na_locs = series.isnull()
    target = series.index[na_locs].values
    new_values = series.values[~na_locs].interpolate_at(target)
    if not inplace:
        series = series.copy()
    series.iloc[na_locs] = new_values
    return series

den-run-ai · 2015-08-30T15:15:37Z

I'm looking at this code now to add interpolate_at() and have a hard time with this code:

https://github.com/pydata/pandas/blob/ba0704f336c733f89ac8fa23c8700bd22ae620d4/pandas/core/common.py#L1632

            firstIndex = valid.argmax()
            valid = valid[firstIndex:]
            invalid = invalid[firstIndex:]
            result = yvalues.copy()
            if valid.all():
                return yvalues

can anyone explain it?

den-run-ai · 2015-08-30T23:10:40Z

@shoyer @TomAugspurger ok, firstIndex probably means first index in Series before reaching non-null value. All values below this (assuming Series is sorted) cannot be interpolated. Then why similarly lastIndex is not defined?

shoyer · 2015-08-30T23:49:17Z

@denfromufa It looks like the lack of this behavior at the end is a bug: #8000

den-run-ai · 2015-08-31T00:02:05Z

ok, I started making changes in https://github.com/denfromufa/pandas, I have some TODO items in the code

den-run-ai · 2015-10-28T18:18:32Z

Today I found one corner-case, hopefully this can be fixed at lower-level.

def interpolate_at(df, new_idxs):
    df=df.drop_duplicates()
    df.dropna(inplace=True)
    df.reindex(np.concatenate(
        (df.index, np.unique(new_idxs))), inplace=True)
    df.sort(inplace=True)
    df.interpolate(inplace=True)
    return df.ix[new_idxs]

If after reindex() operation some indices are duplicates, then ix[new_idxs] generates some weird things with these duplicates [this part needs explanation]. Hence drop_duplicates() needs to be called after reindex().

def interpolate_at(df, new_idxs):
    df=df.drop_duplicates()
    df.dropna(inplace=True)
    df.reindex(np.concatenate(
        (df.index, np.unique(new_idxs))), inplace=True)
    df.drop_duplicates(inplace=True)
    df.sort(inplace=True)
    df.interpolate(inplace=True)
    return df.ix[new_idxs]

den-run-ai · 2015-10-29T13:54:49Z

It is even deeper :(

Duplicates need to be removed even before .reindex()! This is because the new_idxs and df.index may have some duplicate items.

Hopefully others do not step on the same rake while I'm finishing my interpolation pull request.

def interpolate_at(df, new_idxs):
    df=df.drop_duplicates()
    df.dropna(inplace=True)
    df.reindex(
        np.concatenate(
        np.unique(
        (df.index, np.unique(new_idxs)))), inplace=True)
    df.drop_duplicates(inplace=True)
    df.sort(inplace=True)
    df.interpolate(inplace=True)
    return df.ix[new_idxs]

den-run-ai · 2016-11-07T23:32:50Z

latest version, previous one had bugs and sort() is deprecated:

def interpolate_at(df, new_idxs):
    df=df.drop_duplicates()
    df.dropna(inplace=True)
    df=df.reindex(
        np.unique(
        np.concatenate(
        (df.index, np.unique(new_idxs)))))
    df.drop_duplicates(inplace=True)
    df.sort_index(inplace=True)
    df.interpolate(inplace=True)
    return df.ix[new_idxs]

Note that this accepts both Series and DataFrames.

jreback · 2016-11-07T23:43:22Z

generally don't use inplace (it doesn't offer any benefit and makes code much harder to read)

use Index operations rather than numpy functions
numpy ops don't generally handle the full dtype set very well

den-run-ai · 2016-11-08T00:06:19Z

Oh, I see! How is unique in pandas much faster than numpy?!

http://pandas.pydata.org/pandas-docs/version/0.19.1/generated/pandas.Index.unique.html#pandas-index-unique

shoyer · 2016-11-08T00:07:01Z

Pandas uses a hash table, whereas numpy just sorts.

On Mon, Nov 7, 2016 at 4:06 PM, denfromufa [email protected] wrote:

Oh, I see! How is unique in pandas much faster than numpy?!

http://pandas.pydata.org/pandas-docs/version/0.19.1/
generated/pandas.Index.unique.html#pandas-index-unique

[image: image]
https://cloud.githubusercontent.com/assets/7870949/20081379/d787d56a-a514-11e6-9556-48c9244cbd36.png

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#9340 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABKS1vI-hO2XYnD7KJ3MQjF6SIKSBBTXks5q77z_gaJpZM4DWDcQ
.

den-run-ai · 2016-11-08T16:19:03Z

Here is cleaned up version using only pandas machinery, also fixed one more bug:

def interpolate_at(df, new_idxs):
    new_idxs = pd.Index(new_idxs)
    df = df.drop_duplicates().dropna()
    df = df.reindex(df.index.append(new_idxs).unique())
    df = df.sort_index()
    df = df.interpolate()
    return df.ix[new_idxs]

den-run-ai · 2016-12-06T03:58:12Z

This is an interesting related question:

http://stackoverflow.com/questions/40919497/is-it-possible-to-construct-a-pandas-series-which-auto-interpolates

Jostikas · 2017-07-24T06:52:56Z

To be clear, is the cookbook solution by denfromufa the current "best_for_many_cases" way to do this?

proinsias · 2021-04-03T23:54:10Z

@shoyer or @TomAugspurger - any more feedback for @denfromufa?

auxym · 2021-11-30T17:01:40Z

For one thing, I believe ix is deprecated in favor of loc?

But I also would be happy to see this feature in pandas. I thought it would fit well as an additonal method in reindex, but interpolate_at works too.

auxym · 2021-11-30T17:46:07Z

drop_duplicates() though seems to be causing problems in my usage. If anything we'd want to remove duplicate indices, but drop_duplicates removes rows with duplicate values.

TomAugspurger mentioned this issue Aug 14, 2015

support for simple interpolation with new values #10811

Closed

TomAugspurger changed the title ~~Improve of syntax to interpolate~~ Improve of syntax to interpolate at new values Aug 14, 2015

TomAugspurger changed the title ~~Improve of syntax to interpolate at new values~~ API: Interpolate at new values Aug 14, 2015

TomAugspurger added the API Design label Aug 14, 2015

TomAugspurger added this to the 0.17.0 milestone Aug 14, 2015

jreback mentioned this issue Aug 19, 2015

DataFrame.interpolate() is not equivalent to scipy.interpolate.interp1d #8796

Open

jreback modified the milestones: Next Major Release, 0.17.0 Aug 31, 2015

jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Aug 31, 2015

shoyer mentioned this issue Oct 20, 2017

WIP: Feature/interpolate pydata/xarray#1640

Merged

4 tasks

mroeschke added Enhancement and removed API Design labels Apr 12, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

API: Interpolate at new values #9340

API: Interpolate at new values #9340

Comments

rubennj commented Jan 22, 2015

TomAugspurger commented Jan 22, 2015

rubennj commented Jan 24, 2015

rubennj commented Feb 4, 2015

shoyer commented Feb 18, 2015

shoyer commented Feb 18, 2015

rubennj commented Feb 18, 2015

shoyer commented Feb 18, 2015

jreback commented Feb 18, 2015

shoyer commented Feb 18, 2015

rubennj commented Feb 19, 2015

TomAugspurger commented Aug 14, 2015

den-run-ai commented Aug 15, 2015

shoyer commented Aug 15, 2015

Also should axis be an input or transposing is easy enough?

den-run-ai commented Aug 17, 2015

shoyer commented Aug 17, 2015

den-run-ai commented Aug 17, 2015

shoyer commented Aug 17, 2015

den-run-ai commented Aug 18, 2015

shoyer commented Aug 18, 2015

jreback commented Aug 18, 2015

den-run-ai commented Aug 19, 2015

shoyer commented Aug 19, 2015

den-run-ai commented Aug 30, 2015

den-run-ai commented Aug 30, 2015

shoyer commented Aug 30, 2015

den-run-ai commented Aug 31, 2015

den-run-ai commented Oct 28, 2015

den-run-ai commented Oct 29, 2015

den-run-ai commented Nov 7, 2016

jreback commented Nov 7, 2016

den-run-ai commented Nov 8, 2016

shoyer commented Nov 8, 2016

den-run-ai commented Nov 8, 2016 • edited Loading

den-run-ai commented Dec 6, 2016

Jostikas commented Jul 24, 2017

proinsias commented Apr 3, 2021

auxym commented Nov 30, 2021

auxym commented Nov 30, 2021

den-run-ai commented Nov 8, 2016 •

edited

Loading