-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lead/lag with respect to a time variable #37
base: master
Are you sure you want to change the base?
Conversation
Any thought? |
Thanks for taking a stab at this! My main concern is that ideally this would have a lazy implementation (just like I'm not sure whether it should be done by simply allowing cc: @nalimilan (he is using this package in StatsKit, so he may have feedback on what is the best way to add time series support in a way that is helpful to the statistical ecosystem). |
I’m not sure it would be that ideal to have a lazy implementation of this lag
function — no one wants a vector that is as slow as a
dictionary. The trade-off does not seem to be worth it in this case.
The issue is that this package exports lag, so any method that
uses base types needs to be in this package to avoid type piracy. On the
other hand, I am not sure that ShiftedArray is the right type here.
…On Mon, May 4, 2020 at 2:41 PM Pietro Vertechi ***@***.***> wrote:
Thanks for taking a stab at this!
My main concern is that ideally this would have a lazy implementation
(just like lag, lead normally do), so that this line
<https://github.com/JuliaArrays/ShiftedArrays.jl/pull/37/files#diff-278249179b68fe78888f2d5e18630430R142>
is executed on getindex.
I'm not sure whether it should be done by simply allowing ShiftedArray to
have two arrays (keys and indices), or whether one could somehow get it to
work using as parent array some custom array type which also stores its
indices.
cc: @nalimilan <https://github.com/nalimilan> (he is using this package
in StatsKit, so he may have feedback on what is the best way to add time
series support in a way that is helpful to the statistical ecosystem).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#37 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPPPXNBA4IGAN7OFJT73PDRP4D4LANCNFSM4MNL2TJA>
.
|
To me it sounds like this should be a specialization of |
How would that work with panel data though? ie using lag on a grouped
DataFrame?
…On Mon, May 4, 2020 at 3:05 PM Milan Bouchet-Valat ***@***.***> wrote:
To me it sounds like this should be a specialization of lead and lag for
TimeArrays instead. TimeArrays can be constructed by passing timestamps
and values, so lag(TimeArray(times, values), Day(1)) could give you the
same result as what this function does.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#37 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPPPXNVNQOHXGYHMBMRW23RP4GV3ANCNFSM4MNL2TJA>
.
|
Well |
I don’t think that would work: time stamps of a TimeArray need to be
unique.
…On Mon, May 4, 2020 at 3:31 PM Milan Bouchet-Valat ***@***.***> wrote:
Well GroupedDataFrame gives SubArray views of the columns for rows in the
group, so you'd get a SubArray of a TimeArray and lag would be
implemented by that type I guess.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#37 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPPPXNASFWD7DCGTS3RW7DRP4JX7ANCNFSM4MNL2TJA>
.
|
Ah you mean that you want duplicate times in the data frame that would be unique only within each group? |
Yes, exactly. |
OK. Then the only solution would be to call It's still annoying that This choice between lazy and eager is a broader issue that doesn't have a very good solution currently. Julia has By the way, if methods from this PR requires times to be sorted (at least by default), wouldn't they be able to work much more efficiently, just by checking whether the previous timestamp is equal to the current one minus |
Concerning lazy versus eager: I agree that it only makes sense if the timestamps are sorted (or if one stores the As a side comment, if this becomes a special method for |
I don’t think it would work since you may want to lag by more than one
period.
Actually, it looks like TimeArray actually has a lag function, which does
not seem to do what I propose, and also does not inherit from
shiftedarray....
…On Mon, May 4, 2020 at 6:18 PM Pietro Vertechi ***@***.***> wrote:
Concerning lazy versus eager: I agree that it only makes sense if the
timestamps are sorted (or if one stores the sortperm in the lazy object).
I'm actually not 100% sure that the lazy object should be an AbstractArray,
or just an iterator. It feels like it is simpler to define iteration
efficiently rather than getindex`.
As a side comment, if this becomes a special method for TimeArray (which
IMO makes a lot of sense), I think it should live in TimeSeries
(ShiftedArrays is much more lightweight, so it would make more sense for
TimeSeries to depend on it than viceversa).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#37 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPPPXNQOGADIJCYUSTJCVLRP45MXANCNFSM4MNL2TJA>
.
|
The other thing is that TimedArray does not work for dates in Int, which is
common to hold years. This + the inability to have duplicate times makes it sound overly restrictive to be the only one type that can be lagged.
@iblis17
On Mon, May 4, 2020 at 8:22 PM Matthieu Gomez <[email protected]>
wrote:
… I don’t think it would work since you may want to lag by more than one
period.
Actually, it looks like TimeArray actually has a lag function, which does
not seem to do what I propose, and also does not inherit fro
shiftedarray....
On Mon, May 4, 2020 at 6:18 PM Pietro Vertechi ***@***.***>
wrote:
> Concerning lazy versus eager: I agree that it only makes sense if the
> timestamps are sorted (or if one stores the sortperm in the lazy
> object). I'm actually not 100% sure that the lazy object should be an AbstractArray,
> or just an iterator. It feels like it is simpler to define iteration
> efficiently rather than getindex`.
>
> As a side comment, if this becomes a special method for TimeArray (which
> IMO makes a lot of sense), I think it should live in TimeSeries
> (ShiftedArrays is much more lightweight, so it would make more sense for
> TimeSeries to depend on it than viceversa).
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#37 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ABPPPXNQOGADIJCYUSTJCVLRP45MXANCNFSM4MNL2TJA>
> .
>
|
Is that really a problem? You can just check go back by one timestamp until you find one which is either equal or older than the current timestamp minus That said, I'm no longer sure that using Can you point at the corresponding APIs in other languages? That may give us some ideas. |
I have a proposal for these functions:
Since the project goal of Oh, another issue: Is there a interface package for
And the same story will happen on |
Defining a new timed data frame type is certainly an option, although it would be much more complicated than what this pull request does. Another downside, AFIAK, is that it would not extend to other type of Tables. As for your idea about sorted, yes that sounds like a great way to get laziness. Although every indexing would have to check that the whole vector is still sorted right? Stuff like checking missing obs would take a lot of times etc... |
Thanks for the references. It would be useful if you could file an issue in DataFrames listing the operations that should be supported. That way we will be able to discuss what's the best solution to implement a time-aware type in/based on DataFrames.
I guess you could assume that the vector of timestamps isn't mutated. That's a natural expectation for a lazy view.
@iblis17 For |
👍 I will check them and try to move to
Oh, okay. I just read the doc of ShiftedArrays. |
@nalimilan that seems way too error prone too me. I think the best option would be to rename the lag function in this package to shift, which is more consistent with the name of this package anyway. lag would be reserved for a time non lazy lag. |
Thanks @matthieugomez for working on this! This functionality has been hugely (also surprisingly) missing outside of Stata-verse. Here are my two cents on it:
Pandas now support shift by frequency. Not sure about the underlying implementation but it's pretty slow at least on quarterly series. Also, it's not compatible with MultiIndex in pandas which is also unfriendly to panel data. This leaves me wondering after so many years how people deal with panel data outside of Stata. I like the current implementation to take another argument of another time index. Since there potentially be multiple columns on the same timestamps, it'll be redundant to attach every column with time information as in Concerning sorting and laziness: I find it totally reasonable to assume the time index has been sorted. In most of the applications we have sorted time indexes anyways. Operating on a sorted index is much more efficient as you only need to check n-steps back, so I tend to think it should be the default algorithm while the dictionary-based approach can be a fallback. In Stata/Pandas the lag/lead operators also require the data to be sorted by time indexes (the time variable in their applications are literally the indexes of the data so it's admittedly less error-prone). Also consistent with the current method of this package, the lag&lead are relative to the index of the arrays. This should give the user a hint that any other indexes they are passing to the time-aware method better to be sorted. |
Bump! It would be great to be able to use this package for (unbalanced) panel data. I don't think everybody is only focused on time series ;) |
The sorted-lazy approach won’t work for panel data — how would one check that the vector is sorted within each group? Running |
Agreed! I did not think of that. This is a great insight! |
@matthieugomez I don't understand this point. I guess you would do something like EDIT: Incidentally, this is how PanelShift.jl implements this, even if it's not lazy, so that's probably an efficient strategy in general. Cc: @FuZhiyu |
Laziness vs being sorted seems two orthogonal issues. We need As for PanelShift.jl, I didn't implement the lazy approach because 1) I'm not familiar with the machinery of laziness 2) Usually we don't have a need for a lazy lagged array in panel data. But if the maintainer of |
Is it really cheap to check that |
ok I see the issue. If for laziness we are not allowed to store the new index for In the case where there are gaps in time arrays, we could either store the whole new index in the |
My point is that we could check whether times are sorted only when creating the object. Then |
I guess, pragmatically, I am not sure what would be returned by this command: df = DataFrame(times = [0, 1], x = ["V1", "V2"])
df.y = lag(df.x, df.times)
sort!(df, :times, rev = true)
df.y |
@nalimilan The other thing is that, assuming that the underlying times are not mutated seems very error prone to me. Safety seems more important than memory efficiency. |
I now feel lazy vectors may not be a good idea for time vectors with gaps. Let me clarify. The current type
Both approaches are safe for post-creation mutation to the time vector though. but neither seem to have an advantage of a non-lazy method. Or is there any other implementation for lazy shiftedarrays that I haven't thought of? |
I think it may be important to give some perspective from the point of view of the ShiftedArrays library on this type of functionality. In my understanding, the main goal of this package is to provide a lazy, shifted array abstraction (
I don't think adding additional eager IMO, possible ways forward would be the following.
|
Thanks for commenting. I agree it's reasonable to keep ShiftedArrays oriented towards lazy operations only, and put eager methods in another package. I'd tend to prefer your solutions 1 or 2. I actually think it would make sense to define empty |
I would love Solution 1 or 2 |
Option 1 is the best choice. You can always remove them later as in Option 2. |
Solve #13