Lead/lag with respect to a time variable #37

matthieugomez · 2020-04-21T14:48:31Z

Solve #13

matthieugomez · 2020-05-04T17:05:03Z

Any thought?

piever · 2020-05-04T18:40:54Z

Thanks for taking a stab at this!

My main concern is that ideally this would have a lazy implementation (just like lag, lead normally do), so that this line is executed on getindex.

I'm not sure whether it should be done by simply allowing ShiftedArray to have two arrays (keys and indices), or whether one could somehow get it to work using as parent array some custom array type which also stores its indices.

cc: @nalimilan (he is using this package in StatsKit, so he may have feedback on what is the best way to add time series support in a way that is helpful to the statistical ecosystem).

matthieugomez · 2020-05-04T18:51:46Z

I’m not sure it would be that ideal to have a lazy implementation of this lag function — no one wants a vector that is as slow as a dictionary. The trade-off does not seem to be worth it in this case. The issue is that this package exports lag, so any method that uses base types needs to be in this package to avoid type piracy. On the other hand, I am not sure that ShiftedArray is the right type here.

…

On Mon, May 4, 2020 at 2:41 PM Pietro Vertechi ***@***.***> wrote: Thanks for taking a stab at this! My main concern is that ideally this would have a lazy implementation (just like lag, lead normally do), so that this line <https://github.com/JuliaArrays/ShiftedArrays.jl/pull/37/files#diff-278249179b68fe78888f2d5e18630430R142> is executed on getindex. I'm not sure whether it should be done by simply allowing ShiftedArray to have two arrays (keys and indices), or whether one could somehow get it to work using as parent array some custom array type which also stores its indices. cc: @nalimilan <https://github.com/nalimilan> (he is using this package in StatsKit, so he may have feedback on what is the best way to add time series support in a way that is helpful to the statistical ecosystem). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPPPXNBA4IGAN7OFJT73PDRP4D4LANCNFSM4MNL2TJA> .

nalimilan · 2020-05-04T19:04:47Z

To me it sounds like this should be a specialization of lead and lag for TimeArrays instead. TimeArrays can be constructed by passing timestamps and values, so lag(TimeArray(times, values), Day(1)) could give you the same result as what this function does.

matthieugomez · 2020-05-04T19:08:14Z

How would that work with panel data though? ie using lag on a grouped DataFrame?

…

On Mon, May 4, 2020 at 3:05 PM Milan Bouchet-Valat ***@***.***> wrote: To me it sounds like this should be a specialization of lead and lag for TimeArrays instead. TimeArrays can be constructed by passing timestamps and values, so lag(TimeArray(times, values), Day(1)) could give you the same result as what this function does. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPPPXNVNQOHXGYHMBMRW23RP4GV3ANCNFSM4MNL2TJA> .

nalimilan · 2020-05-04T19:30:54Z

Well GroupedDataFrame gives SubArray views of the columns for rows in the group, so you'd get a SubArray of a TimeArray and lag would be implemented by that type I guess.

matthieugomez · 2020-05-04T19:37:23Z

I don’t think that would work: time stamps of a TimeArray need to be unique.

…

On Mon, May 4, 2020 at 3:31 PM Milan Bouchet-Valat ***@***.***> wrote: Well GroupedDataFrame gives SubArray views of the columns for rows in the group, so you'd get a SubArray of a TimeArray and lag would be implemented by that type I guess. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPPPXNASFWD7DCGTS3RW7DRP4JX7ANCNFSM4MNL2TJA> .

nalimilan · 2020-05-04T20:22:24Z

Ah you mean that you want duplicate times in the data frame that would be unique only within each group?

matthieugomez · 2020-05-04T20:23:26Z

Yes, exactly.

nalimilan · 2020-05-04T20:54:37Z

OK. Then the only solution would be to call lag(TimeArray(times, values), Day(1)) for each group inside combine/select. But that wouldn't have any advantage over adding methods like in this PR, so...

It's still annoying that lead and lag are currently lazy in ShiftedArrays. This makes sense for the current methods since it's very efficient, but for methods added by this PR the overhead is much higher. Being lazy can be useful even there if you go over the data only once, but otherwise collecting it will be faster.

This choice between lazy and eager is a broader issue that doesn't have a very good solution currently. Julia has filter and Iterators.filter, DataFrames has stack and stackview (unexported)... all of that is quite inconsistent.

By the way, if methods from this PR requires times to be sorted (at least by default), wouldn't they be able to work much more efficiently, just by checking whether the previous timestamp is equal to the current one minus period? Then it could make sense to be lazy. It may even be faster to sort the data (notably using radix sort) than to use a Dict.

piever · 2020-05-04T22:18:38Z

Concerning lazy versus eager: I agree that it only makes sense if the timestamps are sorted (or if one stores the sortperm in the lazy object). I'm actually not 100% sure that the lazy object should be an AbstractArray, or just an iterator. It feels like it is simpler to define iteration efficiently rather than getindex.

As a side comment, if this becomes a special method for TimeArray (which IMO makes a lot of sense), I think it should live in TimeSeries (ShiftedArrays is much more lightweight, so it would make more sense for TimeSeries to depend on it than viceversa).

matthieugomez · 2020-05-05T00:22:40Z

I don’t think it would work since you may want to lag by more than one period. Actually, it looks like TimeArray actually has a lag function, which does not seem to do what I propose, and also does not inherit from shiftedarray....

…

On Mon, May 4, 2020 at 6:18 PM Pietro Vertechi ***@***.***> wrote: Concerning lazy versus eager: I agree that it only makes sense if the timestamps are sorted (or if one stores the sortperm in the lazy object). I'm actually not 100% sure that the lazy object should be an AbstractArray, or just an iterator. It feels like it is simpler to define iteration efficiently rather than getindex`. As a side comment, if this becomes a special method for TimeArray (which IMO makes a lot of sense), I think it should live in TimeSeries (ShiftedArrays is much more lightweight, so it would make more sense for TimeSeries to depend on it than viceversa). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPPPXNQOGADIJCYUSTJCVLRP45MXANCNFSM4MNL2TJA> .

matthieugomez · 2020-05-05T00:38:31Z

The other thing is that TimedArray does not work for dates in Int, which is common to hold years. This + the inability to have duplicate times makes it sound overly restrictive to be the only one type that can be lagged. @iblis17 On Mon, May 4, 2020 at 8:22 PM Matthieu Gomez <[email protected]> wrote:

…

I don’t think it would work since you may want to lag by more than one period. Actually, it looks like TimeArray actually has a lag function, which does not seem to do what I propose, and also does not inherit fro shiftedarray.... On Mon, May 4, 2020 at 6:18 PM Pietro Vertechi ***@***.***> wrote: > Concerning lazy versus eager: I agree that it only makes sense if the > timestamps are sorted (or if one stores the sortperm in the lazy > object). I'm actually not 100% sure that the lazy object should be an AbstractArray, > or just an iterator. It feels like it is simpler to define iteration > efficiently rather than getindex`. > > As a side comment, if this becomes a special method for TimeArray (which > IMO makes a lot of sense), I think it should live in TimeSeries > (ShiftedArrays is much more lightweight, so it would make more sense for > TimeSeries to depend on it than viceversa). > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#37 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABPPPXNQOGADIJCYUSTJCVLRP45MXANCNFSM4MNL2TJA> > . >

nalimilan · 2020-05-05T07:49:00Z

I don’t think it would work since you may want to lag by more than one
period.

Is that really a problem? You can just check go back by one timestamp until you find one which is either equal or older than the current timestamp minus period. That should be efficient unless you want to lag by a large number of periods compared to the average resolution of the timestamps.

That said, I'm no longer sure that using TimeArray is the best approach. Indeed the proposed methods work for any type (notably integers), and not just dates/times.

Can you point at the corresponding APIs in other languages? That may give us some ideas.

iblislin · 2020-05-05T08:51:17Z

Hi @matthieugomez

I have a proposal for these functions:

Let's implement them in TimeSeries.jl for both AbstractArray and TimeArray.
Furthermore, we can add a new type like SimpleTimeArray to accept dates in Int.

Since the project goal of TimeSeries.jl is a versatile time series toolbox.
I want to make some common functions available for AbstractArray, but I don't do it yet. :p

Oh, another issue: Is there a interface package for lag, lead, head, tail?
I always got this:

WARNING: both TimeSeries and DataFrames export "tail"; uses of it in module Main must be qualified

And the same story will happen on TimeSeries and ShiftedArray.

matthieugomez · 2020-05-05T14:02:13Z

@nalimilan

Stata has a concept of timed DataFrame and timed grouped DataFrame. Once a time variable is defined, doing L.x creates the lagged variable as in this pull request (per group if dataframe is grouped)
dplyr includes a simple function defined on vectors lag(values, n = 1, orderby = times). It does something a bit different: it returns the value for the previous time (whether or not the previous time is equal to the current time minus 1). So in effect, n corresponds to a number of rows to lag by, rather than a period. I think this is to maintain compatibility with SQL. Now there is new user defined package that defines timed tibble and timed group tibble. It includes a function keyed_lag doing the same thing as this pull request, but defined only within a mutate call lag(), lead() and difference() for regular series with missingness tidyverts/tsibble#55
I don’t know Panda that well but it seems that the shift function corresponds to the lag function in dplyr (ie shifts by a number of rows, not a period). https://stackoverflow.com/questions/33875746/make-a-shift-by-index-with-a-pandas-dataframe

Defining a new timed data frame type is certainly an option, although it would be much more complicated than what this pull request does. Another downside, AFIAK, is that it would not extend to other type of Tables.

As for your idea about sorted, yes that sounds like a great way to get laziness. Although every indexing would have to check that the whole vector is still sorted right? Stuff like checking missing obs would take a lot of times etc...

nalimilan · 2020-05-06T09:59:39Z

Thanks for the references. It would be useful if you could file an issue in DataFrames listing the operations that should be supported. That way we will be able to discuss what's the best solution to implement a time-aware type in/based on DataFrames.

As for your idea about sorted, yes that sounds like a great way to get laziness. Although every indexing would have to check that the whole vector is still sorted right? Stuff like checking missing obs would take a lot of times etc...

I guess you could assume that the vector of timestamps isn't mutated. That's a natural expectation for a lazy view.

Oh, another issue: Is there a interface package for lag, lead, head, tail?
I always got this:
WARNING: both TimeSeries and DataFrames export "tail"; uses of it in module Main must be qualified
And the same story will happen on TimeSeries and ShiftedArray.

@iblis17 head and tail are deprecated in DataFrames, so they will be removed at some point. FWIW, we moved to using first and last so maybe that would be an option for TimesSeries too?

For lag and lead, we could define them in DataAPI, but it's difficult to coordinate multiple packagesif they want to defined methods that operate on types they don't own, which is quite typical for theese functions (like ShiftedArrays does). That would require defining very clearly which package is supposed to provide which method.

iblislin · 2020-05-06T13:00:44Z

@nalimilan

head and tail are deprecated in DataFrames, so they will be removed at some point. FWIW, we moved to using first and last so maybe that would be an option for TimesSeries too?

👍 I will check them and try to move to first and last.

For lag and lead, we could define them in DataAPI, but it's difficult to coordinate multiple packagesif they want to defined methods that operate on types they don't own, which is quite typical for theese functions (like ShiftedArrays does). That would require defining very clearly which package is supposed to provide which method.

Oh, okay. I just read the doc of ShiftedArrays.
Seem the "just-let-them-conflict" policy is suitable for the types not owned by the package.

matthieugomez · 2020-05-06T14:24:42Z

@nalimilan that seems way too error prone too me. I think the best option would be to rename the lag function in this package to shift, which is more consistent with the name of this package anyway. lag would be reserved for a time non lazy lag.

FuZhiyu · 2020-10-11T18:19:50Z

Thanks @matthieugomez for working on this! This functionality has been hugely (also surprisingly) missing outside of Stata-verse. Here are my two cents on it:

I don’t know Panda that well but it seems that the shift function corresponds to the lag function in dplyr

Pandas now support shift by frequency. Not sure about the underlying implementation but it's pretty slow at least on quarterly series. Also, it's not compatible with MultiIndex in pandas which is also unfriendly to panel data. This leaves me wondering after so many years how people deal with panel data outside of Stata.

I like the current implementation to take another argument of another time index. Since there potentially be multiple columns on the same timestamps, it'll be redundant to attach every column with time information as in TimeArray. Another approach would be to define a TimeDataFrame but I also agree with @matthieugomez on the downside of that path. The current implementation is more flexible and simple.

Concerning sorting and laziness: I find it totally reasonable to assume the time index has been sorted. In most of the applications we have sorted time indexes anyways. Operating on a sorted index is much more efficient as you only need to check n-steps back, so I tend to think it should be the default algorithm while the dictionary-based approach can be a fallback. In Stata/Pandas the lag/lead operators also require the data to be sorted by time indexes (the time variable in their applications are literally the indexes of the data so it's admittedly less error-prone).

Also consistent with the current method of this package, the lag&lead are relative to the index of the arrays. This should give the user a hint that any other indexes they are passing to the time-aware method better to be sorted.

eloualiche · 2022-02-17T21:08:05Z

Bump! It would be great to be able to use this package for (unbalanced) panel data.

I don't think everybody is only focused on time series ;)

matthieugomez · 2022-02-17T21:18:36Z

The sorted-lazy approach won’t work for panel data — how would one check that the vector is sorted within each group? Running issorted at each getindex call would be too costly.

eloualiche · 2022-02-17T22:10:05Z

The sorted-lazy approach won’t work for panel data — how would one check that the vector is sorted within each group? Running issorted at each getindex call would be too costly.

Agreed! I did not think of that. This is a great insight!

nalimilan · 2022-03-23T08:49:11Z

The sorted-lazy approach won’t work for panel data — how would one check that the vector is sorted within each group? Running issorted at each getindex call would be too costly.

@matthieugomez I don't understand this point. I guess you would do something like combine(groupby(df, :key), [:x, :times] => lag => :x_lagged) right? Then lag just has to check that the SubArray it gets passed for times (corresponding to a particular group) is sorted (this is super cheap).

EDIT: Incidentally, this is how PanelShift.jl implements this, even if it's not lazy, so that's probably an efficient strategy in general. Cc: @FuZhiyu

FuZhiyu · 2022-03-27T23:17:48Z

Laziness vs being sorted seems two orthogonal issues. We need :times to be sorted, while we probably want :x_lagged to be lazy. As long as times is not lazy, checking whether it's sorted seems pretty cheap.

As for PanelShift.jl, I didn't implement the lazy approach because 1) I'm not familiar with the machinery of laziness 2) Usually we don't have a need for a lazy lagged array in panel data.

But if the maintainer of ShiftedArrays.jl thinks it's a good idea to incorporate PanelShifts into this package I'm also happy to dig into the lazy wood to make it lazy.

matthieugomez · 2022-03-28T00:28:48Z

Is it really cheap to check that times is sorted every time getindex(:x_lagged, i) is called?

FuZhiyu · 2022-03-28T19:46:42Z

ok I see the issue. If for laziness we are not allowed to store the new index for x_lagged, then indeed we need to check whether times is sorted every time we call getindex.

In the case where there are gaps in time arrays, we could either store the whole new index in the ShiftedArray, or store multiple shifts and use them depending on the queried index. But both solutions require different designs of ShiftedArray and seem a little bit awkward.

nalimilan · 2022-03-28T20:52:21Z

My point is that we could check whether times are sorted only when creating the object. Then getindex would simply assume that the underlying times are not mutated. Am I missing something?

matthieugomez · 2022-03-28T21:32:52Z

I guess, pragmatically, I am not sure what would be returned by this command:

df = DataFrame(times = [0, 1], x = ["V1", "V2"])
df.y = lag(df.x, df.times)
sort!(df, :times, rev = true)
df.y

matthieugomez · 2022-03-28T21:39:23Z

@nalimilan The other thing is that, assuming that the underlying times are not mutated seems very error prone to me. Safety seems more important than memory efficiency.

FuZhiyu · 2022-03-29T01:57:45Z

I now feel lazy vectors may not be a good idea for time vectors with gaps. Let me clarify.

The current type ShiftedArray stores shifts, so when getindex(:x_lagged, i) is called, it returns x_lagged[i-shift]. However, when there are gaps in time, shifts are no longer constant so the current design cannot be applied directly. Say we have x_lagged=lag(time, x, 1), where x is a length-N vector We can do:

Store a length-N vector of shifts in x_lagged, so when we call getindex(:x_lagged, t), it returns x_lagged[t-shift[t]]. But since we already saved another shift vector, what's the benefit of being lazy?
A true lazy approach, so every call of getindex(x_lagged, t), as @matthieugomez suggests, checks whether time is sorted and searches for x at t-1. Super safe, super inefficient.

Both approaches are safe for post-creation mutation to the time vector though. but neither seem to have an advantage of a non-lazy method.

Or is there any other implementation for lazy shiftedarrays that I haven't thought of?

piever · 2022-03-30T12:39:02Z

I think it may be important to give some perspective from the point of view of the ShiftedArrays library on this type of functionality.

In my understanding, the main goal of this package is to provide a lazy, shifted array abstraction (ShiftedArray and CircShiftedArray) and, whenever possible, add lazy, memory-efficient versions of established functions (Base.circshift, AbstractFFTs.fftshift).

lag and lead are exceptions to this, and at the moment they are mostly helpers for the ShiftedArray function, i.e., lag(v, n) = ShiftedArray(v, n), lead(v, n) = ShiftedArray(v, -n). They are not crucial, in that the constructor already does most of the job.

I don't think adding additional eager lag and lead methods (based on a Dict) here is consistent with the rest of the library. It would also increase the maintenance burden for this package, which at the moment is quite low, as it is a low-level package with no dependencies. I'm afraid that would change if ShiftedArrays became more of a "toolkit to work with sequence data".

IMO, possible ways forward would be the following.

Unexport lag and lead from ShiftedArrays. A separate package (StatsBase.jl, or a hypothetical WindowsFunctions.jl) would define eager versions, with all the required features. ShiftedArrays.lag would be a lazy version of WindowsFunctions.lag whenever possible. Same could be done for the diff proposal in Add ShiftedArrays.diff for differences between elements in a vector #51 (WindowsFunctions.diff is eager, ShiftedArrays.diff is lazy).
Same as 1., but lag and lead are removed altogether from here, the ShiftedArray constructor is probably enough (see Lead/lag with respect to a time variable #37 (comment)).
Do not change anything in ShiftedArrays, but use instead a custom vector type and specialize lag(CustomVector(times, values), deltat). I think CustomVector should be some sort of AxisArray, with a list of sorted indices and a list of values (see Lead/lag with respect to a time variable #37 (comment)).

nalimilan · 2022-03-30T12:47:46Z

Thanks for commenting. I agree it's reasonable to keep ShiftedArrays oriented towards lazy operations only, and put eager methods in another package.

I'd tend to prefer your solutions 1 or 2. I actually think it would make sense to define empty lead and lag functions in StatsAPI so that other packages can provide appropriate methods for custom array types (as long as they are consistent with the general definition). Fallback definitions for any AbstractVectors, which would be eager, could live in StatsBase (if others agree).

matthieugomez · 2022-03-30T12:53:17Z

I would love Solution 1 or 2

mkitti · 2022-09-02T16:23:20Z

Option 1 is the best choice. You can always remove them later as in Option 2.

matthieugomez added 16 commits April 21, 2020 10:47

Lead/lag with respect to a time variable

341ce72

kwarg

10ab355

Update lag.jl

a011898

add default for TimeType

ddfc16a

don't import Dates

db52b7a

up

725574e

add Dates

08b8f50

Update lag.jl

f1e480e

Update Project.toml

41923fd

add more tests

8f94ab6

Lead/lag with respect to a time variable

07731dd

add more tests

b7933c4

Merge branch 'master' of https://github.com/matthieugomez/ShiftedArra…

1583cff

…ys.jl

add comments

879ed0b

multiple lines

5d3ead9

Update lag.jl

6ecc377

iblislin mentioned this pull request May 6, 2020

check last and first of DataFrames.jl JuliaStats/TimeSeries.jl#452

Open

Nosferican mentioned this pull request Oct 15, 2020

Dynamic panel data models Nosferican/Econometrics.jl#29

Open

matthieugomez mentioned this pull request Mar 11, 2022

Lag/Lead Operators in @formula FixedEffects/FixedEffectModels.jl#193

Closed

eirikbrandsaas mentioned this pull request Jul 14, 2022

Collection of related discussion eirikbrandsaas/PanelDataTools.jl#1

Open

piever mentioned this pull request Sep 2, 2022

Please release a new version #54

Closed

piever mentioned this pull request Nov 5, 2022

why lag/lead no longer exported? #62

Open

piever mentioned this pull request Jan 11, 2023

unexport lag and lead #55

Merged

Lead/lag with respect to a time variable #37

Are you sure you want to change the base?

Lead/lag with respect to a time variable #37

Conversation

matthieugomez commented Apr 21, 2020

matthieugomez commented May 4, 2020

piever commented May 4, 2020

matthieugomez commented May 4, 2020 via email • edited Loading

nalimilan commented May 4, 2020

matthieugomez commented May 4, 2020 via email

nalimilan commented May 4, 2020

matthieugomez commented May 4, 2020 via email • edited Loading

nalimilan commented May 4, 2020

matthieugomez commented May 4, 2020

nalimilan commented May 4, 2020

piever commented May 4, 2020 • edited Loading

matthieugomez commented May 5, 2020 via email

matthieugomez commented May 5, 2020 via email • edited Loading

nalimilan commented May 5, 2020

iblislin commented May 5, 2020 • edited Loading

matthieugomez commented May 5, 2020 • edited Loading

nalimilan commented May 6, 2020

iblislin commented May 6, 2020

matthieugomez commented May 6, 2020 • edited Loading

FuZhiyu commented Oct 11, 2020 • edited Loading

eloualiche commented Feb 17, 2022

matthieugomez commented Feb 17, 2022 • edited Loading

eloualiche commented Feb 17, 2022

nalimilan commented Mar 23, 2022 • edited Loading

FuZhiyu commented Mar 27, 2022 • edited Loading

matthieugomez commented Mar 28, 2022

FuZhiyu commented Mar 28, 2022

nalimilan commented Mar 28, 2022

matthieugomez commented Mar 28, 2022

matthieugomez commented Mar 28, 2022 • edited Loading

FuZhiyu commented Mar 29, 2022

piever commented Mar 30, 2022

nalimilan commented Mar 30, 2022

matthieugomez commented Mar 30, 2022

mkitti commented Sep 2, 2022

matthieugomez commented May 4, 2020 via email •

edited

Loading

matthieugomez commented May 4, 2020 via email •

edited

Loading

piever commented May 4, 2020 •

edited

Loading

matthieugomez commented May 5, 2020 via email •

edited

Loading

iblislin commented May 5, 2020 •

edited

Loading

matthieugomez commented May 5, 2020 •

edited

Loading

matthieugomez commented May 6, 2020 •

edited

Loading

FuZhiyu commented Oct 11, 2020 •

edited

Loading

matthieugomez commented Feb 17, 2022 •

edited

Loading

nalimilan commented Mar 23, 2022 •

edited

Loading

FuZhiyu commented Mar 27, 2022 •

edited

Loading

matthieugomez commented Mar 28, 2022 •

edited

Loading