Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aggregate adds leading zeros to series with different dates #189

Closed
NudnikShpilkis opened this issue May 11, 2023 · 0 comments · Fixed by #190
Closed

aggregate adds leading zeros to series with different dates #189

NudnikShpilkis opened this issue May 11, 2023 · 0 comments · Fixed by #190
Assignees

Comments

@NudnikShpilkis
Copy link

The aggregate function adds leading zeros to datasets with different dates per time series. Here's a minimal example:

import pandas as pd
import statsforecast.models as sfm
import hierarchicalforecast.methods as hfm
from statsforecast.utils import generate_series
from statsforecast import StatsForecast
from hierarchicalforecast.utils import aggregate
from hierarchicalforecast.core import HierarchicalReconciliation

max_tenure = 24
dates = pd.date_range(start='2019-01-31', freq='M', periods=max_tenure)
cohort_tenure = [24, 23, 22, 21]

ts_list = []

# Create ts for each cohort
for i in range(len(cohort_tenure)):
    ts_list.append(
        generate_series(n_series=1, freq='M', min_length=cohort_tenure[i], max_length=cohort_tenure[i]).reset_index() \
            .assign(ult=i) \
            .assign(ds=dates[-cohort_tenure[i]:]) \
            .drop(columns=['unique_id'])
    )
df = pd.concat(ts_list, ignore_index=True)

# Create categories
df.loc[df['ult'] < 2, 'pen'] = 'a'
df.loc[df['ult'] >= 2, 'pen'] = 'b'
# Note that unique id requires strings
df['ult'] = df['ult'].astype(str)

hier_levels = [
    ['pen'],
    ['pen', 'ult'],
]

hier_df, S_df, tags = aggregate(df=df, spec=hier_levels)
hier_df = hier_df.reset_index()
    # .query("unique_id.str.split('/').str[0] <= ds.dt.strftime('%Y-%m')")
print('S_df.shape', S_df.shape)
print('hier_df.shape', hier_df.shape)

If you query the 3rd cohort, we should see dates starting with 2019-03-31

df.query("ult == '2'")

But if you query hier_df, the output of aggregate, you'll see dates starting from 2019-01-31, the earliest date in the dataset.

hier_df.query("unique_id.str.split('/').str[-1] == '2'")

If you remove the leading zero's, reconcile fails because forecast_fitted_values cannot be reshaped into length of `S_df'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants