Improve feature design and selection #309

krsnapaudel · 2024-08-21T06:38:23Z

The basic expectation here is that we should beat the naive or average yield model. Try different ideas and improve feature design or feature selection to make this happen.

krsnapaudel · 2024-12-13T13:19:29Z

Some experiments with features shared by Monique

Test years 2015-2020, SklearnRidge model

CY-Bench features
ES {'normalized_rmse': 128.89452058000492, 'mape': 1.3208657278284044, 'r2': -9.250500723690777}
NL {'normalized_rmse': 55.099017165404376, 'mape': 0.5418177229882343, 'r2': -24.873093530121597}

Monique's features
ES {'normalized_rmse': 46.88929648938571, 'mape': 0.4749722496641766, 'r2': -1.3146495440847468}
NL {'normalized_rmse': 14.483919522064255, 'mape': 0.12855531515078933, 'r2': -0.7878573336813055}

Now with residual models:

Test years 2015-2020, RidgeRes model

CY-Bench features
ES {'normalized_rmse': 51.08394263080526, 'mape': 0.4382275845476348, 'r2': -0.6100708475975807}
NL {'normalized_rmse': 13.884076637448246, 'mape': 0.11857026466507622, 'r2': -0.6428376928356032}

Monique's features
ES {'normalized_rmse': 62.315290936306084, 'mape': 0.6534064667630376, 'r2': -3.08815228631894}
NL {'normalized_rmse': 9.155708659272959, 'mape': 0.07468484769146325, 'r2': 0.2855948386281204}

krsnapaudel · 2024-12-20T13:17:40Z

Periods: pre-planting (or before emergence), planting (emergence), vegetative, flowering, grain-filling, harvest

How many periods, how to determine what dates correspond to which period
Which variables we care about per period
Which aggregation per variable

See examples in Table 1 and 2 here: https://doi.org/10.1016/j.agsy.2020.103016

krsnapaudel · 2024-12-20T14:01:17Z

Example code to test improved features:

    # Test 1: Test with raw data
    dataset_wheat = Dataset.load("wheat")
    targets_df = dataset_wheat._df_y

    all_years = list(range(2001, 2019))
    test_years = [2017, 2018]
    train_years = [yr for yr in all_years if yr not in test_years]
    train_dataset, test_dataset = dataset_wheat.split_on_years(
        (train_years, test_years)
    )

    # Model
    model = SklearnRidge()
    model.fit(train_dataset)

    targets = test_dataset.targets()
    test_preds, _ = model.predict(test_dataset)
    assert test_preds.shape[0] == len(test_dataset)
    metrics1 = evaluate_predictions(targets, test_preds)

    # Test 2: Test with predesigned features
    # Training dataset
    improved_csv = os.path.join(PATH_DATA_DIR, "monique_features.csv")
    improved_df = pd.read_csv(improved_csv, index_col=[KEY_LOC, KEY_YEAR])
    # filter for wheat
    improved_df = improved_df[improved_df["crop"] == "wheat"].drop(columns=["crop"])
    dataset_wheat2 = Dataset(
        "wheat", targets, {KEY_COMBINED_FEATURES: improved_df}
    )

    train_dataset2, test_dataset2 = dataset_wheat2.split_on_years(
        (train_years, test_years)
    )

    model.fit(train_dataset2)

    targets2 = test_dataset2.targets()
    test_preds2, _ = model.predict(test_dataset2)
    assert test_preds2.shape[0] == len(test_dataset2)
    metrics2 = evaluate_predictions(targets, test_preds2)

pkj002 · 2024-12-20T14:21:14Z

Thanks Dilli

mnqoliveira · 2024-12-26T12:28:37Z

I revisited my script to add the static features and to add the target and I realized we did not compare the same things 😞 There were some mistakes in my code. I (hopefully) fixed everything and I was able to run some tests.

I used as thresholds:

Maize: base=8, upper=50, thresholds: 500, 1000, 1500, 2000
Wheat: base=0, upper=45, thresholds: 500, 1000, 1500, 2000

These are not yet carefully thought (they don't even are related to the periods in your paper). So this choice is still pending. I just wanted to show I was able to run and point some issues in the process.

wheat NL

Original - SklearnRidge {'normalized_rmse': 42.79182885351677, 'mape': 0.41091685689652985, 'r2': -14.011533645734382}
Modified - SklearnRidge {'normalized_rmse': 14.377558108303093, 'mape': 0.11481023485890013, 'r2': -0.6946250862837775}

Original - RidgeRes {'normalized_rmse': 8.891690336417946, 'mape': 0.0752928093435701, 'r2': 0.3518552208859771}
Modified - RidgeRes {'normalized_rmse': 14.062994933734718, 'mape': 0.12258434244185754, 'r2': -0.6212836713262293}

wheat ES

Original - SklearnRidge {'normalized_rmse': 38.022887420217536, 'mape': 0.47719593050118486, 'r2': 0.16427242385022855}
Modified - SklearnRidge {'normalized_rmse': 32.5682994566381, 'mape': 0.38287948506398756, 'r2': 0.38685285484081866}

Original - RidgeRes {'normalized_rmse': 48.781203476171804, 'mape': 0.3993343352764365, 'r2': -0.3755600884502228}
Modified - RidgeRes {'normalized_rmse': 41.13821275424611, 'mape': 0.3777453115636452, 'r2': 0.021715021391573774}

maize NL

Original - SklearnRidge {'normalized_rmse': 20.690912736393592, 'mape': 0.1827639729644808, 'r2': -0.20693118758279616}
Modified - SklearnRidge {'normalized_rmse': 38.714284946197495, 'mape': 3495033330573030.5, 'r2': -0.39146575944442197}

Original - RidgeRes {'normalized_rmse': 28.566423643255995, 'mape': 0.24972368856241334, 'r2': -1.3005673752745484}
Modified - RidgeRes {'normalized_rmse': 38.02736454571835, 'mape': 3338890672533422.0, 'r2': -0.34252535167362463}

maize ES

Original - SklearnRidge {'normalized_rmse': 80.53221377905166, 'mape': 0.9290330288460491, 'r2': -3.4319357073350094}
Modified - SklearnRidge {'normalized_rmse': 23.801433455100057, 'mape': 0.24336528787951453, 'r2': 0.5014194452846327}

Original - RidgeRes {'normalized_rmse': 23.337056497920795, 'mape': 0.2135844654568528, 'r2': 0.6278257418567065}
Modified - RidgeRes {'normalized_rmse': 22.514457098316793, 'mape': 0.21274445729239302, 'r2': 0.5538796156926478}

To note:

I added the results for CY-Bench default features because, as you see, I was not able to obtain exactly the results you posted above. Even if I just ran the lines associated with Test 1 with the sample data on the proper folder. But I also don't know if the sample data corresponds exactly to current version of the full dataset.
We can see through maize NL MAPE that I'm not using the same exact dataset as the one obtained when running the script to build the original features, since that weird result comes from a very low value in the target. I think this could be an internal filter? I think I saw something like that. (In the previous evaluation this was part of the error I had made. I had sent you fewer observations. That's likely why it did not appear.)
I'll investigate here but, nonetheless, we should think of a way to make it easier to ensure outside comparisons are using the same data. If I made this mistake, more people could do the same.
I need a better way to fill NAs and to replace Infs (I'm using -9999 and not proud).

mnqoliveira · 2025-01-02T13:58:54Z

Changing the thresholds (300, 750, 1200, 1600) for wheat NL led to quite better results than my previous ones. But these should really be more adequately determined than only by my testing of values.

Original - SklearnRidge {'normalized_rmse': 42.79182885351677, 'mape': 0.41091685689652985, 'r2': -14.011533645734382}
Modified - SklearnRidge {'normalized_rmse': 10.420792696322232, 'mape': 0.08489138518201052, 'r2': 0.1097645567108323}
Original - RidgeRes {'normalized_rmse': 8.891690336417946, 'mape': 0.0752928093435701, 'r2': 0.3518552208859771}
Modified - RidgeRes {'normalized_rmse': 9.198665535419105, 'mape': 0.07304645153782414, 'r2': 0.30632982112381446}

krsnapaudel assigned poudelpratishtha Aug 21, 2024

krsnapaudel added the baseline-models label Aug 21, 2024

krsnapaudel assigned smkuhlani Aug 23, 2024

krsnapaudel assigned mnqoliveira Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve feature design and selection #309

Improve feature design and selection #309

krsnapaudel commented Aug 21, 2024

krsnapaudel commented Dec 13, 2024 •

edited

Loading

krsnapaudel commented Dec 20, 2024

krsnapaudel commented Dec 20, 2024

pkj002 commented Dec 20, 2024

mnqoliveira commented Dec 26, 2024

mnqoliveira commented Jan 2, 2025

Improve feature design and selection #309

Improve feature design and selection #309

Comments

krsnapaudel commented Aug 21, 2024

krsnapaudel commented Dec 13, 2024 • edited Loading

krsnapaudel commented Dec 20, 2024

krsnapaudel commented Dec 20, 2024

pkj002 commented Dec 20, 2024

mnqoliveira commented Dec 26, 2024

mnqoliveira commented Jan 2, 2025

krsnapaudel commented Dec 13, 2024 •

edited

Loading