BestPractices

Forecasting Best Practices

Data

Which data should I include?

A primary goal behind PyForecast development is to eliminate unnecessary subjectivity and use reproducible, statistically defensible methods for forecast development. To this end, PyForecast uses a search algorithm to determine "best" forecasting models. Also, DelSole and Shukla, 2009 discovered that variable pre-screening based on correlation with a prediction variable can result in model bias and artificial skill. We recommend erring toward inclusiveness rather than eliminating variables based on subjectivity.

However, it may be useful to eliminate related but not identical variables (Helsel and Hirsch, 2002). For example, air temperature and dewpoint temperature may describe nearly the same relationship but are slightly different. Variables should also be selected based upon physical meaningfulness to avoid spurious correlations.

It should be noted that NRCS (2011) recommends subjective variable selection based on "the hydrologist’s judgment and input from the State Program staff." PyForecast strives to reduce subjectivity as much as possible, and ensure reproducible forecasts. For this reason, PyForecast uses a kernel density estimation technique to combine forecast models and avoid subjective selection of a single model.

How many years of data should I use?

NRCS (2011) recommends the use of at least 10 years of data for forecast generation. As the SNOTEL SWE dataset period of record increases, one unresolved question is whether the full period of record should be used or just the most recent period representing current climatology and basin land use/land cover. PyForecast models assume stationarity, and this assumption is likely not valid (Milly et al., 2008). Pagano and Garen (2005) found decadal differences in streamflow variability and persistance. It may be useful to investigate models using decadal teleconnections, such as the Pacific Decadal Oscillation, and varying data lengths. It may also be useful to update models on an annual basis.

Should I use future variables?

Historically, Reclamation used future variables in its forecast regressions. Models were trained using, for example, April through July accumulated precipitation for the forecasted year. This requires forecasters to predict accumulated precipitation for forecasting. Garen (1992) examined this practice at Anderson Ranch Dam on the South Fork Boise River. Garen found that using only data known at forecast time improved forecast accuracy. It is likely that using future variables decreases the prediction interval due to the significant uncertainty associated with predicting accumulated precipitation.

Because spring and summer precipitation can be an important component of runoff in the Western US, we suggest incorporating teleconnection indicators into PyForecast models.

Should models maintain variables from month to month for forecast consistency?

NRCS (2011) states that "Month-to-month consistency in variable usage is important so that forecast changes during the season reflect hydrometeorological conditions and are not just statistical noise caused by changing predictor variables. Due to PyForecast's kernel density estimation technique, model and specific variable selection is not as important for the final forecast and density function. Also, maintaining variables throughout the season may reduce forecast skill by eliminating data that has skill at different periods. For example, a low-elevation SNOTEL station may provide useful information on March 1 but be melted out by April 1. Finally, this may be a form of anchoring bias (Tversky and Kahneman, 1974)., where people may make insufficient adjustment from the initial estimate to arrive at their final answer.

Model building

How do I know if my assumption of linearity is true?

The assumption of linearity can be tested by examining scatterplots of residuals vs predicted values. According to Helsel and Hirsch (2002) , problems will manifest in this plot through curvature or heteroscedasticity. Probability plots (quantiles vs residuals) may also be useful for verifying assumption of normality. Data transformations may correct these issues.

Which regression technique should I select?

Both PCA and Z-score regression have benefits. Principal components creates orthogonal variables, which may be useful for correcting for multicollinearity. Data useful for streamflow forecasting can frequently have this issue.

Z-score regression is useful for building models with missing data. It may allow for inclusion of datasets with missing or limited data, such as soil moisture.

Ultimately, it may be useful to create models with both techniques and include them in the kernel density estimation.

Which model skill metric should I use?

It is useful to understand the meaning of each of the skill metrics before determining which skill metric to use. PyForecast uses four metrics for model comparison: Root Mean Squared Error (RMSE); Mean Absolute Error (MAE); adjusted R-squared; and Mallows' Cp. Each provides different information regarding the skill of forecast models. They are also based upon model errors or residuals. Residuals are simply the predicted value minus the actual value.

Root mean square error: RMSE is the square root of the sum of squared errors. In ordinary least squares regression, models are built by minimizing the sum of squared errors. For linear regression, RMSE is equal to the standard error. Because errors are squared, when minimizing sum of squared errors, models are punished for larger outliers. RMSE is useful in circumstances when forecast error of 100 KAF is more than twice as bad as forecast error of 50 KAF.
Mean absolute error: MAE describes the magnitude of the average forecast error. It is easily interpretable. When comparing models, if doubling the error only doubles the impact of the error, MAE is more appropriate than RMSE.
Adjusted R-squared: Adjusted r-squared describes the amount of variance described by the model, but penalizes models with a large number of predictors.
Mallows' Cp estimates the bias introduced into predicted responses by having an underspecified model (Penn State University, 2018). Minimizing the Cp value suggests that the model is unbiased; lower values are better.

Ultimately, all of these metrics are useful for model selection. Likely a forecaster should examine all statistics, select "good" models, and include them in the KDE process.

What cross-validation method should I use?

Cross-validation is a method of predicting how well a model will perform in a predictive application. Cross validation typically divides the dataset into training values and values to be predicted. Leave one out validation uses one value to be predicted and trains on the remaining dataset. The k-folds methods leaves either 5 or 10 out to be predicted. Leave one out tends to be more computationally intensive than the k-folds methods. K-folds methods may have higher bias in smaller datasets, as is found in our streamflow forecasting models datasets.

Probability density functions

Should I modify my median forecast subjectively?

We recommend against modifying forecasts subjectively because it may result in forecasts that are not reproducible or defensible. Glantz (1982) describes the potential impacts of subjectivity through a case study in the Yakima basin in 1977. Similarly, a number of cognitive biases impact subjective forecast skill. We recommend evaluating operations based on taking forecasts at face value and examining the potential outcomes throughout the prediction interval. Rather than changing the forecast to match desired operations, operational decisions can be made managing to a inflow other than the median forecast.

How many models should I include in my density analysis?

We do not have current guidance on this subject, and plan to perform experiments to determine best practices in the density analysis arena.

How should I select models for inclusion in my density analysis?