Potential pitfalls with hindcasting as a proving ground for computer models, Part 2

Hindcasting is a method to validate the predictive ability of computer models. In Part 2, we will look at the robustness of computer models to complex and unpredictable change.

Oct 30, 2021

Hindcasting is one method to test the ability of a computer model to forecast the future.

Ideally, for a credible hindcasting check that tests forecasting ability you would require a period of time where you have real-world, observed data of the system you are modeling. You divide this time period into two: a “training data” period and a “testing data” period. The training data is used to tune and refine the computer model. The testing data is used to test the accuracy of the resultant model.

In Part 1 of this series, we examined the concept of blinded experiments as applied to hindcasting runs. In Part 2, let’s further explore how hindcasting proof as a proxy for forecasting proof may break down when the system that is being modeled is complex, unpredictable, and ever-changing.

Potential Pitfall: What if conditions change during or after the hindcasted time period?

In a computer model, the conditions of the modeled system are represented by input data and parameters. Sometimes these inputs are known quantities since these are measured and tracked. Consider the field of climate modeling, where atmospheric carbon dioxide (and eventually other gases) have been measured since 1958 and are used as input for global climate models. Sometimes inputs are unknown and have to be tuned. In climate modeling, such tuned parameters may be related to the equilibrium climate sensitivity (ECS), or the impact of clouds.

In a complex and chaotic system, conditions change all of the time. Therefore, the model parameters might change all of the time too.

What if model parameters change throughout time, but we can only determine them when tuning the model with training data? What if a researcher tunes a model by fiddling with the knobs representing the various parameters, gets it perfect with the training data, but the parameters end up changing for the testing data time period?

In Part 1 I outlined a hypothetical scenario of a scientist that is developing a computer model for predicting polar bear population numbers. We will continue with this example here to illustrate a further weakness with the hindcasting technique. Suppose the polar bear computer model has tunable parameters such as the sensitivity of fertility rate to food abundance, and also non-physical parameters such as k. The training data period is set from 1973-2000, and the testing data period from 2000-2021. Our researcher works hard at getting the computer model to fit the bear population data from the training data period of 1973-2000, and finds the appropriate fertility sensitivity and k parameters to fit the data pretty well. Emboldened, she then runs the model on the testing data period of 2000-2021… and finds that the model is off by quite a lot.

There are many possibilities as to why the model was a poor fit to the testing data. Perhaps the computer model does not adequately describe how polar bears live, die, and reproduce. Perhaps the tunable parameters vary over time in an unpredictable way. Perhaps things like fertility sensitivity to food abundance are far too complex to be encapsulated in a single number. Perhaps all are true.

Now just because the model was off in the testing data phase for a single run, it doesn’t mean that this data is completely useless. Say our researcher goes through the train→test→tune loop a few times to improve the model. In each run, the model data matches the training data very well, but matches the testing data variably well. What can we say about this situation?

Let’s make an assumption here: that the reason for the variable fits to the testing data is because the model parameters change unpredictably throughout time, and there’s really nothing we can do about this inherent uncertainty. Assume also that our researcher is fully aware that the model parameters change throughout time, and this fact limits the ability of her model to predict the future polar bear population size. She looks at her testing data runs, and sees that starting at the breakpoint between training and testing data periods, the model predictions increasingly become more variable over time as compared to the actual observed population numbers.

This is actually useful data. The researcher can quantify the accuracy of her model in the form of an error band as a function of time with multiple train→test→tune runs. So the accuracy quantification might be something like “accurate to within ±1050 polar bears at a 5 year future projection”, with evaluations at various other time points. It is good to note that this accuracy quantification can only project to the duration of the testing period of 2000-2021, so only 21 years. So her model is only validated and the error quantified for 21 years past the training data breakpoint, so it would be of no practical use for, say, a 100 year projection.

So, let’s close the loop on this. Our researcher trained and tested her data using the hindcasting technique, and now has quantified the error band as a function of time out to at least 21 years. What she can do now is to train her computer model with the full data set, i.e. combined training+testing periods (in our example from 1973-2021). This final training run will result in model parameters that are representative of this time range but are unknown beyond, i.e. the legit future. Presumably these model parameters are at least somewhat representative of the near future, but likely degrade as time goes on. This degradation is quantified by the previous testing data runs, where an error band as a function of time was determined, and can now be applied to the final forecast. In our example, with ever-changing model parameters into the future, the computer model has limited validity within only a relatively short forecasting range of 21 years.

---

Hindcasting is an important tool to validate a computer model’s ability to forecast; however, the technique is not without its pitfalls. In Part 3 of this series, we will look at situations where a computer model is validated with hindcasting, but is invalidated when the world is not blinded to its results.

Check out Part 3 in this series, and please offer constructive comments below.

Science V Non-science

Discussion about this post