Potential pitfalls with hindcasting as a proving ground for computer models, Part 1

Hindcasting is a method to validate the predictive ability of computer models. In Part 1, we will look at a potential pitfall when the experiment is improperly blinded.

Oct 26, 2021

Hindcasting is one available method to test the ability of a computer model to forecast the future.

Ideally, for a credible hindcasting check that tests forecasting ability you would require a period of time where you have reliable data. You divide this time period into two: a “training data” period and a “testing data” period. The training data set is to tune and refine the computer model. The testing data set is to test the accuracy of the resultant model.

Say you have a computer model that you have designed to predict crop yields. You take a period of time, say, from 1950 to 1980, and call that your training data set. You then tweak and tune your computer model such that it matches the crop yield data perfectly, at least within the training data time period. Then, you define a testing data set as the period between 1980 and 2020, to which the computer model is naïve. Then you run the computer model and see how accurately it performs between 1980 and 2020. This is hindcasting (also called backtesting): taking a time point in the past and using the model to forecast from that date onwards, with the benefit that we actually know what happened and can compare real data with the computer result.

Hindcasting can be useful because you can rapidly test the predictive accuracy of a computer model with existing data. The alternative is forecasting, which has the computer model predict a result and then wait for the data to prove accuracy—which might not be for a very long time, possibly decades. Hence, the appeal of the hindcasting technique. Hindcasting can also be used to see how various factors impacted past observations as a matter of academic study, though in this article we will focus on the use of hindcasting to test a model’s ability to predict the future.

Hindcasting is used in a variety of fields. Climatologists evaluate climate models with hindcasting. Financial models are tested with the aim to develop computer models to forecast the economic future. Epidemiologists use hindcasting to test their models on how disease spreads in a pandemic.

The claim of hindcasting is that pitting the computer model against historical testing data proves its accuracy. After all, the model is naïve to the testing data and does not “know” it’s outcome.

At first glance, this sounds valid. A scientist cooks up a computer model that predicts some facet of our reality, trains it to model that reality, rigorously tests the model with real-life data, then publishes the result for all to see. The model may then be used to make predictions of the unknown future, with some degree of confidence lent by the process of hindcasting.

But…

There are some potential problems here. In order to explain better, it is instructive to review what a blinded experiment is.

Blinded experiments

A blinded experiment has one or more of the participants in an experiment unaware of some aspects of the methodology and/or results. A classic example would be a drug trial with, say, 1,000 participants with a common ailment. Five hundred of the participants would get the actual test drug, and five hundred would get a placebo. All will be blinded to which study arm they are in. The researchers who set up the trial can also be blinded to the actual results—instead, independent analysts without a stake in the outcome would review the data. Having the participants blinded would isolate the effectiveness of the drug itself from any psychological effects of knowingly taking either the drug or placebo. Having the researchers blinded to the results prevents their biases from skewing the analysis, as perhaps they have invested much time into the drug development and thus have an interest in seeing the drug succeed. As humans, we are often not aware when our biases are putting a thumb on the scale, hence the importance of blinding.

Effective blinding strategies in science improve the credibility of the conclusions, reduce human bias in the methodology, and increase the likelihood that the researchers have nailed the correct result.

How does blinding (or not blinding for that matter) impact hindcasting?

Consider a hypothetical example for illustration: A scientist spends a couple of years creating a computer model predicting polar bear population numbers in the Arctic. She works really hard with a training data set on polar bear numbers from 1973-2000, tuning her model until the model matches nearly perfectly with the real-world data. She then applies her model to the time period from 2000-2021 to validate it with hindcasting. The results appear on the computer screen, and then....

Potential Pitfall: What if the researcher or the computer model are not blinded to the testing data?

For just about all computer models, there are model parameters whose values are tuned by the experimenter. For our polar bear example, the researcher adjusted these parameters such that the polar bear population numbers from 1973-2000 matched as closely as she could get them. Perhaps some of these parameters might be something like sensitivity of food abundance to temperature, or the fertility rate of polar bears, or just some tunable parameter called k. Then she runs the computer model through the testing data set (from 2000-2021) to check if her model still matches the polar bear population numbers.

This second step, that is, checking against the testing data set, adds credibility to the model because the computer model is “blinded” to the testing data. In an ideal world, the principal researcher would also be blinded to what the testing data looks like, and an independent party would evaluate the computer model with the testing data.

So what happens when the researcher and/or the computer model are not blinded to the testing data?

For many (most?) complex computer models, there are multiple sets of parameters that can produce a model that can fit any outcome. So with a given training data set, you often can get a bang-on match between the model output and the observed data. Moreover, researchers who are intimately familiar with their models have a good feel for how to tweak the parameters to produce a desired result.

What if our researcher is not blinded to the polar bear population numbers, having studied them for years? Even though she is working with only the 1973-2000 data set for training the model, she might be biased to use her “foreknowledge” of the 2000-2021 data to steer the computer model towards the real-life trend, even if not explicitly including the “future” data. By not being blinded to the testing data set, there is a potential bias in the construction of the model from the researcher. Granted, it may often be unavoidable that a modeler would be intimately familiar with the outcome data; however, we can still recognize this unblinded situation as a potential weakness in the methodology.

What if the computer model itself is not blinded to the data? Actually, what does this even mean, since computers don’t have human biases? Say our researcher includes the entirety of the available data from 1973-2021 in the training of the computer model, and also uses this data as the hindcasting test. In other words, training data = testing data. She then tunes the model to a close match to the data. The computer model is effectively unblinded to the testing data.

Let’s take it a step further and say that our computer model is trained using machine learning, and that training data = testing data. The entirety of the data set from 1973-2021 is fed into the computer, the deep learning algorithms do their thing, and a near-perfect match to the observed polar bear population data is produced. Does that mean that the model is validated if the computer model is not tested against data to which it is naïve?

This situation of an “unblinded” computer is a really important one to mull over for a moment. Consider a very simple example for further clarity. It is actually possible to curve fit nearly any one-dimensional data set to an nth-order polynomial. Say we have our polar bear data that shows a slow increase in polar bear numbers from 1973-2021. A third- or fourth-order polynomial would probably be sufficient to fit the data very well. Are we done, then? Anyone looking at the situation can see that this model has no predictive ability since the polynomial model was not tested by data of which it has no “knowledge”.

Let’s take it one final step further. What if our researcher trains a model, tests it, and finds that during the testing period it is unacceptably inaccurate? Well, she would re-tune the model in the training phase, then try it against the testing data once again. What happens if this loop (train→test→tune, train→test→tune) repeats many, many times? The more this loop propagates, then the closer we get to the situation of training data = testing data. In other words, the more the train→test→tune loop goes on, the less blinded both the researcher and the computer model are.

The less blinded a hindcasting run is, the less credible it should be. When evaluating any hindcasting check, the questions above should be considered at all steps in the experimental method.

---

Hindcasting is an important tool to validate a computer model’s ability to forecast; however, the technique is not without its pitfalls. In Part 2 of this series, we will look at the ability of a computer model to be robust to conditions (and thus model parameters) that change in a complex and chaotic world.

Science V Non-science

Discussion about this post