
By Katie Daisey
Nov 04, 2025
Q. How should I determine whether my model is useful and appropriate?
A. Perhaps the most famous quote by any statistician is from George Box: “Essentially all models are wrong, but some are useful.” When evaluating a model, the fundamental question is how useful the model is in solving the problem for which it was created. Unfortunately, “usefulness” is a complicated and multi-layered metric and not easy to determine. For a model to be deemed useful, it must first be shown to be appropriate. As with “usefulness,” there are many layers to demonstrating appropriateness.
Before examining this in any detail and before model construction begins, one first needs to define the purpose of a model. Then, one must ensure that the data employed in model construction is also appropriate. We often use data we have already collected for another purpose without considering whether it is biased or reflective of what data we might see in the future. Or, as input to a model intended to determine static operating conditions for a long-term process, we include data in which the process is out of control. Further, it is not unusual to find the data available was only collected because it was easy or cheap to collect, not because it was appropriate to answer the question at hand. Data selection is vital, often requiring close collaboration between a statistician and a subject matter expert (SME). The impulse to save money by repurposing data can lead to costly errors. A designed experiment (or series of them) or a special data-collection program may be required.
The second level of assessment occurs after modeling. The “eye test,” which is graphical examination of model behavior, requires one to plot various aspects of the model results in several different ways to identify trends or anomalies that are masked when calculating a single metric to represent the entire model. It is a fundamental building block of model assessment. Many of these plots are not well-known to non-statistician practitioners. Examples used in regression and ANOVA include plots of model residuals versus model predictions for examining systematic bias (for instance, missing terms in the model), spread-level plots (indicating whether error variability is constant or not), and Q-Q plots for normality of residuals. In comparing two different test methods using paired measurements, the Tukey mean-difference (aka, Bland-Altman) plot examines agreement and systematic bias between test methods.
Regardless of the particular plots, the eye test is a quick analysis to identify any particular regions where the model performs poorly. Two classic examples of problems that are easily identified by the eye test are:
1) increased model error at one extreme of the predicted outcome and 2) a class of data where the model performs significantly worse than other classes.
The third level of assessment is the more traditional statistical metrics that provide calculable measures of how well a model performs. Although the coefficient of determination (R2, the squared correlation between observed and predicted values) is widely known, it fails to detect model bias and overfitting and is completely misleading when used to compare regression models with and without a constant. Mean absolute error (MAE) and root mean squared error (RMSE), reported in the same units as the outcome variable, attempt to capture the average difference between predicted and observed values. Here, lower values are better (zero is perfect), but they are only applicable to continuous outcomes. MAE/RMSE can be incredibly helpful for an SME to gauge the usefulness of a particular model as they can directly gauge whether the level of precision needed to solve their predictive problem is achieved by a particular model.
The next set of assessment metrics applies to classification and is based on the confusion matrix of true positives, true negatives, false positives, and false negatives. Accuracy (the number of true predictions over all predictions) is most used, but derivative metrics such as sensitivity, precision, recall, and F1 score, attempt to include the model use in the metric. For instance, a model to predict whether smoke detectors are constructed correctly would have dire consequences if it predicted a non-working detector as functioning well. Here, the best model is one with the highest sensitivity, as false negatives (failing to alarm) are much worse than false positives (false alarm).
When considering a model that outputs class probabilities instead of discrete binary predictions, the receiver operating curve (ROC) can be used to understand the predictive power of a model without concerns for particular probability thresholds or class imbalance within the data. A statistician can use ROC and area under the curve (AUC) to improve the model before setting the prediction threshold according to the application as discussed previously. Note that while ROC and AUC are often applied to non-binary classification tasks, where a sample may belong to one or many possible classes, they are technically not mathematically valid in those situations. Cross-entropy loss (model training, more heavily penalizes a model for being confidently incorrect) and Cohen’s kappa (model evaluation, accounting for different class sizes) are more complex but more appropriate to the task of finding a useful model in multi-class settings.
Another very important group of metrics is the information criterion metrics, which balance the performance and complexity of a model in an attempt to counter model overfitting. While it may seem obvious that a simpler model is preferred, there is also a mathematical argument against complexity. If a model has as many fitted parameters as it does samples, it is possible to train a well-chosen model to exactly fit the existing dataset. If the training data is identical to all possible future data, there is no problem. Of course, this is never the case, so the model is overfit to the training set and underfit to future data. Common metrics in this group include Akaike information criterion (AIC) and Bayesian information criterion (BIC).
Finally, the last group of model-assessment methods is the classical goodness-of-fit tests. These complement the Q-Q plots mentioned earlier and are used to determine whether model residuals follow a specific probability distribution. This is important to determine which terms are significant and to assess how to construct outputs such as prediction and tolerance limits.
We have illustrated a progression of model precautions and assessments, depending on the type of model and the requirements of the decision problem. Ultimately, the best model is that which is “useful” or “fit for purpose” and can be constructed with available data that satisfy the underlying requirements of the model. ●
Katie Daisey, Ph.D., is a senior research scientist at Arkema Inc., leading R&D and manufacturing in the areas of data science and digital transformation. Daisey currently serves as chair of the committee on quality and statistics (E11).
November / December 2025