Measurement-Uncertainty Protocol for Regression-Based Analytical Instruments
Within the scientific community, measurement uncertainty is a topic of great interest these days. Although it would be nice to have a single protocol that would calculate the uncertainty for all types of measurements made in an analytical chemistry lab, the real world does not lend itself to such an ideal. The members of ASTM Committee D19 on Water have recognized this reality and the group is moving to address a specific portion of the measurement-uncertainty arena. This article summarizes the work that is being done on proposed standard WK13183, Practice for the Estimation of Measurement Uncertainty of a Test Method Involving a Regression-Based Analytical Instrument.
For purposes of this discussion, an analytical instrument will be classified according to the answer to the question, “Is the instrument regression-based or not?” An example of a module that is not regression-based is a balance. If an object is placed on a balance, the readout is in the desired units. No user intervention is required to get to the needed result.
However, for an instrument such as a chromatograph or a spectrometer, the raw data are in user-unfriendly units (e.g., peak area or absorbance). The operator must undertake a process to transform the instrumental data into useful units, typically concentration. Regression is at the core of this transformation process. Solutions of known analyte concentrations must be prepared and analyzed on the instrument. The raw responses are plotted versus the true concentrations, and some type of curve is drawn through the data points. This plot can be used to calculate the concentrations of sample analytes from the corresponding raw data. The statistical technique used to determine an appropriate curve is called regression, thereby giving a name to the classification scheme for this paper.
One additional distinction will be made regarding the applicability of this protocol. This article will deal only with intralaboratory data. In other words, the variability introduced by collecting results from more than one lab is not being considered. The examples that are shown here are for one instrument with one operator.
A brief illustration will start the discussion. A sample is to be analyzed to determine if it is under the upper specification limit of 5 (the actual units of concentration do not matter). The final test result is 4.5. The question then is whether the sample should pass or fail. Clearly, 4.5 is less than 5. If the numbers are treated as being absolute, then the sample will pass.
However, such a judgment call ignores the variability that always exists with a measurement. The width of any measurement’s uncertainty interval depends not only on the noisiness of the data, but also on the confidence level the user wishes to assume. This latter consideration is not a statistical decision, but a political choice that must be based on the needs of the customer and/or the intended use of the data.
Once the confidence level is chosen, the interval can be calculated from the data. In this example, if the uncertainty is determined to be ±1.0, then there is serious doubt as to whether the sample passes or not, since the true value could be anywhere between 3.5 and 5.5. On the other hand, if the uncertainty is only ±0.1, then the sample could be passed with a high level of comfort. Only by making a sound evaluation of the uncertainty can the user determine how to apply the sample estimate he or she has obtained. The following protocol is designed to answer questions such as:
4.5 ± ?
To calculate the uncertainty associated with measurements from regression-based instruments, the following protocol is used. The basic steps are as follows:
1. Calibration study;
2. Regression diagnostics;
3. Recovery study; and
4. Regression diagnostics.
The calibration study is conducted in pure solvent. The choice of this liquid will depend primarily on the solubility of the analytes. Water is typically used in ion chromatography; an organic solvent is often associated with gas chromatography.
With the study design, the ultimate goal is to decide what concentrations (or levels) will be included, and how many replicates of each solution will be analyzed. To make these decisions, several questions should be addressed.
First, what is the concentration range of interest? Some prior knowledge is needed of the levels expected in the samples that will have to be tested eventually. This range should be wide enough to prevent having to extrapolate the calibration curve.
Second, will the sensitivity of the instrument be challenged? Are reliable data necessary in the low-end region, meaning that sufficient levels and replicates are needed in this area? For work in this region, a well- chosen blank typically is necessary.
Third, will high precision be needed in at least some portions of the working range, indicating that an adequate number of replicates are required at each concentration?
Fourth, are the data expected to exhibit curvature? If so, then an adequate number of concentrations should be assigned to the suspect portion of the range.
Fifth, are there specification limits that are of concern? Such critical concentrations should be included in the design and should also be bracketed tightly.
Once the above questions (and any others that are of concern) have been answered, the actual concentration range, along with the number of concentrations and the number of replicates, can be selected. It is not mandatory that the same number of replicates be analyzed for each concentration. Also, the confidence level should be set, since that determination must be made before data can be analyzed properly. Finally, within each set of replicates, the set of concentrations should be analyzed in random order. This process allows for the determination of such phenomena as carryover.
There is no “magic” design that works for all calibration studies. However, a good starting place is a 5 X 5 arrangement (i.e., five replicates of each of five concentrations). The numbers can and should be adapted to fit the needs of the study (and, ultimately, the analytical method). It is good to keep in mind that having a high number of data points is desirable.
Regression Diagnostics for Calibration Data
Once the study has been performed, the data must be examined. Analysts who routinely use chromatographs and spectrometers are familiar with the basics of the regression process. The final results are: 1) a plot that visually relates the responses (on the y-axis) to the true concentrations (on the x-axis), and 2) an equation that mathematically relates the two variables.
Underlying these outputs are two basic choices: 1) a model, such as a straight line or some sort of curved line, and 2) a fitting technique, which is a version of least squares. The first choice is well known to most analysts, but the second is less well understood and will be discussed in more detail.
There are in essence two forms of least-squares fitting. One is ordinary least squares (OLS) and is the default in software packages. The other choice is weighted least squares (WLS). Which technique is needed depends on the behavior of the standard deviations of the responses. If these deviations trend with concentration, then WLS is needed. WLS is the same as OLS, except that the data are weighted according to how noisy they are. Values that are relatively “tight” are afforded more weight (and therefore influence the regression line more) than are the more scattered numbers.
Several formulas have been used for calculating the weights. The simplest is 1/x (where x = true concentration), followed by 1/x2. At each true concentration, the reciprocal square of the actual standard deviation has also been used. However, the preferred formula comes from modeling the standard deviation. In other words, the actual standard-deviation values are plotted versus true concentration; an appropriate model is then fitted to the data. The reciprocal square of the equation for the line is then used to calculate the weights.
The simplest model is a straight line, but more precise modeling should be done if the situation requires it. (In practice, it is best to normalize the weight formula by dividing by the sum of all the reciprocal squares. This process assures that the root mean square error is correct.)
In sum, two choices, which are independent of each other, must be made in performing regression. These two choices are a model and a fitting technique. In practice, the options are usually obtained from those in Table 1.
However, a straight line is not automatically associated with OLS, nor is a quadratic automatically paired with WLS. The choice of a fitting technique depends solely on the behavior of the responses’ standard deviations (i.e., do they trend with concentrations?). The choice of a model depends only on whether the data points exhibit some type of curvature.
Once an appropriate model and fitting technique have been selected, the regression line and plot can be determined. One other very important feature can also be calculated and graphed. That feature is the prediction interval, which is an “envelope” around the line itself and which reports the uncertainty (at the chosen confidence level) in a future measurement predicted from the line. An example is given in Figure 1. The solid line is the regression line; the dashed lines form the prediction interval.
The interval in Figure 1 is parallel to the regression line. This geometry will occur when OLS is the appropriate fitting technique. However, if WLS is needed, the interval will flare. This WLS phenomenon makes sense, since the uncertainty in relatively noisy data will be larger than will the uncertainty in “tight” data.
While the concept of a model is familiar to most analysts, the statistically sound process for selecting an adequate model (and fitting technique) typically is not. A series of regression diagnostics will guide the user. The basic steps are as follows:
1. Plot y vs. x.
2. Determine y’s standard deviations.
3. Fit proposed model.
4. Examine residuals.
5. Conduct lack-of-fit test.
6. Evaluate prediction interval.
Step 1 generates a scatterplot. This graph is helpful for spotting potential errant data points (which may simply be due to typographical errors in the data table), as well as for getting a general sense about the behavior of the responses’ standard deviations and any curvature in the data. Step 2 will show which fitting technique (i.e., OLS or WLS) is needed. Steps 3 through 5 allow for the selection of an adequate model. Step 6 provides the information needed to decide if the uncertainty in the measurements is at an acceptable level.
A recovery study is in essence the calibration study conducted in the matrix itself, instead of pure solvent. In most cases, the design from the calibration study can be used for the recovery study.
Once this second study has been conducted, the response data for the matrix spikes are transformed into recovered concentrations via the calibration curve that was previously constructed.
Regression Diagnostics for Recovery Data
The diagnostics for these results are the same steps that were discussed above for calibration work. The only difference is that the recovery plot is generated by regressing the recovered concentrations (minus any amount in the blanks) versus the true (i.e., spike) concentrations. This plot is similar to the calibration graph. In both cases, the x-axis is the true concentration. However, with the calibration curve, the peak areas (for the various standards) are on the y-axis, while with the recovery curve, the recovered concentrations are the responses.
These steps can best be illustrated with a real-world example. The calibration and spiking studies will be discussed in that order.
Calibration Study and Diagnostics
The scatterplot in Figure 2 is from a calibration study involving sulfate in water. Ten concentrations (including a blank) were analyzed, in random order, on each of eight separate days. The data do not exhibit obvious curvature, but the scatter does seem to increase with concentration.
The next step is to assess the behavior of the responses’ standard deviations. In Figure 3, the calculated (from the actual data) standard deviations are plotted vs. true concentration. A straight line, with OLS fitting, has been regressed through the data. The decision regarding trending depends on the slope of the line, and the p-value for the slope is the criterion. If the p-value is less than 0.01, then the slope is significant, meaning that the standard deviation does trend with concentration. Since 0.0003 is much less than the cutoff, WLS is needed in this situation. The basic formula for the weights is the reciprocal square of the line’s expression of (340 + 233 ppb).
Once the fitting technique has been determined, proposed models can be tried. In Figure 4, a straight line (SL) has been fitted via WLS. To determine the adequacy of this model, the diagnostics of the lack-of-fit (LOF) test and the residual pattern are used. If the p-value for the LOF test is less than 0.05, then the model is not adequate, meaning that one or more terms is missing. Here, the <0.0001 p-value indicates that a straight line is not adequate to explain the data. (Unfortunately, this test does not provide guidance for a better model.)
The second diagnostic (i.e., the residual pattern) often can help not only with determining the adequacy of the proposed model, but also with selecting a better one if the current model is not appropriate. Figure 5 shows the residual plot for the SL/WLS fit. A residual for a given data point is the actual response (for that particular concentration) minus the response the model predicts (also at that particular concentration). In other words, the residual is the part of the response that is not explained by the model. The plot of all the residuals should be a random scatter of the points about the zero line. The ideal pattern has the zero line going through the mean of the residuals at each concentration.
In this SL/WLS trial, the pattern is not random. The highest two concentrations ride high, with the mean of each group’s being well above the zero line. This single-inflection-point type of curve suggests that a quadratic might be a better choice. Thus, the pattern agrees with the LOF test in indicating that an SL is not adequate.
The quadratic model, with WLS fitting, is tested next. The resulting plot, along with the associated prediction interval (at 95 percent confidence) and LOF p-value, is shown in Figure 6. The p-value of 0.9900 is well above the 0.05 cutoff, thereby indicating that the quadratic is an appropriate model.
The associated residual pattern (see Figure 7) supports the LOF-test results. The points for the top two concentrations are now much better centered around the zero line. The trumpet shape of the pattern is characteristic of data where the responses’ standard deviations trend with concentration.
The results of the regression process and diagnostics are that the calibration curve is a quadratic curve that is fitted via WLS. The width of the prediction interval (at 95-percent confidence) is ± ~2 ppb at the widest point; the adequacy of this uncertainty will depend on the needs of the user.
Recovery Study and Diagnostics
From the scatterplot, an appropriate model and fitting technique are found for the data, using the same diagnostic steps outlined above for the calibration study. The scatterplot is shown in Figure 8. These data are from a recovery study that involved the calibration curve in the above example. The matrix was 30-percent hydrogen peroxide. The procedure calls for the digestion of the H2O2 to water, followed by analysis for sulfate.
The regression diagnostics revealed that a straight line was an adequate model; however, WLS was needed for the fitting technique. The final plot, with the prediction interval at 95% confidence, is given in Figure 9.
Evidence for the adequacy of the model is indicated by the fact that the LOF p-value was 0.2016. The residual pattern (see Figure 10) also supported the choice of an SL.
In summary, the protocol for determining measurement uncertainty for results from regression-based instruments involves four main steps:
1. Calibration study;
2. Regression diagnostics;
3. Recovery study; and
4. Regression diagnostics.
Other key points to remember are:
1. Keep the total number of data points in each study high.
2. Determine if OLS or WLS is needed for fitting the proposed model to the data.
3. Use the residual pattern and the LOF test to evaluate the adequacy of the chosen model.
4. Evaluate the width of the prediction interval (at the user-chosen confidence level), remembering that accepting or rejecting the amount of uncertainty is a judgment call, not a statistical decision.
Finally, once the recovery plot and associated prediction interval are available, it is possible to answer the type of question posed in the introduction:
4.5 ± ?
The answer is:
4.5 ± (p.i. / 2)
(i.e., 4.5 plus-or-minus the half-width of the prediction interval).
I would like to acknowledge David E. Coleman (Alcoa Technical Center), John R. Hubbling (Metropolitan Council), and James Rice (retired consultant) for their help in making this article a reality. David has been my statistics mentor since I met him in 1995. Without his tutoring, I would never have been able to write this article or lead the related task group. John is the chairman of the D19.02 subcommittee and is the person who, over the years, has urged me to form a measurement- uncertainty task group. Jim took me under his wing when I first joined ASTM. When I was uncertain about accepting the chairmanship of my first task group (precision and bias), he convinced me to “forge ahead.” I owe all three of these friends a debt of gratitude for their encouragement and support. //