Testing for Equivalence

Why the TOST Procedure Works

Q: What statistical procedures are used to demonstrate that two sets of test results are equivalent?

A: Changes are often made in laboratory operations for expansion or improvement, and these may involve new apparatus, personnel, test methods or testing sites, for example. These changes must be supported by statistical methods to ascertain whether the test results from the old and new sources are close enough to be considered equivalent; that is, whether or not the average test results from the two sources agree within a prespecified limit. This article will use a comparison of a new and a current instrument as an example.

A typical experimental design generates n test results from each instrument on the same material. The test result average and standard deviation from each instrument are calculated. The next step is to calculate D, the difference between the two averages and its standard error sD. Finally, the two-sided confidence interval on the true mean difference ? is calculated as

D ± tsD (1)

The value for t is taken from a Student's t table (found in any basic statistics text) and is based on the number of test results n and a selected confidence level. This two-sided confidence interval is an estimate of the range of ? values supported by the data.

The statistical decision to accept or reject a zero value of ? is based on the two-sample t test (TST). (Here, "sample" refers to a set of data.) The TST is found in most statistical computer packages and yields a p value, the probability that the data supports ? = 0. If the p value is less than a predetermined value, say 0.05, then ? = 0 is not supported and a significant difference is declared at a 5 percent risk of a wrong decision; otherwise, a zero difference is not rejected. An analogous approach is to reject a zero difference if the confidence interval from Equation 1 fails to cover zero, or to not reject ? = 0 if the confidence interval includes zero. In the latter case the two instruments would be deemed equivalent by the TST.

However, the TST is primarily designed to prove or disprove that a statistically significant difference exists. This test is unsuitable for proving equivalence, because the TST is set up to directly control the risk of declaring a difference when there is truly no difference. The opposing risk of failing to declare a difference when a specified difference actually exists is controlled indirectly by the sample size n.

The proper statistical test to use for equivalence is the two one-sided test (TOST) for means. This test requires a predetermined value of the difference, termed an equivalence limit (symbol E) that is undesirable to exceed, based on subject matter requirements. The TOST performs two statistical tests on the difference, one that posits that ? ? E and another one that posits that ? ? -E, assuming symmetry around zero. If the two statistical tests each reject the propositions, then the two instruments are declared to be equivalent. TOST directly controls the risk of falsely declaring equivalence, thus protecting the laboratory and its customers from approving a non-equivalent instrument.

An analogous TOST approach is to declare equivalency if the confidence interval in Equation 1 is completely contained within an equivalence range of -E to E. (It should be noted that the width of the confidence interval for the TOST is slightly narrower than the corresponding interval for the TST.)

For a more intuitive insight into comparing the TST and TOST procedures, Figure 1 depicts confidence intervals on ? for four cases. These illustrate the similarities and differences between conclusions reached by the TST and TOST procedures. In this example the equivalence limit was set at E = 10 measurement units, so the equivalence range becomes the interval 0 ± 10 units.

  • Case 1 - The confidence interval does not include zero and is not completely contained within the equivalence range, so TST declares that there is a significant difference from zero, and TOST fails equivalence. The conclusions of the two tests agree that the new instrument does not meet the acceptance criterion for equivalence and that more work may be needed to determine the qualification of the new instrument.
  • Case 2 - The confidence interval includes zero, and it is also completely contained within the equivalence range, so TST declares that there is no significant difference and TOST accepts equivalence. The outcomes of both tests agree that the new instrument meets the acceptance criterion.
  • Figure 1 - Confidence Intervals for True Mean Difference ?

  • Case 3 - The confidence interval does not include zero but is completely contained within the equivalence range, so TST declares a significant difference, but TOST accepts equivalence. The conclusions disagree. Because the variability is low, small differences may appear statistically significant with TST even though the difference is not of practical significance. This situation may occur when the test method is very precise or when there is overkill on n. However, this situation actually improves the performance of TOST by giving a tighter estimate of the true difference.
  • Case 4 - The confidence interval includes zero but is not completely contained within the equivalence range, so TST declares no difference and TOST fails equivalence. The conclusions disagree. Since the test variability is high, large differences do not appear statistically significant with TST, but there is a high probability of a practically significant difference because a portion of the confidence interval exceeds the equivalence range. This outcome suggests that additional data should be generated to determine whether the confidence interval shrinks towards equivalence or does not.

These examples illustrate potential problems with the TST and show that the TOST is the proper statistical procedure for the equivalence application.

Equivalence testing based on two independent samples of data is covered in greater detail in the ASTM E2935, Practice for Conducting Equivalence Testing in Laboratory Applications. Revision of this standard is in progress for a paired-samples design to address a range of materials and test levels.

Thomas D. Murphy, T.D. Murphy Statistical Consulting LLC, Morristown, New Jersey., is chairman of Subcommittee E11.20 on Test Method Evaluation and Quality Control, a part of ASTM Committee E11 on Quality and Statistics.Dean V. Neubauer, Corning Inc., Corning, New York, is an ASTM International fellow, chairman of E11.90.03 on Publications and coordinator of the DataPoints column; he is immediate past chairman of Committee E11 on Quality and Statistics.

Go to other DataPoints articles.

Issue Month: 
Issue Year: 
Industry Sectors: 
Metals & Materials
Consumer Products