1Assessing agreement for diagnostic 14CI=73.2%,95.8%] NPA=94.4% (85/90)
devices. FDA/Industry Statistics Workshop NPA=91.7% (55/60) [95% CI= 87.5%,98.2 %]
September 28-29, 2006 Bipasa Biswas [95% CI=81.6%,97.2%]. 5. 5. 10. 35. 5. 40.
Mathematical Statistician, Division of 5. 85. 90. 5. 55. 60. 10. 90. 100. 40. 60.
Biostatistics Office of Surveillance and 100. 14.
Biometrics Center for Devices and 15Kappa measure of agreement. Kappa is
Radiological Health, FDA No official defined as the difference between observed
support or endorsement by the Food and and expected agreement expressed as a
Drug Administration of this presentation fraction of the maximum difference and
is intended or should be inferred. 1. ranges between -1 to 1. Imperfect
2Outline. Accuracy measures for reference standard R+ R- New T+ Test T-
diagnostic tests with a dichotomous k=(Io-Ie)/(1-Ie) where Io=(a+d)/n,
outcome. Ideal world -tests with reference Ie=((a+c)(a+b)+(b+d)(c+d))/n2. a. b. a+b.
standard. Two indices to measure accuracy c. d. c+d. a+c. b+d. n=a+b+c+d. 15.
–Sensitivity and Specificity Assessing 16Kappa measure of agreement. Imperfect
agreement between two tests in the absence reference standard R+ R- New T+ Test T-
of a reference standard. Overall agreement Io=(70)/100=0.70,
Cohen’s Kappa McNemar’s test Proposed Ie=((50)(50)+(50)(50))/10000= 0.50
remedy Extending agreement to tests with ?=(0.70-0.50)/(1-0.50)=0.40 [95%
more than 2 outcomes. Cohen’s Kappa CI=0.22,0.58] By the way the overall
Extension to Random Marginal Agreement percent agreement is 70.0%. 35. 15. 50.
coefficient (RMAC) Should agreement per 15. 35. 50. 50. 50. 100. 16.
cell be reported? 2. 17Kappa measure of agreement sensitive
3Ideal World-Tests with perfect to off-diagonal? Imperfect reference test
reference standard (Single). If a perfect R+ R- New T+ Test T- Kappa=?=0.45 [95%
reference standard exists to classify CI=0.31,0.59] Although the overall
patients as diseased (D+) versus not agreement stayed the same (70%) and the
diseased (D-) then we can represent the marginal differences are much bigger than
data as: True Status Test D+ D- T + T - If before, the kappa agreement index
the true status of the disease is known indicates otherwise. Kappa statistics is
then we can estimate the Se =TP/(TP+FN) impacted by the marginal totals even
and the Sp=TN/(TN+FP). TP. FP. TP+FP. FN. though the overall agreement is the same.
TN. FN+TN. TP+FN. FP+TN. TP+FP+FN+TN. 3. 35. 30. 65. 0. 35. 35. 35. 65. 100. 17.
4Ideal World-Tests with perfect 18McNemar’s Test to check for equality
reference standard (Comparing two tests). in the absence of a reference standard.
McNemar’s test to test equality of either Hypothesizes: Equality of rates of
sensitivity or specificity. True Status positive response Imperfect reference test
Disease D+ No Disease D- Comparator test R+ R- New T+ Test T- McNemar Chi
Comparator test New test R+ R- New test R+ square=(|b-c|-1)2/(b+c)
R- T + T + T - T - McNemar Chi square: =(|30-5|-1)2/(30+5)=16.46 Two sided
Check equality of sensitivities of the two p-value=0.00005. 37. 30. 67. 5. 28. 33.
tests (|b1-c1|-1)2/(b1+c1) Check equality 42. 58. 100. 18.
of specifities of the two tests 19McNemar’s test (insensitivity to main
(|c2-b2|-1)2/(c2+b2). a1. b1. a1+b1. a2. diagonal). Imperfect reference test R+ R-
b2. a2+b2. c1. d1. c1+d1. c2. d2. c2+d2. New T+ Test T- Same p-value as when A=37
a1+c1. b1+d1. a1+b1+c1+d1. a2+c2. b2+d2. and D=28, even though the new and the old
a2+b2+c2+d2. 4. test agree on 99.5% of individual cases.
5Ideal World-Tests with perfect 3700. 30. 3730. 5. 2800. 2805. 3705. 2830.
reference standard (Comparing two tests). 6535. 19.
Example True Status Disease D+ Disease D- 20McNemar’s test (insensitivity to main
Comparator test Comparator test New test diagonal). Imperfect reference test R+ R-
R+ R- New test R+ R- T + T + T - T - New T+ Test T- Two sided p-value=1 even
SeT=85.0%(85/100) SpT=88.3%(795/900) though old and new test agree on no cases.
SeR=90.0%(90/100) SpR=90.0%(810/900) 0. 19. 19. 18. 0. 18. 18. 19. 37. 20.
McNemar Chi square: Check equality of 21Proposed remedy. In stead of reporting
sensitivities of the two tests overall agreement or kappa or the
(|5–10|–1)2/(5+10) p-value=0.30 95% CI McNemar’s test p-value, report both
(–13.5%,3.5%) Check equality of positive percent agreement and negative
specifities of the two tests percent agreement. In the 510(k) paradigm
(|5–20|–1)2/(5+20) p-value=.005 95% CI where a new device is compared to an
(–2.9%, –0.5%). 85. 20. 105. 80. 5. 85. 5. already marketed device the positive
790. 795. 10. 5. 15. 90. 810. 900. 90. 10. percent agreement and the negative percent
100. 5. agreement is relative to the comparator
6McNemar’s test when a reference device, which is appropriate. 21.
standard exists. Note however that the 22Agreement of tests with more than two
McNemar’s test is only checking for outcomes. For example in radiology one
equality and thus the null hypothesis is often compares the standard film mammogram
of equivalence and the alternative to a digital mammogram where the
hypothesis of difference. This is not an radiologists assign a score of 1(negative
appropriate hypothesis as a failure to finding) to 5 (highly suggestive of
find a statistically significant malignancy) depending on severity. The
difference is naively interpreted as article by Fay in 2005 in Biostatistics
evidence for equivalence. The 95% proposes a random marginal agreement
confidence interval of the difference in coefficient (RMAC) which uses a different
sensitivities and specificities provides a adjustment for chance than the standard
better idea on the difference between the agreement coefficient (Cohen’s Kappa). 22.
two tests. 6. 23Comparing two tests with more than two
7Imperfect reference standard. A outcomes. The advantages of RMAC is that
subject’s true disease status is seldom the differences between two marginal
known with certainty. What is the effect distributions will not induce greater
on sensitivity and specificity when the apparent agreement. However, as stated in
comparator test R itself has error? the paper similar to Cohen’s Kappa with
Imperfect reference test (Comparator test) the fixed marginal assumption, the RMAC
New test R+ R- T + T -. a. b. a+b. c. d. also depends on the heterogeneity of the
c+d. a+c. b+d. a+b+c+d. 7. population. Thus in cases where the
8Imperfect reference standard. probability of responding in one category
Example1: Say we have a new Test T with is nearly 1 then the chance agreement will
80% sensitivity and 70% specificity. And be large leading to low agreement
an imperfect reference test R (the coefficients. 23.
comparator test) which misses 20% of the 24Comparing two tests with more than two
diseased subjects but never falsely outcomes. An omnibus agreement index for
indicates disease. True Status Imperfect situations with more than two outcomes is
reference test D+ D- R+ R- T + T – Se= also ridden by similar situations faced
(80/100)80.0% Se (relative to R)= (64/80) for tests with dichotomous outcome. Also,
80.0% Sp =(70/100)70.0% Sp (relative to in a regulatory set-up where a new test
R)= (74/120)62.0%. 80. 30. 110. 64. 46. device is being compared to a predicate
110. 20. 70. 90. 16. 74. 90. 100. 100. device RMAC may not be appropriate as it
200. 80. 120. 200. 8. gives equal weight to the marginals from
9Imperfect reference standard. Example the test and the predicate device. In
2: Say we have a new Test T with 80% stead report individual agreement for each
sensitivity and 70% specificity. And an category. 24.
imperfect reference test R which misses 25Summary. Perfect standard exists then
20% of the diseased subjects but the error for a dichotomous test then both
in R is related to the error in T. True sensitivity and specificity can be
Status Imperfect reference test D+ D- R+ estimated and appropriate hypothesis tests
R- T + T – Se =(80/100)80.0% Se (relative can be performed. If a new test is being
to R)=(80/80) 100.0% Sp =(70/100)70.0% Sp compared to an imperfect predicate test
(relative to R) =(90/120)75.0%. 80. 30. then the positive percent agreement and
110. 80. 30. 110. 0. 90. 90. 20. 70. 90. negative percent agreement along with
80. 120. 200. 100. 100. 200. 9. their 95% confidence interval is a more
10Imperfect reference standard. appropriate way of comparison than
Example3: Now suppose our test is perfect, reporting the overall agreement or the
that is has 100% sensitivity and 100% kappa statistics or the McNemar’s test. In
specificity, but the imperfect reference case of tests with more than two outcomes
test R has only 90% sensitivity and 90% the kappa statistics or the overall
specificity. True Status Imperfect agreement has the same problems if the
reference test D+ D- R+ R- T + T – Se goal of the study is to compare the new
=(100/100)100.0% Se (relative to test against a predicate. A suggestion
R)=(90/100) 90.0% Sp =(100/100)100.0% Sp would be to report agreement for each
(relative to R)=(90/100) 90.0%. 100. 0. cell. 25.
100. 90. 10. 100. 10. 90. 100. 0. 100. 26References. Pepe, M.S. (2003). The
100. 100. 100. 200. 100. 100. 200. 10. Statistical Evaluation of Medical Tests
11Challenges in assessing agreement in for Classification and Prediction. Oxford
the absence of a reference standard. Two University Press. Statistical Guidance on
commonly used overall measures are: Reporting Results from Studies Evaluating
Overall agreement measure Cohen’s Kappa Diagnostic Tests; Draft Guidance for
McNemar’s Test In stead report positive Industry and FDA Reviewers. March 2, 2003.
percent agreement (ppa) and negative Fleiss, JL, Statistical Methods for Rates
percent agreement (npa). 11. and Proportions, John Wiley & Sons,
12Estimate of Agreement. The overall New York (2nd ed., 1981). Bossuyt, P.M.,
percent agreement can be calculated as: Reitsma, J.B., Bruns, D.E., Gatsonis,
100%x(a+d)/(a+b+c+d) The overall percent C.A., Glasziou, P.P., Irwig, L.M., Lijmer,
agreement however, does not differentiate J.G., Moher, D., Rennie, D., & deVet,
between the agreement on the positives and H.C.W. (2003). Towards complete and
agreement on the negatives. Instead of accurate reporting of studies of
overall agreement, report positive percent diagnostic accuracy: The STARD initiative.
agreement (PPA) with respect to the Clinical Chemistry, 49(1), 1–6. (Also
imperfect reference standard positives and appears in Annals of Internal Medicine
negative percent agreement (NPA) with (2003) 138(1), W1–12 and in British
respect to imperfect reference standard Medical Journal (2003) 329(7379), 41–44).
negative. (reference Feinstein et. al.) 26.
PPA=100%xa/(a+c) NPA=100%xd/(b+d). 12. 27References (continued). Dunn, G and
13Why not to report just the overall Everitt, B, Clinical Biostatistics –An
percent agreement? The overall percent Introduction to Evidence-Based Medicine,
agreement is insensitive to off diagonal John Wiley & Sons, New York. Feinstein
imperfect reference test R+ R- New T+ Test A. R. and Cicchetti D. V. (1990). High
T- The overall percent agreement is 85.0% agreement but low kappa: I. The problems
and yet it does not account for the of two paradoxes. J. Clin. Epidemiol 1990;
off-diagonal imbalance. The PPA is 100% Vol. 43, No. 6, 543-549. Feinstein A. R.
and the NPA is only 50%. 70. 15. 85. 0. and Cicchetti D. V. (1990). High agreement
15. 15. 70. 30. 100. 13. but low kappa: II. Resolving the
14Why report both PPA and NPA? imperfect paradoxes. J. Clin. Epidemiol 1990; Vol.
reference test imperfect reference test R+ 43, No. 6, 551-558. Fay M. P. (2005).
R- R+ R- New T+ new T+ Test T- test T- Random marginal agreement coefficients:
Table 1 Table2 Overall pct. rethinking the adjustment for chance when
agreement=90.0% Overall pct. measuring agreement 2005; Biostatistics
agreement=90.0% PPA=50.0% (5/10) PPA=87.5% 6:171-180. 27.
(35/40) [95% CI= 18.7%,81.3%] [95%
