The Observers Needed to Evaluate Subjective Tests software implements a statistical method in Reisenbichler et al. (20201), to determine the minimum number of evaluators needed to estimate agreement involving a large number of raters. This method could be utilized by regulatory agencies, such as the FDA, when evaluating agreement levels of a newly proposed subjective laboratory test. Input to the program should be binary(1/0) pathology data, where “0” may stand for negative and “1” for positive. The example datasets in this software are from Rimm et al. (20172) (the SP142 assay), and Reisenbichler et al. 2020. This program can run in R version 3.5.0 and above.
We briefly introduce the statistical model and inference implemented by this program. Let p* denote the proportion of concordant (i.e., identical) reads among a group of raters, and the group size can be two or more. We let “p+” denote the proportion of tissue cases that will always be evaluated positive by all the raters, and “p-” a proportion that will always be evaluated negative. Among the proportion of “1-p+-p-” cases that could be rated either positive or negative, each case has the probability “p” of being rated positive from any pathologist. Then the proportion of consistent reads among k pathologists can be written as p*(k) = p++p-+(1-p+-p-)[pk+(1-p)k].
Let “I” denote the minimal sufficient number of pathologists in the sense that “I” is the minimum integer value to satisfy p* (i) - p*(i+1) < pᵟ with a large probability (e.g., 95%), where pᵟ is a threshold of the change in the percentage agreement due to including one additional pathologist. Let pc = p+ + p-.
The statistical inference is based on the joint likelihood function of parameters p+, p-, and p. For n cases and k pathologists, we have the data {yij; i=1,…,n, j=1,…,k}. Each observation yij is binary, where yij =1 if the read is positive and yij =0 if the read is negative. The probabilities of yij=1 and yij=0 can be written as P(yij=1) = p++ p(1-p+-p-) and P(yij=0) = p-+ (1-p)(1-p+-p-), respectively. We assume all {yij} are independently and identically distributed. The likelihood function can be written as L(p, p+, p-|{yij}) = [p++ p(1-p+-p-)]T [p-+ (1-p)(1-p+-p-)]nk-T, where T is the total number of reading equal to 1 among all “nk” reads. With k pathologists, we let nc denote the number of consistent reads among n cases, so nc ~ Bin(n, pc). Similarly, we have n+ ~ Bin(n, p+) and n- ~ Bin(n, p-), where n+ and n- denote the numbers of cases that all pathologists read positive and negative, respectively.
Based on the binomial maximum likelihood estimation, the estimates are p+ = n+/n, p- = n-/n, p++ p(1-p+-p-) = T/(nk), and p = [T/(nk) - p+]/(1-p+-p-). We then estimate p* by plugging the estimates of {pc , p} into the equation p* (k) = pc +(1-p+-p-)[pk+(1-p)k]. We define the objective function as D(i) = p* (i) - p*(i+1)=(1-p+-p-)[pi(1-p)+ p(1-p)i]. The estimate of “p” depends on the product of n and k, and the estimate of pc is nc/n. We use 95% as the probability threshold. Based on the central limit theorem, the asymptotic 95% lower bound of pc is: nc/n-1.645[nc(n-nc)/n3]1/2. By plugging in this lower bound of pc we can compute the upper bound of D(i) with 95% confidence level. If the upper bound of D(i) is less than pᵟ. We conclude “i” is the sufficient number of pathologists.
This software has one driver file ONEST_main. Input to ONEST_main include
Meanings of the output values are listed below.
consist_p: a vector of length k-1, indicating proportion of identical reads among a set of pathologists. For example, the first element of “consist_p” is the estimate of agreement percentage for 2 raters. The k-1 th element is the estimate of agreement percentage for k raters.
consist_low: a vector of length k-1, indicating the lower bound of the agreement percentage with 95% confidence level corresponding to “consist_p”.
diff_consist: a vector of length k-2, indicating the difference between the consist_p. For example, the first element of “diff_consist” is the estimated difference of agreement percentage after increasing from 2 to 3 raters. The k-2 th element is the difference of agreement percentage after increasing from k-1 to k raters.
diff_high: a vector of length k-2, indicating the upper bound of the change of agreement percentage corresponding to “diff_consist” with 95% confidence level.
size_case: number of cases n.
size_rater: number of raters k.
p: the probability of of being rated positive among the proportion of ‘1-p_plus-p_minus’ cases.
p_plus: proportion of the cases rated positive by all raters.
p_minus: proportion of the cases rated negative by all raters.
empirical: a matrix of dimension k-1 by 3, including the empirical estimate of the agreement percentage, and the empirical 95% confidence intervals (CI) of the agreement percentage with equal tail probabilities on the two sides. The empirical estimate and CI were calculated by permuting the raters with 1000 random permutations, and using the mean, 2.5th percentile, and 97.5th percentile.
All the outputs were saved in the following structure.
consistency: This output includes “consist_p” and “consist_low,” where the data are used to plot figure(5).
difference: This output includes “diff_consist” and “diff_high”, where the data are used to plot figure(6) that can be used to determine the minimum number of evaluators needed to estimate agreement.
estimates: This output includes the ONEST estimates “size_case”, “size_case”, “p”, “p_plus”, and “p_minus”.
empirical: This output has the empirical estimation data for plotting figure(3). The first and third columns are the 2.5% and 97.5% lower and upper bounds of the empirical CI, respectively. The second column is the estimated agreement percentage using the empirical mean.
The dataset “sp142_bin” is a pathology dataset of triple negative breast cancer in Reisenbichler et al. (2020) in a 68 by 18 matrix. An element in position (i, j) having value of 0 means negative for the i-th case, j-th rater, and a value of 1 means a positive evaluation.
Details about other datasets in the package can be found in the reference manual.
The following code is equivalent to ONEST_main(sp142_bin) and can only be applied to the example dataset sp142_bin to decrease the time to build the vignettes. Please use the ONEST_main function instead in practice.
# figure(1): Plot of the agreement percentage in the order of columns in the inputs;
# figure(2): Plot of the 100 randomly chosen permutations;
# figure(3): Plot of the empirical confidence interval;
# figure(4): Barchart: the x axis is the case number and the Y axis is the number of pathologists that called that case positive, sorted from lowest to highest on the y axis;
# figure(5): Plot of the proportion of identical reads among a set of pathologists;
# figure(6): Plot of the difference between the proportion of identical reads among a set of pathologists;
# ONEST_main(sp142_bin)
data('empirical')
ONEST_vignettes(sp142_bin,empirical)
#> $consistency
#> consist_p consist_low
#> [1,] 0.6911795 0.6427088
#> [2,] 0.5367693 0.4640632
#> [3,] 0.4595634 0.3747395
#> [4,] 0.4209597 0.3300768
#> [5,] 0.4016573 0.3077448
#> [6,] 0.3920057 0.2965783
#> [7,] 0.3871797 0.2909948
#> [8,] 0.3847665 0.2882029
#> [9,] 0.3835598 0.2868068
#> [10,] 0.3829564 0.2861087
#> [11,] 0.3826547 0.2857597
#> [12,] 0.3825039 0.2855851
#> [13,] 0.3824284 0.2854978
#> [14,] 0.3823907 0.2854542
#> [15,] 0.3823718 0.2854324
#> [16,] 0.3823624 0.2854214
#> [17,] 0.3823577 0.2854160
#>
#> $difference
#> diff_consist diff_high
#> [1,] -1.544102e-01 1.786456e-01
#> [2,] -7.720588e-02 8.932368e-02
#> [3,] -3.860371e-02 4.466273e-02
#> [4,] -1.930243e-02 2.233203e-02
#> [5,] -9.651598e-03 1.116646e-02
#> [6,] -4.826038e-03 5.583506e-03
#> [7,] -2.413163e-03 2.791919e-03
#> [8,] -1.206665e-03 1.396057e-03
#> [9,] -6.033806e-04 6.980838e-04
#> [10,] -3.017172e-04 3.490731e-04
#> [11,] -1.508736e-04 1.745539e-04
#> [12,] -7.544503e-05 8.728646e-05
#> [13,] -3.772701e-05 4.364843e-05
#> [14,] -1.886594e-05 2.182703e-05
#> [15,] -9.434279e-06 1.091503e-05
#> [16,] -4.717841e-06 5.458327e-06
#>
#> $estimates
#> size_case size_rater p p_plus p_minus
#> [1,] 68 18 0.4984245 0.2794118 0.1029412
#>
#> $empirical
#> lower_bound mean upper_bound
#> [1,] 0.6029412 0.7898235 0.9264706
#> [2,] 0.5294118 0.6951176 0.8529412
#> [3,] 0.4558824 0.6306912 0.7941176
#> [4,] 0.4264706 0.5833529 0.7352941
#> [5,] 0.3970588 0.5447941 0.6911765
#> [6,] 0.3823529 0.5124412 0.6617647
#> [7,] 0.3676471 0.4878088 0.6176471
#> [8,] 0.3676471 0.4642941 0.5882353
#> [9,] 0.3529412 0.4468235 0.5735294
#> [10,] 0.3529412 0.4298824 0.5441176
#> [11,] 0.3529412 0.4145588 0.5147059
#> [12,] 0.3529412 0.4013088 0.5000000
#> [13,] 0.3529412 0.3902059 0.4852941
#> [14,] 0.3529412 0.3786765 0.4705882
#> [15,] 0.3529412 0.3684853 0.4558824
#> [16,] 0.3529412 0.3608382 0.4411765
#> [17,] 0.3529412 0.3529412 0.3529412
A small p-value from this score test indicates significant evidence that the observers’ agreement will converge to a non-zero proportion.
# (1) With example dataset sp263_bin:
# data("sp263_bin") ONEST_main(sp263_bin) ONEST_inflation_test(sp263_bin)
# (2) With example dataset NCNN_sp142:
# data("NCCN_sp142") ONEST_main(NCCN_sp142) ONEST_inflation_test(NCCN_sp142)
# (3) With example dataset NCNN_sp142_t:
# data("NCCN_sp142_t") ONEST_main(NCCN_sp142_t) ONEST_inflation_test(NCCN_sp142_t)
# (4) With example dataset NCCN_22c3_t:
# data("NCCN_22c3_t") ONEST_main(NCCN_22c3_t) ONEST_inflation_test(NCCN_22c3_t)
Reisenbichler, E. S., Han, G., Bellizzi, A., Bossuyt, V., Brock, J., Cole, K., Fadare, O., Hameed, O., Hanley, K., Harrison, B. T., Kuba, M. G., Ly, A., Miller, D., Podoll, M., Roden, A. C., Singh, K., Sanders, M. A., Wei, S., Wen, H., Pelekanou, V., Yaghoobi, V., Ahmed, F., Pusztai, L., and Rimm, D. L. (2020) “Prospective multi-institutional evaluation of pathologist assessment of PD-L1 assays for patient selection in triple negative breast cancer,” Mod Pathol, DOI: 10.1038/s41379-020-0544-x; PMID: 32300181.↩︎
Rimm, D. L., Han, G., Taube, J. M., Yi, E. S., Bridge, J. A., Flieder, D. B., Homer, R., West, W. W., Wu, H., Roden, A. C., Fujimoto, J., Yu, H., Anders, R., Kowalewski, A., Rivard, C., Rehman, J., Batenchuk, C., Burns, V., Hirsch, F. R., and Wistuba,, II (2017) “A Prospective, Multi-institutional, Pathologist-Based Assessment of 4 Immunohistochemistry Assays for PD-L1 Expression in Non-Small Cell Lung Cancer,” JAMA Oncol, 3(8), 1051-1058, DOI: 10.1001/jamaoncol.2017.0013, PMID: 28278348.↩︎