Soft Roc Curves

Research Article

Austin Biom and Biostat. 2014;1(2): 6.

Soft Roc Curves

Xin Huang1, Narayanaswamy Balakrishnan2,3 and Yixin Fang4*

1Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, USA

2Department of Mathematics and Statistics, McMaster University, Canada

3King Saud University, Saudi Arabia

4Division of Biostatistics, Department of Population Health, New York University, USA

*Corresponding author: Yixin Fang, Division of Biostatistics, Department of Population Health, New York University, New York, NY 10016, USA.

Received: August 25, 2014; Accepted: October 13, 2014; Published: November 18, 2014

Abstract

Receiver operating characteristic (ROC) curves are a popular tool for evaluating continuous diagnostic tests. However, the traditional definition of ROC curves incorporates implicitly the idea of "hard" thresholding, which cannot encompass the situation when some intermediate classes are introduced between test result positive and negative, and also results in the empirical curves being step functions. For this reason, we introduce here the definition of soft ROC curves, which incorporates the idea of "soft" thresholding. The softness of a soft ROC curve is controlled by a regularization parameter that can be selected suitably by a cross-validation procedure. A byproduct of the soft ROC curves is that the corresponding empirical curves are smooth. The methods developed here are then examined through some simulation studies as well as a real illustrative example.

Keywords: Cross-validation; Diagnostic test; Intermediate class; Regularization parameter; Thresholding

Introduction

Receiver Operating Characteristic (ROC) curves is a popular tool for evaluating continuous diagnostic tests; see, for example, Pepe [1]. However, the traditional definition of ROC curves incorporates implicitly the idea of "hard" thresholding. To be specific, let T be the outcome of a continuous diagnostic test and D be the disease status. Given a threshold c, the hard thresholding scheme defines a subject as diseased ( D =1 ) MathType@MTEF@5@5@+=feaaguart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaeWaaeaadaWfGaqaaiaadseaaSqabeaacqGHNis2aaGccqGH9aqpcaaIXaaacaGLOaGaayzkaaaaaa@3C06@ if the test result T = t exceeds c, and as non-diseased ( D =0 ) MathType@MTEF@5@5@+=feaaguart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaeWaaeaadaWfGaqaaiaadseaaSqabeaacqGHNis2aaGccqGH9aqpcaaIWaaacaGLOaGaayzkaaaaaa@3C05@ otherwise. It thus results in a binary classifier,

( H ) I( tc ) ={ 1, 0, tc0, tc<0. MathType@MTEF@5@5@+=feaaguart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqbaeqabeWaaaqaamaabmaabaGaamisaaGaayjkaiaawMcaaaqaaiaadMeadaqadaqaaiaadshacqGHsislcaWGJbaacaGLOaGaayzkaaaabaqbaeqabeGaaaqaaiabg2da9maaceaabaqbaeqabiqaaaqaaiaaigdacaGGSaaabaGaaGimaiaacYcaaaaacaGL7baaaeaafaqabeGabaaabaGaamiDaiabgkHiTiaadogacqGHLjYScaaIWaGaaiilaaqaaiaadshacqGHsislcaWGJbGaeyipaWJaaGimaiaac6caaaaaaaaaaaa@4DD4@

The ROC curve is then a graphical plot of true positives, E{I(T.c)|D=1}, versus false positives, E{I(T.c)|D=0}, for .-∞<c<∞.. It can be expressed as

R( p )=1G [ F 1 ( 1p ) ], 0< p<1, MathType@MTEF@5@5@+=feaaguart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamOuamaabmaabaGaamiCaaGaayjkaiaawMcaaiabg2da9iaaigdacqGHsislcaWGhbqbaeqabeGaaaqaamaadmaabaGaamOramaaCaaaleqabaGaeyOeI0IaaGymaaaakmaabmaabaGaaGymaiabgkHiTiaadchaaiaawIcacaGLPaaaaiaawUfacaGLDbaacaGGSaaabaGaaGimaiabgYda8aaacaWGWbGaeyipaWJaaGymaiaacYcaaaa@4B62@

where F (.) and G(.) are the distributions of T , given D = 0 and D = 1, respectively.

However, from the medical practitioners point of view, if the test result is close to the given threshold c, then one may be indecisive about the status of diseases. This is a common situation for tests with ambiguous thresh- olds (e.g., prostate-specific antigen, which is shown to be not a dichotomous marker [2]. Thus, practitioners tend to implement an intermediate class between the negative and positive [3], within which patients are diagnosed as diseased or nondiseased according to some probability model. Hozo and Djulbegovic [4] provide a definition of acceptable regret threshold to explain such phenomenon. They demonstrate that different practitioners might adapt different acceptable regret thresh- olds for withholding treatment even when the diagnostic tests exceed the pre-defined threshold. Unfortunately, the existing hard-thresholding scheme does not incorporate such intermediate classes. Furthermore,there are other disadvantages in the hard thresholding scheme. In particular, the discontinuity of the binary classifier results in the corresponding estimated ROC curve being a step function, while the underlying ROC curve is likely to be smooth. Consequently, due to the discontinuity in the step function, the variability of the estimated ROC curve becomes large.

To overcome these disadvantages, we consider soft-thresholding scheme,

( S ) I δ ( tc )={ 1, ?, 0, tcδ, δ<tc<δ, tc<δ, MathType@MTEF@5@5@+=feaaguart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqbaeqabeWaaaqaamaabmaabaGaam4uaaGaayjkaiaawMcaaaqaaiaadMeadaWgaaWcbaGaeqiTdqgabeaakmaabmaabaGaamiDaiabgkHiTiaadogaaiaawIcacaGLPaaacqGH9aqpdaGabaqaauaabeqadeaaaeaacaaIXaGaaiilaaqaaiaac+dacaGGSaaabaGaaGimaiaacYcaaaaacaGL7baaaeaafaqabeWabaaabaGaamiDaiabgkHiTiaadogacqGHLjYScqaH0oazcaGGSaaabaGaeyOeI0IaeqiTdqMaeyipaWJaamiDaiabgkHiTiaadogacqGH8aapcqaH0oazcaGGSaaabaGaamiDaiabgkHiTiaadogacqGH8aapcqGHsislcqaH0oazcaGGSaaaaaaaaaa@5DA2@

where the value ? Is between 0 and 1 and will be discussed in the next section, and δ is a regularization parameter controlling the softness. In particular, when δ=0, the soft thresholding simply becomes the hard thresholding. When decision-making rule Iδ is applied with threshold c, the sensitivity (a.k.a. the true positive probability) equals E{Iδ(T-c)|D=1} and the specificity (a.k.a true negative probability) equals E{1-Iδ(T-c)|D = 0}.

The rationale of this soft-thresholding scheme is that if the test result is close to the given threshold c, then one may be indecisive about the status of the disease. Hence, we refer to (.) as the indecisive function. And the probability model within the intermediate class can be formulated by? In the indecisive function. We will show that different indecisive functions will result in different soft ROC curves. The idea used here is similar in principle to the one used in designing randomization tests to achieve a given significance level in hypothesis testing [5]. The indecisive function has been considered in the literature of ROC analysis. Many authors have used smooth functions to approximate the indicator function, which can also be considered as indecisive functions. For example, Liu et al. [6] and Liu and Tan [7] used an S-type function to approximate the indicator function for the empirical False Positive Rate (FPR) and True Positive Rate (TPR). Huang et al. [8], Wang et al. [9], and Ma and Huang [10,11] used the sigmoid function to approximate the indicator function in the empirical estimate of the Area Under the ROC Curve (AUC).

Instead of looking for an approximation, in this work, we examine the definition of ROC curves directly and introduce the soft ROC curves based on the soft-thresholding. More importantly, we build a bridge between the approximation of an ROC curve and the approximation of its AUC. More- over, continuity of the proposed soft ROC curves is a promising byproduct, although it is not our primary goal. We should point out that in the literature of ROC; many authors have discussed methods to smooth ROC curves. For example, Zou et al. [12] proposed a non-parametric estimator from kernel estimates of the distribution functions F and G. Peng and Zhou [13] proposed a local linear regression for the ROC curve, while Ren et al. [14] proposed a Penalized Spline Linear Mixed-Effects model (PSLME). In this paper, we demonstrate that the proposed soft ROC method not only has similar performances when compared to the local linear regression and the PSLME methods in terms of smoothing, but also has a clearer explanation to the smoothing parameter and much easier implementation.

The remainder of this paper is organized as follows. In Section 2, we define the soft ROC curve, and derive some of its properties. In Section 3, we propose methods to choose the regularization parameter δ. In Section 4, the proposed methods are examined through some simulation studies and a real data example. Finally, some discussion is made in Section 5, and all technical details are relegated to the Appendix.

Soft ROC Curves

When an indecisive function Iδ is applied with threshold c, we can define a soft ROC curve as follows.

Definition 1: A plot of true positives, E{Iδ (T-c)|D=1}, versus false positives, I0}, for all possible values of c, is called the soft ROC curve with respect to the indecisive function Iδ.

Assume that a test is performed on m non-diseased subjects, yielding testing outcomes Xi, and on n diseased subjects, yielding outcomes Yj . Then, an empirical estimate of the soft ROC curve w.r.t. Iδ is

R ^ δ ( p )=1 G ^ δ [ F δ 1 ( 1p ) ] , p( 0,1 ) , MathType@MTEF@5@5@+=feaaguart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaecaaeaacaWGsbaacaGLcmaadaWgaaWcbaGaeqiTdqgabeaakmaabmaabaGaiWiGdchaaiaawIcacaGLPaaacqGH9aqpcaaIXaGaeyOeI0YaaecaaeaacaWGhbaacaGLcmaadaWgaaWcbaGaeqiTdqgabeaakmaadmaabaWaaCbiaeaacaWGgbWaa0baaSqaaiabes7aKbqaaiabgkHiTiaaigdaaaaabeqaaiadSqQHNis2aaGcdaqadaqaaiaaigdacqGHsislcaWGWbaacaGLOaGaayzkaaaacaGLBbGaayzxaaqbaeqabeGaaaqaaiaacYcaaeaacaWGWbGaeyicI48aaeWaaeaacaaIWaGaaiilaiaaigdaaiaawIcacaGLPaaaaaGaaiilaaaa@57CE@

where G δ ^ ( c )= 1 n j=1 n I δ ( Y j c ) MathType@MTEF@5@5@+=feaaguart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaecaaeaacaWGhbWaaSbaaSqaaiabes7aKbqabaaakiaawkWaamaabmaabaGaam4yaaGaayjkaiaawMcaaiabg2da9maalaaabaGaaGymaaqaaiaad6gaaaWaaabmaeaacaWGjbWaaSbaaSqaaiabes7aKbqabaaabaGaamOAaiabg2da9iaaigdaaeaacaWGUbaaniabggHiLdGcdaqadaqaaiaadMfadaWgaaWcbaGaamOAaaqabaGccqGHsislcaWGJbaacaGLOaGaayzkaaaaaa@4C34@ and F δ ^ ( c )= 1 m i=1 m I δ ( X i c ) MathType@MTEF@5@5@+=feaaguart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaecaaeaacaWGgbWaaSbaaSqaaiabes7aKbqabaaakiaawkWaamaabmaabaGaam4yaaGaayjkaiaawMcaaiabg2da9maalaaabaGaaGymaaqaaiaad2gaaaWaaabmaeaacaWGjbWaaSbaaSqaaiabes7aKbqabaaabaGaamyAaiabg2da9iaaigdaaeaacaWGTbaaniabggHiLdGcdaqadaqaaiaadIfadaWgaaWcbaGaamyAaaqabaGccqGHsislcaWGJbaacaGLOaGaayzkaaaaaa@4C2E@ The areaunder the soft The areaunder the soft ROC curve w.r.t.Iδ , denoted by AUCδ , is derived inthe following theorem, and its proof is presented in the Appendix A.

Theorem 1: For the soft ROC curve w.r.t. to the indecisive function Iδ (.), we have

AUCδ=E{Kδ(Y-X)},

where X~F (.), Y~G(.), K δ ( YX )= I δ ( Yc ) I δ · ( Xc )dc MathType@MTEF@5@5@+=feaaguart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4samaaBaaaleaacqaH0oazaeqaaOWaaeWaaeaacaWGzbGaeyOeI0IaamiwaaGaayjkaiaawMcaaiabg2da9maapedabaGaamysamaaBaaaleaacqaH0oazaeqaaaqaaiabgkHiTiabg6HiLcqaaiabg6HiLcqdcqGHRiI8aOWaaeWaaeaacaWGzbGaeyOeI0Iaam4yaaGaayjkaiaawMcaamaaxacabaGaamysamaaBaaaleaacqaH0oazaeqaaaqabeaacWaxuV4JPFgaaOWaaeWaaeaacaWGybGaeyOeI0Iaam4yaaGaayjkaiaawMcaaiaadsgacaWGJbaaaa@5741@ ,and δ is the derivative of Iδ.

We remark that for functions with piecewise constant, the derivative is defined by using Dirac Delta function. From Theorem 1, we see that an unbiased estimate of AUCδ is given by

AUC ^ δ = 1 mn i=1 m j=1 n K δ ( Y j X i ) . MathType@MTEF@5@5@+=feaaguart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaWaaecaaeaacaWGbbGaamyvaiaadoeaaiaawkWaamaaBaaaleaacqaH0oazaeqaaOGaeyypa0ZaaSaaaeaacaaIXaaabaGaamyBaiaad6gaaaWaaabCaeaadaaeWbqaaiaadUeadaWgaaWcbaGaeqiTdqgabeaakmaabmaabaGaamywamaaBaaaleaacaWGQbaabeaakiabgkHiTiaadIfadaWgaaWcbaGaamyAaaqabaaakiaawIcacaGLPaaaaSqaaiaadQgacqGH9aqpcaaIXaaabaGaamOBaaqdcqGHris5aaWcbaGaamyAaiabg2da9iaaigdaaeaacaWGTbaaniabggHiLdGccaGGUaaaaa@5456@

It is worth emphasizing if the hard-thresholding decision rule (H) is applied, then we use the classical ROC curve to evaluate its performance, Whereas if the soft-thresholding decision rule (S) is applied, then we use the newly proposed soft ROC curve to evaluate its performance. In other words, which type of ROC curves is used for evaluation depends on the underlying decision rule that is applied. Actually, it is not necessary to define a new ROC curve for any new decision rule. However, we define soft curves for at least three reasons. First, the soft-thresholding decision rule is simple and appropriate. Second, the resulted empirical soft ROC curve is continuous. Third, the relationship between Kδ and Iδ is mathematically beautiful.

Two-sided soft ROC curves

We can categorize indecisive functions and soft ROC curves into one-sided and two-sided according to the following definition.

Definition 2: If Iδ(t-c)=0 for t<c, Iδ and the corresponding soft ROC curve are said to be one-sided. Otherwise, they are said to be twosided.

We now present some examples of indecisive functions Iδ and their correspondingsss Kδδ , which are all displayed in Figure 1. The corresponding detailed calculations are presented in the Appendix B.