Receiver Operating Characteristic analysis (Rev #2)

When creating any system which maps from inputs to a classification output, but particularly machine learning classifiers, there are multiple kinds of error (false positive, etc). It’s desirable to be able to be able to say which are the “better” classifiers even without fully committing to the relative importance of the various kinds of errors. **Receiver Operating Characteristic (ROC) analysis** is a method for doing this to extent the possible.

Note: for simplicity, we describe the case where the *class prior* probability of a data item being a particular class is equal; analogous results hold in the uneven prior case.

Consider a system where each item has a *feature vector* $f \in F$ and a *class* $c \in C$. Then given classifier $\xi[p] : F \rightarrow C$ (where $p$ is some changeable internal parameter of the classifier), on a data set $\{(f_i,c_i)\}$ it will in general have some misclassifications $j$ where $\xi[p](f_j) = d_j \ne c_j$; each is an instance of misclassifying $c_j$ as $d_j$. In the two class case, which we specialise to until further notice, these get the special names *false positive (fp)* ($d=true$ when $c=false$) and *false negative (fn)* ($d=false$ when $c=true$), along with the correct classifications *true positive (tp)* and *true negatives (tn)*. Note that in the limit there are the relations

(1)$tp+fp=1 \qquad and \qquad tn+fn=1$

so that one has some freedom in terms of which variables to use. In ROC analysis it’s conventional to work in terms of the true positives and false positives; for consistency with external works we use that here, although it can be more intuitive to think in terms of other variables.

An idea that we will use several times is the idea of mixing classifiers: if we have classifers $\xi_1$ with error profile $(tp_1,fp_1)$ and $\xi_2$ with error profile $(tp_2,fp_2)$ then, in expectation and under the limit of large amounts of data we can create a classifer $\xi$ with error profile $(\alpha tp_1 + (1-\alpha) tp_2,\alpha fp_1 + (1-\alpha) fp_2)$ by, upon each use of the classifier to classify an individual feature vector, generating a random number $r$ and if $r \le \alpha$ using $\xi_1$ and otherwise using $\xi_2$. Clearly this is simply saying given two error profile points one can achieve any value that falls on the line between them. (Note that this is only in the limit, and in may not be desired to do this in practice as in both increases the random “standard deviation” of the performance and makes the resulting classifier non-deterministic.)

For a tunable classifier $\xi$ on a dataset, one can find a set of achievable values either by systematically varying $p$ and learning $\xi[p]$ or else by selecting, say, a $tp$ value and “searching” for the $p$ values which produces it; this will have an associated $fp$ value. In either case we will have a set of attainable $\{(tp_i,fp_i)\}$ values for the classifier. We can plot those on a graph (as shown in XXXX) where, by convention, $tp$ is on the vertical axis and $fp$ is on the horizontal axis. Note however, by the above method of mixing classifiers upon $\xi$ with different $p$ values we can actually replace the curve with its *convex hull*. The curves are called ROC curves, while ROC analysis generally denotes the process of generating ROC curves and then using them to select classifiers.

In addition we can see that for an effective classifier its curve must always lie above the diagonal curve from $(tp,fp)=(0,0)$ to $(tp,fp)=(1,1)$ since we can create two trivial classifiers “always say negative” (with error profile $(0,0)$) and “always say positive” (with error profile $(1,1)$) and use the mixing classifiers trick on them to produce points on the diagonal line. We will call a curve segment `above' or `

outside’ another one if each point is to the left/above; the important point is that a classifier is better than other classifiers at those points where it is all of the others.

Given multiple tunable classifiers $\xi_1$ to $\xi_N$, each with its own parameters, we can plot their ROC curves on the same graph. This will produce a set of curves which, in general, will intersect each other multiple times. However, we can generally divide the horizontal axis (or the vertical axis) into segments where a given curve is on the `outside'. In most practical cases there will be only a few of these and, as mentioned above, in each segment that classifier is better than the others.

This can be used to choose classifiers in the cases where it’s not known what the precise $(tp,fp)$ trade-off that is desired is. For example, there are the extreme cases that if a classifier is always outside then it is always the best classifier and should be unconditionally chosen and that if a classifier is never outside then it is never the best classifier and can be removed from any consideration.

Additionally it is possible to do things like select the best classifiers with a given hard limit on false positive rate (draw a vertical line at that rate from the horizontal axis and select the last curve it intersects) or with a given trade off between increasing true positive rate by $V$ “allowing” and increase in the false positive rate by $H$ by “sweeping” a line with gradient $V/H$ down from the top-right corner of the graph and selecting the classifier corresponding to the first curve it “hits”.

Although not commonly done, ROC analysis can be extended to the cases of either multi-class classification or where there is a vector of different classes associated with each data item. Part of the reason these are not used very much is that they give rise to surfaces in multi-dimensional space which are very difficult to visualise.

- Receiver operating characteristic, Wikipedia.

category: statistical methods