The Azimuth Project
Receiver Operating Characteristic analysis

Idea

When creating any system which maps from inputs to a classification output, but particularly machine learning classifiers, there are multiple kinds of error (false positive, etc). It’s desirable to be able to be able to say which are the “better” classifiers even without fully committing to the relative importance of the various kinds of errors. Receiver Operating Characteristic (ROC) analysis is a method for doing this to the extent this is possible.

Details

Note: for simplicity, we describe the case where the class prior probability of a data item being a particular class is equal; analogous results hold in the uneven prior case.

Kinds of errors

Consider an environment where each item has a feature vector fFf \in F and a class cCc \in C. Then given classifier ξ=ξ[p]\xi = \xi[p] (where pp is some changeable internal parameter of the classifier) with ξ:FC\xi : F \rightarrow C , on a data set {(f i,c i)}\{(f_i,c_i)\} it will in general have some misclassifications {j}\{j\} where ξ(f j)=d jc j\xi(f_j) = d_j \ne c_j; each is an instance of misclassifying class c jc_j as class d jd_j. In the two class case, which we specialise to until further notice, these types of error get the special names false positive rate (fpfp) (d=trued=true when c=falsec=false) and false negative rate(fnfn) (d=falsed=false when c=truec=true), along with the correct classifications true positive rate (tptp) and true negative rate (tntn). Note that in the limit there are the relations

(1)tp+fp=1andtn+fn=1 tp+fp=1 \qquad and \qquad tn+fn=1

so that one has some freedom in terms of which two independent variables to use. In ROC analysis it’s conventional to work in terms of the true positives and false positives; for consistency with external works we use that here, although it can be more intuitive to think in terms of other variables.

Mixing classifiers

An idea that we will use several times is the idea of mixing classifiers: if we have classifers ξ 1\xi_1 with error profile (tp 1,fp 1)(tp_1,fp_1) and ξ 2\xi_2 with error profile (tp 2,fp 2)(tp_2,fp_2) then, in expectation and under the limit of large amounts of data we can create a classifer ξ\xi with error profile (αtp 1+(1α)tp 2,αfp 1+(1α)fp 2)(\alpha tp_1 + (1-\alpha) tp_2,\alpha fp_1 + (1-\alpha) fp_2) (for 0α1)0 \le \alpha \le 1) by the following: upon each use of the classifier to classify an individual feature vector, generate a random number r[0,1)r \in [0,1) and if rαr \le \alpha use ξ 1\xi_1 to classify that item and otherwise use ξ 2\xi_2. Clearly this is simply saying given two error profile points one can produce a classifier that achieves any specific value that falls on the line between them. (Note that this is only in the limit, and it may not be desired to do this in practice as it will both give a resulting classifier with both a high variance in performance – so that comparing experimental results on different datasets is difficult – and which is non-deterministic – so that reproducing results is difficult.)

The ROC curve for a single classifier

For a tunable classifier ξ=ξ[p]\xi=\xi[p] on a dataset, one can find a set of achievable error profile values by, for example,

  1. Systematically varying pp, learning ξ[p]\xi[p] and evaluating its tptp and fpfp values.

  2. Selecting, say, target tptp values and in each case “searching” for the pp values for which ξ[p]\xi[p] produces it; each will have an associated fpfp value.

In either case we will have a set of attainable {(tp i,fp i)}\{(tp_i,fp_i)\} values for the classifier. We can plot those on a graph (as shown in XXXX) where, by convention, tptp is on the vertical axis and fpfp is on the horizontal axis. Note however, by the above method of mixing classifiers upon ξ\xi with different pp values we can actually replace the curve with its convex hull. These curves (either the originals or the convex hulls) are called ROC curves, while ROC analysis generally denotes the process of generating ROC curves and then using them to select classifiers.

In addition we can see that for an effective classifier its curve must always lie above the diagonal curve from (tp,fp)=(0,0)(tp,fp)=(0,0) to (tp,fp)=(1,1)(tp,fp)=(1,1) since we can create two trivial classifiers “always say negative” (with error profile (0,0)(0,0)) and “always say positive” (with error profile (1,1)(1,1)) and use the mixing classifiers trick on them to produce points on the diagonal line. We will call a curve segment “above” or “outside” another one if each point is to the left/above; the important point is that a classifier is better than other classifiers at those points on its curve where it is outside all of the other curves.

ROC curves for multiple classifiers

Given multiple tunable classifiers ξ 1\xi_1 to ξ N\xi_N, each with its own parameters, we can plot their ROC curves on the same graph. This will produce a set of curves which, in general, will intersect each other multiple times. However, we can generally divide the horizontal axis (or the vertical axis) into segments where a given curve is on the `outside'. In most practical cases there will be only a few of these and, as mentioned above, in each segment that classifier is better than the others.

This can be used to choose classifiers in the cases where it’s not known what the precise (tp,fp)(tp,fp) trade-off that is desired is. For example, there are the extreme cases that if a classifier is always outside then it is always the best classifier and should be unconditionally chosen and that if a classifier is never outside then it is never the best classifier and can be removed from any consideration.

Additionally it is possible to do things like select the best classifiers with a given hard limit on false positive rate (draw a vertical line at that rate from the horizontal axis and select the last curve it intersects) or with a given trade off between increasing true positive rate by VV “allowing” and increase in the false positive rate by HH by “sweeping” a line with gradient V/HV/H down from the top-right corner of the graph and selecting the classifier corresponding to the first curve it “hits”.

Extensions

Although not commonly done, ROC analysis can be extended to the cases of either multi-class classification or where there is a vector of different classes associated with each data item. Part of the reason these are not used very much is that they give rise to surfaces in multi-dimensional space which are very difficult to visualise.

References