When creating any system which maps from inputs to a classification output, but particularly machine learning classifiers, there are multiple kinds of error (false positive, etc). It’s desirable to be able to be able to say which are the “better” classifiers even without fully committing to the relative importance of the various kinds of errors. Receiver Operating Characteristic (ROC) analysis is a method for doing this to the extent this is possible.
Note: for simplicity, we describe the case where the class prior probability of a data item being a particular class is equal; analogous results hold in the uneven prior case.
Consider an environment where each item has a feature vector $f \in F$ and a class $c \in C$. Then given classifier $\xi = \xi[p]$ (where $p$ is some changeable internal parameter of the classifier) with $\xi : F \rightarrow C$ , on a data set $\{(f_i,c_i)\}$ it will in general have some misclassifications $\{j\}$ where $\xi(f_j) = d_j \ne c_j$; each is an instance of misclassifying class $c_j$ as class $d_j$. In the two class case, which we specialise to until further notice, these types of error get the special names false positive rate ($fp$) ($d=true$ when $c=false$) and false negative rate($fn$) ($d=false$ when $c=true$), along with the correct classifications true positive rate ($tp$) and true negative rate ($tn$). Note that in the limit there are the relations
so that one has some freedom in terms of which two independent variables to use. In ROC analysis it’s conventional to work in terms of the true positives and false positives; for consistency with external works we use that here, although it can be more intuitive to think in terms of other variables.
An idea that we will use several times is the idea of mixing classifiers: if we have classifers $\xi_1$ with error profile $(tp_1,fp_1)$ and $\xi_2$ with error profile $(tp_2,fp_2)$ then, in expectation and under the limit of large amounts of data we can create a classifer $\xi$ with error profile $(\alpha tp_1 + (1-\alpha) tp_2,\alpha fp_1 + (1-\alpha) fp_2)$ (for $0 \le \alpha \le 1)$ by the following: upon each use of the classifier to classify an individual feature vector, generate a random number $r \in [0,1)$ and if $r \le \alpha$ use $\xi_1$ to classify that item and otherwise use $\xi_2$. Clearly this is simply saying given two error profile points one can produce a classifier that achieves any specific value that falls on the line between them. (Note that this is only in the limit, and it may not be desired to do this in practice as it will both give a resulting classifier with both a high variance in performance – so that comparing experimental results on different datasets is difficult – and which is non-deterministic – so that reproducing results is difficult.)
For a tunable classifier $\xi=\xi[p]$ on a dataset, one can find a set of achievable error profile values by, for example,
Systematically varying $p$, learning $\xi[p]$ and evaluating its $tp$ and $fp$ values.
Selecting, say, target $tp$ values and in each case “searching” for the $p$ values for which $\xi[p]$ produces it; each will have an associated $fp$ value.
In either case we will have a set of attainable $\{(tp_i,fp_i)\}$ values for the classifier. We can plot those on a graph (as shown in XXXX) where, by convention, $tp$ is on the vertical axis and $fp$ is on the horizontal axis. Note however, by the above method of mixing classifiers upon $\xi$ with different $p$ values we can actually replace the curve with its convex hull. These curves (either the originals or the convex hulls) are called ROC curves, while ROC analysis generally denotes the process of generating ROC curves and then using them to select classifiers.
In addition we can see that for an effective classifier its curve must always lie above the diagonal curve from $(tp,fp)=(0,0)$ to $(tp,fp)=(1,1)$ since we can create two trivial classifiers “always say negative” (with error profile $(0,0)$) and “always say positive” (with error profile $(1,1)$) and use the mixing classifiers trick on them to produce points on the diagonal line. We will call a curve segment “above” or “outside” another one if each point is to the left/above; the important point is that a classifier is better than other classifiers at those points on its curve where it is outside all of the other curves.
Given multiple tunable classifiers $\xi_1$ to $\xi_N$, each with its own parameters, we can plot their ROC curves on the same graph. This will produce a set of curves which, in general, will intersect each other multiple times. However, we can generally divide the horizontal axis (or the vertical axis) into segments where a given curve is on the `outside'. In most practical cases there will be only a few of these and, as mentioned above, in each segment that classifier is better than the others.
This can be used to choose classifiers in the cases where it’s not known what the precise $(tp,fp)$ trade-off that is desired is. For example, there are the extreme cases that if a classifier is always outside then it is always the best classifier and should be unconditionally chosen and that if a classifier is never outside then it is never the best classifier and can be removed from any consideration.
Additionally it is possible to do things like select the best classifiers with a given hard limit on false positive rate (draw a vertical line at that rate from the horizontal axis and select the last curve it intersects) or with a given trade off between increasing true positive rate by $V$ “allowing” and increase in the false positive rate by $H$ by “sweeping” a line with gradient $V/H$ down from the top-right corner of the graph and selecting the classifier corresponding to the first curve it “hits”.
Although not commonly done, ROC analysis can be extended to the cases of either multi-class classification or where there is a vector of different classes associated with each data item. Part of the reason these are not used very much is that they give rise to surfaces in multi-dimensional space which are very difficult to visualise.