Showing changes from revision #2 to #3:
Added | Removed | Changed
When creating any system which maps from inputs to a classification output, but particularly machine learning classifiers, there are multiple kinds of error (false positive, etc). It’s desirable to be able to be able to say which are the “better” classifiers even without fully committing to the relative importance of the various kinds of errors. Receiver Operating Characteristic (ROC) analysis is a method for doing this to extent the extent this is possible.
Note: for simplicity, we describe the case where the class prior probability of a data item being a particular class is equal; analogous results hold in the uneven prior case.
Consider a an system environment where each item has afeature vector $f \in F$ and a class $c \in C$. Then given classifier $\xi =\xi [p]:F\to C$ \xi = \xi[p] : F \rightarrow C (where $p$ is some changeable internal parameter of the classifier), classifier) on with a data set$\xi : F \rightarrow C$ , on a data set $\{(f_i,c_i)\}$ it will in general have some misclassifications $\{j\}$ j \{j\} where $\xi [p]({f}_{j})={d}_{j}\ne {c}_{j}$ \xi[p](f_j) \xi(f_j) = d_j \ne c_j ; each is an instance of misclassifying class$c_j$ as class$d_j$ . In the two class case, which we specialise to until further notice, these types of error get the special names false positive (fp) rate ($fp$) ($d=true$ when $c=false$) and false negative (fn) rate($fn$) ($d=false$ when $c=true$), along with the correct classifications true positive (tp) rate ($tp$) and true negatives negative (tn) rate ($tn$). Note that in the limit there are the relations
so that one has some freedom in terms of which two independent variables to use. In ROC analysis it’s conventional to work in terms of the true positives and false positives; for consistency with external works we use that here, although it can be more intuitive to think in terms of other variables.
An idea that we will use several times is the idea of mixing classifiers: if we have classifers $\xi_1$ with error profile $(tp_1,fp_1)$ and $\xi_2$ with error profile $(tp_2,fp_2)$ then, in expectation and under the limit of large amounts of data we can create a classifer $\xi$ with error profile $(\alpha tp_1 + (1-\alpha) tp_2,\alpha fp_1 + (1-\alpha) fp_2)$ by, (for upon each use of the classifier to classify an individual feature vector, generating a random number$r0\le \alpha \le 1)$ r 0 \le \alpha \le 1) and by if the following: upon each use of the classifier to classify an individual feature vector, generate a random number$r \in [0,1)$ and if $r \le \alpha$ using use$\xi_1$ to classify that item and otherwise using use$\xi_2$ . Clearly this is simply saying given two error profile points one can achieve produce a classifier that achieves any specific value that falls on the line between them. (Note that this is only in the limit, and in it may not be desired to do this in practice as in it will both increases give the a random “standard deviation” of the performance and makes the resulting classifier non-deterministic.) with both a high variance in performance – so that comparing experimental results on different datasets is difficult – and which is non-deterministic – so that reproducing results is difficult.)
For a tunable classifier $\xi =\xi [p]$ \xi \xi=\xi[p] on a dataset, one can find a set of achievable error profile values either by, by for systematically example, varying$p$ and learning $\xi[p]$ or else by selecting, say, a $tp$ value and “searching” for the $p$ values which produces it; this will have an associated $fp$ value. In either case we will have a set of attainable $\{(tp_i,fp_i)\}$ values for the classifier. We can plot those on a graph (as shown in XXXX) where, by convention, $tp$ is on the vertical axis and $fp$ is on the horizontal axis. Note however, by the above method of mixing classifiers upon $\xi$ with different $p$ values we can actually replace the curve with its convex hull. The curves are called ROC curves, while ROC analysis generally denotes the process of generating ROC curves and then using them to select classifiers.
In addition we can see that for an effective classifier its curve must always lie above the diagonal curve from $(tp,fp)=(0,0)$ to $(tp,fp)=(1,1)$ since we can create two trivial classifiers “always say negative” (with error profile $(0,0)$) and “always say positive” (with error profile $(1,1)$) and use the mixing classifiers trick on them to produce points on the diagonal line. We will call a curve segment above' or
outside’ another one if each point is to the left/above; the important point is that a classifier is better than other classifiers at those points where it is all of the others.
Systematically varying $p$, learning $\xi[p]$ and evaluating its $tp$ and $fp$ values.
Selecting, say, target $tp$ values and in each case “searching” for the $p$ values for which $\xi[p]$ produces it; each will have an associated $fp$ value.
In either case we will have a set of attainable $\{(tp_i,fp_i)\}$ values for the classifier. We can plot those on a graph (as shown in XXXX) where, by convention, $tp$ is on the vertical axis and $fp$ is on the horizontal axis. Note however, by the above method of mixing classifiers upon $\xi$ with different $p$ values we can actually replace the curve with its convex hull. These curves (either the originals or the convex hulls) are called ROC curves, while ROC analysis generally denotes the process of generating ROC curves and then using them to select classifiers.
In addition we can see that for an effective classifier its curve must always lie above the diagonal curve from $(tp,fp)=(0,0)$ to $(tp,fp)=(1,1)$ since we can create two trivial classifiers “always say negative” (with error profile $(0,0)$) and “always say positive” (with error profile $(1,1)$) and use the mixing classifiers trick on them to produce points on the diagonal line. We will call a curve segment “above” or “outside” another one if each point is to the left/above; the important point is that a classifier is better than other classifiers at those points on its curve where it is outside all of the other curves.
Given multiple tunable classifiers $\xi_1$ to $\xi_N$, each with its own parameters, we can plot their ROC curves on the same graph. This will produce a set of curves which, in general, will intersect each other multiple times. However, we can generally divide the horizontal axis (or the vertical axis) into segments where a given curve is on the `outside'. In most practical cases there will be only a few of these and, as mentioned above, in each segment that classifier is better than the others.
This can be used to choose classifiers in the cases where it’s not known what the precise $(tp,fp)$ trade-off that is desired is. For example, there are the extreme cases that if a classifier is always outside then it is always the best classifier and should be unconditionally chosen and that if a classifier is never outside then it is never the best classifier and can be removed from any consideration.
Additionally it is possible to do things like select the best classifiers with a given hard limit on false positive rate (draw a vertical line at that rate from the horizontal axis and select the last curve it intersects) or with a given trade off between increasing true positive rate by $V$ “allowing” and increase in the false positive rate by $H$ by “sweeping” a line with gradient $V/H$ down from the top-right corner of the graph and selecting the classifier corresponding to the first curve it “hits”.
Although not commonly done, ROC analysis can be extended to the cases of either multi-class classification or where there is a vector of different classes associated with each data item. Part of the reason these are not used very much is that they give rise to surfaces in multi-dimensional space which are very difficult to visualise.