ROC and AUC Interpretation

Introduction

A binary classification is a machine learning model that classifies input data into two classes. We need different metrics to train or evaluate the performance of ML models. The Area Under the Receiver Operating Characteristic Curve (ROC AUC) score is a popular metric for evaluating binary classification models. In this post, we will try to understand the intuition behind the ROC AUC with simple and interactive visualizations.

Subscribe to get a notification about future posts.

Recap

Feel free to skip this section if you are already familiar with the confusion matrix, true positive rate and false positive rate. Next Section

I suggest you also checkout the following resources:

evidentlyai.com/classification-metrics/… - quite detailed explaination
developers.google.com/machine-learning/… - beginner friendly explanation
madrury.github.io/jekyll/update/statistics/… - my favourite so far that gives a probabilistic intution of ROC AUC score
www.alexejgossmann.com/auc/ - more advanced explanation

Let’s say we have two classes (“positive” and “negative”) and a machine learning model that predicts a probability score between 0 and 1. A probability of 0.9 signifes that the input is closer to “positive” than “negative”, a probability of 0.2 signifies that the input is closer to “negative”, and so on.

However, our task is to produce a binary output: “positive” or “negative”. To achieve this, we choose a threshold. Any input with a predicted probability score above the threshold is classified as positive, while inputs with lower scores are classified as negative.

Confusion Matrix

There are four outcomes of the predicted binary output, which can be nicely summarized with the following table, called a confusion matrix:

The green row corresponds to the positive items in our dataset, while the red row corresponds to the negative items in the dataset. The columns correspond to the model predictions. The cells outlined with dark green are the items our model classified correctly, i.e., the accurate predictions of our model.

Now, we can give a few relevant metrics based on the confusion matrix.

Accuracy

Accuracy is the proportion of all accurate predictions among all items in the dataset. It is defined as:

\[ \text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN} \]

Accuracy can be misleading, especially with imbalanced datasets. For example, if 99% of our dataset is positive, a model that always predicts positive will have an accuracy of 99%, but this doesn’t provide meaningful insight. Hence, we need other metrics.

True Positive Rate

The true positive rate or recall is the proportion of accurate predictions among positive items:

\[ \text{Recall or TPR} = \frac{TP}{TP+FN} \]

The recall only considers the green row (actual positives) from our confusion table, and completely ignores the red row.

True Negative Rate

The true negative rate or specificity is the proportion of accurate predictions among negative items:

\[ \text{Specificity or TNR} = \frac{TN}{FP+TN} \]

The specificity only considers the red row (actual negatives) from our confusion table.

False Positive Rate

The false positive rate is the proportion of inaccurate predictions among negative items:

\[ \text{FPR} = \frac{FP}{FP+TN} \]

alternatively:

\[ \text{FPR}=1-\text{TNR} \]

The false positive rate is related to the true negative rate. However, we will be using FPR more than TNR in the next sections.

Visualization

stats = {
    function truePositive(positives, threshold) {
        var truePositive = 0;
        for (var i in positives) {
            if (positives[i] > threshold) {
                truePositive += 1;
            }
        }
        return truePositive;
    }

    function trueNegative(negatives, threshold) {
        var trueNegative = 0;
        for (var i in negatives) {
            if (negatives[i] <= threshold) {
                trueNegative += 1;
            }
        }
        return trueNegative;
    }

    function falsePositive(negatives, threshold) {
        return negatives.length - trueNegative(negatives, threshold);
    }

    function falseNegative(positives, threshold) {
        return positives.length - truePositive(positives, threshold);
    }

    function recall(positives, threshold) {
        return truePositive(positives, threshold) / positives.length;
    }

    function specifity(negatives, threshold) {
        return trueNegative(negatives, threshold) / negatives.length;
    }

    function fallout(negatives, threshold) {
        return 1.0 - specifity(negatives, threshold);
    }

    return {
        truePositive: truePositive,
        trueNegative: trueNegative,
        falsePositive: falsePositive,
        falseNegative: falseNegative,
        recall: recall,
        specifity: specifity,
        fallout: fallout
    }
}

function confusionMatrix(positives, negatives, threshold) {
    this.positives = positives;
    this.negatives = negatives;
    this.threshold = threshold;
    this.truePositive = stats.truePositive(positives, threshold);
    this.trueNegative = stats.trueNegative(negatives, threshold);
    this.falsePositive = stats.falsePositive(negatives, threshold);
    this.falseNegative = stats.falseNegative(positives, threshold);
    this.recall = stats.recall(positives, threshold);
    this.fallout = stats.fallout(negatives, threshold);
    return this;
}

positives = [0.1, 0.3, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0];

negatives = [0.1, 0.2, 0.3, 0.5, 0.8, 1.0];

lightRed = "#ff8886"

function customTable(data) {
    const table = Inputs.table(data);
    table.style.setProperty("font-size", "1.3em");
    return table 
}

function plot1D(simple, threshold, fpr=false, reverse=false) {
    var positiveDots = []
    for (var i in positives) {
        positiveDots.push({
            x: positives[i], 
            y: 1,
            r: positives[i],
            strokeWidth: !simple && positives[i] > threshold? 3 : 0,
            stroke: "green",
        })
    }
    var negativeDots = [];
    for (var i in negatives) {
        var strokeWidth = 0;
        if (simple) {
            strokeWidth = 0;
        } else if (fpr && negatives[i] > threshold) {
            strokeWidth = 3;
        } else if (!fpr && negatives[i] <= threshold) {
            strokeWidth = 3;
        }
        negativeDots.push({
            x: negatives[i],
            y: 0,
            r: negatives[i],
            strokeWidth: strokeWidth,
            stroke: fpr ? "#8b0000" : "green"
            //stroke: "red"
        })
    }
    return Plot.plot({
        r: {range: [0, 20]},
        height: 100,
        width: 600,
        x: { label: "probability", reverse: reverse },
        y: { axis: null },
        marginRight: 25,
        marginTop: 25,
        marginBottom: 25,
        marginLeft: 25,
        marks: [
            Plot.dot(positiveDots, {
                x: "x", y: "y", r: "r", 
                stroke: "stroke", strokeWidth: "strokeWidth", 
                fill: "lightgreen"
            }),
            Plot.dot(negativeDots, {
                x: "x", y: "y", r: "r", 
                stroke: "stroke", strokeWidth: "strokeWidth", 
                fill: lightRed
            }),
            Plot.line([
                [Math.min(threshold, 1), 0], 
                [Math.min(threshold, 1), 1]
            ], { 
                strokeWidth: 3, 
                strokeDasharray: "2, 6",
                opacity: simple ? 0 : 1,
            })
        ]
    })
}

Let’s setup a visualization for better understanding. Assume that we have:

A dataset with positive and negative items
An ML model that predicts a probability score from 0 to 1, representing the probability that the input belongs to the positive class.

Then, we can visualize our dataset and their probability predictions in the same visualization as below:

plot1D(true, thresholdSlider)

Positive items in the dataset are shown in green, and negative items are shown in red. The sizes of circles represent the predicted probability scores, with smaller circles representing scores close to 0 and larger circles representing scores close to 1. The items are ordered according to their probability scores, from smallest to largest.

Next, we choose a threshold depending on the application of our ML model. But, for now, let’s visualize the threshold as well:

plot1D(false, thresholdSlider)

viewof thresholdSlider = html`
<input type=range min=0 max=1.1 step=0.01 value=0.5 style="width:90%"/>
`

The circles with dark green outline represent items that are accurately classified, in other words, true positives and true negatives.

Why not calculate the true positive (TP), false positive (FP), false negative (FP) and true negative (TN) values from the confusion matrix:

{
    const confusion = confusionMatrix(positives, negatives, thresholdSlider);
    const data = [{
        "Threshold": Math.min(1, thresholdSlider), 
        "TP": confusion.truePositive,
        "FN": confusion.falseNegative,
        "FP": confusion.falsePositive,
        "TN": confusion.trueNegative,
    }]
    return customTable(data);
}

Tip: You can adjust the threshold using the slider above (give it a try), and the tables above and below will update accordingly.

And the metrics as defined above:

{
    const confusion = confusionMatrix(positives, negatives, thresholdSlider);
    const correctItems = confusion.truePositive + confusion.trueNegative;
    const allItems = positives.length + negatives.length;
    const data = [{
        "Accuracy": correctItems + "/" + allItems,
        "TPR": confusion.truePositive + "/" + positives.length,
        "FPR": confusion.falsePositive + "/" + negatives.length,
        "TNR": confusion.trueNegative + "/" + negatives.length,
    }]
    return customTable(data)
}

Play with the threshold slider and make sure that you understand different metrics, especially the true positive rate and the false positive rate. Maybe try to reproduce the metrics youself first at different thresholds and compare with the table?

ROC and AUC

ROC Curve

Choosing a specific threshold can be difficult since it depends on the particular application. Luckily we have metrics that show the performance of an ML model at varying threshold values. One of them is a receiver operating characteristic curve or ROC curve.

The term “Receiver Operating Characteristic” originates from Word War II, where it was used in radar systems for detecting enemy objects.

The ROC curve is the plot of the true positive rate (TPR) against the false positive rate (FPR) at each threshold values.

Let’s bring back a slightly modified version of our visualization:

plot1D(false, thresholdSlider2, /*fpr=*/true);

Earlier, we have been visualizing the true negative rate. Now, we are visualizing the false positive rate, outlined with dark red.

viewof thresholdSlider2 = html`
<input type=range min=0 max=1.1 step=0.01 value=0.5 style="width:90%"/>
`

The true positive and the false positive rates accordingly:

{
    const confusion = confusionMatrix(positives, negatives, thresholdSlider2);
    return customTable([{
        "Threshold": Math.min(1, thresholdSlider2), 
        "TPR": confusion.truePositive + "/" + positives.length,
        "FPR": confusion.falsePositive + "/" + negatives.length,
       
    }])
}

Now, we need to plot FPR (x-axis) against TPR (y-axis).

plot_roc(null, thresholdSlider2)

Tip: You can adjust the slider above, then this visualization will update accordingly.

Play with the slider above and compare the values in the table and (x, y) values of the black point.

The new 2D visualization is similar to the previous 1D visualization, except the following differences:

The positives items (green) are visualized along the Y-axis
The negatives items (red) are visualized along the X-axis
The threshold is visualized by the black circle

function plot_roc(pointSelection=null, pointThreshold=0) {
    function roc_curve(positives, negatives) {
        var thresholds = positives.concat(negatives);
        thresholds.push(-0.1);
        thresholds.push(1.1);
        thresholds.sort();
        thresholds.reverse();
        var result = []
        for (var i in thresholds) {
            const threshold = thresholds[i]; 
            if (i != 0 && threshold == thresholds[i - 1]) {
                continue;
            }
            const f = stats.fallout(negatives, threshold);
            const r = stats.recall(positives, threshold);
            const prevx = (i == 0 ? -1 : result[result.length - 1].x);
            result.push({x: prevx >= f ? prevx + 1e-8 : f, y: r})
        }
        return result;
    }

    var falloutDots = []
    for (var i in negatives) {
        const value = stats.fallout(negatives, negatives[i] - 1e-6);
        falloutDots.push({
            x: value, 
            y: 0.0,
            r: negatives[i] 
        })
    }
    var recallDots = []
    for (var i in positives) {
        const value = stats.recall(positives, positives[i] - 1e-6);
        recallDots.push({
            x: 0.0, 
            y: value,
            r: positives[i],
        })
    }

    var aucPoints = [];
    for (var i in falloutDots) {
        for (var j in recallDots) {
            if (falloutDots[i].r <= recallDots[j].r) {
                aucPoints.push([falloutDots[i].x, recallDots[j].y])
            }
        }
    }
    aucPoints.reverse();
    var point = [0, 0];
    var pointOpacity = 0;
    if (pointThreshold != null) { 
        point = [
            stats.fallout(negatives, pointThreshold), 
            stats.recall(positives, pointThreshold)
        ];
        pointOpacity = 1;
    } 
    if (pointSelection != null) {
        point = aucPoints[Math.floor(pointSelection / 101 * aucPoints.length)];
        pointOpacity = 1;
    }
    return Plot.plot({
        grid: true,
        x: {label: "False Positive Rate"},
        y: {label: "True Positive Rate"},
        r: {range: [0, 20]},
        aspectRatio: 1.0,
        width: 400,
        marks: [
            Plot.line([[0, point[1]], point], {
            strokeDasharray: "1, 7",
            }),
            Plot.line([[point[0], 0], point], {
            strokeDasharray: "1, 7",
            }),
            Plot.areaY(roc_curve(positives, negatives), {
                x: "x",
                y: "y",
                opacity: 0.15,
            }),
            Plot.line(roc_curve(positives, negatives), {
                x: "x",
                y: "y",
                strokeWidth: 2
            }),
            Plot.dot(falloutDots, {x: "x", y: "y", r: "r", fill: lightRed}),
            Plot.dot(recallDots, {x: "x", y: "y", r: "r", fill: "lightgreen"}),
            Plot.line([[0, 0], [1, 1]], {
                stroke: "orange",
                opacity: 0.0,
            }),
            Plot.dot([{x: point[0], y: point[1], r: 0.15}], {
                x: "x", y: "y", r: "r",
                fill: "black",
                opacity: pointOpacity
            }),
        ]
    })
}

Area Under the Curve

The area under the ROC curve is called ROC AUC score. ROC AUC stands for “Receiver Operating Characteristic Area Under the Curve”.

The ROC AUC score is a single number that summarizes the ML model’s performance across all threshold values. Note that the ROC curve summarizes with a visual plot, whereas the ROC AUC score summarizes with a single number.

plot_roc(null, null)

The area of the light gray region in the above visualization is the ROC AUC Score.

The ROC AUC score is 0.5 for a classifier that performs no better than random guessing and approaches 1.0 for a classifier with perfect performance.

Probabilistic Interpretation

The ROC AUC score has a very nice probabilistic interpretation:

The ROC AUC is the probability that the model will predict a higher probability score to a randomly selected positive item than a randomly selected negative item.

It is nicely explained with the visualization that we already have:

plot_roc(pointSelection)

viewof pointSelection = html`
<input type=range min=1 max=100 step=1 value=50 style="width:90%"/>
`

Every point (e.g., the black dot above) under the ROC curve has a corresponding green circle (follow the dotted horizontal segment) and a corresponding red circle (follow the dotted vertical segment). Then, the mentioned green circle is always bigger or equal to the red circle.

Tip: Try the slider above and compare the matching green circle with the matching red circle.

This is the same interpretation as the probabilistic interpretation when the number of items in our dataset approaches infinity.

The End

I hope you enjoyed this post.

In the next post, we will work on an efficient Python implementation of the ROC AUC score based on probabilistic intuition.

Subscribe to get a notification about future posts.