Chapter 7: Classifier Evaluation

When working with classification models, it's not enough to know whether a prediction is right or wrong, we need to understand how it was right or wrong, and why. This is especially important when: The cost of mistakes is not equal (e.g. medical diagnoses, fraud detection), classes are imbalanced (e.g. detecting rare events), or you need to tune a classifier beyond just “accuracy.”

In this chapter, we'll explore: The confusion matrix as the foundation of classification metrics, definitions and use cases for accuracy, precision, and recall, how to interpret ROC curves and AUC, and real-world considerations for evaluating classifiers effectively.

7.1 The Confusion Matrix

A confusion matrix is a table that breaks down the predictions of a classification model by comparing them to the actual outcomes.

	Predicted: Positive	Predicted: Negative
Actual: Positive	True Positive (TP)	False Negative (FN)
Actual: Negative	False Positive (FP)	True Negative (TN)

7.2 Accuracy

Accuracy is the proportion of total predictions that were correct.

\[ \text{Accuracy} = \frac{TP + TN}{TP + FP + TN + FN} \]

When It Works: When classes are balanced; when false positives and false negatives are equally costly.

When It Fails: With imbalanced datasets.

Example: In a disease detection model where only 1% of patients have the disease, predicting “negative” for everyone gives 99% accuracy but is useless.

7.3 Precision and Recall

Precision

The proportion of predicted positives that were actually positive.

\[ \text{Precision} = \frac{TP}{TP + FP} \]

High precision = few false positives
Useful when false positives are costly (e.g. spam filters, legal alerts)

Recall (Sensitivity or True Positive Rate)

The proportion of actual positives that were correctly identified.

\[ \text{Recall} = \frac{TP}{TP + FN} \]

High recall = few false negatives
Useful when missing a positive is costly (e.g. cancer detection, fraud detection)

F1-Score

The F1-score is the harmonic mean of precision and recall, offering a balanced metric when you want to consider both. It's especially useful when the dataset is imbalanced.

\[ \text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]

7.4 ROC Curve and AUC

The Receiver Operating Characteristic (ROC) curve plots:

x-axis: False Positive Rate (FPR) = FP / (FP + TN)
y-axis: True Positive Rate (Recall) = TP / (TP + FN)

Area Under the Curve (AUC)

AUC = 1.0: perfect classifier
AUC = 0.5: random guessing

AUC is useful for comparing classifiers independent of a classification threshold.

from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_true, y_scores)
roc_auc = auc(fpr, tpr)

7.5 Precision-Recall Tradeoff

There’s often a tradeoff between precision and recall: If you increase the classification threshold, precision usually goes up, but recall goes down. Lowering the threshold does the opposite.

Use precision-recall curves to explore this tradeoff and set thresholds based on your use case. The area under the precision-recall curve (PR AUC) is also useful for evaluating performance with imbalanced datasets.

7.6 Python Example

from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

# True labels and predicted probabilities
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 0, 1, 1, 0]

# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:\n", cm)

# Full classification report
print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive']))

# AUC if you have prediction probabilities
y_scores = [0.9, 0.2, 0.8, 0.4, 0.1, 0.85, 0.05, 0.6, 0.92, 0.3]
print("ROC AUC Score:", roc_auc_score(y_true, y_scores))

7.7 Metric Selection Guide

Goal/Concern	Best Metric(s)
General performance (balanced classes)	Accuracy
False positives are costly	Precision
False negatives are costly	Recall
Comparing models across thresholds	ROC AUC, PR Curve AUC
Imbalanced classes	Precision/Recall

Key Takeaways

The confusion matrix is the foundation of classification evaluation.
Accuracy alone can be misleading, especially in imbalanced settings.
Precision and recall are vital when different types of errors carry different costs.
ROC and precision-recall curves help visualize classifier performance across thresholds.

Next Chapter →