Chapter 7: Classifier Evaluation

When working with classification models, it's not enough to know whether a prediction is right or wrong, we need to understand how it was right or wrong, and why. This is especially important when: The cost of mistakes is not equal (e.g. medical diagnoses, fraud detection), classes are imbalanced (e.g. detecting rare events), or you need to tune a classifier beyond just “accuracy.”

In this chapter, we'll explore: The confusion matrix as the foundation of classification metrics, definitions and use cases for accuracy, precision, and recall, how to interpret ROC curves and AUC, and real-world considerations for evaluating classifiers effectively.

7.1 The Confusion Matrix

A confusion matrix is a table that breaks down the predictions of a classification model by comparing them to the actual outcomes.

Predicted: Positive Predicted: Negative
Actual: Positive True Positive (TP) False Negative (FN)
Actual: Negative False Positive (FP) True Negative (TN)

7.2 Accuracy

Accuracy is the proportion of total predictions that were correct.

\[ \text{Accuracy} = \frac{TP + TN}{TP + FP + TN + FN} \]

When It Works: When classes are balanced; when false positives and false negatives are equally costly.

When It Fails: With imbalanced datasets.

Example: In a disease detection model where only 1% of patients have the disease, predicting “negative” for everyone gives 99% accuracy but is useless.

7.3 Precision and Recall

Precision

The proportion of predicted positives that were actually positive.

\[ \text{Precision} = \frac{TP}{TP + FP} \]

  • High precision = few false positives
  • Useful when false positives are costly (e.g. spam filters, legal alerts)

Recall (Sensitivity or True Positive Rate)

The proportion of actual positives that were correctly identified.

\[ \text{Recall} = \frac{TP}{TP + FN} \]

  • High recall = few false negatives
  • Useful when missing a positive is costly (e.g. cancer detection, fraud detection)

F1-Score

The F1-score is the harmonic mean of precision and recall, offering a balanced metric when you want to consider both. It's especially useful when the dataset is imbalanced.

\[ \text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]

7.4 ROC Curve and AUC

The Receiver Operating Characteristic (ROC) curve plots:

  • x-axis: False Positive Rate (FPR) = FP / (FP + TN)
  • y-axis: True Positive Rate (Recall) = TP / (TP + FN)

Area Under the Curve (AUC)

  • AUC = 1.0: perfect classifier
  • AUC = 0.5: random guessing

AUC is useful for comparing classifiers independent of a classification threshold.

from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_true, y_scores)
roc_auc = auc(fpr, tpr)
    

7.5 Precision-Recall Tradeoff

There’s often a tradeoff between precision and recall: If you increase the classification threshold, precision usually goes up, but recall goes down. Lowering the threshold does the opposite.

Use precision-recall curves to explore this tradeoff and set thresholds based on your use case. The area under the precision-recall curve (PR AUC) is also useful for evaluating performance with imbalanced datasets.

7.6 Python Example

from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

# True labels and predicted probabilities
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 0, 1, 1, 0]

# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:\n", cm)

# Full classification report
print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive']))

# AUC if you have prediction probabilities
y_scores = [0.9, 0.2, 0.8, 0.4, 0.1, 0.85, 0.05, 0.6, 0.92, 0.3]
print("ROC AUC Score:", roc_auc_score(y_true, y_scores))
    

7.7 Metric Selection Guide

Goal/Concern Best Metric(s)
General performance (balanced classes) Accuracy
False positives are costly Precision
False negatives are costly Recall
Comparing models across thresholds ROC AUC, PR Curve AUC
Imbalanced classes Precision/Recall

Key Takeaways

  • The confusion matrix is the foundation of classification evaluation.
  • Accuracy alone can be misleading, especially in imbalanced settings.
  • Precision and recall are vital when different types of errors carry different costs.
  • ROC and precision-recall curves help visualize classifier performance across thresholds.