Chapter 7: Classifier Evaluation
When working with classification models, it's not enough to know whether a prediction is right or wrong, we need to understand how it was right or wrong, and why. This is especially important when: The cost of mistakes is not equal (e.g. medical diagnoses, fraud detection), classes are imbalanced (e.g. detecting rare events), or you need to tune a classifier beyond just “accuracy.”
In this chapter, we'll explore: The confusion matrix as the foundation of classification metrics, definitions and use cases for accuracy, precision, and recall, how to interpret ROC curves and AUC, and real-world considerations for evaluating classifiers effectively.
7.1 The Confusion Matrix
A confusion matrix is a table that breaks down the predictions of a classification model by comparing them to the actual outcomes.
Predicted: Positive | Predicted: Negative | |
---|---|---|
Actual: Positive | True Positive (TP) | False Negative (FN) |
Actual: Negative | False Positive (FP) | True Negative (TN) |
7.2 Accuracy
Accuracy is the proportion of total predictions that were correct.
\[ \text{Accuracy} = \frac{TP + TN}{TP + FP + TN + FN} \]
When It Works: When classes are balanced; when false positives and false negatives are equally costly.
When It Fails: With imbalanced datasets.
Example: In a disease detection model where only 1% of patients have the disease, predicting “negative” for everyone gives 99% accuracy but is useless.
7.3 Precision and Recall
Precision
The proportion of predicted positives that were actually positive.
\[ \text{Precision} = \frac{TP}{TP + FP} \]
- High precision = few false positives
- Useful when false positives are costly (e.g. spam filters, legal alerts)
Recall (Sensitivity or True Positive Rate)
The proportion of actual positives that were correctly identified.
\[ \text{Recall} = \frac{TP}{TP + FN} \]
- High recall = few false negatives
- Useful when missing a positive is costly (e.g. cancer detection, fraud detection)
F1-Score
The F1-score is the harmonic mean of precision and recall, offering a balanced metric when you want to consider both. It's especially useful when the dataset is imbalanced.
\[ \text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]
7.4 ROC Curve and AUC
The Receiver Operating Characteristic (ROC) curve plots:
- x-axis: False Positive Rate (FPR) = FP / (FP + TN)
- y-axis: True Positive Rate (Recall) = TP / (TP + FN)
Area Under the Curve (AUC)
- AUC = 1.0: perfect classifier
- AUC = 0.5: random guessing
AUC is useful for comparing classifiers independent of a classification threshold.
from sklearn.metrics import roc_curve, auc fpr, tpr, _ = roc_curve(y_true, y_scores) roc_auc = auc(fpr, tpr)
7.5 Precision-Recall Tradeoff
There’s often a tradeoff between precision and recall: If you increase the classification threshold, precision usually goes up, but recall goes down. Lowering the threshold does the opposite.
Use precision-recall curves to explore this tradeoff and set thresholds based on your use case. The area under the precision-recall curve (PR AUC) is also useful for evaluating performance with imbalanced datasets.
7.6 Python Example
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score # True labels and predicted probabilities y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0] y_pred = [1, 0, 1, 0, 0, 1, 0, 1, 1, 0] # Confusion Matrix cm = confusion_matrix(y_true, y_pred) print("Confusion Matrix:\n", cm) # Full classification report print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive'])) # AUC if you have prediction probabilities y_scores = [0.9, 0.2, 0.8, 0.4, 0.1, 0.85, 0.05, 0.6, 0.92, 0.3] print("ROC AUC Score:", roc_auc_score(y_true, y_scores))
7.7 Metric Selection Guide
Goal/Concern | Best Metric(s) |
---|---|
General performance (balanced classes) | Accuracy |
False positives are costly | Precision |
False negatives are costly | Recall |
Comparing models across thresholds | ROC AUC, PR Curve AUC |
Imbalanced classes | Precision/Recall |
Key Takeaways
- The confusion matrix is the foundation of classification evaluation.
- Accuracy alone can be misleading, especially in imbalanced settings.
- Precision and recall are vital when different types of errors carry different costs.
- ROC and precision-recall curves help visualize classifier performance across thresholds.