Chapter 6: Model Training and Evaluation

This model will focus on how to create the best model possible by walking through the best practices for building machine learning workflows such as: Proper train-test-validation splits, Cross-validation, Hyperparameter tuning (grid search and random search), Learning curves, Bootstrapping, and Building reliable model evaluation pipelines.

6.1 Train-Test-Validation Splits

To evaluate model performance fairly, we split our data into different sets:

Training set: Used to learn model parameters.
Validation set: Used to tune hyperparameters and choose models.
Test set: Used for final evaluation after all tuning is done.

Dataset	Purpose
Training	Fit the model
Validation	Tune hyperparameters, compare models
Test	Simulate performance on unseen data

Important Tips: The test set should be treated as untouchable until the final evaluation. The optimal split of data for training and test set data is typically an 80/20 split with 80% of your data being trained with the training set and the rest being evaluated with the test set. Obviously depending on the amount of data that you have and how your model is working you can tune the train/test split as you please.

6.2 Cross-Validation

Rather than depending on one split of training and validation data, cross-validation provides a more stable and generalized estimate of performance.

k-Fold Cross-Validation

Divide the data into k equal parts (folds).
Train the model on k - 1 folds and validate on the remaining one.
Repeat k times, rotating the validation fold each time.
Average the validation scores across the folds.

This process gives a more robust estimate of how the model is likely to perform on new data. An example in Python of doing this:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)

Advantages:

Uses all data for both training and validation
Helps detect high variance or model instability
Reduces dependence on any single split

Drawbacks:

Slower to run, especially for large datasets or complex models

6.3 Parameters vs Hyperparameters

Type	Example	Set By
Parameter	Weights in linear regression	Learned from training data
Hyperparameter	Regularization strength, number of neighbors	Manually tuned

Hyperparameters are not learned by the model, they must be set by the practitioner, usually through a tuning process.

6.4 Hyperparameter Tuning: Grid Search

Try all combinations of specified hyperparameter values.

from sklearn.model_selection import GridSearchCV
param_grid = {'alpha': [0.01, 0.1, 1, 10]}
grid = GridSearchCV(Ridge(), param_grid, cv=5)
grid.fit(X_train, y_train)
print(grid.best_params_)

6.5 Random Search

Randomly sample hyperparameter combinations from specified distributions. More efficient when dealing with a large number of hyperparameters or wide ranges.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
param_dist = {'alpha': uniform(0.01, 10)}
search = RandomizedSearchCV(Ridge(), param_dist, n_iter=100, cv=5)

6.6 Learning Curves

Learning curves plot model performance on both training and validation data as the number of training examples increases. They help answer:

Is the model underfitting? (both errors high)
Is the model overfitting? (training error low, validation error high)
Will adding more data help?

Learning curves are essential diagnostics for understanding what’s going wrong in model training.

6.7 Bootstrapping

Bootstrapping is a statistical resampling technique used to estimate model performance variability or to generate confidence intervals.

Randomly sample data with replacement to create multiple "bootstrap samples."
Train or evaluate the model on each sample.
Analyze the variation in performance across samples.

Bootstrapping is especially useful for small datasets or ensemble methods like bagging.

from sklearn.utils import resample
bootstrap_scores = []
for _ in range(100):
    X_resampled, y_resampled = resample(X, y)
    model.fit(X_resampled, y_resampled)
    bootstrap_scores.append(model.score(X_test, y_test))

6.8 Typical Model Evaluation Workflow

Preprocess and clean the data.
Split into training, validation, and test sets.
Use cross-validation to tune hyperparameters on the training data.
Retrain the final model on the full training set using the best hyperparameters.
Evaluate once on the untouched test set.
Optionally, use bootstrapping to estimate confidence intervals for performance metrics.

Key Takeaways

A reliable machine learning workflow requires proper data splitting and evaluation strategies.
Cross-validation helps obtain stable performance estimates and prevents overfitting to a validation set.
Hyperparameter tuning (via grid or random search) is crucial for model optimization.
Learning curves diagnose underfitting and overfitting.
Bootstrapping provides uncertainty estimates and works well with limited data.

Next Chapter →