Chapter 5: Regularization
In Chapter 4, we discussed the bias-variance tradeoff and how overfitting can cause our model to memorize the training data instead of learning general patterns. One of the most effective tools to combat overfitting is regularization.
Regularization works by discouraging complexity in our model. In this chapter, we’ll explore two of the most widely used regularization techniques in machine learning: L1 Regularization (Lasso) and L2 Regularization (Ridge). In this chapter we will learn what they are, how they work, some mathematical intuition, when to use them, and how to use them in practice and in the real world.
5.1 What is Regularization?
Regularization adds a penalty term to the loss function to discourage extreme model weights and reduce overfitting. Without regularization, a model focuses purely on minimizing prediction error. With regularization, it also tries to keep its parameter values small, preventing it from relying too heavily on any one feature or capturing noise.
5.2 General Form
The general form for regularization starts with a loss function like MSE. In order to regularize we add a penalty term like this:
\[ \text{Loss} = \text{MSE} + \lambda \cdot \Omega(w) \]
- \( \lambda \) represents the regularization strength or the hyperparameter
- \( \Omega(w) \) represents the penalty function based on the model weights \( w \)
- The higher the \( \lambda \), the more the model is penalized for larger weights
5.3 L1 Regularization: Lasso Regression
L1 regularization penalizes the absolute value of the weights, turning the equation into this:
\[ \text{Loss} = \text{MSE} + \lambda \sum_i |w_i| \]
The interpretation of this equation is that this encourages sparse models meaning that some weights become exactly zero. Performs feature selection by eliminating irrelevant predictors. Great for high-dimensional data with many irrelevant features.
L1 penalty creates diamond-shaped constraint contours, which often intersect the loss minimum at axes → zeroing out weights.
5.4 L2 Regularization: Ridge Regression
L2 regularization penalizes the square of the weights, turning the equation into this:
\[ \text{Loss} = \text{MSE} + \lambda \sum_i w_i^2 \]
The interpretation of this equation is that ridge regression shrinks weights toward zero, but never exactly zero. It keeps all features but reduces their influence. It’s useful when you have many correlated features.
L2 penalty creates circular constraint contours. Solutions tend to be spread out and smooth.
5.5 Choosing the Regularization Strength (λ)
The regularization strength \( \lambda \) is a hyperparameter, it must be chosen carefully.
- If \( \lambda = 0 \): No regularization
- If \( \lambda \) is too large: Underfitting
- If \( \lambda \) is just right: Balanced model
We use cross-validation to find the best \( \lambda \). Try a range of values and select the one with the lowest validation error.
5.6 When to Use What?
Situation | Recommended Regularization |
---|---|
Many irrelevant features | L1 |
Many correlated features | L2 |
You want simple, interpretable models | L1 |
You want to avoid eliminating features | L2 |
5.7 Limitations of Regularization
- Doesn’t work well if your features are poorly scaled — always standardize!
- Requires tuning \( \lambda \)
- Lasso can behave erratically with highly correlated features
5.8 Key Takeaways
- Regularization prevents overfitting by penalizing large model weights.
- L1 promotes sparsity and can eliminate features.
- L2 shrinks weights evenly, good for collinearity.
- Regularization is critical in high-dimensional, real-world data science problems.