Chapter 3: Optimization and Gradient Descent

Once we have our model we need a way to train it, training means we need to adjust the model’s internal parameters (weights) so that it performs well with our training data. How do we find the best parameters in order for our model to perform well? This is where the importance of optimization comes into play. In the machine learning world optimization is the process of minimizing the loss function in our model. The loss function is a mathematical expression of how far off our predictors are from the true values. The most popular and used optimization technique in machine learning is gradient descent, it is a powerful tool that is used to train linear models to neural networks. In this chapter we will go over gradient descent without the underlying complicated mathematical background, but we will link the proofs and applications for those who are interested in learning more of the mathematical proof.

3.1 What is a Loss Function

A loss function is the function that tell us how poorly or well our model is doing. An example of one that we discussed in the previous chapter is Mean Squared Error for regression. The formula for MSE is the following:

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2 \]

3.2 What is Gradient Descent?

Gradient descent is an iterative algorithm that gradually adjusts parameters to reduce loss. It works like this:

  • Start with random values for your model parameters.
  • Compute the loss on the training data.
  • Calculate the gradient—the direction and rate of steepest increase of the loss.
  • Move in the opposite direction of the gradient (because we want to minimize loss).
  • Repeat until the loss stops decreasing significantly.

A Simple analogy: Imagine you're walking down a hill, blindfolded. You feel the slope under your feet and take a step downward. That’s gradient descent.

The gradient is a vector of partial derivatives. It tells us how to adjust each model parameter to decrease the loss. For a linear model with weights \( w_1, w_2, \ldots, w_n \) the gradient is:

\[ \nabla L = \left[ \frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, \ldots, \frac{\partial L}{\partial w_n} \right] \]

We use this gradient to update parameters:

\[ \theta \leftarrow \theta - \alpha \cdot \nabla L \]

Where:
\( \alpha \) = learning rate (step size)
\( \nabla L \) = gradient

3.3 Learning Rate (α)

This hyperparameter controls how big each step is during gradient descent.

  • If \( \alpha \) is too small: convergence is slow.
  • If \( \alpha \) is too large: you might overshoot the minimum and never converge.
  • Just right: smooth convergence to the minimum.

Often, learning rates are tuned using experimentation or algorithms like learning rate schedules or adaptive optimizers.

3.4 Variants of Gradient Descent

3.4.1 Batch Gradient Descent

Computes the gradient using the entire training dataset.
Pros: Stable, accurate direction.
Cons: Slow on large datasets.

3.4.2 Stochastic Gradient Descent (SGD)

Uses only one example at a time to compute the gradient.
Pros: Fast, can escape local minima.
Cons: Noisy updates, less stable.

3.4.3 Mini-Batch Gradient Descent

A compromise: compute the gradient on a small batch of samples (e.g. 32 or 64).
Pros: Faster than batch, more stable than SGD.
Most commonly used in practice.

3.5 Gradient Descent Visualization

Imagine a bowl shaped surface representing the loss function.
The lowest point = optimal parameters.
The slope of the surface = gradient.
Gradient descent rolls a ball down the surface, step by step, until it settles at the bottom.
With each step, the model learns to make slightly better predictions.

3.6 What Could Go Wrong in Gradient Descent

One issue that could occur if you do not have the correct alpha value is you could encounter a saddle point and stay at the saddle point which means that instead of being at the true minimum point you are actually at a local minima which would skew your loss function towards something it is not supposed to. Choosing the correct alpha is important so that you do not encounter situations like this.

3.7 Alternatives to Gradient Descent

Some models (like linear regression) allow exact solutions using algebra. But for larger models or non-linear ones, gradient descent is often the only practical choice.

There are also advanced optimizers built on top of gradient descent: Momentum, AdaGrad, RMSProp, and Adam. We’ll explore these later in the context of training neural networks.