Probability Chapter 8: Regression

Chapter 8: Regression

Probability

Chapter 8: Regression

Up to this point, we’ve explored how variables can be related through joint distributions, how they move together with covariance, and how we quantify their strength with correlation. These concepts are deeply rooted in probability. Understanding joint behavior, calculating expectations, and interpreting the spread and direction of relationships between variables prepares us to take the next step: regression.

Regression allows us to go beyond describing relationships and actually start predicting one variable from another. In this chapter, we’ll go over simple regression and explain the key concepts behind it on a high level.

8.1 What is Regression?

Regression is a statistical method for modeling the relationship between two or more variables. In its most basic form, simple linear regression uses one variable (the predictor or independent variable, usually \( X \)) to predict another variable (the response or dependent variable, \( Y \)). The basic regression model is:

\[ Y = \beta_0 + \beta_1 X + \varepsilon \]

\( \beta_0 \): Intercept — the predicted value of \( Y \) when \( X = 0 \)
\( \beta_1 \): Slope — the change in \( Y \) for a one unit increase in \( X \)
\( \varepsilon \): Error term — accounts for variation not explained by \( X \)

Our goal is to estimate \( \beta_0 \) and \( \beta_1 \) from data by minimizing the squared distance between observed and predicted values.

8.2 Least Squares and Interpretation

Imagine plotting your data on a scatterplot with \( X \) on the horizontal axis and \( Y \) on the vertical. The regression line is the best-fitting line that minimizes the total squared vertical distances between data points and the line — this is the method of least squares.

This connects to probability because each point can be seen as a realization of random variables. Minimizing squared differences is equivalent to estimating the conditional expectation of \( Y \) given \( X \).

A positive slope suggests that as \( X \) increases, \( Y \) tends to increase. A negative slope suggests the opposite. A slope of zero suggests no predictive power.

8.3 Example

\[ \text{Exam Score} = 50 + 5 \times \text{Hours Studied} \]

The intercept 50: If a student studies 0 hours, we predict their score to be 50.
The slope 5: For every additional hour studied, the score increases by 5 points.

Always interpret coefficients within the context of the problem.

8.4 Residuals

The residual is the difference between the actual value and the predicted value:

\[ e_i = Y_i - \hat{Y}_i \]

Are centered around 0 (no consistent over/underestimation)
Have roughly constant spread (homoscedasticity)
Show no systematic patterns (randomness)

Patterns in residuals can indicate problems with the regression model and may tie back to assumptions about the distribution of errors.

8.5 Assumptions

Linearity: \( Y \) changes linearly with \( X \)
Independence: Observations are independent of each other
Homoscedasticity: Constant variance of residuals across all \( X \)
Normality of errors: The residuals follow a normal distribution

8.6 Cautions and Practical Notes

Correlation does not equal Causation: A statistical relationship does not imply that changes in \( X \) cause changes in \( Y \)
Outliers: Extreme values can distort the regression line
Extrapolation: Making predictions outside the range of observed \( X \)-values is risky and often unreliable

Regression is a powerful and foundational tool in data science, but it requires careful interpretation and consideration of context and assumptions.