Chapter 7: Joint Distribution and Covariance

In the real world variables tend to not be completely isolated from each other — when one changes another could potentially change alongside it. This is an extremely important concept to understand when it comes to data science, where much of the work involves identifying patterns and potential causal links between variables to make predictions or inform decisions. The first step in understanding these relationships begins with two foundational concepts: joint distributions and covariance.

7.1 Joint Distributions

The joint distribution gives us the probability of a combination of values occurring together. Previously we asked “What is the chance that \(X = 5\)?” — now we ask “What is the chance that \(X = 5\) and \(Y = 3\)?”.

There are two main types of joint distributions:

  • Discrete joint distributions: Probabilities are assigned to individual pairs of values using tables.
  • Continuous joint distributions (or joint densities): Likelihoods are described using a surface over the \((X, Y)\) plane, typically represented by a density function.

For discrete variables, consider two random variables \(X\) and \(Y\). Let \(X\) be the outcome of a coin flip: 1 for heads, 0 for tails. Let \(Y\) be the outcome of a die roll (1 through 6). A joint probability table can show all \(P(X = x, Y = y)\) values. From this, we can compute:

  • Marginal probabilities: e.g., \(P(X = 1)\)
  • Conditional probabilities: e.g., \(P(Y = 6 \mid X = 1)\)

For continuous variables, instead of tables, we use 3D surfaces. The joint probability density function is denoted \(f(x, y)\). Taller areas of the surface represent more likely combinations. The total volume under the surface equals 1.

From a joint density function, we can also compute:

  • Marginal densities: Integrate over one variable.
  • Conditional densities: Fix one variable and examine the distribution of the other.

Joint distributions are powerful: they let us examine dependence between variables, calculate correlations and regression models, and understand the structure of multivariate data.

7.2 Covariance

Covariance quantifies how two random variables move together. If they tend to increase or decrease together, the covariance is positive. If one increases while the other decreases, it’s negative. If there’s no consistent pattern, the covariance is near zero.

The formal definition is:

\[ \text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] \]

This measures the expected value of the product of deviations from the mean. When deviations in \(X\) and \(Y\) have the same sign, the product is positive — reinforcing the positive association.

7.3 Properties of Covariance

  • Constants have no variation: \( \text{Cov}(X, c) = 0 \)
  • Symmetry: \( \text{Cov}(X, Y) = \text{Cov}(Y, X) \)
  • Alternative form: \( \text{Cov}(X, Y) = \mathbb{E}[XY] - \mu_X \mu_Y \)
  • If \(X\) and \(Y\) are independent, then \( \text{Cov}(X, Y) = 0 \)
  • Additivity: \( \text{Cov}(X + Y, Z) = \text{Cov}(X, Z) + \text{Cov}(Y, Z) \)
  • Scaling: \( \text{Cov}(aX, bY) = ab \cdot \text{Cov}(X, Y) \)

7.4 Variance of a Sum

To compute the variance of a sum, we must include covariance terms. The formula is:

\[ \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2 \cdot \text{Cov}(X, Y) \]

Also note: \( \text{Cov}(X, X) = \text{Var}(X) \)

7.5 From Covariance to Correlation

Covariance has units that depend on both variables (e.g., inches·pounds), making interpretation difficult. To standardize it, we divide by the standard deviations:

\[ \text{Corr}(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} \]

This is the definition of correlation, which ranges from -1 to 1 and provides a unit-free measure of linear association. Correlation is a cornerstone metric in data science.