Probability Chapter 6: Density

6.1 Introduction

Now that we understand the sheer importance of the normal distribution through the Central Limit Theorem, we can now dive into deeper applications to find the density of the distribution. In simple terms, density refers to the entire area under the graph that adds up to a total of 1. In a probability sense, 1 refers to 100%, which comes back to a basic rule of probability that the probability of an event cannot exceed or be less than 100%.

We will be using two tools to further our understanding of density: the Probability Distribution Function (PDF) and Cumulative Distribution Function (CDF). We will also dive into the Standard Normal Distribution (SND) and confidence intervals as well.

6.2 PDF and CDF

In this text we will not go over the integrals necessary to calculate the density of a distribution. We will show some equations without the proofs, but if you are interested in learning more about them we welcome you to check out this resource (Links to an external site). We will instead put an emphasis on how/when to use each function and we will also show you how to use the functions in Python.

6.2.1 Probability Distribution Function (PDF)

The probability distribution function describes the shape of the normal distribution and it tells us how likely it is that a random variable will occur. Where the curve is taller, the more concentrated the probability is around the value. In most sense, the PDF is an integral which means it represents the probability a value falls within a certain range.

(*Here we show an example of this in Python)

6.2.2 Cumulative Distribution Function (CDF)

The cumulative distribution function gives the probability that a value is less than or equal to some value \(x\). The direct calculation is:

\[ \text{CDF}(x) = P(X \leq x) \]

On the normal distribution graph, the CDF is the area to the left of \(x\). There is no simple formula for the CDF so we usually calculate it using Z-scores. There is also a Python package to calculate the CDF.

(*Here we show an example of this in Python)

6.3 Standard Normal Distribution (SND)

Any normal distribution, regardless of the standard deviation, can be converted into the standard normal distribution. The SND has a mean of 0 and a standard deviation of 1. This tells us how many standard deviations the value \(X\) is away from the mean. Once you have the Z-score, you can use the standard normal table (insert link to table) to find the needed probabilities.

The equation to calculate the Z-score is:

\[ Z = \frac{X - \mu}{\sigma} \]

Example: Suppose that a test score is 85, the mean is 80 and the standard deviation is 5. What is the Z-score and what percentile does it correspond to?

\[ Z = \frac{85 - 80}{5} = 1 \]

If we look at the standard normal table, we see that this corresponds to the 84th percentile of the distribution.

6.4 Confidence Intervals

Confidence intervals are a very important concept to understand for data science, especially when it comes to regression analysis.

Imagine that you collect a sample of data and calculate its mean. For this example, we will be using the average height of plants in a field. This gives us one number, but we know if we sampled 50 plants we might get a different mean. So how can we use the sample that we got to estimate the true average height of all the plants in the field?

This is where the importance of confidence intervals comes in. A confidence interval gives us a range of values that we believe likely contains the true population mean. The core idea is that the sample mean is a random variable; it will vary from sample to sample. But thanks to the Central Limit Theorem, we know that if the sample is large, then the distribution will be approximately normal.

The general calculation of the confidence interval is:

\[ \bar{X} \pm z^* \cdot \frac{\sigma}{\sqrt{n}} \]

  • \(\bar{X}\) is the sample mean
  • \(\sigma\) is the population standard deviation
  • \(n\) is the sample size
  • \(z^*\) is the critical value from the standard normal distribution that corresponds to the desired confidence level

Critical values:

  • 90% confidence interval: \(z^* = 1.645\)
  • 95% confidence interval: \(z^* = 1.96\)
  • 99% confidence interval: \(z^* = 2.576\)

Imagine a bell curve centered at \(\bar{X}\), your sample mean. A confidence interval is like marking off a section of that curve so that a certain percentage of area (like 95%) is captured in the middle.

You're saying: “I’m 95% confident that the true mean falls within this range.” It doesn’t mean the true mean has a 95% chance of being in your specific interval — that’s a common misunderstanding. The correct interpretation is that 95% of intervals built this way would capture the true mean.

Insert an example here.

6.5 Interpreting Confidence Intervals

  • The sample must be random and representative.
  • The CLT tells us the sample mean is approximately normal especially when \(n \geq 30\).
  • A confidence interval gets wider when the confidence level increases, the population standard deviation is larger, or the sample size is smaller.
  • A confidence interval gets narrower when the confidence level is lower or when the sample size is larger.

This concept is a cornerstone of statistical inference. It connects sample data to population-level conclusions in a transparent, quantifiable way.