Probability Concepts

background-image: url("img/DAW.png")
background-position: left
background-size: 50%
class: middle, center,

## .base-blue[Probability Concepts]

## .base-blue[and the]

## .base-blue[Central Limit Theorem]

<br>

### .purple[Kelly McConville]

#### .purple[ Stat 100 | Week 12 | Fall 2022]

]

---

### Announcements

* Week of Thanksgiving:
  + No sections or wrap-up sessions
  + Normal OH schedule for Sun Nov 20th - Tues Nov 22nd
  + No OHs from Wed Nov 23rd - Sun Nov 27th

****************************

### Goals for Today

* Recap probability concepts

* Cover continuous random variables

]

* Learn important **named** random variables

* Start approximating sampling distributions with the **Central Limit Theorem**

]

---

### Random Variables

For a **discrete** random variable, `$X$`, care about its:

* **Distribution**: Probability function:

`$$p(x) = P(X = x)$$`

* **Center**: Mean of a discrete RV:

$$
\mu = \sum x p(x)
$$

* **Spread**: Standard deviation of a discrete RV:

$$
\sigma = \sqrt{ \sum (x - \mu)^2 p(x)}
$$

---

### Random Variables

If a random variable, `$X$`, is a **continuous** RV, then it can take on any value in an interval.

* Probability function: 
    + `$P(X = x) = 0$` so

$$
p(x) \color{orange}{\approx} P(X = x)
$$

but if `$p(4) > p(2)$` that still means that X is more likely to take on values around 4 than values around 2.

---

### Random Variables: Continuous

Change `$\sum$` to `$\color{orange}{\int}$`:

* `$\color{orange}{\int} p(x) dx = 1$`.

* Center: Mean/Expected value:

$$
\mu = \color{orange}{\int} x p(x) dx
$$

* Spread: Standard deviation:

$$
\sigma = \sqrt{ \color{orange}{\int} (x - \mu)^2 p(x) dx}
$$

---

## Why do we care about random variables?

#### We will recast our sample statistics as random variables.

#### This gives us a probability function that approximates the sampling distribution!

---

## Specific Named Random Variables

---

### Specific Named Random Variables

* There is a vast array of random variables out there.

* But there are a few particular ones that we will find useful.
    + Because these ones are used often, they have been given names.

* Will identify these named RVs using the following format:

$$
X \sim \mbox{Name(values of key parameters)}
$$

---

### Specific Named Random Variables

(1) `$X \sim$` Bernoulli `$(p)$`

`\begin{align*}
X=   \left\{
\begin{array}{ll}
      1 & \mbox{success} \\
      0 & \mbox{failure} \\
\end{array} 
\right.  
\end{align*}`

* Important parameter: `$p$` = probability of success = `$P(X = 1)$`

]

* Probability Function:

| `$x$`    | 0   | 1 |
|------|-----|---|
| `$p(x)$` | 1 - `$p$` | `$p$` |

]

---

### Specific Named Random Variables

(1) `$X \sim$` Bernoulli `$(p)$`

`\begin{align*}
X=   \left\{
\begin{array}{ll}
      1 & \mbox{success} \\
      0 & \mbox{failure} \\
\end{array} 
\right.  
\end{align*}`

* Important parameter: `$p$` = probability of success = `$P(X = 1)$`

]

* Probability Function:

| `$x$`    | 0   | 1 |
|------|-----|---|
| `$p(x)$` | 1 - `$p$` | `$p$` |

]

---

### Specific Named Random Variables

(1) `$X \sim$` Bernoulli `$(p)$`

`\begin{align*}
X=   \left\{
\begin{array}{ll}
      1 & \mbox{success} \\
      0 & \mbox{failure} \\
\end{array} 
\right.  
\end{align*}`

* Important parameter: `$p$` = probability of success = `$P(X = 1)$`

]

* Probability Function:

| `$x$`    | 0   | 1 |
|------|-----|---|
| `$p(x)$` | 1 - `$p$` | `$p$` |

]

---

### Specific Named Random Variables

(1) `$X \sim$` Bernoulli `$(p)$`

`\begin{align*}
X=   \left\{
\begin{array}{ll}
      1 & \mbox{success} \\
      0 & \mbox{failure} \\
\end{array} 
\right.  
\end{align*}`

* Probability Function:

| `$x$`    | 0   | 1 |
|------|-----|---|
| `$p(x)$` | 1 - `$p$` | `$p$` |

]

* Mean: `$1*p + 0*(1 - p) = p$`

* Standard deviation: `$\sqrt{(1 - p)^2*p + (0 - p)^2*(1 - p)} = \sqrt{p(1 - p)}$`

]

---

### Specific Named Random Variables

(2) `$X \sim$` Normal `$(\mu, \sigma)$`

* Probability Function:

$$
p(x) = \frac{1}{\sqrt{2\pi \sigma^2}}\exp{\left(-\frac{(x - \mu)^2}{2\sigma^2} \right)}
$$
where `$-\infty < x < \infty$`

* Mean: `$\mu$`

* Standard deviation: `$\sigma$`

]

]

**Notes:**

(a) Area under the curve = 1.

(b) Height `$\approx$` how likely values are to occur

]

]

---

### Specific Named Random Variables

(2) `$X \sim$` Normal `$(\mu, \sigma)$`

**Normal will be a good approximation for MANY distributions.**

But sometimes its **tails** just aren't fat enough.
---

### Specific Named Random Variables

(3) `$X \sim$` t(df)

* Probability Function:

$$
p(x) = \frac{\Gamma(\mbox{df} + 1)}{\sqrt{\mbox{df}\pi} \Gamma(2^{-1}\mbox{df})}\left(1 + \frac{x^2}{\mbox{df}} \right)^{-\frac{df + 1}{2}}
$$
where `$-\infty < x < \infty$`

* Mean: 0

* Standard deviation: `$\sqrt{\mbox{df}/(\mbox{df} - 2)}$`

]

* Probability Function:

]

---

## Time to Start Viewing Sample Statistics also as Random Variables

---

### Sample Statistics as Random Variables

Here are some of the sample statistics we've seen lately:

* `$\hat{p}$` = sample proportion of correct receiver guesses out of 329 trials

* `$r$` = sample correlation coefficient for vitamin D levels and age for female schoolchildren in Thailand

* `$\hat{p}_D - \hat{p}_Y$` = difference in sample improvement proportions between those who swam with dolphins and those who did yoga

Why are these all random variables?

But none of these are **Bernoulli** random variables.

Nor are they **Normal** random variables.

Nor are they **t** random variables.

#### *"All models are wrong but some are useful."*  -- George Box

---

### Approximating These Distributions

* `$\hat{p}$` = sample proportion of correct receiver guesses out of 329 trials

* We generated its Null Distribution:

]

---

### Approximating These Distributions

* `$\hat{p}$` = sample proportion of correct receiver guesses out of 329 trials

* We generated its Null Distribution:

]

* Its Null Distribution is well approximated by the probability function of a N(0.25, 0.024).

]

---

### Approximating These Distributions

* `$r$` = sample correlation coefficient for vitamin D levels and age for female schoolchildren in Thailand

* In your p-set this week, you are generating its Null Distribution:

]

---

### Approximating These Distributions

* `$r$` = sample correlation coefficient for vitamin D levels and age for female schoolchildren in Thailand

* In your p-set this week, you are generating its Null Distribution:

]

* Its Null Distribution is well approximated by the probability function of a N(0, 0.056).

]

---

### Approximating These Distributions

* `$\hat{p}_D - \hat{p}_Y$` = difference in sample improvement proportions between those who swam with dolphins and those who did yoga

* In your p-set last week, you generated its Null Distribution:

]

---

### Approximating These Distributions

* `$\hat{p}_D - \hat{p}_Y$` = difference in sample improvement proportions between those who swam with dolphins and those who did yoga

* In your p-set last week, you generated its Null Distribution:

]

* Its Null Distribution is kinda somewhat well-ish approximated by the probability function of a N(0, 0.16).

]

---

### Approximating These Distributions

* How do I know **which** probability function is a good approximation for my sample statistic's distribution?

* Once I have figured out a probability function that approximates the distribution of my sample statistic, how do I **use it** to do statistical inference?

---

### Approximating Sampling Distributions

**Central Limit Theorem (CLT):** For random samples and a large sample size `$(n)$`, the sampling distribution of many sample statistics is approximately normal.

**Example**: Trees in Mount Tabor

---

### Approximating Sampling Distributions

**Central Limit Theorem (CLT):** For random samples and a large sample size `$(n)$`, the sampling distribution of many sample statistics is approximately normal.

**Example**: Trees in Mount Tabor

But **which** Normal?  (What is the value of `$\mu$` and `$\sigma$`?)

---

### Approximating Sampling Distributions

**Question**: But **which** normal?  (What is the value of `$\mu$` and `$\sigma$`?)

* The sampling distribution of a statistic is always centered around:

* The CLT also provides formula estimates of the standard error.
    + The formula varies based on the statistic.

---

### Approximating the Sampling Distribution of a Sample Proportion

CLT says: For large `$n$` (At least 10 successes and 10 failures),

$$
\hat{p} \sim N \left(p, \sqrt{\frac{p(1-p)}{n}} \right)
$$

**Example**: Trees in Mount Tabor

* Parameter: `$p$` = proportion of Douglas Fir = 0.615

* Statistic: `$\hat{p}$` = proportion of Douglas Fir in a sample of 50 trees

$$
\hat{p} \sim N \left(0.615, \sqrt{\frac{0.615(1-0.615)}{50}} \right)
$$
--

**NOTE**: Can plug in the true parameter here because we had data on the whole population.

---

### Approximating the Sampling Distribution of a Sample Proportion

**Question**: What do we do when we don't have access to the whole population?

* Have:

$$
\hat{p} \sim N \left(p, \sqrt{\frac{p(1-p)}{n}} \right)
$$

**Answer**: We will have to estimate the SE.

---

### Approximating the Sampling Distribution of a Sample Mean

* There is a version of the CLT for many of our sample statistics.

* For the sample mean, the CLT says: For large `$n$` (At least 30 observations),

$$
\bar{x} \sim N \left(\mu, \frac{\sigma}{\sqrt{n}} \right)
$$

---

## Now we need to use the approximate distribution of the sample statistic (given by the CLT) to construct confidence intervals and to conduct hypothesis tests!