Probability Concepts

background-image: url("img/DAW.png")
background-position: left
background-size: 50%
class: middle, center, inverse

## .whitish[Probability Concepts]

<br>

### .whitish[Kelly McConville]

#### .yellow[ Stat 100 | Week 10 | Spring 2022]

]

---

### Announcements

* P-Set 8 is due on Wednesday.

****************************

### Goals for Today

* Recap probability concepts

* Learn about **conditional** probabilities and practicing computing them

]

* Learn important **named** random variables

* Start approximating sampling distributions

]

---

### Random Variables

For a random variable, `$X$`, care about its:

* **Distribution**: Probability function:

`$$p(x) = P(X = x)$$`

* **Center**: Mean of a discrete RV:

$$
\mu = \sum x p(x)
$$

* **Spread**: Standard deviation of a discrete RV:

$$
\sigma = \sqrt{ \sum (x - \mu)^2 p(x)}
$$

---

### Random Variables

If a random variable, `$X$`, is a **continuous** RV, then it can take on any value in an interval.

* Probability function: 
    + `$P(X = x) = 0$` so

$$
p(x) \color{orange}{\approx} P(X = x)
$$

but if `$p(4) > p(2)$` that still means that X is more likely to take on values around 4 than values around 2.

---

### Random Variables: Continuous

Change `$\sum$` to `$\color{orange}{\int}$`:

* `$\color{orange}{\int} p(x) dx = 1$`.

* Center: Mean/Expected value:

$$
\mu = \color{orange}{\int} x p(x) dx
$$

* Spread: Standard deviation:

$$
\sigma = \sqrt{ \color{orange}{\int} (x - \mu)^2 p(x) dx}
$$

---

### Example from Last Class:

Suppose 4 students have still not received their graded Stat 100 Midterm (yes, let's pretend we actually have hand-written work) and that I hand back the exams randomly to each student.  Let X = the number of students who get their correct exam.

**Questions:**

* Let's say the student's names are A(licia), B(ob), C(olin), and D(onna) and they are sitting in a row ABCD.  One possible outcome is ABDC (1st exam goes to A, 2nd to B, 3rd to D, 4th to C).  In that case, what does X equal?

* List out all possible outcomes.  And for each outcome, determine what X equals.

* Why is P(X = 3) = 0?

* Write out the probability distribution for X.

* Determine the mean value of X.

* Determine the standard deviation of X.

* What is the probability that at least one student gets their correct exam?

---

## Why do we care about random variables?

### We will recast our sample statistics as random variables.

### This gives us a probability function that approximates the sampling distribution!

---

### Conditional Probabilities

> "Conditioning is the soul of statistics." — Joe Blitzstein, Stat 110 Prof

**Question**: What do we mean by "conditioning"?

Most polar bears are twins. Therefore, if you're a twin, you're probably a polar bear.

* P(twin given polar bear) `$\neq$` P(polar bear given twin)

The p-value is a conditional probability.

* P-value = P(data given `$H_o$`) ( `$\neq$` P( `$H_o$` given data))

---

### Conditional Probabilities

**Other favorite examples:**

* P(have COVID given wear mask) `$\neq$` P(wear mask given have COVID)
    + In a CDC study, P(wear mask given have COVID) = 0.71 while P(have COVID given wear mask) is much lower.
    
--

* P(it is raining given there are clouds directly overhead) `$\neq$` Pr(there are clouds directly overhead given it is raining)

**Notation**: P(A given B) = P(A | B)

---

### Example

Testing for COVID-19 is an important part of *Keep Harvard Healthy*.  There are a variety of COVID-19 tests out there but for this problem let's assume the following:

* The test gives a **false negative** result 13% of the time where a false negative case is a person with COVID-19 but the test says they don't have it.
* The test gives a **false positive** result 5% of the time where a false positive case is a person who doesn't have COVID-19 but the test says they do.

Let's assume the true prevalence is 1%.  Each week they test about 30,000 Harvard affiliates.  Use the assumed percentages to fill in the following table of potential outcomes:

|                              | Positive Test Result | Negative Test Result | Total  |
|------------------------------|----------------------|----------------------|--------|
| Actually have COVID-19       |                      |                      |        |
| Actually don't have COVID-19 |                      |                      |        |
| Total                        |                      |                      | 30,000 |

* P(test - | have COVID) =

* P(have COVID | test -) =

]

* P(test + | don't have COVID) =

* P(don't have COVID | test +) =

]

---

### Example

The false negative rate of COVID-19 tests have varied wildly.  One paper estimated it could be as high as 54%.

Recreate the table with this new false negative rate.

* P(test - | have COVID) =

* P(have COVID | test -) =

]

* P(test + | don't have COVID) =

* P(don't have COVID | test +) =

]

---

## Discuss Specific Named Random Variables

---

### Specific Named Random Variables

* There are an **infinite** number of random variables out there.

* But there are a few particular ones that we will find useful.
    + Because these ones are used often, they have been given names.

* Will identify these named RVs using the following format:

$$
X \sim \mbox{Name(values of key parameters)}
$$

---

### Specific Named Random Variables

(1) `$X \sim$` Bernoulli `$(p)$`

`\begin{align*}
X=   \left\{
\begin{array}{ll}
      1 & \mbox{success} \\
      0 & \mbox{failure} \\
\end{array} 
\right.  
\end{align*}`

* Important parameter: `$p$` = probability of success = `$P(X = 1)$`

]

* Probability Function:

| `$x$`    | 0   | 1 |
|------|-----|---|
| `$p(x)$` | 1 - `$p$` | `$p$` |

]

---

### Specific Named Random Variables

(1) `$X \sim$` Bernoulli `$(p)$`

`\begin{align*}
X=   \left\{
\begin{array}{ll}
      1 & \mbox{success} \\
      0 & \mbox{failure} \\
\end{array} 
\right.  
\end{align*}`

* Important parameter: `$p$` = probability of success = `$P(X = 1)$`

]

* Probability Function:

| `$x$`    | 0   | 1 |
|------|-----|---|
| `$p(x)$` | 1 - `$p$` | `$p$` |

]

---

### Specific Named Random Variables

(1) `$X \sim$` Bernoulli `$(p)$`

`\begin{align*}
X=   \left\{
\begin{array}{ll}
      1 & \mbox{success} \\
      0 & \mbox{failure} \\
\end{array} 
\right.  
\end{align*}`

* Important parameter: `$p$` = probability of success = `$P(X = 1)$`

]

* Probability Function:

| `$x$`    | 0   | 1 |
|------|-----|---|
| `$p(x)$` | 1 - `$p$` | `$p$` |

]

---

### Specific Named Random Variables

(1) `$X \sim$` Bernoulli `$(p)$`

`\begin{align*}
X=   \left\{
\begin{array}{ll}
      1 & \mbox{success} \\
      0 & \mbox{failure} \\
\end{array} 
\right.  
\end{align*}`

* Probability Function:

| `$x$`    | 0   | 1 |
|------|-----|---|
| `$p(x)$` | 1 - `$p$` | `$p$` |

]

* Mean: `$1*p + 0*(1 - p) = p$`

* Standard deviation: `$\sqrt{(1 - p)^2*p + (0 - p)^2*(1 - p)} = \sqrt{p(1 - p)}$`

]

---

### Specific Named Random Variables

(2) `$X \sim$` Normal `$(\mu, \sigma)$`

* Probability Function:

$$
p(x) = \frac{1}{\sqrt{2\pi \sigma^2}}\exp{\left(-\frac{(x - \mu)^2}{2\sigma^2} \right)}
$$
where `$-\infty < x < \infty$`

* Mean: `$\mu$`

* Standard deviation: `$\sigma$`

]

]

**Notes:**

(a) Area under the curve = 1.

(b) Height `$\approx$` how likely values are to occur

]

]

---

### Specific Named Random Variables

(2) `$X \sim$` Normal `$(\mu, \sigma)$`

**Normal will be a good approximation for MANY distributions.**

But sometimes its **tails** just aren't fat enough.
---

### Specific Named Random Variables

(3) `$X \sim$` t(df)

* Probability Function:

$$
p(x) = \frac{\Gamma(\mbox{df} + 1)}{\sqrt{\mbox{df}\pi} \Gamma(2^{-1}\mbox{df})}\left(1 + \frac{x^2}{\mbox{df}} \right)^{-\frac{df + 1}{2}}
$$
where `$-\infty < x < \infty$`

* Mean: 0

* Standard deviation: `$\sqrt{\mbox{df}/(\mbox{df} - 2)}$`

]

* Probability Function:

]

---

## Time to Start Viewing Sample Statistics also as Random Variables

---

### Sample Statistics as Random Variables

Here are some of the sample statistics we've seen lately:

* `$\hat{p}$` = sample proportion of correct receiver guesses out of 329 trials

* `$r$` = correlation between audience and critic ratings for a sample of movies

* `$\hat{p}_D - \hat{p}_Y$` = difference in improvement proportions between those who swam with dolphins and those who did yoga

Why are these all random variables?

But none of these are Bernoulli RVs.

Nor are they Normal RVs.

Nor are they t RVs.

> "All models are wrong but some are useful."  -- George Box

---

### Approximating These Distributions

* `$\hat{p}$` = sample proportion of correct receiver guesses out of 329 trials

* We generated its Null Distribution:

]

---

### Approximating These Distributions

* `$\hat{p}$` = sample proportion of correct receiver guesses out of 329 trials

* We generated its Null Distribution:

]

* Its Null Distribution is well approximated by the probability function of a N(0.25, 0.024).

]

---

### Approximating These Distributions

* `$r$` = correlation between audience and critic ratings for a sample of movies

* In your p-set this week, you are generating its Null Distribution:

]

---

### Approximating These Distributions

* `$r$` = correlation between audience and critic ratings for a sample of movies

* In your p-set this week, you are generating its Null Distribution:

]

* Its Null Distribution is well approximated by the probability function of a N(0, 0.023).

]

---

### Approximating These Distributions

* `$\hat{p}_D - \hat{p}_Y$` = difference in improvement proportions between those who swam with dolphins and those who did yoga

* In your p-set last week, you generated its Null Distribution:

]

---

### Approximating These Distributions

* `$\hat{p}_D - \hat{p}_Y$` = difference in improvement proportions between those who swam with dolphins and those who did yoga

* In your p-set last week, you generated its Null Distribution:

]

* Its Null Distribution is kinda somewhat well-ish approximated by the probability function of a N(0, 0.16).

]

---

### Approximating These Distributions

* How did I know what was a **good** approximation distribution?

* How do these distributions help us??