Theory-Based Inference, Day II

background-image: url("img/DAW.png")
background-position: left
background-size: 50%
class: middle, center, inverse

## .whitish[Theory-Based Inference]

## .whitish[and]

## .whitish[Sample Size Calculations]

<br>

### .whitish[Kelly McConville]

#### .yellow[ Stat 100 | Week 11 | Spring 2022]

]

---

background-image: url("img/ggparty.001.jpeg")
background-position: contain
background-size: 90%

---

background-image: url("img/ggparty.002.jpeg")
background-position: contain
background-size: 90%

---

background-image: url("img/ggparty.003.jpeg")
background-position: contain
background-size: 90%

---

### If you are able to attend, please RSVP here:

### [bit.ly/stat100-ggparty](https://bit.ly/stat100-ggparty)

---

### Announcements

* Project Assignment 3 is due Friday, April 22nd at 5pm

****************************

### Goals for Today

* Theory-based inference

* Sample size calculations

]

* Paired data

* Lots of examples

]

---

### Recap:

**Central Limit Theorem (CLT):** For random samples and a large sample size `$(n)$`, the sampling distribution of many sample statistics is approximately normal.

#### Sample Proportion Version

When `$n$` is large (at least 10 successes and 10 failures):

$$
\hat{p} \sim N \left(p, \sqrt{\frac{p(1-p)}{n}} \right)
$$

]

#### Sample Mean Version

When `$n$` is large (at least 30):

$$
\bar{x} \sim N \left(\mu, \frac{\sigma}{\sqrt{n}} \right)
$$
]

---

### But There Are [Several Versions](https://mcconvil.github.io/stat100s22/inference_procedures.html)!

<table class="table table-responsive table-bordered" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Response </th>
   <th style="text-align:left;"> Explanatory </th>
   <th style="text-align:left;"> Numerical_Quantity </th>
   <th style="text-align:left;"> Parameter </th>
   <th style="text-align:left;"> Statistic </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> quantitative </td>
   <td style="text-align:left;"> - </td>
   <td style="text-align:left;"> mean </td>
   <td style="text-align:left;"> `$\mu$` </td>
   <td style="text-align:left;"> `$\bar{x}$` </td>
  </tr>
  <tr>
   <td style="text-align:left;"> categorical </td>
   <td style="text-align:left;"> - </td>
   <td style="text-align:left;"> proportion </td>
   <td style="text-align:left;"> `$p$` </td>
   <td style="text-align:left;"> `$\hat{p}$` </td>
  </tr>
  <tr>
   <td style="text-align:left;"> quantitative </td>
   <td style="text-align:left;"> categorical </td>
   <td style="text-align:left;"> difference in means </td>
   <td style="text-align:left;"> `$\mu_1 - \mu_2$` </td>
   <td style="text-align:left;"> `$\bar{x}_1 - \bar{x}_2$` </td>
  </tr>
  <tr>
   <td style="text-align:left;"> categorical </td>
   <td style="text-align:left;"> categorical </td>
   <td style="text-align:left;"> difference in proportions </td>
   <td style="text-align:left;"> `$p_1 - p_2$` </td>
   <td style="text-align:left;"> `$\hat{p}_1 - \hat{p}_2$` </td>
  </tr>
  <tr>
   <td style="text-align:left;"> quantitative </td>
   <td style="text-align:left;"> quantitative </td>
   <td style="text-align:left;"> correlation </td>
   <td style="text-align:left;"> `$\rho$` </td>
   <td style="text-align:left;"> `$r$` </td>
  </tr>
</tbody>
</table>

---

### Recap:

**Z-score test statistics**:

$$
\mbox{Z-score} = \frac{\mbox{statistic} - \mu}{\sigma}
$$

* Usually follows a **standard normal** or a **t** distribution.

* Use the approximate distribution to find the p-value.

---

### Recap:

**Formula-Based** P*100% Confidence Intervals

$$
\mbox{statistic} \pm z^* SE
$$

where `$P(-z^* \leq Z \leq z^*) = P$`

$$
\mbox{statistic} \pm t^* SE
$$

where `$P(-t^* \leq t \leq t^*) = P$`

---

## How do we do probability model calculations in `R`?

---

### Probability Calculations in R

**Question**: How do I compute **probabilities** in R?

]

```r
pnorm(q = 1, mean = 0, sd = 1)
```

```
## [1] 0.8413447
```

```r
pt(q = 1, df = 52)
```

```
## [1] 0.8390293
```

]

**Doesn't seem quite right**...

---

### Probability Calculations in R

**Question**: How do I compute **probabilities** in R?

]

```r
pnorm(q = 1, mean = 0, sd = 1,
      lower.tail = FALSE)
```

```
## [1] 0.1586553
```

```r
pt(q = 1, df = 52, lower.tail = FALSE)
```

```
## [1] 0.1609707
```

]

---

### P*100% CI for parameter:

$$
\mbox{statistic} \pm z^* SE
$$

**Question**: How do I find the correct critical values `$(z^* \mbox{ or } t^*)$` for the confidence interval?

]

```r
qnorm(p = 0.975, mean = 0, sd = 1)
```

```
## [1] 1.959964
```

```r
qt(p = 0.975, df = 52)
```

```
## [1] 2.006647
```

]

---

### P*100% CI for parameter:

$$
\mbox{statistic} \pm z^* SE
$$
.pull-left[

**Question**: What percentile/quantile do I need for a 90% CI?

]

```r
qnorm(p = 0.95, mean = 0, sd = 1)
```

```
## [1] 1.644854
```

```r
qt(p = 0.95, df = 52)
```

```
## [1] 1.674689
```

]

---

### Probability Calculations in R

**To help you remember**:

Want a **P**robability?

&rarr; use `pnorm()`, `pt()`, ...

Want a **Q**uantile (i.e. percentile)?

&rarr; use `qnorm()`, `qt()`, ...

---

### Probability Calculations in R

**Question**: When might I want to do probability calculations in R?

--
    
&rarr; Computed a test statistic that is approximated by a named random variable.  Want to compute the p-value with `p---()`

&rarr; Compute a confidence interval.  Want to find the critical value with `q---()`.

&rarr; To do a **Sample Size Calculation**.

---

### Sample Size Calculations

* Very important part of the data analysis process!

* Happens BEFORE you collect data.

* You determine how large your sample size needs for a desired precision in your CI.
    + There is also a hypothesis test version that we won't be covering in Stat 100.
    
---

### Sample Size Calculations

**Question**: Why do we need sample size calculations?

**Example**: Let's return to the dolphins for treating depression example.

With a sample size of 30 and 95% confidence, we estimate that the improvement rate for depression is between 14.5 percentage points and 75 percentage points higher if you swim with a dolphin instead of doing yoga.

With a width of 60.5 percentage points, this 95% CI is a **wide**/very imprecise interval.

**Question**: How could we make it narrower?  How could we decrease the Margin of Error (ME)?

&rarr; **Decrease** the confidence level!

&rarr; **Increase** the sample size!

---

### Sample Size Calculations -- Single Proportion

Let's focus on estimating a single proportion.  Suppose we want to estimate the current proportion of Harvard undergraduates with COVID with 95% confidence and we want the margin of error on our interval to be less than or equal to 0.02.  **How large does our sample size need to be?**

Want

$$ 
z^* \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \leq B
$$

Need to derive a formula that looks like

$$ 
n \geq \quad ...
$$

**Question**: How can we isolate `$n$` to be on a side by itself?

---

### Sample Size Calculations -- Single Proportion

**Sample size calculation:**

$$
n \geq \frac{\hat{p}(1 - \hat{p})z^{*2}}{B^2}
$$
--

* What do we plug in for, `$\hat{p}$`, `$z^{*}$`, `$B$`?

* Will consider sample size calculations when estimating a **mean** on this week's p-set.

---

## Another important concept: Paired data

---

### Paired Data: Mean Difference

**Example**: Is the mean number of free throw attempts awarded to the Miami Heat during games different from the mean number attempted by their opponents?

```r
library(tidyverse)
library(Lock5Data)
# Data
data("MiamiHeat")
select(MiamiHeat, Game, Location, Opp, FTA, OppFTA) %>%
  slice(1:6)
```

```
##   Game Location Opp FTA OppFTA
## 1    1     Away BOS  25     25
## 2    2     Away PHI  31     11
## 3    3     Home ORL  27     34
## 4    4     Away NJN  34     23
## 5    5     Home MIN  31     38
## 6    6     Away NOH  24     17
```

* Variables of interest:

* Parameter of interest:

---

### Paired Data: Mean Difference

```r
select(MiamiHeat, Game, Location, Opp, FTA, OppFTA) %>%
  slice(1:6)
```

What are the cases?

How could we control for case-to-case variability?

---

### Paired Data: Mean Difference

* Paired data: **repeated** observations on the same case

* Paired data should not be treated as **independent** observations
    + Why not?

--
    
* **Benefit of pairing**: By accounting for the case-to-case variability, any differences we see are more directly related to the explanatory variable.

---

### Paired Data: Mean Difference

```r
# Calculate Difference
MiamiHeat <- MiamiHeat %>%
  mutate(diff_FTA = FTA - OppFTA)

# Visualize
ggplot(data = MiamiHeat, 
       mapping = aes(x = diff_FTA)) +
  geom_histogram()
```

]

]

---

## Conducting a theory-based t-test

```r
# One-sample t-test
t.test(x = MiamiHeat$diff_FTA)
```

```
## 
## 	One Sample t-test
## 
## data:  MiamiHeat$diff_FTA
## t = 3.4946, df = 81, p-value = 0.0007727
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1.617539 5.894657
## sample estimates:
## mean of x 
##  3.756098
```

]

```r
# Two-sample t-test
# Ignoring the pairing
t.test(x = MiamiHeat$FTA,
       y = MiamiHeat$OppFTA)
```

```
## 
## 	Welch Two Sample t-test
## 
## data:  MiamiHeat$FTA and MiamiHeat$OppFTA
## t = 3.2633, df = 158.73, p-value = 0.001348
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.482830 6.029365
## sample estimates:
## mean of x mean of y 
##  27.90244  24.14634
```

]

---

### Assumptions

All these methods that rely on the CLT, assume:

* The sample size is large.

* The sample is a random sample.
    + Observations are independent of each other.

---

### Let's go through more examples!