background-image: url("img/DAW.png") background-position: left background-size: 50% class: middle, center, inverse .pull-right[ ## .whitish[Theory-Based Inference] ## .whitish[and] ## .whitish[Sample Size Calculations] <br> ### .whitish[Kelly McConville] #### .yellow[ Stat 100 | Week 11 | Spring 2022] ] --- background-image: url("img/ggparty.001.jpeg") background-position: contain background-size: 90% --- background-image: url("img/ggparty.002.jpeg") background-position: contain background-size: 90% --- background-image: url("img/ggparty.003.jpeg") background-position: contain background-size: 90% --- class: middle, center ### If you are able to attend, please RSVP here: ### [bit.ly/stat100-ggparty](https://bit.ly/stat100-ggparty) --- ### Announcements * Project Assignment 3 is due Friday, April 22nd at 5pm **************************** -- ### Goals for Today .pull-left[ * Theory-based inference * Sample size calculations ] .pull-right[ * Paired data * Lots of examples ] --- ### Recap: **Central Limit Theorem (CLT):** For random samples and a large sample size `\((n)\)`, the sampling distribution of many sample statistics is approximately normal. -- .pull-left[ #### Sample Proportion Version When `\(n\)` is large (at least 10 successes and 10 failures): $$ \hat{p} \sim N \left(p, \sqrt{\frac{p(1-p)}{n}} \right) $$ ] -- .pull-right[ #### Sample Mean Version When `\(n\)` is large (at least 30): $$ \bar{x} \sim N \left(\mu, \frac{\sigma}{\sqrt{n}} \right) $$ ] --- ### But There Are [Several Versions](https://mcconvil.github.io/stat100s22/inference_procedures.html)! <table class="table table-responsive table-bordered" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Response </th> <th style="text-align:left;"> Explanatory </th> <th style="text-align:left;"> Numerical_Quantity </th> <th style="text-align:left;"> Parameter </th> <th style="text-align:left;"> Statistic </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> quantitative </td> <td style="text-align:left;"> - </td> <td style="text-align:left;"> mean </td> <td style="text-align:left;"> `\(\mu\)` </td> <td style="text-align:left;"> `\(\bar{x}\)` </td> </tr> <tr> <td style="text-align:left;"> categorical </td> <td style="text-align:left;"> - </td> <td style="text-align:left;"> proportion </td> <td style="text-align:left;"> `\(p\)` </td> <td style="text-align:left;"> `\(\hat{p}\)` </td> </tr> <tr> <td style="text-align:left;"> quantitative </td> <td style="text-align:left;"> categorical </td> <td style="text-align:left;"> difference in means </td> <td style="text-align:left;"> `\(\mu_1 - \mu_2\)` </td> <td style="text-align:left;"> `\(\bar{x}_1 - \bar{x}_2\)` </td> </tr> <tr> <td style="text-align:left;"> categorical </td> <td style="text-align:left;"> categorical </td> <td style="text-align:left;"> difference in proportions </td> <td style="text-align:left;"> `\(p_1 - p_2\)` </td> <td style="text-align:left;"> `\(\hat{p}_1 - \hat{p}_2\)` </td> </tr> <tr> <td style="text-align:left;"> quantitative </td> <td style="text-align:left;"> quantitative </td> <td style="text-align:left;"> correlation </td> <td style="text-align:left;"> `\(\rho\)` </td> <td style="text-align:left;"> `\(r\)` </td> </tr> </tbody> </table> --- ### Recap: **Z-score test statistics**: $$ \mbox{Z-score} = \frac{\mbox{statistic} - \mu}{\sigma} $$ -- * Usually follows a **standard normal** or a **t** distribution. -- * Use the approximate distribution to find the p-value. --- ### Recap: **Formula-Based** P*100% Confidence Intervals $$ \mbox{statistic} \pm z^* SE $$ where `\(P(-z^* \leq Z \leq z^*) = P\)` -- OR $$ \mbox{statistic} \pm t^* SE $$ where `\(P(-t^* \leq t \leq t^*) = P\)` --- class: inverse, middle, center ## How do we do probability model calculations in `R`? --- ### Probability Calculations in R **Question**: How do I compute **probabilities** in R? .pull-left[ <img src="stat100_wk11mon_files/figure-html/unnamed-chunk-2-1.png" width="576" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r pnorm(q = 1, mean = 0, sd = 1) ``` ``` ## [1] 0.8413447 ``` ```r pt(q = 1, df = 52) ``` ``` ## [1] 0.8390293 ``` ] **Doesn't seem quite right**... --- ### Probability Calculations in R **Question**: How do I compute **probabilities** in R? .pull-left[ <img src="stat100_wk11mon_files/figure-html/unnamed-chunk-4-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ ```r pnorm(q = 1, mean = 0, sd = 1, lower.tail = FALSE) ``` ``` ## [1] 0.1586553 ``` ```r pt(q = 1, df = 52, lower.tail = FALSE) ``` ``` ## [1] 0.1609707 ``` ] --- ### P*100% CI for parameter: $$ \mbox{statistic} \pm z^* SE $$ .pull-left[ **Question**: How do I find the correct critical values `\((z^* \mbox{ or } t^*)\)` for the confidence interval? <img src="stat100_wk11mon_files/figure-html/unnamed-chunk-6-1.png" width="576" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r qnorm(p = 0.975, mean = 0, sd = 1) ``` ``` ## [1] 1.959964 ``` ```r qt(p = 0.975, df = 52) ``` ``` ## [1] 2.006647 ``` ] --- ### P*100% CI for parameter: $$ \mbox{statistic} \pm z^* SE $$ .pull-left[ **Question**: What percentile/quantile do I need for a 90% CI? <img src="stat100_wk11mon_files/figure-html/unnamed-chunk-8-1.png" width="576" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r qnorm(p = 0.95, mean = 0, sd = 1) ``` ``` ## [1] 1.644854 ``` ```r qt(p = 0.95, df = 52) ``` ``` ## [1] 1.674689 ``` ] --- ### Probability Calculations in R **To help you remember**: Want a **P**robability? -- → use `pnorm()`, `pt()`, ... -- Want a **Q**uantile (i.e. percentile)? -- → use `qnorm()`, `qt()`, ... --- ### Probability Calculations in R **Question**: When might I want to do probability calculations in R? -- → Computed a test statistic that is approximated by a named random variable. Want to compute the p-value with `p---()` -- → Compute a confidence interval. Want to find the critical value with `q---()`. -- → To do a **Sample Size Calculation**. --- ### Sample Size Calculations * Very important part of the data analysis process! -- * Happens BEFORE you collect data. -- * You determine how large your sample size needs for a desired precision in your CI. + There is also a hypothesis test version that we won't be covering in Stat 100. --- ### Sample Size Calculations **Question**: Why do we need sample size calculations? -- **Example**: Let's return to the dolphins for treating depression example. -- With a sample size of 30 and 95% confidence, we estimate that the improvement rate for depression is between 14.5 percentage points and 75 percentage points higher if you swim with a dolphin instead of doing yoga. -- With a width of 60.5 percentage points, this 95% CI is a **wide**/very imprecise interval. -- **Question**: How could we make it narrower? How could we decrease the Margin of Error (ME)? -- → **Decrease** the confidence level! -- → **Increase** the sample size! --- ### Sample Size Calculations -- Single Proportion Let's focus on estimating a single proportion. Suppose we want to estimate the current proportion of Harvard undergraduates with COVID with 95% confidence and we want the margin of error on our interval to be less than or equal to 0.02. **How large does our sample size need to be?** -- Want $$ z^* \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \leq B $$ -- Need to derive a formula that looks like $$ n \geq \quad ... $$ -- **Question**: How can we isolate `\(n\)` to be on a side by itself? --- ### Sample Size Calculations -- Single Proportion Let's focus on estimating a single proportion. Suppose we want to estimate the current proportion of Harvard undergraduates with COVID with 95% confidence and we want the margin of error on our interval to be less than or equal to 0.02. **How large does our sample size need to be?** **Sample size calculation:** $$ n \geq \frac{\hat{p}(1 - \hat{p})z^{*2}}{B^2} $$ -- * What do we plug in for, `\(\hat{p}\)`, `\(z^{*}\)`, `\(B\)`? -- * Will consider sample size calculations when estimating a **mean** on this week's p-set. --- class: middle, center, inverse ## Another important concept: Paired data --- ### Paired Data: Mean Difference **Example**: Is the mean number of free throw attempts awarded to the Miami Heat during games different from the mean number attempted by their opponents? ```r library(tidyverse) library(Lock5Data) # Data data("MiamiHeat") select(MiamiHeat, Game, Location, Opp, FTA, OppFTA) %>% slice(1:6) ``` ``` ## Game Location Opp FTA OppFTA ## 1 1 Away BOS 25 25 ## 2 2 Away PHI 31 11 ## 3 3 Home ORL 27 34 ## 4 4 Away NJN 34 23 ## 5 5 Home MIN 31 38 ## 6 6 Away NOH 24 17 ``` * Variables of interest: * Parameter of interest: --- ### Paired Data: Mean Difference ```r select(MiamiHeat, Game, Location, Opp, FTA, OppFTA) %>% slice(1:6) ``` ``` ## Game Location Opp FTA OppFTA ## 1 1 Away BOS 25 25 ## 2 2 Away PHI 31 11 ## 3 3 Home ORL 27 34 ## 4 4 Away NJN 34 23 ## 5 5 Home MIN 31 38 ## 6 6 Away NOH 24 17 ``` What are the cases? -- How could we control for case-to-case variability? --- ### Paired Data: Mean Difference * Paired data: **repeated** observations on the same case -- * Paired data should not be treated as **independent** observations + Why not? -- * **Benefit of pairing**: By accounting for the case-to-case variability, any differences we see are more directly related to the explanatory variable. --- ### Paired Data: Mean Difference .pull-left[ ```r # Calculate Difference MiamiHeat <- MiamiHeat %>% mutate(diff_FTA = FTA - OppFTA) # Visualize ggplot(data = MiamiHeat, mapping = aes(x = diff_FTA)) + geom_histogram() ``` ] .pull-right[ <img src="stat100_wk11mon_files/figure-html/heat-1.png" width="768" style="display: block; margin: auto;" /> ] --- ## Conducting a theory-based t-test .pull-left[ ```r # One-sample t-test t.test(x = MiamiHeat$diff_FTA) ``` ``` ## ## One Sample t-test ## ## data: MiamiHeat$diff_FTA ## t = 3.4946, df = 81, p-value = 0.0007727 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 1.617539 5.894657 ## sample estimates: ## mean of x ## 3.756098 ``` ] .pull-right[ ```r # Two-sample t-test # Ignoring the pairing t.test(x = MiamiHeat$FTA, y = MiamiHeat$OppFTA) ``` ``` ## ## Welch Two Sample t-test ## ## data: MiamiHeat$FTA and MiamiHeat$OppFTA ## t = 3.2633, df = 158.73, p-value = 0.001348 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 1.482830 6.029365 ## sample estimates: ## mean of x mean of y ## 27.90244 24.14634 ``` ] --- ### Assumptions All these methods that rely on the CLT, assume: * The sample size is large. -- * The sample is a random sample. + Observations are independent of each other. --- class: inverse, center, middle ### Let's go through more examples!