background-image: url("img/DAW.png") background-position: left background-size: 50% class: middle, center, .pull-right[ ## .base-blue[Theory-Based Inference] ## .base-blue[and] ## .base-blue[Sample Size Calculations] <br> ### .purple[Kelly McConville] #### .purple[ Stat 100 | Week 13 | Fall 2022] ] --- background-image: url("img/ggparty_f22.png") background-position: contain background-size: 90% --- class: middle, center ### If you are able to attend, please RSVP here: ### [bit.ly/stat100-ggparty](https://bit.ly/stat100-ggparty) --- ### Announcements * Extra Credit Lecture Quiz: Correct the problems you missed in prior lecture quizzes and get up up to two quiz grades back. + Overall quiz grade can't exceed 100% so if you haven't missed many points on the lecture quiz, you don't have to do many (or any) corrections. + Opens today and closes at noon on Dec 1st. * A bit of thanks and a (small) surprise. * No regular lecture quiz this week. * P-Set 9 (the LAST p-set) is due Th, Dec 1st! * Project Assignment 3 is due Tu, Dec 6th. **************************** -- ### Goals for Today .pull-left[ * Theory-based inference * Sample size calculations ] .pull-right[ * **Lots of examples** ] --- class: middle, center ## One of the most challenging inferential ideas: -- ### Understanding the **many roles of the sample statistic**: -- .pull-left[ As a number ] -- .pull-right[ As a point estimate ] -- .pull-left[ As a test statistic ] -- .pull-right[ As a random variable ] --- ### Practice Problem #### Identify the different roles of a statistic in the following example: Researchers presented young children (aged 5 to 8 years) with a choice between two toy characters who were offering stickers. One character was described as mean, and the other was described as nice. The mean character offered two stickers, and the nice character offered one sticker. Researchers wanted to investigate whether children would tend to select the nice character over the mean character, despite receiving fewer stickers. They found that 80% of the 20 children in the study selected the nice character. If the children had no preference, the probability that 80% or more would select the nice character is approximately equal to 0.0036. My best guess for the true proportion of children who would select the nice character is 0.8 (with a margin of error of 0.19 for the 95% CI). .pull-left[ * As a number ] .pull-right[ * As a point estimate ] .pull-left[ * As a test statistic ] .pull-right[ * As a random variable ] --- ### Recap: **Central Limit Theorem (CLT):** For random samples and a large sample size `\((n)\)`, the sampling distribution of many sample statistics is approximately normal. -- .pull-left[ #### Sample Proportion Version When `\(n\)` is large (at least 10 successes and 10 failures): $$ \hat{p} \sim N \left(p, \sqrt{\frac{p(1-p)}{n}} \right) $$ ] -- .pull-right[ #### Sample Mean Version When `\(n\)` is large (at least 30): $$ \bar{x} \sim N \left(\mu, \frac{\sigma}{\sqrt{n}} \right) $$ ] --- ### But There Are [Several Versions](https://mcconvil.github.io/stat100f22/inference_procedures.html) of the CLT! <table class="table table-responsive table-bordered" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Response </th> <th style="text-align:left;"> Explanatory </th> <th style="text-align:left;"> Numerical_Quantity </th> <th style="text-align:left;"> Parameter </th> <th style="text-align:left;"> Statistic </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> quantitative </td> <td style="text-align:left;"> - </td> <td style="text-align:left;"> mean </td> <td style="text-align:left;"> `\(\mu\)` </td> <td style="text-align:left;"> `\(\bar{x}\)` </td> </tr> <tr> <td style="text-align:left;"> categorical </td> <td style="text-align:left;"> - </td> <td style="text-align:left;"> proportion </td> <td style="text-align:left;"> `\(p\)` </td> <td style="text-align:left;"> `\(\hat{p}\)` </td> </tr> <tr> <td style="text-align:left;"> quantitative </td> <td style="text-align:left;"> categorical </td> <td style="text-align:left;"> difference in means </td> <td style="text-align:left;"> `\(\mu_1 - \mu_2\)` </td> <td style="text-align:left;"> `\(\bar{x}_1 - \bar{x}_2\)` </td> </tr> <tr> <td style="text-align:left;"> categorical </td> <td style="text-align:left;"> categorical </td> <td style="text-align:left;"> difference in proportions </td> <td style="text-align:left;"> `\(p_1 - p_2\)` </td> <td style="text-align:left;"> `\(\hat{p}_1 - \hat{p}_2\)` </td> </tr> <tr> <td style="text-align:left;"> quantitative </td> <td style="text-align:left;"> quantitative </td> <td style="text-align:left;"> correlation </td> <td style="text-align:left;"> `\(\rho\)` </td> <td style="text-align:left;"> `\(r\)` </td> </tr> </tbody> </table> --- ### Recap: **Z-score test statistics**: $$ \mbox{Z-score} = \frac{\mbox{statistic} - \mu}{\sigma} $$ -- * Usually follows a **standard normal** or a **t** distribution. -- * Use the approximate distribution to find the p-value. --- ### Recap: **Formula-Based** P*100% Confidence Intervals $$ \mbox{statistic} \pm z^* SE $$ where `\(P(-z^* \leq Z \leq z^*) = P\)` -- OR $$ \mbox{statistic} \pm t^* SE $$ where `\(P(-t^* \leq t \leq t^*) = P\)` --- class: , middle, center ## How do we do probability model calculations in `R`? --- ### Probability Calculations in R **Question**: How do I compute **probabilities** in R? .pull-left[ <img src="stat100_wk13mon_files/figure-html/unnamed-chunk-2-1.png" width="576" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r pnorm(q = 1, mean = 0, sd = 1) ``` ``` ## [1] 0.8413447 ``` ```r pt(q = 1, df = 52) ``` ``` ## [1] 0.8390293 ``` ] **Doesn't seem quite right**... --- ### Probability Calculations in R **Question**: How do I compute **probabilities** in R? .pull-left[ <img src="stat100_wk13mon_files/figure-html/unnamed-chunk-4-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ ```r pnorm(q = 1, mean = 0, sd = 1, lower.tail = FALSE) ``` ``` ## [1] 0.1586553 ``` ```r pt(q = 1, df = 52, lower.tail = FALSE) ``` ``` ## [1] 0.1609707 ``` ] --- ### P*100% CI for parameter: $$ \mbox{statistic} \pm z^* SE $$ .pull-left[ **Question**: How do I find the correct critical values `\((z^* \mbox{ or } t^*)\)` for the confidence interval? <img src="stat100_wk13mon_files/figure-html/unnamed-chunk-6-1.png" width="576" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r qnorm(p = 0.975, mean = 0, sd = 1) ``` ``` ## [1] 1.959964 ``` ```r qt(p = 0.975, df = 52) ``` ``` ## [1] 2.006647 ``` ] --- ### P*100% CI for parameter: $$ \mbox{statistic} \pm z^* SE $$ .pull-left[ **Question**: What percentile/quantile do I need for a 90% CI? <img src="stat100_wk13mon_files/figure-html/unnamed-chunk-8-1.png" width="576" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r qnorm(p = 0.95, mean = 0, sd = 1) ``` ``` ## [1] 1.644854 ``` ```r qt(p = 0.95, df = 52) ``` ``` ## [1] 1.674689 ``` ] --- ### Probability Calculations in R **To help you remember**: Want a **P**robability? -- → use `pnorm()`, `pt()`, ... -- Want a **Q**uantile (i.e. percentile)? -- → use `qnorm()`, `qt()`, ... --- ### Probability Calculations in R **Question**: When might I want to do probability calculations in R? -- → Computed a test statistic that is approximated by a named random variable. Want to compute the p-value with `p---()` -- → Compute a confidence interval. Want to find the critical value with `q---()`. -- → To do a **Sample Size Calculation**. --- ### Sample Size Calculations * Very important part of the data analysis process! -- * Happens BEFORE you collect data. -- * You determine how large your sample size needs for a desired precision in your CI. + There is also a hypothesis test version that we won't be covering in Stat 100. --- ### Sample Size Calculations **Question**: Why do we need sample size calculations? -- **Example**: Let's return to the dolphins for treating depression example. -- With a sample size of 30 and 95% confidence, we estimate that the improvement rate for depression is between 14.5 percentage points and 75 percentage points higher if you swim with a dolphin instead of doing yoga. -- With a width of 60.5 percentage points, this 95% CI is a **wide**/very imprecise interval. -- **Question**: How could we make it narrower? How could we decrease the Margin of Error (ME)? -- → **Decrease** the confidence level! -- → **Increase** the sample size! --- ### Sample Size Calculations -- Single Proportion Let's focus on estimating a single proportion. Suppose we want to estimate the current proportion of Harvard undergraduates with COVID with 95% confidence and we want the margin of error on our interval to be less than or equal to 0.02. **How large does our sample size need to be?** -- Want $$ z^* \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \leq B $$ -- Need to derive a formula that looks like $$ n \geq \quad ... $$ -- **Question**: How can we isolate `\(n\)` to be on a side by itself? --- ### Sample Size Calculations -- Single Proportion Let's focus on estimating a single proportion. Suppose we want to estimate the current proportion of Harvard undergraduates with COVID with 95% confidence and we want the margin of error on our interval to be less than or equal to 0.02. **How large does our sample size need to be?** **Sample size calculation:** $$ n \geq \frac{\hat{p}(1 - \hat{p})z^{*2}}{B^2} $$ -- * What do we plug in for, `\(\hat{p}\)`, `\(z^{*}\)`, `\(B\)`? -- * Consider sample size calculations when estimating a **mean** on P-Set 9! --- class: , center, middle ### A bit of thanks and a small surprise. ### And then, let's go through more examples with the "InferenceExamples.Rmd" handout! --- background-image: url("img/hyp_testing_diagram.png") background-position: contain background-size: 80% ### Have Learned Two Routes to Statistical Inference -- Which is **better**? --- ## Is Simulation-Based Inference or Theory-Based Inference better? -- Depends on how you define **better**. .pull-left[ * If **better** = Leads to better understanding: ] -- .pull-right[ → Research tends to show students have a better understanding of **p-values** and **confidence** from learning simulation-based methods. ] -- .pull-left[ * If **better** = More flexible/robust to assumptions: ] -- .pull-right[ → The simulation-based methods tend to be more flexible but that generally requires learning extensions beyond what we've seen in Stat 100. ] -- .pull-left[ * If **better** = More commonly used: ] -- .pull-right[ → Definitely the theory-based methods but the simulation-based methods are becoming more common. ] Good to be comfortable with both as you will find both approaches used in journal and news articles!