background-image: url("img/DAW.png") background-position: left background-size: 50% class: middle, center, .pull-right[ ## .base-blue[Probability Concepts] ## .base-blue[and the] ## .base-blue[Central Limit Theorem] <br> ### .purple[Kelly McConville] #### .purple[ Stat 100 | Week 12 | Fall 2022] ] --- ### Announcements * Week of Thanksgiving: + No sections or wrap-up sessions + Normal OH schedule for Sun Nov 20th - Tues Nov 22nd + No OHs from Wed Nov 23rd - Sun Nov 27th **************************** -- ### Goals for Today .pull-left[ * Recap probability concepts * Cover continuous random variables ] .pull-right[ * Learn important **named** random variables * Start approximating sampling distributions with the **Central Limit Theorem** ] --- ### Random Variables For a **discrete** random variable, `\(X\)`, care about its: * **Distribution**: Probability function: `$$p(x) = P(X = x)$$` * **Center**: Mean of a discrete RV: $$ \mu = \sum x p(x) $$ * **Spread**: Standard deviation of a discrete RV: $$ \sigma = \sqrt{ \sum (x - \mu)^2 p(x)} $$ --- ### Random Variables If a random variable, `\(X\)`, is a **continuous** RV, then it can take on any value in an interval. * Probability function: + `\(P(X = x) = 0\)` so -- $$ p(x) \color{orange}{\approx} P(X = x) $$ but if `\(p(4) > p(2)\)` that still means that X is more likely to take on values around 4 than values around 2. --- ### Random Variables: Continuous Change `\(\sum\)` to `\(\color{orange}{\int}\)`: -- * `\(\color{orange}{\int} p(x) dx = 1\)`. -- * Center: Mean/Expected value: $$ \mu = \color{orange}{\int} x p(x) dx $$ -- * Spread: Standard deviation: $$ \sigma = \sqrt{ \color{orange}{\int} (x - \mu)^2 p(x) dx} $$ --- class: , middle, center ## Why do we care about random variables? -- #### We will recast our sample statistics as random variables. -- #### This gives us a probability function that approximates the sampling distribution! --- class: , middle, center ## Specific Named Random Variables --- ### Specific Named Random Variables * There is a vast array of random variables out there. -- * But there are a few particular ones that we will find useful. + Because these ones are used often, they have been given names. -- * Will identify these named RVs using the following format: $$ X \sim \mbox{Name(values of key parameters)} $$ --- ### Specific Named Random Variables .pull-left[ (1) `\(X \sim\)` Bernoulli `\((p)\)` `\begin{align*} X= \left\{ \begin{array}{ll} 1 & \mbox{success} \\ 0 & \mbox{failure} \\ \end{array} \right. \end{align*}` * Important parameter: `\(p\)` = probability of success = `\(P(X = 1)\)` ] -- .pull-left[ * Probability Function: | `\(x\)` | 0 | 1 | |------|-----|---| | `\(p(x)\)` | 1 - `\(p\)` | `\(p\)` | ] -- <img src="stat100_wk12mon_files/figure-html/unnamed-chunk-1-1.png" width="576" style="display: block; margin: auto;" /> --- ### Specific Named Random Variables .pull-left[ (1) `\(X \sim\)` Bernoulli `\((p)\)` `\begin{align*} X= \left\{ \begin{array}{ll} 1 & \mbox{success} \\ 0 & \mbox{failure} \\ \end{array} \right. \end{align*}` * Important parameter: `\(p\)` = probability of success = `\(P(X = 1)\)` ] .pull-left[ * Probability Function: | `\(x\)` | 0 | 1 | |------|-----|---| | `\(p(x)\)` | 1 - `\(p\)` | `\(p\)` | ] <img src="stat100_wk12mon_files/figure-html/unnamed-chunk-2-1.png" width="576" style="display: block; margin: auto;" /> --- ### Specific Named Random Variables .pull-left[ (1) `\(X \sim\)` Bernoulli `\((p)\)` `\begin{align*} X= \left\{ \begin{array}{ll} 1 & \mbox{success} \\ 0 & \mbox{failure} \\ \end{array} \right. \end{align*}` * Important parameter: `\(p\)` = probability of success = `\(P(X = 1)\)` ] .pull-left[ * Probability Function: | `\(x\)` | 0 | 1 | |------|-----|---| | `\(p(x)\)` | 1 - `\(p\)` | `\(p\)` | ] <img src="stat100_wk12mon_files/figure-html/unnamed-chunk-3-1.png" width="576" style="display: block; margin: auto;" /> --- ### Specific Named Random Variables .pull-left[ (1) `\(X \sim\)` Bernoulli `\((p)\)` `\begin{align*} X= \left\{ \begin{array}{ll} 1 & \mbox{success} \\ 0 & \mbox{failure} \\ \end{array} \right. \end{align*}` * Probability Function: | `\(x\)` | 0 | 1 | |------|-----|---| | `\(p(x)\)` | 1 - `\(p\)` | `\(p\)` | ] -- .pull-right[ * Mean: `\(1*p + 0*(1 - p) = p\)` * Standard deviation: `\(\sqrt{(1 - p)^2*p + (0 - p)^2*(1 - p)} = \sqrt{p(1 - p)}\)` ] --- ### Specific Named Random Variables .pull-left[ (2) `\(X \sim\)` Normal `\((\mu, \sigma)\)` * Probability Function: $$ p(x) = \frac{1}{\sqrt{2\pi \sigma^2}}\exp{\left(-\frac{(x - \mu)^2}{2\sigma^2} \right)} $$ where `\(-\infty < x < \infty\)` * Mean: `\(\mu\)` * Standard deviation: `\(\sigma\)` ] -- .pull-right[ <img src="stat100_wk12mon_files/figure-html/unnamed-chunk-4-1.png" width="576" style="display: block; margin: auto;" /> ] -- .pull-left[ **Notes:** (a) Area under the curve = 1. (b) Height `\(\approx\)` how likely values are to occur ] -- .pull-right[ (c) Super special Normal RV: `\(Z \sim\)` Normal `\((\mu = 0, \sigma = 1)\)`. ] --- ### Specific Named Random Variables (2) `\(X \sim\)` Normal `\((\mu, \sigma)\)` <img src="stat100_wk12mon_files/figure-html/unnamed-chunk-5-1.png" width="576" style="display: block; margin: auto;" /> **Normal will be a good approximation for MANY distributions.** -- But sometimes its **tails** just aren't fat enough. --- ### Specific Named Random Variables .pull-left[ (3) `\(X \sim\)` t(df) * Probability Function: $$ p(x) = \frac{\Gamma(\mbox{df} + 1)}{\sqrt{\mbox{df}\pi} \Gamma(2^{-1}\mbox{df})}\left(1 + \frac{x^2}{\mbox{df}} \right)^{-\frac{df + 1}{2}} $$ where `\(-\infty < x < \infty\)` * Mean: 0 * Standard deviation: `\(\sqrt{\mbox{df}/(\mbox{df} - 2)}\)` ] -- .pull-right[ * Probability Function: <img src="stat100_wk12mon_files/figure-html/unnamed-chunk-6-1.png" width="576" style="display: block; margin: auto;" /> ] --- class: , middle, center ## Time to Start Viewing Sample Statistics also as Random Variables --- ### Sample Statistics as Random Variables Here are some of the sample statistics we've seen lately: * `\(\hat{p}\)` = sample proportion of correct receiver guesses out of 329 trials * `\(r\)` = sample correlation coefficient for vitamin D levels and age for female schoolchildren in Thailand * `\(\hat{p}_D - \hat{p}_Y\)` = difference in sample improvement proportions between those who swam with dolphins and those who did yoga -- Why are these all random variables? -- But none of these are **Bernoulli** random variables. -- Nor are they **Normal** random variables. -- Nor are they **t** random variables. -- #### *"All models are wrong but some are useful."* -- George Box --- ### Approximating These Distributions * `\(\hat{p}\)` = sample proportion of correct receiver guesses out of 329 trials -- .pull-left[ * We generated its Null Distribution: <img src="stat100_wk12mon_files/figure-html/unnamed-chunk-7-1.png" width="576" style="display: block; margin: auto;" /> ] --- ### Approximating These Distributions * `\(\hat{p}\)` = sample proportion of correct receiver guesses out of 329 trials * We generated its Null Distribution: .pull-left[ <img src="stat100_wk12mon_files/figure-html/unnamed-chunk-8-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ * Its Null Distribution is well approximated by the probability function of a N(0.25, 0.024). ] --- ### Approximating These Distributions * `\(r\)` = sample correlation coefficient for vitamin D levels and age for female schoolchildren in Thailand -- .pull-left[ * In your p-set this week, you are generating its Null Distribution: <img src="stat100_wk12mon_files/figure-html/unnamed-chunk-9-1.png" width="576" style="display: block; margin: auto;" /> ] --- ### Approximating These Distributions * `\(r\)` = sample correlation coefficient for vitamin D levels and age for female schoolchildren in Thailand .pull-left[ * In your p-set this week, you are generating its Null Distribution: <img src="stat100_wk12mon_files/figure-html/unnamed-chunk-10-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ * Its Null Distribution is well approximated by the probability function of a N(0, 0.056). ] --- ### Approximating These Distributions * `\(\hat{p}_D - \hat{p}_Y\)` = difference in sample improvement proportions between those who swam with dolphins and those who did yoga -- .pull-left[ * In your p-set last week, you generated its Null Distribution: <img src="stat100_wk12mon_files/figure-html/unnamed-chunk-11-1.png" width="576" style="display: block; margin: auto;" /> ] --- ### Approximating These Distributions * `\(\hat{p}_D - \hat{p}_Y\)` = difference in sample improvement proportions between those who swam with dolphins and those who did yoga .pull-left[ * In your p-set last week, you generated its Null Distribution: <img src="stat100_wk12mon_files/figure-html/unnamed-chunk-12-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ * Its Null Distribution is kinda somewhat well-ish approximated by the probability function of a N(0, 0.16). ] --- ### Approximating These Distributions * How do I know **which** probability function is a good approximation for my sample statistic's distribution? -- * Once I have figured out a probability function that approximates the distribution of my sample statistic, how do I **use it** to do statistical inference? --- ### Approximating Sampling Distributions -- **Central Limit Theorem (CLT):** For random samples and a large sample size `\((n)\)`, the sampling distribution of many sample statistics is approximately normal. -- **Example**: Trees in Mount Tabor <img src="stat100_wk12mon_files/figure-html/unnamed-chunk-13-1.png" width="576" style="display: block; margin: auto;" /> --- ### Approximating Sampling Distributions **Central Limit Theorem (CLT):** For random samples and a large sample size `\((n)\)`, the sampling distribution of many sample statistics is approximately normal. **Example**: Trees in Mount Tabor <img src="stat100_wk12mon_files/figure-html/unnamed-chunk-14-1.png" width="576" style="display: block; margin: auto;" /> -- But **which** Normal? (What is the value of `\(\mu\)` and `\(\sigma\)`?) --- ### Approximating Sampling Distributions **Question**: But **which** normal? (What is the value of `\(\mu\)` and `\(\sigma\)`?) -- * The sampling distribution of a statistic is always centered around: -- * The CLT also provides formula estimates of the standard error. + The formula varies based on the statistic. --- ### Approximating the Sampling Distribution of a Sample Proportion CLT says: For large `\(n\)` (At least 10 successes and 10 failures), -- $$ \hat{p} \sim N \left(p, \sqrt{\frac{p(1-p)}{n}} \right) $$ -- **Example**: Trees in Mount Tabor -- * Parameter: `\(p\)` = proportion of Douglas Fir = 0.615 -- * Statistic: `\(\hat{p}\)` = proportion of Douglas Fir in a sample of 50 trees -- $$ \hat{p} \sim N \left(0.615, \sqrt{\frac{0.615(1-0.615)}{50}} \right) $$ -- **NOTE**: Can plug in the true parameter here because we had data on the whole population. --- ### Approximating the Sampling Distribution of a Sample Proportion **Question**: What do we do when we don't have access to the whole population? -- * Have: $$ \hat{p} \sim N \left(p, \sqrt{\frac{p(1-p)}{n}} \right) $$ -- **Answer**: We will have to estimate the SE. --- ### Approximating the Sampling Distribution of a Sample Mean * There is a version of the CLT for many of our sample statistics. * For the sample mean, the CLT says: For large `\(n\)` (At least 30 observations), $$ \bar{x} \sim N \left(\mu, \frac{\sigma}{\sqrt{n}} \right) $$ --- class: middle, center ## Now we need to use the approximate distribution of the sample statistic (given by the CLT) to construct confidence intervals and to conduct hypothesis tests!