background-image: url("img/DAW.png") background-position: left background-size: 50% class: middle, center, inverse .pull-right[ ## .whitish[Probability Concepts] <br> <br> ### .whitish[Kelly McConville] #### .yellow[ Stat 100 | Week 10 | Spring 2022] ] --- ### Announcements * P-Set 8 is due on Wednesday. **************************** -- ### Goals for Today .pull-left[ * Recap probability concepts * Learn about **conditional** probabilities and practicing computing them ] .pull-right[ * Learn important **named** random variables * Start approximating sampling distributions ] --- ### Random Variables For a random variable, `\(X\)`, care about its: * **Distribution**: Probability function: `$$p(x) = P(X = x)$$` * **Center**: Mean of a discrete RV: $$ \mu = \sum x p(x) $$ * **Spread**: Standard deviation of a discrete RV: $$ \sigma = \sqrt{ \sum (x - \mu)^2 p(x)} $$ --- ### Random Variables If a random variable, `\(X\)`, is a **continuous** RV, then it can take on any value in an interval. * Probability function: + `\(P(X = x) = 0\)` so -- $$ p(x) \color{orange}{\approx} P(X = x) $$ but if `\(p(4) > p(2)\)` that still means that X is more likely to take on values around 4 than values around 2. --- ### Random Variables: Continuous Change `\(\sum\)` to `\(\color{orange}{\int}\)`: -- * `\(\color{orange}{\int} p(x) dx = 1\)`. -- * Center: Mean/Expected value: $$ \mu = \color{orange}{\int} x p(x) dx $$ -- * Spread: Standard deviation: $$ \sigma = \sqrt{ \color{orange}{\int} (x - \mu)^2 p(x) dx} $$ --- ### Example from Last Class: Suppose 4 students have still not received their graded Stat 100 Midterm (yes, let's pretend we actually have hand-written work) and that I hand back the exams randomly to each student. Let X = the number of students who get their correct exam. -- **Questions:** * Let's say the student's names are A(licia), B(ob), C(olin), and D(onna) and they are sitting in a row ABCD. One possible outcome is ABDC (1st exam goes to A, 2nd to B, 3rd to D, 4th to C). In that case, what does X equal? * List out all possible outcomes. And for each outcome, determine what X equals. * Why is P(X = 3) = 0? * Write out the probability distribution for X. * Determine the mean value of X. * Determine the standard deviation of X. * What is the probability that at least one student gets their correct exam? --- class: inverse, middle, center ## Why do we care about random variables? -- ### We will recast our sample statistics as random variables. -- ### This gives us a probability function that approximates the sampling distribution! --- ### Conditional Probabilities > "Conditioning is the soul of statistics." — Joe Blitzstein, Stat 110 Prof -- **Question**: What do we mean by "conditioning"? -- Most polar bears are twins. Therefore, if you're a twin, you're probably a polar bear. -- * P(twin given polar bear) `\(\neq\)` P(polar bear given twin) -- The p-value is a conditional probability. -- * P-value = P(data given `\(H_o\)`) ( `\(\neq\)` P( `\(H_o\)` given data)) --- ### Conditional Probabilities **Other favorite examples:** * P(have COVID given wear mask) `\(\neq\)` P(wear mask given have COVID) + In a CDC study, P(wear mask given have COVID) = 0.71 while P(have COVID given wear mask) is much lower. -- * P(it is raining given there are clouds directly overhead) `\(\neq\)` Pr(there are clouds directly overhead given it is raining) **Notation**: P(A given B) = P(A | B) --- ### Example Testing for COVID-19 is an important part of *Keep Harvard Healthy*. There are a variety of COVID-19 tests out there but for this problem let's assume the following: * The test gives a **false negative** result 13% of the time where a false negative case is a person with COVID-19 but the test says they don't have it. * The test gives a **false positive** result 5% of the time where a false positive case is a person who doesn't have COVID-19 but the test says they do. -- Let's assume the true prevalence is 1%. Each week they test about 30,000 Harvard affiliates. Use the assumed percentages to fill in the following table of potential outcomes: | | Positive Test Result | Negative Test Result | Total | |------------------------------|----------------------|----------------------|--------| | Actually have COVID-19 | | | | | Actually don't have COVID-19 | | | | | Total | | | 30,000 | .pull-left[ * P(test - | have COVID) = * P(have COVID | test -) = ] .pull-right[ * P(test + | don't have COVID) = * P(don't have COVID | test +) = ] --- ### Example The false negative rate of COVID-19 tests have varied wildly. One paper estimated it could be as high as 54%. Recreate the table with this new false negative rate. | | Positive Test Result | Negative Test Result | Total | |------------------------------|----------------------|----------------------|--------| | Actually have COVID-19 | | | | | Actually don't have COVID-19 | | | | | Total | | | 30,000 | .pull-left[ * P(test - | have COVID) = * P(have COVID | test -) = ] .pull-right[ * P(test + | don't have COVID) = * P(don't have COVID | test +) = ] --- class: inverse, middle, center ## Discuss Specific Named Random Variables --- ### Specific Named Random Variables * There are an **infinite** number of random variables out there. -- * But there are a few particular ones that we will find useful. + Because these ones are used often, they have been given names. -- * Will identify these named RVs using the following format: $$ X \sim \mbox{Name(values of key parameters)} $$ --- ### Specific Named Random Variables .pull-left[ (1) `\(X \sim\)` Bernoulli `\((p)\)` `\begin{align*} X= \left\{ \begin{array}{ll} 1 & \mbox{success} \\ 0 & \mbox{failure} \\ \end{array} \right. \end{align*}` * Important parameter: `\(p\)` = probability of success = `\(P(X = 1)\)` ] -- .pull-left[ * Probability Function: | `\(x\)` | 0 | 1 | |------|-----|---| | `\(p(x)\)` | 1 - `\(p\)` | `\(p\)` | ] -- <img src="stat100_wk10mon_files/figure-html/unnamed-chunk-1-1.png" width="576" style="display: block; margin: auto;" /> --- ### Specific Named Random Variables .pull-left[ (1) `\(X \sim\)` Bernoulli `\((p)\)` `\begin{align*} X= \left\{ \begin{array}{ll} 1 & \mbox{success} \\ 0 & \mbox{failure} \\ \end{array} \right. \end{align*}` * Important parameter: `\(p\)` = probability of success = `\(P(X = 1)\)` ] .pull-left[ * Probability Function: | `\(x\)` | 0 | 1 | |------|-----|---| | `\(p(x)\)` | 1 - `\(p\)` | `\(p\)` | ] <img src="stat100_wk10mon_files/figure-html/unnamed-chunk-2-1.png" width="576" style="display: block; margin: auto;" /> --- ### Specific Named Random Variables .pull-left[ (1) `\(X \sim\)` Bernoulli `\((p)\)` `\begin{align*} X= \left\{ \begin{array}{ll} 1 & \mbox{success} \\ 0 & \mbox{failure} \\ \end{array} \right. \end{align*}` * Important parameter: `\(p\)` = probability of success = `\(P(X = 1)\)` ] .pull-left[ * Probability Function: | `\(x\)` | 0 | 1 | |------|-----|---| | `\(p(x)\)` | 1 - `\(p\)` | `\(p\)` | ] <img src="stat100_wk10mon_files/figure-html/unnamed-chunk-3-1.png" width="576" style="display: block; margin: auto;" /> --- ### Specific Named Random Variables .pull-left[ (1) `\(X \sim\)` Bernoulli `\((p)\)` `\begin{align*} X= \left\{ \begin{array}{ll} 1 & \mbox{success} \\ 0 & \mbox{failure} \\ \end{array} \right. \end{align*}` * Probability Function: | `\(x\)` | 0 | 1 | |------|-----|---| | `\(p(x)\)` | 1 - `\(p\)` | `\(p\)` | ] -- .pull-right[ * Mean: `\(1*p + 0*(1 - p) = p\)` * Standard deviation: `\(\sqrt{(1 - p)^2*p + (0 - p)^2*(1 - p)} = \sqrt{p(1 - p)}\)` ] --- ### Specific Named Random Variables .pull-left[ (2) `\(X \sim\)` Normal `\((\mu, \sigma)\)` * Probability Function: $$ p(x) = \frac{1}{\sqrt{2\pi \sigma^2}}\exp{\left(-\frac{(x - \mu)^2}{2\sigma^2} \right)} $$ where `\(-\infty < x < \infty\)` * Mean: `\(\mu\)` * Standard deviation: `\(\sigma\)` ] -- .pull-right[ <img src="stat100_wk10mon_files/figure-html/unnamed-chunk-4-1.png" width="576" style="display: block; margin: auto;" /> ] -- .pull-left[ **Notes:** (a) Area under the curve = 1. (b) Height `\(\approx\)` how likely values are to occur ] -- .pull-right[ (c) Super special Normal RV: `\(Z \sim\)` Normal `\((\mu = 0, \sigma = 1)\)`. ] --- ### Specific Named Random Variables (2) `\(X \sim\)` Normal `\((\mu, \sigma)\)` <img src="stat100_wk10mon_files/figure-html/unnamed-chunk-5-1.png" width="576" style="display: block; margin: auto;" /> **Normal will be a good approximation for MANY distributions.** -- But sometimes its **tails** just aren't fat enough. --- ### Specific Named Random Variables .pull-left[ (3) `\(X \sim\)` t(df) * Probability Function: $$ p(x) = \frac{\Gamma(\mbox{df} + 1)}{\sqrt{\mbox{df}\pi} \Gamma(2^{-1}\mbox{df})}\left(1 + \frac{x^2}{\mbox{df}} \right)^{-\frac{df + 1}{2}} $$ where `\(-\infty < x < \infty\)` * Mean: 0 * Standard deviation: `\(\sqrt{\mbox{df}/(\mbox{df} - 2)}\)` ] -- .pull-right[ * Probability Function: <img src="stat100_wk10mon_files/figure-html/unnamed-chunk-6-1.png" width="576" style="display: block; margin: auto;" /> ] --- class: inverse, middle, center ## Time to Start Viewing Sample Statistics also as Random Variables --- ### Sample Statistics as Random Variables Here are some of the sample statistics we've seen lately: * `\(\hat{p}\)` = sample proportion of correct receiver guesses out of 329 trials * `\(r\)` = correlation between audience and critic ratings for a sample of movies * `\(\hat{p}_D - \hat{p}_Y\)` = difference in improvement proportions between those who swam with dolphins and those who did yoga -- Why are these all random variables? -- But none of these are Bernoulli RVs. -- Nor are they Normal RVs. -- Nor are they t RVs. -- > "All models are wrong but some are useful." -- George Box --- ### Approximating These Distributions * `\(\hat{p}\)` = sample proportion of correct receiver guesses out of 329 trials -- .pull-left[ * We generated its Null Distribution: <img src="stat100_wk10mon_files/figure-html/unnamed-chunk-7-1.png" width="576" style="display: block; margin: auto;" /> ] --- ### Approximating These Distributions * `\(\hat{p}\)` = sample proportion of correct receiver guesses out of 329 trials * We generated its Null Distribution: .pull-left[ <img src="stat100_wk10mon_files/figure-html/unnamed-chunk-8-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ * Its Null Distribution is well approximated by the probability function of a N(0.25, 0.024). ] --- ### Approximating These Distributions * `\(r\)` = correlation between audience and critic ratings for a sample of movies -- .pull-left[ * In your p-set this week, you are generating its Null Distribution: <img src="stat100_wk10mon_files/figure-html/unnamed-chunk-9-1.png" width="576" style="display: block; margin: auto;" /> ] --- ### Approximating These Distributions * `\(r\)` = correlation between audience and critic ratings for a sample of movies .pull-left[ * In your p-set this week, you are generating its Null Distribution: <img src="stat100_wk10mon_files/figure-html/unnamed-chunk-10-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ * Its Null Distribution is well approximated by the probability function of a N(0, 0.023). ] --- ### Approximating These Distributions * `\(\hat{p}_D - \hat{p}_Y\)` = difference in improvement proportions between those who swam with dolphins and those who did yoga -- .pull-left[ * In your p-set last week, you generated its Null Distribution: <img src="stat100_wk10mon_files/figure-html/unnamed-chunk-11-1.png" width="576" style="display: block; margin: auto;" /> ] --- ### Approximating These Distributions * `\(\hat{p}_D - \hat{p}_Y\)` = difference in improvement proportions between those who swam with dolphins and those who did yoga .pull-left[ * In your p-set last week, you generated its Null Distribution: <img src="stat100_wk10mon_files/figure-html/unnamed-chunk-12-1.png" width="576" style="display: block; margin: auto;" /> ] .pull-right[ * Its Null Distribution is kinda somewhat well-ish approximated by the probability function of a N(0, 0.16). ] --- ### Approximating These Distributions * How did I know what was a **good** approximation distribution? -- * How do these distributions help us??