background-image: url("img/DAW.png") background-position: left background-size: 50% class: middle, center, .pull-right[ ## .base-blue[Estimation] <br> <br> ### .purple[Kelly McConville] #### .purple[ Stat 100 | Week 9 | Fall 2022] ] --- ### Announcements * Project Assignment 2 due on Friday at 5pm. **************************** -- ### Goals for Today .pull-left[ * Modeling & Ethics: Algorithmic bias * **Sampling Distribution** + Properties + Construction in `R` ] .pull-right[ * Estimation ] --- class: middle, center, # Data Ethics: Algorithmic Bias --- class: middle, center, # Data Ethics ## Return to the Americian Statistical Association's ["Ethical Guidelines for Statistical Practice"](https://www.amstat.org/ASA/Your-Career/Ethical-Guidelines-for-Statistical-Practice.aspx) --- class: , center, middle ## Integrity of Data and Methods *"The ethical statistical practitioner seeks to understand and mitigate known or suspected limitations, defects, or biases in the data or methods and communicates potential impacts on the interpretation, conclusions, recommendations, decisions, or other results of statistical practices."* -- *"For models and algorithms designed to inform or implement decisions repeatedly, develops and/or implements plans to validate assumptions and assess performance over time, as needed. Considers criteria and mitigation plans for model or algorithm failure and retirement."* --- ## Algorithmic Bias **Algorithmic bias**: when the model systematically creates unfair outcomes, such as privileging one group over another. **Example**: [The Coded Gaze](https://www.youtube.com/watch?v=162VzSzzoPs) <div class="figure" style="text-align: center"> <img src="img/coded_gaze.png" alt="Joy Buolamwini" width="35%" /> <p class="caption">Joy Buolamwini</p> </div> -- * Facial recognition software struggles to see faces of color. -- * Algorithms built on a non-diverse, biased dataset. --- ## Algorithmic Bias **Algorithmic bias**: when the model systematically creates unfair outcomes, such as privileging one group over another. **Example**: [COMPAS model](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) used throughout the country to predict recidivism -- * Differences in predictions across race and gender <div class="figure" style="text-align: center"> <img src="img/compas.png" alt="ProPublica Analysis" width="55%" /> <p class="caption">ProPublica Analysis</p> </div> --- class: middle, center, ## Will consider issues related to models and ethics in P-Set 6. --- ## Sampling Distribution of a Statistic .pull-left[ Steps to Construct an (Approximate) Sampling Distribution: 1. Decide on a sample size, `\(n\)`. 2. Randomly select a sample of size `\(n\)` from the population. 3. Compute the sample statistic. 4. Put the sample back in. 5. Repeat Steps 2 - 4 many (1000+) times. ] .pull-right[ <img src="img/samp_dist.png" width="55%" style="display: block; margin: auto;" /> ] **What happens to the center/spread/shape as we increase the sample size?** **What happens to the center/spread/shape if the true parameter changes?** --- ### Let's go through the `samplingDistributions.Rmd` handout to see how to construct a sampling distribution in `R`. #### Important Notes * To construct a **sampling distribution** for a statistic, we need access to the entire population so that we can take **repeated samples** from the population. -- * But if we have access to the entire population, then we don't need a sampling distribution because we **know** the value of the population parameter. -- * So, frustrating, we will find that the sampling distribution is needed in the exact scenario where we can't compute it, the scenario where we only have a **single sample**. We will learn how to **estimate** the sampling distribution soon. -- * In this hand-out, we have the **entire population** and are constructing sampling distributions anyway to study their properties! --- ## Key Features of a Sampling Distribution What did we learn about sampling distributions? -- → **Standard error** = standard deviation of the statistic -- → Centered around the true population parameter. -- → As the sample size increases, the **standard error** (SE) of the statistic decreases. -- → As the sample size increases, the shape of the sampling distribution becomes more bell-shaped and symmetric. -- ### Question: How do sampling distributions help us **quantify uncertainty**? -- ### Question: If I am estimating a parameter in a real example, why won't I be able to construct the sampling distribution?? --- ## Estimation **Goal**: Estimate the value of a population parameter using data from the sample. -- **Question**: How do I know which population parameter I am interesting in estimating? -- → **Answer**: Likely depends on the research question and structure of your data! -- **Point Estimate**: The corresponding statistic * Single best guess for the parameter ```r library(tidyverse) ce <- read_csv("~/shared_data/stat100/data/ce.csv") summarize(ce, meanFINCBTAX = mean(FINCBTAX)) ``` ``` ## # A tibble: 1 × 1 ## meanFINCBTAX ## <dbl> ## 1 62480. ``` --- ### Potential Parameters and Point Estimates --- ## Confidence Intervals .pull-left[ It is time to move **beyond** just point estimates to interval estimates that quantify our uncertainty. ] .pull-right[ ```r summarize(ce, meanFINCBTAX = mean(FINCBTAX)) ``` ``` ## # A tibble: 1 × 1 ## meanFINCBTAX ## <dbl> ## 1 62480. ``` ] -- **Confidence Interval**: Interval of **plausible** values for a parameter -- **Form**: `\(\mbox{statistic} \pm \mbox{Margin of Error}\)` -- **Question**: How do we find the Margin of Error (ME)? -- → **Answer**: If the sampling distribution of the statistic is approximately bell-shaped and symmetric, then a statistic will be within 2 SEs of the parameter for 95% of the samples. -- **Form**: `\(\mbox{statistic} \pm 2\mbox{SE}\)` -- Called a 95% confidence interval (CI). (Will discuss the meaning of **confidence** soon) --- ## Confidence Intervals **95% CI Form**: $$ \mbox{statistic} \pm 2\mbox{SE} $$ Let's use the `ce` data to produce a CI for the average household income before taxes. ```r summarize(ce, meanFINCBTAX = mean(FINCBTAX)) ``` ``` ## # A tibble: 1 × 1 ## meanFINCBTAX ## <dbl> ## 1 62480. ``` What else do we need to construct the CI? -- **Problem**: To compute the SE, we need many samples from the population. We have 1 sample. -- **Solution**: Approximate the sampling distribution using **ONLY OUR ONE SAMPLE!** --- ## Reminders: * Project Assignment 2 due on Friday at 5pm.