background-image: url("img/DAW.png") background-position: left background-size: 50% class: middle, center, .pull-right[ ## .base-blue[Data Collection] <br> <br> ### .purple[Kelly McConville] #### .purple[ Stat 100 | Week 5 | Fall 2022] ] --- ## Announcements * Make sure to start working on Project Assignment 1 with your group members. **************************** -- ## Goals for Today .pull-left[ * Finish up data wrangling with `dplyr` discussion. ] -- .pull-right[ * Cover data collection/acquisition. ] --- class: , middle, center ## Data Collection <img src="img/twitter-study-design.png" width="50%" style="display: block; margin: auto;" /> --- ## Who are the data supposed to represent? <img src="img/week4.002.jpeg" width="80%" style="display: block; margin: auto;" /> -- **Key questions:** + What evidence is there that the data are **representative**? + Who is present? Who is absent? + Who is overrepresented? Who is underrepresented? --- ## Who are the data supposed to represent? <img src="img/week4.003.jpeg" width="80%" style="display: block; margin: auto;" /> -- **Census**: We have data on the whole population! --- ## Who are the data supposed to represent? <img src="img/sampling.002.jpeg" width="90%" style="display: block; margin: auto;" /> --- ## Who are the data supposed to represent? <img src="img/week4.005.jpeg" width="80%" style="display: block; margin: auto;" /> -- **Key questions:** + What evidence is there that the **sample** is **representative** of the **population**? + Who is present? Who is absent? + Who is overrepresented? Who is underrepresented? --- ## Sampling Bias <img src="img/sampling.001.jpeg" width="80%" style="display: block; margin: auto;" /> **Sampling bias**: When the sampled units are **systematically different** from the non-sampled units on the variables of interest. --- ### Sampling Bias Example The **Literary Digest** was a political magazine that correctly predicted the presidential outcomes from 1916 to 1932. In 1936, they conducted the most extensive (to that date) public opinion poll. They mailed questionnaires to over **10 million people** (about 1/3 of US households) whose names and addresses they obtained from telephone books and vehicle registration lists. -- **Population of Interest**: <br> **Sample**: <br> **Sampling bias**: --- ## Random Sampling Use random sampling (a random mechanism for selecting cases from the population) to remove sampling bias. #### Types of random sampling * Simple random sampling * Cluster sampling * Stratified random sampling * Systematic sampling -- Why aren't all samples generated using simple random sampling? --- <img src="img/fs.png" width="12%" style="float:left; padding:10px" style="display: block; margin: auto;" /> ### [US Forest Inventory and Analysis Program](https://www.fia.fs.fed.us/about/about_us/) > Mission: "Make and keep current a comprehensive inventory and analysis of the present and prospective conditions of and requirements for the renewable resources of the forest and rangelands of the US." Need a **random sample** of ground plots to say something about the state of our nation's forests! --- ### FIA: Simple Random Sampling .pull-left[ - Break the landscape up into equally sized plots (~1 acre). - Number each plot from 1 to 6,755,200. - Use a **random** mechanism to sample a plot for about every 6,000 acres. ```r sample(x = 1:6755200, size = 1126) %>% head() ``` ``` ## [1] 5117382 5263094 2435611 1584603 6586114 1684609 ``` ] .pull-right[ <img src="stat100_wk05mon_files/figure-html/unnamed-chunk-9-1.png" width="576" style="display: block; margin: auto;" /> ] Thoughts on this sampling design? --- ### FIA: Cluster Random Sampling .pull-left[ - Break the landscape up into equally sized plots (~1 acre). - Put each plot in a cluster. + For our example: cluster = county. - Number each cluster. - Use a **random** mechanism to sample 2 clusters. - Sample **all** plots in those 2 clusters. ```r sample(x = 1:14, size = 2) ``` ``` ## [1] 6 9 ``` ] .pull-right[ <img src="stat100_wk05mon_files/figure-html/unnamed-chunk-11-1.png" width="576" style="display: block; margin: auto;" /> ] Thoughts on this sampling design? --- ### FIA: Cluster Random Sampling .pull-left[ - Break the landscape up into equally sized plots (~1 acre). - Put each plot in a cluster. + For our example: cluster = county. - Number each cluster. - Use a **random** mechanism to sample 2 clusters. - Take a **simple random sample** within the sampled clusters. ```r sample(x = 1:14, size = 2) ``` ``` ## [1] 2 9 ``` ```r sample(x = 1:---, size = ---) ``` ] .pull-right[ <img src="stat100_wk05mon_files/figure-html/unnamed-chunk-14-1.png" width="576" style="display: block; margin: auto;" /> ] Thoughts on this sampling design? --- ### FIA: Cluster Random Sampling .pull-left[ <img src="img/mass_forests.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="stat100_wk05mon_files/figure-html/unnamed-chunk-16-1.png" width="576" style="display: block; margin: auto;" /> ] - Are our clusters based on counties **homogeneous**? - Why is **homogeneity** important for cluster sampling? --- ### FIA: Stratified Random Sampling .pull-left[ - Break the landscape up into equally sized plots (~1 acre). - Put each plot in a stratum. + For our example: stratum = county. - Take a **simple random sample** within every stratum. + Don't have to be equally sized! ```r # Do this for each stratum sample(x = 1:---, size = ---) ``` ] .pull-right[ <img src="stat100_wk05mon_files/figure-html/unnamed-chunk-18-1.png" width="576" style="display: block; margin: auto;" /> ] Thoughts on this sampling design? --- ### FIA: Systematic Random Sampling .pull-left[ This is FIA's **actual** sampling design (okay, slightly simplified). - Break the landscape up into equally sized plots (~1 acre). - Number each plot from 1 to 6,755,200. - Use a **random** mechanism to pick starting point. Then sample about once every 6000 acres. ```r sample(x = 1:6755200, size = 1) ``` ``` ## [1] 214156 ``` ] .pull-right[ <img src="stat100_wk05mon_files/figure-html/unnamed-chunk-20-1.png" width="576" style="display: block; margin: auto;" /> ] Why is this design **better** than simple random sampling? --- <img src="img/nhanes_logo.jpg" width="15%" style="float:left; padding:10px" style="display: block; margin: auto;" /> <br> ### National Health and Nutrition Examination Survey <br> Why are these data collected? -- → To assess the health of people in the US. -- How are these data collected? -- → **Stage 1**: US is stratified by geography and distribution of minority populations. Counties are randomly selected within each stratum. -- → **Stage 2**: From the sampled counties, city blocks are randomly selected. (City blocks are clusters.) -- → **Stage 3**: From sampled city blocks, households are randomly selected. (Household are clusters.) -- → **Stage 4**: From sampled households, people are randomly selected. For the sampled households, a mobile health vehicle goes to the house and medical professionals take the necessary measurements. -- **Why don't they use simple random sampling?** --- ### Careful Using Non-Simple Random Sample Data .pull-left[ <img src="stat100_wk05mon_files/figure-html/unnamed-chunk-22-1.png" width="576" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="stat100_wk05mon_files/figure-html/unnamed-chunk-23-1.png" width="576" style="display: block; margin: auto;" /> ] -- * If you are dealing with data collected using a complex sampling design, I'd recommend taking an additional stats course, like Stat 160: Sample Surveys! --- class: middle, center, ## Detour: Data Ethics --- ### Data Ethics > "Good statistical practice is fundamentally based on transparent assumptions, reproducible results, and valid interpretations." -- Committee on Professional Ethics of the American Statistical Association (ASA) -- The ASA have created ["Ethical Guidelines for Statistical Practice"](https://www.amstat.org/ASA/Your-Career/Ethical-Guidelines-for-Statistical-Practice.aspx) -- → These guidelines are for EVERYONE doing statistical work. -- → There are ethical decisions at all steps of the Data Analysis Process. -- → We will periodically refer to specific guidelines throughout this class. -- > "Above all, professionalism in statistical practice presumes the goal of advancing knowledge while avoiding harm; using statistics in pursuit of unethical ends is inherently unethical." --- class: , center, middle ## Responsibilities to Research Subjects > "The ethical statistician protects and respects the rights and interests of human and animal subjects at all stages of their involvement in a project. This includes respondents to the census or to surveys, those whose data are contained in administrative records, and subjects of physically or psychologically invasive research." --- ### Responsibilities to Research Subjects > "Protects the privacy and confidentiality of research subjects and data concerning them, whether obtained from the subjects directly, other persons, or existing records." <img src="stat100_wk05mon_files/figure-html/unnamed-chunk-24-1.png" width="576" style="display: block; margin: auto;" /> Why does the `Age` variable max out at 80? --- ## Detour from Our Detour -- .pull-left[ ```r library(tidyverse) library(NHANES) ggplot(data = NHANES, mapping = aes(x = Age, y = Height)) + geom_point(alpha = 0.1) + stat_smooth(color = "blue") ``` ] .pull-right[ <img src="stat100_wk05mon_files/figure-html/points-1.png" width="768" style="display: block; margin: auto;" /> ] --- ## Detour from Our Detour .pull-left[ ```r library(tidyverse) library(NHANES) library(emojifont) NHANES <- mutate(NHANES, heart = fontawesome("fa-heart")) ggplot(data = NHANES, mapping = aes(x = Age, y = Height, label = heart)) + geom_text(alpha = 0.1, color = "red", family='fontawesome-webfont', size = 8) + stat_smooth(color = "lavender") + labs(title = "Happy Love Note Day") ``` ] .pull-right[ <img src="stat100_wk05mon_files/figure-html/hearts-1.png" width="768" style="display: block; margin: auto;" /> ] --- class: middle, center, ## Back to Data Collection --- ### Who are the data supposed to represent? <img src="img/sampling.002.jpeg" width="90%" style="display: block; margin: auto;" /> --- ### Who are the data supposed to represent? <img src="img/week4.006.jpeg" width="80%" style="display: block; margin: auto;" /> **Key questions:** + What evidence is there that the **respondents** are **representative** of the **population**? + Who is present? Who is absent? + Who is overrepresented? Who is underrepresented? --- ## Nonresponse bias <img src="img/sampling.003.jpeg" width="80%" style="display: block; margin: auto;" /> **Nonresponse bias**: The respondents are **systematically** different from the non-respondents for the variables of interest. --- ### Come Back to Literary Digest Example Of the 10 million people surveyed, more than 2.4 million responded with 57% indicating that they would vote for Republican Alf Landon in the upcoming presidential election instead of the current President Franklin Delano Roosevelt. <br> **Non-response bias**? --- ## Tackling Nonresponse bias <img src="img/sampling.003.jpeg" width="80%" style="display: block; margin: auto;" /> -- → Use multiple modes and multiple attempts for reaching sampled cases. -- → Explore key demographic variables to see how respondents and non-respondents vary. --- ## Is Bigger Always Better? -- <img src="img/sampling.004.jpeg" width="80%" style="display: block; margin: auto;" /> -- For our **Literary Digest Example**, Gallup predicted Roosevelt would win based on a survey of **50,000** people, (instead of 2.4 million). --- ### Big Data Paradox <img src="img/meng.jpg" width="10%" style="float:left; padding:10px" style="display: block; margin: auto;" /> > "Without taking data quality into account, population inferences with Big Data are subject to a Big Data Paradox: the more the data, the surer we fool ourselves." -- Xiao-Li Meng -- **Example:** * During Spring of 2021, Delphi-Facebook estimated vaccine uptake at 70% and U.S. Census estimated it at 67%. -- * The CDC reported it to be 53%. -- And, once we learn about **quantifying uncertainty**, we will see that large sample sizes produce very small measures of uncertainty. -- > "If you have the resources, invest in data quality far more than you invest in data quantity. Bad-quality data is essentially wiping out the power you think you have. That’s always been a problem, but it’s magnified now because we have big data. " -- Xiao-Li Meng --- ## Thoughts on Data Collection: Sampling **Random** sampling is important to ensure the sample is **representative** of the population. -- Representativeness isn't about size. + Small random samples will tend to be more representative than large non-random samples. -- But I bet most samples you will encounter won't have arisen from a random mechanism. -- How do we draw conclusions about the population from **non-random samples**? -- → Determinee if your sampled cases (and respondents) are systematically different from the non-sampled cases (and non-respondents) for the variables you care about. -- → Adjust your population of interest. -- → Take a survey stats course to learn how to adjust the sample to make it more representative. --- class: center, middle ## Now let's shift the discussion. -- ### Suppose we have our sample and have determined the population it represents. -- ### What kind of conclusions can we draw? --- ### Typical Analysis Goals -- **Descriptive**: Want to estimate quantities related to the population. > How many trees are in Alaska? -- **Predictive**: Want to predict the value of a variable. > Can I use remotely sensed data to predict forest types in Alaska? -- **Causal**: Want to determine if changes in a variable cause changes in another variable. > Are insects causing the increased mortality rates for pinyon-juniper woodlands? --- ### Typical Analysis Goals For these goals will differentiate between variables: * **Response variable**: Variable I want to better understand * **Explanatory/predictor variables**: Variables I think might explain/predict the response variable -- > How many trees are in Alaska? > Can I use remotely sensed data to predict forest types in Alaska? > Are insects causing the increased mortality rates for pinyon-juniper woodlands? --- ### Key Mechanism for Causal Goal **Random assignment**: Cases are randomly assigned to categories of the **explanatory variable** -- → If the data were collected using **random assignment**, then I can determine if the explanatory variable **causes** changes in the response variable. -- **Example**: [COVID Vaccine Trials](https://www.nih.gov/news-events/nih-research-matters/experimental-coronavirus-vaccine-highly-effective) To study the effectiveness of the Moderna vaccine (mRNA-1273), researchers carried out a study on over 30,000 adult volunteers with no known previous COVID-19 infection. Volunteers were randomly assigned to either receive two doses of the vaccine or two shots of saline. The incidence of symptomatic COVID-19 was 94% lower in those who received the vaccine than those who did not. -- **Why does random assignment allow us to conclude that this vaccine was effective at preventing (early strains of) COVID-19?** --- ## Causal Inference Often want to conclude that an explanatory variable causes changes in a response variable but you did not randomly assign the explanatory variable. -- **Confounding variable**: When the explanatory variable and response variable vary, so does the confounder. → Unclear if the explanatory variable or the confounder (or some other variable) is causing changes in the response. <img src="img/confound.png" width="70%" style="display: block; margin: auto;" /> --- ## Causal Inference Often want to conclude that an explanatory variable causes changes in a response variable but you did not randomly assign the explanatory variable. **Confounding variable**: When the explanatory variable and response variable vary, so does the confounder. → Unclear if the explanatory variable or the confounder (or some other variable) is causing changes in the response. <img src="img/confound2.png" width="70%" style="display: block; margin: auto;" /> --- ## Causal Inference * **Spurious relationship**: Two variables are associated but not causally related + In the age of big data, lots of good examples [out there](https://tylervigen.com/spurious-correlations). -- > "Correlation does not imply causation." -- > "Correlation does not imply not causation." -- * **Causal inference**: Methods for finding causal relationships even when the data were collected without random assignment. --- ## Types of Studies **Experiment:** Interested in causal relationships so utilize random assignment. Other key features include: + Control group + Placebo + Blinding -- **Example**: COVID vaccine trials -- + Control group: Those who got no vaccine. -- + Placebo: The control group got saline shots so they didn't know their group. -- + Blinding: Subjects and researchers interacting with subjects did not know which group they were in. --- ## Types of Studies **Observational Study:** Collect data in a way that doesn't interfere -- **Example**: Hand washing study To estimate what percent of people in the US wash their hands after using a public restroom, researchers pretended to comb their hair while observing 6000 people in public restrooms throughout the United States. They found that 85% of the people who were observed washed their hands after going to the bathroom. --- ## Thoughts on Data Collection: Goals Consider your analysis goals when making conclusions. If your goal is to show causal relationships, ask: * Do I have convincing evidence that any differences in the response aren't just do to "random chance?" * Do I have convincing evidence against the explanatory groups differing to begin with? -- Random assignment allows you to explore **causal** relationships between your explanatory variables and the predictor variables because the randomization makes the explanatory groups roughly similar. -- How do we draw causal conclusions from studies without random assignment? -- → With extreme care! Try to control for all possible confounding variables. -- → Discuss the associations/correlations you found. Use domain knowledge to address potentially causal links. -- → Take more stats to learn more about causal inference. -- **Bottom Line:** We often have to use imperfect data to make decisions. --- ### Reminders * **Participation/Engagement:** + In class and section + By Oct 14th, must: + Attend at least one office hour. + Post at least two messages on Slack.