More Data Collection
Kelly McConville
Stat 100
Week 5 | Fall 2023
“Good statistical practice is fundamentally based on transparent assumptions, reproducible results, and valid interpretations.” – Committee on Professional Ethics of the American Statistical Association (ASA)
The ASA has created “Ethical Guidelines for Statistical Practice”
→ These guidelines are for EVERYONE doing statistical work.
→ There are ethical decisions at all steps of the Data Analysis Process.
→ We will periodically refer to specific guidelines throughout this class.
“Above all, professionalism in statistical practice presumes the goal of advancing knowledge while avoiding harm; using statistics in pursuit of unethical ends is inherently unethical.”
“The ethical statistician protects and respects the rights and interests of human and animal subjects at all stages of their involvement in a project. This includes respondents to the census or to surveys, those whose data are contained in administrative records, and subjects of physically or psychologically invasive research.”
Why do you think the Age
variable maxes out at 80?
“Protects the privacy and confidentiality of research subjects and data concerning them, whether obtained from the subjects directly, other persons, or existing records.”
library(tidyverse)
library(NHANES)
library(emojifont)
NHANES <- mutate(NHANES,
heart = fontawesome("fa-pumpkin"))
ggplot(data = NHANES,
mapping = aes(x = Age,
y = Height,
label = heart)) +
geom_text(alpha = 0.1, color = "red",
family='fontawesome-webfont',
size = 16) +
stat_smooth(color = "deeppink")
Key questions:
Nonresponse bias: The respondents are systematically different from the non-respondents for the variables of interest.
Of the 10 million people surveyed, more than 2.4 million responded with 57% indicating that they would vote for Republican Alf Landon in the upcoming presidential election instead of the current President Franklin Delano Roosevelt.
Non-response bias?
Use multiple modes (mail, phone, in-person) and multiple attempts for reaching sampled cases.
Explore key demographic variables to see how respondents and non-respondents vary.
Take a survey stats course to learn how to create survey weights to adjust for potential nonresponse bias.
For our Literary Digest Example, Gallup predicted Roosevelt would win based on a survey of 50,000 people (instead of 2.4 million).
“Without taking data quality into account, population inferences with Big Data are subject to a Big Data Paradox: the more the data, the surer we fool ourselves.” – Xiao-Li Meng
Example:
During Spring of 2021, Delphi-Facebook estimated vaccine uptake at 70% and U.S. Census estimated it at 67%.
The CDC reported it to be 53%.
And, once we learn about quantifying uncertainty, we will see that large sample sizes produce very small measures of uncertainty.
“If you have the resources, invest in data quality far more than you invest in data quantity. Bad-quality data is essentially wiping out the power you think you have. That’s always been a problem, but it’s magnified now because we have big data.” – Xiao-Li Meng
Random sampling is important to ensure the sample is representative of the population.
Representativeness isn’t about size.
However, I bet most samples you will encounter won’t have arisen from a random mechanism.
How do we draw conclusions about the population from non-random samples?
Descriptive: Want to estimate quantities related to the population.
→ How many trees are in Alaska?
Predictive: Want to predict the value of a variable.
→ Can I use remotely sensed data to predict forest types in Alaska?
Causal: Want to determine if changes in a variable cause changes in another variable.
→ Are insects causing the increased mortality rates for pinyon-juniper woodlands?
For these goals will differentiate between the roles of the variables:
Response variable: Variable I want to better understand
Explanatory/predictor variables: Variables I think might explain/predict the response variable
→ How many trees are in Alaska?
→ Can I use remotely sensed data to predict forest types in Alaska?
→ Are insects causing the increased mortality rates for pinyon-juniper woodlands?
Random assignment: Cases are randomly assigned to categories of the explanatory variable
Example: COVID Vaccine Trials
To study the effectiveness of the Moderna vaccine (mRNA-1273), researchers carried out a study on over 30,000 adult volunteers with no known previous COVID-19 infection. Volunteers were randomly assigned to either receive two doses of the vaccine or two shots of saline. The incidence of symptomatic COVID-19 was 94% lower in those who received the vaccine than those who did not.
Question: Why does random assignment allow us to conclude that this vaccine was effective at preventing (early strains of) COVID-19?
We have data on the number of Methodist ministers in New England and the number of barrels of rum imported into Boston each year. The data range from 1860 to 1940.
Confounding variable: A third variable that is associated with both the explanatory variable and the response variable.
Unclear if the explanatory variable or the confounder (or some other variable) is causing changes in the response.
→ “Correlation does not imply causation.”
→ “Correlation does not imply not causation.”
A study in which the researchers don’t actively control the value of any variable, but simply observe the values as they naturally exist.
Example: Hand washing study
A study in which the researcher actively controls one or more of the explanatory variables through random assignment.
Example: COVID Trial
Common features:
Random assignment allows you to explore causal relationships between your explanatory variables and the predictor variables because the randomization makes the explanatory groups roughly similar.
How do we draw causal conclusions from studies without random assignment?
But also consider the goals of your analysis. Often the research question isn’t causal.
Bottom Line: We often have to use imperfect data to make decisions.