Data Collection

background-image: url("img/DAW.png")
background-position: left
background-size: 50%
class: middle, center,

## .base-blue[Data Collection]

### .purple[Kelly McConville]

#### .purple[ Stat 100 | Week 5 | Fall 2022]

]

---

## Announcements

* Make sure to start working on Project Assignment 1 with your group members.

****************************

## Goals for Today

* Finish up data wrangling with `dplyr` discussion.

]

* Cover data collection/acquisition.

]

---

## Data Collection

---

##  Who are the data supposed to represent?

**Key questions:**

+ What evidence is there that the data are **representative**?
+ Who is present?  Who is absent?
+ Who is overrepresented?  Who is underrepresented?

---

##  Who are the data supposed to represent?

**Census**: We have data on the whole population!

---

##  Who are the data supposed to represent?

---

##  Who are the data supposed to represent?

**Key questions:**

+ What evidence is there that the **sample** is **representative** of the **population**?
+ Who is present?  Who is absent?
+ Who is overrepresented?  Who is underrepresented?

---

## Sampling Bias

**Sampling bias**: When the sampled units are **systematically different** from the non-sampled units on the variables of interest.

---

### Sampling Bias Example

The **Literary Digest** was a political magazine that correctly predicted the presidential outcomes from 1916 to 1932.  In 1936, they conducted the most extensive (to that date) public opinion poll.  They mailed questionnaires to over **10 million people** (about 1/3 of US households) whose names and addresses they obtained from telephone books and vehicle registration lists.

**Population of Interest**:

**Sample**:

**Sampling bias**:

---

## Random Sampling

Use random sampling (a random mechanism for selecting cases from the population) to remove sampling bias.

#### Types of random sampling

* Simple random sampling

* Cluster sampling

* Stratified random sampling

* Systematic sampling

Why aren't all samples generated using simple random sampling?

---

###  [US Forest Inventory and Analysis Program](https://www.fia.fs.fed.us/about/about_us/)

> Mission: "Make and keep current a comprehensive inventory and analysis of the present and prospective conditions of and requirements for the renewable resources of the forest and rangelands of the US."

Need a **random sample** of ground plots to say something about the state of our nation's forests!

---

### FIA: Simple Random Sampling

- Break the landscape up into equally sized plots (~1 acre).
- Number each plot from 1 to 6,755,200.
- Use a **random** mechanism to sample a plot for about every 6,000 acres.

```r
sample(x = 1:6755200, size = 1126) %>%
  head()
```

```
## [1] 5117382 5263094 2435611 1584603 6586114 1684609
```

]

]

Thoughts on this sampling design?

---

### FIA: Cluster Random Sampling

- Break the landscape up into equally sized plots (~1 acre).
- Put each plot in a cluster.
    + For our example: cluster = county.
- Number each cluster.
- Use a **random** mechanism to sample 2 clusters.
- Sample **all** plots in those 2 clusters.

```r
sample(x = 1:14, size = 2) 
```

```
## [1] 6 9
```

]

]

Thoughts on this sampling design?

---

### FIA: Cluster Random Sampling

```r
sample(x = 1:14, size = 2) 
```

```
## [1] 2 9
```

```r
sample(x = 1:---, size = ---)
```

]

]

Thoughts on this sampling design?

---

### FIA: Cluster Random Sampling

]

]

- Are our clusters based on counties **homogeneous**?

- Why is **homogeneity** important for cluster sampling?

---

### FIA: Stratified Random Sampling

- Break the landscape up into equally sized plots (~1 acre).
- Put each plot in a stratum.
    + For our example: stratum = county.
- Take a **simple random sample** within every stratum.
    + Don't have to be equally sized!

```r
# Do this for each stratum
sample(x = 1:---, size = ---)
```

]

]

Thoughts on this sampling design?

---

### FIA: Systematic Random Sampling

This is FIA's **actual** sampling design (okay, slightly simplified).

- Break the landscape up into equally sized plots (~1 acre).
- Number each plot from 1 to 6,755,200.
- Use a **random** mechanism to pick starting point.  Then sample about once every 6000 acres.

```r
sample(x = 1:6755200, size = 1) 
```

```
## [1] 214156
```

]

]

Why is this design **better** than simple random sampling?

---

### National Health and Nutrition Examination Survey

Why are these data collected?

&rarr; To assess the health of people in the US.

How are these data collected?

&rarr; **Stage 1**: US is stratified by geography and distribution of minority populations.  Counties are randomly selected within each stratum.

&rarr; **Stage 2**: From the sampled counties, city blocks are randomly selected. (City blocks are clusters.)

&rarr; **Stage 3**: From sampled city blocks, households are randomly selected. (Household are clusters.)

&rarr; **Stage 4**: From sampled households, people are randomly selected.  For the sampled households, a mobile health vehicle goes to the house and medical professionals take the necessary measurements.

**Why don't they use simple random sampling?**

---

### Careful Using Non-Simple Random Sample Data

]

]

* If you are dealing with data collected using a complex sampling design, I'd recommend taking an additional stats course, like Stat 160: Sample Surveys!

---

## Detour: Data Ethics

---

### Data Ethics

> "Good statistical practice is fundamentally based on transparent assumptions, reproducible results, and valid interpretations." -- Committee on Professional Ethics of the American Statistical Association (ASA)

The ASA have created ["Ethical Guidelines for Statistical Practice"](https://www.amstat.org/ASA/Your-Career/Ethical-Guidelines-for-Statistical-Practice.aspx)

&rarr; These guidelines are for EVERYONE doing statistical work.

&rarr; There are ethical decisions at all steps of the Data Analysis Process.

&rarr; We will periodically refer to specific guidelines throughout this class.

> "Above all, professionalism in statistical practice presumes the goal of advancing knowledge while avoiding harm; using statistics in pursuit of unethical ends is inherently unethical."

---

##  Responsibilities to Research Subjects

> "The ethical statistician protects and respects the rights and interests of human and animal subjects at all stages of their involvement in a project. This includes respondents to the census or to surveys, those whose data are contained in administrative records, and subjects of physically or psychologically invasive research."

---

###  Responsibilities to Research Subjects

> "Protects the privacy and confidentiality of research subjects and data concerning them, whether obtained from the subjects directly, other persons, or existing records."

Why does the `Age` variable max out at 80?

---

## Detour from Our Detour

```r
library(tidyverse)
library(NHANES)

ggplot(data = NHANES, 
       mapping = aes(x = Age,
                     y = Height)) +
  geom_point(alpha = 0.1) +
  stat_smooth(color = "blue")
```

]

]

---

## Detour from Our Detour

```r
library(tidyverse)
library(NHANES)
library(emojifont)

NHANES <- mutate(NHANES, 
 heart = fontawesome("fa-heart"))

ggplot(data = NHANES, 
       mapping = aes(x = Age,
                     y = Height,
                     label = heart)) +
  geom_text(alpha = 0.1, color = "red",
            family='fontawesome-webfont',
            size = 8) +
  stat_smooth(color = "lavender") +
  labs(title = "Happy Love Note Day")
```

]

]

---

## Back to Data Collection

---

###  Who are the data supposed to represent?

---

###  Who are the data supposed to represent?

**Key questions:**

+ What evidence is there that the **respondents** are **representative** of the **population**?
+ Who is present?  Who is absent?
+ Who is overrepresented?  Who is underrepresented?

---

## Nonresponse bias

**Nonresponse bias**: The respondents are **systematically** different from the non-respondents for the variables of interest.

---

### Come Back to Literary Digest Example

Of the 10 million people surveyed, more than 2.4 million responded with 57% indicating that they would vote for Republican Alf Landon in the upcoming presidential election instead of the current President Franklin Delano Roosevelt.

**Non-response bias**?

---

## Tackling Nonresponse bias

&rarr; Use multiple modes and multiple attempts for reaching sampled cases.

&rarr; Explore key demographic variables to see how respondents and non-respondents vary.

---

## Is Bigger Always Better?

For our **Literary Digest Example**, Gallup predicted Roosevelt would win based on a survey of **50,000** people, (instead of 2.4 million).

---

### Big Data Paradox

> "Without taking data quality into account, population inferences with Big Data are subject to a Big Data Paradox: the more the data, the surer we fool ourselves." -- Xiao-Li Meng

**Example:**

* During Spring of 2021, Delphi-Facebook estimated  vaccine uptake at 70% and U.S. Census estimated it at 67%.

* The CDC reported it to be 53%.

And, once we learn about **quantifying uncertainty**, we will see that large sample sizes produce very small measures of uncertainty.

> "If you have the resources, invest in data quality far more than you invest in data quantity. Bad-quality data is essentially wiping out the power you think you have. That’s always been a problem, but it’s magnified now because we have big data. " -- Xiao-Li Meng

---

## Thoughts on Data Collection: Sampling

**Random** sampling is important to ensure the sample is **representative** of the population.

Representativeness isn't about size.

+ Small random samples will tend to be more representative than large non-random samples.

But I bet most samples you will encounter won't have arisen from a random mechanism.

How do we draw conclusions about the population from **non-random samples**?

&rarr; Determinee if your sampled cases (and respondents) are systematically different from the non-sampled cases (and non-respondents) for the variables you care about.

&rarr; Adjust your population of interest.

&rarr; Take a survey stats course to learn how to adjust the sample to make it more representative.

---

## Now let's shift the discussion.

### Suppose we have our sample and have determined the population it represents.

### What kind of conclusions can we draw?

---

### Typical Analysis Goals

**Descriptive**: Want to estimate quantities related to the population.

> How many trees are in Alaska?

**Predictive**: Want to predict the value of a variable.

> Can I use remotely sensed data to predict forest types in Alaska?

**Causal**: Want to determine if changes in a variable cause changes in another variable.

> Are insects causing the increased mortality rates for pinyon-juniper woodlands?

---

### Typical Analysis Goals

For these goals will differentiate between variables:

* **Response variable**: Variable I want to better understand

* **Explanatory/predictor variables**: Variables I think might explain/predict the response variable

> How many trees are in Alaska?

> Can I use remotely sensed data to predict forest types in Alaska?

> Are insects causing the increased mortality rates for pinyon-juniper woodlands?

---

### Key Mechanism for Causal Goal

**Random assignment**: Cases are randomly assigned to categories of the **explanatory variable**

&rarr; If the data were collected using **random assignment**, then I can determine if the explanatory variable **causes** changes in the response variable.

**Example**: [COVID Vaccine Trials](https://www.nih.gov/news-events/nih-research-matters/experimental-coronavirus-vaccine-highly-effective)

To study the effectiveness of the Moderna vaccine (mRNA-1273), researchers carried out a study on over 30,000 adult volunteers with no known previous COVID-19 infection.  Volunteers were randomly assigned to either receive two doses of the vaccine or two shots of saline.  The incidence of symptomatic COVID-19 was 94% lower in those who received the vaccine than those who did not.

**Why does random assignment allow us to conclude that this vaccine was effective at preventing (early strains of) COVID-19?**

---

## Causal Inference

Often want to conclude that an explanatory variable causes changes in a response variable but you did not randomly assign the explanatory variable.

**Confounding variable**: When the explanatory variable and response variable vary, so does the confounder.

&rarr; Unclear if the explanatory variable or the confounder (or some other variable) is causing changes in the response.

---

## Causal Inference

Often want to conclude that an explanatory variable causes changes in a response variable but you did not randomly assign the explanatory variable.

**Confounding variable**: When the explanatory variable and response variable vary, so does the confounder.

&rarr; Unclear if the explanatory variable or the confounder (or some other variable) is causing changes in the response.

---

## Causal Inference

* **Spurious relationship**: Two variables are associated but not causally related
    + In the age of big data, lots of good examples [out there](https://tylervigen.com/spurious-correlations).

> "Correlation does not imply causation."

>  "Correlation does not imply not causation."

* **Causal inference**: Methods for finding causal relationships even when the data were collected without random assignment.

---

## Types of Studies

**Experiment:** Interested in causal relationships so utilize random assignment.  Other key features include:

+ Control group
+ Placebo
+ Blinding

**Example**: COVID vaccine trials

+ Control group: Those who got no vaccine.

+ Placebo: The control group got saline shots so they didn't know their group.

+ Blinding: Subjects and researchers interacting with subjects did not know which group they were in.

---

## Types of Studies

**Observational Study:** Collect data in a way that doesn't interfere

**Example**: Hand washing study

To estimate what percent of people in the US wash their hands after using a public restroom, researchers pretended to comb their hair while observing 6000 people in public restrooms throughout the United States. They found that 85% of the people who were observed washed their hands after going to the bathroom.

---

## Thoughts on Data Collection: Goals

Consider your analysis goals when making conclusions.  If your goal is to show causal relationships, ask:

* Do I have convincing evidence that any differences in the response aren't just do to "random chance?"
* Do I have convincing evidence against the explanatory groups differing to begin with?

Random assignment allows you to explore **causal** relationships between your explanatory variables and the predictor variables because the randomization makes the explanatory groups roughly similar.

How do we draw causal conclusions from studies without random assignment?

&rarr; With extreme care!  Try to control for all possible confounding variables.

&rarr; Discuss the associations/correlations you found.  Use domain knowledge to address potentially causal links.

&rarr; Take more stats to learn more about causal inference.

**Bottom Line:** We often have to use imperfect data to make decisions.

---

### Reminders

* **Participation/Engagement:**
    + In class and section
    + By Oct 14th, must:
        + Attend at least one office hour.
        + Post at least two messages on Slack.