Data Collection

background-image: url("img/DAW.png")
background-position: left
background-size: 50%
class: middle, center, inverse

## .whitish[Data Collection]

<br>

### .whitish[Kelly McConville]

#### .yellow[ Stat 100 | Week 4 | Spring 2022]

]

---

## Announcements

* Make sure to start working on Project Assignment 1 with your group members.
* P-Set 3 due now on Friday at 5pm because of server hiccup.

****************************

## Goals for Today

* Go through data joins.

]

* Discuss data collection/acquisition.

]

---

## Data Joins

* Often in the data analysis workflow, we have more than one data source, which means more than one dataframe, and we want to combine these dataframes.

* Need principled way to combine.
    + Need a **key** that links two dataframes together.

* These multiple dataframes are called **relational data**.

---

## Example: BLS Consumer Expenditure Data

* Household survey but data are also collected on individuals
    + fmli: household data
    + memi: household member-level data

```r
#Read in data with readr package
library(tidyverse)
fmli <- read_csv("~/shared_data/stat100/data/fmli.csv", 
                 na = c("NA", "."))
memi <- read_csv("~/shared_data/stat100/data/memi.csv", 
                 na = c("NA", "."))
```

* Want variables on principal earner to be added to the household data

---

## CE Data

* Key variable(s)?

```r
library(dplyr)
glimpse(fmli)
```

```
## Rows: 6,301
## Columns: 51
## $ NEWID    <chr> "03324174", "03324204", "03324214", "03324244", "03324274", "…
## $ PRINEARN <chr> "01", "01", "01", "01", "02", "01", "01", "01", "02", "01", "…
## $ FINLWT21 <dbl> 25984.767, 6581.018, 20208.499, 18078.372, 20111.619, 19907.3…
## $ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 40037, 10…
## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ POPSIZE  <dbl> 2, 3, 4, 2, 2, 2, 1, 2, 5, 2, 3, 2, 2, 3, 4, 3, 3, 1, 4, 1, 1…
## $ EDUC_REF <chr> "16", "15", "16", "15", "14", "11", "10", "13", "12", "12", "…
## $ EDUCA2   <dbl> 15, 15, 13, NA, NA, NA, NA, 15, 15, 14, 12, 12, NA, NA, NA, 1…
## $ AGE_REF  <dbl> 63, 50, 47, 37, 51, 63, 77, 37, 51, 64, 26, 59, 81, 51, 67, 4…
## $ AGE2     <dbl> 50, 47, 46, NA, NA, NA, NA, 36, 53, 67, 44, 62, NA, NA, NA, 4…
## $ SEX_REF  <dbl> 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1…
## $ SEX2     <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 1, 1, 1, NA, NA, NA, 1, NA, 1,…
## $ REF_RACE <dbl> 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1…
## $ RACE2    <dbl> 1, 4, 1, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, NA, NA, 2, NA, 1,…
## $ HISP_REF <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1…
## $ HISP2    <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 2, 2, 2, NA, NA, NA, 2, NA, 2,…
## $ FAM_TYPE <dbl> 3, 4, 1, 8, 9, 9, 8, 3, 1, 1, 3, 1, 8, 9, 8, 5, 9, 4, 8, 3, 2…
## $ MARITAL1 <dbl> 1, 1, 1, 5, 3, 3, 2, 1, 1, 1, 1, 1, 2, 3, 5, 1, 3, 1, 3, 1, 1…
## $ REGION   <dbl> 4, 4, 3, 4, 4, 3, 4, 1, 3, 2, 1, 4, 1, 3, 3, 3, 2, 1, 2, 4, 3…
## $ SMSASTAT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15", "14", "…
## $ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ FOODCQ   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TRANSCQ  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ HEALTHCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ENTERTCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ EDUCACQ  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TOBACCCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ STUDFINX <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ IRAX     <dbl> 1000000, 10000, 0, NA, NA, 0, 0, 15000, NA, 477000, NA, NA, N…
## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1, 1, 4, 4…
## $ FAM_SIZE <dbl> 4, 6, 2, 1, 2, 2, 1, 5, 2, 2, 4, 2, 1, 2, 1, 4, 2, 4, 1, 3, 3…
## $ VEHQ     <dbl> 3, 5, 0, 4, 2, 0, 0, 2, 4, 2, 3, 2, 1, 3, 1, 2, 4, 4, 0, 2, 3…
## $ ROOMSQ   <dbl> 8, 5, 6, 4, 4, 4, 7, 5, 4, 9, 6, 10, 4, 7, 5, 6, 6, 8, 18, 4,…
## $ INC_HRS1 <dbl> 40, 40, 40, 44, 40, NA, NA, 40, 40, NA, 40, NA, NA, NA, NA, 4…
## $ INC_HRS2 <dbl> 30, 40, 52, NA, NA, NA, NA, 40, 40, NA, 65, NA, NA, NA, NA, 6…
## $ EARNCOMP <dbl> 3, 2, 2, 1, 4, 7, 8, 2, 2, 8, 2, 8, 8, 7, 8, 2, 7, 3, 1, 2, 1…
## $ NO_EARNR <dbl> 4, 2, 2, 1, 2, 1, 0, 2, 2, 0, 2, 0, 0, 1, 0, 2, 1, 3, 1, 2, 1…
## $ OCCUCOD1 <chr> "03", "03", "05", "03", "04", "", "", "12", "04", "", "01", "…
## $ OCCUCOD2 <chr> "04", "02", "01", "", "", "", "", "02", "03", "", "11", "", "…
## $ STATE    <chr> "41", "15", "48", "06", "06", "48", "06", "42", "", "27", "25…
## $ DIVISION <dbl> 9, 9, 7, 9, 9, 7, 9, 2, NA, 4, 1, 8, 2, 5, 6, 7, 3, 2, 3, 9, …
## $ TOTXEST  <dbl> 15452, 11459, 15738, 25978, 588, 0, 0, 7261, 9406, -1414, 141…
## $ CREDFINX <dbl> 0, NA, 0, NA, 5, NA, NA, NA, NA, 0, NA, 0, NA, NA, NA, 2, 35,…
## $ CREDITB  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ CREDITX  <dbl> 4000, 5000, 2000, NA, 7000, 1800, NA, 6000, NA, 719, NA, 1200…
## $ BUILDING <chr> "01", "01", "01", "02", "08", "01", "01", "01", "01", "01", "…
## $ ST_HOUS  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ INT_PHON <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ INT_HOME <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
```

---

## CE Data

* Key variables?

```r
glimpse(memi)
```

```
## Rows: 15,412
## Columns: 14
## $ NEWID    <chr> "03552611", "03552641", "03552641", "03552651", "03552651", "…
## $ MEMBNO   <dbl> 1, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 1, 2, 3…
## $ AGE      <dbl> 58, 54, 49, 39, 10, 32, 7, 9, 38, 34, 11, 8, 6, 3, 65, 61, 11…
## $ SEX      <dbl> 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1…
## $ EARNER   <dbl> 1, 1, 2, 2, NA, 2, NA, NA, 1, 2, NA, NA, NA, NA, 1, 2, NA, NA…
## $ EARNTYPE <dbl> 2, 1, NA, NA, NA, NA, NA, NA, 3, NA, NA, NA, NA, NA, 4, NA, N…
## $ INC_HRSQ <dbl> 20, 56, NA, NA, NA, NA, NA, NA, 50, NA, NA, NA, NA, NA, 25, N…
## $ INCOMEY  <dbl> 4, 1, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 4, NA, N…
## $ OCCUCODE <chr> "10", "05", "", "", "", "", "", "", "03", "", "", "", "", "",…
## $ HISPANIC <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ MEMBRACE <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1…
## $ PAYSTUB  <dbl> 1, 2, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, 2, NA, N…
## $ SALARYX  <dbl> 8982, NA, NA, NA, NA, NA, NA, NA, 280500, NA, NA, NA, NA, NA,…
## $ WKSTATUS <dbl> 1, 1, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 1, NA, N…
```

---

## CE Data

* Key variables?
    + Problem with class?

```r
glimpse(select(fmli, 1,2))
```

```
## Rows: 6,301
## Columns: 2
## $ NEWID    <chr> "03324174", "03324204", "03324214", "03324244", "03324274", "…
## $ PRINEARN <chr> "01", "01", "01", "01", "02", "01", "01", "01", "02", "01", "…
```

```r
glimpse(select(memi, 1:2))
```

```
## Rows: 15,412
## Columns: 2
## $ NEWID  <chr> "03552611", "03552641", "03552641", "03552651", "03552651", "03…
## $ MEMBNO <dbl> 1, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 1, 2, 3, …
```

---

## CE Data

* Key variables?
    + Problem with class?

```r
fmli <- mutate(fmli, PRINEARN = as.integer(PRINEARN))
glimpse(select(fmli, 1, 2))
```

```
## Rows: 6,301
## Columns: 2
## $ NEWID    <chr> "03324174", "03324204", "03324214", "03324244", "03324274", "…
## $ PRINEARN <int> 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1…
```

```r
glimpse(select(memi, 1, 2))
```

---

## CE Data

* Want to add columns of `memi` to `fmli` that correspond to the principal earner's memi data
    + What type of join is that?

---

## The World of Joins

* Mutating joins: Add new variables to one dataset from matching observations in another.
    + `left_join()` (and `right_join()`)
    + `inner_join()`
    + `full_join()`

* There are also *filtering* joins but we won't cover those today.

---

## Example Dataframes

```r
d1 <- tibble(V = 1:5, X = c(1, 2, 2, 3, 1), Y = c(14, 3, 6, 1, 4))
d2 <- tibble(X = c(2, 4, 1), S = c(4, 13, 8))
d1
```

```
## # A tibble: 5 × 3
##       V     X     Y
##   <int> <dbl> <dbl>
## 1     1     1    14
## 2     2     2     3
## 3     3     2     6
## 4     4     3     1
## 5     5     1     4
```

```r
d2
```

```
## # A tibble: 3 × 2
##       X     S
##   <dbl> <dbl>
## 1     2     4
## 2     4    13
## 3     1     8
```

---

## `left_join()`

```r
d1_new <- left_join(d1, d2)
```

```
## Joining, by = "X"
```

```r
d1_new
```

```
## # A tibble: 5 × 4
##       V     X     Y     S
##   <int> <dbl> <dbl> <dbl>
## 1     1     1    14     8
## 2     2     2     3     4
## 3     3     2     6     4
## 4     4     3     1    NA
## 5     5     1     4     8
```

---

## `left_join()`

```r
d1_new <- left_join(d1, d2, by = c("X" = "X"))
d1_new
```

---

## `left_join()`

```r
d1_new <- left_join(d1, d2, by = c("V" = "X"))
d1_new
```

```
## # A tibble: 5 × 4
##       V     X     Y     S
##   <dbl> <dbl> <dbl> <dbl>
## 1     1     1    14     8
## 2     2     2     3     4
## 3     3     2     6    NA
## 4     4     3     1    13
## 5     5     1     4    NA
```

---

## `inner_join()`

```r
d1_d2 <- inner_join(d1, d2, by = c("X" = "X"))
d1_d2
```

```
## # A tibble: 4 × 4
##       V     X     Y     S
##   <int> <dbl> <dbl> <dbl>
## 1     1     1    14     8
## 2     2     2     3     4
## 3     3     2     6     4
## 4     5     1     4     8
```

---

## `inner_join()`

```r
d1_d2 <- inner_join(d1, d2, by = c("V" = "X"))
d1_d2
```

```
## # A tibble: 3 × 4
##       V     X     Y     S
##   <dbl> <dbl> <dbl> <dbl>
## 1     1     1    14     8
## 2     2     2     3     4
## 3     4     3     1    13
```

---

## `full_join()`

```r
d1_d2 <- full_join(d1, d2, by = c("X" = "X"))
d1_d2
```

```
## # A tibble: 6 × 4
##       V     X     Y     S
##   <int> <dbl> <dbl> <dbl>
## 1     1     1    14     8
## 2     2     2     3     4
## 3     3     2     6     4
## 4     4     3     1    NA
## 5     5     1     4     8
## 6    NA     4    NA    13
```

---

## `full_join()`

```r
d1_d2 <- full_join(d1, d2, by = c("V" = "X"))
d1_d2
```

---

## Back to our Example

* What kind of join do we want for the Consumer Expenditure data?
    + Want to add columns of `memi` to `fmli` that correspond to the principal earner's memi data

---

## Look at the Possible Joins

```r
left_join(fmli, memi) %>% 
  arrange(NEWID)
```

```
## Joining, by = "NEWID"
```

```
## # A tibble: 15,412 × 64
##    NEWID    PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF
##    <chr>       <int>    <dbl>    <dbl>    <dbl>   <dbl> <chr>     <dbl>   <dbl>
##  1 03324174        1   25985.   116920        1       2 16           15      63
##  2 03324174        1   25985.   116920        1       2 16           15      63
##  3 03324174        1   25985.   116920        1       2 16           15      63
##  4 03324174        1   25985.   116920        1       2 16           15      63
##  5 03324204        1    6581.      200        1       3 15           15      50
##  6 03324204        1    6581.      200        1       3 15           15      50
##  7 03324204        1    6581.      200        1       3 15           15      50
##  8 03324204        1    6581.      200        1       3 15           15      50
##  9 03324204        1    6581.      200        1       3 15           15      50
## 10 03324204        1    6581.      200        1       3 15           15      50
## # … with 15,402 more rows, and 55 more variables: AGE2 <dbl>, SEX_REF <dbl>,
## #   SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>,
## #   FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>,
## #   HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>,
## #   TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>,
## #   TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>,
## #   VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, …
```

---

## Look at the Possible Joins

* Be careful.  This erroneous example made my R crash!

```r
left_join(fmli, memi, by = c("PRINEARN" = "MEMBNO")) %>% 
  arrange(MEMBNO)
```

---

## Look at the Possible Joins

```r
left_join(fmli, memi, by = c("NEWID" = "NEWID",
                             "PRINEARN" = "MEMBNO")) %>% 
  arrange(NEWID)
```

```
## # A tibble: 6,301 × 63
##    NEWID    PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF
##    <chr>       <dbl>    <dbl>    <dbl>    <dbl>   <dbl> <chr>     <dbl>   <dbl>
##  1 03324174        1   25985.   116920        1       2 16           15      63
##  2 03324204        1    6581.      200        1       3 15           15      50
##  3 03324214        1   20208.   117000        1       4 16           13      47
##  4 03324244        1   18078.        0        1       2 15           NA      37
##  5 03324274        2   20112.     2000        1       2 14           NA      51
##  6 03324284        1   19907.      942        1       2 11           NA      63
##  7 03324294        1   11705.        0        1       1 10           NA      77
##  8 03324304        1   24431.    91000        1       2 13           15      37
##  9 03324324        2   42859.    95000        2       5 12           15      51
## 10 03324334        1   17481.    40037        1       2 12           14      64
## # … with 6,291 more rows, and 54 more variables: AGE2 <dbl>, SEX_REF <dbl>,
## #   SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>,
## #   FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>,
## #   HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>,
## #   TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>,
## #   TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>,
## #   VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, …
```

---

## Look at the Possible Joins

```r
inner_join(fmli, memi, by = c("NEWID" = "NEWID",
                              "PRINEARN" = "MEMBNO")) %>% 
  arrange(NEWID)
```

* Why does this give us the same answer as `left_join` for this situation?

---

## Look at the Possible Joins

```r
full_join(fmli, memi, by = c("NEWID" = "NEWID",
                             "PRINEARN" = "MEMBNO")) %>% 
  arrange(NEWID)
```

```
## # A tibble: 15,412 × 63
##    NEWID    PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF
##    <chr>       <dbl>    <dbl>    <dbl>    <dbl>   <dbl> <chr>     <dbl>   <dbl>
##  1 03324174        1   25985.   116920        1       2 16           15      63
##  2 03324174        2      NA        NA       NA      NA <NA>         NA      NA
##  3 03324174        3      NA        NA       NA      NA <NA>         NA      NA
##  4 03324174        4      NA        NA       NA      NA <NA>         NA      NA
##  5 03324204        1    6581.      200        1       3 15           15      50
##  6 03324204        2      NA        NA       NA      NA <NA>         NA      NA
##  7 03324204        3      NA        NA       NA      NA <NA>         NA      NA
##  8 03324204        4      NA        NA       NA      NA <NA>         NA      NA
##  9 03324204        5      NA        NA       NA      NA <NA>         NA      NA
## 10 03324204        6      NA        NA       NA      NA <NA>         NA      NA
## # … with 15,402 more rows, and 54 more variables: AGE2 <dbl>, SEX_REF <dbl>,
## #   SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>,
## #   FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>,
## #   HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>,
## #   TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>,
## #   TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>,
## #   VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, …
```

---

## Joining Tips

```r
fmli <- left_join(fmli, memi, by = c("NEWID" = "NEWID",
                                     "PRINEARN" = "MEMBNO")) %>% 
  arrange(NEWID)
```

* FIRST: conceptualize for yourself what you think you want the final dataset to look like!
* Check initial dimensions and final dimensions.
* Use variable names when joining even if they are the same.

---

## Quick Survey Time

### Once you have received your number, provide your responses here:

# [https://bit.ly/stat100-survey](https://bit.ly/stat100-survey)

---

## Data Collection

---

##  Who are the data supposed to represent?

**Key questions:**

+ What evidence is there that the data are **representative**?
+ Who is present?  Who is absent?
+ Who is overrepresented?  Who is underrepresented?

---

##  Who are the data supposed to represent?

**Census**: We have data on the whole population!

---

##  Who are the data supposed to represent?

---

##  Who are the data supposed to represent?

**Key questions:**

+ What evidence is there that the data are **representative**?
+ Who is present?  Who is absent?
+ Who is overrepresented?  Who is underrepresented?

---

## Sampling Bias

**Sampling bias**: When the sampled units are **systematically different** from the non-sampled units on the variables of interest.

---

### Sampling Bias Example

The **Literary Digest** was a political magazine that correctly predicted the presidential outcomes from 1916 to 1932.  In 1936, they conducted the most extensive (to that date) public opinion poll.  They mailed questionnaires to over **10 million people** (about 1/3 of US households) whose names and addresses they obtained from telephone books and vehicle registration lists.  More than 2.4 million responded with 57% indicating that they would vote for Republican Alf Landon in the upcoming presidential election instead of the current President Franklin Delano Roosevelt.

**Population of Interest**:

<br>

**Sample**:

<br>

**Key questions:**

+ What evidence is there that the data are **representative**?
+ Who is present?  Who is absent?
+ Who is overrepresented?  Who is underrepresented?

**Sampling bias**:

---

## Random Sampling

Use random sampling (a random mechanism for selecting cases from the population) to remove sampling bias.

#### Types of random sampling

* Simple random sampling

* Stratified random sampling

* Cluster sampling

Why aren't all samples generated using simple random sampling?

---

## National Health and Nutrition Examination Survey (NHANES)

Why are these data collected?

&rarr; To assess the health of people in the US.

How are these data collected?

&rarr; **Stage 1**: US is stratified by geography and distribution of minority populations.  Counties are randomly selected within each stratum.

&rarr; **Stage 2**: From the sampled counties, city blocks are randomly selected. (City blocks are clusters.)

&rarr; **Stage 3**: From sampled city blocks, households are randomly selected. (Household are clusters.)

&rarr; **Stage 4**: From sampled households, people are randomly selected.  For the sampled households, a mobile health vehicle goes to the house and medical professionals take the necessary measurements.

**Why don't they use simple random sampling?**

---

### Careful Using Non-SRS Data

]

]

* If you are dealing with data collected using a complex sampling design, I'd recommend taking an additional stats course, like Stat 160: Sample Surveys!

---

## Detour: Data Ethics

---

### Data Ethics

> "Good statistical practice is fundamentally based on transparent assumptions, reproducible results, and valid interpretations." -- Committee on Professional Ethics of the American Statistical Association (ASA)

The ASA have created ["Ethical Guidelines for Statistical Practice"](https://www.amstat.org/ASA/Your-Career/Ethical-Guidelines-for-Statistical-Practice.aspx)

&rarr; These guidelines are for EVERYONE doing statistical work.

&rarr; There are ethical decisions at all steps of the Data Analysis Process.

&rarr; We will periodically refer to specific guidelines throughout this class.

> "Above all, professionalism in statistical practice presumes the goal of advancing knowledge while avoiding harm; using statistics in pursuit of unethical ends is inherently unethical."

---

##  Responsibilities to Research Subjects

> "The ethical statistician protects and respects the rights and interests of human and animal subjects at all stages of their involvement in a project. This includes respondents to the census or to surveys, those whose data are contained in administrative records, and subjects of physically or psychologically invasive research."

---

##  Responsibilities to Research Subjects

> "Protects the privacy and confidentiality of research subjects and data concerning them, whether obtained from the subjects directly, other persons, or existing records."

---

## Detour from Our Detour

```r
library(tidyverse)
library(NHANES)

ggplot(data = NHANES, 
       mapping = aes(x = Age,
                     y = Height)) +
  geom_point(alpha = 0.1) +
  stat_smooth(color = "blue")
```

]

]

---

## Detour from Our Detour

```r
library(tidyverse)
library(NHANES)
library(emojifont)

NHANES <- mutate(NHANES, 
          heart = fontawesome("fa-heart"))

ggplot(data = NHANES, 
       mapping = aes(x = Age,
                     y = Height,
                     label = heart)) +
  geom_text(alpha = 0.1, color = "red",
            family='fontawesome-webfont',
            size = 8) +
  stat_smooth(color = "lavender")
```

]

]

---

## Back to Data Collection

---

###  Who are the data supposed to represent?

---

###  Who are the data supposed to represent?

**Key questions:**

+ What evidence is there that the data are **representative**?
+ Who is present?  Who is absent?
+ Who is overrepresented?  Who is underrepresented?

---

## Nonresponse bias

**Nonresponse bias**: The respondents are **systematically** different from the non-respondents for the variables of interest.

---

### Come Back to Literary Digest Example

<br>

**Non-response bias**:

---

## Tackling Nonresponse bias

&rarr; Use multiple modes and multiple attempts for reaching sampled cases.

&rarr; Explore key demographic variables to see how respondents and non-respondents vary.

---

## Is Bigger Always Better?

For our **Literary Digest Example**, Gallup predicted Roosevelt would win based on a survey of **50,000** people, (instead of 2.4 million).

---

### Big Data Paradox

> "Without taking data quality into account, population inferences with Big Data are subject to a Big Data Paradox: the more the data, the surer we fool ourselves." -- Xiao-Li Meng

**Example:**

* During Spring of 2021, Delphi-Facebook estimated  vaccine uptake at 70% and U.S. Census estimated it at 67%.

* The CDC reported it to be 53%.

And, once we learn about **quantifying uncertainty**, we will see that large sample sizes produce very small measures of uncertainty.

> "If you have the resources, invest in data quality far more than you invest in data quantity. Bad-quality data is essentially wiping out the power you think you have. That’s always been a problem, but it’s magnified now because we have big data. " -- Xiao-Li Meng

---

## Other Key Random Mechanism: Random Assignment

**Random assignment**: Cases are randomly assigned to categories of the **explanatory variable**

* **Response variable**: Variable I want to better understand

* **Explanatory variables**: Variables I think might explain the response variable

&rarr; If the data were collected using **random assignment**, then I can determine if the explanatory variable **causes** changes in the response variable.

---

## Causal Inference

Often want to conclude that an explanatory variable causes changes in a response variable but you did not randomly assign the explanatory variable.

**Confounding variable**: When the explanatory variable and response variable vary, so does the confounder.

&rarr; Unclear if the explanatory variable or the confounder (or some other variable) is causing changes in the response.

---

## Causal Inference

Often want to conclude that an explanatory variable causes changes in a response variable but you did not randomly assign the explanatory variable.

**Confounding variable**: When the explanatory variable and response variable vary, so does the confounder.

&rarr; Unclear if the explanatory variable or the confounder (or some other variable) is causing changes in the response.

---

## Causal Inference

* **Spurious relationship**: Two variables are associated but not causally related
    + In the age of big data, lots of good examples [out there](https://tylervigen.com/spurious-correlations).

> "Correlation does not imply causation."

>  "Correlation does not imply not causation."

* **Causal inference**: Methods for finding causal relationships even when the data were collected without random assignment.

---

## Types of Studies

**Observational Study:** Collect data in a way that doesn't interfere

&rarr; **Example**: Studies of hand washing frequency

**Experiment:** Interested in causal relationships so utilize random assignment.  Other key features include:

+ Blinding
+ Control group
+ Placebo

&rarr; **Example**: COVID vaccine trials

---

## Thoughts on Data Collection

**Random Sampling**

Random sampling is important to ensure the sample is representative of the population.

Representativeness isn't about size.

+ Small random samples will tend to be more representative than large non-random samples.

How do we draw conclusions about the population from non-random samples?

&rarr; Investigate how your sampled cases (and respondents) are systematically different from the non-sampled cases (and non-respondents).

---

## Thoughts on Data Collection

**Random Assignment**

Random assignment allows you to explore **causal** relationships between your explanatory variables and the predictor variables.

How do we draw causal conclusions from studies without random assignment?

&rarr; With extreme care!  Try to control for all possible confounding variables.

&rarr; Discuss the associations/correlations you found.  Use domain knowledge to address potentially causal links.

&rarr; Take more stats to learn more about causal inference.

**Bottom Line:** We often have to use imperfect data to make decisions.

---

## Two Key Random Mechanisms

---

### Reminders

* **Participation/Engagement:**
    + In class and section
    + Office hours: Must attend **at least one** office hours during the first five weeks of the semester
    + On Slack: **At least two** posts before Spring Break

* Will practice applying data collection ideas in Section this week!