background-image: url("img/DAW.png") background-position: left background-size: 50% class: middle, center, inverse .pull-right[ ## .whitish[Data Collection] <br> <br> ### .whitish[Kelly McConville] #### .yellow[ Stat 100 | Week 4 | Spring 2022] ] --- ## Announcements * Make sure to start working on Project Assignment 1 with your group members. * P-Set 3 due now on Friday at 5pm because of server hiccup. **************************** -- ## Goals for Today .pull-left[ * Go through data joins. ] -- .pull-right[ * Discuss data collection/acquisition. ] --- ## Data Joins * Often in the data analysis workflow, we have more than one data source, which means more than one dataframe, and we want to combine these dataframes. -- * Need principled way to combine. + Need a **key** that links two dataframes together. -- * These multiple dataframes are called **relational data**. <!-- --- --> <!-- ## Data Joins: Language --> <!-- Three families of verbs for working with relational data: --> <!-- -- --> <!-- * **Mutating joins**: Add variables to one data frame but matching observations in another. --> <!-- -- --> <!-- * **Filtering joins**: Filter observations from one data frame based on whether or not they match an observation in the other table. --> <!-- -- --> <!-- * **Set operations**: treat observations as if they were set elements. --> --- ## Example: BLS Consumer Expenditure Data * Household survey but data are also collected on individuals + fmli: household data + memi: household member-level data ```r #Read in data with readr package library(tidyverse) fmli <- read_csv("~/shared_data/stat100/data/fmli.csv", na = c("NA", ".")) memi <- read_csv("~/shared_data/stat100/data/memi.csv", na = c("NA", ".")) ``` * Want variables on principal earner to be added to the household data --- ## CE Data * Key variable(s)? ```r library(dplyr) glimpse(fmli) ``` ``` ## Rows: 6,301 ## Columns: 51 ## $ NEWID <chr> "03324174", "03324204", "03324214", "03324244", "03324274", "… ## $ PRINEARN <chr> "01", "01", "01", "01", "02", "01", "01", "01", "02", "01", "… ## $ FINLWT21 <dbl> 25984.767, 6581.018, 20208.499, 18078.372, 20111.619, 19907.3… ## $ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 40037, 10… ## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ POPSIZE <dbl> 2, 3, 4, 2, 2, 2, 1, 2, 5, 2, 3, 2, 2, 3, 4, 3, 3, 1, 4, 1, 1… ## $ EDUC_REF <chr> "16", "15", "16", "15", "14", "11", "10", "13", "12", "12", "… ## $ EDUCA2 <dbl> 15, 15, 13, NA, NA, NA, NA, 15, 15, 14, 12, 12, NA, NA, NA, 1… ## $ AGE_REF <dbl> 63, 50, 47, 37, 51, 63, 77, 37, 51, 64, 26, 59, 81, 51, 67, 4… ## $ AGE2 <dbl> 50, 47, 46, NA, NA, NA, NA, 36, 53, 67, 44, 62, NA, NA, NA, 4… ## $ SEX_REF <dbl> 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1… ## $ SEX2 <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 1, 1, 1, NA, NA, NA, 1, NA, 1,… ## $ REF_RACE <dbl> 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1… ## $ RACE2 <dbl> 1, 4, 1, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, NA, NA, 2, NA, 1,… ## $ HISP_REF <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1… ## $ HISP2 <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 2, 2, 2, NA, NA, NA, 2, NA, 2,… ## $ FAM_TYPE <dbl> 3, 4, 1, 8, 9, 9, 8, 3, 1, 1, 3, 1, 8, 9, 8, 5, 9, 4, 8, 3, 2… ## $ MARITAL1 <dbl> 1, 1, 1, 5, 3, 3, 2, 1, 1, 1, 1, 1, 2, 3, 5, 1, 3, 1, 3, 1, 1… ## $ REGION <dbl> 4, 4, 3, 4, 4, 3, 4, 1, 3, 2, 1, 4, 1, 3, 3, 3, 2, 1, 2, 4, 3… ## $ SMSASTAT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15", "14", "… ## $ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ FOODCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ TRANSCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ HEALTHCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ ENTERTCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ EDUCACQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ TOBACCCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ STUDFINX <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ IRAX <dbl> 1000000, 10000, 0, NA, NA, 0, 0, 15000, NA, 477000, NA, NA, N… ## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1, 1, 4, 4… ## $ FAM_SIZE <dbl> 4, 6, 2, 1, 2, 2, 1, 5, 2, 2, 4, 2, 1, 2, 1, 4, 2, 4, 1, 3, 3… ## $ VEHQ <dbl> 3, 5, 0, 4, 2, 0, 0, 2, 4, 2, 3, 2, 1, 3, 1, 2, 4, 4, 0, 2, 3… ## $ ROOMSQ <dbl> 8, 5, 6, 4, 4, 4, 7, 5, 4, 9, 6, 10, 4, 7, 5, 6, 6, 8, 18, 4,… ## $ INC_HRS1 <dbl> 40, 40, 40, 44, 40, NA, NA, 40, 40, NA, 40, NA, NA, NA, NA, 4… ## $ INC_HRS2 <dbl> 30, 40, 52, NA, NA, NA, NA, 40, 40, NA, 65, NA, NA, NA, NA, 6… ## $ EARNCOMP <dbl> 3, 2, 2, 1, 4, 7, 8, 2, 2, 8, 2, 8, 8, 7, 8, 2, 7, 3, 1, 2, 1… ## $ NO_EARNR <dbl> 4, 2, 2, 1, 2, 1, 0, 2, 2, 0, 2, 0, 0, 1, 0, 2, 1, 3, 1, 2, 1… ## $ OCCUCOD1 <chr> "03", "03", "05", "03", "04", "", "", "12", "04", "", "01", "… ## $ OCCUCOD2 <chr> "04", "02", "01", "", "", "", "", "02", "03", "", "11", "", "… ## $ STATE <chr> "41", "15", "48", "06", "06", "48", "06", "42", "", "27", "25… ## $ DIVISION <dbl> 9, 9, 7, 9, 9, 7, 9, 2, NA, 4, 1, 8, 2, 5, 6, 7, 3, 2, 3, 9, … ## $ TOTXEST <dbl> 15452, 11459, 15738, 25978, 588, 0, 0, 7261, 9406, -1414, 141… ## $ CREDFINX <dbl> 0, NA, 0, NA, 5, NA, NA, NA, NA, 0, NA, 0, NA, NA, NA, 2, 35,… ## $ CREDITB <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ CREDITX <dbl> 4000, 5000, 2000, NA, 7000, 1800, NA, 6000, NA, 719, NA, 1200… ## $ BUILDING <chr> "01", "01", "01", "02", "08", "01", "01", "01", "01", "01", "… ## $ ST_HOUS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2… ## $ INT_PHON <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ INT_HOME <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ``` --- ## CE Data * Key variables? ```r glimpse(memi) ``` ``` ## Rows: 15,412 ## Columns: 14 ## $ NEWID <chr> "03552611", "03552641", "03552641", "03552651", "03552651", "… ## $ MEMBNO <dbl> 1, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 1, 2, 3… ## $ AGE <dbl> 58, 54, 49, 39, 10, 32, 7, 9, 38, 34, 11, 8, 6, 3, 65, 61, 11… ## $ SEX <dbl> 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1… ## $ EARNER <dbl> 1, 1, 2, 2, NA, 2, NA, NA, 1, 2, NA, NA, NA, NA, 1, 2, NA, NA… ## $ EARNTYPE <dbl> 2, 1, NA, NA, NA, NA, NA, NA, 3, NA, NA, NA, NA, NA, 4, NA, N… ## $ INC_HRSQ <dbl> 20, 56, NA, NA, NA, NA, NA, NA, 50, NA, NA, NA, NA, NA, 25, N… ## $ INCOMEY <dbl> 4, 1, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 4, NA, N… ## $ OCCUCODE <chr> "10", "05", "", "", "", "", "", "", "03", "", "", "", "", "",… ## $ HISPANIC <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ MEMBRACE <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1… ## $ PAYSTUB <dbl> 1, 2, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, 2, NA, N… ## $ SALARYX <dbl> 8982, NA, NA, NA, NA, NA, NA, NA, 280500, NA, NA, NA, NA, NA,… ## $ WKSTATUS <dbl> 1, 1, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 1, NA, N… ``` --- ## CE Data * Key variables? + Problem with class? ```r glimpse(select(fmli, 1,2)) ``` ``` ## Rows: 6,301 ## Columns: 2 ## $ NEWID <chr> "03324174", "03324204", "03324214", "03324244", "03324274", "… ## $ PRINEARN <chr> "01", "01", "01", "01", "02", "01", "01", "01", "02", "01", "… ``` ```r glimpse(select(memi, 1:2)) ``` ``` ## Rows: 15,412 ## Columns: 2 ## $ NEWID <chr> "03552611", "03552641", "03552641", "03552651", "03552651", "03… ## $ MEMBNO <dbl> 1, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 1, 2, 3, … ``` --- ## CE Data * Key variables? + Problem with class? ```r fmli <- mutate(fmli, PRINEARN = as.integer(PRINEARN)) glimpse(select(fmli, 1, 2)) ``` ``` ## Rows: 6,301 ## Columns: 2 ## $ NEWID <chr> "03324174", "03324204", "03324214", "03324244", "03324274", "… ## $ PRINEARN <int> 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1… ``` ```r glimpse(select(memi, 1, 2)) ``` ``` ## Rows: 15,412 ## Columns: 2 ## $ NEWID <chr> "03552611", "03552641", "03552641", "03552651", "03552651", "03… ## $ MEMBNO <dbl> 1, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 1, 2, 3, … ``` --- ## CE Data * Want to add columns of `memi` to `fmli` that correspond to the principal earner's memi data + What type of join is that? --- ## The World of Joins * Mutating joins: Add new variables to one dataset from matching observations in another. + `left_join()` (and `right_join()`) + `inner_join()` + `full_join()` * There are also *filtering* joins but we won't cover those today. --- ## Example Dataframes ```r d1 <- tibble(V = 1:5, X = c(1, 2, 2, 3, 1), Y = c(14, 3, 6, 1, 4)) d2 <- tibble(X = c(2, 4, 1), S = c(4, 13, 8)) d1 ``` ``` ## # A tibble: 5 × 3 ## V X Y ## <int> <dbl> <dbl> ## 1 1 1 14 ## 2 2 2 3 ## 3 3 2 6 ## 4 4 3 1 ## 5 5 1 4 ``` ```r d2 ``` ``` ## # A tibble: 3 × 2 ## X S ## <dbl> <dbl> ## 1 2 4 ## 2 4 13 ## 3 1 8 ``` --- ## `left_join()` ```r d1_new <- left_join(d1, d2) ``` ``` ## Joining, by = "X" ``` ```r d1_new ``` ``` ## # A tibble: 5 × 4 ## V X Y S ## <int> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 4 ## 4 4 3 1 NA ## 5 5 1 4 8 ``` --- ## `left_join()` ```r d1_new <- left_join(d1, d2, by = c("X" = "X")) d1_new ``` ``` ## # A tibble: 5 × 4 ## V X Y S ## <int> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 4 ## 4 4 3 1 NA ## 5 5 1 4 8 ``` --- ## `left_join()` ```r d1_new <- left_join(d1, d2, by = c("V" = "X")) d1_new ``` ``` ## # A tibble: 5 × 4 ## V X Y S ## <dbl> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 NA ## 4 4 3 1 13 ## 5 5 1 4 NA ``` --- ## `inner_join()` ```r d1_d2 <- inner_join(d1, d2, by = c("X" = "X")) d1_d2 ``` ``` ## # A tibble: 4 × 4 ## V X Y S ## <int> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 4 ## 4 5 1 4 8 ``` --- ## `inner_join()` ```r d1_d2 <- inner_join(d1, d2, by = c("V" = "X")) d1_d2 ``` ``` ## # A tibble: 3 × 4 ## V X Y S ## <dbl> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 4 3 1 13 ``` --- ## `full_join()` ```r d1_d2 <- full_join(d1, d2, by = c("X" = "X")) d1_d2 ``` ``` ## # A tibble: 6 × 4 ## V X Y S ## <int> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 4 ## 4 4 3 1 NA ## 5 5 1 4 8 ## 6 NA 4 NA 13 ``` --- ## `full_join()` ```r d1_d2 <- full_join(d1, d2, by = c("V" = "X")) d1_d2 ``` ``` ## # A tibble: 5 × 4 ## V X Y S ## <dbl> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 NA ## 4 4 3 1 13 ## 5 5 1 4 NA ``` --- ## Back to our Example * What kind of join do we want for the Consumer Expenditure data? + Want to add columns of `memi` to `fmli` that correspond to the principal earner's memi data --- ## Look at the Possible Joins ```r left_join(fmli, memi) %>% arrange(NEWID) ``` ``` ## Joining, by = "NEWID" ``` ``` ## # A tibble: 15,412 × 64 ## NEWID PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 03324174 1 25985. 116920 1 2 16 15 63 ## 2 03324174 1 25985. 116920 1 2 16 15 63 ## 3 03324174 1 25985. 116920 1 2 16 15 63 ## 4 03324174 1 25985. 116920 1 2 16 15 63 ## 5 03324204 1 6581. 200 1 3 15 15 50 ## 6 03324204 1 6581. 200 1 3 15 15 50 ## 7 03324204 1 6581. 200 1 3 15 15 50 ## 8 03324204 1 6581. 200 1 3 15 15 50 ## 9 03324204 1 6581. 200 1 3 15 15 50 ## 10 03324204 1 6581. 200 1 3 15 15 50 ## # … with 15,402 more rows, and 55 more variables: AGE2 <dbl>, SEX_REF <dbl>, ## # SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, ## # FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, ## # HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, ## # TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, ## # TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, ## # VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, … ``` --- ## Look at the Possible Joins * Be careful. This erroneous example made my R crash! ```r left_join(fmli, memi, by = c("PRINEARN" = "MEMBNO")) %>% arrange(MEMBNO) ``` --- ## Look at the Possible Joins ```r left_join(fmli, memi, by = c("NEWID" = "NEWID", "PRINEARN" = "MEMBNO")) %>% arrange(NEWID) ``` ``` ## # A tibble: 6,301 × 63 ## NEWID PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 03324174 1 25985. 116920 1 2 16 15 63 ## 2 03324204 1 6581. 200 1 3 15 15 50 ## 3 03324214 1 20208. 117000 1 4 16 13 47 ## 4 03324244 1 18078. 0 1 2 15 NA 37 ## 5 03324274 2 20112. 2000 1 2 14 NA 51 ## 6 03324284 1 19907. 942 1 2 11 NA 63 ## 7 03324294 1 11705. 0 1 1 10 NA 77 ## 8 03324304 1 24431. 91000 1 2 13 15 37 ## 9 03324324 2 42859. 95000 2 5 12 15 51 ## 10 03324334 1 17481. 40037 1 2 12 14 64 ## # … with 6,291 more rows, and 54 more variables: AGE2 <dbl>, SEX_REF <dbl>, ## # SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, ## # FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, ## # HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, ## # TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, ## # TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, ## # VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, … ``` --- ## Look at the Possible Joins ```r inner_join(fmli, memi, by = c("NEWID" = "NEWID", "PRINEARN" = "MEMBNO")) %>% arrange(NEWID) ``` ``` ## # A tibble: 6,301 × 63 ## NEWID PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 03324174 1 25985. 116920 1 2 16 15 63 ## 2 03324204 1 6581. 200 1 3 15 15 50 ## 3 03324214 1 20208. 117000 1 4 16 13 47 ## 4 03324244 1 18078. 0 1 2 15 NA 37 ## 5 03324274 2 20112. 2000 1 2 14 NA 51 ## 6 03324284 1 19907. 942 1 2 11 NA 63 ## 7 03324294 1 11705. 0 1 1 10 NA 77 ## 8 03324304 1 24431. 91000 1 2 13 15 37 ## 9 03324324 2 42859. 95000 2 5 12 15 51 ## 10 03324334 1 17481. 40037 1 2 12 14 64 ## # … with 6,291 more rows, and 54 more variables: AGE2 <dbl>, SEX_REF <dbl>, ## # SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, ## # FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, ## # HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, ## # TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, ## # TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, ## # VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, … ``` * Why does this give us the same answer as `left_join` for this situation? --- ## Look at the Possible Joins ```r full_join(fmli, memi, by = c("NEWID" = "NEWID", "PRINEARN" = "MEMBNO")) %>% arrange(NEWID) ``` ``` ## # A tibble: 15,412 × 63 ## NEWID PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 03324174 1 25985. 116920 1 2 16 15 63 ## 2 03324174 2 NA NA NA NA <NA> NA NA ## 3 03324174 3 NA NA NA NA <NA> NA NA ## 4 03324174 4 NA NA NA NA <NA> NA NA ## 5 03324204 1 6581. 200 1 3 15 15 50 ## 6 03324204 2 NA NA NA NA <NA> NA NA ## 7 03324204 3 NA NA NA NA <NA> NA NA ## 8 03324204 4 NA NA NA NA <NA> NA NA ## 9 03324204 5 NA NA NA NA <NA> NA NA ## 10 03324204 6 NA NA NA NA <NA> NA NA ## # … with 15,402 more rows, and 54 more variables: AGE2 <dbl>, SEX_REF <dbl>, ## # SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, ## # FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, ## # HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, ## # TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, ## # TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, ## # VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, … ``` --- ## Joining Tips ```r fmli <- left_join(fmli, memi, by = c("NEWID" = "NEWID", "PRINEARN" = "MEMBNO")) %>% arrange(NEWID) ``` * FIRST: conceptualize for yourself what you think you want the final dataset to look like! * Check initial dimensions and final dimensions. * Use variable names when joining even if they are the same. --- class: middle, center ## Quick Survey Time ### Once you have received your number, provide your responses here: # [https://bit.ly/stat100-survey](https://bit.ly/stat100-survey) --- class: inverse, middle, center ## Data Collection <img src="img/twitter-study-design.png" width="50%" style="display: block; margin: auto;" /> --- ## Who are the data supposed to represent? <img src="img/week4.002.jpeg" width="80%" style="display: block; margin: auto;" /> -- **Key questions:** + What evidence is there that the data are **representative**? + Who is present? Who is absent? + Who is overrepresented? Who is underrepresented? --- ## Who are the data supposed to represent? <img src="img/week4.003.jpeg" width="80%" style="display: block; margin: auto;" /> -- **Census**: We have data on the whole population! --- ## Who are the data supposed to represent? <img src="img/sampling.002.jpeg" width="90%" style="display: block; margin: auto;" /> --- ## Who are the data supposed to represent? <img src="img/week4.005.jpeg" width="80%" style="display: block; margin: auto;" /> -- **Key questions:** + What evidence is there that the data are **representative**? + Who is present? Who is absent? + Who is overrepresented? Who is underrepresented? --- ## Sampling Bias <img src="img/sampling.001.jpeg" width="80%" style="display: block; margin: auto;" /> **Sampling bias**: When the sampled units are **systematically different** from the non-sampled units on the variables of interest. --- ### Sampling Bias Example The **Literary Digest** was a political magazine that correctly predicted the presidential outcomes from 1916 to 1932. In 1936, they conducted the most extensive (to that date) public opinion poll. They mailed questionnaires to over **10 million people** (about 1/3 of US households) whose names and addresses they obtained from telephone books and vehicle registration lists. More than 2.4 million responded with 57% indicating that they would vote for Republican Alf Landon in the upcoming presidential election instead of the current President Franklin Delano Roosevelt. -- **Population of Interest**: <br> **Sample**: <br> **Key questions:** + What evidence is there that the data are **representative**? + Who is present? Who is absent? + Who is overrepresented? Who is underrepresented? **Sampling bias**: --- ## Random Sampling Use random sampling (a random mechanism for selecting cases from the population) to remove sampling bias. #### Types of random sampling * Simple random sampling * Stratified random sampling * Cluster sampling -- Why aren't all samples generated using simple random sampling? --- ## National Health and Nutrition Examination Survey (NHANES) Why are these data collected? -- → To assess the health of people in the US. -- How are these data collected? -- → **Stage 1**: US is stratified by geography and distribution of minority populations. Counties are randomly selected within each stratum. -- → **Stage 2**: From the sampled counties, city blocks are randomly selected. (City blocks are clusters.) -- → **Stage 3**: From sampled city blocks, households are randomly selected. (Household are clusters.) -- → **Stage 4**: From sampled households, people are randomly selected. For the sampled households, a mobile health vehicle goes to the house and medical professionals take the necessary measurements. -- **Why don't they use simple random sampling?** --- ### Careful Using Non-SRS Data .pull-left[ <img src="stat100_wk04mon_files/figure-html/unnamed-chunk-26-1.png" width="576" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="stat100_wk04mon_files/figure-html/unnamed-chunk-27-1.png" width="576" style="display: block; margin: auto;" /> ] -- * If you are dealing with data collected using a complex sampling design, I'd recommend taking an additional stats course, like Stat 160: Sample Surveys! --- class: middle, center, inverse ## Detour: Data Ethics --- ### Data Ethics > "Good statistical practice is fundamentally based on transparent assumptions, reproducible results, and valid interpretations." -- Committee on Professional Ethics of the American Statistical Association (ASA) -- The ASA have created ["Ethical Guidelines for Statistical Practice"](https://www.amstat.org/ASA/Your-Career/Ethical-Guidelines-for-Statistical-Practice.aspx) -- → These guidelines are for EVERYONE doing statistical work. -- → There are ethical decisions at all steps of the Data Analysis Process. -- → We will periodically refer to specific guidelines throughout this class. -- > "Above all, professionalism in statistical practice presumes the goal of advancing knowledge while avoiding harm; using statistics in pursuit of unethical ends is inherently unethical." --- class: inverse, center, middle ## Responsibilities to Research Subjects > "The ethical statistician protects and respects the rights and interests of human and animal subjects at all stages of their involvement in a project. This includes respondents to the census or to surveys, those whose data are contained in administrative records, and subjects of physically or psychologically invasive research." --- ## Responsibilities to Research Subjects > "Protects the privacy and confidentiality of research subjects and data concerning them, whether obtained from the subjects directly, other persons, or existing records." <img src="stat100_wk04mon_files/figure-html/unnamed-chunk-28-1.png" width="576" style="display: block; margin: auto;" /> --- ## Detour from Our Detour -- .pull-left[ ```r library(tidyverse) library(NHANES) ggplot(data = NHANES, mapping = aes(x = Age, y = Height)) + geom_point(alpha = 0.1) + stat_smooth(color = "blue") ``` ] .pull-right[ <img src="stat100_wk04mon_files/figure-html/points-1.png" width="768" style="display: block; margin: auto;" /> ] --- ## Detour from Our Detour .pull-left[ ```r library(tidyverse) library(NHANES) library(emojifont) NHANES <- mutate(NHANES, heart = fontawesome("fa-heart")) ggplot(data = NHANES, mapping = aes(x = Age, y = Height, label = heart)) + geom_text(alpha = 0.1, color = "red", family='fontawesome-webfont', size = 8) + stat_smooth(color = "lavender") ``` ] .pull-right[ <img src="stat100_wk04mon_files/figure-html/hearts-1.png" width="768" style="display: block; margin: auto;" /> ] --- class: middle, center, inverse ## Back to Data Collection --- ### Who are the data supposed to represent? <img src="img/sampling.002.jpeg" width="90%" style="display: block; margin: auto;" /> --- ### Who are the data supposed to represent? <img src="img/week4.006.jpeg" width="80%" style="display: block; margin: auto;" /> **Key questions:** + What evidence is there that the data are **representative**? + Who is present? Who is absent? + Who is overrepresented? Who is underrepresented? --- ## Nonresponse bias <img src="img/sampling.003.jpeg" width="80%" style="display: block; margin: auto;" /> **Nonresponse bias**: The respondents are **systematically** different from the non-respondents for the variables of interest. --- ### Come Back to Literary Digest Example The **Literary Digest** was a political magazine that correctly predicted the presidential outcomes from 1916 to 1932. In 1936, they conducted the most extensive (to that date) public opinion poll. They mailed questionnaires to over **10 million people** (about 1/3 of US households) whose names and addresses they obtained from telephone books and vehicle registration lists. More than 2.4 million responded with 57% indicating that they would vote for Republican Alf Landon in the upcoming presidential election instead of the current President Franklin Delano Roosevelt. <br> **Non-response bias**: --- ## Tackling Nonresponse bias <img src="img/sampling.003.jpeg" width="80%" style="display: block; margin: auto;" /> -- → Use multiple modes and multiple attempts for reaching sampled cases. -- → Explore key demographic variables to see how respondents and non-respondents vary. --- ## Is Bigger Always Better? -- <img src="img/sampling.004.jpeg" width="80%" style="display: block; margin: auto;" /> -- For our **Literary Digest Example**, Gallup predicted Roosevelt would win based on a survey of **50,000** people, (instead of 2.4 million). --- ### Big Data Paradox <img src="img/meng.jpg" width="10%" style="float:left; padding:10px" style="display: block; margin: auto;" /> > "Without taking data quality into account, population inferences with Big Data are subject to a Big Data Paradox: the more the data, the surer we fool ourselves." -- Xiao-Li Meng -- **Example:** * During Spring of 2021, Delphi-Facebook estimated vaccine uptake at 70% and U.S. Census estimated it at 67%. -- * The CDC reported it to be 53%. -- And, once we learn about **quantifying uncertainty**, we will see that large sample sizes produce very small measures of uncertainty. -- > "If you have the resources, invest in data quality far more than you invest in data quantity. Bad-quality data is essentially wiping out the power you think you have. That’s always been a problem, but it’s magnified now because we have big data. " -- Xiao-Li Meng --- ## Other Key Random Mechanism: Random Assignment -- **Random assignment**: Cases are randomly assigned to categories of the **explanatory variable** * **Response variable**: Variable I want to better understand * **Explanatory variables**: Variables I think might explain the response variable -- → If the data were collected using **random assignment**, then I can determine if the explanatory variable **causes** changes in the response variable. --- ## Causal Inference Often want to conclude that an explanatory variable causes changes in a response variable but you did not randomly assign the explanatory variable. -- **Confounding variable**: When the explanatory variable and response variable vary, so does the confounder. → Unclear if the explanatory variable or the confounder (or some other variable) is causing changes in the response. <img src="img/confound.png" width="70%" style="display: block; margin: auto;" /> --- ## Causal Inference Often want to conclude that an explanatory variable causes changes in a response variable but you did not randomly assign the explanatory variable. **Confounding variable**: When the explanatory variable and response variable vary, so does the confounder. → Unclear if the explanatory variable or the confounder (or some other variable) is causing changes in the response. <img src="img/confound2.png" width="70%" style="display: block; margin: auto;" /> --- ## Causal Inference * **Spurious relationship**: Two variables are associated but not causally related + In the age of big data, lots of good examples [out there](https://tylervigen.com/spurious-correlations). -- > "Correlation does not imply causation." -- > "Correlation does not imply not causation." -- * **Causal inference**: Methods for finding causal relationships even when the data were collected without random assignment. --- ## Types of Studies **Observational Study:** Collect data in a way that doesn't interfere -- → **Example**: Studies of hand washing frequency -- **Experiment:** Interested in causal relationships so utilize random assignment. Other key features include: + Blinding + Control group + Placebo -- → **Example**: COVID vaccine trials --- ## Thoughts on Data Collection **Random Sampling** Random sampling is important to ensure the sample is representative of the population. -- Representativeness isn't about size. + Small random samples will tend to be more representative than large non-random samples. -- How do we draw conclusions about the population from non-random samples? -- → Investigate how your sampled cases (and respondents) are systematically different from the non-sampled cases (and non-respondents). --- ## Thoughts on Data Collection **Random Assignment** Random assignment allows you to explore **causal** relationships between your explanatory variables and the predictor variables. -- How do we draw causal conclusions from studies without random assignment? -- → With extreme care! Try to control for all possible confounding variables. -- → Discuss the associations/correlations you found. Use domain knowledge to address potentially causal links. -- → Take more stats to learn more about causal inference. -- **Bottom Line:** We often have to use imperfect data to make decisions. --- ## Two Key Random Mechanisms <img src="img/ims_ch2.png" width="80%" style="display: block; margin: auto;" /> --- ### Reminders * **Participation/Engagement:** + In class and section + Office hours: Must attend **at least one** office hours during the first five weeks of the semester + On Slack: **At least two** posts before Spring Break * Will practice applying data collection ideas in Section this week!