background-image: url("img/DAW.png") background-position: left background-size: 50% class: middle, center, inverse .pull-right[ ## .whitish[Data Wrangling] <br> <br> ### .whitish[Kelly McConville] #### .yellow[ Stat 100 | Week 3 | Spring 2022] ] --- ## Announcements * P-Set 3 will be posted today. **************************** -- ## Goals for Today .pull-left[ * Discuss the Project and Project Assignment 1. ] -- .pull-right[ * Go through the main data wrangling operations in `dplyr`. ] --- ## Stat 100 Group Project * Goal: Practice working through the data analysis process with a real dataset and research question(s) -- * Structure: + Groups of 2-3 people + Key due dates [here](https://docs.google.com/spreadsheets/d/1OoXraUpgu3TBQl5TlFpji0Q4QndjDymqAVlYH5qUu3E/edit?usp=sharing) + Three intermediate assignments and a final 10 minute video presentation -- * Project Assignment 1 + Pick a data set and research questions + Explore and visualize the data to start answering the research questions + Due on Gradescope: Wednesday, February 23rd + Only one person from the group needs to turn it in. --- ## dplyr for Data Wrangling * Six common wrangling verbs: + `select()` + `mutate()` + `filter()` + `arrange()` + `summarize()` + `---_join()` * One action: + `group_by()` --- ## Data Setting: [BLS Consumer Expenditure Survey](https://www.bls.gov/cex/) Last quarter of the 2016 Bureau of Labor Statistics Consumer Expenditure Survey. ```r library(tidyverse) ce <- read_csv("~/shared_data/stat100/data/ce.csv") glimpse(ce) ``` ``` ## Rows: 6,301 ## Columns: 49 ## $ FINLWT21 <dbl> 25984.767, 6581.018, 20208.499, 18078.372, 20111.619, 19907.3… ## $ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 40037, 10… ## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ POPSIZE <dbl> 2, 3, 4, 2, 2, 2, 1, 2, 5, 2, 3, 2, 2, 3, 4, 3, 3, 1, 4, 1, 1… ## $ EDUC_REF <chr> "16", "15", "16", "15", "14", "11", "10", "13", "12", "12", "… ## $ EDUCA2 <dbl> 15, 15, 13, NA, NA, NA, NA, 15, 15, 14, 12, 12, NA, NA, NA, 1… ## $ AGE_REF <dbl> 63, 50, 47, 37, 51, 63, 77, 37, 51, 64, 26, 59, 81, 51, 67, 4… ## $ AGE2 <chr> "50", "47", "46", ".", ".", ".", ".", "36", "53", "67", "44",… ## $ SEX_REF <dbl> 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1… ## $ SEX2 <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 1, 1, 1, NA, NA, NA, 1, NA, 1,… ## $ REF_RACE <dbl> 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1… ## $ RACE2 <dbl> 1, 4, 1, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, NA, NA, 2, NA, 1,… ## $ HISP_REF <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1… ## $ HISP2 <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 2, 2, 2, NA, NA, NA, 2, NA, 2,… ## $ FAM_TYPE <dbl> 3, 4, 1, 8, 9, 9, 8, 3, 1, 1, 3, 1, 8, 9, 8, 5, 9, 4, 8, 3, 2… ## $ MARITAL1 <dbl> 1, 1, 1, 5, 3, 3, 2, 1, 1, 1, 1, 1, 2, 3, 5, 1, 3, 1, 3, 1, 1… ## $ REGION <dbl> 4, 4, 3, 4, 4, 3, 4, 1, 3, 2, 1, 4, 1, 3, 3, 3, 2, 1, 2, 4, 3… ## $ SMSASTAT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15", "14", "… ## $ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ FOODCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ TRANSCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ HEALTHCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ ENTERTCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ EDUCACQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ TOBACCCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ STUDFINX <chr> ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", "… ## $ IRAX <chr> "1000000", "10000", "0", ".", ".", "0", "0", "15000", ".", "4… ## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1, 1, 4, 4… ## $ FAM_SIZE <dbl> 4, 6, 2, 1, 2, 2, 1, 5, 2, 2, 4, 2, 1, 2, 1, 4, 2, 4, 1, 3, 3… ## $ VEHQ <dbl> 3, 5, 0, 4, 2, 0, 0, 2, 4, 2, 3, 2, 1, 3, 1, 2, 4, 4, 0, 2, 3… ## $ ROOMSQ <chr> "8", "5", "6", "4", "4", "4", "7", "5", "4", "9", "6", "10", … ## $ INC_HRS1 <chr> "40", "40", "40", "44", "40", ".", ".", "40", "40", ".", "40"… ## $ INC_HRS2 <chr> "30", "40", "52", ".", ".", ".", ".", "40", "40", ".", "65", … ## $ EARNCOMP <dbl> 3, 2, 2, 1, 4, 7, 8, 2, 2, 8, 2, 8, 8, 7, 8, 2, 7, 3, 1, 2, 1… ## $ NO_EARNR <dbl> 4, 2, 2, 1, 2, 1, 0, 2, 2, 0, 2, 0, 0, 1, 0, 2, 1, 3, 1, 2, 1… ## $ OCCUCOD1 <chr> "03", "03", "05", "03", "04", NA, NA, "12", "04", NA, "01", N… ## $ OCCUCOD2 <chr> "04", "02", "01", NA, NA, NA, NA, "02", "03", NA, "11", NA, N… ## $ STATE <chr> "41", "15", "48", "06", "06", "48", "06", "42", NA, "27", "25… ## $ DIVISION <dbl> 9, 9, 7, 9, 9, 7, 9, 2, NA, 4, 1, 8, 2, 5, 6, 7, 3, 2, 3, 9, … ## $ TOTXEST <dbl> 15452, 11459, 15738, 25978, 588, 0, 0, 7261, 9406, -1414, 141… ## $ CREDFINX <chr> "0", ".", "0", ".", "5", ".", ".", ".", ".", "0", ".", "0", "… ## $ CREDITB <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ CREDITX <chr> "4000", "5000", "2000", ".", "7000", "1800", ".", "6000", "."… ## $ BUILDING <chr> "01", "01", "01", "02", "08", "01", "01", "01", "01", "01", "… ## $ ST_HOUS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2… ## $ INT_PHON <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ INT_HOME <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ``` --- class: inverse, middle, center ## Now let's go through the Data Wrangling handout! --- ## Naming Wrangled Data -- Should I name my new dataframe `ce` or `ce1`? + *My* answer: + Is your new dataset structurally different? If so, give it a **new name**. + Are you removing values you will need for a future analysis within the same document? If so, give it a **new name**. + Are you just adding to or cleaning the data? If so, then **write over** the original. --- ### Sage Advice from ModernDive > "Crucial: Unless you are very confident in what you are doing, it is worthwhile not starting to code right away. Rather, first sketch out on paper all the necessary data wrangling steps not using exact code, but rather high-level pseudocode that is informal yet detailed enough to articulate what you are doing. This way you won’t confuse what you are trying to do (the algorithm) with how you are going to do it (writing dplyr code)." -- Will practice writing pseudocode in section this week! --- ## Data Joins * Often in the data analysis workflow, we have more than one data source, which means more than one dataframe, and we want to combine these dataframes. -- * Need principled way to combine. + Need a **key** that links two dataframes together. -- * These multiple dataframes are called **relational data**. <!-- --- --> <!-- ## Data Joins: Language --> <!-- Three families of verbs for working with relational data: --> <!-- -- --> <!-- * **Mutating joins**: Add variables to one data frame but matching observations in another. --> <!-- -- --> <!-- * **Filtering joins**: Filter observations from one data frame based on whether or not they match an observation in the other table. --> <!-- -- --> <!-- * **Set operations**: treat observations as if they were set elements. --> --- ## Example: BLS Consumer Expenditure Data * Household survey but data are also collected on individuals + fmli: household data + memi: household member-level data ```r #Read in data with readr package library(tidyverse) fmli <- read_csv("~/shared_data/stat100/data/fmli.csv", na = c("NA", ".")) memi <- read_csv("~/shared_data/stat100/data/memi.csv", na = c("NA", ".")) ``` * Want variables on principal earner to be added to the household data --- ## CE Data * Key variable(s)? ```r library(dplyr) glimpse(fmli) ``` ``` ## Rows: 6,301 ## Columns: 51 ## $ NEWID <chr> "03324174", "03324204", "03324214", "03324244", "03324274", "… ## $ PRINEARN <chr> "01", "01", "01", "01", "02", "01", "01", "01", "02", "01", "… ## $ FINLWT21 <dbl> 25984.767, 6581.018, 20208.499, 18078.372, 20111.619, 19907.3… ## $ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 40037, 10… ## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ POPSIZE <dbl> 2, 3, 4, 2, 2, 2, 1, 2, 5, 2, 3, 2, 2, 3, 4, 3, 3, 1, 4, 1, 1… ## $ EDUC_REF <chr> "16", "15", "16", "15", "14", "11", "10", "13", "12", "12", "… ## $ EDUCA2 <dbl> 15, 15, 13, NA, NA, NA, NA, 15, 15, 14, 12, 12, NA, NA, NA, 1… ## $ AGE_REF <dbl> 63, 50, 47, 37, 51, 63, 77, 37, 51, 64, 26, 59, 81, 51, 67, 4… ## $ AGE2 <dbl> 50, 47, 46, NA, NA, NA, NA, 36, 53, 67, 44, 62, NA, NA, NA, 4… ## $ SEX_REF <dbl> 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1… ## $ SEX2 <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 1, 1, 1, NA, NA, NA, 1, NA, 1,… ## $ REF_RACE <dbl> 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1… ## $ RACE2 <dbl> 1, 4, 1, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, NA, NA, 2, NA, 1,… ## $ HISP_REF <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1… ## $ HISP2 <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 2, 2, 2, NA, NA, NA, 2, NA, 2,… ## $ FAM_TYPE <dbl> 3, 4, 1, 8, 9, 9, 8, 3, 1, 1, 3, 1, 8, 9, 8, 5, 9, 4, 8, 3, 2… ## $ MARITAL1 <dbl> 1, 1, 1, 5, 3, 3, 2, 1, 1, 1, 1, 1, 2, 3, 5, 1, 3, 1, 3, 1, 1… ## $ REGION <dbl> 4, 4, 3, 4, 4, 3, 4, 1, 3, 2, 1, 4, 1, 3, 3, 3, 2, 1, 2, 4, 3… ## $ SMSASTAT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15", "14", "… ## $ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ FOODCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ TRANSCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ HEALTHCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ ENTERTCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ EDUCACQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ TOBACCCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ STUDFINX <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ IRAX <dbl> 1000000, 10000, 0, NA, NA, 0, 0, 15000, NA, 477000, NA, NA, N… ## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1, 1, 4, 4… ## $ FAM_SIZE <dbl> 4, 6, 2, 1, 2, 2, 1, 5, 2, 2, 4, 2, 1, 2, 1, 4, 2, 4, 1, 3, 3… ## $ VEHQ <dbl> 3, 5, 0, 4, 2, 0, 0, 2, 4, 2, 3, 2, 1, 3, 1, 2, 4, 4, 0, 2, 3… ## $ ROOMSQ <dbl> 8, 5, 6, 4, 4, 4, 7, 5, 4, 9, 6, 10, 4, 7, 5, 6, 6, 8, 18, 4,… ## $ INC_HRS1 <dbl> 40, 40, 40, 44, 40, NA, NA, 40, 40, NA, 40, NA, NA, NA, NA, 4… ## $ INC_HRS2 <dbl> 30, 40, 52, NA, NA, NA, NA, 40, 40, NA, 65, NA, NA, NA, NA, 6… ## $ EARNCOMP <dbl> 3, 2, 2, 1, 4, 7, 8, 2, 2, 8, 2, 8, 8, 7, 8, 2, 7, 3, 1, 2, 1… ## $ NO_EARNR <dbl> 4, 2, 2, 1, 2, 1, 0, 2, 2, 0, 2, 0, 0, 1, 0, 2, 1, 3, 1, 2, 1… ## $ OCCUCOD1 <chr> "03", "03", "05", "03", "04", "", "", "12", "04", "", "01", "… ## $ OCCUCOD2 <chr> "04", "02", "01", "", "", "", "", "02", "03", "", "11", "", "… ## $ STATE <chr> "41", "15", "48", "06", "06", "48", "06", "42", "", "27", "25… ## $ DIVISION <dbl> 9, 9, 7, 9, 9, 7, 9, 2, NA, 4, 1, 8, 2, 5, 6, 7, 3, 2, 3, 9, … ## $ TOTXEST <dbl> 15452, 11459, 15738, 25978, 588, 0, 0, 7261, 9406, -1414, 141… ## $ CREDFINX <dbl> 0, NA, 0, NA, 5, NA, NA, NA, NA, 0, NA, 0, NA, NA, NA, 2, 35,… ## $ CREDITB <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ CREDITX <dbl> 4000, 5000, 2000, NA, 7000, 1800, NA, 6000, NA, 719, NA, 1200… ## $ BUILDING <chr> "01", "01", "01", "02", "08", "01", "01", "01", "01", "01", "… ## $ ST_HOUS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2… ## $ INT_PHON <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ INT_HOME <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ``` --- ## CE Data * Key variables? ```r glimpse(memi) ``` ``` ## Rows: 15,412 ## Columns: 14 ## $ NEWID <chr> "03552611", "03552641", "03552641", "03552651", "03552651", "… ## $ MEMBNO <dbl> 1, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 1, 2, 3… ## $ AGE <dbl> 58, 54, 49, 39, 10, 32, 7, 9, 38, 34, 11, 8, 6, 3, 65, 61, 11… ## $ SEX <dbl> 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1… ## $ EARNER <dbl> 1, 1, 2, 2, NA, 2, NA, NA, 1, 2, NA, NA, NA, NA, 1, 2, NA, NA… ## $ EARNTYPE <dbl> 2, 1, NA, NA, NA, NA, NA, NA, 3, NA, NA, NA, NA, NA, 4, NA, N… ## $ INC_HRSQ <dbl> 20, 56, NA, NA, NA, NA, NA, NA, 50, NA, NA, NA, NA, NA, 25, N… ## $ INCOMEY <dbl> 4, 1, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 4, NA, N… ## $ OCCUCODE <chr> "10", "05", "", "", "", "", "", "", "03", "", "", "", "", "",… ## $ HISPANIC <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ MEMBRACE <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1… ## $ PAYSTUB <dbl> 1, 2, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, 2, NA, N… ## $ SALARYX <dbl> 8982, NA, NA, NA, NA, NA, NA, NA, 280500, NA, NA, NA, NA, NA,… ## $ WKSTATUS <dbl> 1, 1, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 1, NA, N… ``` --- ## CE Data * Key variables? + Problem with class? ```r glimpse(select(fmli, 1,2)) ``` ``` ## Rows: 6,301 ## Columns: 2 ## $ NEWID <chr> "03324174", "03324204", "03324214", "03324244", "03324274", "… ## $ PRINEARN <chr> "01", "01", "01", "01", "02", "01", "01", "01", "02", "01", "… ``` ```r glimpse(select(memi, 1:2)) ``` ``` ## Rows: 15,412 ## Columns: 2 ## $ NEWID <chr> "03552611", "03552641", "03552641", "03552651", "03552651", "03… ## $ MEMBNO <dbl> 1, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 1, 2, 3, … ``` --- ## CE Data * Key variables? + Problem with class? ```r fmli <- mutate(fmli, PRINEARN = as.integer(PRINEARN)) glimpse(select(fmli, 1, 2)) ``` ``` ## Rows: 6,301 ## Columns: 2 ## $ NEWID <chr> "03324174", "03324204", "03324214", "03324244", "03324274", "… ## $ PRINEARN <int> 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1… ``` ```r glimpse(select(memi, 1, 2)) ``` ``` ## Rows: 15,412 ## Columns: 2 ## $ NEWID <chr> "03552611", "03552641", "03552641", "03552651", "03552651", "03… ## $ MEMBNO <dbl> 1, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 1, 2, 3, … ``` --- ## CE Data * Want to add columns of `memi` to `fmli` that correspond to the principal earner's memi data + What type of join is that? --- ## The World of Joins * Mutating joins: Add new variables to one dataset from matching observations in another. + `left_join()` (and `right_join()`) + `inner_join()` + `full_join()` * There are also *filtering* joins but we won't cover those today. --- ## Example Dataframes ```r d1 <- tibble(V = 1:5, X = c(1, 2, 2, 3, 1), Y = c(14, 3, 6, 1, 4)) d2 <- tibble(X = c(2, 4, 1), S = c(4, 13, 8)) d1 ``` ``` ## # A tibble: 5 × 3 ## V X Y ## <int> <dbl> <dbl> ## 1 1 1 14 ## 2 2 2 3 ## 3 3 2 6 ## 4 4 3 1 ## 5 5 1 4 ``` ```r d2 ``` ``` ## # A tibble: 3 × 2 ## X S ## <dbl> <dbl> ## 1 2 4 ## 2 4 13 ## 3 1 8 ``` --- ## `left_join()` ```r d1_new <- left_join(d1, d2) ``` ``` ## Joining, by = "X" ``` ```r d1_new ``` ``` ## # A tibble: 5 × 4 ## V X Y S ## <int> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 4 ## 4 4 3 1 NA ## 5 5 1 4 8 ``` --- ## `left_join()` ```r d1_new <- left_join(d1, d2, by = c("X" = "X")) d1_new ``` ``` ## # A tibble: 5 × 4 ## V X Y S ## <int> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 4 ## 4 4 3 1 NA ## 5 5 1 4 8 ``` --- ## `left_join()` ```r d1_new <- left_join(d1, d2, by = c("V" = "X")) d1_new ``` ``` ## # A tibble: 5 × 4 ## V X Y S ## <dbl> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 NA ## 4 4 3 1 13 ## 5 5 1 4 NA ``` --- ## `inner_join()` ```r d1_d2 <- inner_join(d1, d2, by = c("X" = "X")) d1_d2 ``` ``` ## # A tibble: 4 × 4 ## V X Y S ## <int> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 4 ## 4 5 1 4 8 ``` --- ## `inner_join()` ```r d1_d2 <- inner_join(d1, d2, by = c("V" = "X")) d1_d2 ``` ``` ## # A tibble: 3 × 4 ## V X Y S ## <dbl> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 4 3 1 13 ``` --- ## `full_join()` ```r d1_d2 <- full_join(d1, d2, by = c("X" = "X")) d1_d2 ``` ``` ## # A tibble: 6 × 4 ## V X Y S ## <int> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 4 ## 4 4 3 1 NA ## 5 5 1 4 8 ## 6 NA 4 NA 13 ``` --- ## `full_join()` ```r d1_d2 <- full_join(d1, d2, by = c("V" = "X")) d1_d2 ``` ``` ## # A tibble: 5 × 4 ## V X Y S ## <dbl> <dbl> <dbl> <dbl> ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 NA ## 4 4 3 1 13 ## 5 5 1 4 NA ``` --- ## Back to our Example * What kind of join do we want for the Consumer Expenditure data? + Want to add columns of `memi` to `fmli` that correspond to the principal earner's memi data --- ## Look at the Possible Joins ```r left_join(fmli, memi) %>% arrange(NEWID) ``` ``` ## Joining, by = "NEWID" ``` ``` ## # A tibble: 15,412 × 64 ## NEWID PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 03324174 1 25985. 116920 1 2 16 15 63 ## 2 03324174 1 25985. 116920 1 2 16 15 63 ## 3 03324174 1 25985. 116920 1 2 16 15 63 ## 4 03324174 1 25985. 116920 1 2 16 15 63 ## 5 03324204 1 6581. 200 1 3 15 15 50 ## 6 03324204 1 6581. 200 1 3 15 15 50 ## 7 03324204 1 6581. 200 1 3 15 15 50 ## 8 03324204 1 6581. 200 1 3 15 15 50 ## 9 03324204 1 6581. 200 1 3 15 15 50 ## 10 03324204 1 6581. 200 1 3 15 15 50 ## # … with 15,402 more rows, and 55 more variables: AGE2 <dbl>, SEX_REF <dbl>, ## # SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, ## # FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, ## # HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, ## # TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, ## # TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, ## # VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, … ``` --- ## Look at the Possible Joins * Be careful. This erroneous example made my R crash! ```r left_join(fmli, memi, by = c("PRINEARN" = "MEMBNO")) %>% arrange(MEMBNO) ``` --- ## Look at the Possible Joins ```r left_join(fmli, memi, by = c("NEWID" = "NEWID", "PRINEARN" = "MEMBNO")) %>% arrange(NEWID) ``` ``` ## # A tibble: 6,301 × 63 ## NEWID PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 03324174 1 25985. 116920 1 2 16 15 63 ## 2 03324204 1 6581. 200 1 3 15 15 50 ## 3 03324214 1 20208. 117000 1 4 16 13 47 ## 4 03324244 1 18078. 0 1 2 15 NA 37 ## 5 03324274 2 20112. 2000 1 2 14 NA 51 ## 6 03324284 1 19907. 942 1 2 11 NA 63 ## 7 03324294 1 11705. 0 1 1 10 NA 77 ## 8 03324304 1 24431. 91000 1 2 13 15 37 ## 9 03324324 2 42859. 95000 2 5 12 15 51 ## 10 03324334 1 17481. 40037 1 2 12 14 64 ## # … with 6,291 more rows, and 54 more variables: AGE2 <dbl>, SEX_REF <dbl>, ## # SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, ## # FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, ## # HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, ## # TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, ## # TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, ## # VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, … ``` --- ## Look at the Possible Joins ```r inner_join(fmli, memi, by = c("NEWID" = "NEWID", "PRINEARN" = "MEMBNO")) %>% arrange(NEWID) ``` ``` ## # A tibble: 6,301 × 63 ## NEWID PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 03324174 1 25985. 116920 1 2 16 15 63 ## 2 03324204 1 6581. 200 1 3 15 15 50 ## 3 03324214 1 20208. 117000 1 4 16 13 47 ## 4 03324244 1 18078. 0 1 2 15 NA 37 ## 5 03324274 2 20112. 2000 1 2 14 NA 51 ## 6 03324284 1 19907. 942 1 2 11 NA 63 ## 7 03324294 1 11705. 0 1 1 10 NA 77 ## 8 03324304 1 24431. 91000 1 2 13 15 37 ## 9 03324324 2 42859. 95000 2 5 12 15 51 ## 10 03324334 1 17481. 40037 1 2 12 14 64 ## # … with 6,291 more rows, and 54 more variables: AGE2 <dbl>, SEX_REF <dbl>, ## # SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, ## # FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, ## # HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, ## # TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, ## # TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, ## # VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, … ``` * Why does this give us the same answer as `left_join` for this situation? --- ## Look at the Possible Joins ```r full_join(fmli, memi, by = c("NEWID" = "NEWID", "PRINEARN" = "MEMBNO")) %>% arrange(NEWID) ``` ``` ## # A tibble: 15,412 × 63 ## NEWID PRINEARN FINLWT21 FINCBTAX BLS_URBN POPSIZE EDUC_REF EDUCA2 AGE_REF ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 03324174 1 25985. 116920 1 2 16 15 63 ## 2 03324174 2 NA NA NA NA <NA> NA NA ## 3 03324174 3 NA NA NA NA <NA> NA NA ## 4 03324174 4 NA NA NA NA <NA> NA NA ## 5 03324204 1 6581. 200 1 3 15 15 50 ## 6 03324204 2 NA NA NA NA <NA> NA NA ## 7 03324204 3 NA NA NA NA <NA> NA NA ## 8 03324204 4 NA NA NA NA <NA> NA NA ## 9 03324204 5 NA NA NA NA <NA> NA NA ## 10 03324204 6 NA NA NA NA <NA> NA NA ## # … with 15,402 more rows, and 54 more variables: AGE2 <dbl>, SEX_REF <dbl>, ## # SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, ## # FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, ## # HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, ## # TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, ## # TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, ## # VEHQ <dbl>, ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, … ``` --- ## Joining Tips ```r fmli <- left_join(fmli, memi, by = c("NEWID" = "NEWID", "PRINEARN" = "MEMBNO")) %>% arrange(NEWID) ``` * FIRST: conceptualize for yourself what you think you want the final dataset to look like! * Check initial dimensions and final dimensions. * Use variable names when joining even if they are the same. <!-- XXX: Reminders? --> <!-- XXX: Joins: next week! -->