background-image: url("img/logo_padded.001.jpeg") background-position: left background-size: 60% class: middle, center, .pull-right[ <br> ## .base_color[Data Joins and] ## .base_color[Reshaping] <br> #### .navy[Kelly McConville] #### .navy[ Stat 108 | Week 4 | Spring 2023] ] --- ## Announcements * No lecture on Monday -- University Holiday. * Some Monday Office Hours will also be cancelled. Make sure to check [the schedule](https://docs.google.com/spreadsheets/d/1HqEmr4tEtFPWRrF5TJHd1VgtVD030w6aAYySoFTnhBw/edit?usp=sharing). ************************ ## Week 3 Goals .pull-left[ **Mon Lecture** * GitHub + RStudio Projects workflow * Exploring `R` Objects ] .pull-right[ **Wed Lecture** * Data Joins * Data reshaping with `tidyr` * Coding Style ] --- class: middle, center ## So what about **generic vectors/lists**? #### Recall they are heterogeneous collections of any type of `R` objects. --- ### Lists * Think of these as the most general way to store things. ```r groceries <- list() groceries$whole_foods <- c("apples", "chocolate", "kale", "garlic") groceries$star <- c("vinegar", "soap") groceries$lizzys <- c("french vanilla", "black cherry", "mint chip") groceries$budget <- data.frame(stores = c("whole_foods", "star", "lizzys"), fund = c(100, 25, 200)) class(groceries) ``` ``` ## [1] "list" ``` --- ### Lists * Notice the nested structure ```r groceries ``` ``` ## $whole_foods ## [1] "apples" "chocolate" "kale" "garlic" ## ## $star ## [1] "vinegar" "soap" ## ## $lizzys ## [1] "french vanilla" "black cherry" "mint chip" ## ## $budget ## stores fund ## 1 whole_foods 100 ## 2 star 25 ## 3 lizzys 200 ``` --- ```r holidays <- c("Valentine's", "President's") feb <- list(groceries = groceries, holidays = holidays) feb ``` ``` ## $groceries ## $groceries$whole_foods ## [1] "apples" "chocolate" "kale" "garlic" ## ## $groceries$star ## [1] "vinegar" "soap" ## ## $groceries$lizzys ## [1] "french vanilla" "black cherry" "mint chip" ## ## $groceries$budget ## stores fund ## 1 whole_foods 100 ## 2 star 25 ## 3 lizzys 200 ## ## ## $holidays ## [1] "Valentine's" "President's" ``` --- ### Indexing lists * Distinguishing between `[ ]` and `[[]]` ```r thing1 <- feb$groceries[3] thing1 ``` ``` ## $lizzys ## [1] "french vanilla" "black cherry" "mint chip" ``` ```r class(thing1) ``` ``` ## [1] "list" ``` ```r thing2 <- feb$groceries[[3]] thing2 ``` ``` ## [1] "french vanilla" "black cherry" "mint chip" ``` ```r class(thing2) ``` ``` ## [1] "character" ``` --- ## [`[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments) * `x` is a list: the pepper shaker containing packets of pepper. <img src="img/list.png" width="30%" style="display: block; margin: auto;" /> --- ## [`[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments) * `x[1]` is a pepper shaker containing the first packet of pepper. <img src="img/innerlist.png" width="30%" style="display: block; margin: auto;" /> --- ## [`[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments) * `x[2]` is what? -- <img src="img/innerlist.png" width="30%" style="display: block; margin: auto;" /> --- ## [`[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments) * `x[[1]]` is what? -- <img src="img/innerobject.png" width="30%" style="display: block; margin: auto;" /> --- ## [`[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments) * `x[[1]][[1]]` is what? -- <img src="img/innerobject2.png" width="30%" style="display: block; margin: auto;" /> --- ## Data Frames Let's relate `list()`s to our favorite R object: `data.frame()`s! * `data.frame()`s are `list()`s. * Each variable of a `data.frame`() is a `vector()`. * The `vector()`s all have the same length but not necessary the same class. --- ## Data Frames ```r dat <- data.frame(x = c(1, 2, 4), y = c("cat", "llama", "pig")) dat ``` ``` ## x y ## 1 1 cat ## 2 2 llama ## 3 4 pig ``` ```r dat ``` ``` ## x y ## 1 1 cat ## 2 2 llama ## 3 4 pig ``` ```r dat$x ``` ``` ## [1] 1 2 4 ``` --- ## Data Frames ```r str(dat[1]) ``` ``` ## 'data.frame': 3 obs. of 1 variable: ## $ x: num 1 2 4 ``` ```r str(dat[[1]]) ``` ``` ## num [1:3] 1 2 4 ``` ```r dat[1, 2] ``` ``` ## [1] "cat" ``` ```r dat[1, ] ``` ``` ## x y ## 1 1 cat ``` --- ## R Data Objects * We can create and interact with (atomic) **vectors** and **lists** (generic vectors). -- * When writing and debugging code, it is a good idea to check the `class()` of an object. -- * R will sometimes change the type of an object without telling us (but based on our actions). -- * **Vectorization** makes `R` code speedy but also can cause you to do something you didn't mean to do. -- * While we will primarily interact with `vector()`s and `data.frame()`s, we will sometimes need `list()`s. + Including in our interactive dashboarding. --- ## Coding Style > "Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread." -- The Tidyverse Style Guide .pull-left[ ```r messy4324 <-select(dataset,thing1,thing2)%>%filter(thing1=="uncool") ``` ] .pull-right[ ```r clean <- select(dataset, thing1, thing2) %>% filter(thing1 == "uncool") ``` ] --- ## Coding Style * Part of writing **easily** reproducible code. -- * Lots of reasonable coding styles. -- * Important piece is **consistency** across all coders on a project. -- * My opinions: + Use spaces liberally. + Let `R` help you with the proper indenting. + Think carefully about names. -- * Check out the [Tidyverse Style Guide](https://style.tidyverse.org/index.html). --- ## Relational Data -- * Most organizations that have a fair amount of data don't store that data in a **single** table. + Think Google, Facebook, LinkedIn, ... -- * Instead, they use a **relational database**: collection of linkable tables that are linked together by **keys**. <img src="img/imdb_schema.png" width="60%" style="display: block; margin: auto;" /> -- * Typically use a SQL server for storing and managing that data. * We are going to learn to how to join relational data in `R` using `dplyr`. --- ### Data Joins Most common data joins are **mutating joins**: * `left_join()` * `inner_join()` * `right_join()` * Arguments: + Left-hand `data.frame()` + Right-hand `data.frame()` + Key matching -- When joining two `data.frame()`s together, ask yourself: * What are the **keys**? * Which rows do I want to keep from the left-hand `data.frame()`? Which rows do I want to keep from the right-hand `data.frame()`? * What dimensions should the resulting `data.frame()` have? --- ### Data Joins What do the following movies all have in common? ```r library(tidyverse) movies <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/data/harvard_movies.csv") movies ``` ``` ## # A tibble: 3 × 3 ## id title production_year ## <dbl> <chr> <dbl> ## 1 3870185 Legally Blonde 2001 ## 2 3905748 Love Story 1970 ## 3 4424677 The Social Network 2010 ``` --- Have two more `data.frame()`s: .pull-left[ ```r keywords_movies <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/data/keywords_movie.csv") keywords_movies ``` ``` ## # A tibble: 14 × 2 ## movie_id keyword_id ## <dbl> <dbl> ## 1 3870185 1865 ## 2 3870185 514 ## 3 3870185 31570 ## 4 3905748 3417 ## 5 3905748 514 ## 6 3905748 31570 ## 7 3905748 20986 ## 8 4424677 20887 ## 9 4424677 3417 ## 10 4424677 514 ## 11 4424677 31570 ## 12 4424677 22883 ## 13 4424677 6485 ## 14 4424677 20986 ``` ] .pull-right[ ```r keyword_ids <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/data/keywords_id.csv") keyword_ids ``` ``` ## # A tibble: 11 × 2 ## keyword_id keyword ## <dbl> <chr> ## 1 20887 algorithm ## 2 1865 blonde ## 3 3417 class-differences ## 4 514 college-student ## 5 27064 donkey ## 6 71175 fairy-tale-parody ## 7 31570 harvard ## 8 22883 harvard-university ## 9 6485 identical-twins ## 10 20986 private-club ## 11 13065 true-love ``` ] --- ### Let's try some joins ```r left_join(keywords_movies, keyword_ids) ``` ``` ## # A tibble: 14 × 3 ## movie_id keyword_id keyword ## <dbl> <dbl> <chr> ## 1 3870185 1865 blonde ## 2 3870185 514 college-student ## 3 3870185 31570 harvard ## 4 3905748 3417 class-differences ## 5 3905748 514 college-student ## 6 3905748 31570 harvard ## 7 3905748 20986 private-club ## 8 4424677 20887 algorithm ## 9 4424677 3417 class-differences ## 10 4424677 514 college-student ## 11 4424677 31570 harvard ## 12 4424677 22883 harvard-university ## 13 4424677 6485 identical-twins ## 14 4424677 20986 private-club ``` --- ### Let's try some joins ```r left_join(keywords_movies, keyword_ids, by = c("keyword_id" = "keyword_id")) ``` ``` ## # A tibble: 14 × 3 ## movie_id keyword_id keyword ## <dbl> <dbl> <chr> ## 1 3870185 1865 blonde ## 2 3870185 514 college-student ## 3 3870185 31570 harvard ## 4 3905748 3417 class-differences ## 5 3905748 514 college-student ## 6 3905748 31570 harvard ## 7 3905748 20986 private-club ## 8 4424677 20887 algorithm ## 9 4424677 3417 class-differences ## 10 4424677 514 college-student ## 11 4424677 31570 harvard ## 12 4424677 22883 harvard-university ## 13 4424677 6485 identical-twins ## 14 4424677 20986 private-club ``` --- ### Let's try some joins ```r left_join(keyword_ids, keywords_movies, by = c("keyword_id" = "keyword_id")) ``` ``` ## # A tibble: 17 × 3 ## keyword_id keyword movie_id ## <dbl> <chr> <dbl> ## 1 20887 algorithm 4424677 ## 2 1865 blonde 3870185 ## 3 3417 class-differences 3905748 ## 4 3417 class-differences 4424677 ## 5 514 college-student 3870185 ## 6 514 college-student 3905748 ## 7 514 college-student 4424677 ## 8 27064 donkey NA ## 9 71175 fairy-tale-parody NA ## 10 31570 harvard 3870185 ## 11 31570 harvard 3905748 ## 12 31570 harvard 4424677 ## 13 22883 harvard-university 4424677 ## 14 6485 identical-twins 4424677 ## 15 20986 private-club 3905748 ## 16 20986 private-club 4424677 ## 17 13065 true-love NA ``` --- ### Let's try some joins ```r inner_join(keywords_movies, keyword_ids, by = c("keyword_id" = "keyword_id")) ``` ``` ## # A tibble: 14 × 3 ## movie_id keyword_id keyword ## <dbl> <dbl> <chr> ## 1 3870185 1865 blonde ## 2 3870185 514 college-student ## 3 3870185 31570 harvard ## 4 3905748 3417 class-differences ## 5 3905748 514 college-student ## 6 3905748 31570 harvard ## 7 3905748 20986 private-club ## 8 4424677 20887 algorithm ## 9 4424677 3417 class-differences ## 10 4424677 514 college-student ## 11 4424677 31570 harvard ## 12 4424677 22883 harvard-university ## 13 4424677 6485 identical-twins ## 14 4424677 20986 private-club ``` --- ### Let's try some joins ```r full_join(keywords_movies, keyword_ids, by = c("keyword_id" = "keyword_id")) ``` ``` ## # A tibble: 17 × 3 ## movie_id keyword_id keyword ## <dbl> <dbl> <chr> ## 1 3870185 1865 blonde ## 2 3870185 514 college-student ## 3 3870185 31570 harvard ## 4 3905748 3417 class-differences ## 5 3905748 514 college-student ## 6 3905748 31570 harvard ## 7 3905748 20986 private-club ## 8 4424677 20887 algorithm ## 9 4424677 3417 class-differences ## 10 4424677 514 college-student ## 11 4424677 31570 harvard ## 12 4424677 22883 harvard-university ## 13 4424677 6485 identical-twins ## 14 4424677 20986 private-club ## 15 NA 27064 donkey ## 16 NA 71175 fairy-tale-parody ## 17 NA 13065 true-love ``` Can you guess what the fourth movie is? --- ### Movie Keywords Keep the resulting `data.frame()` that has the keywords for our three Harvard movies: ```r keywords_h <- left_join(keywords_movies, keyword_ids, by = c("keyword_id" = "keyword_id")) keywords_h ``` ``` ## # A tibble: 14 × 3 ## movie_id keyword_id keyword ## <dbl> <dbl> <chr> ## 1 3870185 1865 blonde ## 2 3870185 514 college-student ## 3 3870185 31570 harvard ## 4 3905748 3417 class-differences ## 5 3905748 514 college-student ## 6 3905748 31570 harvard ## 7 3905748 20986 private-club ## 8 4424677 20887 algorithm ## 9 4424677 3417 class-differences ## 10 4424677 514 college-student ## 11 4424677 31570 harvard ## 12 4424677 22883 harvard-university ## 13 4424677 6485 identical-twins ## 14 4424677 20986 private-club ``` --- ### Movie Keywords What `---_join()` should we do to add the movie titles? ```r keywords_h ``` ``` ## # A tibble: 14 × 3 ## movie_id keyword_id keyword ## <dbl> <dbl> <chr> ## 1 3870185 1865 blonde ## 2 3870185 514 college-student ## 3 3870185 31570 harvard ## 4 3905748 3417 class-differences ## 5 3905748 514 college-student ## 6 3905748 31570 harvard ## 7 3905748 20986 private-club ## 8 4424677 20887 algorithm ## 9 4424677 3417 class-differences ## 10 4424677 514 college-student ## 11 4424677 31570 harvard ## 12 4424677 22883 harvard-university ## 13 4424677 6485 identical-twins ## 14 4424677 20986 private-club ``` --- ### Movie Keywords What `---_join()` should we use to add the movie titles? ```r keywords_movies_h <- left_join(keywords_h, movies, by = c("movie_id" = "id")) keywords_movies_h ``` ``` ## # A tibble: 14 × 5 ## movie_id keyword_id keyword title production_year ## <dbl> <dbl> <chr> <chr> <dbl> ## 1 3870185 1865 blonde Legally Blonde 2001 ## 2 3870185 514 college-student Legally Blonde 2001 ## 3 3870185 31570 harvard Legally Blonde 2001 ## 4 3905748 3417 class-differences Love Story 1970 ## 5 3905748 514 college-student Love Story 1970 ## 6 3905748 31570 harvard Love Story 1970 ## 7 3905748 20986 private-club Love Story 1970 ## 8 4424677 20887 algorithm The Social Network 2010 ## 9 4424677 3417 class-differences The Social Network 2010 ## 10 4424677 514 college-student The Social Network 2010 ## 11 4424677 31570 harvard The Social Network 2010 ## 12 4424677 22883 harvard-university The Social Network 2010 ## 13 4424677 6485 identical-twins The Social Network 2010 ## 14 4424677 20986 private-club The Social Network 2010 ``` --- ### Order Matters! * Focus on the keyword "harvard-university" ```r keyword_harvard <- filter(keywords_h, keyword == "harvard-university") keyword_harvard ``` ``` ## # A tibble: 1 × 3 ## movie_id keyword_id keyword ## <dbl> <dbl> <chr> ## 1 4424677 22883 harvard-university ``` --- ### Order Matters! ```r left_join(keyword_harvard, movies, by = c("movie_id" = "id")) ``` ``` ## # A tibble: 1 × 5 ## movie_id keyword_id keyword title production_year ## <dbl> <dbl> <chr> <chr> <dbl> ## 1 4424677 22883 harvard-university The Social Network 2010 ``` ```r right_join(keyword_harvard, movies, by = c("movie_id" = "id")) ``` ``` ## # A tibble: 3 × 5 ## movie_id keyword_id keyword title production_year ## <dbl> <dbl> <chr> <chr> <dbl> ## 1 4424677 22883 harvard-university The Social Network 2010 ## 2 3870185 NA <NA> Legally Blonde 2001 ## 3 3905748 NA <NA> Love Story 1970 ``` --- ## Data Wrangling in the `tidyverse` **Question**: Why do we need TWO different data wrangling packages in the `tidyverse`? -- **Question**: What does `dplyr` do? -- * Performs various data manipulation tasks, such as extracting and summarizing -- **Question**: So what is left for `tidyr` to do? * Reshape data into different formats (e.g. long and wide) --- ## Key Idea: Reshape Data to a Tidy Data Format > "Happy families are all alike; every unhappy family is unhappy in its own way." -- Leo Tolstoy -- > "Tidy datasets are all alike, but every messy dataset is messy in its own way." -- Hadley Wickham -- * By tidy, we don't mean neat. * Tidy data satisfy rules that make it easy to work with the data. --- ## Tidy Data Rules <img src="img/tidyRules.png" width="80%" style="display: block; margin: auto;" /> * Each column is a single variable. * Each row is a unique observation. * Each value must have its own cell. --- ## Tidy Data ```r keywords_movies_h ``` ``` ## # A tibble: 14 × 5 ## movie_id keyword_id keyword title production_year ## <dbl> <dbl> <chr> <chr> <dbl> ## 1 3870185 1865 blonde Legally Blonde 2001 ## 2 3870185 514 college-student Legally Blonde 2001 ## 3 3870185 31570 harvard Legally Blonde 2001 ## 4 3905748 3417 class-differences Love Story 1970 ## 5 3905748 514 college-student Love Story 1970 ## 6 3905748 31570 harvard Love Story 1970 ## 7 3905748 20986 private-club Love Story 1970 ## 8 4424677 20887 algorithm The Social Network 2010 ## 9 4424677 3417 class-differences The Social Network 2010 ## 10 4424677 514 college-student The Social Network 2010 ## 11 4424677 31570 harvard The Social Network 2010 ## 12 4424677 22883 harvard-university The Social Network 2010 ## 13 4424677 6485 identical-twins The Social Network 2010 ## 14 4424677 20986 private-club The Social Network 2010 ``` --- ## Un-Tidy Data * Let's look at a dataset of CDC health stats for PA, NY, and TX ``` ## # A tibble: 5 × 6 ## YearStart PA NY TX DataValueType Question ## <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 2014 87.7 66.8 96.4 Age-adjusted Rate Mortality from heart failure ## 2 2013 87.3 67.8 95.6 Age-adjusted Rate Mortality from heart failure ## 3 2012 83.2 67.7 93.5 Age-adjusted Rate Mortality from heart failure ## 4 2011 83.8 69.9 92.1 Age-adjusted Rate Mortality from heart failure ## 5 2010 83 69.4 96.5 Age-adjusted Rate Mortality from heart failure ``` * Which tidy rule was broken? --- ## Tidy Version of the Data ``` ## # A tibble: 15 × 5 ## YearStart LocationAbbr DataValue Question DataValueType ## <dbl> <chr> <dbl> <chr> <chr> ## 1 2014 NY 66.8 Mortality from heart failure Age-adjusted R… ## 2 2014 PA 87.7 Mortality from heart failure Age-adjusted R… ## 3 2014 TX 96.4 Mortality from heart failure Age-adjusted R… ## 4 2013 NY 67.8 Mortality from heart failure Age-adjusted R… ## 5 2013 PA 87.3 Mortality from heart failure Age-adjusted R… ## 6 2013 TX 95.6 Mortality from heart failure Age-adjusted R… ## 7 2012 NY 67.7 Mortality from heart failure Age-adjusted R… ## 8 2012 PA 83.2 Mortality from heart failure Age-adjusted R… ## 9 2012 TX 93.5 Mortality from heart failure Age-adjusted R… ## 10 2011 NY 69.9 Mortality from heart failure Age-adjusted R… ## 11 2011 PA 83.8 Mortality from heart failure Age-adjusted R… ## 12 2011 TX 92.1 Mortality from heart failure Age-adjusted R… ## 13 2010 NY 69.4 Mortality from heart failure Age-adjusted R… ## 14 2010 PA 83 Mortality from heart failure Age-adjusted R… ## 15 2010 TX 96.5 Mortality from heart failure Age-adjusted R… ``` --- ## Un-Tidy Data ``` ## # A tibble: 15 × 5 ## YearStart LocationAbbr DataValue DataValueType Question ## <dbl> <chr> <dbl> <chr> <chr> ## 1 2014 PA 87.7 Age-adjusted Rate Mortality from heart fail… ## 2 2014 PA 126. Crude Rate Mortality from heart fail… ## 3 2014 PA 16091 Number Mortality from heart fail… ## 4 2013 PA 87.3 Age-adjusted Rate Mortality from heart fail… ## 5 2013 PA 124. Crude Rate Mortality from heart fail… ## 6 2013 PA 15894 Number Mortality from heart fail… ## 7 2012 PA 83.2 Age-adjusted Rate Mortality from heart fail… ## 8 2012 PA 117. Crude Rate Mortality from heart fail… ## 9 2012 PA 14893 Number Mortality from heart fail… ## 10 2011 PA 83.8 Age-adjusted Rate Mortality from heart fail… ## 11 2011 PA 116. Crude Rate Mortality from heart fail… ## 12 2011 PA 14827 Number Mortality from heart fail… ## 13 2010 PA 83 Age-adjusted Rate Mortality from heart fail… ## 14 2010 PA 114. Crude Rate Mortality from heart fail… ## 15 2010 PA 14411 Number Mortality from heart fail… ``` * Which tidy rule was broken? --- ## Tidy Data ``` ## # A tibble: 5 × 6 ## YearStart LocationAbbr Question Age-adjus…¹ Crude…² Number ## <dbl> <chr> <chr> <dbl> <dbl> <dbl> ## 1 2014 PA Mortality from heart failure 87.7 126. 16091 ## 2 2013 PA Mortality from heart failure 87.3 124. 15894 ## 3 2012 PA Mortality from heart failure 83.2 117. 14893 ## 4 2011 PA Mortality from heart failure 83.8 116. 14827 ## 5 2010 PA Mortality from heart failure 83 114. 14411 ## # … with abbreviated variable names ¹`Age-adjusted Rate`, ²`Crude Rate` ``` --- ## `tidyr` * `tidyr` is an R package which will allow us to switch a dataset from a wide to a long format and vice versa. ```r library(tidyverse) ``` -- * **Problem**: When column names are not variables but values of a variable. Each row represents multiple observations. + **Solution**: Use `pivot_longer()`. -- * **Problem**: When an observation is scattered across multiple rows. + **Solution**: Use `pivot_wider()`. -- * **Problem**: Two observations are in one cell. + **Solution**: Use `separate()`. --- ## `pivot_longer()` * What columns do we want to gather? + names_to = name of variable whose values are the column names + values_to = name of variable whose values are spread over the cells ```r CDC_wide ``` ``` ## # A tibble: 5 × 6 ## YearStart PA NY TX DataValueType Question ## <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 2014 87.7 66.8 96.4 Age-adjusted Rate Mortality from heart failure ## 2 2013 87.3 67.8 95.6 Age-adjusted Rate Mortality from heart failure ## 3 2012 83.2 67.7 93.5 Age-adjusted Rate Mortality from heart failure ## 4 2011 83.8 69.9 92.1 Age-adjusted Rate Mortality from heart failure ## 5 2010 83 69.4 96.5 Age-adjusted Rate Mortality from heart failure ``` --- ## `pivot_longer()` ```r CDC_narrow_tidy <- pivot_longer(CDC_wide, cols = c(PA, NY, TX), names_to = "State", values_to = "MortalityRate") %>% select(YearStart, State, MortalityRate) CDC_narrow_tidy ``` ``` ## # A tibble: 15 × 3 ## YearStart State MortalityRate ## <dbl> <chr> <dbl> ## 1 2014 PA 87.7 ## 2 2014 NY 66.8 ## 3 2014 TX 96.4 ## 4 2013 PA 87.3 ## 5 2013 NY 67.8 ## 6 2013 TX 95.6 ## 7 2012 PA 83.2 ## 8 2012 NY 67.7 ## 9 2012 TX 93.5 ## 10 2011 PA 83.8 ## 11 2011 NY 69.9 ## 12 2011 TX 92.1 ## 13 2010 PA 83 ## 14 2010 NY 69.4 ## 15 2010 TX 96.5 ``` --- ## `pivot_wider()` * Which columns do we want to spread across multiple rows? + `names_from` = column that contains the variable names + `values_from` = column that contains the values of the variables ```r CDC_narrow_untidy ``` ``` ## # A tibble: 15 × 5 ## YearStart LocationAbbr DataValue DataValueType Question ## <dbl> <chr> <dbl> <chr> <chr> ## 1 2014 PA 87.7 Age-adjusted Rate Mortality from heart fail… ## 2 2014 PA 126. Crude Rate Mortality from heart fail… ## 3 2014 PA 16091 Number Mortality from heart fail… ## 4 2013 PA 87.3 Age-adjusted Rate Mortality from heart fail… ## 5 2013 PA 124. Crude Rate Mortality from heart fail… ## 6 2013 PA 15894 Number Mortality from heart fail… ## 7 2012 PA 83.2 Age-adjusted Rate Mortality from heart fail… ## 8 2012 PA 117. Crude Rate Mortality from heart fail… ## 9 2012 PA 14893 Number Mortality from heart fail… ## 10 2011 PA 83.8 Age-adjusted Rate Mortality from heart fail… ## 11 2011 PA 116. Crude Rate Mortality from heart fail… ## 12 2011 PA 14827 Number Mortality from heart fail… ## 13 2010 PA 83 Age-adjusted Rate Mortality from heart fail… ## 14 2010 PA 114. Crude Rate Mortality from heart fail… ## 15 2010 PA 14411 Number Mortality from heart fail… ``` --- ## `pivot_wider()` ```r CDC_wide_tidy <- pivot_wider(CDC_narrow_untidy, names_from = DataValueType, values_from = DataValue) CDC_wide_tidy ``` ``` ## # A tibble: 5 × 6 ## YearStart LocationAbbr Question Age-adjus…¹ Crude…² Number ## <dbl> <chr> <chr> <dbl> <dbl> <dbl> ## 1 2014 PA Mortality from heart failure 87.7 126. 16091 ## 2 2013 PA Mortality from heart failure 87.3 124. 15894 ## 3 2012 PA Mortality from heart failure 83.2 117. 14893 ## 4 2011 PA Mortality from heart failure 83.8 116. 14827 ## 5 2010 PA Mortality from heart failure 83 114. 14411 ## # … with abbreviated variable names ¹`Age-adjusted Rate`, ²`Crude Rate` ``` --- ## `separate()` * Separate one column into multiple columns. * What column might we want to separate into two? ```r read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/data/CDC2.csv") %>% select(GeoLocation, YearStart, LocationAbbr, DataValue, Question) %>% glimpse() ``` ``` ## Rows: 74,811 ## Columns: 5 ## $ GeoLocation <chr> NA, "(32.84057112200048, -86.63186076199969)", "(64.84507… ## $ YearStart <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 201… ## $ LocationAbbr <chr> "US", "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC… ## $ DataValue <dbl> 16.9, 13.0, 18.2, 15.6, 15.0, 16.3, 19.0, 16.7, 17.0, 25.… ## $ Question <chr> "Binge drinking prevalence among adults aged >= 18 years"… ``` --- ## `separate()` * Separate one column into multiple columns. + `col` = name of column to separate + `into` = character vector with names of new columns + `sep` = character to split on ```r CDC_sep <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/data/CDC2.csv") %>% select(GeoLocation, YearStart, LocationAbbr, DataValue, Question) %>% separate(col = GeoLocation, into = c("lat", "long"), sep = ", ") CDC_sep ``` ``` ## # A tibble: 74,811 × 6 ## lat long YearStart Location…¹ DataV…² Quest…³ ## <chr> <chr> <dbl> <chr> <dbl> <chr> ## 1 <NA> <NA> 2016 US 16.9 Binge … ## 2 (32.84057112200048 -86.63186076199969) 2016 AL 13 Binge … ## 3 (64.84507995700051 -147.72205903599973) 2016 AK 18.2 Binge … ## 4 (34.865970280000454 -111.76381127699972) 2016 AZ 15.6 Binge … ## 5 (34.74865012400045 -92.27449074299966) 2016 AR 15 Binge … ## 6 (37.63864012300047 -120.99999953799971) 2016 CA 16.3 Binge … ## 7 (38.843840757000464 -106.13361092099967) 2016 CO 19 Binge … ## 8 (41.56266102000046 -72.64984095199964) 2016 CT 16.7 Binge … ## 9 (39.008830667000495 -75.57774116799965) 2016 DE 17 Binge … ## 10 (38.907192 -77.036871) 2016 DC 25.6 Binge … ## # … with 74,801 more rows, and abbreviated variable names ¹LocationAbbr, ## # ²DataValue, ³Question ``` --- ## `separate()` ```r library(stringr) CDC_sep <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/data/CDC2.csv") %>% select(GeoLocation, YearStart, LocationAbbr, DataValue, Question) %>% separate(col = GeoLocation, into = c("lat", "long"), sep = ", ") %>% mutate(lat = parse_number(lat), long = parse_number(long)) ``` --- ## Another Example: Live Coding * Also a chance to see some additional `dplyr` functions. ```r library(readxl) url <- "https://www.bts.gov/sites/bts.dot.gov/files/table_01_33_102020.xlsx" destfile <- "table_01_33_102020.xlsx" curl::curl_download(url, destfile) Amtrak <- read_excel(destfile) Amtrak ``` ``` ## # A tibble: 20 × 35 ## Table 1-…¹ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11 ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 <NA> 1972 1975 1980 1985 1990 1991 1992 1993 1994 1995 ## 2 Locomotiv… <NA> NA NA NA NA NA NA NA NA NA ## 3 Percent a… U 87 83 93 84 86 83 84 85 88 ## 4 Average a… 22.3 14.4 7.4 7 12 13 13 13.2 13.4 13.9 ## 5 Passenger… <NA> NA NA NA NA NA NA NA NA NA ## 6 Percent a… U 82 77 90 90 92 90 89 88 90 ## 7 Average a… 22 24.7 14.3 14.2 20 21 21.5 22.6 22.4 21.8 ## 8 KEY: U =… <NA> NA NA NA NA NA NA NA NA NA ## 9 <NA> <NA> NA NA NA NA NA NA NA NA NA ## 10 a Year-en… <NA> NA NA NA NA NA NA NA NA NA ## 11 b Fiscal … <NA> NA NA NA NA NA NA NA NA NA ## 12 <NA> <NA> NA NA NA NA NA NA NA NA NA ## 13 NOTES <NA> NA NA NA NA NA NA NA NA NA ## 14 1972 was … <NA> NA NA NA NA NA NA NA NA NA ## 15 Roadraile… <NA> NA NA NA NA NA NA NA NA NA ## 16 <NA> <NA> NA NA NA NA NA NA NA NA NA ## 17 SOURCES <NA> NA NA NA NA NA NA NA NA NA ## 18 1972-80: … <NA> NA NA NA NA NA NA NA NA NA ## 19 1985-2000… <NA> NA NA NA NA NA NA NA NA NA ## 20 2001-19: … <NA> NA NA NA NA NA NA NA NA NA ## # … with 24 more variables: ...12 <dbl>, ...13 <dbl>, ...14 <dbl>, ...15 <dbl>, ## # ...16 <dbl>, ...17 <chr>, ...18 <chr>, ...19 <dbl>, ...20 <dbl>, ## # ...21 <dbl>, ...22 <dbl>, ...23 <dbl>, ...24 <dbl>, ...25 <dbl>, ## # ...26 <dbl>, ...27 <dbl>, ...28 <dbl>, ...29 <dbl>, ...30 <dbl>, ## # ...31 <dbl>, ...32 <chr>, ...33 <chr>, ...34 <chr>, ...35 <chr>, and ## # abbreviated variable name ## # ¹`Table 1-33: Age and Availability of Amtrak Locomotive and Car Fleets` ```