Data Joins and Reshaping

background-image: url("img/logo_padded.001.jpeg")
background-position: left
background-size: 60%
class: middle, center,

.pull-right[

<br>

## .base_color[Data Joins and]

## .base_color[Reshaping]
 
<br>

#### .navy[Kelly McConville]

#### .navy[ Stat 108 | Week 4 | Spring 2023]

]

---

## Announcements

* No lecture on Monday -- University Holiday.
* Some Monday Office Hours will also be cancelled.  Make sure to check [the schedule](https://docs.google.com/spreadsheets/d/1HqEmr4tEtFPWRrF5TJHd1VgtVD030w6aAYySoFTnhBw/edit?usp=sharing).

************************

## Week 3 Goals

.pull-left[

**Mon Lecture**

* GitHub + RStudio Projects workflow

* Exploring `R` Objects

]

.pull-right[

**Wed Lecture**

* Data Joins
* Data reshaping with `tidyr`
* Coding Style

]

---

class: middle, center

##  So what about **generic vectors/lists**?

#### Recall they are heterogeneous collections of any type of `R` objects.

---

### Lists

* Think of these as the most general way to store things.

```r
groceries <- list()
groceries$whole_foods <- c("apples", "chocolate", "kale", "garlic")
groceries$star <- c("vinegar", "soap")
groceries$lizzys <- c("french vanilla", "black cherry", "mint chip")
groceries$budget <- data.frame(stores = c("whole_foods", "star", "lizzys"), 
                               fund = c(100, 25, 200))
class(groceries)
```

```
## [1] "list"
```

---

### Lists

* Notice the nested structure

```r
groceries
```

```
## $whole_foods
## [1] "apples"    "chocolate" "kale"      "garlic"   
## 
## $star
## [1] "vinegar" "soap"   
## 
## $lizzys
## [1] "french vanilla" "black cherry"   "mint chip"     
## 
## $budget
##        stores fund
## 1 whole_foods  100
## 2        star   25
## 3      lizzys  200
```

---

```r
holidays <- c("Valentine's", "President's")
feb <- list(groceries  = groceries, holidays = holidays)
feb
```

```
## $groceries
## $groceries$whole_foods
## [1] "apples"    "chocolate" "kale"      "garlic"   
## 
## $groceries$star
## [1] "vinegar" "soap"   
## 
## $groceries$lizzys
## [1] "french vanilla" "black cherry"   "mint chip"     
## 
## $groceries$budget
##        stores fund
## 1 whole_foods  100
## 2        star   25
## 3      lizzys  200
## 
## 
## $holidays
## [1] "Valentine's" "President's"
```

---

### Indexing lists

* Distinguishing between `[ ]` and `[[]]`

```r
thing1 <- feb$groceries[3]
thing1
```

```
## $lizzys
## [1] "french vanilla" "black cherry"   "mint chip"
```

```r
class(thing1)
```

```
## [1] "list"
```

```r
thing2 <- feb$groceries[[3]]
thing2
```

```
## [1] "french vanilla" "black cherry"   "mint chip"
```

```r
class(thing2)
```

```
## [1] "character"
```

---

## [`[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments)

* `x` is a list: the pepper shaker containing packets of pepper.

<img src="img/list.png" width="30%" style="display: block; margin: auto;" />

---

## [`[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments)

* `x[1]` is a pepper shaker containing the first packet of pepper.

<img src="img/innerlist.png" width="30%" style="display: block; margin: auto;" />

---

## [`[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments)

* `x[2]` is what?

--

<img src="img/innerlist.png" width="30%" style="display: block; margin: auto;" />

---

## [`[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments)

* `x[[1]]` is what?

--

<img src="img/innerobject.png" width="30%" style="display: block; margin: auto;" />

---

## [`[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments)

* `x[[1]][[1]]` is what?

--

<img src="img/innerobject2.png" width="30%" style="display: block; margin: auto;" />

---

## Data Frames

Let's relate `list()`s to our favorite R object: `data.frame()`s!

* `data.frame()`s are `list()`s.
* Each variable of a `data.frame`() is a `vector()`.
* The `vector()`s all have the same length but not necessary the same class.

---

## Data Frames

```r
dat <- data.frame(x = c(1, 2, 4), y = c("cat", "llama", "pig"))
dat
```

```
##   x     y
## 1 1   cat
## 2 2 llama
## 3 4   pig
```

```r
dat
```

```
##   x     y
## 1 1   cat
## 2 2 llama
## 3 4   pig
```

```r
dat$x
```

```
## [1] 1 2 4
```

---

## Data Frames

```r
str(dat[1])
```

```
## 'data.frame':	3 obs. of  1 variable:
##  $ x: num  1 2 4
```

```r
str(dat[[1]])
```

```
##  num [1:3] 1 2 4
```

```r
dat[1, 2]
```

```
## [1] "cat"
```

```r
dat[1, ]
```

```
##   x   y
## 1 1 cat
```

---

## R Data Objects

* We can create and interact with (atomic) **vectors** and **lists** (generic vectors).

--

* When writing and debugging code, it is a good idea to check the `class()` of an object.

--

* R will sometimes change the type of an object without telling us (but based on our actions).

--

* **Vectorization** makes `R` code speedy but also can cause you to do something you didn't mean to do.

--

* While we will primarily interact with `vector()`s and `data.frame()`s, we will sometimes need `list()`s.
    + Including in our interactive dashboarding.

---

## Coding Style

> "Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread." -- The Tidyverse Style Guide

.pull-left[

```r
messy4324 <-select(dataset,thing1,thing2)%>%filter(thing1=="uncool")
```

]

.pull-right[

```r
clean <- select(dataset, thing1, thing2) %>%
  filter(thing1 == "uncool")
```

]

---

## Coding Style

* Part of writing **easily** reproducible code.

--

* Lots of reasonable coding styles.

--

* Important piece is **consistency** across all coders on a project.

--

* My opinions:
    + Use spaces liberally.
    + Let `R` help you with the proper indenting.
    + Think carefully about names.

--

* Check out the [Tidyverse Style Guide](https://style.tidyverse.org/index.html).

---

## Relational Data

--

* Most organizations that have a fair amount of data don't store that data in a **single** table.
    + Think Google, Facebook, LinkedIn, ...

--

* Instead, they use a **relational database**: collection of linkable tables that are linked together by **keys**.

<img src="img/imdb_schema.png" width="60%" style="display: block; margin: auto;" />

--

* Typically use a SQL server for storing and managing that data.

* We are going to learn to how to join relational data in `R` using `dplyr`.

---

### Data Joins

Most common data joins are **mutating joins**:

* `left_join()`
* `inner_join()`
* `right_join()`

* Arguments:
    + Left-hand `data.frame()`
    + Right-hand `data.frame()`
    + Key matching

--

When joining two `data.frame()`s together, ask yourself:

* What are the **keys**?
* Which rows do I want to keep from the left-hand `data.frame()`?  Which rows do I want to keep from the right-hand `data.frame()`?
* What dimensions should the resulting `data.frame()` have?

---

### Data Joins

What do the following movies all have in common?

```r
library(tidyverse)
movies <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/data/harvard_movies.csv")
movies
```

```
## # A tibble: 3 × 3
##        id title              production_year
##     <dbl> <chr>                        <dbl>
## 1 3870185 Legally Blonde                2001
## 2 3905748 Love Story                    1970
## 3 4424677 The Social Network            2010
```

---

Have two more `data.frame()`s:

.pull-left[

```r
keywords_movies <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/data/keywords_movie.csv")
keywords_movies
```

```
## # A tibble: 14 × 2
##    movie_id keyword_id
##       <dbl>      <dbl>
##  1  3870185       1865
##  2  3870185        514
##  3  3870185      31570
##  4  3905748       3417
##  5  3905748        514
##  6  3905748      31570
##  7  3905748      20986
##  8  4424677      20887
##  9  4424677       3417
## 10  4424677        514
## 11  4424677      31570
## 12  4424677      22883
## 13  4424677       6485
## 14  4424677      20986
```

]

.pull-right[

```r
keyword_ids <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/data/keywords_id.csv")
keyword_ids
```

```
## # A tibble: 11 × 2
##    keyword_id keyword           
##         <dbl> <chr>             
##  1      20887 algorithm         
##  2       1865 blonde            
##  3       3417 class-differences 
##  4        514 college-student   
##  5      27064 donkey            
##  6      71175 fairy-tale-parody 
##  7      31570 harvard           
##  8      22883 harvard-university
##  9       6485 identical-twins   
## 10      20986 private-club      
## 11      13065 true-love
```

]

---

### Let's try some joins

```r
left_join(keywords_movies, keyword_ids)
```

```
## # A tibble: 14 × 3
##    movie_id keyword_id keyword           
##       <dbl>      <dbl> <chr>             
##  1  3870185       1865 blonde            
##  2  3870185        514 college-student   
##  3  3870185      31570 harvard           
##  4  3905748       3417 class-differences 
##  5  3905748        514 college-student   
##  6  3905748      31570 harvard           
##  7  3905748      20986 private-club      
##  8  4424677      20887 algorithm         
##  9  4424677       3417 class-differences 
## 10  4424677        514 college-student   
## 11  4424677      31570 harvard           
## 12  4424677      22883 harvard-university
## 13  4424677       6485 identical-twins   
## 14  4424677      20986 private-club
```

---

### Let's try some joins

```r
left_join(keywords_movies, keyword_ids, by = c("keyword_id" = "keyword_id"))
```

```
## # A tibble: 14 × 3
##    movie_id keyword_id keyword           
##       <dbl>      <dbl> <chr>             
##  1  3870185       1865 blonde            
##  2  3870185        514 college-student   
##  3  3870185      31570 harvard           
##  4  3905748       3417 class-differences 
##  5  3905748        514 college-student   
##  6  3905748      31570 harvard           
##  7  3905748      20986 private-club      
##  8  4424677      20887 algorithm         
##  9  4424677       3417 class-differences 
## 10  4424677        514 college-student   
## 11  4424677      31570 harvard           
## 12  4424677      22883 harvard-university
## 13  4424677       6485 identical-twins   
## 14  4424677      20986 private-club
```

---

### Let's try some joins

```r
left_join(keyword_ids, keywords_movies, by = c("keyword_id" = "keyword_id"))
```

```
## # A tibble: 17 × 3
##    keyword_id keyword            movie_id
##         <dbl> <chr>                 <dbl>
##  1      20887 algorithm           4424677
##  2       1865 blonde              3870185
##  3       3417 class-differences   3905748
##  4       3417 class-differences   4424677
##  5        514 college-student     3870185
##  6        514 college-student     3905748
##  7        514 college-student     4424677
##  8      27064 donkey                   NA
##  9      71175 fairy-tale-parody        NA
## 10      31570 harvard             3870185
## 11      31570 harvard             3905748
## 12      31570 harvard             4424677
## 13      22883 harvard-university  4424677
## 14       6485 identical-twins     4424677
## 15      20986 private-club        3905748
## 16      20986 private-club        4424677
## 17      13065 true-love                NA
```

---

### Let's try some joins

```r
inner_join(keywords_movies, keyword_ids, by = c("keyword_id" = "keyword_id"))
```

```
## # A tibble: 14 × 3
##    movie_id keyword_id keyword           
##       <dbl>      <dbl> <chr>             
##  1  3870185       1865 blonde            
##  2  3870185        514 college-student   
##  3  3870185      31570 harvard           
##  4  3905748       3417 class-differences 
##  5  3905748        514 college-student   
##  6  3905748      31570 harvard           
##  7  3905748      20986 private-club      
##  8  4424677      20887 algorithm         
##  9  4424677       3417 class-differences 
## 10  4424677        514 college-student   
## 11  4424677      31570 harvard           
## 12  4424677      22883 harvard-university
## 13  4424677       6485 identical-twins   
## 14  4424677      20986 private-club
```

---

### Let's try some joins

```r
full_join(keywords_movies, keyword_ids, by = c("keyword_id" = "keyword_id"))
```

```
## # A tibble: 17 × 3
##    movie_id keyword_id keyword           
##       <dbl>      <dbl> <chr>             
##  1  3870185       1865 blonde            
##  2  3870185        514 college-student   
##  3  3870185      31570 harvard           
##  4  3905748       3417 class-differences 
##  5  3905748        514 college-student   
##  6  3905748      31570 harvard           
##  7  3905748      20986 private-club      
##  8  4424677      20887 algorithm         
##  9  4424677       3417 class-differences 
## 10  4424677        514 college-student   
## 11  4424677      31570 harvard           
## 12  4424677      22883 harvard-university
## 13  4424677       6485 identical-twins   
## 14  4424677      20986 private-club      
## 15       NA      27064 donkey            
## 16       NA      71175 fairy-tale-parody 
## 17       NA      13065 true-love
```

Can you guess what the fourth movie is?

---

### Movie Keywords

Keep the resulting `data.frame()` that has the keywords for our three Harvard movies:

```r
keywords_h <- left_join(keywords_movies, keyword_ids, by = c("keyword_id" = "keyword_id"))
keywords_h
```

```
## # A tibble: 14 × 3
##    movie_id keyword_id keyword           
##       <dbl>      <dbl> <chr>             
##  1  3870185       1865 blonde            
##  2  3870185        514 college-student   
##  3  3870185      31570 harvard           
##  4  3905748       3417 class-differences 
##  5  3905748        514 college-student   
##  6  3905748      31570 harvard           
##  7  3905748      20986 private-club      
##  8  4424677      20887 algorithm         
##  9  4424677       3417 class-differences 
## 10  4424677        514 college-student   
## 11  4424677      31570 harvard           
## 12  4424677      22883 harvard-university
## 13  4424677       6485 identical-twins   
## 14  4424677      20986 private-club
```

---

### Movie Keywords

What `---_join()` should we do to add the movie titles?

```r
keywords_h
```

```
## # A tibble: 14 × 3
##    movie_id keyword_id keyword           
##       <dbl>      <dbl> <chr>             
##  1  3870185       1865 blonde            
##  2  3870185        514 college-student   
##  3  3870185      31570 harvard           
##  4  3905748       3417 class-differences 
##  5  3905748        514 college-student   
##  6  3905748      31570 harvard           
##  7  3905748      20986 private-club      
##  8  4424677      20887 algorithm         
##  9  4424677       3417 class-differences 
## 10  4424677        514 college-student   
## 11  4424677      31570 harvard           
## 12  4424677      22883 harvard-university
## 13  4424677       6485 identical-twins   
## 14  4424677      20986 private-club
```

---

### Movie Keywords

What `---_join()` should we use to add the movie titles?

```r
keywords_movies_h <- left_join(keywords_h, movies, by = c("movie_id" = "id"))
keywords_movies_h
```

```
## # A tibble: 14 × 5
##    movie_id keyword_id keyword            title              production_year
##       <dbl>      <dbl> <chr>              <chr>                        <dbl>
##  1  3870185       1865 blonde             Legally Blonde                2001
##  2  3870185        514 college-student    Legally Blonde                2001
##  3  3870185      31570 harvard            Legally Blonde                2001
##  4  3905748       3417 class-differences  Love Story                    1970
##  5  3905748        514 college-student    Love Story                    1970
##  6  3905748      31570 harvard            Love Story                    1970
##  7  3905748      20986 private-club       Love Story                    1970
##  8  4424677      20887 algorithm          The Social Network            2010
##  9  4424677       3417 class-differences  The Social Network            2010
## 10  4424677        514 college-student    The Social Network            2010
## 11  4424677      31570 harvard            The Social Network            2010
## 12  4424677      22883 harvard-university The Social Network            2010
## 13  4424677       6485 identical-twins    The Social Network            2010
## 14  4424677      20986 private-club       The Social Network            2010
```

---

### Order Matters!

* Focus on the keyword "harvard-university"

```r
keyword_harvard <- filter(keywords_h, keyword == "harvard-university")
keyword_harvard
```

```
## # A tibble: 1 × 3
##   movie_id keyword_id keyword           
##      <dbl>      <dbl> <chr>             
## 1  4424677      22883 harvard-university
```

---

### Order Matters!

```r
left_join(keyword_harvard, movies, by = c("movie_id" = "id"))
```

```
## # A tibble: 1 × 5
##   movie_id keyword_id keyword            title              production_year
##      <dbl>      <dbl> <chr>              <chr>                        <dbl>
## 1  4424677      22883 harvard-university The Social Network            2010
```

```r
right_join(keyword_harvard, movies, by = c("movie_id" = "id"))
```

```
## # A tibble: 3 × 5
##   movie_id keyword_id keyword            title              production_year
##      <dbl>      <dbl> <chr>              <chr>                        <dbl>
## 1  4424677      22883 harvard-university The Social Network            2010
## 2  3870185         NA <NA>               Legally Blonde                2001
## 3  3905748         NA <NA>               Love Story                    1970
```

---

## Data Wrangling in the `tidyverse`

**Question**: Why do we need TWO different data wrangling packages in the `tidyverse`?

--

**Question**: What does `dplyr` do?

--

* Performs various data manipulation tasks, such as extracting and summarizing

--

**Question**: So what is left for `tidyr` to do?

* Reshape data into different formats (e.g. long and wide)

---

## Key Idea: Reshape Data to a Tidy Data Format

> "Happy families are all alike; every unhappy family is unhappy in its own way." -- Leo Tolstoy

--

> "Tidy datasets are all alike, but every messy dataset is messy in its own way." -- Hadley Wickham

--

* By tidy, we don't mean neat.  
* Tidy data satisfy rules that make it easy to work with the data.

---

## Tidy Data Rules

<img src="img/tidyRules.png" width="80%" style="display: block; margin: auto;" />

* Each column is a single variable.

* Each row is a unique observation.

* Each value must have its own cell.

---

## Tidy Data

```r
keywords_movies_h
```

```
## # A tibble: 14 × 5
##    movie_id keyword_id keyword            title              production_year
##       <dbl>      <dbl> <chr>              <chr>                        <dbl>
##  1  3870185       1865 blonde             Legally Blonde                2001
##  2  3870185        514 college-student    Legally Blonde                2001
##  3  3870185      31570 harvard            Legally Blonde                2001
##  4  3905748       3417 class-differences  Love Story                    1970
##  5  3905748        514 college-student    Love Story                    1970
##  6  3905748      31570 harvard            Love Story                    1970
##  7  3905748      20986 private-club       Love Story                    1970
##  8  4424677      20887 algorithm          The Social Network            2010
##  9  4424677       3417 class-differences  The Social Network            2010
## 10  4424677        514 college-student    The Social Network            2010
## 11  4424677      31570 harvard            The Social Network            2010
## 12  4424677      22883 harvard-university The Social Network            2010
## 13  4424677       6485 identical-twins    The Social Network            2010
## 14  4424677      20986 private-club       The Social Network            2010
```

---

## Un-Tidy Data

* Let's look at a dataset of CDC health stats for PA, NY, and TX

```
## # A tibble: 5 × 6
##   YearStart    PA    NY    TX DataValueType     Question                    
##       <dbl> <dbl> <dbl> <dbl> <chr>             <chr>                       
## 1      2014  87.7  66.8  96.4 Age-adjusted Rate Mortality from heart failure
## 2      2013  87.3  67.8  95.6 Age-adjusted Rate Mortality from heart failure
## 3      2012  83.2  67.7  93.5 Age-adjusted Rate Mortality from heart failure
## 4      2011  83.8  69.9  92.1 Age-adjusted Rate Mortality from heart failure
## 5      2010  83    69.4  96.5 Age-adjusted Rate Mortality from heart failure
```

* Which tidy rule was broken?

---

## Tidy Version of the Data

```
## # A tibble: 15 × 5
##    YearStart LocationAbbr DataValue Question                     DataValueType  
##        <dbl> <chr>            <dbl> <chr>                        <chr>          
##  1      2014 NY                66.8 Mortality from heart failure Age-adjusted R…
##  2      2014 PA                87.7 Mortality from heart failure Age-adjusted R…
##  3      2014 TX                96.4 Mortality from heart failure Age-adjusted R…
##  4      2013 NY                67.8 Mortality from heart failure Age-adjusted R…
##  5      2013 PA                87.3 Mortality from heart failure Age-adjusted R…
##  6      2013 TX                95.6 Mortality from heart failure Age-adjusted R…
##  7      2012 NY                67.7 Mortality from heart failure Age-adjusted R…
##  8      2012 PA                83.2 Mortality from heart failure Age-adjusted R…
##  9      2012 TX                93.5 Mortality from heart failure Age-adjusted R…
## 10      2011 NY                69.9 Mortality from heart failure Age-adjusted R…
## 11      2011 PA                83.8 Mortality from heart failure Age-adjusted R…
## 12      2011 TX                92.1 Mortality from heart failure Age-adjusted R…
## 13      2010 NY                69.4 Mortality from heart failure Age-adjusted R…
## 14      2010 PA                83   Mortality from heart failure Age-adjusted R…
## 15      2010 TX                96.5 Mortality from heart failure Age-adjusted R…
```

---

## Un-Tidy Data

```
## # A tibble: 15 × 5
##    YearStart LocationAbbr DataValue DataValueType     Question                  
##        <dbl> <chr>            <dbl> <chr>             <chr>                     
##  1      2014 PA                87.7 Age-adjusted Rate Mortality from heart fail…
##  2      2014 PA               126.  Crude Rate        Mortality from heart fail…
##  3      2014 PA             16091   Number            Mortality from heart fail…
##  4      2013 PA                87.3 Age-adjusted Rate Mortality from heart fail…
##  5      2013 PA               124.  Crude Rate        Mortality from heart fail…
##  6      2013 PA             15894   Number            Mortality from heart fail…
##  7      2012 PA                83.2 Age-adjusted Rate Mortality from heart fail…
##  8      2012 PA               117.  Crude Rate        Mortality from heart fail…
##  9      2012 PA             14893   Number            Mortality from heart fail…
## 10      2011 PA                83.8 Age-adjusted Rate Mortality from heart fail…
## 11      2011 PA               116.  Crude Rate        Mortality from heart fail…
## 12      2011 PA             14827   Number            Mortality from heart fail…
## 13      2010 PA                83   Age-adjusted Rate Mortality from heart fail…
## 14      2010 PA               114.  Crude Rate        Mortality from heart fail…
## 15      2010 PA             14411   Number            Mortality from heart fail…
```

* Which tidy rule was broken?

---

## Tidy Data

```
## # A tibble: 5 × 6
##   YearStart LocationAbbr Question                     Age-adjus…¹ Crude…² Number
##       <dbl> <chr>        <chr>                              <dbl>   <dbl>  <dbl>
## 1      2014 PA           Mortality from heart failure        87.7    126.  16091
## 2      2013 PA           Mortality from heart failure        87.3    124.  15894
## 3      2012 PA           Mortality from heart failure        83.2    117.  14893
## 4      2011 PA           Mortality from heart failure        83.8    116.  14827
## 5      2010 PA           Mortality from heart failure        83      114.  14411
## # … with abbreviated variable names ¹`Age-adjusted Rate`, ²`Crude Rate`
```

---

## `tidyr`

* `tidyr` is an R package which will allow us to switch a dataset from a wide to a long format and vice versa.

```r
library(tidyverse)
```

--

* **Problem**:  When column names are not variables but values of a variable.  Each row represents multiple observations.
    + **Solution**: Use `pivot_longer()`.

--

* **Problem**: When an observation is scattered across multiple rows.
    + **Solution**: Use `pivot_wider()`.

--
    
* **Problem**: Two observations are in one cell.
    + **Solution**: Use `separate()`.

---

## `pivot_longer()`

* What columns do we want to gather?
    + names_to = name of variable whose values are the column names
    + values_to = name of variable whose values are spread over the cells

```r
CDC_wide
```

```
## # A tibble: 5 × 6
##   YearStart    PA    NY    TX DataValueType     Question                    
##       <dbl> <dbl> <dbl> <dbl> <chr>             <chr>                       
## 1      2014  87.7  66.8  96.4 Age-adjusted Rate Mortality from heart failure
## 2      2013  87.3  67.8  95.6 Age-adjusted Rate Mortality from heart failure
## 3      2012  83.2  67.7  93.5 Age-adjusted Rate Mortality from heart failure
## 4      2011  83.8  69.9  92.1 Age-adjusted Rate Mortality from heart failure
## 5      2010  83    69.4  96.5 Age-adjusted Rate Mortality from heart failure
```

---

## `pivot_longer()`

```r
CDC_narrow_tidy <- pivot_longer(CDC_wide, cols = c(PA, NY, TX), 
                                names_to = "State", 
                                values_to = "MortalityRate") %>%
  select(YearStart, State, MortalityRate)
CDC_narrow_tidy
```

```
## # A tibble: 15 × 3
##    YearStart State MortalityRate
##        <dbl> <chr>         <dbl>
##  1      2014 PA             87.7
##  2      2014 NY             66.8
##  3      2014 TX             96.4
##  4      2013 PA             87.3
##  5      2013 NY             67.8
##  6      2013 TX             95.6
##  7      2012 PA             83.2
##  8      2012 NY             67.7
##  9      2012 TX             93.5
## 10      2011 PA             83.8
## 11      2011 NY             69.9
## 12      2011 TX             92.1
## 13      2010 PA             83  
## 14      2010 NY             69.4
## 15      2010 TX             96.5
```

---

## `pivot_wider()`

* Which columns do we want to spread across multiple rows?
    + `names_from` = column that contains the variable names
    + `values_from` = column that contains the values of the variables

```r
CDC_narrow_untidy 
```

```
## # A tibble: 15 × 5
##    YearStart LocationAbbr DataValue DataValueType     Question                  
##        <dbl> <chr>            <dbl> <chr>             <chr>                     
##  1      2014 PA                87.7 Age-adjusted Rate Mortality from heart fail…
##  2      2014 PA               126.  Crude Rate        Mortality from heart fail…
##  3      2014 PA             16091   Number            Mortality from heart fail…
##  4      2013 PA                87.3 Age-adjusted Rate Mortality from heart fail…
##  5      2013 PA               124.  Crude Rate        Mortality from heart fail…
##  6      2013 PA             15894   Number            Mortality from heart fail…
##  7      2012 PA                83.2 Age-adjusted Rate Mortality from heart fail…
##  8      2012 PA               117.  Crude Rate        Mortality from heart fail…
##  9      2012 PA             14893   Number            Mortality from heart fail…
## 10      2011 PA                83.8 Age-adjusted Rate Mortality from heart fail…
## 11      2011 PA               116.  Crude Rate        Mortality from heart fail…
## 12      2011 PA             14827   Number            Mortality from heart fail…
## 13      2010 PA                83   Age-adjusted Rate Mortality from heart fail…
## 14      2010 PA               114.  Crude Rate        Mortality from heart fail…
## 15      2010 PA             14411   Number            Mortality from heart fail…
```

---

## `pivot_wider()`

```r
CDC_wide_tidy <- pivot_wider(CDC_narrow_untidy, 
                             names_from = DataValueType,
                             values_from = DataValue)
CDC_wide_tidy
```

```
## # A tibble: 5 × 6
##   YearStart LocationAbbr Question                     Age-adjus…¹ Crude…² Number
##       <dbl> <chr>        <chr>                              <dbl>   <dbl>  <dbl>
## 1      2014 PA           Mortality from heart failure        87.7    126.  16091
## 2      2013 PA           Mortality from heart failure        87.3    124.  15894
## 3      2012 PA           Mortality from heart failure        83.2    117.  14893
## 4      2011 PA           Mortality from heart failure        83.8    116.  14827
## 5      2010 PA           Mortality from heart failure        83      114.  14411
## # … with abbreviated variable names ¹`Age-adjusted Rate`, ²`Crude Rate`
```

---

## `separate()`

* Separate one column into multiple columns.

* What column might we want to separate into two?

```r
read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/data/CDC2.csv") %>% 
  select(GeoLocation, YearStart, LocationAbbr, DataValue, 
         Question) %>%
  glimpse()
```

```
## Rows: 74,811
## Columns: 5
## $ GeoLocation  <chr> NA, "(32.84057112200048, -86.63186076199969)", "(64.84507…
## $ YearStart    <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 201…
## $ LocationAbbr <chr> "US", "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC…
## $ DataValue    <dbl> 16.9, 13.0, 18.2, 15.6, 15.0, 16.3, 19.0, 16.7, 17.0, 25.…
## $ Question     <chr> "Binge drinking prevalence among adults aged >= 18 years"…
```

---

## `separate()`

* Separate one column into multiple columns.
    + `col` = name of column to separate
    + `into` = character vector with names of new columns
    + `sep` = character to split on

```r
CDC_sep <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/data/CDC2.csv") %>% 
  select(GeoLocation, YearStart, LocationAbbr, DataValue, 
         Question) %>%
  separate(col = GeoLocation, into = c("lat", "long"), 
           sep = ", ")
CDC_sep
```

```
## # A tibble: 74,811 × 6
##    lat                 long                 YearStart Location…¹ DataV…² Quest…³
##    <chr>               <chr>                    <dbl> <chr>        <dbl> <chr>  
##  1 <NA>                <NA>                      2016 US            16.9 Binge …
##  2 (32.84057112200048  -86.63186076199969)       2016 AL            13   Binge …
##  3 (64.84507995700051  -147.72205903599973)      2016 AK            18.2 Binge …
##  4 (34.865970280000454 -111.76381127699972)      2016 AZ            15.6 Binge …
##  5 (34.74865012400045  -92.27449074299966)       2016 AR            15   Binge …
##  6 (37.63864012300047  -120.99999953799971)      2016 CA            16.3 Binge …
##  7 (38.843840757000464 -106.13361092099967)      2016 CO            19   Binge …
##  8 (41.56266102000046  -72.64984095199964)       2016 CT            16.7 Binge …
##  9 (39.008830667000495 -75.57774116799965)       2016 DE            17   Binge …
## 10 (38.907192          -77.036871)               2016 DC            25.6 Binge …
## # … with 74,801 more rows, and abbreviated variable names ¹LocationAbbr,
## #   ²DataValue, ³Question
```

---

## `separate()`

```r
library(stringr)
CDC_sep <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/data/CDC2.csv") %>% 
  select(GeoLocation, YearStart, LocationAbbr, DataValue,
         Question) %>%
  separate(col = GeoLocation, into = c("lat", "long"),
           sep = ", ") %>%
  mutate(lat = parse_number(lat), 
         long = parse_number(long))
```

---

## Another Example: Live Coding

* Also a chance to see some additional `dplyr` functions.

```r
library(readxl)
url <- "https://www.bts.gov/sites/bts.dot.gov/files/table_01_33_102020.xlsx"
destfile <- "table_01_33_102020.xlsx"
curl::curl_download(url, destfile)
Amtrak <- read_excel(destfile)
Amtrak
```

```
## # A tibble: 20 × 35
##    Table 1-…¹ ...2    ...3   ...4   ...5  ...6  ...7   ...8   ...9  ...10  ...11
##    <chr>      <chr>  <dbl>  <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 <NA>       1972  1975   1980   1985    1990  1991 1992   1993   1994   1995  
##  2 Locomotiv… <NA>    NA     NA     NA      NA    NA   NA     NA     NA     NA  
##  3 Percent a… U       87     83     93      84    86   83     84     85     88  
##  4 Average a… 22.3    14.4    7.4    7      12    13   13     13.2   13.4   13.9
##  5 Passenger… <NA>    NA     NA     NA      NA    NA   NA     NA     NA     NA  
##  6 Percent a… U       82     77     90      90    92   90     89     88     90  
##  7 Average a… 22      24.7   14.3   14.2    20    21   21.5   22.6   22.4   21.8
##  8 KEY:  U =… <NA>    NA     NA     NA      NA    NA   NA     NA     NA     NA  
##  9 <NA>       <NA>    NA     NA     NA      NA    NA   NA     NA     NA     NA  
## 10 a Year-en… <NA>    NA     NA     NA      NA    NA   NA     NA     NA     NA  
## 11 b Fiscal … <NA>    NA     NA     NA      NA    NA   NA     NA     NA     NA  
## 12 <NA>       <NA>    NA     NA     NA      NA    NA   NA     NA     NA     NA  
## 13 NOTES      <NA>    NA     NA     NA      NA    NA   NA     NA     NA     NA  
## 14 1972 was … <NA>    NA     NA     NA      NA    NA   NA     NA     NA     NA  
## 15 Roadraile… <NA>    NA     NA     NA      NA    NA   NA     NA     NA     NA  
## 16 <NA>       <NA>    NA     NA     NA      NA    NA   NA     NA     NA     NA  
## 17 SOURCES    <NA>    NA     NA     NA      NA    NA   NA     NA     NA     NA  
## 18 1972-80: … <NA>    NA     NA     NA      NA    NA   NA     NA     NA     NA  
## 19 1985-2000… <NA>    NA     NA     NA      NA    NA   NA     NA     NA     NA  
## 20 2001-19: … <NA>    NA     NA     NA      NA    NA   NA     NA     NA     NA  
## # … with 24 more variables: ...12 <dbl>, ...13 <dbl>, ...14 <dbl>, ...15 <dbl>,
## #   ...16 <dbl>, ...17 <chr>, ...18 <chr>, ...19 <dbl>, ...20 <dbl>,
## #   ...21 <dbl>, ...22 <dbl>, ...23 <dbl>, ...24 <dbl>, ...25 <dbl>,
## #   ...26 <dbl>, ...27 <dbl>, ...28 <dbl>, ...29 <dbl>, ...30 <dbl>,
## #   ...31 <dbl>, ...32 <chr>, ...33 <chr>, ...34 <chr>, ...35 <chr>, and
## #   abbreviated variable name
## #   ¹`Table 1-33:  Age and Availability of Amtrak Locomotive and Car Fleets`
```