Data Summarization and Joins

background-image: url("img/DAW.png")
background-position: left
background-size: 50%
class: middle, center,

## .base-blue[Data Summarization]

<br>

### .purple[Kelly McConville]

#### .purple[ Stat 100 | Week 4 | Fall 2022]

]

---

## Announcements

* Don't forget that P-Set 2 is due tomorrow by 5pm on Gradescope.  
    + Come by office hours with questions!

****************************

## Goals for Today

* Consider measures for **summarizing** quantitative data
    + Center
    + Spread/variability

* Consider measures for **summarizing** categorical data

]

* Define **data wrangling**

* Learn data **joins**

]

---

## Load Necessary Packages

`dplyr` is part of this collection of data science packages.

```r
# Load necessary packages
library(tidyverse)
```

---

## Import the [Data](https://data.cambridgema.gov/Transportation-Planning/Eco-Totem-Broadway-Bicycle-Count/q8v9-mcfg)

```r
bike_counter <- read_csv("https://data.cambridgema.gov/api/views/q8v9-mcfg/rows.csv")

# Inspect the data
glimpse(bike_counter)
```

```
## Rows: 245,890
## Columns: 7
## $ DateTime  <chr> "06/24/2015 12:00:00 AM", "06/24/2015 12:15:00 AM", "06/24/2…
## $ Day       <chr> "Wednesday", "Wednesday", "Wednesday", "Wednesday", "Wednesd…
## $ Date      <chr> "06/24/2015", "06/24/2015", "06/24/2015", "06/24/2015", "06/…
## $ Time      <time> 00:00:00, 00:15:00, 00:30:00, 00:45:00, 01:00:00, 01:15:00,…
## $ Total     <dbl> 4, 3, 4, 2, 2, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, …
## $ Westbound <dbl> 1, 3, 3, 2, 2, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, …
## $ Eastbound <dbl> 3, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
```

---

## Summarizing Data

<table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Date </th>
   <th style="text-align:left;"> Time </th>
   <th style="text-align:right;"> Total </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 09:45:00 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:00:00 </td>
   <td style="text-align:right;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:15:00 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:30:00 </td>
   <td style="text-align:right;"> 13 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:45:00 </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 11:00:00 </td>
   <td style="text-align:right;"> 12 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 11:15:00 </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
</tbody>
</table>

* Hard to do by eyeballing a spreadsheet with many rows!

---

##  Summarizing Data Visually

]

For a quantitative variable, want to answer:

* What is an **average** value?

* What is the **trend/shape** of the variable?

* How much **variation** is there from case to case?

]

---

## Summarizing Quantitative Variables

For a quantitative variable, want to answer:

* What is an average value?

* What is the trend/shape of the variable?

* How much variation is there from case to case?

Need to learn some **summary statistics**: Numerical values computed based on the observed cases.

---

## Measures of Center

* `$n$` = Number of cases (sample size)
* `$x_i$` = value of the i-th observation
* Denote by `$\bar{x}$`

$$
\bar{x}  = \frac{1}{n} \sum_{i = 1}^n x_i
$$

]

]

```r
# Mean
(5 + 10 + 6 + 13 +
   9 + 12 + 7)/7
```

```
## [1] 8.857143
```
{{content}}

---

## Measures of Center

* Denote by `$m$`
* If `$n$` is even, then it is the average of the middle two values

]

]

<table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Date </th>
   <th style="text-align:left;"> Time </th>
   <th style="text-align:right;"> Total </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 09:45:00 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:15:00 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 11:15:00 </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:45:00 </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:00:00 </td>
   <td style="text-align:right;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 11:00:00 </td>
   <td style="text-align:right;"> 12 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:30:00 </td>
   <td style="text-align:right;"> 13 </td>
  </tr>
</tbody>
</table>
{{content}}

```r
# Median
9
```

```
## [1] 9
```
{{content}}

---

## Measures of Center

<table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Date </th>
   <th style="text-align:left;"> Time </th>
   <th style="text-align:right;"> Total </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 09:45:00 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:15:00 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 11:15:00 </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:45:00 </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:00:00 </td>
   <td style="text-align:right;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 11:00:00 </td>
   <td style="text-align:right;"> 12 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:30:00 </td>
   <td style="text-align:right;"> 13 </td>
  </tr>
</tbody>
</table>
]

```r
# Mean
(5 + 10 + 6 + 13 +
   9 + 12 + 7)/7
```

```
## [1] 8.857143
```

```r
# Median
9
```

```
## [1] 9
```

]

* Suppose the 13 bikes was actually 130 bikes.  How would these summary statistics change?

---

## Measures of Variability

* Want a statistic that captures how much observations will likely deviate from the mean

Here is my proposal:

* Find how much each observation deviates from the mean.
* Compute the average of the deviations.

$$
\frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})
$$

]

<table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Date </th>
   <th style="text-align:left;"> Time </th>
   <th style="text-align:right;"> Total </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 09:45:00 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:15:00 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 11:15:00 </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:45:00 </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:00:00 </td>
   <td style="text-align:right;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 11:00:00 </td>
   <td style="text-align:right;"> 12 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:30:00 </td>
   <td style="text-align:right;"> 13 </td>
  </tr>
</tbody>
</table>
]

]

---

## Measures of Variability

* Want a statistic that captures how much observations will likely deviate from the mean

Here is my proposal:

* Find how much each observation deviates from the mean.
* Compute the average of the deviations.

$$
\frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})
$$

**Problem?**

]

<table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Date </th>
   <th style="text-align:left;"> Time </th>
   <th style="text-align:right;"> Total </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 09:45:00 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:15:00 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 11:15:00 </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:45:00 </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:00:00 </td>
   <td style="text-align:right;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 11:00:00 </td>
   <td style="text-align:right;"> 12 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:30:00 </td>
   <td style="text-align:right;"> 13 </td>
  </tr>
</tbody>
</table>
]

]

---

## Measures of Variability

* Want a statistic that captures how much observations will likely deviate from the mean

Here is my **NEW** proposal:

* Find how much each observation deviates from the mean.
* Compute the average of the **squared** deviations.

$$
\frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})^2
$$

]

<table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Date </th>
   <th style="text-align:left;"> Time </th>
   <th style="text-align:right;"> Total </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 09:45:00 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:15:00 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 11:15:00 </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:45:00 </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:00:00 </td>
   <td style="text-align:right;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 11:00:00 </td>
   <td style="text-align:right;"> 12 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:30:00 </td>
   <td style="text-align:right;"> 13 </td>
  </tr>
</tbody>
</table>
]

<table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> Deviations </th>
   <th style="text-align:right;"> Dev_sqd </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> -3.86 </td>
   <td style="text-align:right;"> 14.88 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> -2.86 </td>
   <td style="text-align:right;"> 8.16 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> -1.86 </td>
   <td style="text-align:right;"> 3.45 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 0.14 </td>
   <td style="text-align:right;"> 0.02 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1.14 </td>
   <td style="text-align:right;"> 1.31 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3.14 </td>
   <td style="text-align:right;"> 9.88 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4.14 </td>
   <td style="text-align:right;"> 17.16 </td>
  </tr>
</tbody>
</table>

]

---

## Measures of Variability

* Want a statistic that captures how much observations will likely deviate from the mean

Here is my **NEW** proposal:

* Find how much each observation deviates from the mean.
* Compute the average of the **squared** deviations.

$$
\frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})^2
$$

]

<table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Date </th>
   <th style="text-align:left;"> Time </th>
   <th style="text-align:right;"> Total </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 09:45:00 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:15:00 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 11:15:00 </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:45:00 </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:00:00 </td>
   <td style="text-align:right;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 11:00:00 </td>
   <td style="text-align:right;"> 12 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:30:00 </td>
   <td style="text-align:right;"> 13 </td>
  </tr>
</tbody>
</table>
]

]

```r
# Calculate the measure of variability
(14.88 + 8.16 + 3.45 + 0.020 + 1.31 + 9.88 + 17.16)/7
```

```
## [1] 7.837143
```

---

## Measures of Variability

* Want a statistic that captures how much observations will likely deviate from the mean

Here is the **ACTUAL**:

* Find how much each observation deviates from the mean.
* Compute the (nearly) average of the **squared** deviations.
* Called **sample variance** `$s^2$`.

$$
s^2 = \frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2
$$

]

<table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Date </th>
   <th style="text-align:left;"> Time </th>
   <th style="text-align:right;"> Total </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 09:45:00 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:15:00 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 11:15:00 </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:45:00 </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:00:00 </td>
   <td style="text-align:right;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 11:00:00 </td>
   <td style="text-align:right;"> 12 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:30:00 </td>
   <td style="text-align:right;"> 13 </td>
  </tr>
</tbody>
</table>
]

]

```r
# Calculate the measure of variability
(14.88 + 8.16 + 3.45 + 0.020 + 1.31 + 9.88 + 17.16)/6
```

```
## [1] 9.143333
```

---

## Measures of Variability

* Want a statistic that captures how much observations will likely deviate from the mean

* Find how much each observation deviates from the mean.
* Compute the (nearly) average of the **squared** deviations.
* Called the sample variance, `$s^2$`.
* The square root of the sample variance is called the **sample standard deviation** `$s$`.

$$
s = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2}
$$

]

<table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Date </th>
   <th style="text-align:left;"> Time </th>
   <th style="text-align:right;"> Total </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 09:45:00 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:15:00 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 11:15:00 </td>
   <td style="text-align:right;"> 7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:45:00 </td>
   <td style="text-align:right;"> 9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:00:00 </td>
   <td style="text-align:right;"> 10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 11:00:00 </td>
   <td style="text-align:right;"> 12 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 07/04/2015 </td>
   <td style="text-align:left;"> 10:30:00 </td>
   <td style="text-align:right;"> 13 </td>
  </tr>
</tbody>
</table>
]

]

```r
# Calculate the measure of variability
sqrt((14.88 + 8.16 + 3.45 + 0.020 + 1.31 +9.88 + 17.16)/6)
```

```
## [1] 3.023795
```

---

## Measures of Variability

* In addition to the sample standard deviation and the sample variance, there is the Interquartile Range (IQR):

$$
\mbox{IQR} = \mbox{Q}_3 - \mbox{Q}_1
$$

* Which is more robust to outliers, the IQR or `$s$`?

* Which is more commonly used, the IQR or `$s$`?

---

## Now let's go through the Data Summarization handout!

## Two Minute Stretch

---

---

### Data Wrangling: Transformations done on the data

**Why wrangle the data?**

To **summarize** the data.

]

&#8594; To compute the mean and standard deviation of the bike counts.

]

To **drop** missing values.  (Need to be careful here!)

]

&#8594; In your P-Set 2, `geom_histogram()` is dropping NAs before creating the graph.

]

To **filter** to a particular subset of the data.

]

&#8594; To subset the bike counts data to 2 days in July of 2019.

]

To **collapse** the categories of a categorical variable.

]

&#8594; To go from 86 dog breeds to just mixed or single breed.

]

To **arrange** the data to make it easier to display.

]

&#8594; To sort from most common dog name to least common.

]

To fix how `R` **stores** a variable.

]

&#8594; I converted `Day` from a character variable/vector to a date variable/vector.

]

OR, to **combine** data frames when information about your cases is stored in multiple places!

---

## Data Joins

* Often in the data analysis workflow, we have more than one data source, which means more than one dataframe, and we want to combine these dataframes.

* Need principled way to combine.
    + Need a **key** that links two dataframes together.

* These multiple dataframes are called **relational data**.

---

## [US Bureau of Labor Statistics](https://www.bls.gov/bls/blsmissn.htm) Consumer Expenditure Data

> .mustard[BLS Mission]: "Measures labor market activity, working conditions, price changes, and productivity in the U.S. economy to support public and private decision making."

* Household survey but data are also collected on individuals
    + fmli: household data
    + memi: household member-level data

```r
#Read in data with readr package
library(tidyverse)
fmli <- read_csv("~/shared_data/stat100/data/fmli.csv", 
                 na = c("NA", "."))
memi <- read_csv("~/shared_data/stat100/data/memi.csv", 
                 na = c("NA", "."))
```

* Want variables on the **principal earner** to be added to the household data

---

## CE Data

* Key variable(s)?

```r
glimpse(fmli)
```

```
## Rows: 6,301
## Columns: 51
## $ NEWID    <chr> "03324174", "03324204", "03324214", "03324244", "03324274", "…
## $ PRINEARN <chr> "01", "01", "01", "01", "02", "01", "01", "01", "02", "01", "…
## $ FINLWT21 <dbl> 25984.767, 6581.018, 20208.499, 18078.372, 20111.619, 19907.3…
## $ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 40037, 10…
## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ POPSIZE  <dbl> 2, 3, 4, 2, 2, 2, 1, 2, 5, 2, 3, 2, 2, 3, 4, 3, 3, 1, 4, 1, 1…
## $ EDUC_REF <chr> "16", "15", "16", "15", "14", "11", "10", "13", "12", "12", "…
## $ EDUCA2   <dbl> 15, 15, 13, NA, NA, NA, NA, 15, 15, 14, 12, 12, NA, NA, NA, 1…
## $ AGE_REF  <dbl> 63, 50, 47, 37, 51, 63, 77, 37, 51, 64, 26, 59, 81, 51, 67, 4…
## $ AGE2     <dbl> 50, 47, 46, NA, NA, NA, NA, 36, 53, 67, 44, 62, NA, NA, NA, 4…
## $ SEX_REF  <dbl> 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1…
## $ SEX2     <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 1, 1, 1, NA, NA, NA, 1, NA, 1,…
## $ REF_RACE <dbl> 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1…
## $ RACE2    <dbl> 1, 4, 1, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, NA, NA, 2, NA, 1,…
## $ HISP_REF <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1…
## $ HISP2    <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 2, 2, 2, NA, NA, NA, 2, NA, 2,…
## $ FAM_TYPE <dbl> 3, 4, 1, 8, 9, 9, 8, 3, 1, 1, 3, 1, 8, 9, 8, 5, 9, 4, 8, 3, 2…
## $ MARITAL1 <dbl> 1, 1, 1, 5, 3, 3, 2, 1, 1, 1, 1, 1, 2, 3, 5, 1, 3, 1, 3, 1, 1…
## $ REGION   <dbl> 4, 4, 3, 4, 4, 3, 4, 1, 3, 2, 1, 4, 1, 3, 3, 3, 2, 1, 2, 4, 3…
## $ SMSASTAT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15", "14", "…
## $ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ FOODCQ   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TRANSCQ  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ HEALTHCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ENTERTCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ EDUCACQ  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TOBACCCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ STUDFINX <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ IRAX     <dbl> 1000000, 10000, 0, NA, NA, 0, 0, 15000, NA, 477000, NA, NA, N…
## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1, 1, 4, 4…
## $ FAM_SIZE <dbl> 4, 6, 2, 1, 2, 2, 1, 5, 2, 2, 4, 2, 1, 2, 1, 4, 2, 4, 1, 3, 3…
## $ VEHQ     <dbl> 3, 5, 0, 4, 2, 0, 0, 2, 4, 2, 3, 2, 1, 3, 1, 2, 4, 4, 0, 2, 3…
## $ ROOMSQ   <dbl> 8, 5, 6, 4, 4, 4, 7, 5, 4, 9, 6, 10, 4, 7, 5, 6, 6, 8, 18, 4,…
## $ INC_HRS1 <dbl> 40, 40, 40, 44, 40, NA, NA, 40, 40, NA, 40, NA, NA, NA, NA, 4…
## $ INC_HRS2 <dbl> 30, 40, 52, NA, NA, NA, NA, 40, 40, NA, 65, NA, NA, NA, NA, 6…
## $ EARNCOMP <dbl> 3, 2, 2, 1, 4, 7, 8, 2, 2, 8, 2, 8, 8, 7, 8, 2, 7, 3, 1, 2, 1…
## $ NO_EARNR <dbl> 4, 2, 2, 1, 2, 1, 0, 2, 2, 0, 2, 0, 0, 1, 0, 2, 1, 3, 1, 2, 1…
## $ OCCUCOD1 <chr> "03", "03", "05", "03", "04", "", "", "12", "04", "", "01", "…
## $ OCCUCOD2 <chr> "04", "02", "01", "", "", "", "", "02", "03", "", "11", "", "…
## $ STATE    <chr> "41", "15", "48", "06", "06", "48", "06", "42", "", "27", "25…
## $ DIVISION <dbl> 9, 9, 7, 9, 9, 7, 9, 2, NA, 4, 1, 8, 2, 5, 6, 7, 3, 2, 3, 9, …
## $ TOTXEST  <dbl> 15452, 11459, 15738, 25978, 588, 0, 0, 7261, 9406, -1414, 141…
## $ CREDFINX <dbl> 0, NA, 0, NA, 5, NA, NA, NA, NA, 0, NA, 0, NA, NA, NA, 2, 35,…
## $ CREDITB  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ CREDITX  <dbl> 4000, 5000, 2000, NA, 7000, 1800, NA, 6000, NA, 719, NA, 1200…
## $ BUILDING <chr> "01", "01", "01", "02", "08", "01", "01", "01", "01", "01", "…
## $ ST_HOUS  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ INT_PHON <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ INT_HOME <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
```

---

## CE Data

* Key variables?

```r
glimpse(memi)
```

```
## Rows: 15,412
## Columns: 14
## $ NEWID    <chr> "03552611", "03552641", "03552641", "03552651", "03552651", "…
## $ MEMBNO   <dbl> 1, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 1, 2, 3…
## $ AGE      <dbl> 58, 54, 49, 39, 10, 32, 7, 9, 38, 34, 11, 8, 6, 3, 65, 61, 11…
## $ SEX      <dbl> 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1…
## $ EARNER   <dbl> 1, 1, 2, 2, NA, 2, NA, NA, 1, 2, NA, NA, NA, NA, 1, 2, NA, NA…
## $ EARNTYPE <dbl> 2, 1, NA, NA, NA, NA, NA, NA, 3, NA, NA, NA, NA, NA, 4, NA, N…
## $ INC_HRSQ <dbl> 20, 56, NA, NA, NA, NA, NA, NA, 50, NA, NA, NA, NA, NA, 25, N…
## $ INCOMEY  <dbl> 4, 1, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 4, NA, N…
## $ OCCUCODE <chr> "10", "05", "", "", "", "", "", "", "03", "", "", "", "", "",…
## $ HISPANIC <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ MEMBRACE <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1…
## $ PAYSTUB  <dbl> 1, 2, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, 2, NA, N…
## $ SALARYX  <dbl> 8982, NA, NA, NA, NA, NA, NA, NA, 280500, NA, NA, NA, NA, NA,…
## $ WKSTATUS <dbl> 1, 1, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 1, NA, N…
```

---

## CE Data

* Key variables?
    + Problem with class?

```r
class(fmli$NEWID)
```

```
## [1] "character"
```

```r
class(memi$NEWID)
```

```
## [1] "character"
```

```r
class(fmli$PRINEARN)
```

```
## [1] "character"
```

```r
class(memi$MEMBNO)
```

```
## [1] "numeric"
```

---

## CE Data

* Key variables?
    + Problem with class?

```r
fmli <- mutate(fmli, PRINEARN = as.integer(PRINEARN))
class(fmli$PRINEARN)
```

```
## [1] "integer"
```

```r
class(memi$MEMBNO)
```

```
## [1] "numeric"
```

---

## CE Data

* Want to add columns of `memi` to `fmli` that correspond to the principal earner's memi data
    + What type of join is that?

---

## The World of Joins

* **Mutating joins**: Add new variables to one dataset from matching observations in another.
    + `left_join()` (and `right_join()`)
    + `inner_join()`
    + `full_join()`

* There are also *filtering* joins but we won't cover those today.

---

## Example Dataframes

```r
d1 <- data.frame(V = 1:5, X = c(1, 2, 2, 3, 1), Y = c(14, 3, 6, 1, 4))
d2 <- data.frame(X = c(2, 4, 1), S = c(4, 13, 8))
d1
```

```
##   V X  Y
## 1 1 1 14
## 2 2 2  3
## 3 3 2  6
## 4 4 3  1
## 5 5 1  4
```

```r
d2
```

```
##   X  S
## 1 2  4
## 2 4 13
## 3 1  8
```

---

## `left_join()`

```r
d1_new <- left_join(d1, d2)
```

```
## Joining, by = "X"
```

```r
d1_new
```

```
##   V X  Y  S
## 1 1 1 14  8
## 2 2 2  3  4
## 3 3 2  6  4
## 4 4 3  1 NA
## 5 5 1  4  8
```

---

## `left_join()`

```r
d1_new <- left_join(d1, d2, by = c("X" = "X"))
d1_new
```

```
##   V X  Y  S
## 1 1 1 14  8
## 2 2 2  3  4
## 3 3 2  6  4
## 4 4 3  1 NA
## 5 5 1  4  8
```

---

## `left_join()`

```r
d1_new <- left_join(d1, d2, by = c("V" = "X"))
d1_new
```

```
##   V X  Y  S
## 1 1 1 14  8
## 2 2 2  3  4
## 3 3 2  6 NA
## 4 4 3  1 13
## 5 5 1  4 NA
```

---

## `inner_join()`

```r
d1_d2 <- inner_join(d1, d2, by = c("X" = "X"))
d1_d2
```

```
##   V X  Y S
## 1 1 1 14 8
## 2 2 2  3 4
## 3 3 2  6 4
## 4 5 1  4 8
```

---

## `inner_join()`

```r
d1_d2 <- inner_join(d1, d2, by = c("V" = "X"))
d1_d2
```

```
##   V X  Y  S
## 1 1 1 14  8
## 2 2 2  3  4
## 3 4 3  1 13
```

---

## `full_join()`

```r
d1_d2 <- full_join(d1, d2, by = c("X" = "X"))
d1_d2
```

```
##    V X  Y  S
## 1  1 1 14  8
## 2  2 2  3  4
## 3  3 2  6  4
## 4  4 3  1 NA
## 5  5 1  4  8
## 6 NA 4 NA 13
```

---

## `full_join()`

```r
d1_d2 <- full_join(d1, d2, by = c("V" = "X"))
d1_d2
```

```
##   V X  Y  S
## 1 1 1 14  8
## 2 2 2  3  4
## 3 3 2  6 NA
## 4 4 3  1 13
## 5 5 1  4 NA
```

---

## Back to our Example

* What kind of join do we want for the Consumer Expenditure data?
    + Want to add columns of `memi` to `fmli` that correspond to the principal earner's memi data

---

## Look at the Possible Joins

```r
left_join(fmli, memi) 
```

```
## Joining, by = "NEWID"
```

```
## # A tibble: 15,412 × 64
##    NEWID    PRINE…¹ FINLW…² FINCB…³ BLS_U…⁴ POPSIZE EDUC_…⁵ EDUCA2 AGE_REF  AGE2
##    <chr>      <int>   <dbl>   <dbl>   <dbl>   <dbl> <chr>    <dbl>   <dbl> <dbl>
##  1 03324174       1  25985.  116920       1       2 16          15      63    50
##  2 03324174       1  25985.  116920       1       2 16          15      63    50
##  3 03324174       1  25985.  116920       1       2 16          15      63    50
##  4 03324174       1  25985.  116920       1       2 16          15      63    50
##  5 03324204       1   6581.     200       1       3 15          15      50    47
##  6 03324204       1   6581.     200       1       3 15          15      50    47
##  7 03324204       1   6581.     200       1       3 15          15      50    47
##  8 03324204       1   6581.     200       1       3 15          15      50    47
##  9 03324204       1   6581.     200       1       3 15          15      50    47
## 10 03324204       1   6581.     200       1       3 15          15      50    47
## # … with 15,402 more rows, 54 more variables: SEX_REF <dbl>, SEX2 <dbl>,
## #   REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, FAM_TYPE <dbl>,
## #   MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, HIGH_EDU <chr>,
## #   EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, TRANSCQ <dbl>,
## #   HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, TOBACCCQ <dbl>,
## #   STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, VEHQ <dbl>,
## #   ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, …
```

---

## Look at the Possible Joins

* Be careful.  This erroneous example made my R crash!

```r
left_join(fmli, memi, by = c("PRINEARN" = "MEMBNO"))
```

---

## Look at the Possible Joins

```r
left_join(fmli, memi, by = c("NEWID" = "NEWID",
                             "PRINEARN" = "MEMBNO"))
```

```
## # A tibble: 6,301 × 63
##    NEWID    PRINE…¹ FINLW…² FINCB…³ BLS_U…⁴ POPSIZE EDUC_…⁵ EDUCA2 AGE_REF  AGE2
##    <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>    <dbl>   <dbl> <dbl>
##  1 03324174       1  25985.  116920       1       2 16          15      63    50
##  2 03324204       1   6581.     200       1       3 15          15      50    47
##  3 03324214       1  20208.  117000       1       4 16          13      47    46
##  4 03324244       1  18078.       0       1       2 15          NA      37    NA
##  5 03324274       2  20112.    2000       1       2 14          NA      51    NA
##  6 03324284       1  19907.     942       1       2 11          NA      63    NA
##  7 03324294       1  11705.       0       1       1 10          NA      77    NA
##  8 03324304       1  24431.   91000       1       2 13          15      37    36
##  9 03324324       2  42859.   95000       2       5 12          15      51    53
## 10 03324334       1  17481.   40037       1       2 12          14      64    67
## # … with 6,291 more rows, 53 more variables: SEX_REF <dbl>, SEX2 <dbl>,
## #   REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, FAM_TYPE <dbl>,
## #   MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, HIGH_EDU <chr>,
## #   EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, TRANSCQ <dbl>,
## #   HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, TOBACCCQ <dbl>,
## #   STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, VEHQ <dbl>,
## #   ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, …
```

---

## Look at the Possible Joins

```r
inner_join(fmli, memi, by = c("NEWID" = "NEWID",
                              "PRINEARN" = "MEMBNO"))
```

* Why does this give us the same answer as `left_join` for this situation?

---

## Look at the Possible Joins

```r
full_join(fmli, memi, by = c("NEWID" = "NEWID",
                             "PRINEARN" = "MEMBNO"))
```

```
## # A tibble: 15,412 × 63
##    NEWID    PRINE…¹ FINLW…² FINCB…³ BLS_U…⁴ POPSIZE EDUC_…⁵ EDUCA2 AGE_REF  AGE2
##    <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>    <dbl>   <dbl> <dbl>
##  1 03324174       1  25985.  116920       1       2 16          15      63    50
##  2 03324204       1   6581.     200       1       3 15          15      50    47
##  3 03324214       1  20208.  117000       1       4 16          13      47    46
##  4 03324244       1  18078.       0       1       2 15          NA      37    NA
##  5 03324274       2  20112.    2000       1       2 14          NA      51    NA
##  6 03324284       1  19907.     942       1       2 11          NA      63    NA
##  7 03324294       1  11705.       0       1       1 10          NA      77    NA
##  8 03324304       1  24431.   91000       1       2 13          15      37    36
##  9 03324324       2  42859.   95000       2       5 12          15      51    53
## 10 03324334       1  17481.   40037       1       2 12          14      64    67
## # … with 15,402 more rows, 53 more variables: SEX_REF <dbl>, SEX2 <dbl>,
## #   REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, FAM_TYPE <dbl>,
## #   MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, HIGH_EDU <chr>,
## #   EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, TRANSCQ <dbl>,
## #   HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, TOBACCCQ <dbl>,
## #   STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, VEHQ <dbl>,
## #   ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, …
```

---

## Joining Tips

```r
fmli <- left_join(fmli, memi, by = c("NEWID" = "NEWID",
                                     "PRINEARN" = "MEMBNO"))
```

* FIRST: conceptualize for yourself what you think you want the final dataset to look like!
* Check initial dimensions and final dimensions.
* Use variable names when joining even if they are the same.

---

---

## Data Viz

]

## Data Wrangling

]