background-image: url("img/DAW.png") background-position: left background-size: 50% class: middle, center, .pull-right[ ## .base-blue[Data Summarization] <br> <br> ### .purple[Kelly McConville] #### .purple[ Stat 100 | Week 4 | Fall 2022] ] --- ## Announcements * Don't forget that P-Set 2 is due tomorrow by 5pm on Gradescope. + Come by office hours with questions! **************************** -- ## Goals for Today .pull-left[ * Consider measures for **summarizing** quantitative data + Center + Spread/variability * Consider measures for **summarizing** categorical data ] -- .pull-right[ * Define **data wrangling** * Learn data **joins** ] --- <img src="img/dplyr.png" width="15%" style="float:left; padding:10px" style="display: block; margin: auto;" /> ## Load Necessary Packages `dplyr` is part of this collection of data science packages. ```r # Load necessary packages library(tidyverse) ``` --- ## Import the [Data](https://data.cambridgema.gov/Transportation-Planning/Eco-Totem-Broadway-Bicycle-Count/q8v9-mcfg) ```r bike_counter <- read_csv("https://data.cambridgema.gov/api/views/q8v9-mcfg/rows.csv") # Inspect the data glimpse(bike_counter) ``` ``` ## Rows: 245,890 ## Columns: 7 ## $ DateTime <chr> "06/24/2015 12:00:00 AM", "06/24/2015 12:15:00 AM", "06/24/2… ## $ Day <chr> "Wednesday", "Wednesday", "Wednesday", "Wednesday", "Wednesd… ## $ Date <chr> "06/24/2015", "06/24/2015", "06/24/2015", "06/24/2015", "06/… ## $ Time <time> 00:00:00, 00:15:00, 00:30:00, 00:45:00, 01:00:00, 01:15:00,… ## $ Total <dbl> 4, 3, 4, 2, 2, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, … ## $ Westbound <dbl> 1, 3, 3, 2, 2, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, … ## $ Eastbound <dbl> 3, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ``` --- ## Summarizing Data <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> </tbody> </table> -- * Hard to do by eyeballing a spreadsheet with many rows! --- ## Summarizing Data Visually .pull-left[ <img src="stat100_wk04mon_files/figure-html/unnamed-chunk-6-1.png" width="576" style="display: block; margin: auto;" /> ] -- .pull-right[ For a quantitative variable, want to answer: * What is an **average** value? * What is the **trend/shape** of the variable? * How much **variation** is there from case to case? ] --- ## Summarizing Quantitative Variables For a quantitative variable, want to answer: * What is an average value? * What is the trend/shape of the variable? * How much variation is there from case to case? -- Need to learn some **summary statistics**: Numerical values computed based on the observed cases. --- ## Measures of Center .pull-left[ **Mean: average of all the observations** * `\(n\)` = Number of cases (sample size) * `\(x_i\)` = value of the i-th observation * Denote by `\(\bar{x}\)` $$ \bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i $$ ] .pull-right[ {{content}} ] -- <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> </tbody> </table> {{content}} -- ```r # Mean (5 + 10 + 6 + 13 + 9 + 12 + 7)/7 ``` ``` ## [1] 8.857143 ``` {{content}} --- ## Measures of Center .pull-left[ #### Median: Middle value, 50% * Denote by `\(m\)` * If `\(n\)` is even, then it is the average of the middle two values ] .pull-right[ {{content}} ] -- <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> </tbody> </table> {{content}} -- ```r # Median 9 ``` ``` ## [1] 9 ``` {{content}} --- ## Measures of Center .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> </tbody> </table> ] .pull-right[ ```r # Mean (5 + 10 + 6 + 13 + 9 + 12 + 7)/7 ``` ``` ## [1] 8.857143 ``` ```r # Median 9 ``` ``` ## [1] 9 ``` ] * Suppose the 13 bikes was actually 130 bikes. How would these summary statistics change? --- ## Measures of Variability * Want a statistic that captures how much observations will likely deviate from the mean -- .pull-left[ Here is my proposal: * Find how much each observation deviates from the mean. * Compute the average of the deviations. $$ \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x}) $$ ] .pull-right[ .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Deviations </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -3.86 </td> </tr> <tr> <td style="text-align:right;"> -2.86 </td> </tr> <tr> <td style="text-align:right;"> -1.86 </td> </tr> <tr> <td style="text-align:right;"> 0.14 </td> </tr> <tr> <td style="text-align:right;"> 1.14 </td> </tr> <tr> <td style="text-align:right;"> 3.14 </td> </tr> <tr> <td style="text-align:right;"> 4.14 </td> </tr> </tbody> </table> ] ] --- ## Measures of Variability * Want a statistic that captures how much observations will likely deviate from the mean .pull-left[ Here is my proposal: * Find how much each observation deviates from the mean. * Compute the average of the deviations. $$ \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x}) $$ **Problem?** ] .pull-right[ .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Deviations </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -3.86 </td> </tr> <tr> <td style="text-align:right;"> -2.86 </td> </tr> <tr> <td style="text-align:right;"> -1.86 </td> </tr> <tr> <td style="text-align:right;"> 0.14 </td> </tr> <tr> <td style="text-align:right;"> 1.14 </td> </tr> <tr> <td style="text-align:right;"> 3.14 </td> </tr> <tr> <td style="text-align:right;"> 4.14 </td> </tr> </tbody> </table> ] ] --- ## Measures of Variability * Want a statistic that captures how much observations will likely deviate from the mean .pull-left[ Here is my **NEW** proposal: * Find how much each observation deviates from the mean. * Compute the average of the **squared** deviations. $$ \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})^2 $$ ] .pull-right[ .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Deviations </th> <th style="text-align:right;"> Dev_sqd </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -3.86 </td> <td style="text-align:right;"> 14.88 </td> </tr> <tr> <td style="text-align:right;"> -2.86 </td> <td style="text-align:right;"> 8.16 </td> </tr> <tr> <td style="text-align:right;"> -1.86 </td> <td style="text-align:right;"> 3.45 </td> </tr> <tr> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.02 </td> </tr> <tr> <td style="text-align:right;"> 1.14 </td> <td style="text-align:right;"> 1.31 </td> </tr> <tr> <td style="text-align:right;"> 3.14 </td> <td style="text-align:right;"> 9.88 </td> </tr> <tr> <td style="text-align:right;"> 4.14 </td> <td style="text-align:right;"> 17.16 </td> </tr> </tbody> </table> ] ] --- ## Measures of Variability * Want a statistic that captures how much observations will likely deviate from the mean .pull-left[ Here is my **NEW** proposal: * Find how much each observation deviates from the mean. * Compute the average of the **squared** deviations. $$ \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})^2 $$ ] .pull-right[ .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Deviations </th> <th style="text-align:right;"> Dev_sqd </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -3.86 </td> <td style="text-align:right;"> 14.88 </td> </tr> <tr> <td style="text-align:right;"> -2.86 </td> <td style="text-align:right;"> 8.16 </td> </tr> <tr> <td style="text-align:right;"> -1.86 </td> <td style="text-align:right;"> 3.45 </td> </tr> <tr> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.02 </td> </tr> <tr> <td style="text-align:right;"> 1.14 </td> <td style="text-align:right;"> 1.31 </td> </tr> <tr> <td style="text-align:right;"> 3.14 </td> <td style="text-align:right;"> 9.88 </td> </tr> <tr> <td style="text-align:right;"> 4.14 </td> <td style="text-align:right;"> 17.16 </td> </tr> </tbody> </table> ] ] ```r # Calculate the measure of variability (14.88 + 8.16 + 3.45 + 0.020 + 1.31 + 9.88 + 17.16)/7 ``` ``` ## [1] 7.837143 ``` --- ## Measures of Variability * Want a statistic that captures how much observations will likely deviate from the mean .pull-left[ Here is the **ACTUAL**: * Find how much each observation deviates from the mean. * Compute the (nearly) average of the **squared** deviations. * Called **sample variance** `\(s^2\)`. $$ s^2 = \frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2 $$ ] .pull-right[ .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Deviations </th> <th style="text-align:right;"> Dev_sqd </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -3.86 </td> <td style="text-align:right;"> 14.88 </td> </tr> <tr> <td style="text-align:right;"> -2.86 </td> <td style="text-align:right;"> 8.16 </td> </tr> <tr> <td style="text-align:right;"> -1.86 </td> <td style="text-align:right;"> 3.45 </td> </tr> <tr> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.02 </td> </tr> <tr> <td style="text-align:right;"> 1.14 </td> <td style="text-align:right;"> 1.31 </td> </tr> <tr> <td style="text-align:right;"> 3.14 </td> <td style="text-align:right;"> 9.88 </td> </tr> <tr> <td style="text-align:right;"> 4.14 </td> <td style="text-align:right;"> 17.16 </td> </tr> </tbody> </table> ] ] ```r # Calculate the measure of variability (14.88 + 8.16 + 3.45 + 0.020 + 1.31 + 9.88 + 17.16)/6 ``` ``` ## [1] 9.143333 ``` --- ## Measures of Variability * Want a statistic that captures how much observations will likely deviate from the mean .pull-left[ * Find how much each observation deviates from the mean. * Compute the (nearly) average of the **squared** deviations. * Called the sample variance, `\(s^2\)`. * The square root of the sample variance is called the **sample standard deviation** `\(s\)`. $$ s = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2} $$ ] .pull-right[ .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Deviations </th> <th style="text-align:right;"> Dev_sqd </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -3.86 </td> <td style="text-align:right;"> 14.88 </td> </tr> <tr> <td style="text-align:right;"> -2.86 </td> <td style="text-align:right;"> 8.16 </td> </tr> <tr> <td style="text-align:right;"> -1.86 </td> <td style="text-align:right;"> 3.45 </td> </tr> <tr> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.02 </td> </tr> <tr> <td style="text-align:right;"> 1.14 </td> <td style="text-align:right;"> 1.31 </td> </tr> <tr> <td style="text-align:right;"> 3.14 </td> <td style="text-align:right;"> 9.88 </td> </tr> <tr> <td style="text-align:right;"> 4.14 </td> <td style="text-align:right;"> 17.16 </td> </tr> </tbody> </table> ] ] ```r # Calculate the measure of variability sqrt((14.88 + 8.16 + 3.45 + 0.020 + 1.31 +9.88 + 17.16)/6) ``` ``` ## [1] 3.023795 ``` --- ## Measures of Variability * In addition to the sample standard deviation and the sample variance, there is the Interquartile Range (IQR): $$ \mbox{IQR} = \mbox{Q}_3 - \mbox{Q}_1 $$ * Which is more robust to outliers, the IQR or `\(s\)`? * Which is more commonly used, the IQR or `\(s\)`? --- class: center, middle, ## Now let's go through the Data Summarization handout! -- ## Two Minute Stretch --- class: middle, center <img src="img/dplyr_wrangling.png" width="750px"/> --- ### Data Wrangling: Transformations done on the data -- **Why wrangle the data?** -- .pull-left[ To **summarize** the data. ] .pull-right[ → To compute the mean and standard deviation of the bike counts. ] -- .pull-left[ To **drop** missing values. (Need to be careful here!) ] .pull-right[ → In your P-Set 2, `geom_histogram()` is dropping NAs before creating the graph. ] -- .pull-left[ To **filter** to a particular subset of the data. ] .pull-right[ → To subset the bike counts data to 2 days in July of 2019. ] -- .pull-left[ To **collapse** the categories of a categorical variable. ] .pull-right[ → To go from 86 dog breeds to just mixed or single breed. ] -- .pull-left[ To **arrange** the data to make it easier to display. ] .pull-right[ → To sort from most common dog name to least common. ] -- .pull-left[ To fix how `R` **stores** a variable. ] .pull-right[ → I converted `Day` from a character variable/vector to a date variable/vector. ] OR, to **combine** data frames when information about your cases is stored in multiple places! --- ## Data Joins * Often in the data analysis workflow, we have more than one data source, which means more than one dataframe, and we want to combine these dataframes. -- * Need principled way to combine. + Need a **key** that links two dataframes together. -- * These multiple dataframes are called **relational data**. --- <img src="img/bls2.png" width="15%" style="float:left; padding:15px" style="display: block; margin: auto;" /> ## [US Bureau of Labor Statistics](https://www.bls.gov/bls/blsmissn.htm) Consumer Expenditure Data > .mustard[BLS Mission]: "Measures labor market activity, working conditions, price changes, and productivity in the U.S. economy to support public and private decision making." -- * Household survey but data are also collected on individuals + fmli: household data + memi: household member-level data ```r #Read in data with readr package library(tidyverse) fmli <- read_csv("~/shared_data/stat100/data/fmli.csv", na = c("NA", ".")) memi <- read_csv("~/shared_data/stat100/data/memi.csv", na = c("NA", ".")) ``` * Want variables on the **principal earner** to be added to the household data --- ## CE Data * Key variable(s)? ```r glimpse(fmli) ``` ``` ## Rows: 6,301 ## Columns: 51 ## $ NEWID <chr> "03324174", "03324204", "03324214", "03324244", "03324274", "… ## $ PRINEARN <chr> "01", "01", "01", "01", "02", "01", "01", "01", "02", "01", "… ## $ FINLWT21 <dbl> 25984.767, 6581.018, 20208.499, 18078.372, 20111.619, 19907.3… ## $ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 40037, 10… ## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ POPSIZE <dbl> 2, 3, 4, 2, 2, 2, 1, 2, 5, 2, 3, 2, 2, 3, 4, 3, 3, 1, 4, 1, 1… ## $ EDUC_REF <chr> "16", "15", "16", "15", "14", "11", "10", "13", "12", "12", "… ## $ EDUCA2 <dbl> 15, 15, 13, NA, NA, NA, NA, 15, 15, 14, 12, 12, NA, NA, NA, 1… ## $ AGE_REF <dbl> 63, 50, 47, 37, 51, 63, 77, 37, 51, 64, 26, 59, 81, 51, 67, 4… ## $ AGE2 <dbl> 50, 47, 46, NA, NA, NA, NA, 36, 53, 67, 44, 62, NA, NA, NA, 4… ## $ SEX_REF <dbl> 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1… ## $ SEX2 <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 1, 1, 1, NA, NA, NA, 1, NA, 1,… ## $ REF_RACE <dbl> 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1… ## $ RACE2 <dbl> 1, 4, 1, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, NA, NA, 2, NA, 1,… ## $ HISP_REF <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1… ## $ HISP2 <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 2, 2, 2, NA, NA, NA, 2, NA, 2,… ## $ FAM_TYPE <dbl> 3, 4, 1, 8, 9, 9, 8, 3, 1, 1, 3, 1, 8, 9, 8, 5, 9, 4, 8, 3, 2… ## $ MARITAL1 <dbl> 1, 1, 1, 5, 3, 3, 2, 1, 1, 1, 1, 1, 2, 3, 5, 1, 3, 1, 3, 1, 1… ## $ REGION <dbl> 4, 4, 3, 4, 4, 3, 4, 1, 3, 2, 1, 4, 1, 3, 3, 3, 2, 1, 2, 4, 3… ## $ SMSASTAT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15", "14", "… ## $ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ FOODCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ TRANSCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ HEALTHCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ ENTERTCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ EDUCACQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ TOBACCCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ STUDFINX <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ IRAX <dbl> 1000000, 10000, 0, NA, NA, 0, 0, 15000, NA, 477000, NA, NA, N… ## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1, 1, 4, 4… ## $ FAM_SIZE <dbl> 4, 6, 2, 1, 2, 2, 1, 5, 2, 2, 4, 2, 1, 2, 1, 4, 2, 4, 1, 3, 3… ## $ VEHQ <dbl> 3, 5, 0, 4, 2, 0, 0, 2, 4, 2, 3, 2, 1, 3, 1, 2, 4, 4, 0, 2, 3… ## $ ROOMSQ <dbl> 8, 5, 6, 4, 4, 4, 7, 5, 4, 9, 6, 10, 4, 7, 5, 6, 6, 8, 18, 4,… ## $ INC_HRS1 <dbl> 40, 40, 40, 44, 40, NA, NA, 40, 40, NA, 40, NA, NA, NA, NA, 4… ## $ INC_HRS2 <dbl> 30, 40, 52, NA, NA, NA, NA, 40, 40, NA, 65, NA, NA, NA, NA, 6… ## $ EARNCOMP <dbl> 3, 2, 2, 1, 4, 7, 8, 2, 2, 8, 2, 8, 8, 7, 8, 2, 7, 3, 1, 2, 1… ## $ NO_EARNR <dbl> 4, 2, 2, 1, 2, 1, 0, 2, 2, 0, 2, 0, 0, 1, 0, 2, 1, 3, 1, 2, 1… ## $ OCCUCOD1 <chr> "03", "03", "05", "03", "04", "", "", "12", "04", "", "01", "… ## $ OCCUCOD2 <chr> "04", "02", "01", "", "", "", "", "02", "03", "", "11", "", "… ## $ STATE <chr> "41", "15", "48", "06", "06", "48", "06", "42", "", "27", "25… ## $ DIVISION <dbl> 9, 9, 7, 9, 9, 7, 9, 2, NA, 4, 1, 8, 2, 5, 6, 7, 3, 2, 3, 9, … ## $ TOTXEST <dbl> 15452, 11459, 15738, 25978, 588, 0, 0, 7261, 9406, -1414, 141… ## $ CREDFINX <dbl> 0, NA, 0, NA, 5, NA, NA, NA, NA, 0, NA, 0, NA, NA, NA, 2, 35,… ## $ CREDITB <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ CREDITX <dbl> 4000, 5000, 2000, NA, 7000, 1800, NA, 6000, NA, 719, NA, 1200… ## $ BUILDING <chr> "01", "01", "01", "02", "08", "01", "01", "01", "01", "01", "… ## $ ST_HOUS <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2… ## $ INT_PHON <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ INT_HOME <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ``` --- ## CE Data * Key variables? ```r glimpse(memi) ``` ``` ## Rows: 15,412 ## Columns: 14 ## $ NEWID <chr> "03552611", "03552641", "03552641", "03552651", "03552651", "… ## $ MEMBNO <dbl> 1, 1, 2, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 1, 2, 3… ## $ AGE <dbl> 58, 54, 49, 39, 10, 32, 7, 9, 38, 34, 11, 8, 6, 3, 65, 61, 11… ## $ SEX <dbl> 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1… ## $ EARNER <dbl> 1, 1, 2, 2, NA, 2, NA, NA, 1, 2, NA, NA, NA, NA, 1, 2, NA, NA… ## $ EARNTYPE <dbl> 2, 1, NA, NA, NA, NA, NA, NA, 3, NA, NA, NA, NA, NA, 4, NA, N… ## $ INC_HRSQ <dbl> 20, 56, NA, NA, NA, NA, NA, NA, 50, NA, NA, NA, NA, NA, 25, N… ## $ INCOMEY <dbl> 4, 1, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 4, NA, N… ## $ OCCUCODE <chr> "10", "05", "", "", "", "", "", "", "03", "", "", "", "", "",… ## $ HISPANIC <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ MEMBRACE <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1… ## $ PAYSTUB <dbl> 1, 2, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, 2, NA, N… ## $ SALARYX <dbl> 8982, NA, NA, NA, NA, NA, NA, NA, 280500, NA, NA, NA, NA, NA,… ## $ WKSTATUS <dbl> 1, 1, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 1, NA, N… ``` --- ## CE Data * Key variables? + Problem with class? ```r class(fmli$NEWID) ``` ``` ## [1] "character" ``` ```r class(memi$NEWID) ``` ``` ## [1] "character" ``` ```r class(fmli$PRINEARN) ``` ``` ## [1] "character" ``` ```r class(memi$MEMBNO) ``` ``` ## [1] "numeric" ``` --- ## CE Data * Key variables? + Problem with class? ```r fmli <- mutate(fmli, PRINEARN = as.integer(PRINEARN)) class(fmli$PRINEARN) ``` ``` ## [1] "integer" ``` ```r class(memi$MEMBNO) ``` ``` ## [1] "numeric" ``` --- ## CE Data * Want to add columns of `memi` to `fmli` that correspond to the principal earner's memi data + What type of join is that? --- ## The World of Joins * **Mutating joins**: Add new variables to one dataset from matching observations in another. + `left_join()` (and `right_join()`) + `inner_join()` + `full_join()` * There are also *filtering* joins but we won't cover those today. --- ## Example Dataframes ```r d1 <- data.frame(V = 1:5, X = c(1, 2, 2, 3, 1), Y = c(14, 3, 6, 1, 4)) d2 <- data.frame(X = c(2, 4, 1), S = c(4, 13, 8)) d1 ``` ``` ## V X Y ## 1 1 1 14 ## 2 2 2 3 ## 3 3 2 6 ## 4 4 3 1 ## 5 5 1 4 ``` ```r d2 ``` ``` ## X S ## 1 2 4 ## 2 4 13 ## 3 1 8 ``` --- ## `left_join()` ```r d1_new <- left_join(d1, d2) ``` ``` ## Joining, by = "X" ``` ```r d1_new ``` ``` ## V X Y S ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 4 ## 4 4 3 1 NA ## 5 5 1 4 8 ``` --- ## `left_join()` ```r d1_new <- left_join(d1, d2, by = c("X" = "X")) d1_new ``` ``` ## V X Y S ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 4 ## 4 4 3 1 NA ## 5 5 1 4 8 ``` --- ## `left_join()` ```r d1_new <- left_join(d1, d2, by = c("V" = "X")) d1_new ``` ``` ## V X Y S ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 NA ## 4 4 3 1 13 ## 5 5 1 4 NA ``` --- ## `inner_join()` ```r d1_d2 <- inner_join(d1, d2, by = c("X" = "X")) d1_d2 ``` ``` ## V X Y S ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 4 ## 4 5 1 4 8 ``` --- ## `inner_join()` ```r d1_d2 <- inner_join(d1, d2, by = c("V" = "X")) d1_d2 ``` ``` ## V X Y S ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 4 3 1 13 ``` --- ## `full_join()` ```r d1_d2 <- full_join(d1, d2, by = c("X" = "X")) d1_d2 ``` ``` ## V X Y S ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 4 ## 4 4 3 1 NA ## 5 5 1 4 8 ## 6 NA 4 NA 13 ``` --- ## `full_join()` ```r d1_d2 <- full_join(d1, d2, by = c("V" = "X")) d1_d2 ``` ``` ## V X Y S ## 1 1 1 14 8 ## 2 2 2 3 4 ## 3 3 2 6 NA ## 4 4 3 1 13 ## 5 5 1 4 NA ``` --- ## Back to our Example * What kind of join do we want for the Consumer Expenditure data? + Want to add columns of `memi` to `fmli` that correspond to the principal earner's memi data --- ## Look at the Possible Joins ```r left_join(fmli, memi) ``` ``` ## Joining, by = "NEWID" ``` ``` ## # A tibble: 15,412 × 64 ## NEWID PRINE…¹ FINLW…² FINCB…³ BLS_U…⁴ POPSIZE EDUC_…⁵ EDUCA2 AGE_REF AGE2 ## <chr> <int> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 03324174 1 25985. 116920 1 2 16 15 63 50 ## 2 03324174 1 25985. 116920 1 2 16 15 63 50 ## 3 03324174 1 25985. 116920 1 2 16 15 63 50 ## 4 03324174 1 25985. 116920 1 2 16 15 63 50 ## 5 03324204 1 6581. 200 1 3 15 15 50 47 ## 6 03324204 1 6581. 200 1 3 15 15 50 47 ## 7 03324204 1 6581. 200 1 3 15 15 50 47 ## 8 03324204 1 6581. 200 1 3 15 15 50 47 ## 9 03324204 1 6581. 200 1 3 15 15 50 47 ## 10 03324204 1 6581. 200 1 3 15 15 50 47 ## # … with 15,402 more rows, 54 more variables: SEX_REF <dbl>, SEX2 <dbl>, ## # REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, FAM_TYPE <dbl>, ## # MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, HIGH_EDU <chr>, ## # EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, TRANSCQ <dbl>, ## # HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, TOBACCCQ <dbl>, ## # STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, VEHQ <dbl>, ## # ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, … ``` --- ## Look at the Possible Joins * Be careful. This erroneous example made my R crash! ```r left_join(fmli, memi, by = c("PRINEARN" = "MEMBNO")) ``` --- ## Look at the Possible Joins ```r left_join(fmli, memi, by = c("NEWID" = "NEWID", "PRINEARN" = "MEMBNO")) ``` ``` ## # A tibble: 6,301 × 63 ## NEWID PRINE…¹ FINLW…² FINCB…³ BLS_U…⁴ POPSIZE EDUC_…⁵ EDUCA2 AGE_REF AGE2 ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 03324174 1 25985. 116920 1 2 16 15 63 50 ## 2 03324204 1 6581. 200 1 3 15 15 50 47 ## 3 03324214 1 20208. 117000 1 4 16 13 47 46 ## 4 03324244 1 18078. 0 1 2 15 NA 37 NA ## 5 03324274 2 20112. 2000 1 2 14 NA 51 NA ## 6 03324284 1 19907. 942 1 2 11 NA 63 NA ## 7 03324294 1 11705. 0 1 1 10 NA 77 NA ## 8 03324304 1 24431. 91000 1 2 13 15 37 36 ## 9 03324324 2 42859. 95000 2 5 12 15 51 53 ## 10 03324334 1 17481. 40037 1 2 12 14 64 67 ## # … with 6,291 more rows, 53 more variables: SEX_REF <dbl>, SEX2 <dbl>, ## # REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, FAM_TYPE <dbl>, ## # MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, HIGH_EDU <chr>, ## # EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, TRANSCQ <dbl>, ## # HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, TOBACCCQ <dbl>, ## # STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, VEHQ <dbl>, ## # ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, … ``` --- ## Look at the Possible Joins ```r inner_join(fmli, memi, by = c("NEWID" = "NEWID", "PRINEARN" = "MEMBNO")) ``` ``` ## # A tibble: 6,301 × 63 ## NEWID PRINE…¹ FINLW…² FINCB…³ BLS_U…⁴ POPSIZE EDUC_…⁵ EDUCA2 AGE_REF AGE2 ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 03324174 1 25985. 116920 1 2 16 15 63 50 ## 2 03324204 1 6581. 200 1 3 15 15 50 47 ## 3 03324214 1 20208. 117000 1 4 16 13 47 46 ## 4 03324244 1 18078. 0 1 2 15 NA 37 NA ## 5 03324274 2 20112. 2000 1 2 14 NA 51 NA ## 6 03324284 1 19907. 942 1 2 11 NA 63 NA ## 7 03324294 1 11705. 0 1 1 10 NA 77 NA ## 8 03324304 1 24431. 91000 1 2 13 15 37 36 ## 9 03324324 2 42859. 95000 2 5 12 15 51 53 ## 10 03324334 1 17481. 40037 1 2 12 14 64 67 ## # … with 6,291 more rows, 53 more variables: SEX_REF <dbl>, SEX2 <dbl>, ## # REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, FAM_TYPE <dbl>, ## # MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, HIGH_EDU <chr>, ## # EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, TRANSCQ <dbl>, ## # HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, TOBACCCQ <dbl>, ## # STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, VEHQ <dbl>, ## # ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, … ``` * Why does this give us the same answer as `left_join` for this situation? --- ## Look at the Possible Joins ```r full_join(fmli, memi, by = c("NEWID" = "NEWID", "PRINEARN" = "MEMBNO")) ``` ``` ## # A tibble: 15,412 × 63 ## NEWID PRINE…¹ FINLW…² FINCB…³ BLS_U…⁴ POPSIZE EDUC_…⁵ EDUCA2 AGE_REF AGE2 ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 03324174 1 25985. 116920 1 2 16 15 63 50 ## 2 03324204 1 6581. 200 1 3 15 15 50 47 ## 3 03324214 1 20208. 117000 1 4 16 13 47 46 ## 4 03324244 1 18078. 0 1 2 15 NA 37 NA ## 5 03324274 2 20112. 2000 1 2 14 NA 51 NA ## 6 03324284 1 19907. 942 1 2 11 NA 63 NA ## 7 03324294 1 11705. 0 1 1 10 NA 77 NA ## 8 03324304 1 24431. 91000 1 2 13 15 37 36 ## 9 03324324 2 42859. 95000 2 5 12 15 51 53 ## 10 03324334 1 17481. 40037 1 2 12 14 64 67 ## # … with 15,402 more rows, 53 more variables: SEX_REF <dbl>, SEX2 <dbl>, ## # REF_RACE <dbl>, RACE2 <dbl>, HISP_REF <dbl>, HISP2 <dbl>, FAM_TYPE <dbl>, ## # MARITAL1 <dbl>, REGION <dbl>, SMSASTAT <dbl>, HIGH_EDU <chr>, ## # EHOUSNGC <dbl>, TOTEXPCQ <dbl>, FOODCQ <dbl>, TRANSCQ <dbl>, ## # HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>, TOBACCCQ <dbl>, ## # STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, FAM_SIZE <dbl>, VEHQ <dbl>, ## # ROOMSQ <dbl>, INC_HRS1 <dbl>, INC_HRS2 <dbl>, EARNCOMP <dbl>, … ``` --- ## Joining Tips ```r fmli <- left_join(fmli, memi, by = c("NEWID" = "NEWID", "PRINEARN" = "MEMBNO")) ``` * FIRST: conceptualize for yourself what you think you want the final dataset to look like! * Check initial dimensions and final dimensions. * Use variable names when joining even if they are the same. --- class: middle, center <img src="img/DAW.png" width="750px"/> --- class: center, middle .pull-left[ ## Data Viz <iframe src="https://giphy.com/embed/d31vTpVi1LAcDvdm" width="480" height="362" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/netflix-d31vTpVi1LAcDvdm">via GIPHY</a></p> ] -- .pull-right[ ## Data Wrangling <iframe src="https://giphy.com/embed/DbaUtl1DcLyrdwhzGJ" width="480" height="362" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/Amalgia-DbaUtl1DcLyrdwhzGJ">via GIPHY</a></p> ]