background-image: url("img/DAW.png") background-position: left background-size: 50% class: middle, center, inverse .pull-right[ ## .whitish[Data Summarization] <br> <br> ### .whitish[Kelly McConville] #### .yellow[ Stat 100 | Week 3 | Spring 2022] ] --- ## Announcements * Don't forget that P-Set 2 is due on Wednesday at 9am on Gradescope. + No late work accepted. * Fill out the [Project Assignment Group Preferences Survey](https://forms.gle/S4jDMoPMUHg7Zcfv5) if: + You have particular people you'd like to work with. + Or you are a graduate student/auditor and want to work alone. **************************** -- ## Goals for Today .pull-left[ * Measures for **summarizing** quantitative data + Center + Spread/variability * Measures for **summarizing** categorical data ] -- .pull-right[ * Start discussing **data wrangling** ] --- <img src="img/dplyr.png" width="15%" style="float:left; padding:10px" style="display: block; margin: auto;" /> ## Load Necessary Packages `dplyr` is part of this collection of data science packages. ```r # Load necessary packages library(tidyverse) ``` --- ## Import the [Data](https://data.cambridgema.gov/Transportation-Planning/Eco-Totem-Broadway-Bicycle-Count/q8v9-mcfg) ```r bike_counter <- read_csv("https://data.cambridgema.gov/api/views/q8v9-mcfg/rows.csv") # Inspect the data glimpse(bike_counter) ``` ``` ## Rows: 224,907 ## Columns: 7 ## $ DateTime <chr> "06/24/2015 12:00:00 AM", "06/24/2015 12:15:00 AM", "06/24/2… ## $ Day <chr> "Wednesday", "Wednesday", "Wednesday", "Wednesday", "Wednesd… ## $ Date <chr> "06/24/2015", "06/24/2015", "06/24/2015", "06/24/2015", "06/… ## $ Time <time> 00:00:00, 00:15:00, 00:30:00, 00:45:00, 01:00:00, 01:15:00,… ## $ Total <dbl> 4, 3, 4, 2, 2, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, … ## $ Westbound <dbl> 1, 3, 3, 2, 2, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, … ## $ Eastbound <dbl> 3, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ``` --- ## Summarizing Data <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> </tbody> </table> -- * Hard to do by eyeballing a spreadsheet with many rows! --- ## Summarizing Data Visually .pull-left[ <img src="stat100_wk03mon_files/figure-html/unnamed-chunk-6-1.png" width="576" style="display: block; margin: auto;" /> ] -- .pull-right[ For a quantitative variable, want to answer: * What is an **average** value? * What is the **trend/shape** of the variable? * How much **variation** is there from case to case? ] --- ## Summarizing Quantitative Variables For a quantitative variable, want to answer: * What is an average value? * What is the trend/shape of the variable? * How much variation is there from case to case? -- Need to learn some **summary statistics**: Numerical values computed based on the observed cases. --- ## Measures of Center .pull-left[ **Mean: average of all the observations** * `\(n\)` = Number of cases (sample size) * `\(x_i\)` = value of the i-th observation * Denote by `\(\bar{x}\)` $$ \bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i $$ ] .pull-right[ {{content}} ] -- <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> </tbody> </table> {{content}} -- ```r # Mean (5 + 10 + 6 + 13 + 9 + 12 + 7)/7 ``` ``` ## [1] 8.857143 ``` {{content}} --- ## Measures of Center .pull-left[ #### Median: Middle value, 50% * Denote by `\(m\)` * If `\(n\)` is even, then it is the average of the middle two values ] .pull-right[ {{content}} ] -- <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> </tbody> </table> {{content}} -- ```r # Median 9 ``` ``` ## [1] 9 ``` {{content}} --- ## Measures of Center .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> </tbody> </table> ] .pull-right[ ```r # Mean (5 + 10 + 6 + 13 + 9 + 12 + 7)/7 ``` ``` ## [1] 8.857143 ``` ```r # Median 9 ``` ``` ## [1] 9 ``` ] * Suppose the 13 bikes was actually 130 bikes. How would these summary statistics change? --- ## Measures of Variability * Want a statistic that captures how much observations will likely deviate from the mean -- .pull-left[ Here is my proposal: * Find how much each observation deviates from the mean. * Compute the average of the deviations. $$ \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x}) $$ ] .pull-right[ .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Deviations </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -3.86 </td> </tr> <tr> <td style="text-align:right;"> -2.86 </td> </tr> <tr> <td style="text-align:right;"> -1.86 </td> </tr> <tr> <td style="text-align:right;"> 0.14 </td> </tr> <tr> <td style="text-align:right;"> 1.14 </td> </tr> <tr> <td style="text-align:right;"> 3.14 </td> </tr> <tr> <td style="text-align:right;"> 4.14 </td> </tr> </tbody> </table> ] ] --- ## Measures of Variability * Want a statistic that captures how much observations will likely deviate from the mean .pull-left[ Here is my proposal: * Find how much each observation deviates from the mean. * Compute the average of the deviations. $$ \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x}) $$ **Problem?** ] .pull-right[ .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Deviations </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -3.86 </td> </tr> <tr> <td style="text-align:right;"> -2.86 </td> </tr> <tr> <td style="text-align:right;"> -1.86 </td> </tr> <tr> <td style="text-align:right;"> 0.14 </td> </tr> <tr> <td style="text-align:right;"> 1.14 </td> </tr> <tr> <td style="text-align:right;"> 3.14 </td> </tr> <tr> <td style="text-align:right;"> 4.14 </td> </tr> </tbody> </table> ] ] --- ## Measures of Variability * Want a statistic that captures how much observations will likely deviate from the mean .pull-left[ Here is my **NEW** proposal: * Find how much each observation deviates from the mean. * Compute the average of the **squared** deviations. $$ \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})^2 $$ ] .pull-right[ .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Deviations </th> <th style="text-align:right;"> Dev_sqd </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -3.86 </td> <td style="text-align:right;"> 14.88 </td> </tr> <tr> <td style="text-align:right;"> -2.86 </td> <td style="text-align:right;"> 8.16 </td> </tr> <tr> <td style="text-align:right;"> -1.86 </td> <td style="text-align:right;"> 3.45 </td> </tr> <tr> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.02 </td> </tr> <tr> <td style="text-align:right;"> 1.14 </td> <td style="text-align:right;"> 1.31 </td> </tr> <tr> <td style="text-align:right;"> 3.14 </td> <td style="text-align:right;"> 9.88 </td> </tr> <tr> <td style="text-align:right;"> 4.14 </td> <td style="text-align:right;"> 17.16 </td> </tr> </tbody> </table> ] ] --- ## Measures of Variability * Want a statistic that captures how much observations will likely deviate from the mean .pull-left[ Here is my **NEW** proposal: * Find how much each observation deviates from the mean. * Compute the average of the **squared** deviations. $$ \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})^2 $$ ] .pull-right[ .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Deviations </th> <th style="text-align:right;"> Dev_sqd </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -3.86 </td> <td style="text-align:right;"> 14.88 </td> </tr> <tr> <td style="text-align:right;"> -2.86 </td> <td style="text-align:right;"> 8.16 </td> </tr> <tr> <td style="text-align:right;"> -1.86 </td> <td style="text-align:right;"> 3.45 </td> </tr> <tr> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.02 </td> </tr> <tr> <td style="text-align:right;"> 1.14 </td> <td style="text-align:right;"> 1.31 </td> </tr> <tr> <td style="text-align:right;"> 3.14 </td> <td style="text-align:right;"> 9.88 </td> </tr> <tr> <td style="text-align:right;"> 4.14 </td> <td style="text-align:right;"> 17.16 </td> </tr> </tbody> </table> ] ] ```r # Calculate the measure of variability (14.88 + 8.16 + 3.45 + 0.020 + 1.31 + 9.88 + 17.16)/7 ``` ``` ## [1] 7.837143 ``` --- ## Measures of Variability * Want a statistic that captures how much observations will likely deviate from the mean .pull-left[ Here is the **ACTUAL**: * Find how much each observation deviates from the mean. * Compute the (nearly) average of the **squared** deviations. * Called **sample variance** `\(s^2\)`. $$ s^2 = \frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2 $$ ] .pull-right[ .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Deviations </th> <th style="text-align:right;"> Dev_sqd </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -3.86 </td> <td style="text-align:right;"> 14.88 </td> </tr> <tr> <td style="text-align:right;"> -2.86 </td> <td style="text-align:right;"> 8.16 </td> </tr> <tr> <td style="text-align:right;"> -1.86 </td> <td style="text-align:right;"> 3.45 </td> </tr> <tr> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.02 </td> </tr> <tr> <td style="text-align:right;"> 1.14 </td> <td style="text-align:right;"> 1.31 </td> </tr> <tr> <td style="text-align:right;"> 3.14 </td> <td style="text-align:right;"> 9.88 </td> </tr> <tr> <td style="text-align:right;"> 4.14 </td> <td style="text-align:right;"> 17.16 </td> </tr> </tbody> </table> ] ] ```r # Calculate the measure of variability (14.88 + 8.16 + 3.45 + 0.020 + 1.31 + 9.88 + 17.16)/6 ``` ``` ## [1] 9.143333 ``` --- ## Measures of Variability * Want a statistic that captures how much observations will likely deviate from the mean .pull-left[ * Find how much each observation deviates from the mean. * Compute the (nearly) average of the **squared** deviations. * Called the sample variance, `\(s^2\)`. * The square root of the sample variance is called the **sample standard deviation** `\(s\)`. $$ s = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2} $$ ] .pull-right[ .pull-left[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 09:45:00 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:15:00 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:15:00 </td> <td style="text-align:right;"> 7 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:45:00 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:00:00 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 11:00:00 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> 07/04/2015 </td> <td style="text-align:left;"> 10:30:00 </td> <td style="text-align:right;"> 13 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table table-responsive table-bordered table-striped" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Deviations </th> <th style="text-align:right;"> Dev_sqd </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> -3.86 </td> <td style="text-align:right;"> 14.88 </td> </tr> <tr> <td style="text-align:right;"> -2.86 </td> <td style="text-align:right;"> 8.16 </td> </tr> <tr> <td style="text-align:right;"> -1.86 </td> <td style="text-align:right;"> 3.45 </td> </tr> <tr> <td style="text-align:right;"> 0.14 </td> <td style="text-align:right;"> 0.02 </td> </tr> <tr> <td style="text-align:right;"> 1.14 </td> <td style="text-align:right;"> 1.31 </td> </tr> <tr> <td style="text-align:right;"> 3.14 </td> <td style="text-align:right;"> 9.88 </td> </tr> <tr> <td style="text-align:right;"> 4.14 </td> <td style="text-align:right;"> 17.16 </td> </tr> </tbody> </table> ] ] ```r # Calculate the measure of variability sqrt((14.88 + 8.16 + 3.45 + 0.020 + 1.31 +9.88 + 17.16)/6) ``` ``` ## [1] 3.023795 ``` --- ## Measures of Variability * In addition to the sample standard deviation and the sample variance, there is the Interquartile Range (IQR): $$ \mbox{IQR} = \mbox{Q}_3 - \mbox{Q}_1 $$ * Which is more robust to outliers, the IQR or `\(s\)`? * Which is more commonly used, the IQR or `\(s\)`? --- class: center, middle, inverse ## Two Minute Stretch -- ## Now let's go through the Data Summarization handout! --- class: middle, center <img src="img/dplyr_wrangling.png" width="750px"/> --- ### Data Wrangling: Transformations done on the data -- **Why wrangle the data?** -- .pull-left[ To **summarize** the data. ] .pull-right[ → To compute the mean and standard deviation of the bike counts. ] -- .pull-left[ To **drop** missing values. (Need to be careful here!) ] .pull-right[ → In your P-Set 2, you are dropping rows with NAs for particular variables before creating some of your graphs. ] -- .pull-left[ To **filter** to a particular subset of the data. ] .pull-right[ → To subset the bike counts data to 2 days in July of 2019. ] -- .pull-left[ To **collapse** the categories of a categorical variable. ] .pull-right[ → To go from 86 dog breeds to just mixed or single breed. ] -- .pull-left[ To **arrange** the data to make it easier to display. ] .pull-right[ → To sort from most common dog name to least common. ] -- .pull-left[ To fix how `R` **stores** a variable. ] .pull-right[ → I converted `Day` from a character variable/vector to a date variable/vector. ] --- class: middle, center <img src="img/DAW.png" width="750px"/> --- class: center, middle .pull-left[ ## Data Viz <iframe src="https://giphy.com/embed/d31vTpVi1LAcDvdm" width="480" height="362" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/netflix-d31vTpVi1LAcDvdm">via GIPHY</a></p> ] -- .pull-right[ ## Data Wrangling <iframe src="https://giphy.com/embed/DbaUtl1DcLyrdwhzGJ" width="480" height="362" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/Amalgia-DbaUtl1DcLyrdwhzGJ">via GIPHY</a></p> ] --- ## Reminders * Don't forget that P-Set 2 is due on Wednesday at 9am on Gradescope. + No late work accepted. * Fill out the [Project Assignment Group Preferences Survey](https://forms.gle/S4jDMoPMUHg7Zcfv5) if: + You have particular people you'd like to work with. + Or you are a graduate student/auditor and want to work alone.