background-image: url("img/DAW.png") background-position: left background-size: 50% class: middle, center, .pull-right[ ## .base-blue[Data Visualization] <br> <br> ### .purple[Kelly McConville] #### .purple[ Stat 100 | Week 3 | Fall 2022] ] --- ## Announcements * Remember that the [standard office hours schedule](https://docs.google.com/spreadsheets/d/1eHOdLQGw3mEEvahOM1cKi9p0eu_2TdI6G8b4vdb9hqE/edit?usp=sharing) is now in full swing. **************************** -- ## Goals for Today .pull-left[ First Segment: * Motivate data visualizations. * Develop **language** to talk about the components of a graphic. * Practice deconstructing graphics. * Discuss good graphical practices. ] -- .pull-right[ Second Segment: * Learn the general structure of `ggplot2`. * Learn a few standard graphs for numerical/quantitative data: + **Histogram**: one numerical variable + **Side-by-side boxplot**: one numerical variable and one categorical variable + **Side-by-side violin plot**: one numerical variable and one categorical variable ] --- ## But First... -- * Where do we access p-sets? -- * First step should always be saving the shared `Rmd` to your home folder. -- * `RMarkdown` Workflow: * Modify the Rmd file. * If the modifications are code, run the code in the console to debug. * Knit. * Look over the output document. * Repeat. --- class: middle, , center # Why construct a graph? -- ### To **summarize** the data. -- ### To look for **trends** between variables and make **comparisons**. -- ### To tell a compelling **story**. --- # Challenger * On January 27th, 1986, engineers from Morton Thiokol recommended NASA delay launch of space shuttle *Challenger* due to cold weather. * Believed cold weather impacted the o-rings that held the rockets together. * Used 13 charts in their argument. -- * After a two hour conference call, the engineer's recommendation was overruled due to lack of persuasive evidence and the launch proceeded. -- * The Challenger exploded 73 seconds into launch. --- # Challenger .left-column[ * Here's one of those charts. ] .right-column[ <img src="img/table_o_rings.jpg" width="90%" style="display: block; margin: auto;" /> ] --- .left-column[ # Challenger * Here's another one of those charts. ] .right-column[ <img src="img/o_ring_rockets.jpg" width="45%" style="display: block; margin: auto;" /> ] --- # Challenger .left-column[ * Here's a graphic I created from [Edward Tufte](https://en.wikipedia.org/wiki/Edward_Tufte)'s data. ] <img src="stat100_wk03mon_files/figure-html/unnamed-chunk-3-1.png" width="60%" style="display: block; margin: auto;" /> --- # Challenger .left-column[ * This adaptation is a recreation of Edward Tufte's graphic. * For more information on this example and other examples, check out [Tufte's book](https://www.edwardtufte.com/tufte/books_visex). ] <img src="stat100_wk03mon_files/figure-html/unnamed-chunk-4-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle, , center ## Now let's learn the .orange[Grammar of Graphics]. -- #### We will use this grammar to: -- Decompose and understand existing graphs. -- Create our own graphs with the `R` package `ggplot2`. --- # Grammar of Graphics * **data**: Data frame that contains the raw data + Variables * **geom**: Geometric **shape** that the data are mapped to. + Point, line, bar, text, ... * **aesthetic**: Visual properties of the **geom** + X (horizontal) position, y (vertical) position, color, fill, shape * **scale**: Controls how data are mapped to the visual values of the aesthetic. + EX: particular colors * **guide**: Legend/key to help user convert visual display back to the data -- For right now, we won't focus on the **names** of particular types of graphs (e.g., scatterplot) but on the **elements** of graphs. --- .left-column[ ### Example 1 * What story is the graph telling? * What are the variables here? * What **geom** are the variables map to? * What are the **aesthetic**s of the **geom**? * Which variable sets the value of that **aesthetic**? * What additional context does this graphic provide? ] .right-column[ <img src="img/anne_hathaway.png" width="80%" style="display: block; margin: auto;" /> ] --- .left-column[ ### Example 2 * What story is the graph telling? * What are the variables here? * What **geom** are the variables map to? * What are the **aesthetic**s of the **geom**? * Which variable sets the value of that **aesthetic**? * What additional context does this graphic provide? ] .right-column[ <img src="img/harassment_graphic.png" width="70%" style="display: block; margin: auto;" /> ] --- .left-column[ ### Example 3 * What story is the graph telling? * What are the variables here? * What **geom** are the variables map to? * What are the **aesthetic**s of the **geom**? * Which variable sets the value of that **aesthetic**? * What additional context does this graphic provide? ] .right-column[ <img src="img/covid-wastewater.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Best Practices: Context * Think about the **stories/questions** your visualization answers. -- * Determine what **context/background information** your viewer needs. -- * For context, at a minimum include + Axis labels (with units reported). + Legends. + Data source. -- * Visualizing data involves **editorial choices**. + What to highlight. + What comparisons to make easy to see. + What scales to use. --- .left-column[ ## Context Example ] .right-column[ <img src="img/moma_size.png" width="90%" style="display: block; margin: auto;" /> ] --- .left-column[ ## Best Practices: Order of perception ] .right-column[ <div class="figure" style="text-align: center"> <img src="img/visual_cues.png" alt="Yau (2013)" width="60%" /> <p class="caption">Yau (2013)</p> </div> ] --- ## Best Practices: Be careful with color. .left-column[ * Consider color blindness ] .right-column[ <img src="stat100_wk03mon_files/figure-html/unnamed-chunk-10-1.png" width="576" style="display: block; margin: auto;" /> ] --- ## Best Practices: Color Palettes -- Sequential .left-column[ * [Dude map](https://qz.com/316906/the-dude-map-how-american-men-refer-to-their-bros/) * Note: Maps are also a great way to provide context! ] .right-column[ <img src="img/dude.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Best Practices: Color Palettes -- Diverging .left-column[ * [Adam Pearce's 2015 NBA Games](https://roadtolarissa.com/nba-minutes/) ] .right-column[ <img src="img/nba_2015.png" width="95%" style="display: block; margin: auto;" /> ] --- ## Best Practices: Color Palettes -- Qualitative .left-column[ * [information is beautiful's Best in Show](https://www.informationisbeautiful.net/visualizations/best-in-show-whats-the-top-data-dog/) ] .right-column[ <img src="img/dogs.png" width="85%" style="display: block; margin: auto;" /> ] --- ## Best Practices: Aspect Ratio * Aspect ratio affects our perception of the rate of change <div class="figure" style="text-align: center"> <img src="img/aspect_ratio.png" alt="http://socviz.co/lookatdata.html" width="80%" /> <p class="caption">http://socviz.co/lookatdata.html</p> </div> --- ## Many Ways To Visually Tell A Story Washington Post's Approach: <img src="img/wp_shooter.png" width="50%" style="display: block; margin: auto;" /> [Periscopic's Approach](https://guns.periscopic.com/?year=2013) --- ## Bad Graphics .left-column[ * It is much easier to make a bad graph than a good graph. ] .right-column[ <div class="figure" style="text-align: center"> <img src="img/pizza_pie.png" alt="YouGov" width="50%" /> <p class="caption">YouGov</p> </div> ] --- ## Misleading Graphics .left-column[ * Ethical data viz: Be careful that your editorial choices don't make your viewer make incorrect conclusions about the data ] .right-column[ <div class="figure" style="text-align: center"> <img src="img/FLguns.jpg" alt="Modern Data Science with R" width="70%" /> <p class="caption">Modern Data Science with R</p> </div> ] --- ## Summary Thoughts on Best Practices * Good graphics are one's where the findings and insights are **obvious** to the viewer. + You have power over the patterns your viewer sees and the conclusions they draw. Use that power ethically. -- * Facilitate the **comparisons** that correspond to the research question. -- * Add information and key **context**. -- * Data visualizations are **not neutral**. -- * It is easier to see the differences and similarities between different types of graphics if we learn the **grammar of graphics**. -- * Practicing **decomposing** graphics should make it easier for us to **compose** our own graphics. --- class: , center, middle ## Two Minute Stretch --- <img src="img/ggplot2.png" width="15%" style="float:left; padding:10px" style="display: block; margin: auto;" /> ## Load Necessary Packages `ggplot2` is part of this collection of data science packages. ```r # Load necessary packages library(tidyverse) ``` --- ## Data Setting: [Eco-Totem Broadway Bicycle Count](https://data.cambridgema.gov/Transportation-Planning/Eco-Totem-Broadway-Bicycle-Count/q8v9-mcfg) .pull-left[ <img src="img/counter.jpg" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/bike_counter_map.png" width="90%" style="display: block; margin: auto;" /> ] --- ## Import the [Data](https://data.cambridgema.gov/Transportation-Planning/Eco-Totem-Broadway-Bicycle-Count/q8v9-mcfg) ```r bike_counter <- read_csv("https://data.cambridgema.gov/api/views/q8v9-mcfg/rows.csv") # Inspect the data glimpse(bike_counter) ``` ``` ## Rows: 245,218 ## Columns: 7 ## $ DateTime <chr> "06/24/2015 12:00:00 AM", "06/24/2015 12:15:00 AM", "06/24/2… ## $ Day <chr> "Wednesday", "Wednesday", "Wednesday", "Wednesday", "Wednesd… ## $ Date <chr> "06/24/2015", "06/24/2015", "06/24/2015", "06/24/2015", "06/… ## $ Time <time> 00:00:00, 00:15:00, 00:30:00, 00:45:00, 01:00:00, 01:15:00,… ## $ Total <dbl> 4, 3, 4, 2, 2, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, … ## $ Westbound <dbl> 1, 3, 3, 2, 2, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, … ## $ Eastbound <dbl> 3, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ``` What does a row represent here? --- ## Inspect the Data ```r # Look at first few rows head(bike_counter) ``` ``` ## # A tibble: 6 × 7 ## DateTime Day Date Time Total Westbound Eastbound ## <chr> <chr> <chr> <time> <dbl> <dbl> <dbl> ## 1 06/24/2015 12:00:00 AM Wednesday 06/24/2015 00:00 4 1 3 ## 2 06/24/2015 12:15:00 AM Wednesday 06/24/2015 00:15 3 3 0 ## 3 06/24/2015 12:30:00 AM Wednesday 06/24/2015 00:30 4 3 1 ## 4 06/24/2015 12:45:00 AM Wednesday 06/24/2015 00:45 2 2 0 ## 5 06/24/2015 01:00:00 AM Wednesday 06/24/2015 01:00 2 2 0 ## 6 06/24/2015 01:15:00 AM Wednesday 06/24/2015 01:15 0 0 0 ``` --- ## Inspect the Data ```r # Determine type # To access one variable: dataset$variable class(bike_counter$Day) ``` ``` ## [1] "character" ``` ```r class(bike_counter$Total) ``` ``` ## [1] "numeric" ``` ```r class(bike_counter) ``` ``` ## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame" ``` --- ## Data Wrangling **We haven't learned this topic yet.** **I only included this code for completeness/transparency.** ```r # Fix Date column to be stored with the date class library(lubridate) bike_counter <- mutate(bike_counter, Date = mdy(Date)) # Filter to only include two days in July 2019 july_2019 <- filter(bike_counter, Date %in% c(mdy("07/04/2019"), mdy("07/11/2019"))) # Add an Occasion column july_2019 <- mutate(july_2019, Occasion = if_else(Date == mdy("07/04/2019"), "Fourth of July", "Normal Thursday")) ``` --- # Grammar of Graphics * **data**: Data frame that contains the raw data + Variables * **geom**: Geometric **shape** that the data are mapped to. + Point, line, bar, text, ... * **aesthetic**: Visual properties of the **geom** + X (horizontal) position, y (vertical) position, color, fill, shape * **scale**: Controls how data are mapped to the visual values of the aesthetic. + EX: particular colors * **guide**: Legend/key to help user convert visual display back to the data --- # `ggplot2` example code **Guiding Principle**: We will map variables from the **data** to the **aes**thetic attributes (e.g. location, size, shape, color) of **geom**etric objects (e.g. points, lines, bars). ```r ggplot(data = ---, mapping = aes(---)) + geom_---(---) ``` * There are other layers, such as `scales_---_---()` and `labs()`, but we will wait on those. --- # Histograms .left-column[ * Binned counts of data. * Great for assessing shape. ] .right-column[ <img src="stat100_wk03mon_files/figure-html/unnamed-chunk-28-1.png" width="576" style="display: block; margin: auto;" /> ] --- # Data Shapes <img src="stat100_wk03mon_files/figure-html/unnamed-chunk-29-1.png" width="864" style="display: block; margin: auto;" /> --- # Histograms .pull-left[ ```r # Create histogram ggplot(data = july_2019, mapping = aes(x = Total)) + geom_histogram() ``` ] .pull-right[ <img src="stat100_wk03mon_files/figure-html/hist-1.png" width="768" style="display: block; margin: auto;" /> ] --- # Histograms .pull-left[ ```r # Create histogram ggplot(data = july_2019, mapping = aes(x = Total)) + geom_histogram(color = "white", fill = "violetred1", bins = 50) ``` #### For aesthetics: * **mapping** to a variable goes in `aes()` * **setting** to a specific value goes in the `geom_---()` ] .pull-right[ <img src="stat100_wk03mon_files/figure-html/hist2-1.png" width="768" style="display: block; margin: auto;" /> ] --- # Boxplots .pull-left[ * **Five number summary**: + Minimum + First quartile (Q1) + Median + Third quartile (Q3) + Maximum * Interquartile range (IQR) `\(=\)` Q3 `\(-\)` Q1 * Outliers: **unusual** points + Boxplot defines unusual as being beyond `\(1.5*IQR\)` from `\(Q1\)` or `\(Q3\)`. * Whiskers: reach out to the furthest point that is NOT an outlier ] .pull-right[ <img src="stat100_wk03mon_files/figure-html/box-1.png" width="768" style="display: block; margin: auto;" /> ] --- # Boxplots .pull-left[ ```r ggplot(data = july_2019, mapping = aes(x = Occasion, y = Total)) + geom_boxplot() ``` ] .pull-right[ <img src="stat100_wk03mon_files/figure-html/box-1.png" width="768" style="display: block; margin: auto;" /> ] --- # Boxplots .pull-left[ ```r ggplot(data = july_2019, mapping = aes(x = Occasion, y = Total)) + geom_boxplot(fill = "springgreen1") ``` ] .pull-right[ <img src="stat100_wk03mon_files/figure-html/box2-1.png" width="768" style="display: block; margin: auto;" /> ] --- # Boxplots .pull-left[ ```r ggplot(data = july_2019, mapping = aes(x = Occasion, y = Total, fill = Occasion)) + geom_boxplot() ``` ] .pull-right[ <img src="stat100_wk03mon_files/figure-html/box3-1.png" width="768" style="display: block; margin: auto;" /> ] --- # Boxplots .pull-left[ ```r ggplot(data = july_2019, mapping = aes(x = Occasion, y = Total, fill = Occasion)) + geom_boxplot() + guides(fill = "none") ``` ] .pull-right[ <img src="stat100_wk03mon_files/figure-html/box4-1.png" width="768" style="display: block; margin: auto;" /> ] --- # Violin Plots .pull-left[ ```r ggplot(data = july_2019, mapping = aes(x = Occasion, y = Total, fill = Occasion)) + geom_violin() + guides(fill = "none") ``` ] .pull-right[ <img src="stat100_wk03mon_files/figure-html/vio-1.png" width="768" style="display: block; margin: auto;" /> ] --- # Boxplot Versus Violin Plots .pull-left[ <img src="stat100_wk03mon_files/figure-html/box4-1.png" width="768" style="display: block; margin: auto;" /> ] .pull-right[ <img src="stat100_wk03mon_files/figure-html/vio-1.png" width="768" style="display: block; margin: auto;" /> ] --- # Recap: `ggplot2` ```r library(tidyverse) ggplot(data = ---, mapping = aes(---)) + geom_---(---) ```