background-image: url("img/logo_padded.001.jpeg") background-position: left background-size: 60% class: middle, center, .pull-right[ <br> ## .base_color[Data of Different Types:] ## .base_color[Dates, Factors, Strings] <br> #### .navy[Kelly McConville] #### .navy[ Stat 108 | Week 6 | Spring 2023] ] --- ## Announcements * Please fill out [this form](https://forms.gle/w7Z4izaxburr9jur6) to help us create groups for Project 1. * Remember that P-Set 3 is due on Wed at 5pm. Make sure to come by office hours to get your questions answered! ************************ ## Week's Goals .pull-left[ **Mon Lecture** * Finish up maps -- interactive maps. * More data types + Dates with `lubridate` + Factors with `forcats` + Strings with `stringr` ] .pull-right[ **Wed Lecture** * More wrangling of strings * Text analysis with `tidytext` ] --- ### Why do we need to talk about dates and times? **Question:** When did the crashes happen? .pull-left[ ```r library(tidyverse) crashes <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/psets/data/cambridge_cyclist_ped_crash.csv") crashes %>% count(crash_date) %>% ggplot(mapping = aes(x = crash_date, y = n)) + geom_point() ``` ] .pull-right[ <img src="stat108_wk06mon_files/figure-html/crash1-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Dates ```r class(crashes$crash_date) ``` ``` ## [1] "character" ``` -- What class should it be? --- ### Converting Strings to Dates * Identify the order of year, month, day, hour, minute, second * Pick the `lubridate` function that replicates that order. ```r sample(crashes$crash_date, size = 10) ``` ``` ## [1] "09/04/2020" "07/02/2018" "10/19/2018" "06/13/2017" "07/09/2018" ## [6] "12/11/2018" "10/22/2021" "06/21/2022" "11/11/2021" "03/28/2019" ``` ```r library(lubridate) crashes <- crashes %>% mutate(crash_date = mdy(crash_date)) class(crashes$crash_date) ``` ``` ## [1] "Date" ``` --- ### Why do we need to talk about dates and times? **Question:** When did the crashes happen? .pull-left[ ```r crashes %>% count(crash_date) %>% ggplot(mapping = aes(x = crash_date, y = n)) + geom_point() ``` ] .pull-right[ <img src="stat108_wk06mon_files/figure-html/crash2-1.png" width="768" style="display: block; margin: auto;" /> ] * Hard to see daily patterns. Switch time interval? --- ### Let's Look at [Portland's Biketown Data](https://www.biketownpdx.com/system-data) All check-outs for July - August of 2017 ```r biketown <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/data/biketown_2017_07_09.csv") %>% filter(Distance_Miles < 1000) biketown_dt <- biketown %>% select(StartDate, StartTime, EndDate, EndTime, Distance_Miles, BikeID) glimpse(biketown_dt) ``` ``` ## Rows: 134,838 ## Columns: 6 ## $ StartDate <chr> "7/1/2017", "7/1/2017", "7/1/2017", "7/1/2017", "7/1/20… ## $ StartTime <time> 00:00:00, 00:00:00, 00:00:00, 00:01:00, 00:03:00, 00:0… ## $ EndDate <chr> "7/1/2017", "7/1/2017", "7/1/2017", "7/1/2017", "7/1/20… ## $ EndTime <time> 00:06:00, 00:16:00, 00:02:00, 00:33:00, 00:06:00, 00:0… ## $ Distance_Miles <dbl> 0.55, 2.03, 0.17, 2.75, 0.40, 0.40, 5.08, 0.95, 2.39, 2… ## $ BikeID <dbl> 7375, 6191, 6321, 6434, 6850, 6420, 6593, 6160, 7380, 6… ``` --- ### Let's Look at [Portland's Biketown Data](https://www.biketownpdx.com/system-data) * Fix the class of the date columns. * Create date-time columns. ```r library(lubridate) biketown_dt <- biketown_dt %>% mutate(StartDate = mdy(StartDate), EndDate = mdy(EndDate)) %>% mutate(StartDateTime = ymd_hms(paste(StartDate, StartTime, sep = " ")), EndDateTime = ymd_hms(paste(EndDate, EndTime, sep = " "))) glimpse(biketown_dt) ``` ``` ## Rows: 134,838 ## Columns: 8 ## $ StartDate <date> 2017-07-01, 2017-07-01, 2017-07-01, 2017-07-01, 2017-0… ## $ StartTime <time> 00:00:00, 00:00:00, 00:00:00, 00:01:00, 00:03:00, 00:0… ## $ EndDate <date> 2017-07-01, 2017-07-01, 2017-07-01, 2017-07-01, 2017-0… ## $ EndTime <time> 00:06:00, 00:16:00, 00:02:00, 00:33:00, 00:06:00, 00:0… ## $ Distance_Miles <dbl> 0.55, 2.03, 0.17, 2.75, 0.40, 0.40, 5.08, 0.95, 2.39, 2… ## $ BikeID <dbl> 7375, 6191, 6321, 6434, 6850, 6420, 6593, 6160, 7380, 6… ## $ StartDateTime <dttm> 2017-07-01 00:00:00, 2017-07-01 00:00:00, 2017-07-01 0… ## $ EndDateTime <dttm> 2017-07-01 00:06:00, 2017-07-01 00:16:00, 2017-07-01 0… ``` --- ### Grabbing Components ```r biketown_dt$StartDateTime[40008] ``` ``` ## [1] "2017-07-23 13:44:00 UTC" ``` ```r year(biketown_dt$StartDateTime[40008]) ``` ``` ## [1] 2017 ``` ```r month(biketown_dt$StartDateTime[40008], label = TRUE) ``` ``` ## [1] Jul ## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec ``` ```r day(biketown_dt$StartDateTime[40008]) ``` ``` ## [1] 23 ``` --- ### Grabbing Components ```r week(biketown_dt$StartDateTime[40008]) ``` ``` ## [1] 30 ``` ```r wday(biketown_dt$StartDateTime[40008], label = TRUE) ``` ``` ## [1] Sun ## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat ``` ```r hour(biketown_dt$StartDateTime[40008]) ``` ``` ## [1] 13 ``` ```r minute(biketown_dt$StartDateTime[40008]) ``` ``` ## [1] 44 ``` --- ### Grabbing Components .pull-left[ ```r ggplot(data = biketown_dt, mapping = aes(month(StartDateTime, label = TRUE))) + geom_bar() ``` ] .pull-right[ <img src="stat108_wk06mon_files/figure-html/crash3-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Grabbing Components .pull-left[ ```r ggplot(data = biketown_dt, mapping = aes(wday(StartDateTime, label = TRUE))) + geom_bar() ``` ] .pull-right[ <img src="stat108_wk06mon_files/figure-html/crash4-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### And if you are in R and want to know the current date/time: ```r today() ``` ``` ## [1] "2023-02-25" ``` ```r now() ``` ``` ## [1] "2023-02-25 13:38:43 UTC" ``` --- class: middle, center ## Topic Shift! ### Factors with `forcats` <img src="img/forcats.png" width="20%" style="display: block; margin: auto;" /> --- ### Motivation: Imposing Structure on Categorical Variables ```r library(pdxTrees) pdxTrees <- get_pdxTrees_parks() five_most_common <- c("Douglas-Fir", "Norway Maple", "Western Redcedar", "Northern Red Oak", "Pin Oak") pdxCommon <- pdxTrees %>% filter(Common_Name %in% five_most_common) ``` --- ### Motivation: Imposing Structure on Categorical Variables .pull-left[ How might we want to restructure this graph? ```r ggplot(data = pdxCommon, mapping = aes(x = Common_Name)) + geom_bar() + coord_flip() ``` ] .pull-right[ <img src="stat108_wk06mon_files/figure-html/trees1-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Levels and Class * Why does `Common_Name` have no levels? ```r levels(pdxCommon$Common_Name) ``` ``` ## NULL ``` -- ```r class(pdxCommon$Common_Name) ``` ``` ## [1] "character" ``` ```r pdxCommon <- mutate(pdxCommon, Common_Name = factor(Common_Name)) levels(pdxCommon$Common_Name) ``` ``` ## [1] "Douglas-Fir" "Northern Red Oak" "Norway Maple" "Pin Oak" ## [5] "Western Redcedar" ``` ```r class(pdxCommon$Common_Name) ``` ``` ## [1] "factor" ``` * How is `R` deciding the order of the levels? --- ### What Are the levels/categories? ```r fct_unique(pdxCommon$Common_Name) ``` ``` ## [1] Douglas-Fir Northern Red Oak Norway Maple Pin Oak ## [5] Western Redcedar ## 5 Levels: Douglas-Fir Northern Red Oak Norway Maple ... Western Redcedar ``` ```r unique(pdxCommon$Common_Name) ``` ``` ## [1] Douglas-Fir Northern Red Oak Norway Maple Pin Oak ## [5] Western Redcedar ## 5 Levels: Douglas-Fir Northern Red Oak Norway Maple ... Western Redcedar ``` --- ### Reorder the Levels .pull-left[ ```r pdxCommon %>% mutate(Common_Name = fct_infreq(Common_Name)) %>% ggplot(mapping = aes(Common_Name)) + geom_bar() + coord_flip() ``` ] .pull-right[ <img src="stat108_wk06mon_files/figure-html/trees2-1.png" width="768" style="display: block; margin: auto;" /> ] + Note: This code didn't permanently change the order in `pdxCommon`. + Why? --- ### Reorder the Levels .pull-left[ How might we want to restructure this graph? ```r pdxCommon %>% mutate(Common_Name = fct_infreq(Common_Name), Common_Name = fct_rev(Common_Name)) %>% ggplot(mapping = aes(Common_Name)) + geom_bar() + coord_flip() ``` ] .pull-right[ <img src="stat108_wk06mon_files/figure-html/trees3-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Or, If You Love the Pipe... .pull-left[ How might we want to restructure this graph? ```r pdxCommon %>% mutate(Common_Name = fct_infreq(Common_Name) %>% fct_rev()) %>% ggplot(mapping = aes(Common_Name)) + geom_bar() + coord_flip() ``` ] .pull-right[ <img src="stat108_wk06mon_files/figure-html/trees4-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Reorder the Levels .pull-left[ * Can also relevel manually ```r pdxCommon %>% mutate(Common_Name = fct_relevel(Common_Name, five_most_common)) %>% ggplot(mapping = aes(x = Common_Name)) + geom_bar() + coord_flip() ``` ] .pull-right[ <img src="stat108_wk06mon_files/figure-html/trees5-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Reorder the Levels .pull-left[ * Or maybe I just want to bring one or two category to the front ```r pdxCommon %>% mutate(Common_Name = fct_relevel(Common_Name, "Norway Maple", "Pin Oak")) %>% ggplot(mapping = aes(x = Common_Name)) + geom_bar() + coord_flip() ``` ] .pull-right[ <img src="stat108_wk06mon_files/figure-html/trees6-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### What Have We Wrangled Here? ```r DBH_by_name <- pdxCommon %>% group_by(Common_Name) %>% summarize(mean_DBH = mean(DBH), lb_DBH = mean_DBH - 2*sd(DBH)/sqrt(n()), ub_DBH = mean_DBH + 2 *sd(DBH/sqrt(n()))) DBH_by_name ``` ``` ## # A tibble: 5 × 4 ## Common_Name mean_DBH lb_DBH ub_DBH ## <fct> <dbl> <dbl> <dbl> ## 1 Douglas-Fir 29.6 29.3 29.8 ## 2 Northern Red Oak 29.4 28.3 30.5 ## 3 Norway Maple 20.3 19.9 20.8 ## 4 Pin Oak 25.6 24.8 26.4 ## 5 Western Redcedar 18.1 17.3 18.9 ``` --- ### Reordering by Another Variable .pull-left[ * How might we want to reorder `Common_Name`? ```r ggplot(data = DBH_by_name, mapping = aes(y = mean_DBH, x = Common_Name)) + geom_point() + geom_errorbar(mapping = aes(ymin = lb_DBH, ymax = ub_DBH), width = 0.4) ``` ] .pull-right[ <img src="stat108_wk06mon_files/figure-html/trees7-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Reordering by Another Variable .pull-left[ ```r DBH_by_name %>% mutate(Common_Name = fct_reorder(Common_Name, -mean_DBH)) %>% ggplot(mapping = aes(y = mean_DBH, x = Common_Name)) + geom_point() + geom_errorbar(mapping = aes(ymin = lb_DBH, ymax = ub_DBH), width = 0.4) ``` ] .pull-right[ <img src="stat108_wk06mon_files/figure-html/trees8-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Reordering by Other Variables .pull-left[ * How might we want to reorder `Condition`? ```r ggplot(data = pdxCommon, mapping = aes(x = DBH, y = Total_Annual_Services, color = Condition)) + geom_smooth() ``` ] .pull-right[ <img src="stat108_wk06mon_files/figure-html/trees9-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Reordering by Other Variables .pull-left[ ```r mutate(pdxCommon, Condition = fct_reorder2(Condition, DBH, Total_Annual_Services)) %>% ggplot(mapping = aes(x = DBH, y = Total_Annual_Services, color = Condition)) + geom_smooth() ``` ] .pull-right[ <img src="stat108_wk06mon_files/figure-html/trees10-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Factors Other useful functions in `forcats`: * `fct_collapse()`: Collapse some levels together * `fct_drop()`: Remove levels (useful after a `filter()`!) * `fct_recode()`: Change names of levels --- class: middle, center ## And now: ## Strings with `stringr`! <img src="img/stringr.png" width="20%" style="display: block; margin: auto;" /> --- ### Language **String** -- ```r x <- "cat" ``` **Character vector** -- ```r x <- c("dog", "cat", "mouse") ``` -- **Factor vector** -- ```r x <- factor(x) levels(x) ``` ``` ## [1] "cat" "dog" "mouse" ``` --- ### String Manipulation with [Stringr](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html) * Learn how to handle character vectors! + Character manipulation + Pattern matching * Let's look at some of the functionalities of `stringr` using a character vector of song lyrics. ```r library(stringr) ``` --- ### Our Toy Lyric * Song? * Artist? ```r lyric <- c("But I would walk 500 miles,", "And I would walk 500 more,", "Just to be the man who walks a 1000 miles,", "To fall down at your door") lyric ``` ``` ## [1] "But I would walk 500 miles," ## [2] "And I would walk 500 more," ## [3] "Just to be the man who walks a 1000 miles," ## [4] "To fall down at your door" ``` -- <img src="img/The_Proclaimers_500_Miles.jpg" width="133" style="display: block; margin: auto;" /> --- ### String Length ```r length(lyric) ``` ``` ## [1] 4 ``` -- ```r str_length(lyric) ``` ``` ## [1] 27 26 42 25 ``` --- ### Accessing and Replacing ```r str_sub(string = lyric[1], start = 18, end = 20) ``` ``` ## [1] "500" ``` -- ```r str_sub(string = lyric[1], start = 18, end = 20) <- "2" ``` -- ```r lyric ``` ``` ## [1] "But I would walk 2 miles," ## [2] "And I would walk 500 more," ## [3] "Just to be the man who walks a 1000 miles," ## [4] "To fall down at your door" ``` --- ### Change Cases ```r str_to_upper(lyric) ``` ``` ## [1] "BUT I WOULD WALK 2 MILES," ## [2] "AND I WOULD WALK 500 MORE," ## [3] "JUST TO BE THE MAN WHO WALKS A 1000 MILES," ## [4] "TO FALL DOWN AT YOUR DOOR" ``` ```r str_to_title(lyric) ``` ``` ## [1] "But I Would Walk 2 Miles," ## [2] "And I Would Walk 500 More," ## [3] "Just To Be The Man Who Walks A 1000 Miles," ## [4] "To Fall Down At Your Door" ``` ```r str_to_lower(lyric) ``` ``` ## [1] "but i would walk 2 miles," ## [2] "and i would walk 500 more," ## [3] "just to be the man who walks a 1000 miles," ## [4] "to fall down at your door" ``` --- ### Sorting ```r str_sort(lyric) ``` ``` ## [1] "And I would walk 500 more," ## [2] "But I would walk 2 miles," ## [3] "Just to be the man who walks a 1000 miles," ## [4] "To fall down at your door" ``` --- ### Pattern Matching * Learn to: + Detect pattern + Extract pattern + Replace pattern + Split pattern --- ### Common Goal: Match a particular pattern * I want to match the pattern `500` from `lyric`. ```r lyric ``` ``` ## [1] "But I would walk 2 miles," ## [2] "And I would walk 500 more," ## [3] "Just to be the man who walks a 1000 miles," ## [4] "To fall down at your door" ``` -- ```r str_view_all(string = lyric, pattern = "500") ``` ``` ## [1] │ But I would walk 2 miles, ## [2] │ And I would walk <500> more, ## [3] │ Just to be the man who walks a 1000 miles, ## [4] │ To fall down at your door ``` --- ### Let's make it more general. * I want to locate all the numbers. ```r lyric ``` ``` ## [1] "But I would walk 2 miles," ## [2] "And I would walk 500 more," ## [3] "Just to be the man who walks a 1000 miles," ## [4] "To fall down at your door" ``` -- ```r str_view_all(lyric, "500|1000|2") ``` ``` ## [1] │ But I would walk <2> miles, ## [2] │ And I would walk <500> more, ## [3] │ Just to be the man who walks a <1000> miles, ## [4] │ To fall down at your door ``` --- ### Trivia Time! Name the artist and song title for each of the following! ```r lyrics <- c("But I would walk 500 miles", "2000 0 0 party over oops out of time!", "1 is the loneliest number that you'll ever do", "When I'm 64", "Where 2 and 2 always makes a 5", "1, 2, 3, 4: Tell me that you love me more") ``` --- How should we modify the code to locate all the numbers from these lyrics of various songs? ```r lyrics ``` ``` ## [1] "But I would walk 500 miles" ## [2] "2000 0 0 party over oops out of time!" ## [3] "1 is the loneliest number that you'll ever do" ## [4] "When I'm 64" ## [5] "Where 2 and 2 always makes a 5" ## [6] "1, 2, 3, 4: Tell me that you love me more" ``` ```r str_view_all(lyrics, "500|1000|2") ``` ``` ## [1] │ But I would walk <500> miles ## [2] │ <2>000 0 0 party over oops out of time! ## [3] │ 1 is the loneliest number that you'll ever do ## [4] │ When I'm 64 ## [5] │ Where <2> and <2> always makes a 5 ## [6] │ 1, <2>, 3, 4: Tell me that you love me more ``` --- How should we modify the code to locate all the numbers from these lyrics of various songs? ```r lyrics ``` ``` ## [1] "But I would walk 500 miles" ## [2] "2000 0 0 party over oops out of time!" ## [3] "1 is the loneliest number that you'll ever do" ## [4] "When I'm 64" ## [5] "Where 2 and 2 always makes a 5" ## [6] "1, 2, 3, 4: Tell me that you love me more" ``` ```r str_view_all(lyrics, "500|1000|0|2000|1|64|2|5|3|4") ``` ``` ## [1] │ But I would walk <500> miles ## [2] │ <2000> <0> <0> party over oops out of time! ## [3] │ <1> is the loneliest number that you'll ever do ## [4] │ When I'm <64> ## [5] │ Where <2> and <2> always makes a <5> ## [6] │ <1>, <2>, <3>, <4>: Tell me that you love me more ``` --- ### Need for More Sophisticated Pattern Matching But now imagine you had a very long vector and you want to locate any number? ```r str_view_all(lyrics, "1|2|3|4...") ``` * Not a good approach! * Next time: **Regular Expressions**!