More Data Types

background-image: url("img/logo_padded.001.jpeg")
background-position: left
background-size: 60%
class: middle, center,

.pull-right[

<br>

## .base_color[Data of Different Types:]

## .base_color[Dates, Factors, Strings]
 
<br>

#### .navy[Kelly McConville]

#### .navy[ Stat 108 | Week 6 | Spring 2023]

]

---

## Announcements

* Please fill out [this form](https://forms.gle/w7Z4izaxburr9jur6) to help us create groups for Project 1.
* Remember that P-Set 3 is due on Wed at 5pm.  Make sure to come by office hours to get your questions answered!

************************

## Week's Goals

.pull-left[

**Mon Lecture**

* Finish up maps -- interactive maps.

* More data types
    + Dates with `lubridate`
    + Factors with `forcats`
    + Strings with `stringr`

]

.pull-right[

**Wed Lecture**

* More wrangling of strings
* Text analysis with `tidytext`

]

---

### Why do we need to talk about dates and times?

**Question:** When did the crashes happen?

.pull-left[

```r
library(tidyverse)
crashes <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/psets/data/cambridge_cyclist_ped_crash.csv")

crashes %>%
  count(crash_date) %>%
  ggplot(mapping = 
           aes(x = crash_date,
               y = n)) +
  geom_point()
```

]

.pull-right[

]

---

### Dates

```r
class(crashes$crash_date)
```

```
## [1] "character"
```

What class should it be?

---

### Converting Strings to Dates

* Identify the order of year, month, day, hour, minute, second

* Pick the `lubridate` function that replicates that order.

```r
sample(crashes$crash_date, size = 10)
```

```
##  [1] "09/04/2020" "07/02/2018" "10/19/2018" "06/13/2017" "07/09/2018"
##  [6] "12/11/2018" "10/22/2021" "06/21/2022" "11/11/2021" "03/28/2019"
```

```r
library(lubridate)

crashes <- crashes %>%
  mutate(crash_date = mdy(crash_date))

class(crashes$crash_date)
```

```
## [1] "Date"
```

---

### Why do we need to talk about dates and times?

**Question:** When did the crashes happen?

.pull-left[

```r
crashes %>%
  count(crash_date) %>%
  ggplot(mapping = 
           aes(x = crash_date,
               y = n)) +
  geom_point()
```

]

.pull-right[

]

* Hard to see daily patterns.  Switch time interval?

---

### Let's Look at [Portland's Biketown Data](https://www.biketownpdx.com/system-data)

All check-outs for July - August of 2017

```r
biketown <- 
  read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/data/biketown_2017_07_09.csv") %>%
  filter(Distance_Miles < 1000)

biketown_dt <- biketown %>%
  select(StartDate, StartTime, EndDate, EndTime, Distance_Miles,
         BikeID)

glimpse(biketown_dt)
```

```
## Rows: 134,838
## Columns: 6
## $ StartDate      <chr> "7/1/2017", "7/1/2017", "7/1/2017", "7/1/2017", "7/1/20…
## $ StartTime      <time> 00:00:00, 00:00:00, 00:00:00, 00:01:00, 00:03:00, 00:0…
## $ EndDate        <chr> "7/1/2017", "7/1/2017", "7/1/2017", "7/1/2017", "7/1/20…
## $ EndTime        <time> 00:06:00, 00:16:00, 00:02:00, 00:33:00, 00:06:00, 00:0…
## $ Distance_Miles <dbl> 0.55, 2.03, 0.17, 2.75, 0.40, 0.40, 5.08, 0.95, 2.39, 2…
## $ BikeID         <dbl> 7375, 6191, 6321, 6434, 6850, 6420, 6593, 6160, 7380, 6…
```

---

### Let's Look at [Portland's Biketown Data](https://www.biketownpdx.com/system-data)

* Fix the class of the date columns.
* Create date-time columns.

```r
library(lubridate)
biketown_dt <- biketown_dt %>%
  mutate(StartDate = mdy(StartDate),
         EndDate = mdy(EndDate)) %>%
  mutate(StartDateTime = ymd_hms(paste(StartDate, StartTime, sep = " ")),
         EndDateTime = ymd_hms(paste(EndDate, EndTime, sep = " ")))

glimpse(biketown_dt)
```

```
## Rows: 134,838
## Columns: 8
## $ StartDate      <date> 2017-07-01, 2017-07-01, 2017-07-01, 2017-07-01, 2017-0…
## $ StartTime      <time> 00:00:00, 00:00:00, 00:00:00, 00:01:00, 00:03:00, 00:0…
## $ EndDate        <date> 2017-07-01, 2017-07-01, 2017-07-01, 2017-07-01, 2017-0…
## $ EndTime        <time> 00:06:00, 00:16:00, 00:02:00, 00:33:00, 00:06:00, 00:0…
## $ Distance_Miles <dbl> 0.55, 2.03, 0.17, 2.75, 0.40, 0.40, 5.08, 0.95, 2.39, 2…
## $ BikeID         <dbl> 7375, 6191, 6321, 6434, 6850, 6420, 6593, 6160, 7380, 6…
## $ StartDateTime  <dttm> 2017-07-01 00:00:00, 2017-07-01 00:00:00, 2017-07-01 0…
## $ EndDateTime    <dttm> 2017-07-01 00:06:00, 2017-07-01 00:16:00, 2017-07-01 0…
```

---

### Grabbing Components

```r
biketown_dt$StartDateTime[40008]
```

```
## [1] "2017-07-23 13:44:00 UTC"
```

```r
year(biketown_dt$StartDateTime[40008])
```

```
## [1] 2017
```

```r
month(biketown_dt$StartDateTime[40008], label = TRUE)
```

```
## [1] Jul
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
```

```r
day(biketown_dt$StartDateTime[40008])
```

```
## [1] 23
```

---

### Grabbing Components

```r
week(biketown_dt$StartDateTime[40008])
```

```
## [1] 30
```

```r
wday(biketown_dt$StartDateTime[40008], label = TRUE)
```

```
## [1] Sun
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
```

```r
hour(biketown_dt$StartDateTime[40008])
```

```
## [1] 13
```

```r
minute(biketown_dt$StartDateTime[40008])
```

```
## [1] 44
```

---

### Grabbing Components

.pull-left[

```r
ggplot(data = biketown_dt, 
       mapping = 
         aes(month(StartDateTime,
                   label = TRUE))) +
  geom_bar()
```

]

.pull-right[

]

---

### Grabbing Components

.pull-left[

```r
ggplot(data = biketown_dt, 
       mapping = aes(wday(StartDateTime,
                          label = TRUE))) +
  geom_bar()
```

]

.pull-right[

]

---

### And if you are in R and want to know the current date/time:

```r
today()
```

```
## [1] "2023-02-25"
```

```r
now()
```

```
## [1] "2023-02-25 13:38:43 UTC"
```

---

class: middle, center

##  Topic Shift!

### Factors with `forcats`

---

### Motivation: Imposing Structure on Categorical Variables

```r
library(pdxTrees)
pdxTrees <- get_pdxTrees_parks()

five_most_common <- c("Douglas-Fir", "Norway Maple",
                      "Western Redcedar", "Northern Red Oak",
                      "Pin Oak")

pdxCommon <- pdxTrees %>%
  filter(Common_Name %in% five_most_common)
```

---

### Motivation: Imposing Structure on Categorical Variables

.pull-left[

How might we want to restructure this graph?

```r
ggplot(data = pdxCommon,
       mapping = aes(x = Common_Name)) + 
  geom_bar() +
  coord_flip()
```

]

.pull-right[

]

---

### Levels and Class

* Why does `Common_Name` have no levels?

```r
levels(pdxCommon$Common_Name)
```

```
## NULL
```

```r
class(pdxCommon$Common_Name)
```

```
## [1] "character"
```

```r
pdxCommon <- mutate(pdxCommon, Common_Name = factor(Common_Name))

levels(pdxCommon$Common_Name)
```

```
## [1] "Douglas-Fir"      "Northern Red Oak" "Norway Maple"     "Pin Oak"         
## [5] "Western Redcedar"
```

```r
class(pdxCommon$Common_Name)
```

```
## [1] "factor"
```

* How is `R` deciding the order of the levels?

---

### What Are the levels/categories?

```r
fct_unique(pdxCommon$Common_Name)
```

```
## [1] Douglas-Fir      Northern Red Oak Norway Maple     Pin Oak         
## [5] Western Redcedar
## 5 Levels: Douglas-Fir Northern Red Oak Norway Maple ... Western Redcedar
```

```r
unique(pdxCommon$Common_Name)
```

```
## [1] Douglas-Fir      Northern Red Oak Norway Maple     Pin Oak         
## [5] Western Redcedar
## 5 Levels: Douglas-Fir Northern Red Oak Norway Maple ... Western Redcedar
```

---

### Reorder the Levels

.pull-left[

```r
pdxCommon %>%
  mutate(Common_Name = 
           fct_infreq(Common_Name)) %>%
  ggplot(mapping = aes(Common_Name)) +
  geom_bar() +
  coord_flip()
```

]

.pull-right[

]

+ Note: This code didn't permanently change the order in `pdxCommon`.
    + Why?

---

### Reorder the Levels

.pull-left[

How might we want to restructure this graph?

```r
pdxCommon %>%
  mutate(Common_Name = 
           fct_infreq(Common_Name),
         Common_Name = 
           fct_rev(Common_Name)) %>%
  ggplot(mapping = aes(Common_Name)) +
  geom_bar() +
  coord_flip()
```

]

.pull-right[

]

---

### Or, If You Love the Pipe...

.pull-left[

How might we want to restructure this graph?

```r
pdxCommon %>%
  mutate(Common_Name = 
           fct_infreq(Common_Name) %>%
           fct_rev()) %>%
  ggplot(mapping = aes(Common_Name)) +
  geom_bar() +
  coord_flip()
```

]

.pull-right[

]

---

### Reorder the Levels

.pull-left[

* Can also relevel manually

```r
pdxCommon %>%
  mutate(Common_Name = 
           fct_relevel(Common_Name, 
                       five_most_common)) %>%
  ggplot(mapping = aes(x = Common_Name)) + 
  geom_bar() +
  coord_flip()
```

]

.pull-right[

]

---

### Reorder the Levels

.pull-left[

* Or maybe I just want to bring one or two category to the front

```r
pdxCommon %>%
  mutate(Common_Name = 
           fct_relevel(Common_Name,
                       "Norway Maple",
                       "Pin Oak")) %>%
  ggplot(mapping = aes(x = Common_Name)) + 
  geom_bar() +
  coord_flip()
```

]

.pull-right[

]

---

### What Have We Wrangled Here?

```r
DBH_by_name <- pdxCommon %>%
  group_by(Common_Name) %>%
  summarize(mean_DBH = mean(DBH),
            lb_DBH = mean_DBH - 2*sd(DBH)/sqrt(n()),
            ub_DBH = mean_DBH + 2 *sd(DBH/sqrt(n()))) 
DBH_by_name
```

```
## # A tibble: 5 × 4
##   Common_Name      mean_DBH lb_DBH ub_DBH
##   <fct>               <dbl>  <dbl>  <dbl>
## 1 Douglas-Fir          29.6   29.3   29.8
## 2 Northern Red Oak     29.4   28.3   30.5
## 3 Norway Maple         20.3   19.9   20.8
## 4 Pin Oak              25.6   24.8   26.4
## 5 Western Redcedar     18.1   17.3   18.9
```

---

### Reordering by Another Variable

.pull-left[

* How might we want to reorder `Common_Name`?

```r
ggplot(data = DBH_by_name, 
      mapping = aes(y = mean_DBH,
                    x = Common_Name)) +
  geom_point() +
  geom_errorbar(mapping =
                  aes(ymin = lb_DBH,
                      ymax = ub_DBH),
                width = 0.4)
```

]

.pull-right[

]

---

### Reordering by Another Variable

.pull-left[

```r
DBH_by_name %>%
  mutate(Common_Name =
           fct_reorder(Common_Name,
                       -mean_DBH)) %>%
  ggplot(mapping = aes(y = mean_DBH,
                       x = Common_Name)) +
  geom_point() +
  geom_errorbar(mapping =
                  aes(ymin = lb_DBH,
                      ymax = ub_DBH),
                width = 0.4)
```

]

.pull-right[

]

---

### Reordering by Other Variables

.pull-left[

* How might we want to reorder `Condition`?

```r
ggplot(data = pdxCommon,
       mapping = 
         aes(x = DBH,
             y = Total_Annual_Services,
             color = Condition)) +
  geom_smooth()
```

]

.pull-right[

]

---

### Reordering by Other Variables

.pull-left[

```r
mutate(pdxCommon,
       Condition = 
         fct_reorder2(Condition,
                      DBH, 
                      Total_Annual_Services)) %>%
  ggplot(mapping = 
           aes(x = DBH,
               y = Total_Annual_Services,
               color = Condition)) +
  geom_smooth()
```

]

.pull-right[

]

---

### Factors

Other useful functions in `forcats`:

* `fct_collapse()`: Collapse some levels together
* `fct_drop()`: Remove levels (useful after a `filter()`!)
* `fct_recode()`: Change names of levels

---

class: middle, center

## And now:

##  Strings with `stringr`!

---

### Language

**String**

```r
x <- "cat"
```

**Character vector**

```r
x <- c("dog", "cat", "mouse")
```

**Factor vector**

```r
x <- factor(x)
levels(x)
```

```
## [1] "cat"   "dog"   "mouse"
```

---
  
### String Manipulation with [Stringr](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html)

* Learn how to handle character vectors!
    + Character manipulation
    + Pattern matching

* Let's look at some of the functionalities of `stringr` using a character vector of song lyrics.

```r
library(stringr)
```

---

### Our Toy Lyric

* Song?
* Artist?

```r
lyric <- c("But I would walk 500 miles,",
              "And I would walk 500 more,",
              "Just to be the man who walks a 1000 miles,",
              "To fall down at your door")
lyric
```

```
## [1] "But I would walk 500 miles,"               
## [2] "And I would walk 500 more,"                
## [3] "Just to be the man who walks a 1000 miles,"
## [4] "To fall down at your door"
```

---

### String Length

```r
length(lyric)
```

```
## [1] 4
```

```r
str_length(lyric)
```

```
## [1] 27 26 42 25
```

---

### Accessing and Replacing

```r
str_sub(string = lyric[1], start = 18, end = 20)
```

```
## [1] "500"
```

```r
str_sub(string = lyric[1], start = 18, end = 20) <- "2"
```

```r
lyric
```

```
## [1] "But I would walk 2 miles,"                 
## [2] "And I would walk 500 more,"                
## [3] "Just to be the man who walks a 1000 miles,"
## [4] "To fall down at your door"
```

---

### Change Cases

```r
str_to_upper(lyric)
```

```
## [1] "BUT I WOULD WALK 2 MILES,"                 
## [2] "AND I WOULD WALK 500 MORE,"                
## [3] "JUST TO BE THE MAN WHO WALKS A 1000 MILES,"
## [4] "TO FALL DOWN AT YOUR DOOR"
```

```r
str_to_title(lyric)
```

```
## [1] "But I Would Walk 2 Miles,"                 
## [2] "And I Would Walk 500 More,"                
## [3] "Just To Be The Man Who Walks A 1000 Miles,"
## [4] "To Fall Down At Your Door"
```

```r
str_to_lower(lyric)
```

```
## [1] "but i would walk 2 miles,"                 
## [2] "and i would walk 500 more,"                
## [3] "just to be the man who walks a 1000 miles,"
## [4] "to fall down at your door"
```

---

### Sorting

```r
str_sort(lyric)
```

```
## [1] "And I would walk 500 more,"                
## [2] "But I would walk 2 miles,"                 
## [3] "Just to be the man who walks a 1000 miles,"
## [4] "To fall down at your door"
```

---

### Pattern Matching

* Learn to:
    + Detect pattern
    + Extract pattern
    + Replace pattern
    + Split pattern

---

### Common Goal: Match a particular pattern

* I want to match the pattern `500` from `lyric`.

```r
lyric
```

```
## [1] "But I would walk 2 miles,"                 
## [2] "And I would walk 500 more,"                
## [3] "Just to be the man who walks a 1000 miles,"
## [4] "To fall down at your door"
```

```r
str_view_all(string = lyric, pattern = "500")
```

```
## [1] │ But I would walk 2 miles,
## [2] │ And I would walk <500> more,
## [3] │ Just to be the man who walks a 1000 miles,
## [4] │ To fall down at your door
```

---

### Let's make it more general.

* I want to locate all the numbers.

```r
lyric
```

```
## [1] "But I would walk 2 miles,"                 
## [2] "And I would walk 500 more,"                
## [3] "Just to be the man who walks a 1000 miles,"
## [4] "To fall down at your door"
```

```r
str_view_all(lyric, "500|1000|2")
```

```
## [1] │ But I would walk <2> miles,
## [2] │ And I would walk <500> more,
## [3] │ Just to be the man who walks a <1000> miles,
## [4] │ To fall down at your door
```

---

### Trivia Time!

Name the artist and song title for each of the following!

```r
lyrics <- c("But I would walk 500 miles",
            "2000 0 0 party over oops out of time!", 
            "1 is the loneliest number that you'll ever do",
            "When I'm 64",
            "Where 2 and 2 always makes a 5",
            "1, 2, 3, 4: Tell me that you love me more")
```

---

How should we modify the code to locate all the numbers from these lyrics of various songs?

```r
lyrics
```

```
## [1] "But I would walk 500 miles"                   
## [2] "2000 0 0 party over oops out of time!"        
## [3] "1 is the loneliest number that you'll ever do"
## [4] "When I'm 64"                                  
## [5] "Where 2 and 2 always makes a 5"               
## [6] "1, 2, 3, 4: Tell me that you love me more"
```

```r
str_view_all(lyrics, "500|1000|2")
```

```
## [1] │ But I would walk <500> miles
## [2] │ <2>000 0 0 party over oops out of time!
## [3] │ 1 is the loneliest number that you'll ever do
## [4] │ When I'm 64
## [5] │ Where <2> and <2> always makes a 5
## [6] │ 1, <2>, 3, 4: Tell me that you love me more
```

---

How should we modify the code to locate all the numbers from these lyrics of various songs?

```r
lyrics
```

```r
str_view_all(lyrics, "500|1000|0|2000|1|64|2|5|3|4")
```

```
## [1] │ But I would walk <500> miles
## [2] │ <2000> <0> <0> party over oops out of time!
## [3] │ <1> is the loneliest number that you'll ever do
## [4] │ When I'm <64>
## [5] │ Where <2> and <2> always makes a <5>
## [6] │ <1>, <2>, <3>, <4>: Tell me that you love me more
```

---

### Need for More Sophisticated Pattern Matching

But now imagine you had a very long vector and you want to locate any number?

```r
str_view_all(lyrics, "1|2|3|4...")
```

* Not a good approach!

* Next time: **Regular Expressions**!