String Manipulation

background-image: url("img/logo_padded.001.jpeg")
background-position: left
background-size: 60%
class: middle, center,

.pull-right[

## .base_color[String Manipulation and]

## .base_color[Text Analysis]

#### .navy[Kelly McConville]

#### .navy[ Stat 108 | Week 6 | Spring 2023]

]

---

## Announcements

* Remember that P-Set 3 is due today at 10pm. 
* Lecture quiz posted after lecture today.

************************

## Week's Goals

.pull-left[

**Mon Lecture**

* Finish up maps -- interactive maps.

* More data types
    + Dates with `lubridate`
    + Factors with `forcats`
    + Strings with `stringr`

]

.pull-right[

**Wed Lecture**

* More wrangling of strings
* Text analysis with `tidytext`

]

---

### Recap:

How should we modify the code to locate all the numbers from these lyrics of various songs?

```r
lyrics <- c("But I would walk 500 miles",
 "2000 0 0 party over oops out of time!", 
 "1 is the loneliest number that you'll ever do",
 "When I'm 64",
 "Where 2 and 2 always makes a 5",
 "1, 2, 3, 4: Tell me that you love me more")
```

```r
str_view_all(lyrics, "500|1000|0|2000|1|64|2|5|3|4")
```

```
## [1] │ But I would walk <500> miles
## [2] │ <2000> <0> <0> party over oops out of time!
## [3] │ <1> is the loneliest number that you'll ever do
## [4] │ When I'm <64>
## [5] │ Where <2> and <2> always makes a <5>
## [6] │ <1>, <2>, <3>, <4>: Tell me that you love me more
```

---

### Recap:

But now imagine you had a very long vector and you want to locate any number?

```r
str_view_all(lyrics, "1|2|3|4...")
```

Not a good approach!

---

## Regular Expressions

* A concise language for describing patterns in strings.
    + But not super easy to read.
    + Good to have cheatsheets and the internet for help!

* Neat RStudio Addin to help: [`RegExplain`](https://www.garrickadenbuie.com/project/regexplain/)

---

## Regular Expressions

* `[:digit:]` is a particular [Character Class](https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html)
* Character classes are a way of specifying that you want to match one of the following characters.

```r
str_view_all(lyrics, "[:digit:]")
```

```
## [1] │ But I would walk <5><0><0> miles
## [2] │ <2><0><0><0> <0> <0> party over oops out of time!
## [3] │ <1> is the loneliest number that you'll ever do
## [4] │ When I'm <6><4>
## [5] │ Where <2> and <2> always makes a <5>
## [6] │ <1>, <2>, <3>, <4>: Tell me that you love me more
```

---

## Regular Expressions

* `+` is a quantifier
    * `+`: One or more

```r
str_view_all(lyrics, "[:digit:]+")
```

---

## Regular Expressions

What does `{n}` do?

```r
str_view_all(lyrics, "[:digit:]{2}")
```

```
## [1] │ But I would walk <50>0 miles
## [2] │ <20><00> 0 0 party over oops out of time!
## [3] │ 1 is the loneliest number that you'll ever do
## [4] │ When I'm <64>
## [5] │ Where 2 and 2 always makes a 5
## [6] │ 1, 2, 3, 4: Tell me that you love me more
```

---

## Quantifiers

* `?`: 0 or 1 
* `*`: 0 or more
* `+`: 1 or more
* `{n}`: Exactly n
* `{n,}`: n or more
* `{,m}`: at most m
* `{n,m}`: between n and m

---

## Character Classes

* What pattern does this regexp match?

```r
str_view_all(lyrics, "[:alpha:]")
```

```
## [1] │ <t> <w><o><l><d> <w><a><l><k> 500 <m><l><e><s>
## [2] │ 2000 0 0 <a><r><t><y> <o><v><e><r> <o><o><s> <o><t> <o><f> <t><m><e>!
## [3] │ 1 <s> <t><h><e> <l><o><n><e><l><e><s><t> <n><m><e><r> <t><h><a><t> <y><o>'<l><l> <e><v><e><r> <d><o>
## [4] │ <W><h><e><n> '<m> 64
## [5] │ <W><h><e><r><e> 2 <a><n><d> 2 <a><l><w><a><y><s> <m><a><k><e><s> <a> 5
## [6] │ 1, 2, 3, 4: <T><e><l><l> <m><e> <t><h><a><t> <y><o> <l><o><v><e> <m><e> <m><o><r><e>
```

---

## Character Classes

* What pattern does this regexp match?

```r
str_view_all(lyrics, "[:upper:]")
```

```
## [1] │ ut would walk 500 miles
## [2] │ 2000 0 0 party over oops out of time!
## [3] │ 1 is the loneliest number that you'll ever do
## [4] │ <W>hen 'm 64
## [5] │ <W>here 2 and 2 always makes a 5
## [6] │ 1, 2, 3, 4: <T>ell me that you love me more
```

---

## Character Classes

* What pattern does this regexp match?

```r
str_view_all(lyrics, "[:alnum:]")
```

```
## [1] │ <t> <w><o><l><d> <w><a><l><k> <5><0><0> <m><l><e><s>
## [2] │ <2><0><0><0> <0> <0> <a><r><t><y> <o><v><e><r> <o><o><s> <o><t> <o><f> <t><m><e>!
## [3] │ <1> <s> <t><h><e> <l><o><n><e><l><e><s><t> <n><m><e><r> <t><h><a><t> <y><o>'<l><l> <e><v><e><r> <d><o>
## [4] │ <W><h><e><n> '<m> <6><4>
## [5] │ <W><h><e><r><e> <2> <a><n><d> <2> <a><l><w><a><y><s> <m><a><k><e><s> <a> <5>
## [6] │ <1>, <2>, <3>, <4>: <T><e><l><l> <m><e> <t><h><a><t> <y><o> <l><o><v><e> <m><e> <m><o><r><e>
```

---

## Character Classes

* What pattern does this regexp match?

```r
str_view_all(lyrics, "[:punct:]")
```

```
## [1] │ But I would walk 500 miles
## [2] │ 2000 0 0 party over oops out of time<!>
## [3] │ 1 is the loneliest number that you<'>ll ever do
## [4] │ When I<'>m 64
## [5] │ Where 2 and 2 always makes a 5
## [6] │ 1<,> 2<,> 3<,> 4<:> Tell me that you love me more
```

---

## Character Classes

* What pattern does this regexp match?

```r
str_view_all(lyrics, "[:graph:]")
```

```
## [1] │ <t> <w><o><l><d> <w><a><l><k> <5><0><0> <m><l><e><s>
## [2] │ <2><0><0><0> <0> <0> <a><r><t><y> <o><v><e><r> <o><o><s> <o><t> <o><f> <t><m><e><!>
## [3] │ <1> <s> <t><h><e> <l><o><n><e><l><e><s><t> <n><m><e><r> <t><h><a><t> <y><o><'><l><l> <e><v><e><r> <d><o>
## [4] │ <W><h><e><n> <'><m> <6><4>
## [5] │ <W><h><e><r><e> <2> <a><n><d> <2> <a><l><w><a><y><s> <m><a><k><e><s> <a> <5>
## [6] │ <1><,> <2><,> <3><,> <4><:> <T><e><l><l> <m><e> <t><h><a><t> <y><o> <l><o><v><e> <m><e> <m><o><r><e>
```

---

## Character Classes

* What pattern does this regexp match?

```r
str_view_all(lyrics, "[:space:]")
```

```
## [1] │ But< >I< >would< >walk< >500< >miles
## [2] │ 2000< >0< >0< >party< >over< >oops< >out< >of< >time!
## [3] │ 1< >is< >the< >loneliest< >number< >that< >you'll< >ever< >do
## [4] │ When< >I'm< >64
## [5] │ Where< >2< >and< >2< >always< >makes< >a< >5
## [6] │ 1,< >2,< >3,< >4:< >Tell< >me< >that< >you< >love< >me< >more
```

---

## Character Classes

* Can also create your own.

* What pattern does this regexp match?

```r
str_view_all(lyrics, "[aeiou]")
```

```
## [1] │ Bt I w<o>ld w<a>lk 500 ml<e>s
## [2] │ 2000 0 0 p<a>rty <o>v<e>r <o><o>ps <o>t <o>f tm<e>!
## [3] │ 1 s th<e> l<o>n<e>l<e>st nmb<e>r th<a>t y<o>'ll <e>v<e>r d<o>
## [4] │ Wh<e>n I'm 64
## [5] │ Wh<e>r<e> 2 <a>nd 2 <a>lw<a>ys m<a>k<e>s <a> 5
## [6] │ 1, 2, 3, 4: T<e>ll m<e> th<a>t y<o> l<o>v<e> m<e> m<o>r<e>
```

---

## Other Handy Regexps

* What pattern does this regexp match?

* Why do we need an extra `\`?

```r
str_view_all(lyrics, "\\d")
```

---

## Espacing Meta Characters

* `\` is a special character that has a particular meaning in `r`.

* You can see all the special characters that need escaping in the help page for `'`:

```r
?"'"
```

---

## Other Handy Regexps

* What pattern does this regexp match?

```r
str_view_all(lyrics, ".w.")
```

```
## [1] │ But I< wo>uld< wa>lk 500 miles
## [2] │ 2000 0 0 party over oops out of time!
## [3] │ 1 is the loneliest number that you'll ever do
## [4] │ When I'm 64
## [5] │ Where 2 and 2 a<lwa>ys makes a 5
## [6] │ 1, 2, 3, 4: Tell me that you love me more
```

---

## Other Handy Regexps

* What pattern does this regexp match?

```r
str_view_all(lyrics, "\\W")
```

```
## [1] │ But< >I< >would< >walk< >500< >miles
## [2] │ 2000< >0< >0< >party< >over< >oops< >out< >of< >time<!>
## [3] │ 1< >is< >the< >loneliest< >number< >that< >you<'>ll< >ever< >do
## [4] │ When< >I<'>m< >64
## [5] │ Where< >2< >and< >2< >always< >makes< >a< >5
## [6] │ 1<,>< >2<,>< >3<,>< >4<:>< >Tell< >me< >that< >you< >love< >me< >more
```

---

## Other Handy Regexps

* What pattern does this regexp match?

```r
str_view_all(lyrics, "Whe(n|re)")
```

```
## [1] │ But I would walk 500 miles
## [2] │ 2000 0 0 party over oops out of time!
## [3] │ 1 is the loneliest number that you'll ever do
## [4] │ <When> I'm 64
## [5] │ <Where> 2 and 2 always makes a 5
## [6] │ 1, 2, 3, 4: Tell me that you love me more
```

---

## Groups

* What pattern does this regexp match?

```r
str_view_all(lyrics, "(\\d)\\1")
```

```
## [1] │ But I would walk 5<00> miles
## [2] │ 2<00>0 0 0 party over oops out of time!
## [3] │ 1 is the loneliest number that you'll ever do
## [4] │ When I'm 64
## [5] │ Where 2 and 2 always makes a 5
## [6] │ 1, 2, 3, 4: Tell me that you love me more
```

---

## Groups

* What pattern does this regexp match?

```r
str_view_all(lyrics, "(\\d)\\1\\1")
```

```
## [1] │ But I would walk 500 miles
## [2] │ 2<000> 0 0 party over oops out of time!
## [3] │ 1 is the loneliest number that you'll ever do
## [4] │ When I'm 64
## [5] │ Where 2 and 2 always makes a 5
## [6] │ 1, 2, 3, 4: Tell me that you love me more
```

---

## Groups

* What pattern does this regexp match?

```r
str_view_all(lyrics, "([:alnum:])(\\s)[:alnum:]+\\2\\1")
```

```
## [1] │ But I would walk 500 miles
## [2] │ 200<0 0 0> party over oops ou<t of t>ime!
## [3] │ 1 is the lonelies<t number t>hat you'll ever do
## [4] │ When I'm 64
## [5] │ Where <2 and 2> always makes a 5
## [6] │ 1, 2, 3, 4: Tell me that you love me more
```

---

## Other Handy Regexps

* What pattern does this regexp match?

```r
str_view_all(lyrics, "\\b")
```

```
## [1] │ <>But<> <>I<> <>would<> <>walk<> <>500<> <>miles<>
## [2] │ <>2000<> <>0<> <>0<> <>party<> <>over<> <>oops<> <>out<> <>of<> <>time<>!
## [3] │ <>1<> <>is<> <>the<> <>loneliest<> <>number<> <>that<> <>you<>'<>ll<> <>ever<> <>do<>
## [4] │ <>When<> <>I<>'<>m<> <>64<>
## [5] │ <>Where<> <>2<> <>and<> <>2<> <>always<> <>makes<> <>a<> <>5<>
## [6] │ <>1<>, <>2<>, <>3<>, <>4<>: <>Tell<> <>me<> <>that<> <>you<> <>love<> <>me<> <>more<>
```

---

## Anchors

* What pattern does this regexp match?

```r
str_view_all(lyrics, "^\\d+")
```

```
## [1] │ But I would walk 500 miles
## [2] │ <2000> 0 0 party over oops out of time!
## [3] │ <1> is the loneliest number that you'll ever do
## [4] │ When I'm 64
## [5] │ Where 2 and 2 always makes a 5
## [6] │ <1>, 2, 3, 4: Tell me that you love me more
```

---

## Anchors

* What pattern does this regexp match?

```r
str_view_all(lyrics, "\\d+$")
```

```
## [1] │ But I would walk 500 miles
## [2] │ 2000 0 0 party over oops out of time!
## [3] │ 1 is the loneliest number that you'll ever do
## [4] │ When I'm <64>
## [5] │ Where 2 and 2 always makes a <5>
## [6] │ 1, 2, 3, 4: Tell me that you love me more
```

---

## Alternates

* What pattern does this regexp match?

```r
str_view_all(lyrics, "[^aeiou]")
```

```
## [1] │ u<t>< >< ><w>ou<l><d>< ><w>a<l><k>< ><5><0><0>< ><m>i<l>e<s>
## [2] │ <2><0><0><0>< ><0>< ><0>< >a<r><t><y>< >o<v>e<r>< >oo<s>< >ou<t>< >o<f>< ><t>i<m>e<!>
## [3] │ <1>< >i<s>< ><t><h>e< ><l>o<n>e<l>ie<s><t>< ><n>u<m>e<r>< ><t><h>a<t>< ><y>ou<'><l><l>< >e<v>e<r>< ><d>o
## [4] │ <W><h>e<n>< ><'><m>< ><6><4>
## [5] │ <W><h>e<r>e< ><2>< >a<n><d>< ><2>< >a<l><w>a<y><s>< ><m>a<k>e<s>< >a< ><5>
## [6] │ <1><,>< ><2><,>< ><3><,>< ><4><:>< ><T>e<l><l>< ><m>e< ><t><h>a<t>< ><y>ou< ><l>o<v>e< ><m>e< ><m>o<r>e
```

---

## Alternates

* What pattern does this regexp match?

```r
str_view_all(lyrics, "o[m-z]")
```

```
## [1] │ But I w<ou>ld walk 500 miles
## [2] │ 2000 0 0 party <ov>er <oo>ps <ou>t of time!
## [3] │ 1 is the l<on>eliest number that y<ou>'ll ever do
## [4] │ When I'm 64
## [5] │ Where 2 and 2 always makes a 5
## [6] │ 1, 2, 3, 4: Tell me that y<ou> l<ov>e me m<or>e
```

---

## Look Arounds

```r
str_view_all(lyrics, "(?<=\\d )[:alpha:]+")
```

```
## [1] │ But I would walk 500 <miles>
## [2] │ 2000 0 0 <party> over oops out of time!
## [3] │ 1 <is> the loneliest number that you'll ever do
## [4] │ When I'm 64
## [5] │ Where 2 <and> 2 <always> makes a 5
## [6] │ 1, 2, 3, 4: Tell me that you love me more
```

---

## Look Arounds

```r
str_view_all(lyrics, "[[:alpha:]+[:punct:]]+[:alpha:]+(?= \\d+)")
```

```
## [1] │ But I would <walk> 500 miles
## [2] │ 2000 0 0 party over oops out of time!
## [3] │ 1 is the loneliest number that you'll ever do
## [4] │ When <I'm> 64
## [5] │ <Where> 2 <and> 2 always makes a 5
## [6] │ 1, 2, 3, 4: Tell me that you love me more
```

---

### Pattern Matching

* The `str_view_all()` is a nice helper function.

* Now need to learn functions to take **action** based on our regular expression pattern matching.

* Functions to take **action** based on our regular expression pattern matching.

* Detect pattern with:
    + `str_detect()`
    + `str_subset()`
    + `str_count()`

* Extract pattern with:
    + `str_extract()` and `str_extract_all()`

* Replace pattern with:
    + `str_replace()` and `str_replace_all()`

* Split pattern: 
    + `str_split()`
    + But we are going to let `tidytext` do the splitting for us!

---

### New Example:  [Taylor Swift song](https://github.com/shaynak/taylor-swift-lyrics)

```r
ts <- read_csv("https://raw.githubusercontent.com/shaynak/taylor-swift-lyrics/main/lyrics.csv")
shake_it_off <- ts %>%
 select(Song, Album, Lyric) %>%
 filter(Song == "Shake It Off")
shake_it_off
```

```
## # A tibble: 62 × 3
## Song Album Lyric 
## <chr> <chr> <chr> 
## 1 Shake It Off 1989 (Deluxe) I stay out too late 
## 2 Shake It Off 1989 (Deluxe) Got nothin' in my brain 
## 3 Shake It Off 1989 (Deluxe) That's what people say, mmm-mmm 
## 4 Shake It Off 1989 (Deluxe) That's what people say, mmm-mmm 
## 5 Shake It Off 1989 (Deluxe) I go on too many dates (Haha) 
## 6 Shake It Off 1989 (Deluxe) But I can't make them stay 
## 7 Shake It Off 1989 (Deluxe) At least that's what people say, mmm-mmm
## 8 Shake It Off 1989 (Deluxe) That's what people say, mmm-mmm 
## 9 Shake It Off 1989 (Deluxe) But I keep cruisin' 
## 10 Shake It Off 1989 (Deluxe) Can't stop, won't stop movin' 
## # … with 52 more rows
```

---

## Detect

```r
str_detect(string = shake_it_off$Lyric, pattern = "ake")
```

```
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [25] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
## [49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [61]  TRUE  TRUE
```

---

## Detect

```r
str_subset(string = shake_it_off$Lyric, pattern = "ake")
```

```
##  [1] "But I can't make them stay"                                 
##  [2] "Baby, I'm just gonna shake, shake, shake, shake, shake"     
##  [3] "I shake it off, I shake it off"                             
##  [4] "Heartbreakers gonna break, break, break, break, break"      
##  [5] "And the fakers gonna fake, fake, fake, fake, fake"          
##  [6] "Baby, I'm just gonna shake, shake, shake, shake, shake"     
##  [7] "I shake it off, I shake it off"                             
##  [8] "I make the moves up as I go (Moves up as I go)"             
##  [9] "Shake it off, I shake it off"                               
## [10] "I, I, I shake it off, I shake it off"                       
## [11] "I, I, I shake it off, I shake it off"                       
## [12] "I, I, I shake it off, I shake it off"                       
## [13] "She's like, \"Oh my God!\" But I'm just gonna shake"        
## [14] "We can shake, shake, shake"                                 
## [15] "Baby, I'm just gonna shake, shake, shake, shake, shake"     
## [16] "I shake it off, I shake it off (Ha!)"                       
## [17] "Heartbreakers gonna break, break, break, break, break (Mmm)"
## [18] "And the fakers gonna fake, fake, fake, fake, fake"          
## [19] "(And fake, and fake, and fake)"                             
## [20] "Baby, I'm just gonna shake, shake, shake, shake, shake"     
## [21] "I shake it off, I shake it off (I, I, I)"                   
## [22] "I, I, I shake it off, I shake it off"                       
## [23] "Shake it off, I shake it off"                               
## [24] "I, I, I shake it off, I shake it off"                       
## [25] "I, I, I shake it off, I shake it off (Yeah!)"               
## [26] "Shake it off, I shake it off"                               
## [27] "I, I, I shake it off, I shake it off (You got to)"          
## [28] "I, I, I shake it off, I shake it off"
```

---

## Detect

```r
shake <- shake_it_off %>%
 filter(str_detect(string = Lyric, pattern = "ake"))
shake
```

```
## # A tibble: 28 × 3
## Song Album Lyric 
## <chr> <chr> <chr> 
## 1 Shake It Off 1989 (Deluxe) But I can't make them stay 
## 2 Shake It Off 1989 (Deluxe) Baby, I'm just gonna shake, shake, shake, shake, …
## 3 Shake It Off 1989 (Deluxe) I shake it off, I shake it off 
## 4 Shake It Off 1989 (Deluxe) Heartbreakers gonna break, break, break, break, b…
## 5 Shake It Off 1989 (Deluxe) And the fakers gonna fake, fake, fake, fake, fake 
## 6 Shake It Off 1989 (Deluxe) Baby, I'm just gonna shake, shake, shake, shake, …
## 7 Shake It Off 1989 (Deluxe) I shake it off, I shake it off 
## 8 Shake It Off 1989 (Deluxe) I make the moves up as I go (Moves up as I go) 
## 9 Shake It Off 1989 (Deluxe) Shake it off, I shake it off 
## 10 Shake It Off 1989 (Deluxe) I, I, I shake it off, I shake it off 
## # … with 18 more rows
```

---

## Detect

```r
str_count(string = shake_it_off$Lyric, pattern = "ake")
```

```
##  [1] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 5 2 1 6 5 2 0 0 0 0 0 1 0 0 0 0 0 2 2 2 2 0 0 0
## [39] 0 0 0 1 0 0 3 0 0 0 5 2 1 6 3 5 2 2 2 2 2 2 2 2
```

---

## Detect

```r
shake_it_off %>%
  filter(str_count(string = shake_it_off$Lyric, pattern = "ake") > 1)
```

```
## # A tibble: 23 × 3
## Song Album Lyric 
## <chr> <chr> <chr> 
## 1 Shake It Off 1989 (Deluxe) Baby, I'm just gonna shake, shake, shake, shake, …
## 2 Shake It Off 1989 (Deluxe) I shake it off, I shake it off 
## 3 Shake It Off 1989 (Deluxe) And the fakers gonna fake, fake, fake, fake, fake 
## 4 Shake It Off 1989 (Deluxe) Baby, I'm just gonna shake, shake, shake, shake, …
## 5 Shake It Off 1989 (Deluxe) I shake it off, I shake it off 
## 6 Shake It Off 1989 (Deluxe) Shake it off, I shake it off 
## 7 Shake It Off 1989 (Deluxe) I, I, I shake it off, I shake it off 
## 8 Shake It Off 1989 (Deluxe) I, I, I shake it off, I shake it off 
## 9 Shake It Off 1989 (Deluxe) I, I, I shake it off, I shake it off 
## 10 Shake It Off 1989 (Deluxe) We can shake, shake, shake 
## # … with 13 more rows
```

---

## Extract

```r
str_subset(string = shake_it_off$Lyric, pattern = "[:punct:]") %>%
  str_extract(pattern = "[:punct:]")
```

```
##  [1] "'" "'" "'" "(" "'" "'" "'" "'" "'" "'" "'" "'" "," "," "," "," "," "," ","
## [20] "'" "'" "'" "'" "(" "'" "'" "'" "'" "'" "," "," "," "," "," "," "," "'" "-"
## [39] "'" "'" "," "," "," "(" "," "," "," "," "(" "," "," "," "," "," "," "," ","
## [58] ","
```

---

## Extract

```r
all <- str_subset(string = shake_it_off$Lyric, pattern = "[:punct:]") %>%
 str_extract_all(pattern = "[:punct:]")
class(all)
```

```
## [1] "list"
```

```r
all
```

```
## [[1]]
## [1] "'"
## 
## [[2]]
## [1] "'" "," "-"
## 
## [[3]]
## [1] "'" "," "-"
## 
## [[4]]
## [1] "(" ")"
## 
## [[5]]
## [1] "'"
## 
## [[6]]
## [1] "'" "," "-"
## 
## [[7]]
## [1] "'" "," "-"
## 
## [[8]]
## [1] "'"
## 
## [[9]]
## [1] "'" "," "'" "'"
## 
## [[10]]
## [1] "'"
## 
## [[11]]
## [1] "'"  ","  "\"" "'"  "\""
## 
## [[12]]
## [1] "'" "," "," "," ","
## 
## [[13]]
## [1] "," "," "," ","
## 
## [[14]]
## [1] "," "'" "," "," "," ","
## 
## [[15]]
## [1] ","
## 
## [[16]]
## [1] "," "," "," ","
## 
## [[17]]
## [1] "," "," "," ","
## 
## [[18]]
## [1] "," "'" "," "," "," ","
## 
## [[19]]
## [1] ","
## 
## [[20]]
## [1] "'" "'"
## 
## [[21]]
## [1] "'" "'" "," "-"
## 
## [[22]]
## [1] "'" "'" "," "-"
## 
## [[23]]
## [1] "'" "'" "(" "'" ")"
## 
## [[24]]
## [1] "(" ")"
## 
## [[25]]
## [1] "'" "'" "," "-"
## 
## [[26]]
## [1] "'" "'" "," "-"
## 
## [[27]]
## [1] "'"
## 
## [[28]]
## [1] "'" "," "'" "'"
## 
## [[29]]
## [1] "'"
## 
## [[30]]
## [1] ","
## 
## [[31]]
## [1] "," "," ","
## 
## [[32]]
## [1] "," "," ","
## 
## [[33]]
## [1] "," "," ","
## 
## [[34]]
## [1] "," ","
## 
## [[35]]
## [1] "," "'" "'"
## 
## [[36]]
## [1] ","
## 
## [[37]]
## [1] "'" "'"
## 
## [[38]]
## [1] "-"
## 
## [[39]]
## [1] "'"  ","  "\"" "!"  "\"" "'" 
## 
## [[40]]
## [1] "'" "," "?"
## 
## [[41]]
## [1] "," ","
## 
## [[42]]
## [1] "," "," ","
## 
## [[43]]
## [1] "," "," "," ","
## 
## [[44]]
## [1] "(" ")"
## 
## [[45]]
## [1] "," "'" "," "," "," ","
## 
## [[46]]
## [1] "," "(" "!" ")"
## 
## [[47]]
## [1] "," "," "," "," "(" ")"
## 
## [[48]]
## [1] "," "," "," ","
## 
## [[49]]
## [1] "(" "," "," ")"
## 
## [[50]]
## [1] "," "'" "," "," "," ","
## 
## [[51]]
## [1] "," "(" "," "," ")"
## 
## [[52]]
## [1] "," "," ","
## 
## [[53]]
## [1] ","
## 
## [[54]]
## [1] "," "," ","
## 
## [[55]]
## [1] "," "," "," "(" "!" ")"
## 
## [[56]]
## [1] ","
## 
## [[57]]
## [1] "," "," "," "(" ")"
## 
## [[58]]
## [1] "," "," ","
```

---

## Extract

```r
str_view_all(shake_it_off$Lyric, pattern = "(?<= gonna )\\w+")
```

```
## [1] │ I stay out too late
## [2] │ Got nothin' in my brain
## [3] │ That's what people say, mmm-mmm
## [4] │ That's what people say, mmm-mmm
## [5] │ I go on too many dates (Haha)
## [6] │ But I can't make them stay
## [7] │ At least that's what people say, mmm-mmm
## [8] │ That's what people say, mmm-mmm
## [9] │ But I keep cruisin'
## [10] │ Can't stop, won't stop movin'
## [11] │ It's like I got this music in my mind
## [12] │ Sayin', "It's gonna <be> alright"
## [13] │ 'Cause the players gonna <play>, play, play, play, play
## [14] │ And the haters gonna <hate>, hate, hate, hate, hate
## [15] │ Baby, I'm just gonna <shake>, shake, shake, shake, shake
## [16] │ I shake it off, I shake it off
## [17] │ Heartbreakers gonna <break>, break, break, break, break
## [18] │ And the fakers gonna <fake>, fake, fake, fake, fake
## [19] │ Baby, I'm just gonna <shake>, shake, shake, shake, shake
## [20] │ I shake it off, I shake it off
## ... and 42 more
```

---

## Extract

```r
shake_it_off %>%
 filter(str_detect(Lyric, "gonna")) %>%
 mutate(after = str_extract(Lyric, "(?<= gonna )\\w+"))
```

```
## # A tibble: 14 × 4
## Song Album Lyric after
## <chr> <chr> <chr> <chr>
## 1 Shake It Off 1989 (Deluxe) "Sayin', \"It's gonna be alright\"" be 
## 2 Shake It Off 1989 (Deluxe) "'Cause the players gonna play, play, play,… play 
## 3 Shake It Off 1989 (Deluxe) "And the haters gonna hate, hate, hate, hat… hate 
## 4 Shake It Off 1989 (Deluxe) "Baby, I'm just gonna shake, shake, shake, … shake
## 5 Shake It Off 1989 (Deluxe) "Heartbreakers gonna break, break, break, b… break
## 6 Shake It Off 1989 (Deluxe) "And the fakers gonna fake, fake, fake, fak… fake 
## 7 Shake It Off 1989 (Deluxe) "Baby, I'm just gonna shake, shake, shake, … shake
## 8 Shake It Off 1989 (Deluxe) "She's like, \"Oh my God!\" But I'm just go… shake
## 9 Shake It Off 1989 (Deluxe) "And the haters gonna hate, hate, hate, hat… hate 
## 10 Shake It Off 1989 (Deluxe) "(Haters gonna hate)" hate 
## 11 Shake It Off 1989 (Deluxe) "Baby, I'm just gonna shake, shake, shake, … shake
## 12 Shake It Off 1989 (Deluxe) "Heartbreakers gonna break, break, break, b… break
## 13 Shake It Off 1989 (Deluxe) "And the fakers gonna fake, fake, fake, fak… fake 
## 14 Shake It Off 1989 (Deluxe) "Baby, I'm just gonna shake, shake, shake, … shake
```

---

## Replace

```r
run_it_off <- shake_it_off %>%
 mutate(Lyric = str_replace_all(Lyric, "shake",
 "run")) 
run_it_off %>%
 filter(str_detect(Lyric, "run"))
```

```
## # A tibble: 21 × 3
## Song Album Lyric 
## <chr> <chr> <chr> 
## 1 Shake It Off 1989 (Deluxe) "Baby, I'm just gonna run, run, run, run, run" 
## 2 Shake It Off 1989 (Deluxe) "I run it off, I run it off" 
## 3 Shake It Off 1989 (Deluxe) "Baby, I'm just gonna run, run, run, run, run" 
## 4 Shake It Off 1989 (Deluxe) "I run it off, I run it off" 
## 5 Shake It Off 1989 (Deluxe) "Shake it off, I run it off" 
## 6 Shake It Off 1989 (Deluxe) "I, I, I run it off, I run it off" 
## 7 Shake It Off 1989 (Deluxe) "I, I, I run it off, I run it off" 
## 8 Shake It Off 1989 (Deluxe) "I, I, I run it off, I run it off" 
## 9 Shake It Off 1989 (Deluxe) "She's like, \"Oh my God!\" But I'm just gonna ru…
## 10 Shake It Off 1989 (Deluxe) "We can run, run, run" 
## # … with 11 more rows
```

---

### Basic Text Analysis with `tidytext`

Topics:

* Tokenizing to a tidy format
* Word frequencies
* Sentiment analysis

---

### Tidy Text

* A data table with one token per row.

* **Token**: meaningful unit of text
    + What is the unit for `shake_it_off`?

```r
shake_it_off
```

---

### Tidy Text

* A data table with one token per row.

* **Token**: meaningful unit of text
    + What is the unit for `shake_it_off`?

* Other common tokens are words, sentences, paragraphs.

* Some text analysis should be done on text data in a non-tidy format.

---

### Tidying Text Data

* **Tokenize**: Break text into individual tokens

```r
library(tidytext) 
shake_it_off_words <- shake_it_off %>%
 unnest_tokens(output = word, input = Lyric, token = "words")
shake_it_off_words
```

```
## # A tibble: 473 × 3
## Song Album word 
## <chr> <chr> <chr> 
## 1 Shake It Off 1989 (Deluxe) i 
## 2 Shake It Off 1989 (Deluxe) stay 
## 3 Shake It Off 1989 (Deluxe) out 
## 4 Shake It Off 1989 (Deluxe) too 
## 5 Shake It Off 1989 (Deluxe) late 
## 6 Shake It Off 1989 (Deluxe) got 
## 7 Shake It Off 1989 (Deluxe) nothin
## 8 Shake It Off 1989 (Deluxe) in 
## 9 Shake It Off 1989 (Deluxe) my 
## 10 Shake It Off 1989 (Deluxe) brain 
## # … with 463 more rows
```

---

### Tidying Text Data

* What is an `ngram`?

```r
shake_it_off_ngram <- shake_it_off %>%
 unnest_tokens(output = ngram, input = Lyric,
 token = "ngrams", n = 2)
shake_it_off_ngram
```

```
## # A tibble: 411 × 3
## Song Album ngram 
## <chr> <chr> <chr> 
## 1 Shake It Off 1989 (Deluxe) i stay 
## 2 Shake It Off 1989 (Deluxe) stay out 
## 3 Shake It Off 1989 (Deluxe) out too 
## 4 Shake It Off 1989 (Deluxe) too late 
## 5 Shake It Off 1989 (Deluxe) got nothin 
## 6 Shake It Off 1989 (Deluxe) nothin in 
## 7 Shake It Off 1989 (Deluxe) in my 
## 8 Shake It Off 1989 (Deluxe) my brain 
## 9 Shake It Off 1989 (Deluxe) that's what
## 10 Shake It Off 1989 (Deluxe) what people
## # … with 401 more rows
```

---

### Word Frequencies

* Common text mining task

* What have we learned about the frequency of words in "Shake it Off"?

* Which words in this list do we maybe not care about?

```r
shake_it_off_words %>%
  count(word, sort = TRUE)
```

```
## # A tibble: 111 × 2
## word n
## <chr> <int>
## 1 i 57
## 2 shake 54
## 3 it 30
## 4 off 30
## 5 mmm 17
## 6 gonna 14
## 7 fake 13
## 8 and 12
## 9 hate 11
## 10 the 11
## # … with 101 more rows
```

---

### Word Frequencies

* **Stop words**: Common words that are not useful for analysis

```r
data("stop_words")
stop_words
```

```
## # A tibble: 1,149 × 2
## word lexicon
## <chr> <chr> 
## 1 a SMART 
## 2 a's SMART 
## 3 able SMART 
## 4 about SMART 
## 5 above SMART 
## 6 according SMART 
## 7 accordingly SMART 
## 8 across SMART 
## 9 actually SMART 
## 10 after SMART 
## # … with 1,139 more rows
```

---

### Word Frequencies

* I want remove from `shake_it_off_words` the rows that contain the stop words.
    + Get to learn a new `join`!

```r
shake_it_off_words <- shake_it_off_words %>%
 anti_join(stop_words, by = "word")
shake_it_off_words 
```

```
## # A tibble: 192 × 3
## Song Album word 
## <chr> <chr> <chr> 
## 1 Shake It Off 1989 (Deluxe) stay 
## 2 Shake It Off 1989 (Deluxe) late 
## 3 Shake It Off 1989 (Deluxe) nothin
## 4 Shake It Off 1989 (Deluxe) brain 
## 5 Shake It Off 1989 (Deluxe) people
## 6 Shake It Off 1989 (Deluxe) mmm 
## 7 Shake It Off 1989 (Deluxe) mmm 
## 8 Shake It Off 1989 (Deluxe) people
## 9 Shake It Off 1989 (Deluxe) mmm 
## 10 Shake It Off 1989 (Deluxe) mmm 
## # … with 182 more rows
```

---

### Word Frequencies

* What graph should we construct?

```r
shake_it_off_words %>%
  count(word, sort = TRUE)
```

```
## # A tibble: 49 × 2
## word n
## <chr> <int>
## 1 shake 54
## 2 mmm 17
## 3 gonna 14
## 4 fake 13
## 5 hate 11
## 6 break 10
## 7 baby 5
## 8 play 5
## 9 people 4
## 10 stop 4
## # … with 39 more rows
```

---

### Word Frequencies

.pull-left[

* Which `forcats` function should we use to reorder the bars?

```r
shake_it_off_words %>%
  count(word, sort = TRUE) %>%
  filter(n > 2) %>%
  ggplot(mapping =
           aes(x = word,
               y = n)) +
  geom_col() +
  coord_flip()
```

]

.pull-right[

]

---

### Word Frequencies

.pull-left[

* Which `forcats` function should we use to reorder the bars?

```r
shake_it_off_words %>%
  count(word, sort = TRUE) %>%
  filter(n > 2) %>%
  mutate(word = 
        fct_reorder(word, n)) %>%
  ggplot(mapping = 
           aes(x = word,
               y = n)) +
  geom_col() +
  coord_flip()
```

]

.pull-right[

]

* Wordclouds are also a pretty common way of displaying this data.  Will create wordclouds on P-Set 4!

---

### Comparisons Across Albums/Texts

```r
two_albums <- ts %>%
 select(Song, Album, Lyric) %>%
 filter(Album %in% c("1989 (Deluxe)", "reputation")) %>%
 mutate(Album = str_replace(Album, " \\(.+", ""))
```

---

### Word Frequencies Across Albums

```r
two_albums_tidy <- two_albums %>%
 unnest_tokens(output = word, input = Lyric, token = "words") %>%
 anti_join(stop_words, by = "word") %>%
 filter(!(word %in% c("ah", "ahh", "eh", "uh", "ooh",
 "huh", "mmm", "di", "ha"))) %>%
 count(Album, word) %>%
 group_by(Album) %>%
 mutate(prop = n/sum(n)) 
two_albums_tidy 
```

```
## # A tibble: 1,166 × 4
## # Groups: Album [2]
## Album word n prop
## <chr> <chr> <int> <dbl>
## 1 1989 2 3 0.00220 
## 2 1989 ace 1 0.000734
## 3 1989 admit 1 0.000734
## 4 1989 afraid 1 0.000734
## 5 1989 aglow 1 0.000734
## 6 1989 ahead 1 0.000734
## 7 1989 aids 2 0.00147 
## 8 1989 airplanes 1 0.000734
## 9 1989 alright 1 0.000734
## 10 1989 anthem 2 0.00147 
## # … with 1,156 more rows
```

---

### Word Frequencies Across Albums

.pull-left[

```r
two_albums_tidy %>%
  group_by(Album) %>%
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ungroup() %>%
  mutate(word = factor(word),
         word = fct_reorder(word, n)) %>%
  ggplot(mapping = 
           aes(x = word,
               y = n)) +
  geom_col() +
  facet_wrap(~Album) +
  coord_flip()
```

]

.pull-right[

]

---

### Word Frequencies Across Albums

.pull-left[

```r
two_albums_tidy  %>%
  group_by(Album) %>%
  arrange(desc(n)) %>%
  slice(1:10) %>%
  ungroup() %>%
  mutate(word = factor(word),
         word = fct_reorder(word,
                            n)) %>%
  ggplot(mapping = aes(x = word,
                       y = n)) +
  geom_col() +
  facet_wrap(~Album,
             scales = "free_y") +
  coord_flip()
```

]

.pull-right[

]

---

### Sentiment Analysis

* Was one album a more negative album than the other?

* Need to add a column that measures the sentiment of each token.
    + From [Bing Liu and collaborators](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html)
    + Generalizability to other English-speaking countries or time periods?

```r
sentiments
```

```
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr> 
## 1 2-faces negative 
## 2 abnormal negative 
## 3 abolish negative 
## 4 abominable negative 
## 5 abominably negative 
## 6 abominate negative 
## 7 abomination negative 
## 8 abort negative 
## 9 aborted negative 
## 10 aborts negative 
## # … with 6,776 more rows
```

---

### Sentiment Analysis

* Keep stop words this time.

```r
two_albums_tidy <- two_albums %>%
 unnest_tokens(output = word, input = Lyric,
 token = "words") %>%
 count(Album, word) %>%
 group_by(Album) %>%
 mutate(prop = n/sum(n)) 
two_albums_tidy
```

```
## # A tibble: 1,707 × 4
## # Groups: Album [2]
## Album word n prop
## <chr> <chr> <int> <dbl>
## 1 1989 2 3 0.000561
## 2 1989 a 73 0.0136 
## 3 1989 about 6 0.00112 
## 4 1989 ace 1 0.000187
## 5 1989 admit 1 0.000187
## 6 1989 afraid 1 0.000187
## 7 1989 aglow 1 0.000187
## 8 1989 ah 41 0.00767 
## 9 1989 ahead 1 0.000187
## 10 1989 ahh 1 0.000187
## # … with 1,697 more rows
```

---

### Sentiment Analysis

What are the most common **negative words** on each album?

```r
two_albums_tidy %>%
  inner_join(sentiments, by = "word") %>%
  filter(sentiment == "negative") %>%
  arrange(desc(n))
```

```
## # A tibble: 174 × 5
## # Groups: Album [2]
## Album word n prop sentiment
## <chr> <chr> <int> <dbl> <chr> 
## 1 1989 shake 54 0.0101 negative 
## 2 reputation bad 25 0.00418 negative 
## 3 1989 fake 13 0.00243 negative 
## 4 1989 lost 13 0.00243 negative 
## 5 1989 hate 12 0.00224 negative 
## 6 1989 break 10 0.00187 negative 
## 7 1989 bad 9 0.00168 negative 
## 8 1989 worse 8 0.00150 negative 
## 9 1989 insane 7 0.00131 negative 
## 10 reputation break 7 0.00117 negative 
## # … with 164 more rows
```

---

### Sentiment Analysis

What are the most common **positive words** on each album?

```r
two_albums_tidy %>%
  inner_join(sentiments, by = "word") %>%
  filter(sentiment == "positive") %>%
  arrange(desc(n))
```

```
## # A tibble: 94 × 5
## # Groups: Album [2]
## Album word n prop sentiment
## <chr> <chr> <int> <dbl> <chr> 
## 1 1989 like 42 0.00785 positive 
## 2 reputation like 42 0.00702 positive 
## 3 1989 love 34 0.00636 positive 
## 4 1989 clear 27 0.00505 positive 
## 5 reputation good 23 0.00384 positive 
## 6 1989 welcome 19 0.00355 positive 
## 7 reputation love 16 0.00267 positive 
## 8 1989 good 15 0.00280 positive 
## 9 reputation gorgeous 14 0.00234 positive 
## 10 reputation right 14 0.00234 positive 
## # … with 84 more rows
```

---

### What is the distribution of positive and negative words?

.pull-left[

* Remember that words not in the lexicon are dropped!

* Issue with word-based sentiment analysis?

```r
two_albums_tidy %>%
  inner_join(sentiments, 
             by = "word") %>%
  group_by(Album, sentiment) %>%
  summarize(n = sum(n)) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = Album, y = prop,
             fill  = sentiment)) +
  geom_col()
```

]

.pull-right[

]

---

### Sentiment Analysis

```r
two_albums %>%
  filter(str_detect(Lyric, "love"))
```

```
## # A tibble: 53 × 3
## Song Album Lyric 
## <chr> <chr> <chr> 
## 1 All You Had to Do Was Stay 1989 The love they gave away 
## 2 All You Had to Do Was Stay 1989 The love they pushed aside 
## 3 Bad Blood 1989 You know it used to be mad love 
## 4 Bad Blood 1989 If you love like that, blood runs cold 
## 5 Bad Blood 1989 You know it used to be mad love 
## 6 I Know Places 1989 Just grab my hand and don't ever drop it, m…
## 7 I Know Places 1989 Just grab my hand and don't ever drop it, m…
## 8 I Wish You Would 1989 We're a crooked love in a straight line down
## 9 I Wish You Would 1989 This mad, mad love makes you come running 
## 10 I Wish You Would 1989 We're a crooked love in a straight line down
## # … with 43 more rows
```

---

### Sentiment Analysis

* Should also try out other lexicons

```r
library(textdata)
nrc <- get_sentiments("nrc")
nrc
```

```
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr> 
## 1 abacus trust 
## 2 abandon fear 
## 3 abandon negative 
## 4 abandon sadness 
## 5 abandoned anger 
## 6 abandoned fear 
## 7 abandoned negative 
## 8 abandoned sadness 
## 9 abandonment anger 
## 10 abandonment fear 
## # … with 13,862 more rows
```

---

### Sentiment Analysis

.pull-left[

```r
two_albums_tidy %>%
  inner_join(nrc, by = "word") %>%
  group_by(Album, sentiment) %>%
  summarize(n = sum(n)) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(fill = Album, 
             y = prop,
             x  = sentiment)) +
  geom_col(position = "dodge") +
  coord_flip()
```

]

.pull-right[

]

---

### Measuring Differences: `tf_idf`

* tf = Number of times word appears in a given text

* idf = log(number of texts/number of texts with word)

* tf `$*$` idf = Sense of frequency within text that accounts for how common word is across texts

If we have 6 texts and "you" shows up in all of them, then tf `$*$` idf equals what?

---

### Measuring Differences: `tf_idf`

```r
ts_albums <- c("Taylor Swift", "Fearless (Taylor’s Version)", "Speak Now",
 "1989 (Deluxe)", "reputation", "Lover")

taylor_tidy <- ts %>%
 mutate(Album = factor(Album, levels = ts_albums)) %>%
 unnest_tokens(output = word, input = Lyric, token = "words") %>%
 filter(Album %in% ts_albums) %>%
 count(Album, word, sort = TRUE) %>%
 filter(!(word %in% c("that", "na", "mm", "la", "di",
 "da", "eeh", "eh", "e"))) %>%
 distinct(Album, word, .keep_all = TRUE) %>%
 bind_tf_idf(word, Album, n)

taylor_tidy %>%
  arrange(tf_idf)
```

```
## # A tibble: 5,287 × 6
## Album word n tf idf tf_idf
## <fct> <chr> <int> <dbl> <dbl> <dbl>
## 1 Fearless (Taylor’s Version) you 376 0.0532 0 0
## 2 Fearless (Taylor’s Version) i 347 0.0491 0 0
## 3 Lover i 335 0.0588 0 0
## 4 1989 (Deluxe) i 309 0.0587 0 0
## 5 Fearless (Taylor’s Version) and 266 0.0377 0 0
## 6 Speak Now you 255 0.0523 0 0
## 7 reputation i 251 0.0427 0 0
## 8 reputation you 251 0.0427 0 0
## 9 1989 (Deluxe) you 239 0.0454 0 0
## 10 Fearless (Taylor’s Version) the 232 0.0329 0 0
## # … with 5,277 more rows
```

---

### Measuring Differences: `tf_idf`

.pull-left[

```r
taylor_tidy %>%
  group_by(Album) %>%
  slice_max(tf_idf, n = 10) %>%
  ungroup() %>%
  mutate(word = fct_reorder(word,
                            tf_idf)) %>%
  ggplot(aes(x = word, y = tf_idf,
             fill = Album)) + 
  geom_col(show.legend = FALSE) +
  coord_flip() +
  facet_wrap(~Album, ncol = 3,
             scales = "free")
```

]

.pull-right[

]

---

What happened to **shake**?!

---

### Measuring Differences: `tf_idf`

* What happened to **shake**?!

```r
taylor_tidy %>%
  filter(word == "shake")
```

```
## # A tibble: 4 × 6
## Album word n tf idf tf_idf
## <fct> <chr> <int> <dbl> <dbl> <dbl>
## 1 1989 (Deluxe) shake 54 0.0103 0.405 0.00416 
## 2 Fearless (Taylor’s Version) shake 1 0.000142 0.405 0.0000574
## 3 reputation shake 1 0.000170 0.405 0.0000690
## 4 Lover shake 1 0.000176 0.405 0.0000712
```

---

## Further Text Analysis Topics

* Topic Models: Latent Dirichlet allocation

* Sentence level sentiment analysis with `coreNLP`, `cleanNLP`, and/or `sentimentr`