background-image: url("img/logo_padded.001.jpeg") background-position: left background-size: 60% class: middle, center, .pull-right[ <br> ## .base_color[String Manipulation and] ## .base_color[Text Analysis] <br> #### .navy[Kelly McConville] #### .navy[ Stat 108 | Week 6 | Spring 2023] ] --- ## Announcements * Remember that P-Set 3 is due today at 10pm. * Lecture quiz posted after lecture today. ************************ ## Week's Goals .pull-left[ **Mon Lecture** * Finish up maps -- interactive maps. * More data types + Dates with `lubridate` + Factors with `forcats` + Strings with `stringr` ] .pull-right[ **Wed Lecture** * More wrangling of strings * Text analysis with `tidytext` ] --- ### Recap: How should we modify the code to locate all the numbers from these lyrics of various songs? ```r lyrics <- c("But I would walk 500 miles", "2000 0 0 party over oops out of time!", "1 is the loneliest number that you'll ever do", "When I'm 64", "Where 2 and 2 always makes a 5", "1, 2, 3, 4: Tell me that you love me more") ``` ```r str_view_all(lyrics, "500|1000|0|2000|1|64|2|5|3|4") ``` ``` ## [1] │ But I would walk <500> miles ## [2] │ <2000> <0> <0> party over oops out of time! ## [3] │ <1> is the loneliest number that you'll ever do ## [4] │ When I'm <64> ## [5] │ Where <2> and <2> always makes a <5> ## [6] │ <1>, <2>, <3>, <4>: Tell me that you love me more ``` --- ### Recap: But now imagine you had a very long vector and you want to locate any number? ```r str_view_all(lyrics, "1|2|3|4...") ``` Not a good approach! --- ## Regular Expressions * A concise language for describing patterns in strings. + But not super easy to read. + Good to have cheatsheets and the internet for help! * Neat RStudio Addin to help: [`RegExplain`](https://www.garrickadenbuie.com/project/regexplain/) --- ## Regular Expressions * `[:digit:]` is a particular [Character Class](https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html) * Character classes are a way of specifying that you want to match one of the following characters. ```r str_view_all(lyrics, "[:digit:]") ``` ``` ## [1] │ But I would walk <5><0><0> miles ## [2] │ <2><0><0><0> <0> <0> party over oops out of time! ## [3] │ <1> is the loneliest number that you'll ever do ## [4] │ When I'm <6><4> ## [5] │ Where <2> and <2> always makes a <5> ## [6] │ <1>, <2>, <3>, <4>: Tell me that you love me more ``` --- ## Regular Expressions * `+` is a quantifier * `+`: One or more ```r str_view_all(lyrics, "[:digit:]+") ``` ``` ## [1] │ But I would walk <500> miles ## [2] │ <2000> <0> <0> party over oops out of time! ## [3] │ <1> is the loneliest number that you'll ever do ## [4] │ When I'm <64> ## [5] │ Where <2> and <2> always makes a <5> ## [6] │ <1>, <2>, <3>, <4>: Tell me that you love me more ``` --- ## Regular Expressions What does `{n}` do? ```r str_view_all(lyrics, "[:digit:]{2}") ``` ``` ## [1] │ But I would walk <50>0 miles ## [2] │ <20><00> 0 0 party over oops out of time! ## [3] │ 1 is the loneliest number that you'll ever do ## [4] │ When I'm <64> ## [5] │ Where 2 and 2 always makes a 5 ## [6] │ 1, 2, 3, 4: Tell me that you love me more ``` --- ## Quantifiers * `?`: 0 or 1 * `*`: 0 or more * `+`: 1 or more * `{n}`: Exactly n * `{n,}`: n or more * `{,m}`: at most m * `{n,m}`: between n and m --- ## Character Classes * What pattern does this regexp match? ```r str_view_all(lyrics, "[:alpha:]") ``` ``` ## [1] │ <B><u><t> <I> <w><o><u><l><d> <w><a><l><k> 500 <m><i><l><e><s> ## [2] │ 2000 0 0 <p><a><r><t><y> <o><v><e><r> <o><o><p><s> <o><u><t> <o><f> <t><i><m><e>! ## [3] │ 1 <i><s> <t><h><e> <l><o><n><e><l><i><e><s><t> <n><u><m><b><e><r> <t><h><a><t> <y><o><u>'<l><l> <e><v><e><r> <d><o> ## [4] │ <W><h><e><n> <I>'<m> 64 ## [5] │ <W><h><e><r><e> 2 <a><n><d> 2 <a><l><w><a><y><s> <m><a><k><e><s> <a> 5 ## [6] │ 1, 2, 3, 4: <T><e><l><l> <m><e> <t><h><a><t> <y><o><u> <l><o><v><e> <m><e> <m><o><r><e> ``` --- ## Character Classes * What pattern does this regexp match? ```r str_view_all(lyrics, "[:upper:]") ``` ``` ## [1] │ <B>ut <I> would walk 500 miles ## [2] │ 2000 0 0 party over oops out of time! ## [3] │ 1 is the loneliest number that you'll ever do ## [4] │ <W>hen <I>'m 64 ## [5] │ <W>here 2 and 2 always makes a 5 ## [6] │ 1, 2, 3, 4: <T>ell me that you love me more ``` --- ## Character Classes * What pattern does this regexp match? ```r str_view_all(lyrics, "[:alnum:]") ``` ``` ## [1] │ <B><u><t> <I> <w><o><u><l><d> <w><a><l><k> <5><0><0> <m><i><l><e><s> ## [2] │ <2><0><0><0> <0> <0> <p><a><r><t><y> <o><v><e><r> <o><o><p><s> <o><u><t> <o><f> <t><i><m><e>! ## [3] │ <1> <i><s> <t><h><e> <l><o><n><e><l><i><e><s><t> <n><u><m><b><e><r> <t><h><a><t> <y><o><u>'<l><l> <e><v><e><r> <d><o> ## [4] │ <W><h><e><n> <I>'<m> <6><4> ## [5] │ <W><h><e><r><e> <2> <a><n><d> <2> <a><l><w><a><y><s> <m><a><k><e><s> <a> <5> ## [6] │ <1>, <2>, <3>, <4>: <T><e><l><l> <m><e> <t><h><a><t> <y><o><u> <l><o><v><e> <m><e> <m><o><r><e> ``` --- ## Character Classes * What pattern does this regexp match? ```r str_view_all(lyrics, "[:punct:]") ``` ``` ## [1] │ But I would walk 500 miles ## [2] │ 2000 0 0 party over oops out of time<!> ## [3] │ 1 is the loneliest number that you<'>ll ever do ## [4] │ When I<'>m 64 ## [5] │ Where 2 and 2 always makes a 5 ## [6] │ 1<,> 2<,> 3<,> 4<:> Tell me that you love me more ``` --- ## Character Classes * What pattern does this regexp match? ```r str_view_all(lyrics, "[:graph:]") ``` ``` ## [1] │ <B><u><t> <I> <w><o><u><l><d> <w><a><l><k> <5><0><0> <m><i><l><e><s> ## [2] │ <2><0><0><0> <0> <0> <p><a><r><t><y> <o><v><e><r> <o><o><p><s> <o><u><t> <o><f> <t><i><m><e><!> ## [3] │ <1> <i><s> <t><h><e> <l><o><n><e><l><i><e><s><t> <n><u><m><b><e><r> <t><h><a><t> <y><o><u><'><l><l> <e><v><e><r> <d><o> ## [4] │ <W><h><e><n> <I><'><m> <6><4> ## [5] │ <W><h><e><r><e> <2> <a><n><d> <2> <a><l><w><a><y><s> <m><a><k><e><s> <a> <5> ## [6] │ <1><,> <2><,> <3><,> <4><:> <T><e><l><l> <m><e> <t><h><a><t> <y><o><u> <l><o><v><e> <m><e> <m><o><r><e> ``` --- ## Character Classes * What pattern does this regexp match? ```r str_view_all(lyrics, "[:space:]") ``` ``` ## [1] │ But< >I< >would< >walk< >500< >miles ## [2] │ 2000< >0< >0< >party< >over< >oops< >out< >of< >time! ## [3] │ 1< >is< >the< >loneliest< >number< >that< >you'll< >ever< >do ## [4] │ When< >I'm< >64 ## [5] │ Where< >2< >and< >2< >always< >makes< >a< >5 ## [6] │ 1,< >2,< >3,< >4:< >Tell< >me< >that< >you< >love< >me< >more ``` --- ## Character Classes * Can also create your own. * What pattern does this regexp match? ```r str_view_all(lyrics, "[aeiou]") ``` ``` ## [1] │ B<u>t I w<o><u>ld w<a>lk 500 m<i>l<e>s ## [2] │ 2000 0 0 p<a>rty <o>v<e>r <o><o>ps <o><u>t <o>f t<i>m<e>! ## [3] │ 1 <i>s th<e> l<o>n<e>l<i><e>st n<u>mb<e>r th<a>t y<o><u>'ll <e>v<e>r d<o> ## [4] │ Wh<e>n I'm 64 ## [5] │ Wh<e>r<e> 2 <a>nd 2 <a>lw<a>ys m<a>k<e>s <a> 5 ## [6] │ 1, 2, 3, 4: T<e>ll m<e> th<a>t y<o><u> l<o>v<e> m<e> m<o>r<e> ``` --- ## Other Handy Regexps * What pattern does this regexp match? * Why do we need an extra `\`? ```r str_view_all(lyrics, "\\d") ``` ``` ## [1] │ But I would walk <5><0><0> miles ## [2] │ <2><0><0><0> <0> <0> party over oops out of time! ## [3] │ <1> is the loneliest number that you'll ever do ## [4] │ When I'm <6><4> ## [5] │ Where <2> and <2> always makes a <5> ## [6] │ <1>, <2>, <3>, <4>: Tell me that you love me more ``` --- ## Espacing Meta Characters * `\` is a special character that has a particular meaning in `r`. * You can see all the special characters that need escaping in the help page for `'`: ```r ?"'" ``` --- ## Other Handy Regexps * What pattern does this regexp match? ```r str_view_all(lyrics, ".w.") ``` ``` ## [1] │ But I< wo>uld< wa>lk 500 miles ## [2] │ 2000 0 0 party over oops out of time! ## [3] │ 1 is the loneliest number that you'll ever do ## [4] │ When I'm 64 ## [5] │ Where 2 and 2 a<lwa>ys makes a 5 ## [6] │ 1, 2, 3, 4: Tell me that you love me more ``` --- ## Other Handy Regexps * What pattern does this regexp match? ```r str_view_all(lyrics, "\\W") ``` ``` ## [1] │ But< >I< >would< >walk< >500< >miles ## [2] │ 2000< >0< >0< >party< >over< >oops< >out< >of< >time<!> ## [3] │ 1< >is< >the< >loneliest< >number< >that< >you<'>ll< >ever< >do ## [4] │ When< >I<'>m< >64 ## [5] │ Where< >2< >and< >2< >always< >makes< >a< >5 ## [6] │ 1<,>< >2<,>< >3<,>< >4<:>< >Tell< >me< >that< >you< >love< >me< >more ``` --- ## Other Handy Regexps * What pattern does this regexp match? ```r str_view_all(lyrics, "Whe(n|re)") ``` ``` ## [1] │ But I would walk 500 miles ## [2] │ 2000 0 0 party over oops out of time! ## [3] │ 1 is the loneliest number that you'll ever do ## [4] │ <When> I'm 64 ## [5] │ <Where> 2 and 2 always makes a 5 ## [6] │ 1, 2, 3, 4: Tell me that you love me more ``` --- ## Groups * What pattern does this regexp match? ```r str_view_all(lyrics, "(\\d)\\1") ``` ``` ## [1] │ But I would walk 5<00> miles ## [2] │ 2<00>0 0 0 party over oops out of time! ## [3] │ 1 is the loneliest number that you'll ever do ## [4] │ When I'm 64 ## [5] │ Where 2 and 2 always makes a 5 ## [6] │ 1, 2, 3, 4: Tell me that you love me more ``` --- ## Groups * What pattern does this regexp match? ```r str_view_all(lyrics, "(\\d)\\1\\1") ``` ``` ## [1] │ But I would walk 500 miles ## [2] │ 2<000> 0 0 party over oops out of time! ## [3] │ 1 is the loneliest number that you'll ever do ## [4] │ When I'm 64 ## [5] │ Where 2 and 2 always makes a 5 ## [6] │ 1, 2, 3, 4: Tell me that you love me more ``` --- ## Groups * What pattern does this regexp match? ```r str_view_all(lyrics, "([:alnum:])(\\s)[:alnum:]+\\2\\1") ``` ``` ## [1] │ But I would walk 500 miles ## [2] │ 200<0 0 0> party over oops ou<t of t>ime! ## [3] │ 1 is the lonelies<t number t>hat you'll ever do ## [4] │ When I'm 64 ## [5] │ Where <2 and 2> always makes a 5 ## [6] │ 1, 2, 3, 4: Tell me that you love me more ``` --- ## Other Handy Regexps * What pattern does this regexp match? ```r str_view_all(lyrics, "\\b") ``` ``` ## [1] │ <>But<> <>I<> <>would<> <>walk<> <>500<> <>miles<> ## [2] │ <>2000<> <>0<> <>0<> <>party<> <>over<> <>oops<> <>out<> <>of<> <>time<>! ## [3] │ <>1<> <>is<> <>the<> <>loneliest<> <>number<> <>that<> <>you<>'<>ll<> <>ever<> <>do<> ## [4] │ <>When<> <>I<>'<>m<> <>64<> ## [5] │ <>Where<> <>2<> <>and<> <>2<> <>always<> <>makes<> <>a<> <>5<> ## [6] │ <>1<>, <>2<>, <>3<>, <>4<>: <>Tell<> <>me<> <>that<> <>you<> <>love<> <>me<> <>more<> ``` --- ## Anchors * What pattern does this regexp match? ```r str_view_all(lyrics, "^\\d+") ``` ``` ## [1] │ But I would walk 500 miles ## [2] │ <2000> 0 0 party over oops out of time! ## [3] │ <1> is the loneliest number that you'll ever do ## [4] │ When I'm 64 ## [5] │ Where 2 and 2 always makes a 5 ## [6] │ <1>, 2, 3, 4: Tell me that you love me more ``` --- ## Anchors * What pattern does this regexp match? ```r str_view_all(lyrics, "\\d+$") ``` ``` ## [1] │ But I would walk 500 miles ## [2] │ 2000 0 0 party over oops out of time! ## [3] │ 1 is the loneliest number that you'll ever do ## [4] │ When I'm <64> ## [5] │ Where 2 and 2 always makes a <5> ## [6] │ 1, 2, 3, 4: Tell me that you love me more ``` --- ## Alternates * What pattern does this regexp match? ```r str_view_all(lyrics, "[^aeiou]") ``` ``` ## [1] │ <B>u<t>< ><I>< ><w>ou<l><d>< ><w>a<l><k>< ><5><0><0>< ><m>i<l>e<s> ## [2] │ <2><0><0><0>< ><0>< ><0>< ><p>a<r><t><y>< >o<v>e<r>< >oo<p><s>< >ou<t>< >o<f>< ><t>i<m>e<!> ## [3] │ <1>< >i<s>< ><t><h>e< ><l>o<n>e<l>ie<s><t>< ><n>u<m><b>e<r>< ><t><h>a<t>< ><y>ou<'><l><l>< >e<v>e<r>< ><d>o ## [4] │ <W><h>e<n>< ><I><'><m>< ><6><4> ## [5] │ <W><h>e<r>e< ><2>< >a<n><d>< ><2>< >a<l><w>a<y><s>< ><m>a<k>e<s>< >a< ><5> ## [6] │ <1><,>< ><2><,>< ><3><,>< ><4><:>< ><T>e<l><l>< ><m>e< ><t><h>a<t>< ><y>ou< ><l>o<v>e< ><m>e< ><m>o<r>e ``` --- ## Alternates * What pattern does this regexp match? ```r str_view_all(lyrics, "o[m-z]") ``` ``` ## [1] │ But I w<ou>ld walk 500 miles ## [2] │ 2000 0 0 party <ov>er <oo>ps <ou>t of time! ## [3] │ 1 is the l<on>eliest number that y<ou>'ll ever do ## [4] │ When I'm 64 ## [5] │ Where 2 and 2 always makes a 5 ## [6] │ 1, 2, 3, 4: Tell me that y<ou> l<ov>e me m<or>e ``` --- ## Look Arounds ```r str_view_all(lyrics, "(?<=\\d )[:alpha:]+") ``` ``` ## [1] │ But I would walk 500 <miles> ## [2] │ 2000 0 0 <party> over oops out of time! ## [3] │ 1 <is> the loneliest number that you'll ever do ## [4] │ When I'm 64 ## [5] │ Where 2 <and> 2 <always> makes a 5 ## [6] │ 1, 2, 3, 4: Tell me that you love me more ``` --- ## Look Arounds ```r str_view_all(lyrics, "[[:alpha:]+[:punct:]]+[:alpha:]+(?= \\d+)") ``` ``` ## [1] │ But I would <walk> 500 miles ## [2] │ 2000 0 0 party over oops out of time! ## [3] │ 1 is the loneliest number that you'll ever do ## [4] │ When <I'm> 64 ## [5] │ <Where> 2 <and> 2 always makes a 5 ## [6] │ 1, 2, 3, 4: Tell me that you love me more ``` --- ### Pattern Matching * The `str_view_all()` is a nice helper function. * Now need to learn functions to take **action** based on our regular expression pattern matching. * Functions to take **action** based on our regular expression pattern matching. * Detect pattern with: + `str_detect()` + `str_subset()` + `str_count()` * Extract pattern with: + `str_extract()` and `str_extract_all()` * Replace pattern with: + `str_replace()` and `str_replace_all()` * Split pattern: + `str_split()` + But we are going to let `tidytext` do the splitting for us! --- ### New Example: [Taylor Swift song](https://github.com/shaynak/taylor-swift-lyrics) ```r ts <- read_csv("https://raw.githubusercontent.com/shaynak/taylor-swift-lyrics/main/lyrics.csv") shake_it_off <- ts %>% select(Song, Album, Lyric) %>% filter(Song == "Shake It Off") shake_it_off ``` ``` ## # A tibble: 62 × 3 ## Song Album Lyric ## <chr> <chr> <chr> ## 1 Shake It Off 1989 (Deluxe) I stay out too late ## 2 Shake It Off 1989 (Deluxe) Got nothin' in my brain ## 3 Shake It Off 1989 (Deluxe) That's what people say, mmm-mmm ## 4 Shake It Off 1989 (Deluxe) That's what people say, mmm-mmm ## 5 Shake It Off 1989 (Deluxe) I go on too many dates (Haha) ## 6 Shake It Off 1989 (Deluxe) But I can't make them stay ## 7 Shake It Off 1989 (Deluxe) At least that's what people say, mmm-mmm ## 8 Shake It Off 1989 (Deluxe) That's what people say, mmm-mmm ## 9 Shake It Off 1989 (Deluxe) But I keep cruisin' ## 10 Shake It Off 1989 (Deluxe) Can't stop, won't stop movin' ## # … with 52 more rows ``` --- ## Detect ```r str_detect(string = shake_it_off$Lyric, pattern = "ake") ``` ``` ## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE ## [13] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE ## [25] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE ## [37] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE ## [49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [61] TRUE TRUE ``` --- ## Detect ```r str_subset(string = shake_it_off$Lyric, pattern = "ake") ``` ``` ## [1] "But I can't make them stay" ## [2] "Baby, I'm just gonna shake, shake, shake, shake, shake" ## [3] "I shake it off, I shake it off" ## [4] "Heartbreakers gonna break, break, break, break, break" ## [5] "And the fakers gonna fake, fake, fake, fake, fake" ## [6] "Baby, I'm just gonna shake, shake, shake, shake, shake" ## [7] "I shake it off, I shake it off" ## [8] "I make the moves up as I go (Moves up as I go)" ## [9] "Shake it off, I shake it off" ## [10] "I, I, I shake it off, I shake it off" ## [11] "I, I, I shake it off, I shake it off" ## [12] "I, I, I shake it off, I shake it off" ## [13] "She's like, \"Oh my God!\" But I'm just gonna shake" ## [14] "We can shake, shake, shake" ## [15] "Baby, I'm just gonna shake, shake, shake, shake, shake" ## [16] "I shake it off, I shake it off (Ha!)" ## [17] "Heartbreakers gonna break, break, break, break, break (Mmm)" ## [18] "And the fakers gonna fake, fake, fake, fake, fake" ## [19] "(And fake, and fake, and fake)" ## [20] "Baby, I'm just gonna shake, shake, shake, shake, shake" ## [21] "I shake it off, I shake it off (I, I, I)" ## [22] "I, I, I shake it off, I shake it off" ## [23] "Shake it off, I shake it off" ## [24] "I, I, I shake it off, I shake it off" ## [25] "I, I, I shake it off, I shake it off (Yeah!)" ## [26] "Shake it off, I shake it off" ## [27] "I, I, I shake it off, I shake it off (You got to)" ## [28] "I, I, I shake it off, I shake it off" ``` --- ## Detect ```r shake <- shake_it_off %>% filter(str_detect(string = Lyric, pattern = "ake")) shake ``` ``` ## # A tibble: 28 × 3 ## Song Album Lyric ## <chr> <chr> <chr> ## 1 Shake It Off 1989 (Deluxe) But I can't make them stay ## 2 Shake It Off 1989 (Deluxe) Baby, I'm just gonna shake, shake, shake, shake, … ## 3 Shake It Off 1989 (Deluxe) I shake it off, I shake it off ## 4 Shake It Off 1989 (Deluxe) Heartbreakers gonna break, break, break, break, b… ## 5 Shake It Off 1989 (Deluxe) And the fakers gonna fake, fake, fake, fake, fake ## 6 Shake It Off 1989 (Deluxe) Baby, I'm just gonna shake, shake, shake, shake, … ## 7 Shake It Off 1989 (Deluxe) I shake it off, I shake it off ## 8 Shake It Off 1989 (Deluxe) I make the moves up as I go (Moves up as I go) ## 9 Shake It Off 1989 (Deluxe) Shake it off, I shake it off ## 10 Shake It Off 1989 (Deluxe) I, I, I shake it off, I shake it off ## # … with 18 more rows ``` --- ## Detect ```r str_count(string = shake_it_off$Lyric, pattern = "ake") ``` ``` ## [1] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 5 2 1 6 5 2 0 0 0 0 0 1 0 0 0 0 0 2 2 2 2 0 0 0 ## [39] 0 0 0 1 0 0 3 0 0 0 5 2 1 6 3 5 2 2 2 2 2 2 2 2 ``` --- ## Detect ```r shake_it_off %>% filter(str_count(string = shake_it_off$Lyric, pattern = "ake") > 1) ``` ``` ## # A tibble: 23 × 3 ## Song Album Lyric ## <chr> <chr> <chr> ## 1 Shake It Off 1989 (Deluxe) Baby, I'm just gonna shake, shake, shake, shake, … ## 2 Shake It Off 1989 (Deluxe) I shake it off, I shake it off ## 3 Shake It Off 1989 (Deluxe) And the fakers gonna fake, fake, fake, fake, fake ## 4 Shake It Off 1989 (Deluxe) Baby, I'm just gonna shake, shake, shake, shake, … ## 5 Shake It Off 1989 (Deluxe) I shake it off, I shake it off ## 6 Shake It Off 1989 (Deluxe) Shake it off, I shake it off ## 7 Shake It Off 1989 (Deluxe) I, I, I shake it off, I shake it off ## 8 Shake It Off 1989 (Deluxe) I, I, I shake it off, I shake it off ## 9 Shake It Off 1989 (Deluxe) I, I, I shake it off, I shake it off ## 10 Shake It Off 1989 (Deluxe) We can shake, shake, shake ## # … with 13 more rows ``` --- ## Extract ```r str_subset(string = shake_it_off$Lyric, pattern = "[:punct:]") %>% str_extract(pattern = "[:punct:]") ``` ``` ## [1] "'" "'" "'" "(" "'" "'" "'" "'" "'" "'" "'" "'" "," "," "," "," "," "," "," ## [20] "'" "'" "'" "'" "(" "'" "'" "'" "'" "'" "," "," "," "," "," "," "," "'" "-" ## [39] "'" "'" "," "," "," "(" "," "," "," "," "(" "," "," "," "," "," "," "," "," ## [58] "," ``` --- ## Extract ```r all <- str_subset(string = shake_it_off$Lyric, pattern = "[:punct:]") %>% str_extract_all(pattern = "[:punct:]") class(all) ``` ``` ## [1] "list" ``` ```r all ``` ``` ## [[1]] ## [1] "'" ## ## [[2]] ## [1] "'" "," "-" ## ## [[3]] ## [1] "'" "," "-" ## ## [[4]] ## [1] "(" ")" ## ## [[5]] ## [1] "'" ## ## [[6]] ## [1] "'" "," "-" ## ## [[7]] ## [1] "'" "," "-" ## ## [[8]] ## [1] "'" ## ## [[9]] ## [1] "'" "," "'" "'" ## ## [[10]] ## [1] "'" ## ## [[11]] ## [1] "'" "," "\"" "'" "\"" ## ## [[12]] ## [1] "'" "," "," "," "," ## ## [[13]] ## [1] "," "," "," "," ## ## [[14]] ## [1] "," "'" "," "," "," "," ## ## [[15]] ## [1] "," ## ## [[16]] ## [1] "," "," "," "," ## ## [[17]] ## [1] "," "," "," "," ## ## [[18]] ## [1] "," "'" "," "," "," "," ## ## [[19]] ## [1] "," ## ## [[20]] ## [1] "'" "'" ## ## [[21]] ## [1] "'" "'" "," "-" ## ## [[22]] ## [1] "'" "'" "," "-" ## ## [[23]] ## [1] "'" "'" "(" "'" ")" ## ## [[24]] ## [1] "(" ")" ## ## [[25]] ## [1] "'" "'" "," "-" ## ## [[26]] ## [1] "'" "'" "," "-" ## ## [[27]] ## [1] "'" ## ## [[28]] ## [1] "'" "," "'" "'" ## ## [[29]] ## [1] "'" ## ## [[30]] ## [1] "," ## ## [[31]] ## [1] "," "," "," ## ## [[32]] ## [1] "," "," "," ## ## [[33]] ## [1] "," "," "," ## ## [[34]] ## [1] "," "," ## ## [[35]] ## [1] "," "'" "'" ## ## [[36]] ## [1] "," ## ## [[37]] ## [1] "'" "'" ## ## [[38]] ## [1] "-" ## ## [[39]] ## [1] "'" "," "\"" "!" "\"" "'" ## ## [[40]] ## [1] "'" "," "?" ## ## [[41]] ## [1] "," "," ## ## [[42]] ## [1] "," "," "," ## ## [[43]] ## [1] "," "," "," "," ## ## [[44]] ## [1] "(" ")" ## ## [[45]] ## [1] "," "'" "," "," "," "," ## ## [[46]] ## [1] "," "(" "!" ")" ## ## [[47]] ## [1] "," "," "," "," "(" ")" ## ## [[48]] ## [1] "," "," "," "," ## ## [[49]] ## [1] "(" "," "," ")" ## ## [[50]] ## [1] "," "'" "," "," "," "," ## ## [[51]] ## [1] "," "(" "," "," ")" ## ## [[52]] ## [1] "," "," "," ## ## [[53]] ## [1] "," ## ## [[54]] ## [1] "," "," "," ## ## [[55]] ## [1] "," "," "," "(" "!" ")" ## ## [[56]] ## [1] "," ## ## [[57]] ## [1] "," "," "," "(" ")" ## ## [[58]] ## [1] "," "," "," ``` --- ## Extract ```r str_view_all(shake_it_off$Lyric, pattern = "(?<= gonna )\\w+") ``` ``` ## [1] │ I stay out too late ## [2] │ Got nothin' in my brain ## [3] │ That's what people say, mmm-mmm ## [4] │ That's what people say, mmm-mmm ## [5] │ I go on too many dates (Haha) ## [6] │ But I can't make them stay ## [7] │ At least that's what people say, mmm-mmm ## [8] │ That's what people say, mmm-mmm ## [9] │ But I keep cruisin' ## [10] │ Can't stop, won't stop movin' ## [11] │ It's like I got this music in my mind ## [12] │ Sayin', "It's gonna <be> alright" ## [13] │ 'Cause the players gonna <play>, play, play, play, play ## [14] │ And the haters gonna <hate>, hate, hate, hate, hate ## [15] │ Baby, I'm just gonna <shake>, shake, shake, shake, shake ## [16] │ I shake it off, I shake it off ## [17] │ Heartbreakers gonna <break>, break, break, break, break ## [18] │ And the fakers gonna <fake>, fake, fake, fake, fake ## [19] │ Baby, I'm just gonna <shake>, shake, shake, shake, shake ## [20] │ I shake it off, I shake it off ## ... and 42 more ``` --- ## Extract ```r shake_it_off %>% filter(str_detect(Lyric, "gonna")) %>% mutate(after = str_extract(Lyric, "(?<= gonna )\\w+")) ``` ``` ## # A tibble: 14 × 4 ## Song Album Lyric after ## <chr> <chr> <chr> <chr> ## 1 Shake It Off 1989 (Deluxe) "Sayin', \"It's gonna be alright\"" be ## 2 Shake It Off 1989 (Deluxe) "'Cause the players gonna play, play, play,… play ## 3 Shake It Off 1989 (Deluxe) "And the haters gonna hate, hate, hate, hat… hate ## 4 Shake It Off 1989 (Deluxe) "Baby, I'm just gonna shake, shake, shake, … shake ## 5 Shake It Off 1989 (Deluxe) "Heartbreakers gonna break, break, break, b… break ## 6 Shake It Off 1989 (Deluxe) "And the fakers gonna fake, fake, fake, fak… fake ## 7 Shake It Off 1989 (Deluxe) "Baby, I'm just gonna shake, shake, shake, … shake ## 8 Shake It Off 1989 (Deluxe) "She's like, \"Oh my God!\" But I'm just go… shake ## 9 Shake It Off 1989 (Deluxe) "And the haters gonna hate, hate, hate, hat… hate ## 10 Shake It Off 1989 (Deluxe) "(Haters gonna hate)" hate ## 11 Shake It Off 1989 (Deluxe) "Baby, I'm just gonna shake, shake, shake, … shake ## 12 Shake It Off 1989 (Deluxe) "Heartbreakers gonna break, break, break, b… break ## 13 Shake It Off 1989 (Deluxe) "And the fakers gonna fake, fake, fake, fak… fake ## 14 Shake It Off 1989 (Deluxe) "Baby, I'm just gonna shake, shake, shake, … shake ``` --- ## Replace ```r run_it_off <- shake_it_off %>% mutate(Lyric = str_replace_all(Lyric, "shake", "run")) run_it_off %>% filter(str_detect(Lyric, "run")) ``` ``` ## # A tibble: 21 × 3 ## Song Album Lyric ## <chr> <chr> <chr> ## 1 Shake It Off 1989 (Deluxe) "Baby, I'm just gonna run, run, run, run, run" ## 2 Shake It Off 1989 (Deluxe) "I run it off, I run it off" ## 3 Shake It Off 1989 (Deluxe) "Baby, I'm just gonna run, run, run, run, run" ## 4 Shake It Off 1989 (Deluxe) "I run it off, I run it off" ## 5 Shake It Off 1989 (Deluxe) "Shake it off, I run it off" ## 6 Shake It Off 1989 (Deluxe) "I, I, I run it off, I run it off" ## 7 Shake It Off 1989 (Deluxe) "I, I, I run it off, I run it off" ## 8 Shake It Off 1989 (Deluxe) "I, I, I run it off, I run it off" ## 9 Shake It Off 1989 (Deluxe) "She's like, \"Oh my God!\" But I'm just gonna ru… ## 10 Shake It Off 1989 (Deluxe) "We can run, run, run" ## # … with 11 more rows ``` --- ### Basic Text Analysis with `tidytext` Topics: * Tokenizing to a tidy format * Word frequencies * Sentiment analysis --- ### Tidy Text * A data table with one token per row. -- * **Token**: meaningful unit of text + What is the unit for `shake_it_off`? ```r shake_it_off ``` ``` ## # A tibble: 62 × 3 ## Song Album Lyric ## <chr> <chr> <chr> ## 1 Shake It Off 1989 (Deluxe) I stay out too late ## 2 Shake It Off 1989 (Deluxe) Got nothin' in my brain ## 3 Shake It Off 1989 (Deluxe) That's what people say, mmm-mmm ## 4 Shake It Off 1989 (Deluxe) That's what people say, mmm-mmm ## 5 Shake It Off 1989 (Deluxe) I go on too many dates (Haha) ## 6 Shake It Off 1989 (Deluxe) But I can't make them stay ## 7 Shake It Off 1989 (Deluxe) At least that's what people say, mmm-mmm ## 8 Shake It Off 1989 (Deluxe) That's what people say, mmm-mmm ## 9 Shake It Off 1989 (Deluxe) But I keep cruisin' ## 10 Shake It Off 1989 (Deluxe) Can't stop, won't stop movin' ## # … with 52 more rows ``` --- ### Tidy Text * A data table with one token per row. * **Token**: meaningful unit of text + What is the unit for `shake_it_off`? * Other common tokens are words, sentences, paragraphs. * Some text analysis should be done on text data in a non-tidy format. --- ### Tidying Text Data * **Tokenize**: Break text into individual tokens ```r library(tidytext) shake_it_off_words <- shake_it_off %>% unnest_tokens(output = word, input = Lyric, token = "words") shake_it_off_words ``` ``` ## # A tibble: 473 × 3 ## Song Album word ## <chr> <chr> <chr> ## 1 Shake It Off 1989 (Deluxe) i ## 2 Shake It Off 1989 (Deluxe) stay ## 3 Shake It Off 1989 (Deluxe) out ## 4 Shake It Off 1989 (Deluxe) too ## 5 Shake It Off 1989 (Deluxe) late ## 6 Shake It Off 1989 (Deluxe) got ## 7 Shake It Off 1989 (Deluxe) nothin ## 8 Shake It Off 1989 (Deluxe) in ## 9 Shake It Off 1989 (Deluxe) my ## 10 Shake It Off 1989 (Deluxe) brain ## # … with 463 more rows ``` --- ### Tidying Text Data * What is an `ngram`? ```r shake_it_off_ngram <- shake_it_off %>% unnest_tokens(output = ngram, input = Lyric, token = "ngrams", n = 2) shake_it_off_ngram ``` ``` ## # A tibble: 411 × 3 ## Song Album ngram ## <chr> <chr> <chr> ## 1 Shake It Off 1989 (Deluxe) i stay ## 2 Shake It Off 1989 (Deluxe) stay out ## 3 Shake It Off 1989 (Deluxe) out too ## 4 Shake It Off 1989 (Deluxe) too late ## 5 Shake It Off 1989 (Deluxe) got nothin ## 6 Shake It Off 1989 (Deluxe) nothin in ## 7 Shake It Off 1989 (Deluxe) in my ## 8 Shake It Off 1989 (Deluxe) my brain ## 9 Shake It Off 1989 (Deluxe) that's what ## 10 Shake It Off 1989 (Deluxe) what people ## # … with 401 more rows ``` --- ### Word Frequencies * Common text mining task * What have we learned about the frequency of words in "Shake it Off"? * Which words in this list do we maybe not care about? ```r shake_it_off_words %>% count(word, sort = TRUE) ``` ``` ## # A tibble: 111 × 2 ## word n ## <chr> <int> ## 1 i 57 ## 2 shake 54 ## 3 it 30 ## 4 off 30 ## 5 mmm 17 ## 6 gonna 14 ## 7 fake 13 ## 8 and 12 ## 9 hate 11 ## 10 the 11 ## # … with 101 more rows ``` --- ### Word Frequencies * **Stop words**: Common words that are not useful for analysis ```r data("stop_words") stop_words ``` ``` ## # A tibble: 1,149 × 2 ## word lexicon ## <chr> <chr> ## 1 a SMART ## 2 a's SMART ## 3 able SMART ## 4 about SMART ## 5 above SMART ## 6 according SMART ## 7 accordingly SMART ## 8 across SMART ## 9 actually SMART ## 10 after SMART ## # … with 1,139 more rows ``` --- ### Word Frequencies * I want remove from `shake_it_off_words` the rows that contain the stop words. + Get to learn a new `join`! -- ```r shake_it_off_words <- shake_it_off_words %>% anti_join(stop_words, by = "word") shake_it_off_words ``` ``` ## # A tibble: 192 × 3 ## Song Album word ## <chr> <chr> <chr> ## 1 Shake It Off 1989 (Deluxe) stay ## 2 Shake It Off 1989 (Deluxe) late ## 3 Shake It Off 1989 (Deluxe) nothin ## 4 Shake It Off 1989 (Deluxe) brain ## 5 Shake It Off 1989 (Deluxe) people ## 6 Shake It Off 1989 (Deluxe) mmm ## 7 Shake It Off 1989 (Deluxe) mmm ## 8 Shake It Off 1989 (Deluxe) people ## 9 Shake It Off 1989 (Deluxe) mmm ## 10 Shake It Off 1989 (Deluxe) mmm ## # … with 182 more rows ``` --- ### Word Frequencies * What graph should we construct? ```r shake_it_off_words %>% count(word, sort = TRUE) ``` ``` ## # A tibble: 49 × 2 ## word n ## <chr> <int> ## 1 shake 54 ## 2 mmm 17 ## 3 gonna 14 ## 4 fake 13 ## 5 hate 11 ## 6 break 10 ## 7 baby 5 ## 8 play 5 ## 9 people 4 ## 10 stop 4 ## # … with 39 more rows ``` --- ### Word Frequencies .pull-left[ * Which `forcats` function should we use to reorder the bars? ```r shake_it_off_words %>% count(word, sort = TRUE) %>% filter(n > 2) %>% ggplot(mapping = aes(x = word, y = n)) + geom_col() + coord_flip() ``` ] .pull-right[ <img src="stat108_wk06wed_files/figure-html/shake-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Word Frequencies .pull-left[ * Which `forcats` function should we use to reorder the bars? ```r shake_it_off_words %>% count(word, sort = TRUE) %>% filter(n > 2) %>% mutate(word = fct_reorder(word, n)) %>% ggplot(mapping = aes(x = word, y = n)) + geom_col() + coord_flip() ``` ] .pull-right[ <img src="stat108_wk06wed_files/figure-html/shake2-1.png" width="768" style="display: block; margin: auto;" /> ] * Wordclouds are also a pretty common way of displaying this data. Will create wordclouds on P-Set 4! --- ### Comparisons Across Albums/Texts ```r two_albums <- ts %>% select(Song, Album, Lyric) %>% filter(Album %in% c("1989 (Deluxe)", "reputation")) %>% mutate(Album = str_replace(Album, " \\(.+", "")) ``` --- ### Word Frequencies Across Albums ```r two_albums_tidy <- two_albums %>% unnest_tokens(output = word, input = Lyric, token = "words") %>% anti_join(stop_words, by = "word") %>% filter(!(word %in% c("ah", "ahh", "eh", "uh", "ooh", "huh", "mmm", "di", "ha"))) %>% count(Album, word) %>% group_by(Album) %>% mutate(prop = n/sum(n)) two_albums_tidy ``` ``` ## # A tibble: 1,166 × 4 ## # Groups: Album [2] ## Album word n prop ## <chr> <chr> <int> <dbl> ## 1 1989 2 3 0.00220 ## 2 1989 ace 1 0.000734 ## 3 1989 admit 1 0.000734 ## 4 1989 afraid 1 0.000734 ## 5 1989 aglow 1 0.000734 ## 6 1989 ahead 1 0.000734 ## 7 1989 aids 2 0.00147 ## 8 1989 airplanes 1 0.000734 ## 9 1989 alright 1 0.000734 ## 10 1989 anthem 2 0.00147 ## # … with 1,156 more rows ``` --- ### Word Frequencies Across Albums .pull-left[ ```r two_albums_tidy %>% group_by(Album) %>% arrange(desc(n)) %>% slice(1:10) %>% ungroup() %>% mutate(word = factor(word), word = fct_reorder(word, n)) %>% ggplot(mapping = aes(x = word, y = n)) + geom_col() + facet_wrap(~Album) + coord_flip() ``` ] .pull-right[ <img src="stat108_wk06wed_files/figure-html/shake3-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Word Frequencies Across Albums .pull-left[ ```r two_albums_tidy %>% group_by(Album) %>% arrange(desc(n)) %>% slice(1:10) %>% ungroup() %>% mutate(word = factor(word), word = fct_reorder(word, n)) %>% ggplot(mapping = aes(x = word, y = n)) + geom_col() + facet_wrap(~Album, scales = "free_y") + coord_flip() ``` ] .pull-right[ <img src="stat108_wk06wed_files/figure-html/shake4-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Sentiment Analysis * Was one album a more negative album than the other? * Need to add a column that measures the sentiment of each token. + From [Bing Liu and collaborators](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html) + Generalizability to other English-speaking countries or time periods? ```r sentiments ``` ``` ## # A tibble: 6,786 × 2 ## word sentiment ## <chr> <chr> ## 1 2-faces negative ## 2 abnormal negative ## 3 abolish negative ## 4 abominable negative ## 5 abominably negative ## 6 abominate negative ## 7 abomination negative ## 8 abort negative ## 9 aborted negative ## 10 aborts negative ## # … with 6,776 more rows ``` --- ### Sentiment Analysis * Keep stop words this time. ```r two_albums_tidy <- two_albums %>% unnest_tokens(output = word, input = Lyric, token = "words") %>% count(Album, word) %>% group_by(Album) %>% mutate(prop = n/sum(n)) two_albums_tidy ``` ``` ## # A tibble: 1,707 × 4 ## # Groups: Album [2] ## Album word n prop ## <chr> <chr> <int> <dbl> ## 1 1989 2 3 0.000561 ## 2 1989 a 73 0.0136 ## 3 1989 about 6 0.00112 ## 4 1989 ace 1 0.000187 ## 5 1989 admit 1 0.000187 ## 6 1989 afraid 1 0.000187 ## 7 1989 aglow 1 0.000187 ## 8 1989 ah 41 0.00767 ## 9 1989 ahead 1 0.000187 ## 10 1989 ahh 1 0.000187 ## # … with 1,697 more rows ``` --- ### Sentiment Analysis What are the most common **negative words** on each album? ```r two_albums_tidy %>% inner_join(sentiments, by = "word") %>% filter(sentiment == "negative") %>% arrange(desc(n)) ``` ``` ## # A tibble: 174 × 5 ## # Groups: Album [2] ## Album word n prop sentiment ## <chr> <chr> <int> <dbl> <chr> ## 1 1989 shake 54 0.0101 negative ## 2 reputation bad 25 0.00418 negative ## 3 1989 fake 13 0.00243 negative ## 4 1989 lost 13 0.00243 negative ## 5 1989 hate 12 0.00224 negative ## 6 1989 break 10 0.00187 negative ## 7 1989 bad 9 0.00168 negative ## 8 1989 worse 8 0.00150 negative ## 9 1989 insane 7 0.00131 negative ## 10 reputation break 7 0.00117 negative ## # … with 164 more rows ``` --- ### Sentiment Analysis What are the most common **positive words** on each album? ```r two_albums_tidy %>% inner_join(sentiments, by = "word") %>% filter(sentiment == "positive") %>% arrange(desc(n)) ``` ``` ## # A tibble: 94 × 5 ## # Groups: Album [2] ## Album word n prop sentiment ## <chr> <chr> <int> <dbl> <chr> ## 1 1989 like 42 0.00785 positive ## 2 reputation like 42 0.00702 positive ## 3 1989 love 34 0.00636 positive ## 4 1989 clear 27 0.00505 positive ## 5 reputation good 23 0.00384 positive ## 6 1989 welcome 19 0.00355 positive ## 7 reputation love 16 0.00267 positive ## 8 1989 good 15 0.00280 positive ## 9 reputation gorgeous 14 0.00234 positive ## 10 reputation right 14 0.00234 positive ## # … with 84 more rows ``` --- ### What is the distribution of positive and negative words? .pull-left[ * Remember that words not in the lexicon are dropped! * Issue with word-based sentiment analysis? ```r two_albums_tidy %>% inner_join(sentiments, by = "word") %>% group_by(Album, sentiment) %>% summarize(n = sum(n)) %>% mutate(prop = n/sum(n)) %>% ggplot(aes(x = Album, y = prop, fill = sentiment)) + geom_col() ``` ] .pull-right[ <img src="stat108_wk06wed_files/figure-html/shake5-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Sentiment Analysis ```r two_albums %>% filter(str_detect(Lyric, "love")) ``` ``` ## # A tibble: 53 × 3 ## Song Album Lyric ## <chr> <chr> <chr> ## 1 All You Had to Do Was Stay 1989 The love they gave away ## 2 All You Had to Do Was Stay 1989 The love they pushed aside ## 3 Bad Blood 1989 You know it used to be mad love ## 4 Bad Blood 1989 If you love like that, blood runs cold ## 5 Bad Blood 1989 You know it used to be mad love ## 6 I Know Places 1989 Just grab my hand and don't ever drop it, m… ## 7 I Know Places 1989 Just grab my hand and don't ever drop it, m… ## 8 I Wish You Would 1989 We're a crooked love in a straight line down ## 9 I Wish You Would 1989 This mad, mad love makes you come running ## 10 I Wish You Would 1989 We're a crooked love in a straight line down ## # … with 43 more rows ``` --- ### Sentiment Analysis * Should also try out other lexicons ```r library(textdata) nrc <- get_sentiments("nrc") nrc ``` ``` ## # A tibble: 13,872 × 2 ## word sentiment ## <chr> <chr> ## 1 abacus trust ## 2 abandon fear ## 3 abandon negative ## 4 abandon sadness ## 5 abandoned anger ## 6 abandoned fear ## 7 abandoned negative ## 8 abandoned sadness ## 9 abandonment anger ## 10 abandonment fear ## # … with 13,862 more rows ``` --- ### Sentiment Analysis .pull-left[ ```r two_albums_tidy %>% inner_join(nrc, by = "word") %>% group_by(Album, sentiment) %>% summarize(n = sum(n)) %>% mutate(prop = n/sum(n)) %>% ggplot(aes(fill = Album, y = prop, x = sentiment)) + geom_col(position = "dodge") + coord_flip() ``` ] .pull-right[ <img src="stat108_wk06wed_files/figure-html/shake6-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Measuring Differences: `tf_idf` * tf = Number of times word appears in a given text * idf = log(number of texts/number of texts with word) * tf `\(*\)` idf = Sense of frequency within text that accounts for how common word is across texts If we have 6 texts and "you" shows up in all of them, then tf `\(*\)` idf equals what? --- ### Measuring Differences: `tf_idf` ```r ts_albums <- c("Taylor Swift", "Fearless (Taylor’s Version)", "Speak Now", "1989 (Deluxe)", "reputation", "Lover") taylor_tidy <- ts %>% mutate(Album = factor(Album, levels = ts_albums)) %>% unnest_tokens(output = word, input = Lyric, token = "words") %>% filter(Album %in% ts_albums) %>% count(Album, word, sort = TRUE) %>% filter(!(word %in% c("that", "na", "mm", "la", "di", "da", "eeh", "eh", "e"))) %>% distinct(Album, word, .keep_all = TRUE) %>% bind_tf_idf(word, Album, n) taylor_tidy %>% arrange(tf_idf) ``` ``` ## # A tibble: 5,287 × 6 ## Album word n tf idf tf_idf ## <fct> <chr> <int> <dbl> <dbl> <dbl> ## 1 Fearless (Taylor’s Version) you 376 0.0532 0 0 ## 2 Fearless (Taylor’s Version) i 347 0.0491 0 0 ## 3 Lover i 335 0.0588 0 0 ## 4 1989 (Deluxe) i 309 0.0587 0 0 ## 5 Fearless (Taylor’s Version) and 266 0.0377 0 0 ## 6 Speak Now you 255 0.0523 0 0 ## 7 reputation i 251 0.0427 0 0 ## 8 reputation you 251 0.0427 0 0 ## 9 1989 (Deluxe) you 239 0.0454 0 0 ## 10 Fearless (Taylor’s Version) the 232 0.0329 0 0 ## # … with 5,277 more rows ``` --- ### Measuring Differences: `tf_idf` .pull-left[ ```r taylor_tidy %>% group_by(Album) %>% slice_max(tf_idf, n = 10) %>% ungroup() %>% mutate(word = fct_reorder(word, tf_idf)) %>% ggplot(aes(x = word, y = tf_idf, fill = Album)) + geom_col(show.legend = FALSE) + coord_flip() + facet_wrap(~Album, ncol = 3, scales = "free") ``` ] .pull-right[ <img src="stat108_wk06wed_files/figure-html/shake7-1.png" width="768" style="display: block; margin: auto;" /> ] --- What happened to **shake**?! <img src="stat108_wk06wed_files/figure-html/shake7-1.png" width="768" style="display: block; margin: auto;" /> --- ### Measuring Differences: `tf_idf` * What happened to **shake**?! ```r taylor_tidy %>% filter(word == "shake") ``` ``` ## # A tibble: 4 × 6 ## Album word n tf idf tf_idf ## <fct> <chr> <int> <dbl> <dbl> <dbl> ## 1 1989 (Deluxe) shake 54 0.0103 0.405 0.00416 ## 2 Fearless (Taylor’s Version) shake 1 0.000142 0.405 0.0000574 ## 3 reputation shake 1 0.000170 0.405 0.0000690 ## 4 Lover shake 1 0.000176 0.405 0.0000712 ``` --- ## Further Text Analysis Topics * Topic Models: Latent Dirichlet allocation * Sentence level sentiment analysis with `coreNLP`, `cleanNLP`, and/or `sentimentr`