background-image: url("img/logo_padded.001.jpeg") background-position: left background-size: 60% class: middle, center, .pull-right[ <br> ## .base_color[Data Objects] ## .base_color[in `R`] <br> #### .navy[Kelly McConville] #### .navy[ Stat 108 | Week 4 | Spring 2023] ] --- class: middle, center ## Let's tinker with our RStudio Settings! --- ## Announcements * Make sure to come by office hours if you are still having issues syncing your class GitHub repo with an RStudio Project. ************************ ## Week 3 Goals .pull-left[ **Mon Lecture** * GitHub + RStudio Projects workflow * Coding Style * Exploring `R` Objects ] .pull-right[ **Wed Lecture** * Data Joins * Data reshaping with `tidyr` ] --- ## GitHub + RStudio Workflow Once your GitHub repo and RStudio project are synced, here's your workflow: * **Pull** the most recent version of the repo from GitHub to your RStudio project. * Do some work on your project in RStudio. * **Commit** that work. + Committing takes a snapshot of all the files in the project. + Look over the **Diff**: which shows what has changed since your last update. + Include a quick note, **Commit Message** to summarize the motivation for the changes. * **Push** your commit to GitHub from RStudio. --- class: middle, center ## Workflow Demo --- ## Ignoring Files * There are several files that we want to **NOT** push to GitHub. * These include: + `.gitignore` + `.DS_Store` * Add these files to the `.gitignore`. --- ## Test the waters: Let's go through the workflow. * Pull. (Yes, there is nothing to pull yet but it is always good practice to start here.) * Click on the ReadMe. * Add something to the ReadMe. * Click on the git tab. Check the box next to the ReadMe.md. Hit commit. * Put in a commit message. Look over the diff. * Push. **Look for updates in the ReadMe on GitHub.com.** --- ## Git Collaboration: Merge conflicts * What if my collaborators and I both make changes? + Scenario: Your collaborator makes changes to a file, commits, and pushes to GitHub. You also modify that file, commit and push. + Result: Your push will fail because there's a commit on GitHub that you don't have. + Usual Solution: Pull and *usually* git will merge their work nicely with yours. Then push. If that doesn't work, you have a **merge conflict**. Let's cross that bridge when we get there. * How to avoid merge conflicts? + Always pull when you are going to work on your project. + Always commit and push when you are done even if you made small changes. --- ## Collaboration: Git Style * **Projects**: Can use to create to do lists and stay organized. * **Issues**: Useful method to communicate with your group members. --- ### Time to Take Stock of Learning Goals <img src="img/ggplot2.png" width="10%" style="float:right; padding:25px" style="display: block; margin: auto;" /> #### So far we have: #### Created **static** data visualizations of multivariate data with `ggplot2`. <br> -- #### Created **animated** data visualizations of multivariate data with `gganimate`. -- <img src="img/plotly.jpeg" width="10%" style="float:right; padding:25px" style="display: block; margin: auto;" /> <br> #### Created simple **interactive** data visualizations of multivariate data with `plotly`. <br> -- #### Wrangled `data.frames()` with `dplyr` and `factor()` vectors with `forcats`. <img src="img/dplyr.png" width="10%" style="float:right; padding:5px" style="display: block; margin: auto;" /><img src="img/forcats.png" width="10%" style="float:right; padding:5px" style="display: block; margin: auto;" /> --- ### Looking Ahead to Project 1 <img src="img/flexdashboard.png" width="15%" style="float:left; padding:10px" style="display: block; margin: auto;" /> #### Goal: Create interactive dashboards with `shinydashboard` and `flexdashboard`. <br> <img src="img/fia_dash.png" width="50%" style="display: block; margin: auto;" /> --- class: middle ### What We Need First -- #### Need to explore more **data objects** in `R`. -- #### Need to learn how to interact with and wrangle these **data objects**. --- ### Today's discussion is a mix of .pull-left[ <img src="img/STAT108Logo_Wrangling.png" width="60%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/STAT108Logo_Programming.png" width="60%" style="display: block; margin: auto;" /> ] --- ## Stat 108 `R` Data Objects So Far: ```r crash_data <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/psets/data/cambridge_cyclist_ped_crash.csv") class(crash_data) ``` ``` ## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame" ``` --- ## Stat 108 `R` Data Objects So Far: ```r pretty_colors <- c("deeppink3", "cyan4", "darkolivegreen") class(pretty_colors) ``` ``` ## [1] "character" ``` ```r size <- 3 class(size) ``` ``` ## [1] "numeric" ``` These are both examples of what are called **atomic vectors**. --- ## R Objects: Vectors * **Vectors** are the fundamental building blocks of data in `R`. + Come in **2** flavors! -- * **Flavor 1:** **Atomic vectors** + Homogeneous collections of the same type. + Confusingly, people usually just call these *vectors*. -- * **Flavor 2:** **Generic vectors** + Heterogeneous collections of any type of `R` objects. + Commonly called *lists*. --- class: middle, center ### Deep Dive into (atomic) vectors -- #### Let's explore the common types and how to interact with them. -- #### This will allow us to use vectors to do useful things and -- #### To debug our code more effectively! --- ### Flavors of Atomic Vectors .pull-left[ **Logical**: TRUEs and FALSEs ```r happy <- c(TRUE, FALSE, TRUE, TRUE) class(happy) ``` ``` ## [1] "logical" ``` **Numeric**: Integers, real numbers (double-precision floating point numbers) ```r x <- c(1, 4, 5) class(x) ``` ``` ## [1] "numeric" ``` ```r x <- c(1L, 4L, 5L) class(x) ``` ``` ## [1] "integer" ``` ] -- .pull-right[ **Character**: Strings (contains 1 or more characters) ```r animals <- c("llama", "pig", "cat") class(animals) ``` ``` ## [1] "character" ``` **Factors**: Strings with order ```r animals <- as.factor(animals) class(animals) ``` ``` ## [1] "factor" ``` ] --- ### Factors versus Characters .pull-left[ **Character**: Strings (contains 1 or more characters) ```r animals <- c("llama", "pig", "cat") class(animals) ``` ``` ## [1] "character" ``` ```r typeof(animals) ``` ``` ## [1] "character" ``` ```r str(animals) ``` ``` ## chr [1:3] "llama" "pig" "cat" ``` ] -- .pull-right[ **Factors**: Strings with order ```r animals <- as.factor(animals) class(animals) ``` ``` ## [1] "factor" ``` ```r typeof(animals) ``` ``` ## [1] "integer" ``` ```r str(animals) ``` ``` ## Factor w/ 3 levels "cat","llama",..: 2 3 1 ``` What?! ] If you want to read more about this rabbit hole, [go here](https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/index.html). --- ### Concatenation Atomic vectors are created and combined with `c()`. .pull-left[ ```r x <- 5 x ``` ``` ## [1] 5 ``` ```r y <- c(4, 1, 7) y ``` ``` ## [1] 4 1 7 ``` ] -- .pull-right[ ```r z <- c(y, x, c(20)) z ``` ``` ## [1] 4 1 7 5 20 ``` ```r quick <- c(2:5, 13) quick ``` ``` ## [1] 2 3 4 5 13 ``` ] --- ### Checking Type .pull-left[ ```r is.logical(c(FALSE, TRUE)) ``` ``` ## [1] TRUE ``` ```r is.logical(c(0, 1)) ``` ``` ## [1] FALSE ``` ] -- .pull-right[ ```r is.factor(c("llama", "pig", "cat")) ``` ``` ## [1] FALSE ``` ```r is.character(c("llama", "pig", "cat")) ``` ``` ## [1] TRUE ``` ] --- ### Checking Type .pull-left[ ```r is.integer(1) ``` ``` ## [1] FALSE ``` ```r is.double(1) ``` ``` ## [1] TRUE ``` ```r is.numeric(1) ``` ``` ## [1] TRUE ``` ] -- .pull-right[ ```r is.integer(1L) ``` ``` ## [1] TRUE ``` ```r is.double(1L) ``` ``` ## [1] FALSE ``` ```r is.numeric(1L) ``` ``` ## [1] TRUE ``` ] ```r is.atomic(1) ``` ``` ## [1] TRUE ``` --- ### Type Coercion * R will often change the type of a vector with no warning. * Usually it makes the smart choice. .pull-left[ ```r y <- c(4, 1, 7) class(y) ``` ``` ## [1] "numeric" ``` ```r z <- c(y, "cat") class(z) ``` ``` ## [1] "character" ``` ```r z <- c(y, 3L) class(z) ``` ``` ## [1] "numeric" ``` ] -- .pull-right[ ```r a <- c(FALSE, 4) class(a) ``` ``` ## [1] "numeric" ``` ```r a ``` ``` ## [1] 0 4 ``` ```r b <- c("NA", 4) class(b) ``` ``` ## [1] "character" ``` ] --- ### Operator Coercion * Functions and operators (like `+`) will often try to convert a vector to an appropriate type. * Once we are writing our own functions, we will consider building in tests to ensure the user provided the correct type. .pull-left[ ```r 1L + 3.1415 ``` ``` ## [1] 4.1415 ``` ```r log(TRUE) ``` ``` ## [1] 0 ``` ```r sum(c(FALSE, TRUE, TRUE)) ``` ``` ## [1] 2 ``` ] -- .pull-right[ ```r TRUE & FALSE ``` ``` ## [1] FALSE ``` ```r TRUE | FALSE ``` ``` ## [1] TRUE ``` ```r TRUE & 7 ``` ``` ## [1] TRUE ``` ```r FALSE | !7 ``` ``` ## [1] FALSE ``` ] --- ### Changing Type * We can also explicitly change the type. .pull-left[ ```r as.logical(c(0, 6.4, 1)) ``` ``` ## [1] FALSE TRUE TRUE ``` ```r as.character(c(FALSE, TRUE, TRUE)) ``` ``` ## [1] "FALSE" "TRUE" "TRUE" ``` ```r as.integer(pi) ``` ``` ## [1] 3 ``` ] -- .pull-right[ ```r as.numeric(c(FALSE, TRUE, TRUE)) ``` ``` ## [1] 0 1 1 ``` ```r as.double(c("01", "02", "03")) ``` ``` ## [1] 1 2 3 ``` ```r as.double(c("one", "two", "three")) ``` ``` ## [1] NA NA NA ``` ] --- ## Vectorized * R is built to work with vectors. * Many operations are **vectorized**: will happen component-wise when given a vector as input. .pull-left[ ```r y <- c(4, 1, 7) y + 4 ``` ``` ## [1] 8 5 11 ``` ```r y * 2 ``` ``` ## [1] 8 2 14 ``` ] -- .pull-right[ ```r x <- c(-4, 0, 2) y + x ``` ``` ## [1] 0 1 9 ``` ```r rnorm(n = 3, mean = c(-5, 0, 5)) ``` ``` ## [1] -4.2706889 -0.6995773 3.5507746 ``` ] --- * But we need to be careful! ```r dat <- data.frame(x = c(1, 2, 2, 4, 1, 3), y = c(8, 7, 6, 5, 8, 6)) dat ``` ``` ## x y ## 1 1 8 ## 2 2 7 ## 3 2 6 ## 4 4 5 ## 5 1 8 ## 6 3 6 ``` * Want the rows where `x` equals 1 or 2 + What happened? ```r library(tidyverse) dat %>% filter(x == c(1, 2)) ``` ``` ## x y ## 1 1 8 ## 2 2 7 ## 3 1 8 ``` --- ## Recycling * R recycles vectors if they are not the necessary length. + Notice we got NO error but that wasn't what we wanted. ```r library(tidyverse) dat %>% filter(x %in% c(1, 2)) ``` ``` ## x y ## 1 1 8 ## 2 2 7 ## 3 2 6 ## 4 1 8 ``` --- ## Indexing a Vector .pull-left[ ```r x <- c(4, 6, 9) x[1] ``` ``` ## [1] 4 ``` ```r x[-1] ``` ``` ## [1] 6 9 ``` ] -- .pull-right[ ```r x[c(2, 4)] ``` ``` ## [1] 6 NA ``` ```r a <- c(2, 4) x[a] ``` ``` ## [1] 6 NA ``` ] --- class: middle, center ## So what about **generic vectors/lists**? #### Recall they are heterogeneous collections of any type of `R` objects. --- ### Lists * Think of these as the most general way to store things. ```r groceries <- list() groceries$whole_foods <- c("apples", "chocolate", "kale", "garlic") groceries$star <- c("vinegar", "soap") groceries$lizzys <- c("french vanilla", "black cherry", "mint chip") groceries$budget <- data.frame(stores = c("whole_foods", "star", "lizzys"), fund = c(100, 25, 200)) class(groceries) ``` ``` ## [1] "list" ``` --- ### Lists * Notice the nested structure ```r groceries ``` ``` ## $whole_foods ## [1] "apples" "chocolate" "kale" "garlic" ## ## $star ## [1] "vinegar" "soap" ## ## $lizzys ## [1] "french vanilla" "black cherry" "mint chip" ## ## $budget ## stores fund ## 1 whole_foods 100 ## 2 star 25 ## 3 lizzys 200 ``` --- ```r holidays <- c("Valentine's", "President's") feb <- list(groceries = groceries, holidays = holidays) feb ``` ``` ## $groceries ## $groceries$whole_foods ## [1] "apples" "chocolate" "kale" "garlic" ## ## $groceries$star ## [1] "vinegar" "soap" ## ## $groceries$lizzys ## [1] "french vanilla" "black cherry" "mint chip" ## ## $groceries$budget ## stores fund ## 1 whole_foods 100 ## 2 star 25 ## 3 lizzys 200 ## ## ## $holidays ## [1] "Valentine's" "President's" ``` --- ### Indexing lists * Distinguishing between `[ ]` and `[[]]` ```r thing1 <- feb$groceries[3] thing1 ``` ``` ## $lizzys ## [1] "french vanilla" "black cherry" "mint chip" ``` ```r class(thing1) ``` ``` ## [1] "list" ``` ```r thing2 <- feb$groceries[[3]] thing2 ``` ``` ## [1] "french vanilla" "black cherry" "mint chip" ``` ```r class(thing2) ``` ``` ## [1] "character" ``` --- ## [`[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments) * `x` is a list: the pepper shaker containing packets of pepper. <img src="img/list.png" width="30%" style="display: block; margin: auto;" /> --- ## [`[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments) * `x[1]` is a pepper shaker containing the first packet of pepper. <img src="img/innerlist.png" width="30%" style="display: block; margin: auto;" /> --- ## [`[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments) * `x[2]` is what? -- <img src="img/innerlist.png" width="30%" style="display: block; margin: auto;" /> --- ## [`[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments) * `x[[1]]` is what? -- <img src="img/innerobject.png" width="30%" style="display: block; margin: auto;" /> --- ## [`[ ]` versus `[[ ]]`](https://r4ds.had.co.nz/vectors.html#lists-of-condiments) * `x[[1]][[1]]` is what? -- <img src="img/innerobject2.png" width="30%" style="display: block; margin: auto;" /> --- ## Data Frames Let's relate `list()`s to our favorite R object: `data.frame()`s! * `data.frame()`s are `list()`s. * Each variable of a `data.frame`() is a `vector()`. * The `vector()`s all have the same length but not necessary the same class. --- ## Data Frames ```r dat ``` ``` ## x y ## 1 1 8 ## 2 2 7 ## 3 2 6 ## 4 4 5 ## 5 1 8 ## 6 3 6 ``` ```r dat$x ``` ``` ## [1] 1 2 2 4 1 3 ``` --- ## Data Frames ```r str(dat[1]) ``` ``` ## 'data.frame': 6 obs. of 1 variable: ## $ x: num 1 2 2 4 1 3 ``` ```r str(dat[[1]]) ``` ``` ## num [1:6] 1 2 2 4 1 3 ``` ```r dat[1, 2] ``` ``` ## [1] 8 ``` ```r dat[1, ] ``` ``` ## x y ## 1 1 8 ``` --- ## R Data Objects * We can create and interact with (atomic) **vectors** and **lists** (generic vectors). -- * When writing and debugging code, it is a good idea to check the `class()` of an object. -- * R will sometimes change the type of an object without telling us (but based on our actions). -- * **Vectorization** makes `R` code speedy but also can cause you to do something you didn't mean to do. -- * While we will primarily interact with `vector()`s and `data.frame()`s, we will sometimes need `list()`s. --- ### Reminders * Make sure to come by office hours if you are still having issues syncing your class GitHub repo with an RStudio Project.