Data Objects in R

background-image: url("img/logo_padded.001.jpeg")
background-position: left
background-size: 60%
class: middle, center,

## .base_color[Data Objects]

## .base_color[in `R`]

#### .navy[Kelly McConville]

#### .navy[ Stat 108 | Week 4 | Spring 2023]

]

---

## Let's tinker with our RStudio Settings!

---

## Announcements

* Make sure to come by office hours if you are still having issues syncing your class GitHub repo with an RStudio Project.

************************

## Week 3 Goals

**Mon Lecture**

* GitHub + RStudio Projects workflow
* Coding Style
* Exploring `R` Objects

]

**Wed Lecture**

* Data Joins
* Data reshaping with `tidyr`

]

---

## GitHub + RStudio Workflow

Once your GitHub repo and RStudio project are synced, here's your workflow:

* **Pull** the most recent version of the repo from GitHub to your RStudio project.

* Do some work on your project in RStudio.

* **Commit** that work.
    + Committing takes a snapshot of all the files in the project.
    + Look over the **Diff**: which shows what has changed since your last update.
    + Include a quick note, **Commit Message** to summarize the motivation for the changes.

* **Push** your commit to GitHub from RStudio.

---

## Workflow Demo

---

## Ignoring Files

* There are several files that we want to **NOT** push to GitHub.

* These include:
    + `.gitignore`
    + `.DS_Store`

* Add these files to the `.gitignore`.

---

## Test the waters: Let's go through the workflow.

* Pull. (Yes, there is nothing to pull yet but it is always good practice to start here.)
* Click on the ReadMe.
* Add something to the ReadMe.
* Click on the git tab.  Check the box next to the ReadMe.md. Hit commit.
* Put in a commit message.  Look over the diff.  
* Push.

**Look for updates in the ReadMe on GitHub.com.**

---

## Git Collaboration: Merge conflicts

* What if my collaborators and I both make changes? 
    + Scenario: Your collaborator makes changes to a file, commits, and pushes to GitHub. You also modify that file, commit and push.  
    + Result: Your push will fail because there's a commit on GitHub that you don't have.  
    + Usual Solution: Pull and *usually* git will merge their work nicely with yours.  Then push.  If that doesn't work, you have a **merge conflict**.  Let's cross that bridge when we get there.  
    
* How to avoid merge conflicts?
    + Always pull when you are going to work on your project.
    + Always commit and push when you are done even if you made small changes.

---

## Collaboration: Git Style

* **Projects**: Can use to create to do lists and stay organized.

* **Issues**: Useful method to communicate with your group members.

---

### Time to Take Stock of Learning Goals

#### So far we have:

#### Created **static** data visualizations of multivariate data with `ggplot2`.

#### Created **animated** data visualizations of multivariate data with `gganimate`.

#### Created simple **interactive** data visualizations of multivariate data with `plotly`.

--

#### Wrangled `data.frames()` with `dplyr` and `factor()` vectors with `forcats`.

---

###  Looking Ahead to Project 1

#### Goal: Create interactive dashboards with `shinydashboard` and `flexdashboard`.

---

### What We Need First

#### Need to explore more **data objects** in `R`.

#### Need to learn how to interact with and wrangle these **data objects**.

---

### Today's discussion is a mix of

]

]

---

## Stat 108 `R` Data Objects So Far:

```r
crash_data <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/psets/data/cambridge_cyclist_ped_crash.csv") 
class(crash_data)
```

```
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
```

---

## Stat 108 `R` Data Objects So Far:

```r
pretty_colors <- c("deeppink3", "cyan4", "darkolivegreen")
class(pretty_colors)
```

```
## [1] "character"
```

```r
size <- 3
class(size)
```

```
## [1] "numeric"
```

These are both examples of what are called **atomic vectors**.

---

## R Objects: Vectors

* **Vectors** are the fundamental building blocks of data in `R`.  
    + Come in **2** flavors!

* **Flavor 1:** **Atomic vectors**
    + Homogeneous collections of the same type.
    + Confusingly, people usually just call these *vectors*.

* **Flavor 2:** **Generic vectors**
    + Heterogeneous collections of any type of `R` objects.
    + Commonly called *lists*.

---

### Deep Dive into (atomic) vectors

#### Let's explore the common types and how to interact with them.

#### This will allow us to use vectors to do useful things and

#### To debug our code more effectively!

---

### Flavors of Atomic Vectors

**Logical**: TRUEs and FALSEs

```r
happy <- c(TRUE, FALSE, TRUE, TRUE)
class(happy)
```

```
## [1] "logical"
```

**Numeric**: Integers, real numbers (double-precision floating point numbers)

```r
x <- c(1, 4, 5)
class(x)
```

```
## [1] "numeric"
```

```r
x <- c(1L, 4L, 5L)
class(x)
```

```
## [1] "integer"
```

]

**Character**: Strings (contains 1 or more characters)

```r
animals <- c("llama", "pig", "cat")
class(animals)
```

```
## [1] "character"
```

**Factors**: Strings with order

```r
animals <- as.factor(animals)
class(animals)
```

```
## [1] "factor"
```

]

---

### Factors versus Characters

**Character**: Strings (contains 1 or more characters)

```r
animals <- c("llama", "pig", "cat")
class(animals)
```

```
## [1] "character"
```

```r
typeof(animals)
```

```
## [1] "character"
```

```r
str(animals)
```

```
##  chr [1:3] "llama" "pig" "cat"
```

]

**Factors**: Strings with order

```r
animals <- as.factor(animals)
class(animals)
```

```
## [1] "factor"
```

```r
typeof(animals)
```

```
## [1] "integer"
```

```r
str(animals)
```

```
##  Factor w/ 3 levels "cat","llama",..: 2 3 1
```

What?!

]

If you want to read more about this rabbit hole, [go here](https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/index.html).

---

### Concatenation

Atomic vectors are created and combined with `c()`.

```r
x <- 5
x
```

```
## [1] 5
```

```r
y <- c(4, 1, 7)
y
```

```
## [1] 4 1 7
```

]

```r
z <- c(y, x, c(20))
z
```

```
## [1]  4  1  7  5 20
```

```r
quick <- c(2:5, 13)
quick
```

```
## [1]  2  3  4  5 13
```

]

---

### Checking Type

```r
is.logical(c(FALSE, TRUE))
```

```
## [1] TRUE
```

```r
is.logical(c(0, 1))
```

```
## [1] FALSE
```

]

```r
is.factor(c("llama", "pig", "cat"))
```

```
## [1] FALSE
```

```r
is.character(c("llama", "pig", "cat"))
```

```
## [1] TRUE
```

]

---

### Checking Type

```r
is.integer(1)
```

```
## [1] FALSE
```

```r
is.double(1)
```

```
## [1] TRUE
```

```r
is.numeric(1)
```

```
## [1] TRUE
```

]

```r
is.integer(1L)
```

```
## [1] TRUE
```

```r
is.double(1L)
```

```
## [1] FALSE
```

```r
is.numeric(1L)
```

```
## [1] TRUE
```

]

```r
is.atomic(1)
```

```
## [1] TRUE
```

---

### Type Coercion

* R will often change the type of a vector with no warning.

* Usually it makes the smart choice.

```r
y <- c(4, 1, 7)
class(y)
```

```
## [1] "numeric"
```

```r
z <- c(y, "cat")
class(z)
```

```
## [1] "character"
```

```r
z <- c(y, 3L)
class(z)
```

```
## [1] "numeric"
```

]

```r
a <- c(FALSE, 4)
class(a)
```

```
## [1] "numeric"
```

```r
a
```

```
## [1] 0 4
```

```r
b <- c("NA", 4)
class(b)
```

```
## [1] "character"
```

]

---

### Operator Coercion

* Functions and operators (like `+`) will often try to convert a vector to an appropriate type.

* Once we are writing our own functions, we will consider building in tests to ensure the user provided the correct type.
.pull-left[

```r
1L + 3.1415
```

```
## [1] 4.1415
```

```r
log(TRUE)
```

```
## [1] 0
```

```r
sum(c(FALSE, TRUE, TRUE))
```

```
## [1] 2
```

]

```r
TRUE & FALSE
```

```
## [1] FALSE
```

```r
TRUE | FALSE
```

```
## [1] TRUE
```

```r
TRUE & 7
```

```
## [1] TRUE
```

```r
FALSE | !7
```

```
## [1] FALSE
```

]

---

### Changing Type

* We can also explicitly change the type.

```r
as.logical(c(0, 6.4, 1))
```

```
## [1] FALSE  TRUE  TRUE
```

```r
as.character(c(FALSE, TRUE, TRUE))
```

```
## [1] "FALSE" "TRUE"  "TRUE"
```

```r
as.integer(pi)
```

```
## [1] 3
```

]

```r
as.numeric(c(FALSE, TRUE, TRUE))
```

```
## [1] 0 1 1
```

```r
as.double(c("01", "02", "03"))
```

```
## [1] 1 2 3
```

```r
as.double(c("one", "two", "three"))
```

```
## [1] NA NA NA
```

]

---

## Vectorized

* R is built to work with vectors.

* Many operations are **vectorized**: will happen component-wise when given a vector as input.

```r
y <- c(4, 1, 7)
y + 4
```

```
## [1]  8  5 11
```

```r
y * 2
```

```
## [1]  8  2 14
```

]

```r
x <- c(-4, 0, 2)
y + x
```

```
## [1] 0 1 9
```

```r
rnorm(n = 3, mean = c(-5, 0, 5))
```

```
## [1] -4.2706889 -0.6995773  3.5507746
```

]

---

* But we need to be careful!

```r
dat <- data.frame(x = c(1, 2, 2, 4, 1, 3), y = c(8, 7, 6, 5, 8, 6))
dat
```

```
##   x y
## 1 1 8
## 2 2 7
## 3 2 6
## 4 4 5
## 5 1 8
## 6 3 6
```

* Want the rows where `x` equals 1 or 2
    + What happened?

```r
library(tidyverse)
dat %>%
  filter(x == c(1, 2))
```

```
##   x y
## 1 1 8
## 2 2 7
## 3 1 8
```

---

## Recycling

* R recycles vectors if they are not the necessary length.
    + Notice we got NO error but that wasn't what we wanted.

```r
library(tidyverse)
dat %>%
  filter(x %in% c(1, 2))
```

```
##   x y
## 1 1 8
## 2 2 7
## 3 2 6
## 4 1 8
```

---

## Indexing a Vector

```r
x <- c(4, 6, 9)
x[1]
```

```
## [1] 4
```

```r
x[-1]
```

```
## [1] 6 9
```

]

```r
x[c(2, 4)]
```

```
## [1]  6 NA
```

```r
a <- c(2, 4)
x[a]
```

```
## [1]  6 NA
```

]

---

##  So what about **generic vectors/lists**?

#### Recall they are heterogeneous collections of any type of `R` objects.

---

### Lists

* Think of these as the most general way to store things.

```r
groceries <- list()
groceries$whole_foods <- c("apples", "chocolate", "kale", "garlic")
groceries$star <- c("vinegar", "soap")
groceries$lizzys <- c("french vanilla", "black cherry", "mint chip")
groceries$budget <- data.frame(stores = c("whole_foods", "star", "lizzys"), 
 fund = c(100, 25, 200))
class(groceries)
```

```
## [1] "list"
```

---

### Lists

* Notice the nested structure

```r
groceries
```

```
## $whole_foods
## [1] "apples"    "chocolate" "kale"      "garlic"   
## 
## $star
## [1] "vinegar" "soap"   
## 
## $lizzys
## [1] "french vanilla" "black cherry"   "mint chip"     
## 
## $budget
##        stores fund
## 1 whole_foods  100
## 2        star   25
## 3      lizzys  200
```

---

```r
holidays <- c("Valentine's", "President's")
feb <- list(groceries = groceries, holidays = holidays)
feb
```

```
## $groceries
## $groceries$whole_foods
## [1] "apples"    "chocolate" "kale"      "garlic"   
## 
## $groceries$star
## [1] "vinegar" "soap"   
## 
## $groceries$lizzys
## [1] "french vanilla" "black cherry"   "mint chip"     
## 
## $groceries$budget
##        stores fund
## 1 whole_foods  100
## 2        star   25
## 3      lizzys  200
## 
## 
## $holidays
## [1] "Valentine's" "President's"
```

---

### Indexing lists

* Distinguishing between `[ ]` and `[[]]`

```r
thing1 <- feb$groceries[3]
thing1
```

```
## $lizzys
## [1] "french vanilla" "black cherry"   "mint chip"
```

```r
class(thing1)
```

```
## [1] "list"
```

```r
thing2 <- feb$groceries[[3]]
thing2
```

```
## [1] "french vanilla" "black cherry"   "mint chip"
```

```r
class(thing2)
```

```
## [1] "character"
```

---