Data Wrangling

background-image: url("img/DAW.png")
background-position: left
background-size: 50%
class: middle, center,

.pull-right[

## .base-blue[Data Wrangling]

<br>

### .purple[Kelly McConville]

#### .purple[ Stat 100 | Week 4 | Fall 2022]

]

---

## Announcements

* Don't forget to come by [office hours](https://docs.google.com/spreadsheets/d/1eHOdLQGw3mEEvahOM1cKi9p0eu_2TdI6G8b4vdb9hqE/edit?usp=sharing) to ask questions or to work in a collaborative setting.

****************************

## Goals for Today

.pull-left[

* Discuss the Project and Project Assignment 1.

]

.pull-right[

* Go through the main data wrangling operations in `dplyr`.

]

---

## Stat 100 Group Project

* Goal: Practice working through the data analysis process with a real dataset and research question(s)

* Structure:
    + Groups of 2-3 people
    + Key due dates [here](https://docs.google.com/spreadsheets/d/19YaBdfSqF_vsMPidiyRQ349gfUMxMpV8OM29xokG3Jw/edit?usp=sharing)
    + Three intermediate assignments and a final 10 minute video presentation

* Project Assignment 1
    + Pick a data set and research questions
    + Explore and visualize the data to start answering the research questions
    + Due on Gradescope: Friday, October 7th at 5pm
        + Only one person from the group needs to turn it in.

---

## dplyr for Data Wrangling

* Six common wrangling verbs:
    + `select()`
    + `mutate()`
    + `filter()`
    + `arrange()`
    + `summarize()`
    + `---_join()`
    
* One action:
    + `group_by()`

---

## Data Setting: [BLS Consumer Expenditure Survey](https://www.bls.gov/cex/)

Last quarter of the 2016 Bureau of Labor Statistics Consumer Expenditure Survey.

```r
library(tidyverse)

ce <- read_csv("~/shared_data/stat100/data/ce.csv")
glimpse(ce)
```

```
## Rows: 6,301
## Columns: 49
## $ FINLWT21 <dbl> 25984.767, 6581.018, 20208.499, 18078.372, 20111.619, 19907.3…
## $ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 40037, 10…
## $ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ POPSIZE  <dbl> 2, 3, 4, 2, 2, 2, 1, 2, 5, 2, 3, 2, 2, 3, 4, 3, 3, 1, 4, 1, 1…
## $ EDUC_REF <chr> "16", "15", "16", "15", "14", "11", "10", "13", "12", "12", "…
## $ EDUCA2   <dbl> 15, 15, 13, NA, NA, NA, NA, 15, 15, 14, 12, 12, NA, NA, NA, 1…
## $ AGE_REF  <dbl> 63, 50, 47, 37, 51, 63, 77, 37, 51, 64, 26, 59, 81, 51, 67, 4…
## $ AGE2     <chr> "50", "47", "46", ".", ".", ".", ".", "36", "53", "67", "44",…
## $ SEX_REF  <dbl> 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1…
## $ SEX2     <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 1, 1, 1, NA, NA, NA, 1, NA, 1,…
## $ REF_RACE <dbl> 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1…
## $ RACE2    <dbl> 1, 4, 1, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, NA, NA, 2, NA, 1,…
## $ HISP_REF <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1…
## $ HISP2    <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 2, 2, 2, NA, NA, NA, 2, NA, 2,…
## $ FAM_TYPE <dbl> 3, 4, 1, 8, 9, 9, 8, 3, 1, 1, 3, 1, 8, 9, 8, 5, 9, 4, 8, 3, 2…
## $ MARITAL1 <dbl> 1, 1, 1, 5, 3, 3, 2, 1, 1, 1, 1, 1, 2, 3, 5, 1, 3, 1, 3, 1, 1…
## $ REGION   <dbl> 4, 4, 3, 4, 4, 3, 4, 1, 3, 2, 1, 4, 1, 3, 3, 3, 2, 1, 2, 4, 3…
## $ SMSASTAT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15", "14", "…
## $ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ FOODCQ   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TRANSCQ  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ HEALTHCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ENTERTCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ EDUCACQ  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ TOBACCCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ STUDFINX <chr> ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", "…
## $ IRAX     <chr> "1000000", "10000", "0", ".", ".", "0", "0", "15000", ".", "4…
## $ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1, 1, 4, 4…
## $ FAM_SIZE <dbl> 4, 6, 2, 1, 2, 2, 1, 5, 2, 2, 4, 2, 1, 2, 1, 4, 2, 4, 1, 3, 3…
## $ VEHQ     <dbl> 3, 5, 0, 4, 2, 0, 0, 2, 4, 2, 3, 2, 1, 3, 1, 2, 4, 4, 0, 2, 3…
## $ ROOMSQ   <chr> "8", "5", "6", "4", "4", "4", "7", "5", "4", "9", "6", "10", …
## $ INC_HRS1 <chr> "40", "40", "40", "44", "40", ".", ".", "40", "40", ".", "40"…
## $ INC_HRS2 <chr> "30", "40", "52", ".", ".", ".", ".", "40", "40", ".", "65", …
## $ EARNCOMP <dbl> 3, 2, 2, 1, 4, 7, 8, 2, 2, 8, 2, 8, 8, 7, 8, 2, 7, 3, 1, 2, 1…
## $ NO_EARNR <dbl> 4, 2, 2, 1, 2, 1, 0, 2, 2, 0, 2, 0, 0, 1, 0, 2, 1, 3, 1, 2, 1…
## $ OCCUCOD1 <chr> "03", "03", "05", "03", "04", NA, NA, "12", "04", NA, "01", N…
## $ OCCUCOD2 <chr> "04", "02", "01", NA, NA, NA, NA, "02", "03", NA, "11", NA, N…
## $ STATE    <chr> "41", "15", "48", "06", "06", "48", "06", "42", NA, "27", "25…
## $ DIVISION <dbl> 9, 9, 7, 9, 9, 7, 9, 2, NA, 4, 1, 8, 2, 5, 6, 7, 3, 2, 3, 9, …
## $ TOTXEST  <dbl> 15452, 11459, 15738, 25978, 588, 0, 0, 7261, 9406, -1414, 141…
## $ CREDFINX <chr> "0", ".", "0", ".", "5", ".", ".", ".", ".", "0", ".", "0", "…
## $ CREDITB  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ CREDITX  <chr> "4000", "5000", "2000", ".", "7000", "1800", ".", "6000", "."…
## $ BUILDING <chr> "01", "01", "01", "02", "08", "01", "01", "01", "01", "01", "…
## $ ST_HOUS  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ INT_PHON <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ INT_HOME <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
```

---

class: , middle, center

## Now let's go through the Data Wrangling handout!

---

## Naming Wrangled Data

Should I name my new dataframe `ce` or `ce1`?
    
    
+ *My* answer:
    + Is your new dataset structurally different? If so, give it a **new name**.
    + Are you removing values you will need for a future analysis within the same document? If so, give it a **new name**.
    + Are you just adding to or cleaning the data?  If so, then **write over** the original.
    
---

### Sage Advice from ModernDive

> "Crucial: Unless you are very confident in what you are doing, it is worthwhile not starting to code right away. Rather, first sketch out on paper all the necessary data wrangling steps not using exact code, but rather high-level pseudocode that is informal yet detailed enough to articulate what you are doing. This way you won’t confuse what you are trying to do (the algorithm) with how you are going to do it (writing dplyr code)."

Will practice writing pseudocode in section this week!