Tidymodels and Wrap-up

background-image: url("img/logo_padded.001.jpeg")
background-position: left
background-size: 60%
class: middle, center,

<br>

## .base_color[Modeling with `tidymodels` and Wrap-Up]
 
<br>
 
<br>

#### .navy[Kelly McConville]

#### .navy[ Stat 108 | Week 13 | Spring 2023]

]

---

background-image: url("img/ggparty_s23.001.jpeg")
background-size: 80%
class: bottom, center,

### If you are able to attend, please RSVP: [https://bit.ly/ggpartys23](https://bit.ly/ggpartys23)

---

### Announcements

* Extra credit lecture quiz on Gradescope
* Updated [OH Schedule](https://docs.google.com/spreadsheets/d/1HqEmr4tEtFPWRrF5TJHd1VgtVD030w6aAYySoFTnhBw/edit?usp=sharing)
* [Final Presentations + Food](https://docs.google.com/spreadsheets/d/1xU3w4sXQSWkU678YjtdnjyBZOuB6RtFiZeB4fbsZozo/edit?usp=sharing): SC 316
    + Monday, May 8th noon - 2pm: Groups 3, 4, 6, 7, 9, 11, 12, 14, 16, 19, 21
    + Wednesday, May 10th 9 - 11am: Groups 1, 2, 5, 8, 10, 13, 15, 17, 18, 20

************************

### Week's Goals

**Mon Lecture**

* Database querying with SQL

]

**Wed Lecture**

* Modeling with `tidymodels`

]

---

### Disclaimer: This is not a modeling class.

* I am going to expose you to some best practices in model building and some model building code via `tidymodels`.

* But I am not really teaching you about the models we are fitting.

* Take a modeling class! 🎉

---

### Example

```r
dat <- readRDS(url("https://mcconvil.github.io/fia_workshop_2021/data/IDdata.rds","rb"))
idaho_plots <- dat$pltassgn %>%
  select(VOLCFNET_TPA_ADJ, tcc, tnt, ppt) %>%
  mutate(tnt = factor(tnt))
glimpse(idaho_plots)
```

```
## Rows: 3,753
## Columns: 4
## $ VOLCFNET_TPA_ADJ <dbl> 2471.6238, 108.4613, 399.4410, 5450.5850, 1119.4392, …
## $ tcc              <int> 16, 34, 19, 74, 28, 18, 23, 51, 12, 29, 60, 39, 18, 9…
## $ tnt              <fct> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2,…
## $ ppt              <int> 893, 981, 450, 819, 870, 869, 1143, 696, 1305, 1344, …
```

---

### Exploratory Data Analysis

* Make sure to thoroughly explore your data **before** building models!

---

```r
ggplot(data = idaho_plots, 
       mapping = aes(x = tcc, y = VOLCFNET_TPA_ADJ,
                     color = tnt)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = lm)
```

---

```r
ggplot(data = idaho_plots, 
       mapping = aes(x = ppt, y = VOLCFNET_TPA_ADJ,
                     color = tnt)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = lm)
```

---

### Selecting a Modeling Approach

Consider your modeling goals:

* **To describe**

* **To explain**

* **To predict**

Consider the structure of your data:

* Is your response variable **categorical** or **quantitative**?

* How do your predictors relate to the response variable?

Determine how you will **assess** your model:

* Do you care about predictive accuracy?
* Do you care about interpretability?

Be prepared to iterate.

---

### Splitting the Data

Commonly we will split our data into two sets:

* **Training set**: For developing and optimizing the model
* **Test set**: For determining the quality of the estimated models

The `rsample` package provides functions for creating these splits.

```r
# Split the data using an 80/20 split
idaho_split <- rsample::initial_split(idaho_plots,  prop = 0.8)
idaho_split
```

```
## <Training/Testing/Total>
## <3002/751/3753>
```

```r
idaho_train <- rsample::training(idaho_split)
idaho_test <- rsample::testing(idaho_split)
```

---

### Fitting the Model

* The `parsnip` package provides a way to interface with a wide range of models.

* The `workflows` package allows you to combine the model building and any data preprocessing together.

```r
lin_reg <- parsnip::linear_reg() %>%
  parsnip::set_engine("lm")

lin_reg_flow <- workflows::workflow() %>%
  workflows::add_model(lin_reg) %>%
  workflows::add_formula(VOLCFNET_TPA_ADJ ~ tnt*ppt + tnt*tcc)

lin_reg_flow
```

```
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## VOLCFNET_TPA_ADJ ~ tnt * ppt + tnt * tcc
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
```

---

### Fitting the Model

```r
lin_reg_fit <- parsnip::fit(lin_reg_flow, idaho_train)
lin_reg_fit
```

```
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## VOLCFNET_TPA_ADJ ~ tnt * ppt + tnt * tcc
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
## (Intercept)         tnt2          ppt          tcc   `tnt2:ppt`   `tnt2:tcc`  
##   -795.2268    1443.9284       0.8998      49.0499      -0.8050     -25.1351
```

---

### Predicting from the Model

```r
predict(lin_reg_fit, idaho_test)
```

```
## # A tibble: 751 × 1
##    .pred
##    <dbl>
##  1 3571.
##  2  870.
##  3 1836.
##  4 2689.
##  5  932.
##  6  962.
##  7 1705.
##  8 1222.
##  9 1874.
## 10 1849.
## # ℹ 741 more rows
```

---

### Evaluating the Model

```r
lin_reg_final <- tune::last_fit(lin_reg_flow, idaho_split)
tune::collect_metrics(lin_reg_final)
```

```
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard    2040.    Preprocessor1_Model1
## 2 rsq     standard       0.273 Preprocessor1_Model1
```

---

### Why `tidymodels`?

Note: Should tune the hyper-parameters of the random forest with `tune`.

```r
rf <- parsnip::rand_forest(trees = 1000) %>%
  parsnip::set_mode("regression") %>%
  parsnip::set_engine("ranger")

rf_flow <- workflows::workflow() %>%
  workflows::add_model(rf) %>%
  workflows::add_formula(VOLCFNET_TPA_ADJ ~ tnt + ppt + tnt + tcc)

rf_final <- tune::last_fit(rf_flow, idaho_split)
tune::collect_metrics(rf_final)
```

```
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard    2047.    Preprocessor1_Model1
## 2 rsq     standard       0.288 Preprocessor1_Model1
```

---

## Learning Goals

---

### Goal: Learn to use several `R` packages.

]

### Goal: Learn to create our own functions in `R`.

]

---

### Goal: Wrangle and interact with a variety of data types.

Can now wrangle:

* `data.frame()`s
* Vectors:
    + Numeric
    + Logical
    + Characters with `stringr`
    + Factors with `forcats`
    + Dates and times with `lubridate`
* `list()`s
* Spatial data
* Text data with `tidytext`

]

]

---

### Goal: Develop our statistical programming skills.

* Implement code that reflects core ideas of statistical programming, including functions, iteration, control flow, vectorization, debugging, refactoring, and abstraction.

```r
for(i in 1:20){
  Iterate_over_something_awesome
}
```

]

```r
if(magic_eight_ball("Should I take Stat 108?")
  == "Without a doubt") {
    print("Stay seated")
  } else {
    print("Leave and buy a coffee.")
  }
```

]

* Apply coding habits and a coding style that align with best practices in the field.

---

### Goal: Create data visualizations of multivariate data.

Can now create:

* Static graphs with `ggplot2`.
* Interactive graphs with `plotly` and `leaflet`.
* Reactive graphs with `shiny`.
* Animated graphs with `gganimate`.

]

]

---

### Goal: Learn to collaborate and share `R` code and output.

Learned to use and create:

* `R` Markdown documents
* GitHub repositories
* RStudio Projects
* `R` packages

]

]

---

## Can't wait to see and to celebrate your Project 2 `R` packages!

## Thanks for a wonderful semester!

### Hope to see you at the `ggparty` on Friday!