background-image: url("img/logo_padded.001.jpeg") background-position: left background-size: 60% class: middle, center, .pull-right[ <br> ## .base_color[Modeling with `tidymodels` and Wrap-Up] <br> <br> #### .navy[Kelly McConville] #### .navy[ Stat 108 | Week 13 | Spring 2023] ] --- background-image: url("img/ggparty_s23.001.jpeg") background-size: 80% class: bottom, center, ### If you are able to attend, please RSVP: [https://bit.ly/ggpartys23](https://bit.ly/ggpartys23) --- ### Announcements * Extra credit lecture quiz on Gradescope * Updated [OH Schedule](https://docs.google.com/spreadsheets/d/1HqEmr4tEtFPWRrF5TJHd1VgtVD030w6aAYySoFTnhBw/edit?usp=sharing) * [Final Presentations + Food](https://docs.google.com/spreadsheets/d/1xU3w4sXQSWkU678YjtdnjyBZOuB6RtFiZeB4fbsZozo/edit?usp=sharing): SC 316 + Monday, May 8th noon - 2pm: Groups 3, 4, 6, 7, 9, 11, 12, 14, 16, 19, 21 + Wednesday, May 10th 9 - 11am: Groups 1, 2, 5, 8, 10, 13, 15, 17, 18, 20 ************************ ### Week's Goals .pull-left[ **Mon Lecture** * Database querying with SQL ] .pull-right[ **Wed Lecture** * Modeling with `tidymodels` ] --- ### Disclaimer: This is not a modeling class. * I am going to expose you to some best practices in model building and some model building code via `tidymodels`. * But I am not really teaching you about the models we are fitting. * Take a modeling class! 🎉 --- ### Example ```r dat <- readRDS(url("https://mcconvil.github.io/fia_workshop_2021/data/IDdata.rds","rb")) idaho_plots <- dat$pltassgn %>% select(VOLCFNET_TPA_ADJ, tcc, tnt, ppt) %>% mutate(tnt = factor(tnt)) glimpse(idaho_plots) ``` ``` ## Rows: 3,753 ## Columns: 4 ## $ VOLCFNET_TPA_ADJ <dbl> 2471.6238, 108.4613, 399.4410, 5450.5850, 1119.4392, … ## $ tcc <int> 16, 34, 19, 74, 28, 18, 23, 51, 12, 29, 60, 39, 18, 9… ## $ tnt <fct> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2,… ## $ ppt <int> 893, 981, 450, 819, 870, 869, 1143, 696, 1305, 1344, … ``` --- ### Exploratory Data Analysis * Make sure to thoroughly explore your data **before** building models! --- ```r ggplot(data = idaho_plots, mapping = aes(x = tcc, y = VOLCFNET_TPA_ADJ, color = tnt)) + geom_point(alpha = 0.3) + geom_smooth(method = lm) ``` <img src="stat108_wk13wed_files/figure-html/unnamed-chunk-2-1.png" width="576" style="display: block; margin: auto;" /> --- ```r ggplot(data = idaho_plots, mapping = aes(x = ppt, y = VOLCFNET_TPA_ADJ, color = tnt)) + geom_point(alpha = 0.3) + geom_smooth(method = lm) ``` <img src="stat108_wk13wed_files/figure-html/unnamed-chunk-3-1.png" width="576" style="display: block; margin: auto;" /> --- ### Selecting a Modeling Approach Consider your modeling goals: * **To describe** * **To explain** * **To predict** Consider the structure of your data: * Is your response variable **categorical** or **quantitative**? * How do your predictors relate to the response variable? Determine how you will **assess** your model: * Do you care about predictive accuracy? * Do you care about interpretability? Be prepared to iterate. --- ### Splitting the Data Commonly we will split our data into two sets: * **Training set**: For developing and optimizing the model * **Test set**: For determining the quality of the estimated models The `rsample` package provides functions for creating these splits. ```r # Split the data using an 80/20 split idaho_split <- rsample::initial_split(idaho_plots, prop = 0.8) idaho_split ``` ``` ## <Training/Testing/Total> ## <3002/751/3753> ``` ```r idaho_train <- rsample::training(idaho_split) idaho_test <- rsample::testing(idaho_split) ``` --- ### Fitting the Model * The `parsnip` package provides a way to interface with a wide range of models. * The `workflows` package allows you to combine the model building and any data preprocessing together. ```r lin_reg <- parsnip::linear_reg() %>% parsnip::set_engine("lm") lin_reg_flow <- workflows::workflow() %>% workflows::add_model(lin_reg) %>% workflows::add_formula(VOLCFNET_TPA_ADJ ~ tnt*ppt + tnt*tcc) lin_reg_flow ``` ``` ## ══ Workflow ════════════════════════════════════════════════════════════════════ ## Preprocessor: Formula ## Model: linear_reg() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## VOLCFNET_TPA_ADJ ~ tnt * ppt + tnt * tcc ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` --- ### Fitting the Model ```r lin_reg_fit <- parsnip::fit(lin_reg_flow, idaho_train) lin_reg_fit ``` ``` ## ══ Workflow [trained] ══════════════════════════════════════════════════════════ ## Preprocessor: Formula ## Model: linear_reg() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## VOLCFNET_TPA_ADJ ~ tnt * ppt + tnt * tcc ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## ## Call: ## stats::lm(formula = ..y ~ ., data = data) ## ## Coefficients: ## (Intercept) tnt2 ppt tcc `tnt2:ppt` `tnt2:tcc` ## -795.2268 1443.9284 0.8998 49.0499 -0.8050 -25.1351 ``` --- ### Predicting from the Model ```r predict(lin_reg_fit, idaho_test) ``` ``` ## # A tibble: 751 × 1 ## .pred ## <dbl> ## 1 3571. ## 2 870. ## 3 1836. ## 4 2689. ## 5 932. ## 6 962. ## 7 1705. ## 8 1222. ## 9 1874. ## 10 1849. ## # ℹ 741 more rows ``` --- ### Evaluating the Model ```r lin_reg_final <- tune::last_fit(lin_reg_flow, idaho_split) tune::collect_metrics(lin_reg_final) ``` ``` ## # A tibble: 2 × 4 ## .metric .estimator .estimate .config ## <chr> <chr> <dbl> <chr> ## 1 rmse standard 2040. Preprocessor1_Model1 ## 2 rsq standard 0.273 Preprocessor1_Model1 ``` --- ### Why `tidymodels`? Note: Should tune the hyper-parameters of the random forest with `tune`. ```r rf <- parsnip::rand_forest(trees = 1000) %>% parsnip::set_mode("regression") %>% parsnip::set_engine("ranger") rf_flow <- workflows::workflow() %>% workflows::add_model(rf) %>% workflows::add_formula(VOLCFNET_TPA_ADJ ~ tnt + ppt + tnt + tcc) rf_final <- tune::last_fit(rf_flow, idaho_split) tune::collect_metrics(rf_final) ``` ``` ## # A tibble: 2 × 4 ## .metric .estimator .estimate .config ## <chr> <chr> <dbl> <chr> ## 1 rmse standard 2047. Preprocessor1_Model1 ## 2 rsq standard 0.288 Preprocessor1_Model1 ``` --- class: middle, center ## Learning Goals <img src="img/logo_padded.001.jpeg" width="40%" style="display: block; margin: auto;" /> --- class: center .pull-left[ ### Goal: Learn to use several `R` packages. <img src="img/box.png" width="45%" style="display: block; margin: auto;" /> ] -- .pull-left[ ### Goal: Learn to create our own functions in `R`. <img src="img/homemade.png" width="70%" style="display: block; margin: auto;" /> ] --- ### Goal: Wrangle and interact with a variety of data types. .pull-left[ Can now wrangle: * `data.frame()`s * Vectors: + Numeric + Logical + Characters with `stringr` + Factors with `forcats` + Dates and times with `lubridate` * `list()`s * Spatial data * Text data with `tidytext` ] .pull-right[ <img src="img/STAT108Logo_Wrangling.png" width="60%" style="display: block; margin: auto;" /> ] --- ### Goal: Develop our statistical programming skills. * Implement code that reflects core ideas of statistical programming, including functions, iteration, control flow, vectorization, debugging, refactoring, and abstraction. .pull-left[ ```r for(i in 1:20){ Iterate_over_something_awesome } ``` ] .pull-right[ ```r if(magic_eight_ball("Should I take Stat 108?") == "Without a doubt") { print("Stay seated") } else { print("Leave and buy a coffee.") } ``` ] * Apply coding habits and a coding style that align with best practices in the field. <img src="img/STAT108Logo_Programming.png" width="20%" style="display: block; margin: auto;" /> --- ### Goal: Create data visualizations of multivariate data. .pull-left[ Can now create: * Static graphs with `ggplot2`. * Interactive graphs with `plotly` and `leaflet`. * Reactive graphs with `shiny`. * Animated graphs with `gganimate`. ] .pull-right[ <img src="img/STAT108Logo_Viz.png" width="60%" style="display: block; margin: auto;" /> ] --- ### Goal: Learn to collaborate and share `R` code and output. .pull-left[ Learned to use and create: * `R` Markdown documents * GitHub repositories * RStudio Projects * `R` packages ] .pull-right[ <img src="img/STAT108Logo_Sharing.png" width="60%" style="display: block; margin: auto;" /> ] --- class: middle, center ## Can't wait to see and to celebrate your Project 2 `R` packages! ## Thanks for a wonderful semester! ### Hope to see you at the `ggparty` on Friday!