background-image: url("img/DAW.png") background-position: left background-size: 50% class: middle, center, .pull-right[ ## .base-blue[Introduction to Modeling] <br> <br> ### .purple[Kelly McConville] #### .purple[ Stat 100 | Week 5 | Fall 2022] ] --- ## Announcements * P-Set 3 due Friday at 5pm this week. **************************** -- ## Goals for Today .pull-left[ * Introduce statistical modeling * Simple linear regression model ] .pull-right[ * Measuring correlation ] --- ## Conclusions, Conclusions <img src="img/week4.005.jpeg" width="80%" style="display: block; margin: auto;" /> --- .left-column[ ## Recap ] .right-column[ <img src="img/DAW.jpeg" width="70%" style="display: block; margin: auto;" /> ] --- ### Typical Analysis Goals **Descriptive**: Want to estimate quantities related to the population. → How many trees are in Alaska? **Predictive**: Want to predict the value of a variable. → Can I use remotely sensed data to predict forest types in Alaska? **Causal**: Want to determine if changes in a variable cause changes in another variable. → Are insects causing the increased mortality rates for pinyon-juniper woodlands? -- We will focus mainly on **descriptive/causal modeling** in this course. If you want to learn more about **predictive modeling**, take Stat 121A: Data Science 1 + Stat 121B: Data Science 2. --- ## Form of the Model <br><br><br> -- $$ y = f(x) + \epsilon $$ <br><br><br> -- **Goal:** → Determine a reasonable form for `\(f()\)`. (Ex: Line, curve, ...) -- → Estimate `\(f()\)` with `\(\hat{f}()\)` using the data. -- → Generate predicted values: `\(\hat{y} = \hat{f}(x)\)`. --- ### Simple Linear Regression Model Consider this model when: -- * Response variable `\((y)\)`: quantitative -- * Explanatory variable `\((x)\)`: quantitative + Have only ONE explanatory variable. -- * AND, `\(f()\)` can be approximated by a line. --- ### Example: [The Ultimate Halloween Candy Power Ranking](https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/) "The social contract of Halloween is simple: Provide adequate treats to costumed masses, or be prepared for late-night tricks from those dissatisfied with your offer. To help you avoid that type of vengeance, and to help you make good decisions at the supermarket this weekend, we wanted to figure out what Halloween candy people most prefer. So we devised an experiment: Pit dozens of fun-sized candy varietals against one another, and let the wisdom of the crowd decide which one was best." -- Walt Hickey -- "While we don’t know who exactly voted, we do know this: 8,371 different IP addresses voted on about 269,000 randomly generated matchups.2 So, not a scientific survey or anything, but a good sample of what candy people like. " --- ### Example: [The Ultimate Halloween Candy Power Ranking](https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/) <img src="img/candy_ex.png" width="80%" style="display: block; margin: auto;" /> --- ### Example: [The Ultimate Halloween Candy Power Ranking](https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/) ```r candy <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv") glimpse(candy) ``` ``` ## Rows: 85 ## Columns: 13 ## $ competitorname <chr> "100 Grand", "3 Musketeers", "One dime", "One quarter… ## $ chocolate <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,… ## $ fruity <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1,… ## $ caramel <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ peanutyalmondy <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ nougat <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,… ## $ crispedricewafer <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… ## $ hard <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,… ## $ bar <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,… ## $ pluribus <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1,… ## $ sugarpercent <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604, 0.31… ## $ pricepercent <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767, 0.51… ## $ winpercent <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.… ``` --- ### Example: [The Ultimate Halloween Candy Power Ranking](https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/) .pull-left[ * Linear trend? * Direction of trend? ] .pull-right[ <img src="stat100_wk05wed_files/figure-html/candy-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Example: [The Ultimate Halloween Candy Power Ranking](https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/) .pull-left[ **A simple linear regression model would be suitable for these data.** * But first, let's describe more plots! ] .pull-right[ <img src="stat100_wk05wed_files/figure-html/candy2-1.png" width="768" style="display: block; margin: auto;" /> ] --- <img src="stat100_wk05wed_files/figure-html/unnamed-chunk-7-1.png" width="576" style="display: block; margin: auto;" /> -- #### Need a summary statistics that quantifies the strength and relationship of the linear trend! --- ## (Sample) Correlation Coefficient * Measures the **strength** and **direction** of **linear** relationship between two quantitative variables -- * Symbol: `\(r\)` -- * Always between -1 and 1 -- * Sign indicates the direction of the relationship -- * Magnitude indicates the strength of the linear relationship -- ```r candy %>% summarize(cor = cor(pricepercent, winpercent)) ``` ``` ## # A tibble: 1 × 1 ## cor ## <dbl> ## 1 0.345 ``` --- .pull-left[ <img src="stat100_wk05wed_files/figure-html/unnamed-chunk-9-1.png" width="540" style="display: block; margin: auto;" /> ] .pull-right[ Any guesses on the correlations for A, B, C, or D? ] -- .pull-right[ ```r dat %>% summarize(A = cor(x, y1), B = cor(x, y2), C = cor(x, y3), D = cor(x, y4)) ``` ``` ## # A tibble: 1 × 4 ## A B C D ## <dbl> <dbl> <dbl> <dbl> ## 1 0.695 -0.217 -0.815 -0.113 ``` ] --- ## New Example .pull-left[ ```r # Correlation coefficients dat2 %>% group_by(dataset) %>% summarize(cor = cor(x, y)) ``` ``` ## # A tibble: 13 × 2 ## dataset cor ## <chr> <dbl> ## 1 away -0.0641 ## 2 bullseye -0.0686 ## 3 circle -0.0683 ## 4 dino -0.0645 ## 5 dots -0.0603 ## 6 h_lines -0.0617 ## 7 high_lines -0.0685 ## 8 slant_down -0.0690 ## 9 slant_up -0.0686 ## 10 star -0.0630 ## 11 v_lines -0.0694 ## 12 wide_lines -0.0666 ## 13 x_shape -0.0656 ``` ] -- .pull-right[ * Conclude that `\(x\)` and `\(y\)` have the same relationship across these different datasets because the correlation is the same? ] --- ### Always graph the data when exploring relationships! <img src="stat100_wk05wed_files/figure-html/unnamed-chunk-13-1.png" width="576" style="display: block; margin: auto;" />