background-image: url("img/DAW.png") background-position: left background-size: 50% class: middle, center, .pull-right[ ## .base-blue[Inference] ## .base-blue[for] ## .base-blue[Linear Regression] <br> ### .purple[Kelly McConville] #### .purple[ Stat 100 | Week 14 | Fall 2022] ] --- class: middle, center ### If you are able to attend, please RSVP here: ### [bit.ly/stat100-ggparty](https://bit.ly/stat100-ggparty) --- ### Announcements * Extra Credit Lecture Quiz due Dec 1st. * P-Set 9 (the LAST p-set) is due Th, Dec 1st! * Project Assignment 3 is due Tu, Dec 6th. + Also provide [feedback on work distribution](https://forms.gle/EJ1zgcC3c5rNrn9f7). * Will get review sheet on Wed. **************************** -- ### Goals for Today .pull-left[ * **Sim-based** versus **theory-based** inference * Recap **multiple linear regression** * Check **assumptions** for linear regression ] .pull-right[ * **Hypothesis testing** for linear regression * **Estimation** and **prediction** inference for linear regression ] --- class: , middle, center ## Please make sure to fill out the Stat 100 Course Evaluations. ### We appreciate constructive feedback. ### For all of your course evaluations be mindful of [unconscious and unintentional biases](https://benschmidt.org/profGender). --- background-image: url("img/hyp_testing_diagram.png") background-position: contain background-size: 80% ### Have Learned Two Routes to Statistical Inference -- Which is **better**? --- ## Is Simulation-Based Inference or Theory-Based Inference better? -- Depends on how you define **better**. .pull-left[ * If **better** = Leads to better understanding: ] -- .pull-right[ → Research tends to show students have a better understanding of **p-values** and **confidence** from learning simulation-based methods. ] -- .pull-left[ * If **better** = More flexible/robust to assumptions: ] -- .pull-right[ → The simulation-based methods tend to be more flexible but that generally requires learning extensions beyond what we've seen in Stat 100. ] -- .pull-left[ * If **better** = More commonly used: ] -- .pull-right[ → Definitely the theory-based methods but the simulation-based methods are becoming more common. ] Good to be comfortable with both as you will find both approaches used in journal and news articles! --- class: center, middle ## What does statistical inference (estimation and hypothesis testing) look like when I have more than 0 or 1 explanatory variables? -- ### One route: Multiple Linear Regression! --- ## Multiple Linear Regression Linear regression is a flexible class of models that allow for: * Both quantitative and categorical explanatory variables. * Multiple explanatory variables. * Curved relationships between the response variable and the explanatory variable. -- * BUT the response variable is quantitative. **In P-Set 9, Problem 5** you are exploring the importance of controlling for other explanatory variable when you try to make inferences about relationships. --- ### Multiple Linear Regression **Form of the Model:** $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon \end{align}` $$ -- **Fitted Model:** Using the Method of Least Squares, $$ `\begin{align} \hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p \end{align}` $$ -- #### Typical Inferential Questions: (1) Should `\(x_2\)` be in the model that already contains `\(x_1\)` and `\(x_3\)`? Also often asked as "Controlling for `\(x_1\)` and `\(x_3\)`, is there evidence that `\(x_2\)` has a relationship with `\(y\)`?" $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \end{align}` $$ In other words, should `\(\beta_2 = 0\)`? --- ### Multiple Linear Regression **Form of the Model:** $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon \end{align}` $$ **Fitted Model:** Using the Method of Least Squares, $$ `\begin{align} \hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p \end{align}` $$ #### Typical Inferential Questions: (2) Can we estimate `\(\beta_3\)` with a confidence interval? $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \end{align}` $$ --- ### Multiple Linear Regression **Form of the Model:** $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon \end{align}` $$ **Fitted Model:** Using the Method of Least Squares, $$ `\begin{align} \hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p \end{align}` $$ #### Typical Inferential Questions: (3) While `\(\hat{y}\)` is a point estimate for `\(y\)`, can we also get an interval estimate for `\(y\)`? $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \end{align}` $$ -- To answer these questions, we need to add some **assumptions** to our linear regression model. --- ### Multiple Linear Regression **Form of the Model:** $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon \end{align}` $$ **Additional Assumptions:** $$ \epsilon \overset{\mbox{ind}}{\sim} N (\mu = 0, \sigma = \sigma_{\epsilon}) $$ `\(\sigma_{\epsilon}\)` = typical deviations from the model -- Let's unpack these assumptions! --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\color{orange}{\mbox{ind}}}{\sim}N\left(0, \sigma_{\epsilon} \right) \end{align*}` -- **Assumption**: The cases are independent of each other. -- **Question**: How do we check this assumption? -- Look at how the data were collected. My collaborator, Paolo Maranzano, talked to us about models that assume spatial dependence and temporal dependence. --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}\color{orange}{N}\left(0, \sigma_{\epsilon} \right) \end{align*}` -- **Assumption**: The errors are normally distributed. -- .pull-left[ **Question**: How do we check this assumption? ] -- .pull-right[ Recall the residual: `\(e = y - \hat{y}\)` ] -- **QQ-plot:** Plot the residuals against the quantiles of a normal distribution! .pull-left[ <img src="stat100_wk14mon_files/figure-html/unnamed-chunk-1-1.png" width="432" style="display: block; margin: auto;" /> ] .pull-right[ <img src="stat100_wk14mon_files/figure-html/unnamed-chunk-2-1.png" width="432" style="display: block; margin: auto;" /> ] --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(\color{orange}{0}, \sigma_{\epsilon} \right) \end{align*}` -- **Assumption**: The points will, on average, fall on the line. -- **Question**: How do we check this assumption? -- If you use the Method of Least Squares, then you don't have to check. It will be true by construction: $$ \sum e = 0 $$ --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \color{orange}{\sigma_{\epsilon}} \right) \end{align*}` -- **Assumption**: The variability in the errors is constant. -- **Question**: How do we check this assumption? -- **One option**: Scatterplot .pull-left[ <img src="stat100_wk14mon_files/figure-html/unnamed-chunk-3-1.png" width="432" style="display: block; margin: auto;" /> ] .pull-right[ <img src="stat100_wk14mon_files/figure-html/unnamed-chunk-4-1.png" width="432" style="display: block; margin: auto;" /> ] --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \color{orange}{\sigma_{\epsilon}} \right) \end{align*}` **Assumption**: The variability in the errors is constant. **Question**: How do we check this assumption? **Better option** (especially when have more than 1 explanatory variable): **Residual Plot** .pull-left[ <img src="stat100_wk14mon_files/figure-html/unnamed-chunk-5-1.png" width="432" style="display: block; margin: auto;" /> ] .pull-right[ <img src="stat100_wk14mon_files/figure-html/unnamed-chunk-6-1.png" width="432" style="display: block; margin: auto;" /> ] --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \color{orange}{\beta_o + \beta_1 x_1} + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right) \end{align*}` -- **Assumption**: The model form is appropriate. -- **Question**: How do we check this assumption? -- **One option**: Scatterplot(s) .pull-left[ <img src="stat100_wk14mon_files/figure-html/unnamed-chunk-7-1.png" width="432" style="display: block; margin: auto;" /> ] .pull-right[ <img src="stat100_wk14mon_files/figure-html/unnamed-chunk-8-1.png" width="432" style="display: block; margin: auto;" /> ] --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \color{orange}{\beta_o + \beta_1 x_1} + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right) \end{align*}` **Assumption**: The model form is appropriate. **Question**: How do we check this assumption? **Better option** (especially when have more than 1 explanatory variable): **Residual Plot** .pull-left[ <img src="stat100_wk14mon_files/figure-html/unnamed-chunk-9-1.png" width="432" style="display: block; margin: auto;" /> ] .pull-right[ <img src="stat100_wk14mon_files/figure-html/unnamed-chunk-10-1.png" width="432" style="display: block; margin: auto;" /> ] --- ### Assumptions **Question**: What if the assumptions aren't all satisfied? -- → Try transforming the data and building the model again. -- → Use a modeling technique beyond linear regression. -- **Question**: What if the assumptions are all (roughly) satisfied? -- → Can now start answering your inference questions! *********************************** ### Let's now look at an example and learn how to create qq-plots and residual plots in `R`. --- ### Example: COVID and Candle Ratings [Kate Petrova created a dataset](https://twitter.com/kate_ptrv/status/1332398768659050496) that made the rounds on Twitter: <img src="img/kate_petrova_candles.jpg" width="600px"/> --- ### COVID and Candle Ratings She posted all her data and code to GitHub and I did some light wrangling so that we could answer the question: → Early on in the pandemic, do we have evidence that the association between time and Amazon rating varies by whether or not a candle is scented and in particular, that scented candles have a steeper decline in ratings over time? -- In other words, do we have evidence that we should allow the slopes to vary? .pull-left[ ```r ggplot(data = all, mapping = aes(x = Date, y = Rating, color = Type)) + geom_point(alpha = 0.4) + geom_smooth(method = lm) + theme(legend.position = "bottom") ``` #### Checking assumptions: **Assumption**: The cases are independent of each other. ] .pull-right[ <img src="stat100_wk14mon_files/figure-html/candles-1.png" width="768" style="display: block; margin: auto;" /> ] --- ## Assumption Checking in `R` The `R` package we will use to check model assumptions is called `gglm` and was written by one of my former Reed students, Grayson White. <img src="https://raw.githubusercontent.com/graysonwhite/gglm/master/figs/gglm.gif" width="15%" style="float:left; padding:10px" style="display: block; margin: auto;" /> ```r library(gglm) ``` First need to fit the model: ```r glimpse(all) ``` ``` ## Rows: 612 ## Columns: 3 ## $ Date <date> 2020-01-20, 2020-01-21, 2020-01-22, 2020-01-23, 2020-01-24, 20… ## $ Rating <dbl> 4.500000, 4.500000, 3.909091, 4.857143, 4.461538, 4.800000, 4.4… ## $ Type <chr> "scented", "scented", "scented", "scented", "scented", "scented… ``` ```r mod <- lm(Rating ~ Date * Type, data = all) ``` --- ### qq-plot **Assumption**: The errors are normally distributed. .pull-left[ ```r # Normal qq plot ggplot(data = mod) + stat_normal_qq() ``` ] .pull-right[ <img src="stat100_wk14mon_files/figure-html/qqplot-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Residual Plot **Assumption**: The variability in the errors is constant. **Assumption**: The model form is appropriate. .pull-left[ ```r # Residual plot ggplot(data = mod) + stat_fitted_resid() ``` ] .pull-right[ <img src="stat100_wk14mon_files/figure-html/residual-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Hypothesis Testing **Question**: What tests is `get_regression_table()` conducting? For the moment, let's focus on the equal slopes model. ```r mod <- lm(Rating ~ Date + Type, data = all) get_regression_table(mod) ``` ``` ## # A tibble: 3 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 36.2 6.50 5.58 0 23.5 49.0 ## 2 Date -0.002 0 -5.00 0 -0.002 -0.001 ## 3 Type: unscented 0.831 0.063 13.2 0 0.707 0.955 ``` -- **In General**: $$ H_o: \beta_j = 0 \quad \mbox{assuming all other predictors are in the model} $$ $$ H_a: \beta_j \neq 0 \quad \mbox{assuming all other predictors are in the model} $$ --- ### Hypothesis Testing **Question**: What tests is `get_regression_table()` conducting? ```r mod <- lm(Rating ~ Date + Type, data = all) get_regression_table(mod) ``` ``` ## # A tibble: 3 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 36.2 6.50 5.58 0 23.5 49.0 ## 2 Date -0.002 0 -5.00 0 -0.002 -0.001 ## 3 Type: unscented 0.831 0.063 13.2 0 0.707 0.955 ``` **For our Example**: **Row 2**: $$ H_o: \beta_1 = 0 \quad \mbox{given Type is already in the model} $$ $$ H_a: \beta_1 \neq 0 \quad \mbox{given Type is already in the model} $$ --- ### Hypothesis Testing **Question**: What tests is `get_regression_table()` conducting? ```r mod <- lm(Rating ~ Date + Type, data = all) get_regression_table(mod) ``` ``` ## # A tibble: 3 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 36.2 6.50 5.58 0 23.5 49.0 ## 2 Date -0.002 0 -5.00 0 -0.002 -0.001 ## 3 Type: unscented 0.831 0.063 13.2 0 0.707 0.955 ``` **For our Example**: **Row 3**: $$ H_o: \beta_2 = 0 \quad \mbox{given Date is already in the model} $$ $$ H_a: \beta_2 \neq 0 \quad \mbox{given Date is already in the model} $$ --- ### Hypothesis Testing **Question**: What tests is `get_regression_table()` conducting? **In General**: $$ H_o: \beta_j = 0 \quad \mbox{assuming all other predictors are in the model} $$ $$ H_a: \beta_j \neq 0 \quad \mbox{assuming all other predictors are in the model} $$ Test Statistic: -- $$ t = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)} \sim t(df = n - \mbox{# of predictors}) $$ when `\(H_o\)` is true and the model assumptions are met. --- **For our Example**: ```r get_regression_table(mod) ``` ``` ## # A tibble: 3 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 36.2 6.50 5.58 0 23.5 49.0 ## 2 Date -0.002 0 -5.00 0 -0.002 -0.001 ## 3 Type: unscented 0.831 0.063 13.2 0 0.707 0.955 ``` **Row 3**: $$ H_o: \beta_2 = 0 \quad \mbox{given Date is already in the model} $$ $$ H_a: \beta_2 \neq 0 \quad \mbox{given Date is already in the model} $$ Test Statistic: -- $$ t = \frac{\hat{\beta}_2 - 0}{SE(\hat{\beta}_2)} = \frac{0.831 - 0}{0.063} = 13.2 $$ with p-value `\(= P(t \leq -13.2) + P(t \geq 13.2) \approx 0.\)` -- There is evidence that including whether or not the candle is scented adds useful information to the linear regression model for Amazon ratings that already controls for date. --- ### Example Early on in the pandemic, do we have evidence that the association between time and Amazon rating varies by whether or not a candle is scented and in particular, that scented candles have a steeper decline in ratings over time? <img src="stat100_wk14mon_files/figure-html/unnamed-chunk-22-1.png" width="576" style="display: block; margin: auto;" /> --- ### Example Early on in the pandemic, do we have evidence that the association between time and Amazon rating varies by whether or not a candle is scented and in particular, that scented candles have a steeper decline in ratings over time? ```r mod <- lm(Rating ~ Date * Type, data = all) get_regression_table(mod) ``` ``` ## # A tibble: 4 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 52.7 9.09 5.80 0 34.9 70.6 ## 2 Date -0.003 0 -5.40 0 -0.004 -0.002 ## 3 Type: unscented -32.6 12.9 -2.52 0.012 -58.0 -7.24 ## 4 Date:Typeunscented 0.002 0.001 2.59 0.01 0 0.003 ``` --- class: , middle, center ## Now let's shift our focus to estimation and prediction! --- ### Estimation #### Typical Inferential Questions: (2) Can we estimate `\(\beta_j\)` with a confidence interval? -- Confidence Interval Formula: -- `\begin{align*} \mbox{statistic} & \pm ME \\ \hat{\beta}_j & \pm t^* SE(\hat{\beta}_j) \end{align*}` -- ```r get_regression_table(mod) ``` ``` ## # A tibble: 4 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 52.7 9.09 5.80 0 34.9 70.6 ## 2 Date -0.003 0 -5.40 0 -0.004 -0.002 ## 3 Type: unscented -32.6 12.9 -2.52 0.012 -58.0 -7.24 ## 4 Date:Typeunscented 0.002 0.001 2.59 0.01 0 0.003 ``` --- ### Prediction #### Typical Inferential Questions: (3) While `\(\hat{y}\)` is a point estimate for `\(y\)`, can we also get an interval estimate for `\(y\)`? #### Two Types: -- .pull-left[ **Confidence Interval for the Mean Response** → Defined at given values of the explanatory variables → Estimates the <span style="color: orange;">average</span> response → Centered at `\(\hat{y}\)` → <span style="color: orange;">Smaller</span> SE ] .pull-right[ **Prediction Interval for an Individual Response** → Defined at given values of the explanatory variables → Predicts the response of a <span style="color: orange;">single, </span> new observation → Centered at `\(\hat{y}\)` → <span style="color: orange;">Larger</span> SE ] --- ### CI for mean response at a given level of X: We want to construct a 95% CI for the average rating of unscented candles on Oct 31st, 2020. -- ```r new <- data.frame(Type = "unscented", Date = as.Date("2020-10-31")) predict(mod, new, interval = "confidence", level = 0.95) ``` ``` ## fit lwr upr ## 1 4.442378 4.286152 4.598604 ``` -- **Interpretation**: We are 95% confident that the average rating of an unscented candle on Amazon on Halloween 2020 was between 4.27 and 4.60. --- ### PI for a new Y at a given level of X: Say we want to construct a 95% PI for the rating of an **individual** unscented candle on Oct 31st, 2020. * Predicting for a new observation not the mean! ```r predict(mod, new, interval = "prediction", level = 0.95) ``` ``` ## fit lwr upr ## 1 4.442378 2.911587 5.973169 ``` **Interpretation**: For unscented candles on Amazon, we expect 95% of the Halloween 2020 ratings to be between 2.91 and 5.97. --- ### Comparing Models Suppose I built 4 different model. **Which is best?** -- * Big question! Take **Stat 139: Linear Models** to learn systematic model selection techniques. -- * We will explore one approach. (But there are many possible approaches!) --- ### Comparing Models Suppose I built 4 different model. **Which is best?** -- → Pick the best model based on some measure of quality. -- **Measure of quality**: `\(R^2\)` (Coefficient of Determination) `\begin{align*} R^2 &= \mbox{Percent of total variation in y explained by the model}\\ &= 1- \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2} \end{align*}` -- **Strategy**: Compute the `\(R^2\)` value for each model and pick the one with the highest `\(R^2\)`. --- ### Comparing Models with `\(R^2\)` **Strategy**: Compute the `\(R^2\)` value for each model and pick the one with the highest `\(R^2\)`. ```r library(broom) mod1 <- lm(Rating ~ Date, data = all) mod2 <- lm(Rating ~ Type, data = all) mod3 <- lm(Rating ~ Date + Type, data = all) mod4 <- lm(Rating ~ Date * Type, data = all) ``` --- **Strategy**: Compute the `\(R^2\)` value for each model and pick the one with the highest `\(R^2\)`. ```r glance(mod1) ``` ``` ## # A tibble: 1 × 12 ## r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> ## 1 0.0326 0.0310 0.883 20.6 6.92e-6 1 -791. 1588. 1601. 475. 610 ## # … with 1 more variable: nobs <int>, and abbreviated variable names ## # ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual ``` ```r glance(mod2) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squa…¹ sigma stati…² p.value df logLik AIC BIC devia…³ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.217 0.216 0.794 169. 2.95e-34 1 -726. 1459. 1472. 385. ## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated ## # variable names ¹adj.r.squared, ²statistic, ³deviance ``` ```r glance(mod3) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squa…¹ sigma stati…² p.value df logLik AIC BIC devia…³ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.248 0.245 0.779 100. 2.28e-38 2 -714. 1436. 1454. 370. ## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated ## # variable names ¹adj.r.squared, ²statistic, ³deviance ``` --- ```r glance(mod2) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squa…¹ sigma stati…² p.value df logLik AIC BIC devia…³ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.217 0.216 0.794 169. 2.95e-34 1 -726. 1459. 1472. 385. ## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated ## # variable names ¹adj.r.squared, ²statistic, ³deviance ``` ```r glance(mod3) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squa…¹ sigma stati…² p.value df logLik AIC BIC devia…³ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.248 0.245 0.779 100. 2.28e-38 2 -714. 1436. 1454. 370. ## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated ## # variable names ¹adj.r.squared, ²statistic, ³deviance ``` ```r glance(mod4) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squa…¹ sigma stati…² p.value df logLik AIC BIC devia…³ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.256 0.252 0.775 69.7 9.42e-39 3 -711. 1431. 1454. 366. ## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated ## # variable names ¹adj.r.squared, ²statistic, ³deviance ``` **Problem:** As we add predictors, the `\(R^2\)` value will only increase. --- ### Comparing Models with `\(R^2\)` **Problem:** As we add predictors, the `\(R^2\)` value will only increase. And in [Week 6, we said](https://mcconvil.github.io/stat100f22/stat100_wk07wed.html#34): **Guiding Principle**: Occam's Razor for Modeling > "All other things being equal, simpler models are to be preferred over complex ones." -- ModernDive --- ### Comparing Models with the Adjusted `\(R^2\)` **New Measure of quality**: Adjusted `\(R^2\)` (Coefficient of Determination) `\begin{align*} \mbox{adj} R^2 &= 1- \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2} \left(\frac{n - 1}{n - p - 1} \right) \end{align*}` where `\(p\)` is the number of explanatory variables in the model. -- Now we will penalize larger models. -- **Strategy**: Compute the adjusted `\(R^2\)` value for each model and pick the one with the highest adjusted `\(R^2\)`. --- **Strategy**: Compute the adjusted `\(R^2\)` value for each model and pick the one with the highest adjusted `\(R^2\)`. ```r glance(mod2) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squa…¹ sigma stati…² p.value df logLik AIC BIC devia…³ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.217 0.216 0.794 169. 2.95e-34 1 -726. 1459. 1472. 385. ## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated ## # variable names ¹adj.r.squared, ²statistic, ³deviance ``` ```r glance(mod3) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squa…¹ sigma stati…² p.value df logLik AIC BIC devia…³ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.248 0.245 0.779 100. 2.28e-38 2 -714. 1436. 1454. 370. ## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated ## # variable names ¹adj.r.squared, ²statistic, ³deviance ``` ```r glance(mod4) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squa…¹ sigma stati…² p.value df logLik AIC BIC devia…³ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.256 0.252 0.775 69.7 9.42e-39 3 -711. 1431. 1454. 366. ## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated ## # variable names ¹adj.r.squared, ²statistic, ³deviance ```