background-image: url("img/DAW.png") background-position: left background-size: 50% class: middle, center, inverse .pull-right[ ## .whitish[Inference] ## .whitish[for] ## .whitish[Linear Regression] <br> ### .whitish[Kelly McConville] #### .yellow[ Stat 100 | Week 12 | Spring 2022] ] --- ### Announcements * Project Assignment 3 is due Friday, April 22nd at 5pm * Final Project Assignment + Create 5-10 minute video summary of your analyses + Due Friday, May 6th **************************** -- ### Goals for Today .pull-left[ * Recap linear regression * Checking **assumptions** for linear regression ] .pull-right[ * Hypothesis testing for linear regression * Estimation and prediction inference for linear regression ] --- class: inverse, middle, center ## Please make sure to fill out the Stat 100 Course Evaluations. ### We appreciate constructive feedback. ### For all of your course evaluations be mindful of [unconscious and unintentional biases](https://benschmidt.org/profGender). --- ## Multiple Linear Regression Linear regression is a flexible class of models that allow for: * Both quantitative and categorical explanatory variables. * Multiple explanatory variables. * Curved relationships between the response variable and the explanatory variable. -- * BUT the response variable is quantitative. --- ### Multiple Linear Regression **Form of the Model:** $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon \end{align}` $$ -- **Fitted Model:** Using the Method of Least Squares, $$ `\begin{align} \hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p \end{align}` $$ -- #### Typical Inferential Questions: (1) Should `\(x_2\)` be in the model that already contains `\(x_1\)` and `\(x_3\)`? $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \end{align}` $$ In other words, should `\(\beta_2 = 0\)`? --- ### Multiple Linear Regression **Form of the Model:** $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon \end{align}` $$ **Fitted Model:** Using the Method of Least Squares, $$ `\begin{align} \hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p \end{align}` $$ #### Typical Inferential Questions: (2) Can we estimate `\(\beta_3\)` with a confidence interval? $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \end{align}` $$ --- ### Multiple Linear Regression **Form of the Model:** $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon \end{align}` $$ **Fitted Model:** Using the Method of Least Squares, $$ `\begin{align} \hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p \end{align}` $$ #### Typical Inferential Questions: (3) While `\(\hat{y}\)` is a point estimate for `\(y\)`, can we also get an interval estimate for `\(y\)`? $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \end{align}` $$ -- To answer these questions, we need to add some **assumptions** to our linear regression model. --- ### Multiple Linear Regression **Form of the Model:** $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon \end{align}` $$ **Additional Assumptions:** $$ \epsilon \overset{\mbox{ind}}{\sim} N (\mu = 0, \sigma = \sigma_{\epsilon}) $$ `\(\sigma_{\epsilon}\)` = typical deviations from the model -- Let's unpack these assumptions! --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\color{orange}{\mbox{ind}}}{\sim}N\left(0, \sigma_{\epsilon} \right) \end{align*}` -- **Assumption**: The cases are independent of each other. -- **Question**: How do we check this assumption? -- Look at how the data were collected. Generally relies on random sampling or random assignment being utilized. --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}\color{orange}{N}\left(0, \sigma_{\epsilon} \right) \end{align*}` -- **Assumption**: The errors are normally distributed. -- .pull-left[ **Question**: How do we check this assumption? ] -- .pull-right[ Recall the residual: `\(e = y - \hat{y}\)` ] -- **QQ-plot:** Plot the residuals against the quantiles of a normal distribution! .pull-left[ <img src="stat100_wk12wed_files/figure-html/unnamed-chunk-1-1.png" width="432" style="display: block; margin: auto;" /> ] .pull-right[ <img src="stat100_wk12wed_files/figure-html/unnamed-chunk-2-1.png" width="432" style="display: block; margin: auto;" /> ] --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(\color{orange}{0}, \sigma_{\epsilon} \right) \end{align*}` -- **Assumption**: The points will, on average, fall on the line. -- **Question**: How do we check this assumption? -- If you use the Method of Least Squares, then you don't have to check. It will be true by construction: $$ \sum e = 0 $$ --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \color{orange}{\sigma_{\epsilon}} \right) \end{align*}` -- **Assumption**: The variability in the errors is constant. -- **Question**: How do we check this assumption? -- **One option**: Scatterplot .pull-left[ <img src="stat100_wk12wed_files/figure-html/unnamed-chunk-3-1.png" width="432" style="display: block; margin: auto;" /> ] .pull-right[ <img src="stat100_wk12wed_files/figure-html/unnamed-chunk-4-1.png" width="432" style="display: block; margin: auto;" /> ] --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \color{orange}{\sigma_{\epsilon}} \right) \end{align*}` **Assumption**: The variability in the errors is constant. **Question**: How do we check this assumption? **Better option** (especially when have more than 1 explanatory variable): **Residual Plot** .pull-left[ <img src="stat100_wk12wed_files/figure-html/unnamed-chunk-5-1.png" width="432" style="display: block; margin: auto;" /> ] .pull-right[ <img src="stat100_wk12wed_files/figure-html/unnamed-chunk-6-1.png" width="432" style="display: block; margin: auto;" /> ] --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \color{orange}{\beta_o + \beta_1 x_1} + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right) \end{align*}` -- → The model form is appropriate. -- **Question**: How do we check this assumption? -- **One option**: Scatterplot(s) .pull-left[ <img src="stat100_wk12wed_files/figure-html/unnamed-chunk-7-1.png" width="432" style="display: block; margin: auto;" /> ] .pull-right[ <img src="stat100_wk12wed_files/figure-html/unnamed-chunk-8-1.png" width="432" style="display: block; margin: auto;" /> ] --- ### Assumptions For ease of visualization, let's assume a simple linear model: `\begin{align*} y = \color{orange}{\beta_o + \beta_1 x_1} + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right) \end{align*}` **Assumption**: The model form is appropriate. **Question**: How do we check this assumption? **Better option** (especially when have more than 1 explanatory variable): **Residual Plot** .pull-left[ <img src="stat100_wk12wed_files/figure-html/unnamed-chunk-9-1.png" width="432" style="display: block; margin: auto;" /> ] .pull-right[ <img src="stat100_wk12wed_files/figure-html/unnamed-chunk-10-1.png" width="432" style="display: block; margin: auto;" /> ] --- ### Assumptions **Question**: What if the assumptions aren't all satisfied? -- → Try transforming the data and building the model again. -- → Use a modeling technique beyond linear regression. -- **Question**: What if the assumptions are all (roughly) satisfied? -- → Can now start answering your inference questions! --- ### Example: COVID and Candle Ratings [Kate Petrova created a dataset](https://twitter.com/kate_ptrv/status/1332398768659050496) that has been making the rounds on Twitter: <img src="img/kate_petrova_candles.jpg" width="600px"/> --- ### COVID and Candle Ratings She posted all her data and code to GitHub and I did some light wrangling so that we could answer the question: → Do we have evidence that we should allow the slopes to vary? -- .pull-left[ ```r ggplot(data = all, mapping = aes(x = Date, y = Rating, color = Type)) + geom_point(alpha = 0.4) + geom_smooth(method = lm) + theme(legend.position = "bottom") ``` ] .pull-right[ <img src="stat100_wk12wed_files/figure-html/candles-1.png" width="768" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle ### But before we try to answer that question formally, let's practice checking linear regression assumptions with the "inferenceModeling.Rmd" handout! --- ### Hypothesis Testing **Question**: What tests is `get_regression_table()` conducting? ```r mod <- lm(Rating ~ Date + Type, data = all) get_regression_table(mod) ``` ``` ## # A tibble: 3 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 36.2 6.50 5.58 0 23.5 49.0 ## 2 Date -0.002 0 -5.00 0 -0.002 -0.001 ## 3 Type: unscented 0.831 0.063 13.2 0 0.707 0.955 ``` -- **In General**: $$ H_o: \beta_j = 0 \quad \mbox{assuming all other predictors are in the model} $$ $$ H_a: \beta_j \neq 0 \quad \mbox{assuming all other predictors are in the model} $$ --- ### Hypothesis Testing **Question**: What tests is `get_regression_table()` conducting? ```r mod <- lm(Rating ~ Date + Type, data = all) get_regression_table(mod) ``` ``` ## # A tibble: 3 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 36.2 6.50 5.58 0 23.5 49.0 ## 2 Date -0.002 0 -5.00 0 -0.002 -0.001 ## 3 Type: unscented 0.831 0.063 13.2 0 0.707 0.955 ``` **For our Example**: **Row 2**: $$ H_o: \beta_1 = 0 \quad \mbox{given Type is already in the model} $$ $$ H_a: \beta_1 \neq 0 \quad \mbox{given Type is already in the model} $$ --- ### Hypothesis Testing **Question**: What tests is `get_regression_table()` conducting? ```r mod <- lm(Rating ~ Date + Type, data = all) get_regression_table(mod) ``` ``` ## # A tibble: 3 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 36.2 6.50 5.58 0 23.5 49.0 ## 2 Date -0.002 0 -5.00 0 -0.002 -0.001 ## 3 Type: unscented 0.831 0.063 13.2 0 0.707 0.955 ``` **For our Example**: **Row 3**: $$ H_o: \beta_2 = 0 \quad \mbox{given Date is already in the model} $$ $$ H_a: \beta_2 \neq 0 \quad \mbox{given Date is already in the model} $$ --- ### Hypothesis Testing **Question**: What tests is `get_regression_table()` conducting? **In General**: $$ H_o: \beta_j = 0 \quad \mbox{assuming all other predictors are in the model} $$ $$ H_a: \beta_j \neq 0 \quad \mbox{assuming all other predictors are in the model} $$ Test Statistic: -- $$ t = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)} \sim t(df = n - \mbox{# of predictors}) $$ when `\(H_o\)` is true and the model assumptions are met. --- ### Hypothesis Testing **Question**: What tests is `get_regression_table()` conducting? -- **For our Example**: **Row 3**: $$ H_o: \beta_2 = 0 \quad \mbox{given Date is already in the model} $$ $$ H_a: \beta_2 \neq 0 \quad \mbox{given Date is already in the model} $$ Test Statistic: -- $$ t = \frac{\hat{\beta}_2 - 0}{SE(\hat{\beta}_2)} = \frac{0.831 - 0}{0.063} = 13.2 $$ with p-value `\(= P(t \leq -13.2) + P(t \geq 13.2) \approx 0.\)` -- There is evidence that including whether or not the candle is scented adds useful information to the linear regression model for Amazon ratings that already controls for date. --- ### Example Do we have evidence that we should allow the slopes to vary? <img src="stat100_wk12wed_files/figure-html/unnamed-chunk-16-1.png" width="576" style="display: block; margin: auto;" /> --- ### Example Do we have evidence that we should allow the slopes to vary? ```r mod <- lm(Rating ~ Date*Type, data = all) get_regression_table(mod) ``` ``` ## # A tibble: 4 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 52.7 9.09 5.80 0 34.9 70.6 ## 2 Date -0.003 0 -5.40 0 -0.004 -0.002 ## 3 Type: unscented -32.6 12.9 -2.52 0.012 -58.0 -7.24 ## 4 Date:Typeunscented 0.002 0.001 2.59 0.01 0 0.003 ``` --- class: inverse, middle, center ## Now let's shift our focus to estimation and prediction! --- ### Estimation #### Typical Inferential Questions: (2) Can we estimate `\(\beta_j\)` with a confidence interval? -- Confidence Interval Formula: -- `\begin{align*} \mbox{statistic} & \pm ME \\ \hat{\beta}_j & \pm t^* SE(\hat{\beta}_j) \end{align*}` -- ```r get_regression_table(mod) ``` ``` ## # A tibble: 4 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 52.7 9.09 5.80 0 34.9 70.6 ## 2 Date -0.003 0 -5.40 0 -0.004 -0.002 ## 3 Type: unscented -32.6 12.9 -2.52 0.012 -58.0 -7.24 ## 4 Date:Typeunscented 0.002 0.001 2.59 0.01 0 0.003 ``` --- ### Prediction #### Typical Inferential Questions: (3) While `\(\hat{y}\)` is a point estimate for `\(y\)`, can we also get an interval estimate for `\(y\)`? #### Two Types: -- .pull-left[ **Confidence Interval for the Mean Response** → Defined at given values of the explanatory variables → Estimates the <span style="color: orange;">average</span> response → Centered at `\(\hat{y}\)` → <span style="color: orange;">Smaller</span> SE ] .pull-right[ **Prediction Interval for an Individual Response** → Defined at given values of the explanatory variables → Predicts the response of a <span style="color: orange;">single, </span> new observation → Centered at `\(\hat{y}\)` → <span style="color: orange;">Larger</span> SE ] --- ### CI for mean response at a given level of X: We want to construct a 95\% CI for the average rating of unscented candles on Oct 31st, 2020. -- ```r new <- data.frame(Type = "unscented", Date = as.Date("2020-10-31")) predict(mod, new, interval = "confidence", level = 0.95) ``` ``` ## fit lwr upr ## 1 4.442378 4.286152 4.598604 ``` -- **Interpretation**: We are 95\% confident that the average rating of an unscented candle on Amazon on Halloween 2020 was between 4.27 and 4.60. --- ### PI for a new Y at a given level of X: Say we want to construct a 95\% PI for the rating of an **individual** unscented candle on Oct 31st, 2020. * Predicting for a new observation not the mean! ```r predict(mod, new, interval = "prediction", level = 0.95) ``` ``` ## fit lwr upr ## 1 4.442378 2.911587 5.973169 ``` **Interpretation**: For unscented candles on Amazon, we expect 95\% of the Halloween 2020 ratings to be between 2.91 and 5.97. --- ### Comparing Models Suppose I built 4 different model. **Which is best?** -- * Big question! Take [Stat 139: Linear Models](https://canvas.harvard.edu/courses/89311) to learn systematic model selection techniques. -- * We will explore one approach. (But there are many possible approaches!) --- ### Comparing Models Suppose I built 4 different model. **Which is best?** -- → Pick the best model based on some measure of quality. -- **Measure of quality**: `\(R^2\)` (Coefficient of Determination) `\begin{align*} R^2 &= \mbox{Percent of total variation in y explained by the model}\\ &= 1- \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2} \end{align*}` -- **Strategy**: Compute the `\(R^2\)` value for each model and pick the one with the highest `\(R^2\)`. --- ### Comparing Models with `\(R^2\)` **Strategy**: Compute the `\(R^2\)` value for each model and pick the one with the highest `\(R^2\)`. ```r library(broom) mod1 <- lm(Rating ~ Date, data = all) mod2 <- lm(Rating ~ Type, data = all) mod3 <- lm(Rating ~ Date + Type, data = all) mod4 <- lm(Rating ~ Date * Type, data = all) ``` --- **Strategy**: Compute the `\(R^2\)` value for each model and pick the one with the highest `\(R^2\)`. ```r glance(mod1) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.0326 0.0310 0.883 20.6 0.00000692 1 -791. 1588. 1601. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ```r glance(mod2) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.217 0.216 0.794 169. 2.95e-34 1 -726. 1459. 1472. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ```r glance(mod3) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.248 0.245 0.779 100. 2.28e-38 2 -714. 1436. 1454. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` --- ```r glance(mod2) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.217 0.216 0.794 169. 2.95e-34 1 -726. 1459. 1472. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ```r glance(mod3) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.248 0.245 0.779 100. 2.28e-38 2 -714. 1436. 1454. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ```r glance(mod4) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.256 0.252 0.775 69.7 9.42e-39 3 -711. 1431. 1454. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` **Problem:** As we add predictors, the `\(R^2\)` value will only increase. --- ### Comparing Models with `\(R^2\)` **Problem:** As we add predictors, the `\(R^2\)` value will only increase. And in [Week 6, we said](https://mcconvil.github.io/stat100s22/stat100_wk06mon.html#55): **Guiding Principle**: Occam's Razor for Modeling > "All other things being equal, simpler models are to be preferred over complex ones." -- ModernDive --- ### Comparing Models with the Adjusted `\(R^2\)` **New Measure of quality**: Adjusted `\(R^2\)` (Coefficient of Determination) `\begin{align*} \mbox{adj} R^2 &= 1- \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2} \left(\frac{n - 1}{n - p - 1} \right) \end{align*}` where `\(p\)` is the number of explanatory variables in the model. -- Now we will penalize larger models. -- **Strategy**: Compute the adjusted `\(R^2\)` value for each model and pick the one with the highest adjusted `\(R^2\)`. --- **Strategy**: Compute the adjusted `\(R^2\)` value for each model and pick the one with the highest adjusted `\(R^2\)`. ```r glance(mod2) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.217 0.216 0.794 169. 2.95e-34 1 -726. 1459. 1472. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ```r glance(mod3) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.248 0.245 0.779 100. 2.28e-38 2 -714. 1436. 1454. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ```r glance(mod4) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.256 0.252 0.775 69.7 9.42e-39 3 -711. 1431. 1454. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ```