Inference for Regression

background-image: url("img/DAW.png")
background-position: left
background-size: 50%
class: middle, center, inverse

## .whitish[Inference]

## .whitish[for]

## .whitish[Linear Regression]

### .whitish[Kelly McConville]

#### .yellow[ Stat 100 | Week 12 | Spring 2022]

]

---

### Announcements

* Project Assignment 3 is due Friday, April 22nd at 5pm
* Final Project Assignment
    + Create 5-10 minute video summary of your analyses
    + Due Friday, May 6th

****************************

### Goals for Today

* Recap linear regression

* Checking **assumptions** for linear regression

]

* Hypothesis testing for linear regression

* Estimation and prediction inference for linear regression

]

---

## Please make sure to fill out the Stat 100 Course Evaluations.

### We appreciate constructive feedback.

### For all of your course evaluations be mindful of [unconscious and unintentional biases](https://benschmidt.org/profGender).

---

## Multiple Linear Regression

Linear regression is a flexible class of models that allow for:

* Both quantitative and categorical explanatory variables.

* Multiple explanatory variables.

* Curved relationships between the response variable and the explanatory variable.

* BUT the response variable is quantitative.

---

### Multiple Linear Regression

**Form of the Model:**

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon
\end{align}`
$$

**Fitted Model:** Using the Method of Least Squares,

$$ 
`\begin{align}
\hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p 
\end{align}`
$$

#### Typical Inferential Questions:

(1) Should `$x_2$` be in the model that already contains `$x_1$` and `$x_3$`?

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon
\end{align}`
$$

In other words, should `$\beta_2 = 0$`?

---

### Multiple Linear Regression

**Form of the Model:**

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon
\end{align}`
$$

**Fitted Model:** Using the Method of Least Squares,

$$ 
`\begin{align}
\hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p 
\end{align}`
$$

#### Typical Inferential Questions:

(2) Can we estimate `$\beta_3$` with a confidence interval?

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon
\end{align}`
$$

---

### Multiple Linear Regression

**Form of the Model:**

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon
\end{align}`
$$

**Fitted Model:** Using the Method of Least Squares,

$$ 
`\begin{align}
\hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p 
\end{align}`
$$

#### Typical Inferential Questions:

(3) While `$\hat{y}$` is a point estimate for `$y$`, can we also get an interval estimate for `$y$`?

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon
\end{align}`
$$

To answer these questions, we need to add some **assumptions** to our linear regression model.

---

### Multiple Linear Regression

**Form of the Model:**

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon
\end{align}`
$$

**Additional Assumptions:**

$$
\epsilon \overset{\mbox{ind}}{\sim} N (\mu = 0, \sigma = \sigma_{\epsilon})
$$

`$\sigma_{\epsilon}$` = typical deviations from the model

Let's unpack these assumptions!

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\color{orange}{\mbox{ind}}}{\sim}N\left(0, \sigma_{\epsilon} \right)
\end{align*}`

**Assumption**: The cases are independent of each other.

**Question**: How do we check this assumption?

Look at how the data were collected.  Generally relies on random sampling or random assignment being utilized.

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}\color{orange}{N}\left(0, \sigma_{\epsilon} \right)
\end{align*}`

**Assumption**: The errors are normally distributed.

**Question**: How do we check this assumption?

]

Recall the residual: `$e = y - \hat{y}$`

]

**QQ-plot:** Plot the residuals against the quantiles of a normal distribution!

]

]

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(\color{orange}{0}, \sigma_{\epsilon} \right)
\end{align*}`

**Assumption**: The points will, on average, fall on the line.

**Question**: How do we check this assumption?

If you use the Method of Least Squares, then you don't have to check.

It will be true by construction:

$$
\sum e = 0
$$

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \color{orange}{\sigma_{\epsilon}} \right)
\end{align*}`

**Assumption**: The variability in the errors is constant.

**Question**: How do we check this assumption?

**One option**: Scatterplot

]

]

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \color{orange}{\sigma_{\epsilon}} \right)
\end{align*}`

**Assumption**: The variability in the errors is constant.

**Question**: How do we check this assumption?

**Better option** (especially when have more than 1 explanatory variable): **Residual Plot**

]

]

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \color{orange}{\beta_o + \beta_1 x_1} + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right)
\end{align*}`

&rarr; The model form is appropriate.

**Question**: How do we check this assumption?

**One option**: Scatterplot(s)

]

]

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \color{orange}{\beta_o + \beta_1 x_1} + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right)
\end{align*}`

**Assumption**: The model form is appropriate.

**Question**: How do we check this assumption?

**Better option** (especially when have more than 1 explanatory variable): **Residual Plot**

]

]

---

### Assumptions

**Question**: What if the assumptions aren't all satisfied?

&rarr; Try transforming the data and building the model again.

&rarr; Use a modeling technique beyond linear regression.

**Question**: What if the assumptions are all (roughly) satisfied?

&rarr; Can now start answering your inference questions!

---

### Example: COVID and Candle Ratings

[Kate Petrova created a dataset](https://twitter.com/kate_ptrv/status/1332398768659050496) that has been making the rounds on Twitter:

---

### COVID and Candle Ratings

She posted all her data and code to GitHub and I did some light wrangling so that we could answer the question:

&rarr; Do we have evidence that we should allow the slopes to vary?

```r
ggplot(data = all,
       mapping = aes(x = Date,
                     y = Rating,
                     color = Type)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = lm) +
  theme(legend.position = "bottom")
```

]

]

---

### But before we try to answer that question formally, let's practice checking linear regression assumptions with the "inferenceModeling.Rmd" handout!

---

### Hypothesis Testing

**Question**: What tests is `get_regression_table()` conducting?

```r
mod <- lm(Rating ~ Date + Type, data = all)
get_regression_table(mod)
```

```
## # A tibble: 3 × 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 36.2 6.50 5.58 0 23.5 49.0 
## 2 Date -0.002 0 -5.00 0 -0.002 -0.001
## 3 Type: unscented 0.831 0.063 13.2 0 0.707 0.955
```

**In General**:

$$
H_o: \beta_j = 0 \quad \mbox{assuming all other predictors are in the model}
$$
$$
H_a: \beta_j \neq 0 \quad \mbox{assuming all other predictors are in the model}
$$

---

### Hypothesis Testing

**Question**: What tests is `get_regression_table()` conducting?

```r
mod <- lm(Rating ~ Date + Type, data = all)
get_regression_table(mod)
```

**For our Example**:

**Row 2**:

$$
H_o: \beta_1 = 0 \quad \mbox{given Type is already in the model}
$$
$$
H_a: \beta_1 \neq 0 \quad \mbox{given Type is already in the model}
$$

---

### Hypothesis Testing

**Question**: What tests is `get_regression_table()` conducting?

```r
mod <- lm(Rating ~ Date + Type, data = all)
get_regression_table(mod)
```

**For our Example**:

**Row 3**:

$$
H_o: \beta_2 = 0 \quad \mbox{given Date is already in the model}
$$
$$
H_a: \beta_2 \neq 0 \quad \mbox{given Date is already in the model}
$$

---

### Hypothesis Testing

**Question**: What tests is `get_regression_table()` conducting?

**In General**:

$$
H_o: \beta_j = 0 \quad \mbox{assuming all other predictors are in the model}
$$
$$
H_a: \beta_j \neq 0 \quad \mbox{assuming all other predictors are in the model}
$$

Test Statistic:

$$
t = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)} \sim t(df = n - \mbox{# of predictors})
$$

when `$H_o$` is true and the model assumptions are met.

---

### Hypothesis Testing

**Question**: What tests is `get_regression_table()` conducting?

**For our Example**:

**Row 3**:

$$
H_o: \beta_2 = 0 \quad \mbox{given Date is already in the model}
$$
$$
H_a: \beta_2 \neq 0 \quad \mbox{given Date is already in the model}
$$

Test Statistic:

$$
t = \frac{\hat{\beta}_2 - 0}{SE(\hat{\beta}_2)} = \frac{0.831 - 0}{0.063} = 13.2
$$

with p-value `$= P(t \leq -13.2) + P(t \geq 13.2) \approx 0.$`

There is evidence that including whether or not the candle is scented adds useful information to the linear regression model for Amazon ratings that already controls for date.

---

### Example

Do we have evidence that we should allow the slopes to vary?

---

### Example

Do we have evidence that we should allow the slopes to vary?

```r
mod <- lm(Rating ~ Date*Type, data = all)
get_regression_table(mod)
```

```
## # A tibble: 4 × 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 52.7 9.09 5.80 0 34.9 70.6 
## 2 Date -0.003 0 -5.40 0 -0.004 -0.002
## 3 Type: unscented -32.6 12.9 -2.52 0.012 -58.0 -7.24 
## 4 Date:Typeunscented 0.002 0.001 2.59 0.01 0 0.003
```

---

## Now let's shift our focus to estimation and prediction!

---

### Estimation

#### Typical Inferential Questions:

(2) Can we estimate `$\beta_j$` with a confidence interval?

Confidence Interval Formula:

`\begin{align*}
\mbox{statistic} & \pm ME \\
\hat{\beta}_j & \pm t^* SE(\hat{\beta}_j)
\end{align*}`

```r
get_regression_table(mod)
```

---

### Prediction

#### Typical Inferential Questions:

(3) While `$\hat{y}$` is a point estimate for `$y$`, can we also get an interval estimate for `$y$`?

#### Two Types:

**Confidence Interval for the Mean Response**

&rarr; Defined at given values of the explanatory variables

&rarr; Estimates the average response

&rarr; Centered at `$\hat{y}$`

&rarr; Smaller SE

]

**Prediction Interval for an Individual Response**

&rarr; Defined at given values of the explanatory variables

&rarr; Predicts the response of a single, new observation

&rarr; Centered at `$\hat{y}$`

&rarr; Larger SE

]

---

### CI for mean response at a given level of X:

We want to construct a 95\% CI for the average rating of unscented candles on Oct 31st, 2020.

```r
new <- data.frame(Type = "unscented", Date = as.Date("2020-10-31"))
predict(mod, new, interval = "confidence", level = 0.95)
```

```
##        fit      lwr      upr
## 1 4.442378 4.286152 4.598604
```

**Interpretation**:  We are 95\% confident that the average rating of an unscented candle on Amazon on Halloween 2020 was between 4.27 and 4.60.

---

### PI for a new Y at a given level of X:

Say we want to construct a 95\% PI for the rating of an **individual** unscented candle on Oct 31st, 2020.

* Predicting for a new observation not the mean!

```r
predict(mod, new, interval = "prediction", level = 0.95)
```

```
##        fit      lwr      upr
## 1 4.442378 2.911587 5.973169
```

**Interpretation**: For unscented candles on Amazon, we expect 95\% of the Halloween 2020 ratings to be between 2.91 and 5.97.

---

### Comparing Models

Suppose I built 4 different model. **Which is best?**

* Big question!  Take [Stat 139: Linear Models](https://canvas.harvard.edu/courses/89311) to learn systematic model selection techniques.

* We will explore one approach.  (But there are many possible approaches!)

---

### Comparing Models

Suppose I built 4 different model. **Which is best?**

&rarr; Pick the best model based on some measure of quality.

**Measure of quality**: `$R^2$` (Coefficient of Determination)

`\begin{align*}
R^2 &= \mbox{Percent of total variation in y explained by the model}\\
&= 1- \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2}
\end{align*}`

**Strategy**: Compute the `$R^2$` value for each model and pick the one with the highest `$R^2$`.

---

### Comparing Models with `$R^2$`

**Strategy**: Compute the `$R^2$` value for each model and pick the one with the highest `$R^2$`.

```r
library(broom)
mod1 <- lm(Rating ~ Date, data = all)
mod2 <- lm(Rating ~ Type, data = all)
mod3 <- lm(Rating ~ Date + Type, data = all)
mod4 <- lm(Rating ~ Date * Type, data = all)
```

---

**Strategy**: Compute the `$R^2$` value for each model and pick the one with the highest `$R^2$`.

```r
glance(mod1)
```

```
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0326 0.0310 0.883 20.6 0.00000692 1 -791. 1588. 1601.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

```r
glance(mod2)
```

```
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.217 0.216 0.794 169. 2.95e-34 1 -726. 1459. 1472.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

```r
glance(mod3)
```

```
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.248 0.245 0.779 100. 2.28e-38 2 -714. 1436. 1454.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

---

```r
glance(mod2)
```

```r
glance(mod3)
```

```r
glance(mod4)
```

```
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.256 0.252 0.775 69.7 9.42e-39 3 -711. 1431. 1454.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

**Problem:** As we add predictors, the `$R^2$` value will only increase.

---

### Comparing Models with `$R^2$`

**Problem:** As we add predictors, the `$R^2$` value will only increase.

And in [Week 6, we said](https://mcconvil.github.io/stat100s22/stat100_wk06mon.html#55):

**Guiding Principle**: Occam's Razor for Modeling

> "All other things being equal, simpler models are to be preferred over complex ones." -- ModernDive

---

### Comparing Models with the Adjusted `$R^2$`

**New Measure of quality**: Adjusted `$R^2$` (Coefficient of Determination)

`\begin{align*}
\mbox{adj} R^2 &= 1- \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2} \left(\frac{n - 1}{n - p - 1} \right)
\end{align*}`

where `$p$` is the number of explanatory variables in the model.

Now we will penalize larger models.

**Strategy**: Compute the adjusted `$R^2$` value for each model and pick the one with the highest adjusted `$R^2$`.

---

**Strategy**: Compute the adjusted `$R^2$` value for each model and pick the one with the highest adjusted `$R^2$`.

```r
glance(mod2)
```

```r
glance(mod3)
```

```r
glance(mod4)
```