Inference for Regression

background-image: url("img/DAW.png")
background-position: left
background-size: 50%
class: middle, center,

## .base-blue[Inference]

## .base-blue[for]

## .base-blue[Linear Regression]

### .purple[Kelly McConville]

#### .purple[ Stat 100 | Week 14 | Fall 2022]

]

---

### If you are able to attend, please RSVP here:

### [bit.ly/stat100-ggparty](https://bit.ly/stat100-ggparty)

---

### Announcements

* Extra Credit Lecture Quiz due Dec 1st.
* P-Set 9 (the LAST p-set) is due Th, Dec 1st!
* Project Assignment 3 is due Tu, Dec 6th.
    + Also provide [feedback on work distribution](https://forms.gle/EJ1zgcC3c5rNrn9f7).
* Will get review sheet on Wed.

****************************

### Goals for Today

* **Sim-based** versus **theory-based** inference

* Recap **multiple linear regression**

* Check **assumptions** for linear regression

]

* **Hypothesis testing** for linear regression

* **Estimation** and **prediction** inference for linear regression

]

---

## Please make sure to fill out the Stat 100 Course Evaluations.

### We appreciate constructive feedback.

### For all of your course evaluations be mindful of [unconscious and unintentional biases](https://benschmidt.org/profGender).

---

background-image: url("img/hyp_testing_diagram.png")
background-position: contain
background-size: 80%

### Have Learned Two Routes to Statistical Inference

Which is **better**?

---

## Is Simulation-Based Inference or Theory-Based Inference better?

Depends on how you define **better**.

* If **better** = Leads to better understanding:

]

&#8594; Research tends to show students have a better understanding of **p-values** and **confidence** from learning simulation-based methods.

]

* If **better** = More flexible/robust to assumptions:

]

&#8594; The simulation-based methods tend to be more flexible but that generally requires learning extensions beyond what we've seen in Stat 100.

]

* If **better** = More commonly used:

]

&#8594; Definitely the theory-based methods but the simulation-based methods are becoming more common.

]

Good to be comfortable with both as you will find both approaches used in journal and news articles!

---

## What does statistical inference (estimation and hypothesis testing) look like when I have more than 0 or 1 explanatory variables?

### One route: Multiple Linear Regression!

---

## Multiple Linear Regression

Linear regression is a flexible class of models that allow for:

* Both quantitative and categorical explanatory variables.

* Multiple explanatory variables.

* Curved relationships between the response variable and the explanatory variable.

* BUT the response variable is quantitative.

**In P-Set 9, Problem 5** you are exploring the importance of controlling for other explanatory variable when you try to make inferences about relationships.

---

### Multiple Linear Regression

**Form of the Model:**

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon
\end{align}`
$$

**Fitted Model:** Using the Method of Least Squares,

$$ 
`\begin{align}
\hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p 
\end{align}`
$$

#### Typical Inferential Questions:

(1) Should `$x_2$` be in the model that already contains `$x_1$` and `$x_3$`?  Also often asked as "Controlling for `$x_1$` and `$x_3$`, is there evidence that `$x_2$` has a relationship with `$y$`?"

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon
\end{align}`
$$

In other words, should `$\beta_2 = 0$`?

---

### Multiple Linear Regression

**Form of the Model:**

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon
\end{align}`
$$

**Fitted Model:** Using the Method of Least Squares,

$$ 
`\begin{align}
\hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p 
\end{align}`
$$

#### Typical Inferential Questions:

(2) Can we estimate `$\beta_3$` with a confidence interval?

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon
\end{align}`
$$

---

### Multiple Linear Regression

**Form of the Model:**

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon
\end{align}`
$$

**Fitted Model:** Using the Method of Least Squares,

$$ 
`\begin{align}
\hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p 
\end{align}`
$$

#### Typical Inferential Questions:

(3) While `$\hat{y}$` is a point estimate for `$y$`, can we also get an interval estimate for `$y$`?

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon
\end{align}`
$$

To answer these questions, we need to add some **assumptions** to our linear regression model.

---

### Multiple Linear Regression

**Form of the Model:**

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon
\end{align}`
$$

**Additional Assumptions:**

$$
\epsilon \overset{\mbox{ind}}{\sim} N (\mu = 0, \sigma = \sigma_{\epsilon})
$$

`$\sigma_{\epsilon}$` = typical deviations from the model

Let's unpack these assumptions!

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\color{orange}{\mbox{ind}}}{\sim}N\left(0, \sigma_{\epsilon} \right)
\end{align*}`

**Assumption**: The cases are independent of each other.

**Question**: How do we check this assumption?

Look at how the data were collected.  My collaborator, Paolo Maranzano, talked to us about models that assume spatial dependence and temporal dependence.

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}\color{orange}{N}\left(0, \sigma_{\epsilon} \right)
\end{align*}`

**Assumption**: The errors are normally distributed.

**Question**: How do we check this assumption?

]

Recall the residual: `$e = y - \hat{y}$`

]

**QQ-plot:** Plot the residuals against the quantiles of a normal distribution!

]

]

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(\color{orange}{0}, \sigma_{\epsilon} \right)
\end{align*}`

**Assumption**: The points will, on average, fall on the line.

**Question**: How do we check this assumption?

If you use the Method of Least Squares, then you don't have to check.

It will be true by construction:

$$
\sum e = 0
$$

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \color{orange}{\sigma_{\epsilon}} \right)
\end{align*}`

**Assumption**: The variability in the errors is constant.

**Question**: How do we check this assumption?

**One option**: Scatterplot

]

]

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \color{orange}{\sigma_{\epsilon}} \right)
\end{align*}`

**Assumption**: The variability in the errors is constant.

**Question**: How do we check this assumption?

**Better option** (especially when have more than 1 explanatory variable): **Residual Plot**

]

]

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \color{orange}{\beta_o + \beta_1 x_1} + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right)
\end{align*}`

**Assumption**: The model form is appropriate.

**Question**: How do we check this assumption?

**One option**: Scatterplot(s)

]

]

---

### Assumptions

For ease of visualization, let's assume a simple linear model:

`\begin{align*}
y = \color{orange}{\beta_o + \beta_1 x_1} + \epsilon \quad \mbox{   where   } \quad \epsilon \overset{\mbox{ind}}{\sim}N\left(0, \sigma_{\epsilon} \right)
\end{align*}`

**Assumption**: The model form is appropriate.

**Question**: How do we check this assumption?

**Better option** (especially when have more than 1 explanatory variable): **Residual Plot**

]

]

---

### Assumptions

**Question**: What if the assumptions aren't all satisfied?

&rarr; Try transforming the data and building the model again.

&rarr; Use a modeling technique beyond linear regression.

**Question**: What if the assumptions are all (roughly) satisfied?

&rarr; Can now start answering your inference questions!

***********************************

### Let's now look at an example and learn how to create qq-plots and residual plots in `R`.

---

### Example: COVID and Candle Ratings

[Kate Petrova created a dataset](https://twitter.com/kate_ptrv/status/1332398768659050496) that made the rounds on Twitter:

---

### COVID and Candle Ratings

She posted all her data and code to GitHub and I did some light wrangling so that we could answer the question:

&rarr; Early on in the pandemic, do we have evidence that the association between time and Amazon rating varies by whether or not a candle is scented and in particular, that scented candles have a steeper decline in ratings over time?

In other words, do we have evidence that we should allow the slopes to vary?

```r
ggplot(data = all,
       mapping = aes(x = Date,
                     y = Rating,
                     color = Type)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = lm) +
  theme(legend.position = "bottom")
```

#### Checking assumptions:

**Assumption**: The cases are independent of each other.

]

]

---

## Assumption Checking in `R`

The `R` package we will use to check model assumptions is called `gglm` and was written by one of my former Reed students, Grayson White.

```r
library(gglm)
```

First need to fit the model:

```r
glimpse(all)
```

```
## Rows: 612
## Columns: 3
## $ Date <date> 2020-01-20, 2020-01-21, 2020-01-22, 2020-01-23, 2020-01-24, 20…
## $ Rating <dbl> 4.500000, 4.500000, 3.909091, 4.857143, 4.461538, 4.800000, 4.4…
## $ Type <chr> "scented", "scented", "scented", "scented", "scented", "scented…
```

```r
mod <- lm(Rating ~ Date * Type, data = all)
```

---

### qq-plot

**Assumption**: The errors are normally distributed.

```r
# Normal qq plot
ggplot(data = mod) + 
  stat_normal_qq() 
```

]

]

---

### Residual Plot

**Assumption**: The variability in the errors is constant.

**Assumption**: The model form is appropriate.

```r
# Residual plot
ggplot(data = mod) + 
  stat_fitted_resid()
```

]

]

---

### Hypothesis Testing

**Question**: What tests is `get_regression_table()` conducting?

For the moment, let's focus on the equal slopes model.

```r
mod <- lm(Rating ~ Date + Type, data = all)
get_regression_table(mod)
```

```
## # A tibble: 3 × 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 36.2 6.50 5.58 0 23.5 49.0 
## 2 Date -0.002 0 -5.00 0 -0.002 -0.001
## 3 Type: unscented 0.831 0.063 13.2 0 0.707 0.955
```

**In General**:

$$
H_o: \beta_j = 0 \quad \mbox{assuming all other predictors are in the model}
$$
$$
H_a: \beta_j \neq 0 \quad \mbox{assuming all other predictors are in the model}
$$

---

### Hypothesis Testing

**Question**: What tests is `get_regression_table()` conducting?

```r
mod <- lm(Rating ~ Date + Type, data = all)
get_regression_table(mod)
```

**For our Example**:

**Row 2**:

$$
H_o: \beta_1 = 0 \quad \mbox{given Type is already in the model}
$$
$$
H_a: \beta_1 \neq 0 \quad \mbox{given Type is already in the model}
$$

---

### Hypothesis Testing

**Question**: What tests is `get_regression_table()` conducting?

```r
mod <- lm(Rating ~ Date + Type, data = all)
get_regression_table(mod)
```

**For our Example**:

**Row 3**:

$$
H_o: \beta_2 = 0 \quad \mbox{given Date is already in the model}
$$
$$
H_a: \beta_2 \neq 0 \quad \mbox{given Date is already in the model}
$$

---

### Hypothesis Testing

**Question**: What tests is `get_regression_table()` conducting?

**In General**:

$$
H_o: \beta_j = 0 \quad \mbox{assuming all other predictors are in the model}
$$
$$
H_a: \beta_j \neq 0 \quad \mbox{assuming all other predictors are in the model}
$$

Test Statistic:

$$
t = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)} \sim t(df = n - \mbox{# of predictors})
$$

when `$H_o$` is true and the model assumptions are met.

---

**For our Example**:

```r
get_regression_table(mod)
```

**Row 3**:

$$
H_o: \beta_2 = 0 \quad \mbox{given Date is already in the model}
$$
$$
H_a: \beta_2 \neq 0 \quad \mbox{given Date is already in the model}
$$

Test Statistic:

$$
t = \frac{\hat{\beta}_2 - 0}{SE(\hat{\beta}_2)} = \frac{0.831 - 0}{0.063} = 13.2
$$

with p-value `$= P(t \leq -13.2) + P(t \geq 13.2) \approx 0.$`

There is evidence that including whether or not the candle is scented adds useful information to the linear regression model for Amazon ratings that already controls for date.

---

### Example

Early on in the pandemic, do we have evidence that the association between time and Amazon rating varies by whether or not a candle is scented and in particular, that scented candles have a steeper decline in ratings over time?

---

### Example

```r
mod <- lm(Rating ~ Date * Type, data = all)
get_regression_table(mod)
```

```
## # A tibble: 4 × 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 52.7 9.09 5.80 0 34.9 70.6 
## 2 Date -0.003 0 -5.40 0 -0.004 -0.002
## 3 Type: unscented -32.6 12.9 -2.52 0.012 -58.0 -7.24 
## 4 Date:Typeunscented 0.002 0.001 2.59 0.01 0 0.003
```

---

## Now let's shift our focus to estimation and prediction!

---

### Estimation

#### Typical Inferential Questions:

(2) Can we estimate `$\beta_j$` with a confidence interval?

Confidence Interval Formula:

`\begin{align*}
\mbox{statistic} & \pm ME \\
\hat{\beta}_j & \pm t^* SE(\hat{\beta}_j)
\end{align*}`

```r
get_regression_table(mod)
```

---

### Prediction

#### Typical Inferential Questions:

(3) While `$\hat{y}$` is a point estimate for `$y$`, can we also get an interval estimate for `$y$`?

#### Two Types:

**Confidence Interval for the Mean Response**

&rarr; Defined at given values of the explanatory variables

&rarr; Estimates the average response

&rarr; Centered at `$\hat{y}$`

&rarr; Smaller SE

]

**Prediction Interval for an Individual Response**

&rarr; Defined at given values of the explanatory variables

&rarr; Predicts the response of a single, new observation

&rarr; Centered at `$\hat{y}$`

&rarr; Larger SE

]

---

### CI for mean response at a given level of X:

We want to construct a 95% CI for the average rating of unscented candles on Oct 31st, 2020.

```r
new <- data.frame(Type = "unscented", Date = as.Date("2020-10-31"))
predict(mod, new, interval = "confidence", level = 0.95)
```

```
##        fit      lwr      upr
## 1 4.442378 4.286152 4.598604
```

**Interpretation**:  We are 95% confident that the average rating of an unscented candle on Amazon on Halloween 2020 was between 4.27 and 4.60.

---

### PI for a new Y at a given level of X:

Say we want to construct a 95% PI for the rating of an **individual** unscented candle on Oct 31st, 2020.

* Predicting for a new observation not the mean!

```r
predict(mod, new, interval = "prediction", level = 0.95)
```

```
##        fit      lwr      upr
## 1 4.442378 2.911587 5.973169
```

**Interpretation**: For unscented candles on Amazon, we expect 95% of the Halloween 2020 ratings to be between 2.91 and 5.97.

---

### Comparing Models

Suppose I built 4 different model. **Which is best?**

* Big question!  Take **Stat 139: Linear Models** to learn systematic model selection techniques.

* We will explore one approach.  (But there are many possible approaches!)

---

### Comparing Models

Suppose I built 4 different model. **Which is best?**

&rarr; Pick the best model based on some measure of quality.

**Measure of quality**: `$R^2$` (Coefficient of Determination)

`\begin{align*}
R^2 &= \mbox{Percent of total variation in y explained by the model}\\
&= 1- \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2}
\end{align*}`

**Strategy**: Compute the `$R^2$` value for each model and pick the one with the highest `$R^2$`.

---

### Comparing Models with `$R^2$`

**Strategy**: Compute the `$R^2$` value for each model and pick the one with the highest `$R^2$`.

```r
library(broom)
mod1 <- lm(Rating ~ Date, data = all)
mod2 <- lm(Rating ~ Type, data = all)
mod3 <- lm(Rating ~ Date + Type, data = all)
mod4 <- lm(Rating ~ Date * Type, data = all)
```

---

**Strategy**: Compute the `$R^2$` value for each model and pick the one with the highest `$R^2$`.

```r
glance(mod1)
```

```
## # A tibble: 1 × 12
## r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 0.0326 0.0310 0.883 20.6 6.92e-6 1 -791. 1588. 1601. 475. 610
## # … with 1 more variable: nobs <int>, and abbreviated variable names
## # ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual
```

```r
glance(mod2)
```

```
## # A tibble: 1 × 12
## r.squared adj.r.squa…¹ sigma stati…² p.value df logLik AIC BIC devia…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.217 0.216 0.794 169. 2.95e-34 1 -726. 1459. 1472. 385.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## # variable names ¹adj.r.squared, ²statistic, ³deviance
```

```r
glance(mod3)
```

```
## # A tibble: 1 × 12
## r.squared adj.r.squa…¹ sigma stati…² p.value df logLik AIC BIC devia…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.248 0.245 0.779 100. 2.28e-38 2 -714. 1436. 1454. 370.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## # variable names ¹adj.r.squared, ²statistic, ³deviance
```

---

```r
glance(mod2)
```

```r
glance(mod3)
```

```r
glance(mod4)
```

```
## # A tibble: 1 × 12
## r.squared adj.r.squa…¹ sigma stati…² p.value df logLik AIC BIC devia…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.256 0.252 0.775 69.7 9.42e-39 3 -711. 1431. 1454. 366.
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## # variable names ¹adj.r.squared, ²statistic, ³deviance
```

**Problem:** As we add predictors, the `$R^2$` value will only increase.

---

### Comparing Models with `$R^2$`

**Problem:** As we add predictors, the `$R^2$` value will only increase.

And in [Week 6, we said](https://mcconvil.github.io/stat100f22/stat100_wk07wed.html#34):

**Guiding Principle**: Occam's Razor for Modeling

> "All other things being equal, simpler models are to be preferred over complex ones." -- ModernDive

---

### Comparing Models with the Adjusted `$R^2$`

**New Measure of quality**: Adjusted `$R^2$` (Coefficient of Determination)

`\begin{align*}
\mbox{adj} R^2 &= 1- \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2} \left(\frac{n - 1}{n - p - 1} \right)
\end{align*}`

where `$p$` is the number of explanatory variables in the model.

Now we will penalize larger models.

**Strategy**: Compute the adjusted `$R^2$` value for each model and pick the one with the highest adjusted `$R^2$`.

---

**Strategy**: Compute the adjusted `$R^2$` value for each model and pick the one with the highest adjusted `$R^2$`.

```r
glance(mod2)
```

```r
glance(mod3)
```

```r
glance(mod4)
```