Multiple Linear Regression

background-image: url("img/DAW.png")
background-position: left
background-size: 50%
class: middle, center,

.pull-right[

## .base-blue[Multiple Linear Regression]

### .purple[Kelly McConville]

#### .purple[ Stat 100 | Week 6 | Fall 2022]

]

---

## Announcements

* No lecture on Monday, October 10th!
* Project Assignment 1 due on Friday at 5pm
* Mid-Term Exam: Wednesday, March 9th - Friday, March 11th
* After turning in Project Assignment 1, please fill out [this feedback survey](https://forms.gle/PmV9QxNoxFztXcUk9) (also on Canvas and in the announcements channel) by Friday, March 4th.
    + We will give groups one set of feedback on Gradescope but will take your feedback into account when doing our final assessments.

****************************

## Goals for Today

.pull-left[

* Broadening our idea of linear regression models

]

.pull-right[

* Discuss the **multiple** linear regression model

* Explore interaction terms

]

---

### Linear Regression

Linear regression is a flexible class of models that allow for:

* Both quantitative and categorical explanatory variables.

* Multiple explanatory variables.

* Curved relationships between the response variable and the explanatory variable.

* BUT the **response variable is quantitative**.

********************

### What About A Categorical Explanatory Variable?

* Response variable `$(y)$`: quantitative

* Have 1 categorical explanatory variable `$(x)$` with two categories.

---

### Example: Halloween Candy

```r
candy <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv")
glimpse(candy)
```

```
## Rows: 85
## Columns: 13
## $ competitorname <chr> "100 Grand", "3 Musketeers", "One dime", "One quarter…
## $ chocolate <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ fruity <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1,…
## $ caramel <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ peanutyalmondy <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nougat <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ crispedricewafer <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ hard <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,…
## $ bar <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ pluribus <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1,…
## $ sugarpercent <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604, 0.31…
## $ pricepercent <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767, 0.51…
## $ winpercent <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.…
```

What might be a good categorical explanatory variable of `winpercent`?

---

###  Model Form

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x + \epsilon
\end{align}`
$$

First, need to convert the categories of `$x$` to numbers.

Before building the model, let's explore and visualize the data!

* What `dplyr` functions should I use to find the mean and sd of `winpercent` by the categories of `chocolate`?

* What graph should we use to visualize the `winpercent` scores by `chocolate`?

---

### Example: Halloween Candy

```r
# Summarize
candy %>%
  group_by(chocolate) %>%
  summarize(count = n(), mean_win = mean(winpercent), 
            sd_win = sd(winpercent))
```

```
## # A tibble: 2 × 4
## chocolate count mean_win sd_win
## <dbl> <int> <dbl> <dbl>
## 1 0 48 42.1 10.2
## 2 1 37 60.9 12.8
```

---

### Example: Halloween Candy

.pull-left[

```r
# Visualize
ggplot(candy, aes(x = factor(chocolate), 
                   y = winpercent, 
                  fill = factor(chocolate))) +
  geom_boxplot() +
    stat_summary(fun = mean,
                 geom = "point",
                 color = "yellow",
                 size = 4) +
  guides(fill = "none") +
  scale_fill_manual(values = c("0" = "deeppink",
                               "1" = "chocolate4")) +
  scale_x_discrete(labels = c("No", "Yes"),
                   name = "Does the candy contain chocolate?")
```

]

.pull-right[

]

---

### Fit the Linear Regression Model

Model Form:

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x + \epsilon
\end{align}`
$$

When `$x = 0$`:

When `$x = 1$`:

```r
mod <- lm(winpercent ~ chocolate, data = candy)
library(moderndive)
get_regression_table(mod)
```

```
## # A tibble: 2 × 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 42.1 1.65 25.6 0 38.9 45.4
## 2 chocolate 18.8 2.50 7.52 0 13.8 23.7
```

---

### Notes

1. When the explanatory variable is categorical, `$\beta_o$` and `$\beta_1$` no longer represent the interceopt and slope.

2. Now `$\beta_o$` represents the (population) mean of the response variable when `$x = 0$`.

3. And, `$\beta_1$` represents the change in the (population) mean response going from `$x = 0$` to `$x = 1$`.

4. Can also do prediction:

```r
new_candy <- data.frame(chocolate = c(0, 1))
predict(mod, newdata = new_candy)
```

```
##        1        2 
## 42.14226 60.92153
```

---

## And, As You Start Planning For Halloween...

---

## Multiple Linear Regression

Linear regression is a flexible class of models that allow for:

* Both quantitative and categorical explanatory variables.

* Multiple explanatory variables.

* Curved relationships between the response variable and the explanatory variable.

* BUT the response variable is **quantitative**.

---

### Multiple Linear Regression

**Form of the Model:**

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon
\end{align}`
$$

#### How does extending to more predictors change our process?

**What doesn't change:**

&rarr; Still use **Method of Least Squares** to estimate coefficients

&rarr; Still use `lm()` to fit the model and `predict()` for prediction

**What does change:**

&rarr; Meaning of the coefficients are more complicated and depend on other variables in the model

&rarr; Need to decide which variables to include and how (linear term, squared term...)

---

### Multiple Linear Regression

* We are going to see a few examples today and next lecture.

* We will need to return to modeling later in the course to more definitively answer questions about **model selection**.

---

## Example

Meadowfoam is a plant that grows in the Pacific Northwest and is harvested for its seed oil.  In a randomized experiment, researchers at Oregon State University looked at how two light-related factors influenced the number of flowers per meadowfoam plant, the primary measure of productivity for this plant.  The two light measures were light intensity (in mmol/ `$m^2$` /sec) and the timing of onset of the light (early or late in terms of photo periodic floral induction).

**Response variable**:

**Explanatory variables**:

**Model Form:**

---

### Data Loading and Wrangling

```r
library(tidyverse)
library(Sleuth3)
data(case0901)

# Recode the timing variable
case0901 <- case0901 %>%
 mutate(TimeCat = case_when(
 Time == 1 ~ "Late",
 Time == 2 ~ "Early"
 )) 
```

---

### Visualizing the Data

.pull-left[

```r
ggplot(case0901,
       aes(x = Intensity,
           y = Flowers,
           color = TimeCat)) + 
  geom_point(size = 4)
```

Why don't I have to include `data = ` and `mapping = ` in my `ggplot()` layer?

]

.pull-right[

]

---

### Building the Linear Regression Model

```r
modFlowers <- lm(Flowers ~ Intensity + TimeCat, data = case0901)

library(moderndive)
get_regression_table(modFlowers)
```

```
## # A tibble: 3 × 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 83.5 3.27 25.5 0 76.7 90.3 
## 2 Intensity -0.04 0.005 -7.89 0 -0.051 -0.03
## 3 TimeCat: Late -12.2 2.63 -4.62 0 -17.6 -6.69
```

---

### Interpreting the Coefficients

.pull-left[

```r
ggplot(case0901, 
       aes(x = Intensity,
           y = Flowers,
           color = TimeCat)) + 
  geom_point(size = 4) + 
  geom_smooth(method = "lm", se = FALSE)
```

]

.pull-right[

]

--

Is the assumption of **equal slopes** reasonable here?

---

### Prediction

```r
flowersNew <- data.frame(Intensity = 700, 
 TimeCat = "Early")
predict(modFlowers, newdata = flowersNew)
```

```
##        1 
## 55.13417
```

---

### New Example

For this example, we will use data collected by the website pollster.com, which aggregated 102 presidential polls from August 29th, 2008 through the end of September.  We want to determine the best model to explain the variable `Margin`, measured by the difference in preference between Barack Obama and John McCain.  Our potential predictors are `Days` (the number of days after the Democratic Convention) and `Charlie` (indicator variable on whether poll was conducted before or after the first ABC interview of Sarah Palin with Charlie Gibson).

.pull-left[

```r
Pollster08 <- 
 read_csv("~/shared_data/stat100/data/Pollster08.csv")
glimpse(Pollster08)
```

```
## Rows: 102
## Columns: 11
## $ PollTaker <chr> "Rasmussen", "Zogby", "Diageo/Hotline", "CBS", "CNN", "Rasmu…
## $ PollDates <chr> "8/28-30/08", "8/29-30/08", "8/29-31/08", "8/29-31/08", "8/2…
## $ MidDate <chr> "8/29", "8/30", "8/30", "8/30", "8/30", "8/31", "8/31", "9/1…
## $ Days <dbl> 1, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 8, 8, 9, 9, 9, 9, …
## $ n <dbl> 3000, 2020, 805, 781, 927, 3000, 1200, 1728, 2771, 1000, 734…
## $ Pop <chr> "LV", "LV", "RV", "RV", "RV", "LV", "LV", "RV", "RV", "A", "…
## $ McCain <dbl> 46, 47, 39, 40, 48, 45, 43, 36, 42, 39, 42, 44, 46, 40, 48, …
## $ Obama <dbl> 49, 45, 48, 48, 49, 51, 49, 40, 49, 42, 42, 49, 48, 46, 45, …
## $ Margin <dbl> 3, -2, 9, 8, 1, 6, 6, 4, 7, 3, 0, 5, 2, 6, -3, 5, -4, -1, -2…
## $ Charlie <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Meltdown <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
```

]

.pull-right[

**Response variable**:

**Explanatory variables**:

]

---

### Loading and Visualizing the Data

.pull-left[

```r
ggplot(Pollster08,
       aes(x = Days,
           y = Margin, 
           color = Charlie)) +
  geom_point(size = 3)
```

]

.pull-right[

]

What is wrong with how one of the variables is mapped in the graph?

---

### Loading and Visualizing the Data

.pull-left[

```r
ggplot(Pollster08, 
       aes(x = Days,
           y = Margin, 
           color = factor(Charlie))) +
  geom_point(size = 3)
```

]

.pull-right[

]

Is the assumption of **equal slopes** reasonable here?

---

### Model Forms

**Same Slopes Model:**

**Different Slopes Model:**

* Line for `$x_2 = 1$`:

* Line for `$x_2 = 0$`:

---

### Fit the Linear Regression Model

```r
modPoll <- lm(Margin ~ Days*factor(Charlie), data = Pollster08)

get_regression_table(modPoll)
```

```
## # A tibble: 4 × 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 5.57 1.09 5.11 0 3.40 7.73 
## 2 Days -0.598 0.121 -4.96 0 -0.838 -0.359
## 3 factor(Charlie): 1 -10.1 1.92 -5.25 0 -13.9 -6.29 
## 4 Days:factor(Charlie)1 0.921 0.136 6.75 0 0.65 1.19
```

---

### Adding the Regression Model to the Curve

.pull-left[

```r
ggplot(Pollster08, 
       aes(x = Days, 
           y = Margin, 
           color = factor(Charlie))) +
  geom_point(size = 3) +
  stat_smooth(method = lm, se = FALSE)
```

]

.pull-right[

]

Is our modeling goal here **predictive** or **descriptive**?

---

## Reminders: