Modeling

background-image: url("img/DAW.png")
background-position: left
background-size: 50%
class: middle, center,

.pull-right[

## .base-blue[Simple Linear Regression]

### .purple[Kelly McConville]

#### .purple[ Stat 100 | Week 6 | Fall 2022]

]

---

## Announcements

* Final Exam:
    + Take-home will be released at noon on Wed, Dec 7th.  You must complete it before 4pm on Dec 10th over a 3 hour period of your choice.  
    + Oral exams will start on the afternoon of the 8th and go through 4:30pm on the 10th.  They are 10 minutes long and should be done AFTER you complete the take-home.
* P-Set 4 due tomorrow at 5pm.
* Project Assignment 1 due Friday at 5pm.

****************************

## Goals for Today

.pull-left[

* Simple linear regression model
    + Estimating the slope and intercept terms
    + Prediction
    + Consider one quantitative predictor
    + Consider one categorical predictor

]

---

class: center, middle,

## It's Time for Trend Stretches!

---

### Simple Linear Regression

.pull-left[

]

.pull-right[

Let's return to the Candy Example.

* A line is a reasonable model form.

* Where should the line be?
    + Slope? Intercept?

]

---

###  Form of the SLR Model

$$ 
`\begin{align}
y &= f(x) + \epsilon \\
y &= \beta_o + \beta_1 x + \epsilon
\end{align}`
$$

**Need to determine the best estimates of `$\beta_o$` and `$\beta_1$`.**

*****************************

#### Distinguishing between the population and the sample

* Parameters: 
    + Based on the population
    + Unknown then if don't have data on the whole population
    + EX: `$\beta_o$` and `$\beta_1$`

* Statistics: 
    + Based on the sample data
    + Known
    + Usually estimate a population parameter
    + EX: `$\hat{\beta}_o$` and `$\hat{\beta}_1$`

---

### Method of Least Squares

Need two key definitions:

* Fitted value: The *estimated* value of the `$i$`-th case

$$
\hat{y}_i = \hat{\beta}_o + \hat{\beta}_1 x_i
$$
--

* Residuals: The *observed* error term for the `$i$`-th case

$$
e_i = y_i - \hat{y}_i
$$

**Goal**: Pick values for `$\hat{\beta}_o$` and `$\hat{\beta}_1$`  so that the residuals are small!

---

### Method of Least Squares

* Want residuals to be small.

* Minimize some function of the residuals.

---

### Method of Least Squares

Minimize:

$$
\sum_{i = 1}^n e^2_i
$$

Get the following equations:

$$ 
`\begin{align}
\hat{\beta}_1 &= \frac{ \sum_{i = 1}^n (x_i - \bar{x}) (y_i - \bar{y})}{ \sum_{i = 1}^n (x_i - \bar{x})^2} \\
\hat{\beta}_o &= \bar{y} - \hat{\beta}_1 \bar{x}
\end{align}`
$$
where

$$
`\begin{align}
\bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i \quad \mbox{and} \quad \bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i
\end{align}`
$$

---

## Method of Least Squares

Once we have the estimated intercept `$(\hat{\beta}_o)$` and the estimated slope `$(\hat{\beta}_1)$`, we can estimate the whole function:

$$
\hat{y} = \hat{\beta}_o + \hat{\beta}_1 x
$$

Called the **least squares line** or the **line of best fit**.

---

### Method of Least Squares

.pull-left[

`ggplot2` will compute the line and add it to your plot using `geom_smooth(method = "lm")`

But what are the **exact** values of `$\hat{\beta}_o$` and `$\hat{\beta}_1$`?

]

.pull-right[

]

---

### Constructing the Simple Linear Regression Model in R

```r
mod <- lm(winpercent ~ pricepercent, data = candy)

library(moderndive)
get_regression_table(mod)
```

```
## # A tibble: 2 × 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 42.0 2.91 14.4 0 36.2 47.8
## 2 pricepercent 17.8 5.30 3.35 0.001 7.23 28.3
```

---

### Interpretation

Slope: 17.8

Intercept: 42.0

---

### Prediction

```r
new_cases <- data.frame(pricepercent = c(0.25, 0.85, 1.5))
predict(mod, newdata = new_cases)
```

```
##        1        2        3 
## 46.42443 57.09409 68.65289
```

We didn't have any treats in our sample with a price percentage of 85%.  Can we still make this prediction?

&rarr;  Called **interpolation**

We didn't have any treats in our sample with a price percentage of 150%.  Can we still make this prediction?

&rarr;  Called **extrapolation**

---

### Cautions

.pull-left[

* Careful to only predict values within the range of `$x$` values in the sample.

* Make sure to investigate **influential points**.

**Q:** What is an **outlier**?

]

.pull-right[

]

---

### Linear Regression

Linear regression is a flexible class of models that allow for:

* Both quantitative and categorical explanatory variables.

* Multiple explanatory variables.

* Curved relationships between the response variable and the explanatory variable.

* BUT the **response variable is quantitative**.

********************

### What About A Categorical Explanatory Variable?

* Response variable `$(y)$`: quantitative

* Have 1 categorical explanatory variable `$(x)$` with two categories.

---

### Example: The Smile-Leniency Effect

**Can a simple smile have an effect on punishment assigned following an infraction?** In a 1995 study, Hecht and LeFrance examined the effect of a smile on the leniency of disciplinary action for wrongdoers. Participants in the experiment took on the role of members of a college disciplinary panel judging students accused of cheating. For each suspect, along with a description of the offense, a picture was provided with either a smile or neutral facial expression. A leniency score was calculated based on the disciplinary decisions made by the participants.

**Response variable?**

**Explanatory variable?**

---

###  Model Form

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x + \epsilon
\end{align}`
$$

First, need to convert the categories of `$x$` to numbers.

Before building the model, let's explore and visualize the data!

```r
library(tidyverse)
library(Lock5Data)
# Load data
data(Smiles)
smiles <- Smiles
glimpse(smiles)
```

```
## Rows: 68
## Columns: 2
## $ Leniency <dbl> 7.0, 3.0, 6.0, 4.5, 3.5, 4.0, 3.0, 3.0, 3.5, 4.5, 7.0, 5.0, 5…
## $ Group <fct> smile, smile, smile, smile, smile, smile, smile, smile, smile…
```

* What `dplyr` functions should I use to find the mean and sd of `Leniency` by the categories of `Group`?

* What graph should we use to visualize the `Leniency` scores by `Group`?

---

### Example: The Smile-Leniency Effect

```r
# Summarize
smiles %>%
  group_by(Group) %>%
  summarize(count = n(), mean_len = mean(Leniency), 
            sd_len = sd(Leniency))
```

```
## # A tibble: 2 × 4
## Group count mean_len sd_len
## <fct> <int> <dbl> <dbl>
## 1 neutral 34 4.12 1.52
## 2 smile 34 4.91 1.68
```

---

### Example: The Smile-Leniency Effect

.pull-left[

```r
# Visualize
ggplot(smiles, aes(x = Group, 
                   y = Leniency)) +
  geom_boxplot() +
    stat_summary(fun = mean,
                 geom = "point",
                 color = "purple",
                 size = 4)
```

]

.pull-right[

]

---

### Fit the Linear Regression Model

Model Form:

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x + \epsilon
\end{align}`
$$

When `$x = 0$`:

When `$x = 1$`:

```r
mod <- lm(Leniency ~ Group, data = smiles)
library(moderndive)
get_regression_table(mod)
```

```
## # A tibble: 2 × 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 4.12 0.275 15.0 0 3.57 4.67
## 2 Group: smile 0.794 0.389 2.04 0.045 0.017 1.57
```

---

### Notes

1. When the explanatory variable is categorical, `$\beta_o$` and `$\beta_1$` no longer represent the interceopt and slope.

2. Now `$\beta_o$` represents the (population) mean of the response variable when `$x = 0$`.

3. And, `$\beta_1$` represents the change in the (population) mean response going from `$x = 0$` to `$x = 1$`.

4. Can also do prediction:

```r
new <- data.frame(Group = c("smile", "neutral"))
predict(mod, newdata = new)
```

```
##        1        2 
## 4.911765 4.117647
```

---

## And, As You Start Planning For Halloween...

---

class: center, middle

## Survey Time

Someone from the teaching staff is going to give you a card a number that was randomly generated for you.  Please use that number to take the following anonymous survey:

## [https://bit.ly/stat100trees](https://bit.ly/stat100trees)