Modeling

background-image: url("img/DAW.png")
background-position: left
background-size: 50%
class: middle, center, inverse

.pull-right[

## .whitish[Simple Linear Regression]

### .whitish[Kelly McConville]

#### .yellow[ Stat 100 | Week 4 | Spring 2022]

]

---

## Announcements

* P-Set 3 due Friday this week.
* Moving all project assignment due dates to Fridays at 5pm.

****************************

## Goals for Today

.pull-left[

* Introduce statistical modeling

* Simple linear regression model
    + Estimating the slope and intercept terms
    + Prediction
    + Consider one quantitative predictor
    + Consider one categorical predictor

]

.pull-right[

* Measuring correlation

]

---

###  When to Get Help

> "I have no idea how to do this problem."

&rarr; Ask us to point you to an similar example from the lecture or handouts.

&rarr; Talk it through with one of us so we can verbalize the process of going from Q to A.

> "I am getting a weird error but really think my code is correct/on the right track/matches the examples from class."

&rarr; It is time for a second pair of eyes.  Don't stare that the error for over 10 minutes.

> And lots of other times too!

---

###  When to Get Help

Remember:

&rarr; Struggling is part of learning.

&rarr; But let us help you ensure it is a **productive** struggle.

&rarr; Struggling does NOT mean you are bad at stats!

---

## Thoughts on Data Collection

**Random Sampling**

Random sampling is important to ensure the sample is representative of the population.

Representativeness isn't about size.

+ Small random samples will tend to be more representative than large non-random samples.

How do we draw conclusions about the population from non-random samples?

&rarr; Investigate how your sampled cases (and respondents) are systematically different from the non-sampled cases (and non-respondents).

---

## Thoughts on Data Collection

**Random Assignment**

Random assignment allows you to explore **causal** relationships between your explanatory variables and the predictor variables.

How do we draw causal conclusions from studies without random assignment?

&rarr; With extreme care!  Try to control for all possible confounding variables.

&rarr; Discuss the associations/correlations you found.  Use domain knowledge to address potentially causal links.

&rarr; Take more stats to learn more about causal inference.

**Bottom Line:** We often have to use imperfect data to make decisions.

---

## Two Key Random Mechanisms

---

.left-column[

## Recap

]

.right-column[

]

---

## Statistical Models

#### Two Main Motivations

You can often tell the modeling motivation from the research question.  We will look at studies that ask the following questions:

> "Can I use remotely sensed data to predict forest types in Alaska?"

&rarr; Motivation: Predict new observations

> "Do left-handed people live shorter lives than right-handed people?"

&rarr; Motivation: Describe relationships

We will focus mainly on **descriptive/explanatory modeling** in this course.  If you want to learn more about **predictive modeling**, take Stat 121A: Data Science 1 + Stat 121B: Data Science 2.

---

## Form of the Model

$$
y = f(x) + \epsilon
$$

**Goal:**

&rarr; Determine a reasonable form for `$f()$`. (Ex: Line, curve, ...)

&rarr; Estimate `$f()$` with `$\hat{f}()$` using the data.

&rarr; Generate predicted values: `$\hat{y} = \hat{f}(x)$`.

---

### Simple Linear Regression Model

Consider this model when:

* Response variable `$(y)$`: quantitative

* Explanatory variable `$(x)$`: quantitative
    + Have only ONE explanatory variable.

--
    
* AND, `$f()$` can be approximated by a line.

---

### Example: [The Ultimate Halloween Candy Power Ranking](https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/)

> "The social contract of Halloween is simple: Provide adequate treats to costumed masses, or be prepared for late-night tricks from those dissatisfied with your offer. To help you avoid that type of vengeance, and to help you make good decisions at the supermarket this weekend, we wanted to figure out what Halloween candy people most prefer. So we devised an experiment: Pit dozens of fun-sized candy varietals against one another, and let the wisdom of the crowd decide which one was best." -- Walt Hickey

> "While we don’t know who exactly voted, we do know this: 8,371 different IP addresses voted on about 269,000 randomly generated matchups.2 So, not a scientific survey or anything, but a good sample of what candy people like. "

---

### Example: [The Ultimate Halloween Candy Power Ranking](https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/)

---

### Example: [The Ultimate Halloween Candy Power Ranking](https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/)

```r
candy <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv")

glimpse(candy)
```

```
## Rows: 85
## Columns: 13
## $ competitorname <chr> "100 Grand", "3 Musketeers", "One dime", "One quarter…
## $ chocolate <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ fruity <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1,…
## $ caramel <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ peanutyalmondy <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nougat <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ crispedricewafer <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ hard <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,…
## $ bar <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ pluribus <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1,…
## $ sugarpercent <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604, 0.31…
## $ pricepercent <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767, 0.51…
## $ winpercent <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.…
```

---

### Example: [The Ultimate Halloween Candy Power Ranking](https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/)

.pull-left[

* Linear trend?

* Direction of trend?

]

.pull-right[

]

---

### Example: [The Ultimate Halloween Candy Power Ranking](https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/)

.pull-left[

**A simple linear regression model would be suitable for these data.**

* But first, let's describe more plots!

]

.pull-right[

]

---

#### Need a summary statistics that quantifies the strength and relationship of the linear trend!

---

## (Sample) Correlation Coefficient

* Measures the **strength** and **direction** of **linear** relationship between two quantitative variables

* Symbol: `$r$`

* Always between -1 and 1

* Sign indicates the direction of the relationship

* Magnitude indicates the strength of the linear relationship

```r
candy %>%
  summarize(cor = cor(pricepercent, winpercent))
```

```
## # A tibble: 1 × 1
## cor
## <dbl>
## 1 0.345
```

---

.pull-left[

]

.pull-right[

Any guesses on the correlations for A, B, C, or D?

]

.pull-right[

```r
dat %>%
  summarize(A = cor(x, y1), B = cor(x, y2),
            C = cor(x, y3), D = cor(x, y4))
```

```
## # A tibble: 1 × 4
## A B C D
## <dbl> <dbl> <dbl> <dbl>
## 1 0.695 -0.217 -0.815 -0.113
```

]

---

## New Example

.pull-left[

```r
# Correlation coefficients
dat2 %>%
  group_by(dataset) %>%
  summarize(cor = cor(x, y))
```

```
## # A tibble: 13 × 2
## dataset cor
## <chr> <dbl>
## 1 away -0.0641
## 2 bullseye -0.0686
## 3 circle -0.0683
## 4 dino -0.0645
## 5 dots -0.0603
## 6 h_lines -0.0617
## 7 high_lines -0.0685
## 8 slant_down -0.0690
## 9 slant_up -0.0686
## 10 star -0.0630
## 11 v_lines -0.0694
## 12 wide_lines -0.0666
## 13 x_shape -0.0656
```

]

.pull-right[

* Conclude that `$x$` and `$y$` have the same relationship across these different datasets because the correlation is the same?

]

---

###  Always graph the data when exploring relationships!

---

class: center, middle, inverse

# It's Time for Trend Stretches!

---

### Simple Linear Regression

.pull-left[

]

.pull-right[

Let's return to the Candy Example.

* A line is a reasonable model form.

* Where should the line be?
    + Slope? Intercept?

]

---

###  Form of the SLR Model

$$ 
`\begin{align}
y &= f(x) + \epsilon \\
y &= \beta_o + \beta_1 x + \epsilon
\end{align}`
$$

**Need to determine the best estimates of `$\beta_o$` and `$\beta_1$`.**

*****************************

#### Distinguishing between the population and the sample

* Parameters: 
    + Based on the population
    + Unknown then if don't have data on the whole population
    + EX: `$\beta_o$` and `$\beta_1$`

* Statistics: 
    + Based on the sample data
    + Known
    + Usually estimate a population parameter
    + EX: `$\hat{\beta}_o$` and `$\hat{\beta}_1$`

---

### Method of Least Squares

Need two key definitions:

* Fitted value: The *estimated* value of the `$i$`-th case

$$
\hat{y}_i = \hat{\beta}_o + \hat{\beta}_1 x_i
$$
--

* Residuals: The *observed* error term for the `$i$`-th case

$$
e_i = y_i - \hat{y}_i
$$

**Goal**: Pick values for `$\hat{\beta}_o$` and `$\hat{\beta}_1$`  so that the residuals are small!

---

### Method of Least Squares

* Want residuals to be small.

* Minimize some function of the residuals.

---

### Method of Least Squares

Minimize:

$$
\sum_{i = 1}^n e^2_i
$$

Get the following equations:

$$ 
`\begin{align}
\hat{\beta}_1 &= \frac{ \sum_{i = 1}^n (x_i - \bar{x}) (y_i - \bar{y})}{ \sum_{i = 1}^n (x_i - \bar{x})^2} \\
\hat{\beta}_o &= \bar{y} - \hat{\beta}_1 \bar{x}
\end{align}`
$$
where

$$
`\begin{align}
\bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i \quad \mbox{and} \quad \bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i
\end{align}`
$$

---

## Method of Least Squares

Once we have the estimated intercept `$(\hat{\beta}_o)$` and the estimated slope `$(\hat{\beta}_1)$`, we can estimate the whole function:

$$
\hat{y} = \hat{\beta}_o + \hat{\beta}_1 x
$$

Called the **least squares line** or the **line of best fit**.

---

### Method of Least Squares

.pull-left[

`ggplot2` will compute the line and add it to your plot using `geom_smooth(method = "lm")`

But what are the **exact** values of `$\hat{\beta}_o$` and `$\hat{\beta}_1$`?

]

.pull-right[

]

---

### Constructing the Simple Linear Regression Model in R

```r
mod <- lm(winpercent ~ pricepercent, data = candy)

library(moderndive)
get_regression_table(mod)
```

```
## # A tibble: 2 × 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 42.0 2.91 14.4 0 36.2 47.8
## 2 pricepercent 17.8 5.30 3.35 0.001 7.23 28.3
```

---

### Interpretation

Slope: 17.8

Intercept: 42.0

---

### Prediction

```r
new_cases <- data.frame(pricepercent = c(0.25, 0.85, 1.5))
predict(mod, newdata = new_cases)
```

```
##        1        2        3 
## 46.42443 57.09409 68.65289
```

We didn't have any treats in our sample with a price percentage of 85%.  Can we still make this prediction?

&rarr;  Called **interpolation**

We didn't have any treats in our sample with a price percentage of 150%.  Can we still make this prediction?

&rarr;  Called **extrapolation**

---

### Cautions

.pull-left[

* Careful to only predict values within the range of `$x$` values in the sample.

* Make sure to investigate **influential points**.

**Q:** What is an **outlier**?

]

.pull-right[

]

---

### Linear Regression

Linear regression is a flexible class of models that allow for:

* Both quantitative and categorical explanatory variables.

* Multiple explanatory variables.

* Curved relationships between the response variable and the explanatory variable.

* BUT the **response variable is quantitative**.

********************

### What About A Categorical Explanatory Variable?

* Response variable `$(y)$`: quantitative

* Have 1 categorical explanatory variable `$(x)$` with two categories.

---

### Example: The Smile-Leniency Effect

**Can a simple smile have an effect on punishment assigned following an infraction?** In a 1995 study, Hecht and LeFrance examined the effect of a smile on the leniency of disciplinary action for wrongdoers. Participants in the experiment took on the role of members of a college disciplinary panel judging students accused of cheating. For each suspect, along with a description of the offense, a picture was provided with either a smile or neutral facial expression. A leniency score was calculated based on the disciplinary decisions made by the participants.

**Response variable?**

**Explanatory variable?**

---

###  Model Form

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x + \epsilon
\end{align}`
$$

First, need to convert the categories of `$x$` to numbers.

Before building the model, let's explore and visualize the data!

```r
library(tidyverse)
library(Lock5Data)
# Load data
data(Smiles)
smiles <- Smiles
glimpse(smiles)
```

```
## Rows: 68
## Columns: 2
## $ Leniency <dbl> 7.0, 3.0, 6.0, 4.5, 3.5, 4.0, 3.0, 3.0, 3.5, 4.5, 7.0, 5.0, 5…
## $ Group <fct> smile, smile, smile, smile, smile, smile, smile, smile, smile…
```

* What `dplyr` functions should I use to find the mean and sd of `Leniency` by the categories of `Group`?

* What graph should we use to visualize the `Leniency` scores by `Group`?

---

### Example: The Smile-Leniency Effect

```r
# Summarize
smiles %>%
  group_by(Group) %>%
  summarize(count = n(), mean_len = mean(Leniency), 
            sd_len = sd(Leniency))
```

```
## # A tibble: 2 × 4
## Group count mean_len sd_len
## <fct> <int> <dbl> <dbl>
## 1 neutral 34 4.12 1.52
## 2 smile 34 4.91 1.68
```

---

### Example: The Smile-Leniency Effect

.pull-left[

```r
# Visualize
ggplot(smiles, aes(x = Group, 
                   y = Leniency)) +
  geom_boxplot() +
    stat_summary(fun = mean,
                 geom = "point",
                 color = "purple",
                 size = 4)
```

]

.pull-right[

]

---

### Fit the Linear Regression Model

Model Form:

$$ 
`\begin{align}
y &= \beta_o + \beta_1 x + \epsilon
\end{align}`
$$

When `$x = 0$`:

When `$x = 1$`:

```r
mod <- lm(Leniency ~ Group, data = smiles)
library(moderndive)
get_regression_table(mod)
```

```
## # A tibble: 2 × 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 4.12 0.275 15.0 0 3.57 4.67
## 2 Group: smile 0.794 0.389 2.04 0.045 0.017 1.57
```

---

### Notes

1. When the explanatory variable is categorical, `$\beta_o$` and `$\beta_1$` no longer represent the interceopt and slope.

2. Now `$\beta_o$` represents the (population) mean of the response variable when `$x = 0$`.

3. And, `$\beta_1$` represents the change in the (population) mean response going from `$x = 0$` to `$x = 1$`.

4. Can also do prediction:

```r
new <- data.frame(Group = c("smile", "neutral"))
predict(mod, newdata = new)
```

```
##        1        2 
## 4.911765 4.117647
```

---

## And, For Those Buying Late Valentine Gifts...