background-image: url("img/DAW.png") background-position: left background-size: 50% class: middle, center, .pull-right[ ## .base-blue[Simple Linear Regression] <br> <br> ### .purple[Kelly McConville] #### .purple[ Stat 100 | Week 6 | Fall 2022] ] --- ## Announcements * Final Exam: + Take-home will be released at noon on Wed, Dec 7th. You must complete it before 4pm on Dec 10th over a 3 hour period of your choice. + Oral exams will start on the afternoon of the 8th and go through 4:30pm on the 10th. They are 10 minutes long and should be done AFTER you complete the take-home. * P-Set 4 due tomorrow at 5pm. * Project Assignment 1 due Friday at 5pm. **************************** -- ## Goals for Today .pull-left[ * Simple linear regression model + Estimating the slope and intercept terms + Prediction + Consider one quantitative predictor + Consider one categorical predictor ] --- class: center, middle, ## It's Time for Trend Stretches! <img src="stat100_wk06mon_files/figure-html/unnamed-chunk-1-1.png" width="540" style="display: block; margin: auto;" /> --- ### Simple Linear Regression .pull-left[ <img src="stat100_wk06mon_files/figure-html/candy2-1.png" width="768" style="display: block; margin: auto;" /> ] .pull-right[ Let's return to the Candy Example. * A line is a reasonable model form. * Where should the line be? + Slope? Intercept? ] --- ### Form of the SLR Model $$ `\begin{align} y &= f(x) + \epsilon \\ y &= \beta_o + \beta_1 x + \epsilon \end{align}` $$ **Need to determine the best estimates of `\(\beta_o\)` and `\(\beta_1\)`.** -- ***************************** #### Distinguishing between the population and the sample -- * Parameters: + Based on the population + Unknown then if don't have data on the whole population + EX: `\(\beta_o\)` and `\(\beta_1\)` -- * Statistics: + Based on the sample data + Known + Usually estimate a population parameter + EX: `\(\hat{\beta}_o\)` and `\(\hat{\beta}_1\)` --- ### Method of Least Squares Need two key definitions: -- * Fitted value: The *estimated* value of the `\(i\)`-th case $$ \hat{y}_i = \hat{\beta}_o + \hat{\beta}_1 x_i $$ -- * Residuals: The *observed* error term for the `\(i\)`-th case $$ e_i = y_i - \hat{y}_i $$ **Goal**: Pick values for `\(\hat{\beta}_o\)` and `\(\hat{\beta}_1\)` so that the residuals are small! --- ### Method of Least Squares <img src="stat100_wk06mon_files/figure-html/unnamed-chunk-3-1.png" width="576" style="display: block; margin: auto;" /> -- * Want residuals to be small. -- * Minimize some function of the residuals. --- ### Method of Least Squares Minimize: $$ \sum_{i = 1}^n e^2_i $$ -- Get the following equations: $$ `\begin{align} \hat{\beta}_1 &= \frac{ \sum_{i = 1}^n (x_i - \bar{x}) (y_i - \bar{y})}{ \sum_{i = 1}^n (x_i - \bar{x})^2} \\ \hat{\beta}_o &= \bar{y} - \hat{\beta}_1 \bar{x} \end{align}` $$ where $$ `\begin{align} \bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i \quad \mbox{and} \quad \bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i \end{align}` $$ --- ## Method of Least Squares Once we have the estimated intercept `\((\hat{\beta}_o)\)` and the estimated slope `\((\hat{\beta}_1)\)`, we can estimate the whole function: -- $$ \hat{y} = \hat{\beta}_o + \hat{\beta}_1 x $$ Called the **least squares line** or the **line of best fit**. --- ### Method of Least Squares .pull-left[ `ggplot2` will compute the line and add it to your plot using `geom_smooth(method = "lm")` But what are the **exact** values of `\(\hat{\beta}_o\)` and `\(\hat{\beta}_1\)`? ] .pull-right[ <img src="stat100_wk06mon_files/figure-html/candy3-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Constructing the Simple Linear Regression Model in R ```r mod <- lm(winpercent ~ pricepercent, data = candy) library(moderndive) get_regression_table(mod) ``` ``` ## # A tibble: 2 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 42.0 2.91 14.4 0 36.2 47.8 ## 2 pricepercent 17.8 5.30 3.35 0.001 7.23 28.3 ``` --- ### Interpretation Slope: 17.8 <br><br><br> <br> <br> <br> Intercept: 42.0 --- ### Prediction ```r new_cases <- data.frame(pricepercent = c(0.25, 0.85, 1.5)) predict(mod, newdata = new_cases) ``` ``` ## 1 2 3 ## 46.42443 57.09409 68.65289 ``` We didn't have any treats in our sample with a price percentage of 85%. Can we still make this prediction? -- → Called **interpolation** We didn't have any treats in our sample with a price percentage of 150%. Can we still make this prediction? -- → Called **extrapolation** --- ### Cautions .pull-left[ * Careful to only predict values within the range of `\(x\)` values in the sample. * Make sure to investigate **influential points**. **Q:** What is an **outlier**? ] .pull-right[ <img src="stat100_wk06mon_files/figure-html/unnamed-chunk-7-1.png" width="576" style="display: block; margin: auto;" /> ] --- ### Linear Regression Linear regression is a flexible class of models that allow for: * Both quantitative and categorical explanatory variables. -- * Multiple explanatory variables. -- * Curved relationships between the response variable and the explanatory variable. -- * BUT the **response variable is quantitative**. ******************** -- ### What About A Categorical Explanatory Variable? -- * Response variable `\((y)\)`: quantitative -- * Have 1 categorical explanatory variable `\((x)\)` with two categories. --- ### Example: The Smile-Leniency Effect **Can a simple smile have an effect on punishment assigned following an infraction?** In a 1995 study, Hecht and LeFrance examined the effect of a smile on the leniency of disciplinary action for wrongdoers. Participants in the experiment took on the role of members of a college disciplinary panel judging students accused of cheating. For each suspect, along with a description of the offense, a picture was provided with either a smile or neutral facial expression. A leniency score was calculated based on the disciplinary decisions made by the participants. **Response variable?** **Explanatory variable?** --- ### Model Form $$ `\begin{align} y &= \beta_o + \beta_1 x + \epsilon \end{align}` $$ -- First, need to convert the categories of `\(x\)` to numbers. -- Before building the model, let's explore and visualize the data! ```r library(tidyverse) library(Lock5Data) # Load data data(Smiles) smiles <- Smiles glimpse(smiles) ``` ``` ## Rows: 68 ## Columns: 2 ## $ Leniency <dbl> 7.0, 3.0, 6.0, 4.5, 3.5, 4.0, 3.0, 3.0, 3.5, 4.5, 7.0, 5.0, 5… ## $ Group <fct> smile, smile, smile, smile, smile, smile, smile, smile, smile… ``` -- * What `dplyr` functions should I use to find the mean and sd of `Leniency` by the categories of `Group`? -- * What graph should we use to visualize the `Leniency` scores by `Group`? --- ### Example: The Smile-Leniency Effect ```r # Summarize smiles %>% group_by(Group) %>% summarize(count = n(), mean_len = mean(Leniency), sd_len = sd(Leniency)) ``` ``` ## # A tibble: 2 × 4 ## Group count mean_len sd_len ## <fct> <int> <dbl> <dbl> ## 1 neutral 34 4.12 1.52 ## 2 smile 34 4.91 1.68 ``` --- ### Example: The Smile-Leniency Effect .pull-left[ ```r # Visualize ggplot(smiles, aes(x = Group, y = Leniency)) + geom_boxplot() + stat_summary(fun = mean, geom = "point", color = "purple", size = 4) ``` ] .pull-right[ <img src="stat100_wk06mon_files/figure-html/box-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Fit the Linear Regression Model Model Form: $$ `\begin{align} y &= \beta_o + \beta_1 x + \epsilon \end{align}` $$ -- When `\(x = 0\)`: <br> <br> When `\(x = 1\)`: -- ```r mod <- lm(Leniency ~ Group, data = smiles) library(moderndive) get_regression_table(mod) ``` ``` ## # A tibble: 2 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept 4.12 0.275 15.0 0 3.57 4.67 ## 2 Group: smile 0.794 0.389 2.04 0.045 0.017 1.57 ``` --- ### Notes 1. When the explanatory variable is categorical, `\(\beta_o\)` and `\(\beta_1\)` no longer represent the interceopt and slope. -- 2. Now `\(\beta_o\)` represents the (population) mean of the response variable when `\(x = 0\)`. -- 3. And, `\(\beta_1\)` represents the change in the (population) mean response going from `\(x = 0\)` to `\(x = 1\)`. -- 4. Can also do prediction: ```r new <- data.frame(Group = c("smile", "neutral")) predict(mod, newdata = new) ``` ``` ## 1 2 ## 4.911765 4.117647 ``` --- ## And, As You Start Planning For Halloween... <img src="stat100_wk06mon_files/figure-html/unnamed-chunk-13-1.png" width="576" style="display: block; margin: auto;" /> --- class: center, middle ## Survey Time Someone from the teaching staff is going to give you a card a number that was randomly generated for you. Please use that number to take the following anonymous survey: ## [https://bit.ly/stat100trees](https://bit.ly/stat100trees)