background-image: url("img/DAW.png") background-position: left background-size: 50% class: middle, center, inverse .pull-right[ ## .whitish[More Linear Regression] <br> <br> ### .whitish[Kelly McConville] #### .yellow[ Stat 100 | Week 6 | Spring 2022] ] --- ### Announcements * Mid-Term Exam: Wednesday, March 9th - Friday, March 11th + Can now sign up for your oral exam slot [here](https://docs.google.com/document/d/1fL0ge4JW_-KYqLzxD_oQeMXt3sLQvTdPW9N67OVbjAY/edit?usp=sharing). * For Project Assignment 1, please fill out [this feedback survey](https://forms.gle/ytMfBBi2iBJKby8w6) (also on Canvas and in the announcements channel) by Friday, March 4th. **************************** -- ### Goals for Today .pull-left[ * Discuss upcoming exam. * Receive Project Assignment 2. * Explore ethics in modeling: algorithmic bias. * Considering approaches to handle missingness. ] .pull-right[ * Practice interpreting model coefficients. * Continue discussing multiple linear regression models. * Explore polynomial terms. * Consider categorical explanatory variables with more than 2 categories. * Discuss guiding principles for model building. ] --- class: middle, center, inverse # Midterm Exam ## Let's go through "exam1Review.Rmd" in the "stat100/handouts"! --- class: middle, center, inverse # Project Assignment 2 ## Let's go through "pa2.Rmd" in the "stat100/project"! --- class: middle, center, inverse # Data Ethics: Algorithmic Bias --- class: middle, center, inverse # Data Ethics ## Return to the Americian Statistical Association's ["Ethical Guidelines for Statistical Practice"](https://www.amstat.org/ASA/Your-Career/Ethical-Guidelines-for-Statistical-Practice.aspx) --- class: inverse, center, middle ## Integrity of Data and Methods > "The ethical statistical practitioner seeks to understand and mitigate known or suspected limitations, defects, or biases in the data or methods and communicates potential impacts on the interpretation, conclusions, recommendations, decisions, or other results of statistical practices." --- class: inverse, center, middle ## Integrity of Data and Methods > "For models and algorithms designed to inform or implement decisions repeatedly, develops and/or implements plans to validate assumptions and assess performance over time, as needed. Considers criteria and mitigation plans for model or algorithm failure and retirement." --- ## Algorithmic Bias **Algorithmic bias**: when the model systematically creates unfair outcomes, such as privileging one group over another. **Example**: [The Coded Gaze](https://www.youtube.com/watch?v=162VzSzzoPs) <div class="figure" style="text-align: center"> <img src="img/coded_gaze.png" alt="Joy Buolamwini" width="35%" /> <p class="caption">Joy Buolamwini</p> </div> -- * Facial recognition software struggles to see faces of color. -- * Algorithms built on a non-diverse, biased dataset. --- **Example**: [COMPAS model](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) used throughout the country to predict recidivism -- * Differences in predictions across race and gender <div class="figure" style="text-align: center"> <img src="img/compas.png" alt="ProPublica Analysis" width="55%" /> <p class="caption">ProPublica Analysis</p> </div> -- > "For models and algorithms designed to inform or implement decisions repeatedly, develops and/or implements plans to validate assumptions and assess performance over time, as needed. Considers criteria and mitigation plans for model or algorithm failure and retirement." -- ASA Guidelines --- class: inverse, center, middle ## What to do about missing values? --- ### Missingness **Unit non-response**: Complete absence of responses for a subset of the sampled units <img src="img/sampling.003.jpeg" width="80%" style="display: block; margin: auto;" /> --- ### Missingness **Item non-response**: When you have no response for a subset of the measured variables for a sampled unit. ```r library(NHANES) library(tidyverse) select(NHANES, Height, Depressed, SleepHrsNight) %>% glimpse() ``` ``` ## Rows: 10,000 ## Columns: 3 ## $ Height <dbl> 164.7, 164.7, 164.7, 105.4, 168.4, 133.1, 130.6, 166.7, … ## $ Depressed <fct> Several, Several, Several, NA, Several, NA, NA, None, No… ## $ SleepHrsNight <int> 4, 4, 4, NA, 8, NA, NA, 8, 8, 8, 7, 5, 4, NA, 5, 7, NA, … ``` What should we do about the item non-response? --- ### Missingness What should we do about the item non-response? ```r newdat <- na.omit(NHANES) glimpse(newdat) ``` ``` ## Rows: 0 ## Columns: 76 ## $ ID <int> ## $ SurveyYr <fct> ## $ Gender <fct> ## $ Age <int> ## $ AgeDecade <fct> ## $ AgeMonths <int> ## $ Race1 <fct> ## $ Race3 <fct> ## $ Education <fct> ## $ MaritalStatus <fct> ## $ HHIncome <fct> ## $ HHIncomeMid <int> ## $ Poverty <dbl> ## $ HomeRooms <int> ## $ HomeOwn <fct> ## $ Work <fct> ## $ Weight <dbl> ## $ Length <dbl> ## $ HeadCirc <dbl> ## $ Height <dbl> ## $ BMI <dbl> ## $ BMICatUnder20yrs <fct> ## $ BMI_WHO <fct> ## $ Pulse <int> ## $ BPSysAve <int> ## $ BPDiaAve <int> ## $ BPSys1 <int> ## $ BPDia1 <int> ## $ BPSys2 <int> ## $ BPDia2 <int> ## $ BPSys3 <int> ## $ BPDia3 <int> ## $ Testosterone <dbl> ## $ DirectChol <dbl> ## $ TotChol <dbl> ## $ UrineVol1 <int> ## $ UrineFlow1 <dbl> ## $ UrineVol2 <int> ## $ UrineFlow2 <dbl> ## $ Diabetes <fct> ## $ DiabetesAge <int> ## $ HealthGen <fct> ## $ DaysPhysHlthBad <int> ## $ DaysMentHlthBad <int> ## $ LittleInterest <fct> ## $ Depressed <fct> ## $ nPregnancies <int> ## $ nBabies <int> ## $ Age1stBaby <int> ## $ SleepHrsNight <int> ## $ SleepTrouble <fct> ## $ PhysActive <fct> ## $ PhysActiveDays <int> ## $ TVHrsDay <fct> ## $ CompHrsDay <fct> ## $ TVHrsDayChild <int> ## $ CompHrsDayChild <int> ## $ Alcohol12PlusYr <fct> ## $ AlcoholDay <int> ## $ AlcoholYear <int> ## $ SmokeNow <fct> ## $ Smoke100 <fct> ## $ Smoke100n <fct> ## $ SmokeAge <int> ## $ Marijuana <fct> ## $ AgeFirstMarij <int> ## $ RegularMarij <fct> ## $ AgeRegMarij <int> ## $ HardDrugs <fct> ## $ SexEver <fct> ## $ SexAge <int> ## $ SexNumPartnLife <int> ## $ SexNumPartYear <int> ## $ SameSex <fct> ## $ SexOrientation <fct> ## $ PregnantNow <fct> ``` --- ### Missingness What should we do about the item non-response? ```r newdat <- drop_na(NHANES, Height, Depressed, SleepHrsNight) glimpse(newdat) ``` ``` ## Rows: 6,619 ## Columns: 76 ## $ ID <int> 51624, 51624, 51624, 51630, 51647, 51647, 51647, 5165… ## $ SurveyYr <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2009_10,… ## $ Gender <fct> male, male, male, female, female, female, female, mal… ## $ Age <int> 34, 34, 34, 49, 45, 45, 45, 66, 58, 54, 50, 33, 60, 5… ## $ AgeDecade <fct> 30-39, 30-39, 30-39, 40-49, 40-49, 40-49, 40-4… ## $ AgeMonths <int> 409, 409, 409, 596, 541, 541, 541, 795, 707, 654, 603… ## $ Race1 <fct> White, White, White, White, White, White, White, Whit… ## $ Race3 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ Education <fct> High School, High School, High School, Some College, … ## $ MaritalStatus <fct> Married, Married, Married, LivePartner, Married, Marr… ## $ HHIncome <fct> 25000-34999, 25000-34999, 25000-34999, 35000-44999, 7… ## $ HHIncomeMid <int> 30000, 30000, 30000, 40000, 87500, 87500, 87500, 3000… ## $ Poverty <dbl> 1.36, 1.36, 1.36, 1.91, 5.00, 5.00, 5.00, 2.20, 5.00,… ## $ HomeRooms <int> 6, 6, 6, 5, 6, 6, 6, 5, 10, 6, 4, 11, 5, 10, 10, 9, 3… ## $ HomeOwn <fct> Own, Own, Own, Rent, Own, Own, Own, Own, Rent, Rent, … ## $ Work <fct> NotWorking, NotWorking, NotWorking, NotWorking, Worki… ## $ Weight <dbl> 87.4, 87.4, 87.4, 86.7, 75.7, 75.7, 75.7, 68.0, 78.4,… ## $ Length <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ HeadCirc <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ Height <dbl> 164.7, 164.7, 164.7, 168.4, 166.7, 166.7, 166.7, 169.… ## $ BMI <dbl> 32.22, 32.22, 32.22, 30.57, 27.24, 27.24, 27.24, 23.6… ## $ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ BMI_WHO <fct> 30.0_plus, 30.0_plus, 30.0_plus, 30.0_plus, 25.0_to_2… ## $ Pulse <int> 70, 70, 70, 86, 62, 62, 62, 60, 62, 76, 74, 96, 84, 6… ## $ BPSysAve <int> 113, 113, 113, 112, 118, 118, 118, 111, 104, 134, 142… ## $ BPDiaAve <int> 85, 85, 85, 75, 64, 64, 64, 63, 74, 85, 68, 74, 100, … ## $ BPSys1 <int> 114, 114, 114, 118, 106, 106, 106, 124, 108, 136, 138… ## $ BPDia1 <int> 88, 88, 88, 82, 62, 62, 62, 64, 76, 86, 66, 80, 98, 7… ## $ BPSys2 <int> 114, 114, 114, 108, 118, 118, 118, 108, 104, 132, 142… ## $ BPDia2 <int> 88, 88, 88, 74, 68, 68, 68, 62, 72, 88, 74, 74, 98, 7… ## $ BPSys3 <int> 112, 112, 112, 116, 118, 118, 118, 114, 104, 136, 142… ## $ BPDia3 <int> 82, 82, 82, 76, 60, 60, 60, 64, 76, 82, 62, NA, 102, … ## $ Testosterone <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ DirectChol <dbl> 1.29, 1.29, 1.29, 1.16, 2.12, 2.12, 2.12, 0.67, 0.96,… ## $ TotChol <dbl> 3.49, 3.49, 3.49, 6.70, 5.82, 5.82, 5.82, 4.99, 4.24,… ## $ UrineVol1 <int> 352, 352, 352, 77, 106, 106, 106, 113, 163, 215, 64, … ## $ UrineFlow1 <dbl> NA, NA, NA, 0.094, 1.116, 1.116, 1.116, 0.489, NA, 0.… ## $ UrineVol2 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 8… ## $ UrineFlow2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0… ## $ Diabetes <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N… ## $ DiabetesAge <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ HealthGen <fct> Good, Good, Good, Good, Vgood, Vgood, Vgood, Vgood, V… ## $ DaysPhysHlthBad <int> 0, 0, 0, 0, 0, 0, 0, 10, 0, 4, 0, 3, 7, 3, 3, 0, 0, 0… ## $ DaysMentHlthBad <int> 15, 15, 15, 10, 3, 3, 3, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0… ## $ LittleInterest <fct> Most, Most, Most, Several, None, None, None, None, No… ## $ Depressed <fct> Several, Several, Several, Several, None, None, None,… ## $ nPregnancies <int> NA, NA, NA, 2, 1, 1, 1, NA, NA, NA, NA, NA, NA, 4, 4,… ## $ nBabies <int> NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3,… ## $ Age1stBaby <int> NA, NA, NA, 27, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2… ## $ SleepHrsNight <int> 4, 4, 4, 8, 8, 8, 8, 7, 5, 4, 7, 6, 6, 7, 7, 8, 6, 6,… ## $ SleepTrouble <fct> Yes, Yes, Yes, Yes, No, No, No, No, No, Yes, No, No, … ## $ PhysActive <fct> No, No, No, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No… ## $ PhysActiveDays <int> NA, NA, NA, NA, 5, 5, 5, 7, 5, 1, 7, NA, NA, 7, 7, 3,… ## $ TVHrsDay <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ CompHrsDay <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ TVHrsDayChild <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ CompHrsDayChild <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ Alcohol12PlusYr <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No,… ## $ AlcoholDay <int> NA, NA, NA, 2, 3, 3, 3, 1, 2, 6, NA, 3, 6, 1, 1, 1, 2… ## $ AlcoholYear <int> 0, 0, 0, 20, 52, 52, 52, 100, 104, 364, 0, 104, 36, 1… ## $ SmokeNow <fct> No, No, No, Yes, NA, NA, NA, No, NA, NA, NA, No, No, … ## $ Smoke100 <fct> Yes, Yes, Yes, Yes, No, No, No, Yes, No, No, No, Yes,… ## $ Smoke100n <fct> Smoker, Smoker, Smoker, Smoker, Non-Smoker, Non-Smoke… ## $ SmokeAge <int> 18, 18, 18, 38, NA, NA, NA, 13, NA, NA, NA, NA, 16, N… ## $ Marijuana <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, NA, Yes, Yes, No, … ## $ AgeFirstMarij <int> 17, 17, 17, 18, 13, 13, 13, NA, 19, 15, NA, NA, NA, N… ## $ RegularMarij <fct> No, No, No, No, No, No, No, NA, Yes, Yes, No, No, NA,… ## $ AgeRegMarij <int> NA, NA, NA, NA, NA, NA, NA, NA, 20, 15, NA, NA, NA, N… ## $ HardDrugs <fct> Yes, Yes, Yes, Yes, No, No, No, No, Yes, Yes, No, No,… ## $ SexEver <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes… ## $ SexAge <int> 16, 16, 16, 12, 13, 13, 13, 17, 22, 12, NA, 27, 20, 2… ## $ SexNumPartnLife <int> 8, 8, 8, 10, 20, 20, 20, 15, 7, 100, 9, 1, 1, 2, 2, 5… ## $ SexNumPartYear <int> 1, 1, 1, 1, 0, 0, 0, NA, 1, 1, 1, 1, NA, 1, 1, 1, 2, … ## $ SameSex <fct> No, No, No, Yes, Yes, Yes, Yes, No, No, No, No, No, N… ## $ SexOrientation <fct> Heterosexual, Heterosexual, Heterosexual, Heterosexua… ## $ PregnantNow <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ``` --- ### Missingness .pull-left[ <img src="stat100_wk06mon_files/figure-html/unnamed-chunk-7-1.png" width="576" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="stat100_wk06mon_files/figure-html/unnamed-chunk-8-1.png" width="576" style="display: block; margin: auto;" /> ] **WHAT HAPPENED?!** --- ### Missingness What should we do about the item non-response? ```r NHANES %>% group_by(Depressed) %>% summarize(MeanSleep = mean(SleepHrsNight)) ``` ``` ## # A tibble: 4 × 2 ## Depressed MeanSleep ## <fct> <dbl> ## 1 None NA ## 2 Several NA ## 3 Most NA ## 4 <NA> NA ``` --- ### Missingness What should we do about the item non-response? ```r NHANES %>% group_by(Depressed) %>% summarize(MeanSleep = mean(SleepHrsNight, na.rm = TRUE)) ``` ``` ## # A tibble: 4 × 2 ## Depressed MeanSleep ## <fct> <dbl> ## 1 None 7.00 ## 2 Several 6.74 ## 3 Most 6.35 ## 4 <NA> 6.99 ``` --- ### Missingness What should we do about the item non-response? ```r NHANES %>% drop_na(Depressed) %>% group_by(Depressed) %>% summarize(MeanSleep = mean(SleepHrsNight)) ``` ``` ## # A tibble: 3 × 2 ## Depressed MeanSleep ## <fct> <dbl> ## 1 None NA ## 2 Several NA ## 3 Most NA ``` --- ### Missingness What should we do about the item non-response? Beyond Stat 100 Answers: -- * Build an **imputation model** and fill in the missing values. -- * Give different **weights** to the respondents when constructing statistics. -- Take **Stat 160: Sample Surveys** to learn about these approaches! -- **Bottom line**: Be transparent about how you handle the missingness and if that impacts the group you are generalizing to! --- class: inverse, center, middle ## Let's return to Multiple Linear Regression --- ## Multiple Linear Regression Linear regression is a flexible class of models that allow for: * Both quantitative and categorical explanatory variables. * Multiple explanatory variables. * Curved relationships between the response variable and the explanatory variable. * BUT the response variable is quantitative. **Form of the Model:** $$ `\begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p \epsilon_p + \epsilon \end{align}` $$ --- ### New Example: Movies Let's model a movie's critic rating using the audience rating and the movie's genre. ```r library(tidyverse) movies <- read_csv("https://www.lock5stat.com/datasets2e/HollywoodMovies.csv") # Restrict our attention to dramas, horrors, and actions movies2 <- movies %>% filter(Genre %in% c("Drama", "Horror", "Action")) %>% drop_na(Genre, AudienceScore, RottenTomatoes) glimpse(movies2) ``` ``` ## Rows: 313 ## Columns: 16 ## $ Movie <chr> "Spider-Man 3", "Transformers", "Pirates of the Carib… ## $ LeadStudio <chr> "Sony", "Paramount", "Disney", "Warner Bros", "Warner… ## $ RottenTomatoes <dbl> 61, 57, 45, 60, 20, 79, 35, 28, 41, 71, 95, 42, 18, 2… ## $ AudienceScore <dbl> 54, 89, 74, 90, 68, 86, 55, 56, 81, 52, 84, 55, 70, 6… ## $ Story <chr> "Metamorphosis", "Monster Force", "Rescue", "Sacrific… ## $ Genre <chr> "Action", "Action", "Action", "Action", "Action", "Ac… ## $ TheatersOpenWeek <dbl> 4252, 4011, 4362, 3103, 3778, 3408, 3959, 3619, 2911,… ## $ OpeningWeekend <dbl> 151.1, 70.5, 114.7, 70.9, 49.1, 33.4, 58.0, 45.3, 19.… ## $ BOAvgOpenWeekend <dbl> 35540, 17577, 26302, 22844, 12996, 9791, 14663, 12541… ## $ DomesticGross <dbl> 336.53, 319.25, 309.42, 210.61, 140.13, 134.53, 131.9… ## $ ForeignGross <dbl> 554.34, 390.46, 654.00, 245.45, 117.90, 249.00, 157.1… ## $ WorldGross <dbl> 890.87, 709.71, 963.42, 456.07, 258.02, 383.53, 289.0… ## $ Budget <dbl> 258.0, 150.0, 300.0, 65.0, 140.0, 110.0, 130.0, 110.0… ## $ Profitability <dbl> 345.30, 473.14, 321.14, 701.64, 184.30, 348.66, 222.3… ## $ OpenProfit <dbl> 58.57, 47.00, 38.23, 109.08, 35.07, 30.36, 44.62, 41.… ## $ Year <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007,… ``` * **Response variable:** * **Explanatory variables:** --- #### How should we encode a categorical variable with more than 2 categories? <br> <br> **Equal Slopes Model:** <br> <br> <br> <br> <br> <br> **Different Slopes Mdoel:** --- ### Exploring the Data .pull-left[ ```r ggplot(data = movies2, mapping = aes(x = AudienceScore, y = RottenTomatoes, color = Genre)) + geom_point(alpha = 0.5) + stat_smooth(method = lm, se = FALSE) + geom_abline(slope = 1, intercept = 0) ``` * Trends? * Should we include interaction terms in the model? ] .pull-right[ <img src="stat100_wk06mon_files/figure-html/mscat-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Side-bar: Identify Outliers on a Graph ```r outliers <- movies2 %>% mutate(DiffScore = AudienceScore - RottenTomatoes) %>% filter(DiffScore > 50 | DiffScore < -30) %>% select(Movie, DiffScore, AudienceScore, RottenTomatoes, Genre) outliers ``` ``` ## # A tibble: 9 × 5 ## Movie DiffScore AudienceScore RottenTomatoes Genre ## <chr> <dbl> <dbl> <dbl> <chr> ## 1 Saw IV 52 70 18 Horr… ## 2 Step Up 2: The Streets 55 81 26 Drama ## 3 Kit Kittredge: An American Girl -52 26 78 Drama ## 4 Stop-Loss -38 27 65 Drama ## 5 Transformers: Revenge of the Fal… 56 76 20 Acti… ## 6 The Twilight Saga: New Moon 51 78 27 Drama ## 7 Drag Me to Hell -31 61 92 Horr… ## 8 The Last Exorcism -41 32 73 Drama ## 9 Haywire -40 40 80 Acti… ``` --- ### Side-bar: Identify Outliers on a Graph .pull-left[ ```r library(ggrepel) ggplot(data = movies2, mapping = aes(x = AudienceScore, y = RottenTomatoes, color = Genre)) + geom_point(alpha = 0.5) + stat_smooth(method = lm, se = FALSE) + geom_abline(slope = 1, intercept = 0) + geom_text_repel(data = outliers, mapping = aes(label = Movie), force = 10, show.legend = FALSE, size = 6) ``` ] .pull-right[ <img src="stat100_wk06mon_files/figure-html/outliers-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Building the Model: Full model form: ```r mod <- lm(RottenTomatoes ~ AudienceScore*Genre, data = movies2) library(moderndive) get_regression_table(mod) ``` ``` ## # A tibble: 6 × 7 ## term estimate std_error statistic p_value lower_ci upper_ci ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 intercept -15.0 5.27 -2.85 0.005 -25.4 -4.67 ## 2 AudienceScore 1.01 0.085 11.8 0 0.84 1.18 ## 3 Genre: Drama 22.8 8.94 2.55 0.011 5.23 40.4 ## 4 Genre: Horror -15.2 11.0 -1.39 0.165 -36.8 6.32 ## 5 AudienceScore:GenreDra… -0.253 0.136 -1.86 0.065 -0.522 0.015 ## 6 AudienceScore:GenreHor… 0.365 0.206 1.77 0.078 -0.04 0.771 ``` --- ### Exploring the Data .pull-left[ ```r library(ggrepel) ggplot(data = movies2, mapping = aes(x = AudienceScore, y = RottenTomatoes, color = Genre)) + geom_point(alpha = 0.5) ``` Evidence of **curvature**? ] .pull-right[ <img src="stat100_wk06mon_files/figure-html/mscat2-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Exploring the Data .pull-left[ ```r library(ggrepel) ggplot(data = movies2, mapping = aes(x = AudienceScore, y = RottenTomatoes, color = Genre)) + geom_point(alpha = 0.5) + stat_smooth(method = lm, se = FALSE, formula = y ~ poly(x, degree = 2)) ``` Evidence of **curvature**? ] .pull-right[ <img src="stat100_wk06mon_files/figure-html/curve-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Fitting the New Model ```r mod2 <- lm(RottenTomatoes ~ poly(AudienceScore, degree = 2, raw = TRUE)*Genre, data = movies2) get_regression_table(mod2, print = TRUE) ``` |term | estimate| std_error| statistic| p_value| lower_ci| upper_ci| |:--------------------------------------------------------|--------:|---------:|---------:|-------:|--------:|--------:| |intercept | 9.922| 14.851| 0.668| 0.505| -19.301| 39.145| |poly(AudienceScore, degree = 2, raw = TRUE)1 | 0.098| 0.515| 0.191| 0.849| -0.916| 1.113| |poly(AudienceScore, degree = 2, raw = TRUE)2 | 0.008| 0.004| 1.788| 0.075| -0.001| 0.016| |Genre: Drama | 88.923| 24.538| 3.624| 0.000| 40.638| 137.208| |Genre: Horror | -23.767| 31.054| -0.765| 0.445| -84.876| 37.342| |poly(AudienceScore, degree = 2, raw = TRUE)1:GenreDrama | -2.608| 0.840| -3.107| 0.002| -4.260| -0.956| |poly(AudienceScore, degree = 2, raw = TRUE)2:GenreDrama | 0.019| 0.007| 2.785| 0.006| 0.006| 0.032| |poly(AudienceScore, degree = 2, raw = TRUE)1:GenreHorror | 0.574| 1.223| 0.469| 0.639| -1.833| 2.981| |poly(AudienceScore, degree = 2, raw = TRUE)2:GenreHorror | -0.001| 0.012| -0.061| 0.951| -0.024| 0.022| --- ### Considering Other Explanatory Variables ```r movies2 %>% select(RottenTomatoes, AudienceScore, OpeningWeekend, DomesticGross, TheatersOpenWeek) %>% na.omit() %>% cor() ``` ``` ## RottenTomatoes AudienceScore OpeningWeekend DomesticGross ## RottenTomatoes 1.0000000 0.67904126 0.1482138 0.2477617 ## AudienceScore 0.6790413 1.00000000 0.3368534 0.4421614 ## OpeningWeekend 0.1482138 0.33685340 1.0000000 0.8823562 ## DomesticGross 0.2477617 0.44216141 0.8823562 1.0000000 ## TheatersOpenWeek -0.1892605 0.03874516 0.6520931 0.5681251 ## TheatersOpenWeek ## RottenTomatoes -0.18926052 ## AudienceScore 0.03874516 ## OpeningWeekend 0.65209307 ## DomesticGross 0.56812510 ## TheatersOpenWeek 1.00000000 ``` Also look at scatterplots! --- ### Considering Other Explanatory Variables ```r mod3 <- lm(RottenTomatoes ~ AudienceScore + TheatersOpenWeek, data = movies2) get_regression_table(mod3, print = TRUE) ``` |term | estimate| std_error| statistic| p_value| lower_ci| upper_ci| |:----------------|--------:|---------:|---------:|-------:|--------:|--------:| |intercept | 2.299| 4.600| 0.500| 0.618| -6.754| 11.351| |AudienceScore | 1.012| 0.060| 16.955| 0.000| 0.894| 1.129| |TheatersOpenWeek | -0.006| 0.001| -5.325| 0.000| -0.008| -0.004| --- class: center, middle, inverse ### Let's Practice! --- ### Let's Practice with the `palmerpenguins`! .pull-left[ ```r library(palmerpenguins) ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = bill_length_mm, color = species)) + geom_point(alpha = 0.7) + stat_smooth(method = "lm", se = FALSE) ``` ] .pull-right[ <img src="stat100_wk06mon_files/figure-html/penguins-1.png" width="768" style="display: block; margin: auto;" /> ] --- center: middle ```r mod1 <- lm(bill_length_mm ~ flipper_length_mm + species, data = penguins) get_regression_table(mod1, print = TRUE) ``` |term | estimate| std_error| statistic| p_value| lower_ci| upper_ci| |:------------------|--------:|---------:|---------:|-------:|--------:|--------:| |intercept | -2.059| 4.039| -0.510| 0.611| -10.002| 5.885| |flipper_length_mm | 0.215| 0.021| 10.129| 0.000| 0.173| 0.257| |species: Chinstrap | 8.780| 0.399| 21.998| 0.000| 7.995| 9.565| |species: Gentoo | 2.857| 0.659| 4.338| 0.000| 1.561| 4.152| --- center: middle ```r mod2 <- lm(bill_length_mm ~ flipper_length_mm * species, data = penguins) get_regression_table(mod2, print = TRUE) ``` |term | estimate| std_error| statistic| p_value| lower_ci| upper_ci| |:----------------------------------|--------:|---------:|---------:|-------:|--------:|--------:| |intercept | 13.587| 6.051| 2.246| 0.025| 1.685| 25.489| |flipper_length_mm | 0.133| 0.032| 4.168| 0.000| 0.070| 0.195| |species: Chinstrap | -7.994| 10.481| -0.763| 0.446| -28.611| 12.623| |species: Gentoo | -34.323| 9.820| -3.495| 0.001| -53.639| -15.007| |flipper_length_mm:speciesChinstrap | 0.088| 0.054| 1.631| 0.104| -0.018| 0.194| |flipper_length_mm:speciesGentoo | 0.182| 0.048| 3.801| 0.000| 0.088| 0.275| --- ### Model Building Guidance We often have several potential explanatory variables. How do we determine which to include in the model and in what form? -- .pull-left[ **Guiding Principle**: Capture the general trend, not the noise. $$ `\begin{align} y &= f(x) + \epsilon \\ y &= \mbox{TREND} + \mbox{NOISE} \end{align}` $$ ] -- .pull-right[ Returning the 2008 Election Example: <img src="stat100_wk06mon_files/figure-html/unnamed-chunk-25-1.png" width="576" style="display: block; margin: auto;" /> ] --- ### Model Building Guidance We often have several potential explanatory variables. How do we determine which to include in the model and in what form? .pull-left[ **Guiding Principle**: Capture the general trend, not the noise. $$ `\begin{align} y &= f(x) + \epsilon \\ y &= \mbox{TREND} + \mbox{NOISE} \end{align}` $$ ] .pull-right[ Returning the 2008 Election Example: <img src="stat100_wk06mon_files/figure-html/unnamed-chunk-26-1.png" width="576" style="display: block; margin: auto;" /> ] --- ### Model Building Guidance We often have several potential explanatory variables. How do we determine which to include in the model and in what form? .pull-left[ **Guiding Principle**: Capture the general trend, not the noise. $$ `\begin{align} y &= f(x) + \epsilon \\ y &= \mbox{TREND} + \mbox{NOISE} \end{align}` $$ ] .pull-right[ Returning the 2008 Election Example: <img src="stat100_wk06mon_files/figure-html/unnamed-chunk-27-1.png" width="576" style="display: block; margin: auto;" /> ] --- ### Model Building Guidance We often have several potential explanatory variables. How do we determine which to include in the model and in what form? .pull-left[ **Guiding Principle**: Capture the general trend, not the noise. $$ `\begin{align} y &= f(x) + \epsilon \\ y &= \mbox{TREND} + \mbox{NOISE} \end{align}` $$ ] .pull-right[ Returning the 2008 Election Example: <img src="stat100_wk06mon_files/figure-html/unnamed-chunk-28-1.png" width="576" style="display: block; margin: auto;" /> ] --- ### Model Building Guidance We often have several potential explanatory variables. How do we determine which to include in the model and in what form? -- .pull-left[ **Guiding Principle**: Occam's Razor for Modeling > "All other things being equal, simpler models are to be preferred over complex ones." -- ModernDive ] .pull-right[ <img src="stat100_wk06mon_files/figure-html/mf-1.png" width="768" style="display: block; margin: auto;" /> ] --- ### Model Building Guidance We often have several potential explanatory variables. How do we determine which to include in the model and in what form? -- **Guiding Principle**: Include explanatory variables that attempt to explain **different** aspects of the variation in the response variable. ```r library(mosaicData) SaratogaHouses %>% select(price, livingArea, age, rooms, bathrooms) %>% na.omit() %>% cor() ``` ``` ## price livingArea age rooms bathrooms ## price 1.0000000 0.7123902 -0.18879259 0.53117012 0.5972498 ## livingArea 0.7123902 1.0000000 -0.17424195 0.73366584 0.7185637 ## age -0.1887926 -0.1742420 1.00000000 -0.08226402 -0.3618973 ## rooms 0.5311701 0.7336658 -0.08226402 1.00000000 0.5175847 ## bathrooms 0.5972498 0.7185637 -0.36189725 0.51758469 1.0000000 ``` --- ### Model Building Guidance We often have several potential explanatory variables. How do we determine which to include in the model and in what form? -- **Guiding Principle**: Use your modeling motivation to determine how much you weight **interpretability** versus **prediction accuracy** when choosing the model. <img src="stat100_wk06mon_files/figure-html/unnamed-chunk-31-1.png" width="936" style="display: block; margin: auto;" /> --- ### Model Building * We will come back to methods for model selection. * Key ideas: + Determining the **response** variable and the potential **explanatory** variable(s) + Writing out the **model form** and understanding the terms + **Building** and **visualizing** linear regression models in `R` + **Comparing** different potential models --- class: inverse, center, middle # Conceptual Review Time! --- class: inverse, center, middle # One can conclude a causal relationship between two variables if the study contains random sampling. -- ## TRUE or FALSE ## (1) or (2) --- class: inverse, center, middle # The larger the value of correlation coefficient, the stronger the linear relationship between two variables. -- ## TRUE or FALSE ## (1) or (2) --- class: inverse, center, middle # The mean tends to be larger than the median for right-skewed data. -- ## TRUE or FALSE ## (1) or (2) --- class: inverse, center, middle # A residual can only be non-negative. -- ## TRUE or FALSE ## (1) or (2) --- class: inverse, center, middle # A non-random sample of 10,000 individuals is better than a random sample of 1000. -- ## TRUE or FALSE ## (1) or (2) --- class: inverse, center, middle # Bar graphs and histograms are the same graph since they both contain bars. -- ## TRUE or FALSE ## (1) or (2) --- ## Reminders: * Mid-Term Exam: Wednesday, March 9th - Friday, March 11th + Can now sign up for your oral exam slot [here](https://docs.google.com/document/d/1fL0ge4JW_-KYqLzxD_oQeMXt3sLQvTdPW9N67OVbjAY/edit?usp=sharing). * For Project Assignment 1, please fill out [this feedback survey](https://forms.gle/ytMfBBi2iBJKby8w6) (also on Canvas and in the announcements channel) by Friday, March 4th.