Linear regression is a flexible class of models that allow for:
Both quantitative and categorical explanatory variables.
Multiple explanatory variables.
Curved relationships between the response variable and the explanatory variable.
BUT the response variable is quantitative.
How do I pick the BEST model?
Comparing Models
Suppose I built 4 different models to predict the price of a Saratoga Springs house. Which is best?
library(mosaicData)mod1 <-lm(price ~ centralAir, data = SaratogaHouses)mod2 <-lm(price ~ centralAir + waterfront, data = SaratogaHouses)mod3 <-lm(price ~ centralAir + waterfront + age + livingArea + bathrooms, data = SaratogaHouses)mod4 <-lm(price ~ centralAir + waterfront + age + livingArea + bathrooms + sewer, data = SaratogaHouses)
Big question! Take Stat 139: Linear Models to learn systematic model selection techniques.
We will explore one approach. (But there are many possible approaches!)
Comparing Models
Suppose I built 4 different model. Which is best?
Pick the best model based on some measure of quality.
Measure of quality: \(R^2\) (Coefficient of Determination)
\[\begin{align*}
R^2 &= \mbox{Percent of total variation in y explained by the model}\\
&= 1- \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2}
\end{align*}\]
Strategy: Compute the \(R^2\) value for each model and pick the one with the highest \(R^2\).
Comparing Models with \(R^2\)
Strategy: Compute the \(R^2\) value for each model and pick the one with the highest \(R^2\).
library(mosaicData)mod1 <-lm(price ~ centralAir, data = SaratogaHouses)mod2 <-lm(price ~ centralAir + waterfront, data = SaratogaHouses)mod3 <-lm(price ~ centralAir + waterfront + age + livingArea + bathrooms, data = SaratogaHouses)mod4 <-lm(price ~ centralAir + waterfront + age + livingArea + bathrooms + sewer, data = SaratogaHouses)
Strategy: Compute the \(R^2\) value for each model and pick the one with the highest \(R^2\).
What data structures have we not tackled in Stat 100?
What Else?
Which data structures/variable types are we missing in this table?
Response
Explanatory
Numerical_Quantity
Parameter
Statistic
quantitative
-
mean
\(\mu\)
\(\bar{x}\)
categorical
-
proportion
\(p\)
\(\hat{p}\)
quantitative
categorical
difference in means
\(\mu_1 - \mu_2\)
\(\bar{x}_1 - \bar{x}_2\)
categorical
categorical
difference in proportions
\(p_1 - p_2\)
\(\hat{p}_1 - \hat{p}_2\)
quantitative
quantitative
correlation
\(\rho\)
\(r\)
quantitative
mix
model coefficients
\(\beta_i\)s
\(\hat{\beta}_i\)s
Inference for Categorical Variables
Consider the situation where:
Response variable: categorical
Explanatory variable: categorical
Parameter of interest: \(p_1 - p_2\)
This parameter of interest only makes sense if both variables only have two categories.
It is time to learn how to study the relationship between two categorical variables when at least one has more than two categories.
Hypotheses
\(H_o\): The two variables are independent.
\(H_a\): The two variables are dependent.
Example
Near-sightedness typically develops during the childhood years. Quinn, Shin, Maguire, and Stone (1999) explored whether there is a relationship between the type of light children were exposed to and their eye health based on questionnaires filled out by the children’s parents at a university pediatric ophthalmology clinic.
# A tibble: 9 × 3
Lighting Eye n
<chr> <chr> <int>
1 dark Far 40
2 dark Near 18
3 dark Normal 114
4 night Far 39
5 night Near 78
6 night Normal 115
7 room Far 12
8 room Near 41
9 room Normal 22
Eyesight Example
Does there appear to be a relationship/dependence?
# Compute p-valuenull_dist %>%get_pvalue(obs_stat = test_stat, direction ="greater")
# A tibble: 1 × 1
p_value
<dbl>
1 0
Approximating the Null Distribution
If there are at least 5 observations in each cell, then
\[
\mbox{test statistic} \sim \chi^2(df = (k - 1)(j - 1))
\] where \(k\) is the number of categories in the response variable and \(j\) is the number of categories in the explanatory variable.
The \(df\) controls the center and spread of the distribution.