Decisions, Decisions
Kelly McConville
Stat 100
Week 10 | Fall 2023
But first… Common Question: How should I describe my post-Stat 100 coding abilities?
Potential Answer: You have learned how to write code to analyze data. This includes visualization (ggplot2
), data wrangling (dplyr
), data importation (readr
), modeling, inference (infer
) and communication (with Quarto
Follow-up Question: So what coding is there left to learn?
Answer: Learning how to program. This includes topics such as control flow, iteration, creating functions, and vectorization.
Let’s return to the penguins
data and ask if flipper length varies, on average, by the sex of the penguin.
Research Question: Does flipper length differ by sex?
Response Variable:
Explanatory Variable:
Statistical Hypotheses:
Response: flipper_length_mm (numeric)
Explanatory: sex (factor)
# A tibble: 1 Ă— 1
1 -7.14
# A tibble: 1 Ă— 1
1 0
Interpretation of \(p\)-value: If the mean flipper length does not differ by sex in the population, the probability of observing a difference in the sample means of at least 7.142316 mm (in magnitude) is equal to 0.
Conclusion: These data represent evidence that flipper length does vary by sex.
Once you get to the end of a hypothesis test you make one of two decisions:
Sometimes we make the correct decision. Sometimes we make a mistake.
Let’s create a table of potential outcomes.
\(\alpha\) = prob of Type I error under repeated sampling = prob reject \(H_o\) when it is true
\(\beta\) = prob of Type II error under repeated sampling = prob fail to reject \(H_o\) when \(H_a\) is true.
Typically set \(\alpha\) level beforehand.
Use \(\alpha\) to determine “small” for a p-value.
Question: How do I select \(\alpha\)?
Will depend on the convention in your field.
Want a small \(\alpha\) and a small \(\beta\). But they are related.
Choose a lower \(\alpha\) (e.g., 0.01, 0.001) when the Type I error is worse and a higher \(\alpha\) (e.g., 0.1) when the Type II error is worse.
Can’t easily compute \(\beta\). Why?
One more important term:
Suppose we have a baseball player who has been a 0.250 career hitter who suddenly improves to be a 0.333 hitter. He wants a raise but needs to convince his manager that he has genuinely improved. The manager offers to examine his performance in 20 at-bats.
When \(\alpha\) is set to \(0.05\), he needs to hit 9 or more to get a small enough p-value to reject \(H_o\).
When \(\alpha\) is set to \(0.05\), the power of this test is 0.204.
Why is the power so low?
What aspects of the test could the baseball player change to increase the power of the test?
Suppose we have a baseball player who has been a 0.250 career hitter who suddenly improves to be a 0.333 hitter. He wants a raise but needs to convince his manager that he has genuinely improved. The manager offers to examine his performance in 20 100 at-bats.
What will happen to the power of the test if we increase the sample size?
Increasing the sample size increases the power.
When \(\alpha\) is set to \(0.05\) and the sample size is now 100, the power of this test is 0.57.
Suppose we have a baseball player who has been a 0.250 career hitter who suddenly improves to be a 0.333 hitter. He wants a raise but needs to convince his manager that he has genuinely improved. The manager offers to examine his performance in 20 100 at-bats.
What will happen to the power of the test if we increase \(\alpha\) to 0.1?
Suppose we have a baseball player who has been a 0.250 career hitter who suddenly improves to be a 0.333 0.400 hitter. He wants a raise but needs to convince his manager that he has genuinely improved. The manager offers to examine his performance in 20 100 at-bats.
What will happen to the power of the test if he is an even better player?
Effect size: Difference between true value of the parameter and null value.
Increasing the effect size increases the power.
When \(\alpha\) is set to \(0.1\), the sample size is 100, and the true probability of hitting the ball is 0.4, the power of this test is 0.97.
# Create a dummy dataset with the correct sample size
dat <- data.frame(at_bats = c(rep("hit", 80),
rep("miss", 20)))
null <- dat %>%
specify(response = at_bats, success = "hit") %>%
hypothesize(null = "point", p = 0.25) %>%
generate(reps = 1000, type = "draw") %>%
calculate(stat = "prop")
ggplot(data = null, mapping = aes(x = stat)) +
geom_histogram(bins = 27, color = "white")
alt <- dat %>%
specify(response = at_bats, success = "hit") %>%
hypothesize(null = "point", p = 0.333) %>%
generate(reps = 1000, type = "draw") %>%
calculate(stat = "prop") %>%
mutate(dist = "Alternative")
ggplot(data = alt, mapping = aes(x = stat)) +
geom_histogram(bins = 27, color = "white") +
geom_vline(xintercept = quantile(null$stat, 0.95),
size = 2,
color = "turquoise4")
What aspects of the test did the player actually have control over?
Why is it easier to set \(\alpha\) than to set \(\beta\) or power?
Considering power before collecting data is very important!
The danger of under-powered studies