What words or phrases do you think of when you hear the word “Harvard”?
This being a data class, I’d like to collect some data related to “statistical thinking.”
Go to bit.ly/stat-100-think to provide the words or phrases you think of when you hear “statistical thinking.”
Statistical Thinking
Kelly McConville
Stat 100
Week 1 | Fall 2023
Lecture slide decks will always be posted and linked to a Canvas Module the day before lecture.
No section and no lecture quiz this week.
Only I will be running office hours this week at the following time:
The regular office hour schedule will be posted later this week and will start next week.
If able, please bring a laptop or tablet to Mondays’s lecture.
Start engaging in statistical thinking
Introduce data
Consider hand-drawn visualizations as a way to tell stories with data
Hop into the RStudio Server using Posit Cloud
Discuss course structure (lecture, section, wrap-ups, office hours, assessments…)
Present important course policies (engagement, code of conduct, chatGPT, …)
Get started in RStudio
and with Quarto
documents
What is statistical thinking?
It is not the same as mathematical thinking.
Let’s discover what statistical thinking is through some examples.
Will use a wide-range of real and relevant data examples
I understand that some of these topics have likely had profound impacts on your lives.
We will focus class time on the key course objectives but will use these current topics to empower ourselves and to see how we can productively participate with data.
At a quick first glance, what story does the Georgia Department of Public Health graph appear to be telling?
What is misleading about the Georgia Department of Public Health graph? How could we fix this issue?
Alberto Cairo, a journalist and designer, created the second graph of the Georgia COVID-19 data:
A key principle of data visualization is to “help the viewer make meaningful comparisons”.
What comparisons are made easy by the lefthand graph? What about by the righthand graph?
From these graphs, can we get an accurate estimate of the COVID prevalence in these Georgian counties over this two week period?
What are the pros of using wastewater over nasal swabs to assess COVID prevalence? What are the cons?
One more note: The graph also incorporates uncertainty measures, a key statistical thinking idea that we will learn more about later in the semester!
Context explains the Monday jumps in the COVID counts.
Design choices impact the conclusions the viewer draws.
Voluntary COVID test results don’t likely provide good estimates of COVID prevalence.
We will learn to compute and interpret certainty estimates (like those in the wastewater graph) later in the course!
About developing reasoning (not just learning definitions and formulae).
Developing our statistical thinking skills will allow us to soundly extract knowledge from data!
Statistical thinking requires judgment that takes time to develop.
“‘Raw data’ is an oxymoron.” – Lisa Gitelman
“Data … is information made tractable.” – Catherine D’Ignazio and Lauren Klein
Data in spreadsheet-like format where:
Rows = Observations/cases
Columns = Variables
ID | kind | .pred_AI | .pred_class | detector | native | name | model |
---|---|---|---|---|---|---|---|
1 | Human | 0.9999942 | AI | Sapling | No | Real TOEFL | Human |
2 | Human | 0.8281448 | AI | Crossplag | No | Real TOEFL | Human |
3 | Human | 0.0002137 | Human | Crossplag | Yes | Real College Essays | Human |
4 | AI | 0.0000000 | Human | ZeroGPT | NA | Fake CS224N - GPT3 | GPT3 |
5 | AI | 0.0017841 | Human | OriginalityAI | NA | Fake CS224N - GPT3, PE | GPT4 |
6 | Human | 0.0001783 | Human | HFOpenAI | Yes | Real CS224N | Human |
R
package detectors
.ID | kind | .pred_AI | .pred_class | detector | native | name | model |
---|---|---|---|---|---|---|---|
1 | Human | 0.9999942 | AI | Sapling | No | Real TOEFL | Human |
2 | Human | 0.8281448 | AI | Crossplag | No | Real TOEFL | Human |
3 | Human | 0.0002137 | Human | Crossplag | Yes | Real College Essays | Human |
4 | AI | 0.0000000 | Human | ZeroGPT | NA | Fake CS224N - GPT3 | GPT3 |
5 | AI | 0.0017841 | Human | OriginalityAI | NA | Fake CS224N - GPT3, PE | GPT4 |
6 | Human | 0.0001783 | Human | HFOpenAI | Yes | Real CS224N | Human |
Rows = Observations/cases
What are the cases? What does each row represent?
ID | kind | .pred_AI | .pred_class | detector | native | name | model |
---|---|---|---|---|---|---|---|
1 | Human | 0.9999942 | AI | Sapling | No | Real TOEFL | Human |
2 | Human | 0.8281448 | AI | Crossplag | No | Real TOEFL | Human |
3 | Human | 0.0002137 | Human | Crossplag | Yes | Real College Essays | Human |
4 | AI | 0.0000000 | Human | ZeroGPT | NA | Fake CS224N - GPT3 | GPT3 |
5 | AI | 0.0017841 | Human | OriginalityAI | NA | Fake CS224N - GPT3, PE | GPT4 |
6 | Human | 0.0001783 | Human | HFOpenAI | Yes | Real CS224N | Human |
Columns = Variables
Variables: Describe characteristics of the observations
Quantitative: Numerical in nature
Categorical: Values are categories
Identification: Uniquely identify each case
ID | kind | .pred_AI | .pred_class | detector | native | name | model |
---|---|---|---|---|---|---|---|
1 | Human | 0.9999942 | AI | Sapling | No | Real TOEFL | Human |
2 | Human | 0.8281448 | AI | Crossplag | No | Real TOEFL | Human |
3 | Human | 0.0002137 | Human | Crossplag | Yes | Real College Essays | Human |
4 | AI | 0.0000000 | Human | ZeroGPT | NA | Fake CS224N - GPT3 | GPT3 |
5 | AI | 0.0017841 | Human | OriginalityAI | NA | Fake CS224N - GPT3, PE | GPT4 |
6 | Human | 0.0001783 | Human | HFOpenAI | Yes | Real CS224N | Human |
Every time you get a new dataset, spend time exploring the variables.
Example questions:
Is the variable capturing what I want?
For categorical variables, what are the categories? Do those categories adequately represent the data represented by that variable?
For quantitative variables, what values are possible? Were the data rounded or binned? Are those values actually encoding categories? What are the units of measurement?
Once we have collected data, a common next step is to visualize it.
Two key aspects of data visualization:
Determining how you want to display the data.
Figuring out how to tell the computer to do that mapping.
Hand-drawn data visualizations allow us to focus on the first part with full control over the creative process!
“Each week, and for a year, we collected and measured a particular type of data about our lives, used this data to make a drawing on a postcard-sized sheet of paper, and then dropped the postcard in an English”postbox” (Stefanie) or an American “mailbox” (Giorgia)!“
Store the data in your favorite spreadsheet program (Google Sheets, Numbers, Excel).
Determine what your cases/observations will be.
Collect data on more variables than you will likely visualize. It is hard to know beforehand what the interesting relationships will be.
Demo of accessing the RStudio Server on Posit Cloud
Try to access the RStudio Server between now and next lecture.
Come back to the recording if need help with the steps.
If able, please bring a laptop or tablet to Mondays’s lecture.
No section, no wrap-ups, and no lecture quiz this week.
Make sure to go through the syllabus, which can be found on Canvas.
Only I will be running office hours this week at the following time:
The regular office hour schedule will be posted later this week and will start next week.
Be on the look-out for the section preference form.