
Before we get started…
Go to bit.ly/stat-100-think to provide the words or phrases you think of when you hear “statistical thinking.”
Waitlisters: Go to bit.ly/stat-100-wait to sign-in.

Statistical Thinking
Kelly McConville
Stat 100
Week 1 | Spring 2024


Lecture slide decks will be posted and linked to a Canvas Module the day before lecture.
Section starts this week! Make sure you are enrolled in a specific section.
Office hours start on Wednesday.
If able, please bring a laptop or tablet to the next lecture.
Start engaging in statistical thinking
Introduce data
Consider hand-drawn visualizations as a way to tell stories with data
Hop into the RStudio Server using Posit Cloud
Discuss course structure (lecture, section, wrap-ups, office hours, assessments…)
Present important course policies (engagement, code of conduct, chatGPT, …)
Get started in RStudio and with Quarto documents






What is statistical thinking?
It is not the same as mathematical thinking.
Let’s discover what statistical thinking is through some examples.
Will use a wide-range of real and relevant data examples




I understand that some of these topics have likely had profound impacts on your lives.
We will focus class time on the key course objectives but will use these current topics to empower ourselves and to see how we can productively participate with data.

At a quick first glance, what story does the Georgia Department of Public Health graph appear to be telling?
What is misleading about the Georgia Department of Public Health graph? How could we fix this issue?

Alberto Cairo, a journalist and designer, created the second graph of the Georgia COVID-19 data:

A key principle of data visualization is to “help the viewer make meaningful comparisons”.
What comparisons are made easy by the lefthand graph? What about by the righthand graph?
From these graphs, can we get an accurate estimate of the COVID prevalence in these Georgian counties over this two week period?

What are the pros of using wastewater over nasal swabs to assess COVID prevalence? What are the cons?
One more note: The graph also incorporates uncertainty measures, a key statistical thinking idea that we will learn more about later in the semester!
Context explains the Monday jumps in the COVID counts.
Design choices impact the conclusions the viewer draws.
Voluntary COVID test results likely don’t provide good estimates of COVID prevalence.
We will learn to compute and interpret certainty estimates (like those in the wastewater graph) later in the course!
About developing reasoning (not just learning definitions and formulae).
Statistical thinking requires judgment that takes time to develop.
Developing our statistical thinking skills will allow us to soundly extract knowledge from data!
“‘Raw data’ is an oxymoron.” – Lisa Gitelman
“Data … is information made tractable.” – Catherine D’Ignazio and Lauren Klein
Data in spreadsheet-like format where:
Rows = Observations/cases
Columns = Variables
| ID | kind | .pred_AI | .pred_class | detector | native | name | model |
|---|---|---|---|---|---|---|---|
| 1 | Human | 0.9999942 | AI | Sapling | No | Real TOEFL | Human |
| 2 | Human | 0.8281448 | AI | Crossplag | No | Real TOEFL | Human |
| 3 | Human | 0.0002137 | Human | Crossplag | Yes | Real College Essays | Human |
| 4 | AI | 0.0000000 | Human | ZeroGPT | NA | Fake CS224N - GPT3 | GPT3 |
| 5 | AI | 0.0017841 | Human | OriginalityAI | NA | Fake CS224N - GPT3, PE | GPT4 |
| 6 | Human | 0.0001783 | Human | HFOpenAI | Yes | Real CS224N | Human |
R package detectors.| ID | kind | .pred_AI | .pred_class | detector | native | name | model |
|---|---|---|---|---|---|---|---|
| 1 | Human | 0.9999942 | AI | Sapling | No | Real TOEFL | Human |
| 2 | Human | 0.8281448 | AI | Crossplag | No | Real TOEFL | Human |
| 3 | Human | 0.0002137 | Human | Crossplag | Yes | Real College Essays | Human |
| 4 | AI | 0.0000000 | Human | ZeroGPT | NA | Fake CS224N - GPT3 | GPT3 |
| 5 | AI | 0.0017841 | Human | OriginalityAI | NA | Fake CS224N - GPT3, PE | GPT4 |
| 6 | Human | 0.0001783 | Human | HFOpenAI | Yes | Real CS224N | Human |
Rows = Observations/cases
What are the cases? What does each row represent?
| ID | kind | .pred_AI | .pred_class | detector | native | name | model |
|---|---|---|---|---|---|---|---|
| 1 | Human | 0.9999942 | AI | Sapling | No | Real TOEFL | Human |
| 2 | Human | 0.8281448 | AI | Crossplag | No | Real TOEFL | Human |
| 3 | Human | 0.0002137 | Human | Crossplag | Yes | Real College Essays | Human |
| 4 | AI | 0.0000000 | Human | ZeroGPT | NA | Fake CS224N - GPT3 | GPT3 |
| 5 | AI | 0.0017841 | Human | OriginalityAI | NA | Fake CS224N - GPT3, PE | GPT4 |
| 6 | Human | 0.0001783 | Human | HFOpenAI | Yes | Real CS224N | Human |
Columns = Variables
Variables: Describe characteristics of the observations
Quantitative: Numerical in nature
Categorical: Values are categories
Identification: Uniquely identify each case
| ID | kind | .pred_AI | .pred_class | detector | native | name | model |
|---|---|---|---|---|---|---|---|
| 1 | Human | 0.9999942 | AI | Sapling | No | Real TOEFL | Human |
| 2 | Human | 0.8281448 | AI | Crossplag | No | Real TOEFL | Human |
| 3 | Human | 0.0002137 | Human | Crossplag | Yes | Real College Essays | Human |
| 4 | AI | 0.0000000 | Human | ZeroGPT | NA | Fake CS224N - GPT3 | GPT3 |
| 5 | AI | 0.0017841 | Human | OriginalityAI | NA | Fake CS224N - GPT3, PE | GPT4 |
| 6 | Human | 0.0001783 | Human | HFOpenAI | Yes | Real CS224N | Human |
Every time you get a new dataset, spend time exploring the variables.
Example questions:
Is each variable capturing the desired information?
For categorical variables, what are the categories? Do those categories adequately represent that variable?
For quantitative variables, what values are possible? Were the data rounded or binned? Are those values actually encoding categories? What are the units of measurement?
Once we have collected data, a common next step is to visualize it.
Two key aspects of data visualization:
Determining how you want to display the data.
Figuring out how to tell the computer to do that mapping.
Hand-drawn data visualizations allow us to focus on the first part with full control over the creative process!
“Each week, and for a year, we collected and measured a particular type of data about our lives, used this data to make a drawing on a postcard-sized sheet of paper, and then dropped the postcard in an English”postbox” (Stefanie) or an American “mailbox” (Giorgia)!“





Store the data in your favorite spreadsheet program (Google Sheets, Numbers, Excel).
Determine what your cases/observations will be.
Collect data on more variables than you will likely visualize. It is hard to know beforehand what the interesting relationships will be.
Demo of accessing the RStudio Server on Posit Cloud
Try to access the RStudio Server between now and next lecture.
Come back to the recording if need help with the steps.
If able, please bring a laptop or tablet to the next lecture.
Make sure to go through the syllabus, which can be found on Canvas.
Section starts this week! Make sure you are enrolled in a specific section.
Office hours start on Wednesday.