class: , middle, center ## What words or phrases do you think of when you hear the word .orange["Harvard"]? -- ### This being a .orange[data] class, I'd like to collect some data related to .orange["statistical thinking."] -- ### Go to [bit.ly/stat-100-think](https://bit.ly/stat-100-think) to provide the words or phrases you think of when you hear .orange["statistical thinking."] --- background-image: url("img/DAW.png") background-position: left background-size: 50% class: middle, center, .pull-right[ ## .blue[Statistical Thinking] <br> <br> ### .purple[Kelly McConville] #### .purple[ Stat 100 | Week 1 | Fall 2022] ] --- class: ## Getting Started in Stat 100 #### Step 1: Getting Started Module in Canvas <img src="img/Canvas_getting_started.png" width="85%" style="display: block; margin: auto;" /> --- ## Announcements * Lecture slide decks will always be posted and linked to a Canvas Module the day before lecture. + These are HTML files but you can print them to PDF. + Looking into printed options. * No section, no lecture quiz, no p-set this week. * Only I will be running office hours this week at the following times: + Today: 1:30 - 2:30pm Science Center 316.04 + Thursday: 10 - 11am Science Center 316.04 * No lecture or office hours next Monday! * Be on the look-out for the section preference form. --- ## Week 1 + Week 2 Goals .pull-left[ **Week 1 Lecture** * Statistical thinking * Introduction to data * Hand-drawn data visualizations * Hop into the RStudio Server using FAS OnDemand ] -- .pull-right[ **Week 2 Lecture** * Discuss course structure * Getting up and running in `RStudio` * Working with `RMarkdown` documents ] --- class: center, middle, ## But first, let me quickly introduce myself... --- class: center, middle, ### Let's start with my path to Harvard... <img src="stat100_wk01wed_files/figure-html/unnamed-chunk-2-1.png" width="80%" style="display: block; margin: auto;" /> --- class: center, ## Research Interests ### Survey statistics and collaborate with <img src="img/logos.jpeg" width="50%" style="display: block; margin: auto;" /> --- class: center, ## Research Interests ### Where survey statistics meets data science -- <img src="img/data.jpeg" width="1414" style="display: block; margin: auto;" /> --- class: center, ### Advising Undergraduate Forestry Data Science Research <!--   --> <img src="img/Forest_and_IceCream_Lovers.jpg" width="59%" height="25%" style="display: block; margin: auto;" /> --- class: middle background-image: url("img/seedlings.jpg") background-position: left background-size: contain .pull-right[ * I **love** teaching stats and coding. ] -- .pull-right[ * But, learning stats and coding is **hard**. ] -- .pull-right[ * With the **right scaffolding**, **good strategies**, and **sustained effort**, you can excel at both! ] -- .pull-right[ * And mistakes are part of the learning process. They don't imply that you are bad at stats. ] --- class: , center, middle .pull-left[ ## The Rest of the Teaching Team ] .pull-right[ <img src="img/team.png" width="95%" style="display: block; margin: auto;" /> ] --- background-image: url("img/structures.001.jpeg") background-position: contain background-size: 65% ## Stat 100 Tech & Materials --- class: , middle, center ## Stat 100 is about developing our .orange[statistical thinking] skills. ### What is .orange[statistical thinking]? -- ### It is not the same as mathematical thinking. -- ### Let's discover what .orange[statistical thinking] is through some examples. --- ## Data in Stat 100 Will use a wide-range of **real** and **relevant** data examples .pull-left[ <img src="img/nytimes_access.png" width="110%" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="img/nytimes_labor_force.png" width="50%" style="display: block; margin: auto;" /> ] --- ## Data in Stat 100 .pull-left[ <img src="img/covid_govt.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="img/traffic_stops.jpg" width="100%" style="display: block; margin: auto;" /> ] -- * I understand that some of these topics have likely had profound impacts on your lives. -- * We will focus class time on the key course objectives but will use these current topics to empower ourselves and to see how we can productively participate with data. --- class: , middle, center ## Example: Visualizing COVID Prevalence --- ### Example: Visualizing COVID Prevalence * In May of 2020, the Georgia Department of Public Health posted the following graph: <img src="img/GAcovid.jpg" width="60%" style="display: block; margin: auto;" /> * At a quick first glance, what story does the Georgia Department of Public Health graph appear to be telling? -- * What is misleading about the Georgia Department of Public Health graph? How could we fix this issue? --- ### Example: Visualizing COVID Prevalence * After public outcry, the Georgia Department of Public Health said they made a mistake and posted the following updated graph: <img src="img/GAcovid2.jpg" width="55%" style="display: block; margin: auto;" /> * How do your conclusions about COVID-19 cases in Georgia change when now interpreting this new graph? --- ### Example: Visualizing COVID Prevalence * Alberto Cairo, a journalist and designer, created the second graph of the Georgia COVID-19 data: <img width="45%" src="img/GAcovid2.jpg"/> <img width="49%" src="img/GAcovid_cairo.png"/> -- * A key principle of data visualization is to “help the viewer make meaningful comparisons”. -- * What comparisons are made easy by the lefthand graph? What about by the righthand graph? -- * From these graphs, can we get an accurate estimate of the COVID prevalence in these Georgian counties over this two week period? --- ### Example: Visualizing COVID Prevalence * The [Massachusetts Water Resources Authority (MWRA) graph](https://www.mwra.com/biobot/biobotdata.htm) tracks the presence of COVID-19 in the Boston-area wastewater. <img src="stat100_wk01wed_files/figure-html/unnamed-chunk-13-1.png" width="648" style="display: block; margin: auto;" /> * What are the pros of using wastewater over nasal swabs to assess COVID prevalence? What are the cons? -- * One more note: The graph also incorporates **uncertainty measures**, a key statistical thinking idea that we will learn more about later in the semester! --- class: middle, , center ## What is "Statistical Thinking?" --- ## Statistical Thinking .pull-left[ * Understanding the importance of **context**. ] -- .pull-right[ → Context explained the Monday jumps in the COVID counts. ] -- .pull-left[ * How we **encode** information in a graph should be driven by our research question. ] -- .pull-right[ → **Design choices** impact the conclusions the viewer draws. ] -- .pull-left[ * How the data were **collected** impacts the conclusions we can make. ] -- .pull-right[ → Voluntary COVID test results don't likely provide good estimates of COVID prevalence. ] -- .pull-left[ * Often we are using a **sample** of data to say something about a larger group. In this case, we should measure how certain our estimates are! ] -- .pull-right[ → We will learn to **compute** and **interpret** certainty estimates like those in the wastewater graph later in the course! ] -- Developing our statistical thinking skills will allow us to soundly **extract knowledge from data**! --- class: middle, , center ## What are data? --- * The dictionary definition: > "data: factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation" -- Merriam-Webster -- * Wikipedia: > "Data are characteristics or information, usually numerical, that are collected through observation. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum (singular of data) is a single value of a single variable." --- * Our textbook definition: > "Data comes to us in a variety of formats, from pictures to text to numbers." -- ModernDive -- * Data Feminism: > "... by the time that information becomes data, it's already been classified in some way. Data after all, is information made *tractable*." -- D'Ignazio and Klein --- ## Data Frames <table class="table table-responsive table-bordered table-striped" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> UserID </th> <th style="text-align:right;"> Tree_Height </th> <th style="text-align:left;"> Common_Name </th> <th style="text-align:left;"> Park </th> <th style="text-align:right;"> DBH </th> <th style="text-align:left;"> Species_Factoid </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 105 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 37.4 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 94 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 32.5 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 23 </td> <td style="text-align:left;"> Lavalle Hawthorn </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 9.7 </td> <td style="text-align:left;"> Like most hawthorns, the tree has stout thorns up to 2" long. </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 28 </td> <td style="text-align:left;"> Northern Red Oak </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 10.3 </td> <td style="text-align:left;"> Acorns take two years to mature and are an important food source for wildlife. </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 102 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 33.2 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:right;"> 95 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 32.1 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> </tbody> </table> Data in spreadsheet-like format where: -- * Rows = Observations/cases -- * Columns = Variables --- ## Data Frames <table class="table table-responsive table-bordered table-striped" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> UserID </th> <th style="text-align:right;"> Tree_Height </th> <th style="text-align:left;"> Common_Name </th> <th style="text-align:left;"> Park </th> <th style="text-align:right;"> DBH </th> <th style="text-align:left;"> Species_Factoid </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 105 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 37.4 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 94 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 32.5 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 23 </td> <td style="text-align:left;"> Lavalle Hawthorn </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 9.7 </td> <td style="text-align:left;"> Like most hawthorns, the tree has stout thorns up to 2" long. </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 28 </td> <td style="text-align:left;"> Northern Red Oak </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 10.3 </td> <td style="text-align:left;"> Acorns take two years to mature and are an important food source for wildlife. </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 102 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 33.2 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:right;"> 95 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 32.1 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> </tbody> </table> Rows = Observations/cases **What are the cases? What does each row represent?** --- ## Data Frames <table class="table table-responsive table-bordered table-striped" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> UserID </th> <th style="text-align:right;"> Tree_Height </th> <th style="text-align:left;"> Common_Name </th> <th style="text-align:left;"> Park </th> <th style="text-align:right;"> DBH </th> <th style="text-align:left;"> Species_Factoid </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 105 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 37.4 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 94 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 32.5 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 23 </td> <td style="text-align:left;"> Lavalle Hawthorn </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 9.7 </td> <td style="text-align:left;"> Like most hawthorns, the tree has stout thorns up to 2" long. </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 28 </td> <td style="text-align:left;"> Northern Red Oak </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 10.3 </td> <td style="text-align:left;"> Acorns take two years to mature and are an important food source for wildlife. </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 102 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 33.2 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:right;"> 95 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 32.1 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> </tbody> </table> Columns = Variables **Variables**: Describe characteristics of the observations -- * **Quantitative**: Numerical in nature -- * **Categorical**: Values are categories -- * **Identification**: Uniquely identify each case --- ## Data Frames <table class="table table-responsive table-bordered table-striped" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> UserID </th> <th style="text-align:right;"> Tree_Height </th> <th style="text-align:left;"> Common_Name </th> <th style="text-align:left;"> Park </th> <th style="text-align:right;"> DBH </th> <th style="text-align:left;"> Species_Factoid </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 105 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 37.4 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 94 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 32.5 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 23 </td> <td style="text-align:left;"> Lavalle Hawthorn </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 9.7 </td> <td style="text-align:left;"> Like most hawthorns, the tree has stout thorns up to 2" long. </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:right;"> 28 </td> <td style="text-align:left;"> Northern Red Oak </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 10.3 </td> <td style="text-align:left;"> Acorns take two years to mature and are an important food source for wildlife. </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 102 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 33.2 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> <tr> <td style="text-align:left;"> 6 </td> <td style="text-align:right;"> 95 </td> <td style="text-align:left;"> Douglas-Fir </td> <td style="text-align:left;"> Gammans Park </td> <td style="text-align:right;"> 32.1 </td> <td style="text-align:left;"> Bracts on cones look like a mouse's feet and tail. </td> </tr> </tbody> </table> **Important to understand what each variable represents and the units of measurement.** -- Example questions: * For categorical variables, what are the categories? Do those categories adequately represent the data represented by that variable? -- * For quantitative variables, what values are possible? Were the data rounded or binned? Are those values actually encoding categories? --- class: middle, center, ## Goal: Over the next week, collect data on your world so that you can visualize it on P-Set 1. --- ## Hand-Drawn Data Viz * Two key aspects of data visualization: + Determining how you want to display the data. + Figuring out how to tell the computer to do that mapping. -- * Hand-drawn data visualizations allow us to focus on the first part and with full control over the creative process! --- ## Hand-Drawn Data Viz Examples * [Dear Data](http://www.dear-data.com/theproject) > "Each week, and for a year, we collected and measured a particular type of data about our lives, used this data to make a drawing on a postcard-sized sheet of paper, and then dropped the postcard in an English “postbox” (Stefanie) or an American “mailbox” (Giorgia)!" --- ### Dear Data Examples <img src="img/complaints.png" width="100%" style="display: block; margin: auto;" /> * What would the data frame for this visualization look like? --- ### Dear Data Examples <img src="img/time.png" width="100%" style="display: block; margin: auto;" /> * What would the data frame for this visualization look like? --- ### Mapping Manhattan * Becky Cooper handed out hand-drawn maps of Manhattan to strangers and asked them to ["map their Manhattan."](https://www.goodreads.com/book/show/15842664-mapping-manhattan?from_search=true) <div class="figure" style="text-align: center"> <img src="img/mapmanhattan.png" alt="Map drawn by New Yorker staff writer Patricia Marx" width="95%" /> <p class="caption">Map drawn by New Yorker staff writer Patricia Marx</p> </div> * What would the data frame for this visualization look like? --- ### More Dear Data Examples <img src="img/postcards_stat100s22.001.jpeg" width="90%" style="display: block; margin: auto;" /> * What would the data frame for this visualization look like? --- ### More Dear Data Examples <img src="img/postcards_stat100s22.002.jpeg" width="80%" style="display: block; margin: auto;" /> * What would the data frame for this visualization look like? --- ## Goal: Over the next week, collect data on your world so that you can visualize it on P-Set 1. #### Recommendations * Store the data in your favorite spreadsheet program (Google Sheets, Numbers, Excel). * Determine what your cases/observations will be. * Collect data on **more** variables than you will likely graph. It is hard to know at the on-set what the interesting relationships will be. -- #### Next Week * Will get a blank postcard and further guidance on the visualization with P-Set 1. --- class: middle, center, ## To Try Before Next Lecture: ### Accessing the RStudio Server --- ## Accessing the RStudio Server <img src="img/raccess.001.jpeg" width="95%" style="display: block; margin: auto;" /> * First time may take a while. --- ## Reminders * No section, no lecture quiz, no p-set this week. * Make sure to go through the syllabus, which can be found in the Getting Started Module on Canvas. + Will discuss assessments (p-sets, project, exams, quizzes, engagement) next week. * Only I will be running office hours this week at the following times: + Today: 1:30 - 2:30pm Science Center 316.04 + Thursday: 10 - 11am Science Center 316.04 * Try to access the RStudio Server through FAS OnDemand between now and next lecture. + I will demo now and come back to the recording if need help with the steps. * Be on the look-out for the section preference form.