background-image: url("img/logo_padded.001.jpeg") background-position: left background-size: 60% class: middle, center, .pull-right[ <br> ## .base_color[Intro to Statistical] ## .base_color[Computing in R] <br> <br> ### .navy[Kelly McConville] #### .navy[ Stat 108 | Week 1 | Spring 2023] ] --- ## Announcements * Lecture slide decks will always be posted and linked to a Canvas Module the day before lecture. + These are HTML files but you can print them to PDF. + Will also bring printed versions for those who prefer paper copies. * No section, no lecture quiz, no p-set this week. * Only I will be running office hours this week at the following times: + Tuesday 10:30 am - noon in Science Center 316 + Wednesday 1:30 - 3:00 pm in Science Center 316 + Thursday 10:30 - 11:30 am in Science Center 316 (this week only) * Note: HUIT is still setting up our RStudio Server. -- *************************************** ## Week 1 Goals .pull-left[ **Day 1 Lecture** * Course overview ] -- .pull-right[ **Day 2 Lecture** * Data Viz Principles ] --- class: center, middle, ## But first, let me quickly introduce myself... --- class: center, middle, ### Let's start with my path to Harvard... <img src="stat108_wk01mon_files/figure-html/unnamed-chunk-1-1.png" width="80%" style="display: block; margin: auto;" /> --- class: center, ## Research Interests ### Survey statistics and collaborate with <img src="img/logos.jpeg" width="50%" style="display: block; margin: auto;" /> --- class: center, ## Research Interests ### Where survey statistics meets data science -- <img src="img/data.jpeg" width="1414" style="display: block; margin: auto;" /> --- class: center, ### Advising Undergraduate Forestry Data Science Research <img src="img/Forest_and_IceCream_Lovers.jpg" width="59%" height="25%" style="display: block; margin: auto;" /> --- class: center, middle ## What is the point of this class? -- ### Let's talk learning goals! <img src="img/STAT108Logo.png" width="30%" style="display: block; margin: auto;" /> --- class: middle, center ## Learning Goals <img src="img/STAT108Logo_Programming.png" width="20%" style="display: block; margin: auto;" /> -- ### But first... --- class: center .pull-left[ ### Box Mac and Cheese <img src="img/box.png" width="45%" style="display: block; margin: auto;" /> ] -- .pull-left[ ### versus Homemade Mac and Cheese? <img src="img/homemade.png" width="70%" style="display: block; margin: auto;" /> ] --- ### Goal: Learn to use several `R` packages. .pull-left[ <img src="img/box.png" width="45%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/packages_wall.001.jpeg" width="85%" style="display: block; margin: auto;" /> ] --- ### Goal: Learn to use several `R` packages. .pull-left[ * Focus on robust, well-documented packages. * Understand their strengths and limitations. * And if there isn't an existing `R` function that does what we want... ] .pull-right[ <img src="img/packages_wall.001.jpeg" width="85%" style="display: block; margin: auto;" /> ] --- ### Goal: Learn to create our own functions in `R`. .pull-left[ <img src="img/homemade.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ ```r magic_eight_ball <- function( question = "Will it rain today?"){ answers <- c("Without a doubt", "Concentrate and ask again", "My sources say no") return(sample(x = answers, size = 1)) } magic_eight_ball(question = "Should I take Stat 108?") ``` ``` ## [1] "My sources say no" ``` ] --- ### Goal: Learn to create our own functions in `R`. .pull-left[ * Learn the appropriate syntax and how to structure the arguments, body, and output. * Create tests for our functions to make them more robust. * Come full circle and bundle them up into their own `R` package in Project 2! ] .pull-right[ ```r magic_eight_ball <- function( question = "Will it rain today?"){ answers <- c("Without a doubt", "Concentrate and ask again", "My sources say no") return(sample(x = answers, size = 1)) } magic_eight_ball(question = "Should I take Stat 108?") ``` ``` ## [1] "Concentrate and ask again" ``` ] --- ### Goal: Develop our statistical programming skills. * Implement code that reflects core ideas of statistical programming, including functions, iteration, control flow, vectorization, debugging, refactoring, and abstraction. .pull-left[ ```r for(i in 1:20){ Iterate_over_something_awesome } ``` ] .pull-right[ ```r if(magic_eight_ball("Should I take Stat 108?") == "Without a doubt") { print("Stay seated") } else { print("Leave and buy a coffee.") } ``` ] -- * Apply coding habits and a coding style that align with best practices in the field. .pull-left[ ```r messy4324 <-select(dataset,thing1,thing2)%>%filter(thing1=="uncool") ``` ] .pull-right[ ```r clean <- select(dataset, thing1, thing2) %>% filter(thing1 == "uncool") ``` ] --- class: middle, center ## Learning Goals <img src="img/STAT108Logo_Wrangling.png" width="20%" style="display: block; margin: auto;" /> --- ### Goal: Wrangle and interact with a variety of data types. -- #### Requires expanding our understanding of **data structures** in `R`. -- * Most common data structure = `data.frame`/data set/spreadsheet * Example: [Boston area Blue bike check-outs data](https://www.bluebikes.com/system-data) ``` ## bikeid tripduration usertype starttime stoptime ## 1 3534 1322 Subscriber 2022-07-01 00:00:01 2022-07-01 00:22:03 ## 2 5355 353 Customer 2022-07-01 00:00:02 2022-07-01 00:05:56 ## 3 5834 413 Subscriber 2022-07-01 00:00:07 2022-07-01 00:07:01 ## 4 5010 175 Subscriber 2022-07-01 00:00:13 2022-07-01 00:03:09 ## 5 7398 751 Subscriber 2022-07-01 00:00:19 2022-07-01 00:12:50 ## 6 7462 499 Customer 2022-07-01 00:00:19 2022-07-01 00:08:39 ``` -- * Rows = observations * Columns = variables * Two types: categorical and quantitative -- .pull-left[ * #### What variables **don't** fit neatly into these two types? ] -- .pull-right[ * #### What other **data structures** are there? ] --- ### Variable types: Dates and Times .pull-left[ * Are dates and times categorical or quantitative? ] -- .pull-right[ * What makes dates and times different? ] ```r bluebikes %>% select(starttime, stoptime) %>% as.data.frame() ``` ``` ## starttime stoptime ## 1 2022-07-01 00:00:01 2022-07-01 00:22:03 ## 2 2022-07-01 00:00:02 2022-07-01 00:05:56 ## 3 2022-07-01 00:00:07 2022-07-01 00:07:01 ## 4 2022-07-01 00:00:13 2022-07-01 00:03:09 ## 5 2022-07-01 00:00:19 2022-07-01 00:12:50 ## 6 2022-07-01 00:00:19 2022-07-01 00:08:39 ``` -- <img src="img/lubridate.png" width="10%" style="float:left; padding:25px" style="display: block; margin: auto;" /> <br> <br> * Will learn to use the `lubridate` package to wrangle dates and times. --- ## Variable types: Factors and Characters * `R` typically stores categorical variables as one of two types: `factor` or `character`. ```r bluebikes$usertype ``` ``` ## [1] "Subscriber" "Customer" "Subscriber" "Subscriber" "Subscriber" ## [6] "Customer" ``` ```r class(bluebikes$usertype) ``` ``` ## [1] "character" ``` -- * Why do we need two different types?? -- .pull-left[ <img src="img/forcats.png" width="20%" style="float:left; padding:25px" style="display: block; margin: auto;" /> <br> <br> * Will learn to use the `forcats` package to wrangle `factor`s. ] -- .pull-right[ <img src="img/stringr.png" width="20%" style="float:left; padding:25px" style="display: block; margin: auto;" /> <br> <br> * Will learn to use the `stringr` package to wrangle `character`s/strings/text. + Note: A character vector stores multiple strings/text. ] --- ## Wrangling text data * Semi-structured text are data too! * Example: [Taylor Swift lyrics](https://github.com/shaynak/taylor-swift-lyrics) ``` ## # A tibble: 6 × 3 ## Song Album Lyric ## <chr> <chr> <chr> ## 1 22 (Taylor’s Version) Red (Taylor’s Version) It feels like a perfect night ## 2 22 (Taylor’s Version) Red (Taylor’s Version) To dress up like hipsters ## 3 22 (Taylor’s Version) Red (Taylor’s Version) And make fun of our exes ## 4 22 (Taylor’s Version) Red (Taylor’s Version) Uh-uh, uh-uh ## 5 22 (Taylor’s Version) Red (Taylor’s Version) It feels like a perfect night ## 6 22 (Taylor’s Version) Red (Taylor’s Version) For breakfast at midnight ``` -- * But extracting useful information takes a lot of wrangling! Will learn to use **regular expressions** to identify patterns in the text. <img src="img/tidytext.png" width="10%" style="float:left; padding:25px" style="display: block; margin: auto;" /> <br> <br> * Will also learn to analyze text with the `tidytext` package. --- class: middle, center ## What other data structures are there beyond `data.frames`? --- ## Data Structures: Spatial Data Frames * [Example: American Community Survey Data](https://walker-data.com/tidycensus/) on median household income for counties in MA ``` ## Simple feature collection with 6 features and 5 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: -73.06851 ymin: 41.4811 xmax: -70.52557 ymax: 42.73664 ## Geodetic CRS: NAD83 ## GEOID NAME variable estimate moe ## 1 25017 Middlesex County, Massachusetts B19013_001 111790 1032 ## 2 25005 Bristol County, Massachusetts B19013_001 74290 1371 ## 3 25015 Hampshire County, Massachusetts B19013_001 76959 2504 ## 4 25025 Suffolk County, Massachusetts B19013_001 80260 1586 ## 5 25023 Plymouth County, Massachusetts B19013_001 98190 1711 ## 6 25027 Worcester County, Massachusetts B19013_001 81660 987 ## geometry ## 1 MULTIPOLYGON (((-71.89877 4... ## 2 MULTIPOLYGON (((-70.83595 4... ## 3 MULTIPOLYGON (((-73.06577 4... ## 4 MULTIPOLYGON (((-70.93091 4... ## 5 MULTIPOLYGON (((-70.88335 4... ## 6 MULTIPOLYGON (((-72.31363 4... ``` --- ## Data Structures: Spatial Data Frames * [Example: American Community Survey Data](https://walker-data.com/tidycensus/) <img src="stat108_wk01mon_files/figure-html/unnamed-chunk-30-1.png" width="504" style="display: block; margin: auto;" /> * Will learn to use the `sf` package for wrangling spatial data --- ## Data Structures: Lists * `List`s are a more flexible structure for storing data! .pull-left[ ```r got ``` ``` ## [[1]] ## [[1]]$name ## [1] "Theon Greyjoy" ## ## [[1]]$gender ## [1] "Male" ## ## [[1]]$culture ## [1] "Ironborn" ## ## ## [[2]] ## [[2]]$name ## [1] "Tyrion Lannister" ## ## [[2]]$gender ## [1] "Male" ## ## [[2]]$culture ## [1] "" ``` ] -- .pull-right[ * From a data analysis perspective, they are hard to work with! <img src="img/purrr.png" width="20%" style="float:left; padding:25px" style="display: block; margin: auto;" /> <br> * Will learn to use the `purrr` package to converting nested `list`s into `data.frame`s. <br> <br> <br> ``` ## name gender culture ## 1 Theon Greyjoy Male Ironborn ## 2 Tyrion Lannister Male ## 3 Victarion Greyjoy Male Ironborn ## 4 Will Male ## 5 Areo Hotah Male Norvoshi ## 6 Chett Male ``` ] --- class: middle, center ## Learning Goals <img src="img/STAT108Logo_Viz.png" width="20%" style="display: block; margin: auto;" /> --- <img src="img/ggplot2.png" width="10%" style="float:left; padding:25px" style="display: block; margin: auto;" /> ### Goal: Create **static** data visualizations of multivariate data with `ggplot2`. <img src="stat108_wk01mon_files/figure-html/unnamed-chunk-37-1.png" width="648" style="display: block; margin: auto;" /> --- ### Goal: Create **animated** data visualizations of multivariate data with `gganimate`. <img src="stat108_wk01mon_files/figure-html/unnamed-chunk-38-1.gif" style="display: block; margin: auto;" /> --- <img src="img/plotlyshiny.001.jpeg" width="30%" style="float:left; padding:25px" style="display: block; margin: auto;" /> ### Goal: Create **interactive** data visualizations of multivariate data with `plotly` and `shiny`.
#### Also follow **best practices** in data viz, which is Wednesday's lecture topic! --- class: middle, center ## Learning Goals <img src="img/STAT108Logo_Sharing.png" width="20%" style="display: block; margin: auto;" /> -- ### What forms do you use to share the results of your data work? --- <img src="img/fs.png" width="15%" style="float:left; padding:10px" style="display: block; margin: auto;" /> ### Example: [US Forest Inventory and Analysis Program](https://www.fia.fs.fed.us/about/about_us/) > Mission: "Make and keep current a comprehensive inventory and analysis of the present and prospective conditions of and requirements for the renewable resources of the forest and rangelands of the US." -- .pull-left[ * Every 10 years each state produces a book of estimates --lots of tables and some graphs. * I wondered if dashboards would allow for more frequent updates and for more questions to be answered. ] .pull-right[ <img src="img/fia_sharing.001.jpeg" width="100%" style="display: block; margin: auto;" /> ] --- ### And then the pandemic happened and Johns Hopkins made their [COVID dashboard](https://coronavirus.jhu.edu/map.html): <img src="img/jhu_covid.png" width="70%" style="display: block; margin: auto;" /> --- <img src="img/fs.png" width="15%" style="float:left; padding:10px" style="display: block; margin: auto;" /> ### Example: [US Forest Inventory and Analysis Program](https://www.fia.fs.fed.us/about/about_us/) <br> <img src="img/fia_dash.png" width="50%" style="display: block; margin: auto;" /> * Very common now that any FIA work I do also includes a [corresponding dashboard](https://ncasi-shiny-tools.shinyapps.io/Counties/) --- <img src="img/flexdashboard.png" width="15%" style="float:left; padding:10px" style="display: block; margin: auto;" /> ### Goal: Create interactive dashboards with `shinydashboard` and `flexdashboard`. <br> <img src="img/fia_dash.png" width="50%" style="display: block; margin: auto;" /> --- <img src="img/devtools.png" width="15%" style="float:left; padding:10px" style="display: block; margin: auto;" /> <br> ### Goal: Learn to create an `R` package with `devtools` for Project 2. <br> -- I have two `R` packages on the [Comprehensive R Archival Network](https://cran.r-project.org/): * `mase`: Contains a collection of survey estimators that combine complex survey data with auxiliary data sources (e.g., satellite imagery, administrative records) * `pdxTrees`: Contains tree data from Portland, OR's Parks and Rec Department <img src="img/my_hexes.001.jpeg" width="40%" style="display: block; margin: auto;" /> --- <img src="img/octocat.001.jpeg" width="15%" style="float:left; padding:10px" style="display: block; margin: auto;" /> ### Goal: Develop a collaborative, version controlled workflow using git and GitHub. <br> .pull-left[ * Will use git and GitHub for both of the projects. * Will also be given a personal GitHub repository for practice and storing course materials. * GitHub is a great place to share code, R packages, and data work! ] .pull-right[ <img src="img/git_2x.png" width="50%" style="display: block; margin: auto;" /> ] --- <img src="img/distill.png" width="15%" style="float:left; padding:10px" style="display: block; margin: auto;" /> <br> ### Goal: Learn how to create a website with `distill`. <br> <img src="img/my_site.png" width="60%" style="display: block; margin: auto;" /> --- ### Goal: Utilize a reproducible workflow for all `R` work! -- .pull-left[ * Store `R` work in `R` scripts and `RMarkdown` documents. <img src="img/rmarkdown.png" width="50%" style="display: block; margin: auto;" /> ] -- .pull-right[ * Create reproducible coding examples so it is easy to ask and answer coding questions. <img src="img/reprex.png" width="50%" style="display: block; margin: auto;" /> ] -- #### Now let's move onto the structure of the course! --- ## Stat 108 Structure .pull-left[ <img src="img/week_stat108s23.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ #### Meetings * Lectures Mon and Wed at 9am. * Starting in Week 2, optional Sections. + Mix of review, problem set work time, and project feedback activities. * Lots of office hours. + Come and collaborate! ] --- ## Stat 108 Structure .pull-left[ <img src="img/week_stat108s23.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ #### Assessments * **Weekly problem sets** + Starts in Week 2. + Released on Thur and due the following Wed at 5:00pm. + Get 1 drop and 4 extension days * **Weekly lecture quizzes** + Open notes + Starts in Week 2. + Released at noon on Wed and due 48 hours later. + Get 1 drop * **Projects** + Project 1: Create an interactive dashboard. + Project 2: Create an `R` package. * **Engagement** ] --- ### Stat 108 Code of Conduct I expect everyone in this class to strive to foster a learning environment that is equitable, inclusive, and welcoming. If you experience any barriers to learning, please come to me or a college administrator with your concerns. #### Code of Conduct I expect all members of Stat 108 to make participation a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation. I expect everyone to act and interact in ways that contribute to an open, welcoming, inclusive, and healthy community of learners. You can contribute to a positive learning environment by demonstrating empathy and kindness, being respectful of differing viewpoints and experiences, and giving and gracefully accepting constructive feedback. This Code of Conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org/version/2/0/code_of_conduct.html), version 2.0. **************************** Please read the syllabus for other policies (e.g, academic honesty, regrades...). --- ## Learning to be a Coder Many of the assignments will provide opportunities to stretch yourself and learn code **not** covered directly in class. Why? -- ### (Another) Goal: Develop our abilities to effectively search for and evaluate potential solutions and then adapt the code to our situation. <img src="img/internet_help.001.jpeg" width="80%" style="display: block; margin: auto;" /> -- 😾 **Potential Erroneous Side Effect:** Concluding that you are bad at coding because you can't solve the problem "on your own" or because you find the answers on StackOverflow confusing/unhelpful. --- ### (Another) Goal: Develop our abilities to effectively search for and evaluate potential solutions and then adapt the code to our situation. I encourage employing the following strategy for solving coding questions: → Try the problem. -- → If and when you get stuck, look to the internet for potential solutions. -- → If you find some promising ideas, try them out. -- → For the code that seems most helpful, spend time figuring out what each line does. -- → If the code doesn't exactly solve your problem, adapt it. Even if it does, still consider whether or not modifications should be made. -- → If still stuck, post your question on our class Slack and/or come to office hours. **Don't just stay stuck.** -- **Key:** Try to find the right balance between independent learning and supported learning. + And, get help **before** the frustration sets in! --- ### Reminders * No section or no lecture quiz this week. * Make sure to go through the syllabus, which can be found in the Getting Started Module on Canvas. * Only I will be running office hours this week at the following times: + Tuesday 10:30 am - noon in Science Center 316 + Wednesday 1:30 - 3:00 pm in Science Center 316 + Thursday 10:30 - 11:30 am in Science Center 316 (this week only) * Note: HUIT is still setting up our RStudio Server.