background-image: url("img/logo_padded.001.jpeg") background-position: left background-size: 60% class: middle, center, .pull-right[ <br> ## .base_color[Ingesting Web Data] <br> <br> #### .navy[Kelly McConville] #### .navy[ Stat 108 | Week 12 | Spring 2023] ] --- background-image: url("img/ggparty_s23.001.jpeg") background-size: 80% class: bottom, center, ### If you are able to attend, please RSVP: [https://bit.ly/ggpartys23](https://bit.ly/ggpartys23) --- ### Announcements * No lecture on Wed, April 19th. Project 2 OHs 9-10:15am **with pastries** in 316 instead. ************************ ### Week's Goals .pull-left[ **Mon Lecture** * Ingesting web data ] .pull-right[ **Wed Lecture/Section** * Project 2 OHs * Project 2 Peer feedback on package name + Make sure you have a clear idea of what you want your project to do before Thursday! ] --- ### Returning to the Project 2 Timeline * 4/12: Receive project instructions, [group assignment](https://docs.google.com/spreadsheets/d/1xU3w4sXQSWkU678YjtdnjyBZOuB6RtFiZeB4fbsZozo/edit?usp=sharing), and invite to your group's GitHub repo. + Please use your assigned Stat 108 GitHub repo for this project. * **4/19:** 9 - 10:15am: Lecture is cancelled. Package office hours and pastries in SC 316. * **4/19:** As a group, come up with three potential names and a short description of your package. Add this information to the [potential package names spreadsheet](https://docs.google.com/spreadsheets/d/1N63gjVFeHMdkUCENOsfVFKFE0aIT4jYVGcqQ81ZVXh0/edit?usp=sharing). * **4/20 - 4/22:** Package naming peer feedback activity + Will post instructions for the activity at noon on the 20th. + Part of section time that week will be devoted to feedback activity. * 5/8 - 5/10: Give a 5 minute presentation of your package in one of the following time slots: + Monday, May 8th noon - 2pm + Wednesday, May 10th 9 - 11am * 5/10 (noon): Make sure the final version of your package is in your GitHub repo. * 5/10: Decide as a group whether or not you want to make your GitHub repo (and so `R` package) public. + If at least one group member does not want to make the repo public, please leave it private. + Your grade on this project does not at all depend on this decision. --- ### Grabbing Data From The Web Four main categories (listed from easiest to hardest): * **Download and Go**: Flat files, such as csvs, that you can download and then install via something like `readr`. -- * **Package API Wrapper**: R packages that talk to APIs. -- * **API**: Talking to the APIs directly. -- * **Scrap**: Scraping directly from a website. --- ### Download and Go * You are (likely) doing this for your project. * We did this for the R Data Package demo: ```r dat <- readr::read_csv("https://data.cambridgema.gov/api/views/sckh-3xyx/rows.csv?accessType=DOWNLOAD") ``` --- ### Data Ingestation ```r dat <- readr::read_csv("https://data.cambridgema.gov/api/views/sckh-3xyx/rows.csv?accessType=DOWNLOAD") ``` * Lots of useful arguments in `read_csv()` + `skip` + `col_names` and `col_types` + `na` * Other packages for different file formats + `haven`: SPSS, Stata, and SAS files + `readxl`: Excel files * For large files consider `fread()` in `data.table` --- ### If `fread()` from `data.table` is faster, why not always use `fread()`? -- * If using `tidyverse` packages, it is often easiest to stay within the family because... + Functions play nicely together. + Easier to read your code. -- * Good to load a minimal amount of packages. + Can also be an argument for using `read.csv()`. -- Check if loading a new package is worth it. --- ### Example: Is `fread()` worth it? * Main differences between base, `readr`, and `data.table`: + All read a flat file into a data frame. + `readr` reads in as a [tibble](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html). + `data.table` reads in as a [data.table](https://www.rdocumentation.org/packages/data.table/versions/1.10.4/topics/data.table-package). + `readr` and `data.table` don't automatically convert characters to factors. + [Why?](https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) + Time: base > readr > data.table --- ### Example: Is `fread()` worth it? ```r system.time(dat1 <- read.csv("https://data.cambridgema.gov/api/views/sckh-3xyx/rows.csv?accessType=DOWNLOAD")) ``` ``` ## user system elapsed ## 0.157 0.001 0.524 ``` ```r system.time({ library(readr) dat2 <- read_csv("https://data.cambridgema.gov/api/views/sckh-3xyx/rows.csv?accessType=DOWNLOAD")}) ``` ``` ## user system elapsed ## 0.151 0.022 0.466 ``` ```r system.time({ library(data.table) dat3 <- fread("https://data.cambridgema.gov/api/views/sckh-3xyx/rows.csv?accessType=DOWNLOAD")}) ``` ``` ## user system elapsed ## 0.157 0.004 0.421 ``` --- ### Example: Is `fread()` worth it? ```r system.time(dat1 <- read.csv("https://ed-public-download.app.cloud.gov/downloads/Most-Recent-Cohorts-All-Data-Elements.csv")) ``` ``` ## user system elapsed ## 10.053 0.244 12.540 ``` ```r system.time({ library(readr) dat2 <- read_csv("https://ed-public-download.app.cloud.gov/downloads/Most-Recent-Cohorts-All-Data-Elements.csv")}) ``` ``` ## user system elapsed ## 4.858 0.673 10.410 ``` ```r system.time({ library(data.table) dat3 <- fread("https://ed-public-download.app.cloud.gov/downloads/Most-Recent-Cohorts-All-Data-Elements.csv")}) ``` ``` ## user system elapsed ## 2.461 0.492 9.854 ``` --- ### But careful: Not all functions play nice with `tidyverse` objects! ```r # Load data library(tidyverse) cities <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/psets/data/cities_UI.csv") # Determine its type class(cities) ``` ``` ## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame" ``` ```r # Create a regression tree library(rpms) tree <- rpms(rp_equ = life_expectancy ~ walkability_score + city_type + region, data = cities) ``` ``` ## Error in rpms(rp_equ = life_expectancy ~ walkability_score + city_type + : RPMS works only on numeric or factor data types. ``` --- ### But careful: Not all functions play nice with `tidyverse` objects! ```r # Load data library(tidyverse) cities <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/psets/data/cities_UI.csv") %>% as.data.frame() # Determine its type class(cities) ``` ``` ## [1] "data.frame" ``` ```r # Create a regression tree library(rpms) tree <- rpms(rp_equ = life_expectancy ~ walkability_score + city_type + region, data = cities) ``` ``` ## Error in rpms(rp_equ = life_expectancy ~ walkability_score + city_type + : RPMS works only on numeric or factor data types. ``` --- ### But careful: Not all functions play nice with `tidyverse` objects! ```r # Load data library(tidyverse) cities <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/psets/data/cities_UI.csv") %>% as.data.frame() %>% mutate(across(where(is.character), as.factor)) # Determine its type class(cities) ``` ``` ## [1] "data.frame" ``` ```r # Create a regression tree library(rpms) tree <- rpms(rp_equ = life_expectancy ~ walkability_score + city_type + region, data = cities) ``` --- ### Package API Wrapper Step back: What is an API? -- API = Application Programming Interface -- Web API: Allows access to an organization’s assets (e.g. its data) via a defined set of messages. In R: * (Client) Give R a URL to request information (data) from * (Server) API sends back a response -- * There are over 10,000 APIs on the web. --- ### APIs * Many organizations have made their data available via APIs over the internet. * Other people have written R packages that wrap the API. + R functions that will make the query and format the response for you. + Output is then often a data frame! * (Often) Need an API key. + So they know who is requesting their data and when to limit the amount of data given. * Let's look at an example of an API wrapper package. --- ### Example API wrapper: `rebird` * [eBird](https://ebird.org/home) is an online database of bird sightings. * [rebird](https://github.com/ropensci/rebird) is an R interface to the eBird API ```r devtools::install_github("ropensci/rebird") ``` * Need an API key + Link to your account ```r ebird_key <- "Insert key" ``` --- ### Example API wrapper: `rebird` * Search for bird occurrences by latitude and longitude point ```r library(rebird) species_code("Cardinalis cardinalis") ``` ``` ## [1] "norcar" ``` --- ```r ebirdgeo(species = "norcar", lng = -71.11, lat = 42.38, key = ebird_key) ``` ``` ## # A tibble: 604 × 13 ## speciesCode comName sciName locId locName obsDt howMany lat lng obsValid ## <chr> <chr> <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <lgl> ## 1 norcar Norther… Cardin… L205… 270 Tr… 2023… 1 42.6 -71.3 TRUE ## 2 norcar Norther… Cardin… L229… Charle… 2023… 1 42.4 -71.1 TRUE ## 3 norcar Norther… Cardin… L208… Great … 2023… 1 42.5 -71.3 TRUE ## 4 norcar Norther… Cardin… L659… Amelia… 2023… 6 42.4 -71.1 TRUE ## 5 norcar Norther… Cardin… L209… Hall's… 2023… 1 42.3 -71.1 TRUE ## 6 norcar Norther… Cardin… L358… Arling… 2023… 5 42.4 -71.2 TRUE ## 7 norcar Norther… Cardin… L185… Eaton'… 2023… 5 42.2 -71.0 TRUE ## 8 norcar Norther… Cardin… L449… Upper … 2023… 5 42.4 -71.2 TRUE ## 9 norcar Norther… Cardin… L159… Middle… 2023… 9 42.4 -71.1 TRUE ## 10 norcar Norther… Cardin… L241… Barret… 2023… 2 42.5 -71.4 TRUE ## # ℹ 594 more rows ## # ℹ 3 more variables: obsReviewed <lgl>, locationPrivate <lgl>, subId <chr> ``` --- ### Example API wrapper: `rebird` Recent notable sightings .pull-left[ ```r ebirdnotable(lng = -71.11, lat = 42.38, key = ebird_key) ``` ``` ## # A tibble: 3,261 × 14 ## speciesCode comName sciName locId locName obsDt howMany lat lng obsValid ## <chr> <chr> <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <lgl> ## 1 chiswi Chimney… Chaetu… L239… Wheato… 2023… 1 42.0 -71.2 FALSE ## 2 chiswi Chimney… Chaetu… L239… Wheato… 2023… 1 42.0 -71.2 FALSE ## 3 balori Baltimo… Icteru… L883… Darrel… 2023… 1 42.3 -72.5 FALSE ## 4 sancra Sandhil… Antigo… L298… Sandhi… 2023… 2 44.3 -72.1 FALSE ## 5 botgra Boat-ta… Quisca… L387… Stewar… 2023… 3 41.2 -73.2 FALSE ## 6 gwcspa White-c… Zonotr… L387… Stewar… 2023… 1 41.2 -73.2 FALSE ## 7 balori Baltimo… Icteru… L577… 9 Darr… 2023… 1 42.3 -72.5 FALSE ## 8 saypho Say's P… Sayorn… L823… Thomps… 2023… 1 41.7 -70.1 FALSE ## 9 whfibi White-f… Plegad… L581… Scarbo… 2023… 1 43.6 -70.4 FALSE ## 10 comyel Common … Geothl… L236… Steve's 2023… 1 43.7 -72.3 FALSE ## # ℹ 3,251 more rows ## # ℹ 4 more variables: obsReviewed <lgl>, locationPrivate <lgl>, subId <chr>, ## # exoticCategory <chr> ``` ] .pull-right[ <img src="img/red_winged_blackbird.jpg" width="800" style="display: block; margin: auto;" /> ] --- ### Data from APIs: What To Do * Ask the internet if there is an R package for a particular API. * If so, read the vignette/help files. * If not, you must talk to the API directly. --- ### Other APIs to Play With! * [`rscorecard`](https://cran.r-project.org/web/packages/rscorecard/) * [ieugwasr](https://mrcieu.github.io/ieugwasr/index.html) * [VancouvR](https://mountainmath.github.io/VancouvR/index.html) * [traveltime](https://tlorusso.github.io/traveltime/vignette.html) * [nbastatR](https://github.com/abresler/nbastatR) * [eia](https://docs.ropensci.org/eia/) * [tradestatistics](https://docs.ropensci.org/tradestatistics/) * [fbicrime](https://github.com/SUN-Wenjun/fbicrime) * [wbstats](https://github.com/nset-ornl/wbstats) * [rtweet](https://docs.ropensci.org/rtweet/) * If you are thinking about writing an API wrapper package for Project 2, check out their code (via their GitHub repos) for inspiration. + Note: But, don't create a API wrapper package if one already exists. --- ### Web Data * Two common languages of web services: + JavaScript Object Notation (JSON) + eXtensible Markup Language (XML) * We won't be deep diving into JSON/XML today. + Will use functions to convert to `R` objects (`list()`s!). --- ### APIs * We will use `httr` to access data via APIs. + `tidyverse` adjacent + Play on HTTP: Hyper-Text Transfer Protocol * You used `httr` in your last p-set! ```r library(httr) ``` --- ### APIs * For our example, let's grab the College Scorecard data from [data.gov](https://www.data.gov/). + You will need to first sign up for an API key [here](https://api.data.gov/signup/). ```r # Store API key (Change to your personal key!) my_key <- "insert key" ``` ```r # URL of interest url <- "https://api.data.gov/ed/collegescorecard/v1/schools?" # Download available data for Harvard harvard <- GET(url, query = list(api_key = my_key, school.name = "Harvard University")) ``` --- ```r #Look at type http_type(harvard) ``` ``` ## [1] "application/json" ``` ```r # Examine components names(harvard) ``` ``` ## [1] "url" "status_code" "headers" "all_headers" "cookies" ## [6] "content" "date" "times" "request" "handle" ``` --- ### Let's start with `status_code` Key: * 2xx: Success * 3xx: Client error (something's not right on your end) * 4xx: Server error (something's not right on their end) ```r status_code(harvard) ``` ``` ## [1] 200 ``` * If not 200, check that you got the correct url. --- ### Want to pull out the `content` ```r # Convert data into an R object # JSON automatically parsed into named list dat <- content(harvard, as = "parsed", type = "application/json") #Look at structure class(dat) ``` ``` ## [1] "list" ``` --- ```r # Continue looking at structure names(dat) ``` ``` ## [1] "metadata" "results" ``` ```r glimpse(dat) ``` ``` ## List of 2 ## $ metadata:List of 3 ## ..$ page : int 0 ## ..$ total : int 1 ## ..$ per_page: int 20 ## $ results :List of 1 ## ..$ :List of 7 ## .. ..$ latest :List of 10 ## .. ..$ school :List of 37 ## .. ..$ location :List of 2 ## .. ..$ id : int 166027 ## .. ..$ ope6_id : chr "002155" ## .. ..$ ope8_id : chr "00215500" ## .. ..$ fed_sch_cd: chr "002155" ``` --- ### Interacting with APIs * Spend time cleaning the data and then... * **Important last step:** Save the data as a `.RData` or a `.csv`. + Don't have R run `GET()` each time you knit! The API might stop talking to you. ```r write_csv(clean_dat, "clean_dat.csv") ``` --- ### Web Scraping * Found data on a website but there isn't an API. -- * How easy it is to grab that data depends on the quality of the website! + And be prepared to do a LOT of cleaning once you have the data. --- ### HyperText Markup Language (HTML) * Most of the data on the web is available as HTML. * It is structured (hierarchial) but often not available in a useful, tidy format. ``` <div class="remark-slide-content hljs-github"> <h3>HyperText Markup Language (HTML)</h3> <ul> <li>Most of the data on the web is available as HTML.</li> <li>It is structured (hierarchial) but often not available in a useful, tidy format.</li> </ul> </div> ``` --- ### Web Scraping `rvest`: package for basic processing and manipulation of HTML data * Designed to work with `%>%` <img src="img/rvest.png" width="20%" style="display: block; margin: auto;" /> --- ### Key `rvest` functions * `read_html()`: Read in HTML data from a URL * `html_node()`: Select a specified node from the HTML document * `html_nodes()`: Select specified nodes from the HTML document * `html_table()`: Parse an HTML table into a data frame --- ### Web Scraping **Steps:** 0. Checking for permission to scrape the data with `robotstxt::paths_allowed()` 1. Read the HTML page into R with `read_html()` 2. Extract the nodes of the page that correspond to elements of interest + Use web tools to help identify these nodes 3. Clean up the extracted text fields. 4. Write the data to a csv so that you are not scraping it every time, in case the website goes down, or if the data changes. --- class: , center, middle ## Let's go through the webScraping.Rmd handout in the Handouts folder! --- #### Scraping Challenges 1. Reproducibility + The data are not static. + Websites change their structure over time. -- 2. Some website structures are much harder to scrape, especially when the data you want is spread over many nodes. -- 3. Making many requests to a web server will cause it to ban requests or slow down the speed of informational retrieval. -- 4. The quality of the data + Is it provided by users of the website or staff affiliated with the website? -- 5. Privacy and consent considerations + Ex: OkCupid data scraped and provided on PsychNet -- 6. Legality issues + Ex: Dispute between eBay and Bidder's Edge (2000): Court banned Bidder's Edge from scraping data from eBay + Ex: Dispute between LinkedIn and HiQ (2019): scraping publicly available info `\(\neq\)` hacking but may involve copyright infringement