Ingesting Web Data

background-image: url("img/logo_padded.001.jpeg")
background-position: left
background-size: 60%
class: middle, center,

.pull-right[

<br>

## .base_color[Ingesting Web Data]
 
<br>
 
<br>

#### .navy[Kelly McConville]

#### .navy[ Stat 108 | Week 12 | Spring 2023]

]

---

background-image: url("img/ggparty_s23.001.jpeg")
background-size: 80%
class: bottom, center,

### If you are able to attend, please RSVP: [https://bit.ly/ggpartys23](https://bit.ly/ggpartys23)

---

### Announcements

* No lecture on Wed, April 19th.  Project 2 OHs 9-10:15am **with pastries** in 316 instead.

************************

### Week's Goals

.pull-left[

**Mon Lecture**

* Ingesting web data

]

.pull-right[

**Wed Lecture/Section**

* Project 2 OHs
* Project 2 Peer feedback on package name
    + Make sure you have a clear idea of what you want your project to do before Thursday!

]

---

### Returning to the Project 2 Timeline

* 4/12: Receive project instructions, [group assignment](https://docs.google.com/spreadsheets/d/1xU3w4sXQSWkU678YjtdnjyBZOuB6RtFiZeB4fbsZozo/edit?usp=sharing), and invite to your group's GitHub repo.
    + Please use your assigned Stat 108 GitHub repo for this project.
* **4/19:** 9 - 10:15am: Lecture is cancelled.  Package office hours and pastries in SC 316.
* **4/19:** As a group, come up with three potential names and a short description of your package.  Add this information to the [potential package names spreadsheet](https://docs.google.com/spreadsheets/d/1N63gjVFeHMdkUCENOsfVFKFE0aIT4jYVGcqQ81ZVXh0/edit?usp=sharing).
* **4/20 - 4/22:** Package naming peer feedback activity
    + Will post instructions for the activity at noon on the 20th.
    + Part of section time that week will be devoted to feedback activity.
* 5/8 - 5/10: Give a 5 minute presentation of your package in one of the following time slots:
    + Monday, May 8th noon - 2pm
    + Wednesday, May 10th 9 - 11am
* 5/10 (noon): Make sure the final version of your package is in your GitHub repo.
* 5/10: Decide as a group whether or not you want to make your GitHub repo (and so `R` package) public.  
    + If at least one group member does not want to make the repo public, please leave it private.
    + Your grade on this project does not at all depend on this decision.

---

### Grabbing Data From The Web

Four main categories (listed from easiest to hardest):

* **Download and Go**: Flat files, such as csvs, that you can download and then install via something like `readr`.

* **Package API Wrapper**: R packages that talk to APIs.

* **API**: Talking to the APIs directly.

* **Scrap**: Scraping directly from a website.

---

### Download and Go

* You are (likely) doing this for your project.

* We did this for the R Data Package demo:

```r
dat <- readr::read_csv("https://data.cambridgema.gov/api/views/sckh-3xyx/rows.csv?accessType=DOWNLOAD")
```

---

### Data Ingestation

```r
dat <- readr::read_csv("https://data.cambridgema.gov/api/views/sckh-3xyx/rows.csv?accessType=DOWNLOAD")
```

* Lots of useful arguments in `read_csv()`
    + `skip`
    + `col_names` and `col_types`
    + `na`

* Other packages for different file formats
    + `haven`: SPSS, Stata, and SAS files
    + `readxl`: Excel files

* For large files consider `fread()` in `data.table`

---

### If `fread()` from `data.table` is faster, why not always use `fread()`?

* If using `tidyverse` packages, it is often easiest to stay within the family because...
    + Functions play nicely together.
    + Easier to read your code.

* Good to load a minimal amount of packages.
    + Can also be an argument for using `read.csv()`.

Check if loading a new package is worth it.

---

### Example: Is `fread()` worth it?

* Main differences between base, `readr`, and `data.table`:
    + All read a flat file into a data frame.
        + `readr` reads in as a [tibble](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html).
        + `data.table` reads in as a [data.table](https://www.rdocumentation.org/packages/data.table/versions/1.10.4/topics/data.table-package).
    + `readr` and `data.table` don't automatically convert characters to factors.
        + [Why?](https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/)
    + Time: base > readr > data.table

---

### Example: Is `fread()` worth it?

```r
system.time(dat1 <- read.csv("https://data.cambridgema.gov/api/views/sckh-3xyx/rows.csv?accessType=DOWNLOAD"))
```

```
##    user  system elapsed 
##   0.157   0.001   0.524
```

```r
system.time({
  library(readr)
  dat2 <- read_csv("https://data.cambridgema.gov/api/views/sckh-3xyx/rows.csv?accessType=DOWNLOAD")})
```

```
##    user  system elapsed 
##   0.151   0.022   0.466
```

```r
system.time({
  library(data.table)
  dat3 <- fread("https://data.cambridgema.gov/api/views/sckh-3xyx/rows.csv?accessType=DOWNLOAD")})
```

```
##    user  system elapsed 
##   0.157   0.004   0.421
```

---

### Example: Is `fread()` worth it?

```r
system.time(dat1 <- read.csv("https://ed-public-download.app.cloud.gov/downloads/Most-Recent-Cohorts-All-Data-Elements.csv"))
```

```
##    user  system elapsed 
##  10.053   0.244  12.540
```

```r
system.time({
  library(readr)
  dat2 <- read_csv("https://ed-public-download.app.cloud.gov/downloads/Most-Recent-Cohorts-All-Data-Elements.csv")})
```

```
##    user  system elapsed 
##   4.858   0.673  10.410
```

```r
system.time({
  library(data.table)
  dat3 <- fread("https://ed-public-download.app.cloud.gov/downloads/Most-Recent-Cohorts-All-Data-Elements.csv")})
```

```
##    user  system elapsed 
##   2.461   0.492   9.854
```

---

### But careful: Not all functions play nice with `tidyverse` objects!

```r
# Load data
library(tidyverse)
cities <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/psets/data/cities_UI.csv")

# Determine its type
class(cities)
```

```
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
```

```r
# Create a regression tree
library(rpms)
tree <- rpms(rp_equ = life_expectancy ~ walkability_score + city_type + region, data = cities)
```

```
## Error in rpms(rp_equ = life_expectancy ~ walkability_score + city_type + : RPMS works only on numeric or factor data types.
```

---

### But careful: Not all functions play nice with `tidyverse` objects!

```r
# Load data
library(tidyverse)
cities <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/psets/data/cities_UI.csv") %>%
  as.data.frame()

# Determine its type
class(cities)
```

```
## [1] "data.frame"
```

```r
# Create a regression tree
library(rpms)
tree <- rpms(rp_equ = life_expectancy ~ walkability_score + city_type + region, data = cities)
```

```
## Error in rpms(rp_equ = life_expectancy ~ walkability_score + city_type + : RPMS works only on numeric or factor data types.
```

---

### But careful: Not all functions play nice with `tidyverse` objects!

```r
# Load data
library(tidyverse)
cities <- read_csv("https://raw.githubusercontent.com/harvard-stat108s23/materials/main/psets/data/cities_UI.csv") %>%
  as.data.frame() %>%
  mutate(across(where(is.character), as.factor))

# Determine its type
class(cities)
```

```
## [1] "data.frame"
```

```r
# Create a regression tree
library(rpms)
tree <- rpms(rp_equ = life_expectancy ~ walkability_score + city_type + region, data = cities)
```

---

### Package API Wrapper

Step back: What is an API?

API = Application Programming Interface

Web API: Allows access to an organization’s assets (e.g. its data) via a defined set of messages.

In R:

* (Client) Give R a URL to request information (data) from
* (Server) API sends back a response

* There are over 10,000 APIs on the web.

---

### APIs

* Many organizations have made their data available via APIs over the internet.

* Other people have written R packages that wrap the API.
    + R functions that will make the query and format the response for you.
    + Output is then often a data frame!

* (Often) Need an API key.
    + So they know who is requesting their data and when to limit the amount of data given.

* Let's look at an example of an API wrapper package.

---

### Example API wrapper: `rebird`

* [eBird](https://ebird.org/home) is an online database of bird sightings.

* [rebird](https://github.com/ropensci/rebird) is an R interface to the eBird API

```r
devtools::install_github("ropensci/rebird")
```

* Need an API key
    + Link to your account

```r
ebird_key <- "Insert key"
```

---

### Example API wrapper: `rebird`

* Search for bird occurrences by latitude and longitude point

```r
library(rebird)
species_code("Cardinalis cardinalis")
```

```
## [1] "norcar"
```

---

```r
ebirdgeo(species = "norcar", lng = -71.11, 
         lat = 42.38, key = ebird_key) 
```

```
## # A tibble: 604 × 13
##    speciesCode comName  sciName locId locName obsDt howMany   lat   lng obsValid
##    <chr>       <chr>    <chr>   <chr> <chr>   <chr>   <int> <dbl> <dbl> <lgl>   
##  1 norcar      Norther… Cardin… L205… 270 Tr… 2023…       1  42.6 -71.3 TRUE    
##  2 norcar      Norther… Cardin… L229… Charle… 2023…       1  42.4 -71.1 TRUE    
##  3 norcar      Norther… Cardin… L208… Great … 2023…       1  42.5 -71.3 TRUE    
##  4 norcar      Norther… Cardin… L659… Amelia… 2023…       6  42.4 -71.1 TRUE    
##  5 norcar      Norther… Cardin… L209… Hall's… 2023…       1  42.3 -71.1 TRUE    
##  6 norcar      Norther… Cardin… L358… Arling… 2023…       5  42.4 -71.2 TRUE    
##  7 norcar      Norther… Cardin… L185… Eaton'… 2023…       5  42.2 -71.0 TRUE    
##  8 norcar      Norther… Cardin… L449… Upper … 2023…       5  42.4 -71.2 TRUE    
##  9 norcar      Norther… Cardin… L159… Middle… 2023…       9  42.4 -71.1 TRUE    
## 10 norcar      Norther… Cardin… L241… Barret… 2023…       2  42.5 -71.4 TRUE    
## # ℹ 594 more rows
## # ℹ 3 more variables: obsReviewed <lgl>, locationPrivate <lgl>, subId <chr>
```

---

### Example API wrapper: `rebird`

Recent notable sightings

.pull-left[

```r
ebirdnotable(lng = -71.11, 
             lat = 42.38,
             key = ebird_key)
```

```
## # A tibble: 3,261 × 14
##    speciesCode comName  sciName locId locName obsDt howMany   lat   lng obsValid
##    <chr>       <chr>    <chr>   <chr> <chr>   <chr>   <int> <dbl> <dbl> <lgl>   
##  1 chiswi      Chimney… Chaetu… L239… Wheato… 2023…       1  42.0 -71.2 FALSE   
##  2 chiswi      Chimney… Chaetu… L239… Wheato… 2023…       1  42.0 -71.2 FALSE   
##  3 balori      Baltimo… Icteru… L883… Darrel… 2023…       1  42.3 -72.5 FALSE   
##  4 sancra      Sandhil… Antigo… L298… Sandhi… 2023…       2  44.3 -72.1 FALSE   
##  5 botgra      Boat-ta… Quisca… L387… Stewar… 2023…       3  41.2 -73.2 FALSE   
##  6 gwcspa      White-c… Zonotr… L387… Stewar… 2023…       1  41.2 -73.2 FALSE   
##  7 balori      Baltimo… Icteru… L577… 9 Darr… 2023…       1  42.3 -72.5 FALSE   
##  8 saypho      Say's P… Sayorn… L823… Thomps… 2023…       1  41.7 -70.1 FALSE   
##  9 whfibi      White-f… Plegad… L581… Scarbo… 2023…       1  43.6 -70.4 FALSE   
## 10 comyel      Common … Geothl… L236… Steve's 2023…       1  43.7 -72.3 FALSE   
## # ℹ 3,251 more rows
## # ℹ 4 more variables: obsReviewed <lgl>, locationPrivate <lgl>, subId <chr>,
## #   exoticCategory <chr>
```

]

.pull-right[

]

---

### Data from APIs: What To Do

* Ask the internet if there is an R package for a particular API.

* If so, read the vignette/help files.

* If not, you must talk to the API directly.

---

### Other APIs to Play With!

* [`rscorecard`](https://cran.r-project.org/web/packages/rscorecard/)
* [ieugwasr](https://mrcieu.github.io/ieugwasr/index.html)
* [VancouvR](https://mountainmath.github.io/VancouvR/index.html)
* [traveltime](https://tlorusso.github.io/traveltime/vignette.html)
* [nbastatR](https://github.com/abresler/nbastatR)
* [eia](https://docs.ropensci.org/eia/)
* [tradestatistics](https://docs.ropensci.org/tradestatistics/)
* [fbicrime](https://github.com/SUN-Wenjun/fbicrime)
* [wbstats](https://github.com/nset-ornl/wbstats)
* [rtweet](https://docs.ropensci.org/rtweet/)

* If you are thinking about writing an API wrapper package for Project 2, check out their code (via their GitHub repos) for inspiration.
    + Note: But, don't create a API wrapper package if one already exists.

---

### Web Data

* Two common languages of web services:
    + JavaScript Object Notation (JSON)
    + eXtensible Markup Language (XML)

* We won't be deep diving into JSON/XML today.
    + Will use functions to convert to `R` objects (`list()`s!).

---

### APIs

* We will use `httr` to access data via APIs.
    + `tidyverse` adjacent
    + Play on HTTP: Hyper-Text Transfer Protocol

* You used `httr` in your last p-set!

```r
library(httr)
```

---

### APIs

* For our example, let's grab the College Scorecard data from [data.gov](https://www.data.gov/).  
    + You will need to first sign up for an API key [here](https://api.data.gov/signup/).

```r
# Store API key (Change to your personal key!)
my_key <- "insert key"
```

```r
# URL of interest
url <- "https://api.data.gov/ed/collegescorecard/v1/schools?"
# Download available data for Harvard
harvard <- GET(url, query = list(api_key = my_key, 
                                     school.name = "Harvard University"))
```

---

```r
#Look at type
http_type(harvard)
```

```
## [1] "application/json"
```

```r
# Examine components
names(harvard)
```

```
##  [1] "url"         "status_code" "headers"     "all_headers" "cookies"    
##  [6] "content"     "date"        "times"       "request"     "handle"
```

---

### Let's start with `status_code`

Key:

* 2xx: Success
* 3xx: Client error (something's not right on your end)
* 4xx: Server error (something's not right on their end)

```r
status_code(harvard)
```

```
## [1] 200
```

* If not 200, check that you got the correct url.

---

### Want to pull out the `content`

```r
# Convert data into an R object
# JSON automatically parsed into named list
dat <- content(harvard, as = "parsed", type = "application/json")
#Look at structure
class(dat)
```

```
## [1] "list"
```

---

```r
# Continue looking at structure
names(dat)
```

```
## [1] "metadata" "results"
```

```r
glimpse(dat)
```

```
## List of 2
##  $ metadata:List of 3
##   ..$ page    : int 0
##   ..$ total   : int 1
##   ..$ per_page: int 20
##  $ results :List of 1
##   ..$ :List of 7
##   .. ..$ latest    :List of 10
##   .. ..$ school    :List of 37
##   .. ..$ location  :List of 2
##   .. ..$ id        : int 166027
##   .. ..$ ope6_id   : chr "002155"
##   .. ..$ ope8_id   : chr "00215500"
##   .. ..$ fed_sch_cd: chr "002155"
```

---
### Interacting with APIs

* Spend time cleaning the data and then...

* **Important last step:** Save the data as a `.RData` or a `.csv`.
    + Don't have R run `GET()` each time you knit!  The API might stop talking to you.

```r
write_csv(clean_dat, "clean_dat.csv")
```

---

### Web Scraping

* Found data on a website but there isn't an API.

* How easy it is to grab that data depends on the quality of the website!
    + And be prepared to do a LOT of cleaning once you have the data.

---

### HyperText Markup Language (HTML)

* Most of the data on the web is available as HTML.
* It is structured (hierarchial) but often not available in a useful, tidy format.

```
<div class="remark-slide-content hljs-github">
<h3>HyperText Markup Language (HTML)</h3>
<ul>
<li>Most of the data on the web is available as HTML.</li>
<li>It is structured (hierarchial) but often not available in a useful, tidy format.</li>
</ul>
</div>
```
    
---

### Web Scraping

`rvest`: package for basic processing and manipulation of HTML data

* Designed to work with `%>%`

---

### Key `rvest` functions

* `read_html()`: Read in HTML data from a URL
* `html_node()`: Select a specified node from the HTML document
* `html_nodes()`: Select specified nodes from the HTML document
* `html_table()`: Parse an HTML table into a data frame

---

### Web Scraping

**Steps:**

0. Checking for permission to scrape the data with `robotstxt::paths_allowed()`
1. Read the HTML page into R with `read_html()`
2. Extract the nodes of the page that correspond to elements of interest
    + Use web tools to help identify these nodes
3. Clean up the extracted text fields.
4. Write the data to a csv so that you are not scraping it every time, in case the website goes down, or if the data changes.

---
class: , center, middle

## Let's go through the webScraping.Rmd handout in the Handouts folder!

---

#### Scraping Challenges

1. Reproducibility
    + The data are not static.
    + Websites change their structure over time.
    
--

2. Some website structures are much harder to scrape, especially when the data you want is spread over many nodes.

3. Making many requests to a web server will cause it to ban requests or slow down the speed of informational retrieval.

4. The quality of the data
    + Is it provided by users of the website or staff affiliated with the website?
    
--

5. Privacy and consent considerations
    + Ex: OkCupid data scraped and provided on PsychNet
    
--

6. Legality issues
    + Ex: Dispute between eBay and Bidder's Edge (2000): Court banned Bidder's Edge from scraping data from eBay
    + Ex: Dispute between LinkedIn and HiQ (2019):  scraping publicly available info `$\neq$` hacking but may involve copyright infringement