Teaching Data

Sharing through

R Data Packages

Kelly McConville

Dominguez Center for Data Science

Bucknell University

A Typical Data Problem in Kelly’s Stat 100 Course

Here’s a dataset on the inventoried trees in Portland, OR’s parks. Conduct an exploratory data analysis on these trees.

library(tidyverse)
trees <- read_csv("https://raw.githubusercontent.com/mcconvil/pdxTrees/refs/heads/master/pdxTrees_parks.csv")
trees

# A tibble: 25,534 × 34
   Longitude Latitude UserID Genus      Family   DBH Inventory_Date      Species
       <dbl>    <dbl>  <dbl> <chr>      <chr>  <dbl> <dttm>              <chr>  
 1     -123.     45.6      1 Pseudotsu… Pinac…  37.4 2017-05-09 00:00:00 PSME   
 2     -123.     45.6      2 Pseudotsu… Pinac…  32.5 2017-05-09 00:00:00 PSME   
 3     -123.     45.6      3 Crataegus  Rosac…   9.7 2017-05-09 00:00:00 CRLA   
 4     -123.     45.6      4 Quercus    Fagac…  10.3 2017-05-09 00:00:00 QURU   
 5     -123.     45.6      5 Pseudotsu… Pinac…  33.2 2017-05-09 00:00:00 PSME   
 6     -123.     45.6      6 Pseudotsu… Pinac…  32.1 2017-05-09 00:00:00 PSME   
 7     -123.     45.6      7 Pseudotsu… Pinac…  28.4 2017-05-09 00:00:00 PSME   
 8     -123.     45.6      8 Pseudotsu… Pinac…  27.2 2017-05-09 00:00:00 PSME   
 9     -123.     45.6      9 Pseudotsu… Pinac…  35.2 2017-05-09 00:00:00 PSME   
10     -123.     45.6     10 Pseudotsu… Pinac…  32.4 2017-05-09 00:00:00 PSME   
# ℹ 25,524 more rows
# ℹ 26 more variables: Common_Name <chr>, Condition <chr>, Tree_Height <dbl>,
#   Crown_Width_NS <dbl>, Crown_Width_EW <dbl>, Crown_Base_Height <dbl>,
#   Collected_By <chr>, Park <chr>, Scientific_Name <chr>,
#   Functional_Type <chr>, Mature_Size <chr>, Native <chr>, Edible <chr>,
#   Nuisance <chr>, Structural_Value <dbl>, Carbon_Storage_lb <dbl>,
#   Carbon_Storage_value <dbl>, Carbon_Sequestration_lb <dbl>, …

Ready to explore these data?

Typical Student Submission

“Here’s a graph of the DBH and crown width EW. We can see that as DBH increases, so does crown width EW.”

ggplot(data = trees,
       mapping = aes(x = DBH, y = Crown_Width_EW)) +
  geom_point(alpha = 0.4, color = "#6EBBC5") +
  labs(title = "Scatterplot of DBH and Crown_Width_EW")

Another Typical Student Submission

“Portland has a lot of PSME trees.”

count(trees, Species) %>%
  slice_max(n = 5, order_by = n) %>%
  mutate(Species = fct_reorder(Species, -n)) %>%
ggplot(mapping = aes(x = Species, y = n)) +
  geom_col(fill = "#D37C9C") +
  labs(title = "Counts of Most Common Species")

Context is key to doing good data work.

But always providing good context felt overwhelming so I (mostly) ignored the problem. 🙈

Until…

Co-Led Workshop on Teaching Intro Stats with `R`

Participants were mostly new to R.
We didn’t want to waste precious workshop time on file paths:

dat <- read_csv("ugly/file/path/cool_data.csv")

Error: 'ugly/file/path/cool_data.csv' does not exist in current working directory ('/Users/kstats/Documents/Courses/Fall 2025/r-data-package-talk-f25').

“I bet you could turn the Portland tree data into an R Data Package on the train ride from Portland to Seattle.” – Nick Horton, Co-Leader

My First R Data Package: `pdxTrees`

Created a GitHub repository: https://github.com/mcconvil/pdxTrees

And hex sticker…

inspired by the PDX airport carpet:

:::

:::::

Workshop Problem Solved

Easy to install:

devtools::install_github("mcconvil/pdxTrees")

Easy to access the data (without learning file paths!):

library(pdxTrees)
get_pdxTrees_parks()

# A tibble: 25,534 × 34
   Longitude Latitude UserID Genus      Family   DBH Inventory_Date      Species
       <dbl>    <dbl> <chr>  <chr>      <chr>  <dbl> <dttm>              <chr>  
 1     -123.     45.6 1      Pseudotsu… Pinac…  37.4 2017-05-09 00:00:00 PSME   
 2     -123.     45.6 2      Pseudotsu… Pinac…  32.5 2017-05-09 00:00:00 PSME   
 3     -123.     45.6 3      Crataegus  Rosac…   9.7 2017-05-09 00:00:00 CRLA   
 4     -123.     45.6 4      Quercus    Fagac…  10.3 2017-05-09 00:00:00 QURU   
 5     -123.     45.6 5      Pseudotsu… Pinac…  33.2 2017-05-09 00:00:00 PSME   
 6     -123.     45.6 6      Pseudotsu… Pinac…  32.1 2017-05-09 00:00:00 PSME   
 7     -123.     45.6 7      Pseudotsu… Pinac…  28.4 2017-05-09 00:00:00 PSME   
 8     -123.     45.6 8      Pseudotsu… Pinac…  27.2 2017-05-09 00:00:00 PSME   
 9     -123.     45.6 9      Pseudotsu… Pinac…  35.2 2017-05-09 00:00:00 PSME   
10     -123.     45.6 10     Pseudotsu… Pinac…  32.4 2017-05-09 00:00:00 PSME   
# ℹ 25,524 more rows
# ℹ 26 more variables: Common_Name <chr>, Condition <chr>, Tree_Height <dbl>,
#   Crown_Width_NS <dbl>, Crown_Width_EW <dbl>, Crown_Base_Height <dbl>,
#   Collected_By <chr>, Park <chr>, Scientific_Name <chr>,
#   Functional_Type <chr>, Mature_Size <fct>, Native <chr>, Edible <chr>,
#   Nuisance <chr>, Structural_Value <dbl>, Carbon_Storage_lb <dbl>,
#   Carbon_Storage_value <dbl>, Carbon_Sequestration_lb <dbl>, …

Context in `pdxTrees`

?get_pdxTrees_parks

::::::

New Student Submission

“Trees with larger diameters at breast height tend to also have larger canopies.”

ggplot(data = trees,
       mapping = aes(x = DBH, y = Crown_Width_EW)) +
  geom_point(alpha = 0.3, color = "#6EBBC5") +
  labs(title = "Scatterplot of Diameter at Breast Height and Canopy Width",
       x = "Diameter at Breast Height (Inches)",
       y = "Canopy Width from East to West (Inches)",
       subtitle = "Inventoried Street Trees in Portland, OR",
       caption = "Data collected by Portland Parks and Rec")

Another New Student Submission

“Portland has a lot of Douglas-Fir, which makes sense as it is the state tree!”

count(trees, Common_Name) %>%
  slice_max(n = 5, order_by = n) %>%
  mutate(Common_Name = fct_reorder(Common_Name, -n)) %>%
ggplot(mapping = aes(x = Common_Name, y = n)) +
  geom_col(fill = "#D37C9C") +
  labs(title = "The Most Common Tree Species in Portland Parks",
       subtitle = "Inventoried Street Trees in Portland, OR",
       caption = "Data collected by Portland Parks and Rec",
       x = "Tree Species", y = "Number of Trees")

Context is key to doing good data work.

And R Data Packages are a great way to provide that context. 🐵

💡 I should teach my students to share data via R Data Packages.

How To Make an R Data Package

Have a dataset you want to share!

Suppose you need a new pair of jeans and so you data-ified jeans on Anthropologie’s website.

   brand            name cost        style                   wash
1 Mother     The Kick It  248 straight-leg        tequila sunrise
2  Frame         Le High  228       skinny                majesty
3 Pilcro       The Izzie  118       barrel              eyck wash
4  Maeve     The Colette  130     wide-leg             dark denim
5 Mother     The Dazzler  248 straight-leg when the time is right
6 Agolde          Harper  238 straight-leg                  forum
7  Paige          Aneesa  249     wide-leg              mesmerize
8  Frame Le Slim Palazzo  288     wide-leg                  loner
  sizes_available made_in_USA  rise inseam leg_opening
1           24-31         yes 10.50   32.0       17.75
2    23-26, 28-32          no  9.75   29.0        9.50
3         XS - XL          no 12.50   25.2       11.00
4   23, 25-32, 34          no 12.50   26.0       10.75
5           23-34         yes  9.75   27.5       13.00
6  25, 26, 31, 32          no 11.25   32.0       19.50
7           23-34          no 11.75   31.0       22.50
8      23 -31, 33          no 11.00   31.0       22.50

It’s time to make jean shopping less painful for everyone via an R data package!

How To Make an R Data Package

Learn from the pros:

By following their R Packages book:

Starter Tips

Lean heavily on:

Helper packages:
- devtools: supports the development and dissemination
- usethis: automates many steps
- roxygen2: simplifies writing documentation

Existing R data packages:
- palmerpenguins: Mimick the beautiful source code

I want to focus on creating good documentation.

The ReadMe as the Quick View

Generate a ReadMe file:

usethis::use_readme_rmd()

The Help File as the Product Details

Create the help file:

usethis::use_r("jeans")
devtools::document()

?jeans

The Vignette as the Ways to Wear

Generate the vignette file:

usethis::use_vignette("jeans-shopping")

Final Thoughts

It did take me longer than a three hour train ride to create my first R data package.
- The process goes much faster after the first time!
- Get support from someone who has already built a package.
We should invest time into helping our students, our collaborators, and ourselves understand the context of the data.

Context is key to doing good data work.

Thank you
for coming!

Demo package repo:
github.com/mcconvil/jeans

I have hex stickers and would love to connect.

k.mcconville@bucknell.edu

A Typical Data Problem in Kelly’s Stat 100 Course

Typical Student Submission

Another Typical Student Submission

Until…

Co-Led Workshop on Teaching Intro Stats with R

My First R Data Package: pdxTrees

Workshop Problem Solved

Context in pdxTrees

New Student Submission

Another New Student Submission

How To Make an R Data Package

How To Make an R Data Package

Starter Tips

I want to focus on creating good documentation.

The ReadMe as the Quick View

The Help File as the Product Details

The Vignette as the Ways to Wear

Final Thoughts

Context is key to doing good data work.

Co-Led Workshop on Teaching Intro Stats with `R`

My First R Data Package: `pdxTrees`

Context in `pdxTrees`