Applications in R with mase

class: center, middle, hide_logo
background-image: url("img/mases.001.png")
background-position: left
background-size: contain
background-color: #e9ecef

.pull-right[

## Applications in `R`

#### Data Science in Official Statistics

#### 2024 International Conference on Establishment Statistics

#### Kelly McConville & Wesley Yung

####  Monday, June 17th, 2024

]

---

## Official Statistics in `R`

* CRAN Task View: [https://cran.r-project.org/web//views/OfficialStatistics.html](https://cran.r-project.org/web//views/OfficialStatistics.html)

* Packages on survey planning, sampling, data collection, data processing, visualization, statistical disclosure control, and estimation.

* We will focus on `R` code/packages for model-assisted **estimation**.

---

## `survey`: The Multi-Purpose Survey Analysis Tool

* Created by Thomas Lumley.

* Can handle complex sampling designs.

* Computes a wide range of estimators: post-stratification, raking, generalized regression estimator.

* For `dplyr`-like syntax, check out `srvyr`.

---

background-image: url("img/mases.001.png")
background-position: left
background-size: contain

.pull-right[

## Focus here on `mase`

* [CRAN](https://cran.r-project.org/web/packages/mase/)

```r
install.packages("mase")
```

* [GitHub](https://github.com/mcconvil/mase)
    + Recommend installing the CRAN version

]

---

## Model-Assisted Survey Estimation

**Goal**: Estimate `$t_y = \sum_{i \in U} y_i$`.

Data we have:

* Along with first-order inclusion probabilities

---

## Estimators in mase

`$$\hat{t}_y = \sum_{i \in U} \hat{m}(x_i) + \sum_{i \in s} \frac{y_i - \hat{m}(x_i)}{\pi_i}$$`

Where `$m(\cdot)$` is:

* Linear regression through the origin: `ratioEstimator()`
* Linear regression: `greg()` (Generalized Regression Estimator)
* Elastic-Net regression: `gregElasticNet()`
* Regression trees: `gregTree()`
    + Uses `rpms`

---

## Example Dataset: `survey::api`

* Establishments of interest: All schools in California

* Datasets:
    + Full population dataset: `apipop`
    + Sample dataset: `apisrs`
    + There are other sample datasets: `apistrat`, `apiclus1`

```r
library(survey)
library(tidyverse)
data(api)
```

---

## Example Dataset: `api`

**Want to estimate the average Academic Performance Index (API) score for 2000**

* Y = `api00`

* Auxiliary data: X = `meals`, `stype`
    + Known for every school
    
* Survey base weights: `pw`

---

```r
glimpse(apisrs)
```

```
## Rows: 200
## Columns: 39
## $ cds      <chr> "15739081534155", "19642126066716", "30664493030640", "196445…
## $ stype    <fct> H, E, H, E, E, E, M, E, E, E, E, H, M, E, E, E, M, M, H, E, E…
## $ name     <chr> "McFarland High", "Stowers (Cecil ", "Brea-Olinda Hig", "Alam…
## $ sname    <chr> "McFarland High", "Stowers (Cecil B.) Elementary", "Brea-Olin…
## $ snum     <dbl> 1039, 1124, 2868, 1273, 4926, 2463, 2031, 1736, 2142, 4754, 1…
## $ dname    <chr> "McFarland Unified", "ABC Unified", "Brea-Olinda Unified", "D…
## $ dnum     <int> 432, 1, 79, 187, 640, 284, 401, 401, 470, 632, 401, 753, 784,…
## $ cname    <chr> "Kern", "Los Angeles", "Orange", "Los Angeles", "San Luis Obi…
## $ cnum     <int> 14, 18, 29, 18, 39, 18, 18, 18, 18, 37, 18, 24, 14, 1, 47, 18…
## $ flag     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pcttest  <int> 98, 100, 98, 99, 99, 93, 98, 99, 100, 90, 95, 100, 97, 99, 98…
## $ api00    <int> 462, 878, 734, 772, 739, 835, 456, 506, 543, 649, 556, 671, 5…
## $ api99    <int> 448, 831, 742, 657, 719, 822, 472, 474, 458, 604, 575, 620, 5…
## $ target   <int> 18, NA, 3, 7, 4, NA, 16, 16, 17, 10, 11, 9, 14, 5, 15, 10, 6,…
## $ growth   <int> 14, 47, -8, 115, 20, 13, -16, 32, 85, 45, -19, 51, 4, 51, 50,…
## $ sch.wide <fct> No, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, No, Yes, No, Y…
## $ comp.imp <fct> Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, No, No, Yes, No, Y…
## $ both     <fct> No, Yes, No, Yes, Yes, Yes, No, Yes, Yes, No, No, Yes, No, Ye…
## $ awards   <fct> No, Yes, No, Yes, Yes, No, No, Yes, Yes, No, No, Yes, No, Yes…
## $ meals    <int> 44, 8, 10, 70, 43, 16, 81, 98, 94, 85, 81, 67, 77, 20, 70, 75…
## $ ell      <int> 31, 25, 10, 25, 12, 19, 40, 65, 65, 57, 4, 25, 32, 16, 23, 18…
## $ yr.rnd   <fct> NA, NA, NA, NA, NA, NA, NA, No, NA, NA, NA, NA, NA, NA, No, N…
## $ mobility <int> 6, 15, 7, 23, 12, 13, 22, 43, 15, 10, 20, 12, 4, 32, 17, 9, 1…
## $ acs.k3   <int> NA, 19, NA, 23, 20, 19, NA, 18, 19, 16, 16, NA, NA, 19, 21, 2…
## $ acs.46   <int> NA, 30, NA, NA, 29, 29, 30, 29, 32, 25, 27, NA, NA, 29, 30, 2…
## $ acs.core <int> 24, NA, 28, NA, NA, NA, 27, NA, NA, 30, NA, 17, 27, NA, NA, N…
## $ pct.resp <int> 82, 97, 95, 100, 91, 71, 49, 75, 99, 49, 62, 96, 77, 96, 39, …
## $ not.hsg  <int> 44, 4, 5, 37, 8, 1, 30, 49, 48, 23, 5, 44, 40, 4, 14, 18, 15,…
## $ hsg      <int> 34, 10, 9, 40, 21, 8, 27, 31, 34, 36, 38, 19, 34, 14, 57, 28,…
## $ some.col <int> 12, 23, 21, 14, 27, 20, 18, 15, 14, 14, 29, 17, 16, 25, 18, 2…
## $ col.grad <int> 7, 43, 41, 8, 34, 38, 22, 2, 4, 21, 24, 19, 8, 37, 10, 23, 28…
## $ grad.sch <int> 3, 21, 24, 1, 10, 34, 2, 3, 1, 6, 5, 2, 2, 19, 1, 3, 10, 32, …
## $ avg.ed   <dbl> 1.9, 3.7, 3.7, 2.0, 3.2, 4.0, 2.4, 1.8, 1.8, 2.5, 2.8, 2.2, 2…
## $ full     <int> 71, 90, 83, 85, 100, 75, 72, 69, 68, 81, 84, 100, 89, 95, 96,…
## $ emer     <int> 35, 10, 18, 18, 0, 20, 25, 22, 29, 7, 16, 0, 11, 5, 6, 10, 8,…
## $ enroll   <int> 477, 478, 1410, 342, 217, 258, 1274, 566, 645, 311, 328, 210,…
## $ api.stu  <int> 429, 420, 1287, 291, 189, 211, 1090, 353, 563, 258, 253, 166,…
## $ pw       <dbl> 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 3…
## $ fpc      <dbl> 6194, 6194, 6194, 6194, 6194, 6194, 6194, 6194, 6194, 6194, 6…
```

---

## Exploratory Data Analysis

```r
ggplot(data = apisrs, 
       mapping = aes(x = meals, 
                     y = api00, 
                     color = stype)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", se = FALSE)
```

---

### Horvitz-Thompson Estimator

```r
library(mase)

horvitzThompson(y = apisrs$api00,
          pi = 1/apisrs$pw, var_est = TRUE)
```

```
## $pop_total
## [1] 4066887
## 
## $pop_mean
## [1] 657
## 
## $pop_total_var
## [1] 3282462447
## 
## $pop_mean_var
## [1] 86
```

---

### Generalized Regression Estimator

```r
xsample <- select(apisrs, meals, stype) %>%
  as.data.frame()

xpop <- select(apipop, names(xsample))

greg(y = apisrs$api00, xsample = xsample, xpop = xpop,
          pi = 1/apisrs$pw, var_est = TRUE, datatype = "raw")
```

```
## $pop_total
## [1] 4116168
## 
## $pop_mean
## [1] 665
## 
## $pop_total_var
## [1] 922697909
## 
## $pop_mean_var
## [1] 24
## 
## $weights
##   [1] 30 35 32 30 32 34 28 28 28 29 29 28 28 34 30 30 32 33 28 29 29 31 28 32 27
##  [26] 34 33 30 29 31 28 32 31 30 32 33 28 33 33 29 31 31 34 34 34 28 29 28 33 28
##  [51] 34 29 26 32 32 34 34 32 29 28 30 27 30 33 29 34 30 28 29 32 35 34 30 31 30
##  [76] 31 28 32 31 31 35 31 28 32 33 29 33 31 32 30 28 31 35 29 29 33 31 30 31 30
## [101] 32 32 31 31 32 30 32 31 32 28 30 28 28 31 32 34 29 31 32 32 35 32 34 30 33
## [126] 33 33 31 33 35 34 35 31 29 31 34 33 32 35 32 28 30 28 33 33 34 32 31 28 32
## [151] 31 28 35 28 35 35 34 29 33 28 31 31 29 29 35 28 34 28 32 30 33 34 29 32 29
## [176] 28 30 31 29 31 30 35 30 32 28 28 31 33 32 28 29 29 31 31 32 30 28 29 30 29
## 
## $coefficients
## (Intercept)       meals      stypeH      stypeM 
##       873.9        -3.8      -127.7       -66.7
```

---

### Generalized Regression Estimator

```r
greg(y = apisrs$api00, xsample = xsample, xpop = xpop,
          pi = 1/apisrs$pw, var_est = TRUE,
     datatype = "raw")$coefficients
```

```
## (Intercept)       meals      stypeH      stypeM 
##       873.9        -3.8      -127.7       -66.7
```

---

### Generalized Regression Estimator with Elastic Net

```r
gregElasticNet(y = apisrs$api00, xsample = xsample, xpop = xpop,
          pi = 1/apisrs$pw, var_est = TRUE, datatype = "raw")
```

```
## $pop_total
## [1] 4108860
## 
## $pop_mean
## [1] 663
## 
## $pop_total_var
## [1] 1283716175
## 
## $pop_mean_var
## [1] 33
## 
## $weights
##   [1] 30 35 32 30 32 34 28 28 28 29 29 28 28 34 30 30 32 33 28 29 29 31 28 32 27
##  [26] 34 33 30 29 31 28 32 31 30 32 33 28 33 33 29 31 31 34 34 34 28 29 28 33 28
##  [51] 34 29 26 32 32 34 34 32 29 28 30 27 30 33 29 34 30 28 29 32 35 34 30 31 30
##  [76] 31 28 32 31 31 35 31 28 32 33 29 33 31 32 30 28 31 35 29 29 33 31 30 31 30
## [101] 32 32 31 31 32 30 32 31 32 28 30 28 28 31 32 34 29 31 32 32 35 32 34 30 33
## [126] 33 33 31 33 35 34 35 31 29 31 34 33 32 35 32 28 30 28 33 33 34 32 31 28 32
## [151] 31 28 35 28 35 35 34 29 33 28 31 31 29 29 35 28 34 28 32 30 33 34 29 32 29
## [176] 28 30 31 29 31 30 35 30 32 28 28 31 33 32 28 29 29 31 31 32 30 28 29 30 29
## 
## $coefficients
## (Intercept)       meals      stypeH      stypeM 
##       828.2        -3.4         0.0         0.0
```

---

### Generalized Regression Estimator with Elastic Net

```r
gregElasticNet(y = apisrs$api00, xsample = xsample, xpop = xpop,
          pi = 1/apisrs$pw, var_est = TRUE,
          datatype = "raw")$coefficients
```

```
## (Intercept)       meals      stypeH      stypeM 
##       828.2        -3.4         0.0         0.0
```

---

### Generalized Regression Estimator with Regression Trees

```r
est <- gregTree(y = apisrs$api00, xsample = xsample, xpop = xpop,
          pi = 1/apisrs$pw, var_est = TRUE)
est
```

```
## $pop_total
## [1] 4129548
## 
## $pop_mean
## [1] 667
## 
## $pop_total_var
## [1] 1059103817
## 
## $pop_mean_var
## [1] 28
## 
## $weights
##   [1] 32 44 32 27 30 44 27 30 30 27 27 31 27 26 27 27 32 32 27 27 30 31 30 30 30
##  [26] 44 26 27 31 32 30 30 31 32 30 32 30 26 32 27 31 31 44 44 26 27 27 30 26 30
##  [51] 26 27 30 30 32 32 32 30 27 27 32 30 31 30 30 26 31 30 27 32 44 26 27 30 27
##  [76] 30 30 32 32 32 44 31 27 32 26 30 26 31 30 27 30 31 44 31 27 26 31 31 32 27
## [101] 30 32 31 31 30 27 30 32 30 30 32 30 30 31 32 44 27 32 30 32 44 32 26 27 30
## [126] 26 32 31 26 44 26 44 32 31 31 44 26 30 44 30 30 32 30 32 26 32 32 31 31 30
## [151] 31 30 44 27 44 44 26 31 26 30 32 32 30 27 44 30 44 30 30 31 26 44 27 30 27
## [176] 30 27 31 27 32 27 44 31 32 30 30 31 26 30 30 27 30 32 31 32 32 30 27 31 27
## 
## $tree
## 
## RPMS Recursive Partitioning Equation 
## y ~ meals + stype
## 
## Estimating Equation 
## y ~ 1
## 
## 
## [1] "unequal probability of selection, sample design"
## 
## ===================== Tree Model =================== 
##  
##    Splits                                                             
## sp   meals <= 50.5                                                    
## sp   meals <= 50.5 & stype %in% c('H','M')                            
## sp   meals <= 50.5 & stype %in%  c('E') & meals <= 19.5               
## sp   meals <= 50.5 & stype %in%  c('E') & meals > 19.5 & meals <= 33.5
## sp   meals > 50.5 & meals <= 85.5                                     
## sp   meals > 50.5 & meals <= 85.5 & meals <= 67.5
```

---

### Generalized Regression Estimator with Regression Trees

```r
plot(est$tree)
```

---

### Aggregated Auxiliary Information

* Don't have unit level data for the non-sampled units
* Just have aggregated data

```r
xpop_agg
```

```
## # A tibble: 1 × 3
##   meals stype_H stype_M
##   <dbl>   <dbl>   <dbl>
## 1  48.0   0.122   0.164
```

```r
library(recipes)

xsample <- recipe(~ meals + stype, data = apisrs) %>%
  step_dummy(stype) %>%
  prep(training = apisrs) %>%
  bake(new_data = NULL) %>%
  as.data.frame()

xsample %>%
  summarise_all(mean)
```

```
##   meals stype_H stype_M
## 1    50    0.12    0.16
```

---

### Aggregated Auxiliary Information

```r
greg(y = apisrs$api00, xsample = xsample, xpop = xpop_agg,
          pi = 1/apisrs$pw, var_est = TRUE, datatype = "means")
```

```
## $pop_total
## [1] 4116168
## 
## $pop_mean
## [1] 665
## 
## $pop_total_var
## [1] 922697909
## 
## $pop_mean_var
## [1] 24
## 
## $weights
##   [1] 30 35 32 30 32 34 28 28 28 29 29 28 28 34 30 30 32 33 28 29 29 31 28 32 27
##  [26] 34 33 30 29 31 28 32 31 30 32 33 28 33 33 29 31 31 34 34 34 28 29 28 33 28
##  [51] 34 29 26 32 32 34 34 32 29 28 30 27 30 33 29 34 30 28 29 32 35 34 30 31 30
##  [76] 31 28 32 31 31 35 31 28 32 33 29 33 31 32 30 28 31 35 29 29 33 31 30 31 30
## [101] 32 32 31 31 32 30 32 31 32 28 30 28 28 31 32 34 29 31 32 32 35 32 34 30 33
## [126] 33 33 31 33 35 34 35 31 29 31 34 33 32 35 32 28 30 28 33 33 34 32 31 28 32
## [151] 31 28 35 28 35 35 34 29 33 28 31 31 29 29 35 28 34 28 32 30 33 34 29 32 29
## [176] 28 30 31 29 31 30 35 30 32 28 28 31 33 32 28 29 29 31 31 32 30 28 29 30 29
## 
## $coefficients
## (Intercept)       meals     stype_H     stype_M 
##       873.9        -3.8      -127.7       -66.7
```

---

### Categorical Response Variable

```r
apisrs <- mutate(apisrs, both_num = if_else(both == "Yes",  1, 0))

xsample <- select(apisrs, meals, stype) %>%
  as.data.frame()

greg(y = apisrs$both_num, xsample = xsample, xpop = xpop,
          pi = 1/apisrs$pw, var_est = TRUE, datatype = "raw",
     model = "logistic")
```

```
## $pop_total
## [1] 4072
## 
## $pop_mean
## [1] 0.66
## 
## $pop_total_var
## [1] 37635
## 
## $pop_mean_var
## [1] 0.00098
## 
## $coefficients
## (Intercept)       meals      stypeH      stypeM 
##      1.2747     -0.0028     -1.5754     -1.4662
```

---

background-image: url("img/mases.001.png")
background-position: left
background-size: contain

.pull-right[

## Feedback Wanted

* Currently geared more toward use with US Forest Inventory and Analysis data

* What's on your wish list?

]

---

background-image: url("img/nadine-shaabana-qAcqRgWQemM-unsplash.jpg")
background-position: left
background-size: contain

.pull-right[

#### Useful Resources

[`survey` package](http://r-survey.r-forge.r-project.org/survey/): Very extensive package for analyzing complex survey data!

[Model-Assisted Survey Estimation with Modern Prediction Techniques](https://projecteuclid.org/journals/statistical-science/volume-32/issue-2/Model-Assisted-Survey-Estimation-with-Modern-Prediction-Techniques/10.1214/16-STS589.full) by Jay Breidt and Jean Opsomer: Wonderful overview article

[Model Assisted Survey Sampling](https://www.springer.com/us/book/9780387406206) by Carl-Erik Särndal, Bengt Swensson, and Jan Wretman: Seminar textbook

]

---

class: middle, center

## Closing Thoughts

---

class: smaller-font

#### References

* Breidt, F. J., G. Claeskens, and J. D. Opsomer. Model-assisted estimation for complex surveys using penalised splines. Biometrika, 92:831–846, 2005.

* Breidt, F. J. and J. D. Opsomer. Model-assisted survey estimation with modern prediction techniques. Statistical Science, 32:190-205, 2017.

* Breidt, F. J. and J. D. Opsomer. Local polynomial regression estimators in survey sampling. Annals of Statistics, 28:1026–1053, 2000.

* Goga, C. Variance reduction in surveys with auxiliary information: a nonparametric approach involving regression splines. The Canadian Journal of Statistics/La revue canadienne de statistique, 33(2), 163-180, 2005.

* Horvitz, D. G., and Thompson, D. J.  A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260), 663-685, 1952.

* Lehtonen, R. and Veijanen, A. Logistic generalized regression estimators. Survey Methodology 24, 51-55, 1998.

* McConville, Moisen, G. G, Frescino, T. S. A Tutorial in Model-Assisted Estimation with Application to Forest Inventory. Forests. 11:2, 244, 2020.

* McConville, K. S., Breidt, F. J., Lee, T., & Moisen, G. G.  Model-assisted survey regression estimation with the lasso. Journal of Survey Statistics and Methodology, 5(2), 131-158, 2017.

* McConville, K. S. and F. J. Breidt. Survey design asymptotics for the model-assisted penalised spline regression estimator. Journal of Nonparametric Statistics, 25:745–763, 2013.

* McConville, K.S., Tang, B., Zhu, G., 
  Cheung, S. and S. Li. mase:
  Model-Assisted Survey Estimation. R package
  version 0.1.2
  https://cran.r-project.org/package=mase, 2018.

* McConville, K. S. and Toth, D. Automated selection of post‐strata using a model‐assisted regression tree estimator. Scandinavian Journal of Statistics, 42, 2: 389–413, 2019.

* Montanari, G. E. and M. G. Ranalli. Nonparametric model calibration estimation in survey sampling. Journal of the American Statistical Association, 100(472):1429–1442, 2005.

* Sarndal, C. E., B. Swensson, and J. Wretman. Model-Assisted Survey Sampling. Springer-Verlag, New York, 1992.

* Tibshirani, R. "Regression shrinkage and selection via the lasso." Journal of the Royal Statistical Society: Series B (Methodological) 58.1 267-288, 1996.

* Toth, D. and J. Eltinge. Building consistent regression trees from complex sample data. Journal of the American Statistical Association, 106:1626–1636, 2011.

* Toth, D. rpms: Recursive
  Partitioning for Modeling Survey Data. R
  package version 0.5.1.
  https://CRAN.R-project.org/package=rpms, 2021.