class: center, middle, hide_logo background-image: url("img/mases.001.png") background-position: left background-size: contain background-color: #e9ecef .pull-right[ ## Applications in `R` #### Data Science in Official Statistics #### 2024 International Conference on Establishment Statistics #### Kelly McConville & Wesley Yung #### Monday, June 17th, 2024 ] --- ## Official Statistics in `R` * CRAN Task View: [https://cran.r-project.org/web//views/OfficialStatistics.html](https://cran.r-project.org/web//views/OfficialStatistics.html) -- * Packages on survey planning, sampling, data collection, data processing, visualization, statistical disclosure control, and estimation. -- * We will focus on `R` code/packages for model-assisted **estimation**. --- ## `survey`: The Multi-Purpose Survey Analysis Tool * Created by Thomas Lumley. -- * Can handle complex sampling designs. -- * Computes a wide range of estimators: post-stratification, raking, generalized regression estimator. -- * For `dplyr`-like syntax, check out `srvyr`. --- background-image: url("img/mases.001.png") background-position: left background-size: contain .pull-right[ ## Focus here on `mase` * [CRAN](https://cran.r-project.org/web/packages/mase/) ```r install.packages("mase") ``` * [GitHub](https://github.com/mcconvil/mase) + Recommend installing the CRAN version ] --- ## Model-Assisted Survey Estimation **Goal**: Estimate `\(t_y = \sum_{i \in U} y_i\)`. Data we have: <img src="img/setup.png" width="1186" /> * Along with first-order inclusion probabilities --- ## Estimators in mase `$$\hat{t}_y = \sum_{i \in U} \hat{m}(x_i) + \sum_{i \in s} \frac{y_i - \hat{m}(x_i)}{\pi_i}$$` Where `\(m(\cdot)\)` is: * Linear regression through the origin: `ratioEstimator()` * Linear regression: `greg()` (Generalized Regression Estimator) * Elastic-Net regression: `gregElasticNet()` * Regression trees: `gregTree()` + Uses `rpms` --- ## Example Dataset: `survey::api` * Establishments of interest: All schools in California * Datasets: + Full population dataset: `apipop` + Sample dataset: `apisrs` + There are other sample datasets: `apistrat`, `apiclus1` ```r library(survey) library(tidyverse) data(api) ``` --- ## Example Dataset: `api` **Want to estimate the average Academic Performance Index (API) score for 2000** * Y = `api00` * Auxiliary data: X = `meals`, `stype` + Known for every school * Survey base weights: `pw` --- ```r glimpse(apisrs) ``` ``` ## Rows: 200 ## Columns: 39 ## $ cds <chr> "15739081534155", "19642126066716", "30664493030640", "196445… ## $ stype <fct> H, E, H, E, E, E, M, E, E, E, E, H, M, E, E, E, M, M, H, E, E… ## $ name <chr> "McFarland High", "Stowers (Cecil ", "Brea-Olinda Hig", "Alam… ## $ sname <chr> "McFarland High", "Stowers (Cecil B.) Elementary", "Brea-Olin… ## $ snum <dbl> 1039, 1124, 2868, 1273, 4926, 2463, 2031, 1736, 2142, 4754, 1… ## $ dname <chr> "McFarland Unified", "ABC Unified", "Brea-Olinda Unified", "D… ## $ dnum <int> 432, 1, 79, 187, 640, 284, 401, 401, 470, 632, 401, 753, 784,… ## $ cname <chr> "Kern", "Los Angeles", "Orange", "Los Angeles", "San Luis Obi… ## $ cnum <int> 14, 18, 29, 18, 39, 18, 18, 18, 18, 37, 18, 24, 14, 1, 47, 18… ## $ flag <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ pcttest <int> 98, 100, 98, 99, 99, 93, 98, 99, 100, 90, 95, 100, 97, 99, 98… ## $ api00 <int> 462, 878, 734, 772, 739, 835, 456, 506, 543, 649, 556, 671, 5… ## $ api99 <int> 448, 831, 742, 657, 719, 822, 472, 474, 458, 604, 575, 620, 5… ## $ target <int> 18, NA, 3, 7, 4, NA, 16, 16, 17, 10, 11, 9, 14, 5, 15, 10, 6,… ## $ growth <int> 14, 47, -8, 115, 20, 13, -16, 32, 85, 45, -19, 51, 4, 51, 50,… ## $ sch.wide <fct> No, Yes, No, Yes, Yes, Yes, No, Yes, Yes, Yes, No, Yes, No, Y… ## $ comp.imp <fct> Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes, No, No, Yes, No, Y… ## $ both <fct> No, Yes, No, Yes, Yes, Yes, No, Yes, Yes, No, No, Yes, No, Ye… ## $ awards <fct> No, Yes, No, Yes, Yes, No, No, Yes, Yes, No, No, Yes, No, Yes… ## $ meals <int> 44, 8, 10, 70, 43, 16, 81, 98, 94, 85, 81, 67, 77, 20, 70, 75… ## $ ell <int> 31, 25, 10, 25, 12, 19, 40, 65, 65, 57, 4, 25, 32, 16, 23, 18… ## $ yr.rnd <fct> NA, NA, NA, NA, NA, NA, NA, No, NA, NA, NA, NA, NA, NA, No, N… ## $ mobility <int> 6, 15, 7, 23, 12, 13, 22, 43, 15, 10, 20, 12, 4, 32, 17, 9, 1… ## $ acs.k3 <int> NA, 19, NA, 23, 20, 19, NA, 18, 19, 16, 16, NA, NA, 19, 21, 2… ## $ acs.46 <int> NA, 30, NA, NA, 29, 29, 30, 29, 32, 25, 27, NA, NA, 29, 30, 2… ## $ acs.core <int> 24, NA, 28, NA, NA, NA, 27, NA, NA, 30, NA, 17, 27, NA, NA, N… ## $ pct.resp <int> 82, 97, 95, 100, 91, 71, 49, 75, 99, 49, 62, 96, 77, 96, 39, … ## $ not.hsg <int> 44, 4, 5, 37, 8, 1, 30, 49, 48, 23, 5, 44, 40, 4, 14, 18, 15,… ## $ hsg <int> 34, 10, 9, 40, 21, 8, 27, 31, 34, 36, 38, 19, 34, 14, 57, 28,… ## $ some.col <int> 12, 23, 21, 14, 27, 20, 18, 15, 14, 14, 29, 17, 16, 25, 18, 2… ## $ col.grad <int> 7, 43, 41, 8, 34, 38, 22, 2, 4, 21, 24, 19, 8, 37, 10, 23, 28… ## $ grad.sch <int> 3, 21, 24, 1, 10, 34, 2, 3, 1, 6, 5, 2, 2, 19, 1, 3, 10, 32, … ## $ avg.ed <dbl> 1.9, 3.7, 3.7, 2.0, 3.2, 4.0, 2.4, 1.8, 1.8, 2.5, 2.8, 2.2, 2… ## $ full <int> 71, 90, 83, 85, 100, 75, 72, 69, 68, 81, 84, 100, 89, 95, 96,… ## $ emer <int> 35, 10, 18, 18, 0, 20, 25, 22, 29, 7, 16, 0, 11, 5, 6, 10, 8,… ## $ enroll <int> 477, 478, 1410, 342, 217, 258, 1274, 566, 645, 311, 328, 210,… ## $ api.stu <int> 429, 420, 1287, 291, 189, 211, 1090, 353, 563, 258, 253, 166,… ## $ pw <dbl> 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 3… ## $ fpc <dbl> 6194, 6194, 6194, 6194, 6194, 6194, 6194, 6194, 6194, 6194, 6… ``` --- ## Exploratory Data Analysis ```r ggplot(data = apisrs, mapping = aes(x = meals, y = api00, color = stype)) + geom_point(alpha = 0.2) + geom_smooth(method = "lm", se = FALSE) ``` <img src="r_applications_files/figure-html/unnamed-chunk-5-1.png" width="432" /> --- ### Horvitz-Thompson Estimator ```r library(mase) horvitzThompson(y = apisrs$api00, pi = 1/apisrs$pw, var_est = TRUE) ``` ``` ## $pop_total ## [1] 4066887 ## ## $pop_mean ## [1] 657 ## ## $pop_total_var ## [1] 3282462447 ## ## $pop_mean_var ## [1] 86 ``` --- ### Generalized Regression Estimator ```r xsample <- select(apisrs, meals, stype) %>% as.data.frame() xpop <- select(apipop, names(xsample)) greg(y = apisrs$api00, xsample = xsample, xpop = xpop, pi = 1/apisrs$pw, var_est = TRUE, datatype = "raw") ``` ``` ## $pop_total ## [1] 4116168 ## ## $pop_mean ## [1] 665 ## ## $pop_total_var ## [1] 922697909 ## ## $pop_mean_var ## [1] 24 ## ## $weights ## [1] 30 35 32 30 32 34 28 28 28 29 29 28 28 34 30 30 32 33 28 29 29 31 28 32 27 ## [26] 34 33 30 29 31 28 32 31 30 32 33 28 33 33 29 31 31 34 34 34 28 29 28 33 28 ## [51] 34 29 26 32 32 34 34 32 29 28 30 27 30 33 29 34 30 28 29 32 35 34 30 31 30 ## [76] 31 28 32 31 31 35 31 28 32 33 29 33 31 32 30 28 31 35 29 29 33 31 30 31 30 ## [101] 32 32 31 31 32 30 32 31 32 28 30 28 28 31 32 34 29 31 32 32 35 32 34 30 33 ## [126] 33 33 31 33 35 34 35 31 29 31 34 33 32 35 32 28 30 28 33 33 34 32 31 28 32 ## [151] 31 28 35 28 35 35 34 29 33 28 31 31 29 29 35 28 34 28 32 30 33 34 29 32 29 ## [176] 28 30 31 29 31 30 35 30 32 28 28 31 33 32 28 29 29 31 31 32 30 28 29 30 29 ## ## $coefficients ## (Intercept) meals stypeH stypeM ## 873.9 -3.8 -127.7 -66.7 ``` --- ### Generalized Regression Estimator ```r greg(y = apisrs$api00, xsample = xsample, xpop = xpop, pi = 1/apisrs$pw, var_est = TRUE, datatype = "raw")$coefficients ``` ``` ## (Intercept) meals stypeH stypeM ## 873.9 -3.8 -127.7 -66.7 ``` --- ### Generalized Regression Estimator with Elastic Net ```r gregElasticNet(y = apisrs$api00, xsample = xsample, xpop = xpop, pi = 1/apisrs$pw, var_est = TRUE, datatype = "raw") ``` ``` ## $pop_total ## [1] 4108860 ## ## $pop_mean ## [1] 663 ## ## $pop_total_var ## [1] 1283716175 ## ## $pop_mean_var ## [1] 33 ## ## $weights ## [1] 30 35 32 30 32 34 28 28 28 29 29 28 28 34 30 30 32 33 28 29 29 31 28 32 27 ## [26] 34 33 30 29 31 28 32 31 30 32 33 28 33 33 29 31 31 34 34 34 28 29 28 33 28 ## [51] 34 29 26 32 32 34 34 32 29 28 30 27 30 33 29 34 30 28 29 32 35 34 30 31 30 ## [76] 31 28 32 31 31 35 31 28 32 33 29 33 31 32 30 28 31 35 29 29 33 31 30 31 30 ## [101] 32 32 31 31 32 30 32 31 32 28 30 28 28 31 32 34 29 31 32 32 35 32 34 30 33 ## [126] 33 33 31 33 35 34 35 31 29 31 34 33 32 35 32 28 30 28 33 33 34 32 31 28 32 ## [151] 31 28 35 28 35 35 34 29 33 28 31 31 29 29 35 28 34 28 32 30 33 34 29 32 29 ## [176] 28 30 31 29 31 30 35 30 32 28 28 31 33 32 28 29 29 31 31 32 30 28 29 30 29 ## ## $coefficients ## (Intercept) meals stypeH stypeM ## 828.2 -3.4 0.0 0.0 ``` --- ### Generalized Regression Estimator with Elastic Net ```r gregElasticNet(y = apisrs$api00, xsample = xsample, xpop = xpop, pi = 1/apisrs$pw, var_est = TRUE, datatype = "raw")$coefficients ``` ``` ## (Intercept) meals stypeH stypeM ## 828.2 -3.4 0.0 0.0 ``` --- ### Generalized Regression Estimator with Regression Trees ```r est <- gregTree(y = apisrs$api00, xsample = xsample, xpop = xpop, pi = 1/apisrs$pw, var_est = TRUE) est ``` ``` ## $pop_total ## [1] 4129548 ## ## $pop_mean ## [1] 667 ## ## $pop_total_var ## [1] 1059103817 ## ## $pop_mean_var ## [1] 28 ## ## $weights ## [1] 32 44 32 27 30 44 27 30 30 27 27 31 27 26 27 27 32 32 27 27 30 31 30 30 30 ## [26] 44 26 27 31 32 30 30 31 32 30 32 30 26 32 27 31 31 44 44 26 27 27 30 26 30 ## [51] 26 27 30 30 32 32 32 30 27 27 32 30 31 30 30 26 31 30 27 32 44 26 27 30 27 ## [76] 30 30 32 32 32 44 31 27 32 26 30 26 31 30 27 30 31 44 31 27 26 31 31 32 27 ## [101] 30 32 31 31 30 27 30 32 30 30 32 30 30 31 32 44 27 32 30 32 44 32 26 27 30 ## [126] 26 32 31 26 44 26 44 32 31 31 44 26 30 44 30 30 32 30 32 26 32 32 31 31 30 ## [151] 31 30 44 27 44 44 26 31 26 30 32 32 30 27 44 30 44 30 30 31 26 44 27 30 27 ## [176] 30 27 31 27 32 27 44 31 32 30 30 31 26 30 30 27 30 32 31 32 32 30 27 31 27 ## ## $tree ## ## RPMS Recursive Partitioning Equation ## y ~ meals + stype ## ## Estimating Equation ## y ~ 1 ## ## ## [1] "unequal probability of selection, sample design" ## ## ===================== Tree Model =================== ## ## Splits ## sp meals <= 50.5 ## sp meals <= 50.5 & stype %in% c('H','M') ## sp meals <= 50.5 & stype %in% c('E') & meals <= 19.5 ## sp meals <= 50.5 & stype %in% c('E') & meals > 19.5 & meals <= 33.5 ## sp meals > 50.5 & meals <= 85.5 ## sp meals > 50.5 & meals <= 85.5 & meals <= 67.5 ``` --- ### Generalized Regression Estimator with Regression Trees ```r plot(est$tree) ``` <img src="r_applications_files/figure-html/unnamed-chunk-12-1.png" width="432" /> --- ### Aggregated Auxiliary Information * Don't have unit level data for the non-sampled units * Just have aggregated data ```r xpop_agg ``` ``` ## # A tibble: 1 × 3 ## meals stype_H stype_M ## <dbl> <dbl> <dbl> ## 1 48.0 0.122 0.164 ``` ```r library(recipes) xsample <- recipe(~ meals + stype, data = apisrs) %>% step_dummy(stype) %>% prep(training = apisrs) %>% bake(new_data = NULL) %>% as.data.frame() xsample %>% summarise_all(mean) ``` ``` ## meals stype_H stype_M ## 1 50 0.12 0.16 ``` --- ### Aggregated Auxiliary Information ```r greg(y = apisrs$api00, xsample = xsample, xpop = xpop_agg, pi = 1/apisrs$pw, var_est = TRUE, datatype = "means") ``` ``` ## $pop_total ## [1] 4116168 ## ## $pop_mean ## [1] 665 ## ## $pop_total_var ## [1] 922697909 ## ## $pop_mean_var ## [1] 24 ## ## $weights ## [1] 30 35 32 30 32 34 28 28 28 29 29 28 28 34 30 30 32 33 28 29 29 31 28 32 27 ## [26] 34 33 30 29 31 28 32 31 30 32 33 28 33 33 29 31 31 34 34 34 28 29 28 33 28 ## [51] 34 29 26 32 32 34 34 32 29 28 30 27 30 33 29 34 30 28 29 32 35 34 30 31 30 ## [76] 31 28 32 31 31 35 31 28 32 33 29 33 31 32 30 28 31 35 29 29 33 31 30 31 30 ## [101] 32 32 31 31 32 30 32 31 32 28 30 28 28 31 32 34 29 31 32 32 35 32 34 30 33 ## [126] 33 33 31 33 35 34 35 31 29 31 34 33 32 35 32 28 30 28 33 33 34 32 31 28 32 ## [151] 31 28 35 28 35 35 34 29 33 28 31 31 29 29 35 28 34 28 32 30 33 34 29 32 29 ## [176] 28 30 31 29 31 30 35 30 32 28 28 31 33 32 28 29 29 31 31 32 30 28 29 30 29 ## ## $coefficients ## (Intercept) meals stype_H stype_M ## 873.9 -3.8 -127.7 -66.7 ``` --- ### Categorical Response Variable ```r apisrs <- mutate(apisrs, both_num = if_else(both == "Yes", 1, 0)) xsample <- select(apisrs, meals, stype) %>% as.data.frame() greg(y = apisrs$both_num, xsample = xsample, xpop = xpop, pi = 1/apisrs$pw, var_est = TRUE, datatype = "raw", model = "logistic") ``` ``` ## $pop_total ## [1] 4072 ## ## $pop_mean ## [1] 0.66 ## ## $pop_total_var ## [1] 37635 ## ## $pop_mean_var ## [1] 0.00098 ## ## $coefficients ## (Intercept) meals stypeH stypeM ## 1.2747 -0.0028 -1.5754 -1.4662 ``` --- background-image: url("img/mases.001.png") background-position: left background-size: contain .pull-right[ ## Feedback Wanted * Currently geared more toward use with US Forest Inventory and Analysis data * What's on your wish list? ] --- background-image: url("img/nadine-shaabana-qAcqRgWQemM-unsplash.jpg") background-position: left background-size: contain .pull-right[ #### Useful Resources [`survey` package](http://r-survey.r-forge.r-project.org/survey/): Very extensive package for analyzing complex survey data! [Model-Assisted Survey Estimation with Modern Prediction Techniques](https://projecteuclid.org/journals/statistical-science/volume-32/issue-2/Model-Assisted-Survey-Estimation-with-Modern-Prediction-Techniques/10.1214/16-STS589.full) by Jay Breidt and Jean Opsomer: Wonderful overview article [Model Assisted Survey Sampling](https://www.springer.com/us/book/9780387406206) by Carl-Erik Särndal, Bengt Swensson, and Jan Wretman: Seminar textbook ] --- class: middle, center ## Closing Thoughts --- class: smaller-font #### References * Breidt, F. J., G. Claeskens, and J. D. Opsomer. Model-assisted estimation for complex surveys using penalised splines. Biometrika, 92:831–846, 2005. * Breidt, F. J. and J. D. Opsomer. Model-assisted survey estimation with modern prediction techniques. Statistical Science, 32:190-205, 2017. * Breidt, F. J. and J. D. Opsomer. Local polynomial regression estimators in survey sampling. Annals of Statistics, 28:1026–1053, 2000. * Goga, C. Variance reduction in surveys with auxiliary information: a nonparametric approach involving regression splines. The Canadian Journal of Statistics/La revue canadienne de statistique, 33(2), 163-180, 2005. * Horvitz, D. G., and Thompson, D. J. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260), 663-685, 1952. * Lehtonen, R. and Veijanen, A. Logistic generalized regression estimators. Survey Methodology 24, 51-55, 1998. * McConville, Moisen, G. G, Frescino, T. S. A Tutorial in Model-Assisted Estimation with Application to Forest Inventory. Forests. 11:2, 244, 2020. * McConville, K. S., Breidt, F. J., Lee, T., & Moisen, G. G. Model-assisted survey regression estimation with the lasso. Journal of Survey Statistics and Methodology, 5(2), 131-158, 2017. * McConville, K. S. and F. J. Breidt. Survey design asymptotics for the model-assisted penalised spline regression estimator. Journal of Nonparametric Statistics, 25:745–763, 2013. * McConville, K.S., Tang, B., Zhu, G., Cheung, S. and S. Li. mase: Model-Assisted Survey Estimation. R package version 0.1.2 https://cran.r-project.org/package=mase, 2018. * McConville, K. S. and Toth, D. Automated selection of post‐strata using a model‐assisted regression tree estimator. Scandinavian Journal of Statistics, 42, 2: 389–413, 2019. * Montanari, G. E. and M. G. Ranalli. Nonparametric model calibration estimation in survey sampling. Journal of the American Statistical Association, 100(472):1429–1442, 2005. * Sarndal, C. E., B. Swensson, and J. Wretman. Model-Assisted Survey Sampling. Springer-Verlag, New York, 1992. * Tibshirani, R. "Regression shrinkage and selection via the lasso." Journal of the Royal Statistical Society: Series B (Methodological) 58.1 267-288, 1996. * Toth, D. and J. Eltinge. Building consistent regression trees from complex sample data. Journal of the American Statistical Association, 106:1626–1636, 2011. * Toth, D. rpms: Recursive Partitioning for Modeling Survey Data. R package version 0.5.1. https://CRAN.R-project.org/package=rpms, 2021.