class: center, middle, hide_logo background-image: url("img/daggett6.001.jpeg") background-position: left background-size: contain background-color: #e9ecef .pull-right[ ## Model-Assisted Estimation #### Data Science in Official Statistics #### 2024 International Conference on Establishment Statistics #### Kelly McConville & Wesley Yung #### Monday, June 17th, 2024 ] --- class: center, middle # Recap: Data Science Opportunities in Official Statistics --- class: center, middle # Focus on Estimation Stage! --- class: center ### Model-Assisted Survey Estimation -- #### **Goal**: Estimate finite population quantities, such as means and totals. -- #### Data Needed: <img src="img/data_needs2.004.jpeg" width="1402" /> --- class: center ### Estimation Set-up -- Enumerate the finite population of interest. <img src="img/enumerate.jpeg" width="1481" /> `\(U = \{1, 2, \ldots, N\}\)` --- class: center ### Estimation Set-up -- **Goal:** Estimate a population quantity, such as the total of a response variable, `\(y\)`. <img src="img/response.jpeg" width="1482" /> `\(t_y = \sum_{i \in U} y_i\)` --- class: center ### Estimation Set-up -- Have auxiliary data for the entire population or summary auxiliary data. <img src="img/aux_data.jpeg" width="1484" /> Have `\(\{\boldsymbol{x}_i\}_{i\in U}\)` or `\(\bar{\boldsymbol{x}}_U\)`. --- class: center ### Estimation Set-up -- Construct a sampling design. Assume a design-based approach to inference. <img src="img/sampling_design.jpeg" width="1491" /> Have `\(\{\pi_i\}_{i\in U}\)` where `\(\pi_i = P(i \in s)\)`. --- class: center ### Estimation Set-up A sample is taken and you have access to the following data for estimation. <img src="img/all_data.jpeg" width="1489" /> -- One option: `\(\hat{t}_y = \sum_{i \in s} \frac{y_i}{\pi_i}\)` (Horvitz-Thompson Estimator) --- class: center ### Estimation Set-up Another option: Use the sample to .orange2[**build**] a model and then .orange2[**predict**] `\(y\)` over all the population units. <img src="img/model_fits.jpeg" width="1484" /> `\(\hat{t}_y = \sum_{i \in s} y_i + \sum_{i \in U-s} \hat{m}(\boldsymbol{x_i})\)` --- class: center ### Estimation Set-up Estimator: `\(\hat{t}_y = \sum_{i \in s} y_i + \sum_{i \in U-s} \hat{m}(\boldsymbol{x_i})\)` <img src="img/model_fits.jpeg" width="1484" /> Estimator can be rather .orange2[biased] when the model is misspecified. --- class: center ### Estimation Set-up **Another approach:** Use a bias-correction term. <img src="img/model_fits.jpeg" width="1484" /> Model-assisted estimator: `$$\hat{t}_y = \sum_{i \in U} \hat{m}(\boldsymbol{x_i}) + \sum_{i \in s} \frac{y_i - \hat{m}(\boldsymbol{x_i})}{\pi_i}$$` --- class: center, middle background-color: #CFF09E ### Come back to the specifics of the estimator soon... -- ## But first, do we even have the necessary data? --- class: center <img src="img/fs.png" width="15%" style="float:left; padding:10px" /> ### Case Study 1: [US Forest Inventory and Analysis Program](https://www.fia.fs.fed.us/about/about_us/) > **Mission**: "Make and keep current a comprehensive inventory and analysis of the present and prospective conditions of and requirements for the renewable resources of the forest and rangelands of the US." -- Focusing on Daggett County, Utah: <img src="img/data_needs2.004.jpeg" width="1402" /> .pull-left[ Quasi-systematic sample of ground plots ] -- .pull-right[ **Many** layers of remotely sensed data! ] --- <img src="img/bls2.png" width="15%" style="float:left; padding:10px" /> ### Case Study 2: [US Bureau of Labor Statistics](https://www.bls.gov/bls/blsmissn.htm) > **Mission**: "Measures labor market activity, working conditions, price changes, and productivity in the U.S. economy to support public and private decision making." .pull-left[ Maintain many different surveys: * Job Openings and Labor Turnover Survey * Occupational Employment and Wage Statistics ] -- .pull-right[ Quarterly Census of Employment and Wages: * Size class * Geographic information * Industry type * Whether or not its a multi-establishment firm ] --- class: center, middle background-color: #CFF09E ## Discussion Questions -- ### What survey data you have access to? -- ### What auxiliary data do you have access to? -- ### Are you currently combining these data sources for estimation? --- class: center <img src="img/fs.png" width="15%" style="float:left; padding:10px" /> ### Case Study 1: [US Forest Inventory and Analysis Program](https://www.fia.fs.fed.us/about/about_us/) > **Mission**: "Make and keep current a comprehensive inventory and analysis of the present and prospective conditions of and requirements for the renewable resources of the forest and rangelands of the US." Focusing on Daggett County, Utah: <img src="img/data_needs2.004.jpeg" width="1402" /> .pull-left[ Quasi-systematic sample of ground plots ] .pull-right[ **Many** layers of remotely sensed data! ] --- <img src="img/fs.png" width="15%" style="float:left; padding:10px" /> ### Case Study 1: [US Forest Inventory and Analysis Program](https://www.fia.fs.fed.us/about/about_us/) **Goal**: Estimate .orange2[MANY] finite population quantities! -- * Want a simple model. -- * Have lots of auxiliary data layers. <img src="img/FIA_EDA.jpeg" width="1736" /> <!-- ### Case Study 1: [US Forest Inventory and Analysis Program](https://www.fia.fs.fed.us/about/about_us/) --> <!-- **Goal**: Estimate .orange2[MANY] finite population quantities! --> <!-- Use linear assisting model but estimate coefficients with the .orange2[elastic net]: --> <!-- \begin{aligned} --> <!-- \boldsymbol{\hat{\beta}} &= \underset{\boldsymbol{\beta}}{\arg\min} \left\{ \sum_{i \in s} \frac{(y_i - \boldsymbol{x}_i^T \boldsymbol{\beta})^2}{ \pi_i} + \lambda \left[ \alpha \sum_{j=1}^p \left|\beta_j\right| + (1-\alpha) \sum_{j=1}^p \beta_j^2\right] \right\} --> <!-- \end{aligned} --> --- <img src="img/fs.png" width="15%" style="float:left; padding:10px" /> ### Case Study 1: [US Forest Inventory and Analysis Program](https://www.fia.fs.fed.us/about/about_us/) **Goal**: Estimate .orange2[MANY] finite population quantities! Use linear assisting model but estimate coefficients with the .orange2[elastic net]: <img src="img/elasticnet.png" width="80%" /> --- <img src="img/fs.png" width="15%" style="float:left; padding:10px" /> ### Case Study 1: [US Forest Inventory and Analysis Program](https://www.fia.fs.fed.us/about/about_us/) **Goal**: Estimate .orange2[MANY] finite population quantities! Use linear assisting model but estimate coefficients with the .orange2[elastic net]: <img src="img/penalty.png" width="80%" /> --- <img src="img/fs.png" width="15%" style="float:left; padding:10px" /> ### Case Study 1: [US Forest Inventory and Analysis Program](https://www.fia.fs.fed.us/about/about_us/) **Goal**: Estimate .orange2[MANY] finite population quantities! Use linear assisting model but estimate coefficients with the .orange2[elastic net]: <img src="img/lasso.png" width="80%" /> --- <img src="img/fs.png" width="15%" style="float:left; padding:10px" /> ### Case Study 1: [US Forest Inventory and Analysis Program](https://www.fia.fs.fed.us/about/about_us/) **Goal**: Estimate .orange2[MANY] finite population quantities! Use linear assisting model but estimate coefficients with the .orange2[elastic net]: <img src="img/ridge.png" width="80%" /> --- <img src="img/fs.png" width="15%" style="float:left; padding:10px" /> ### Case Study 1: [US Forest Inventory and Analysis Program](https://www.fia.fs.fed.us/about/about_us/) **Goal**: Estimate .orange2[MANY] finite population quantities! Use linear assisting model but estimate coefficients with the .orange2[elastic net]: <img src="img/alpha.png" width="80%" /> --- <img src="img/fs.png" width="15%" style="float:left; padding:10px" /> ### Case Study 1: [US Forest Inventory and Analysis Program](https://www.fia.fs.fed.us/about/about_us/) **Goal**: Estimate .orange2[MANY] finite population quantities! -- .center[**For total trees per acre**] -- .pull-left[ **Selected Predictors**: * Normalized Difference Vegetation Index * Slope * Normalized Burn Ratio * Elevation * Slope : Forest/Non-Forest ] -- .pull-right[ **Non-Selected Predictors**: * Forest Probability * Eastness * Forest Probability : Forest/Non-Forest * Normalized Difference Vegetation Index : Forest/Non-Forest * Normalized Burn Ratio : Forest/Non-Forest * Elevation : Forest/Non-Forest * Eastness : Forest/Non-Forest ] --- class: center <img src="img/fs.png" width="15%" style="float:left; padding:10px" /> ### Case Study 1: [US Forest Inventory and Analysis Program](https://www.fia.fs.fed.us/about/about_us/) **Goal**: Estimate .orange2[MANY] finite population quantities! <img src="case_studies_files/figure-html/unnamed-chunk-29-1.png" width="432" /> --- <img src="img/bls2.png" width="15%" style="float:left; padding:10px" /> ### Case Study 2: [US Bureau of Labor Statistics](https://www.bls.gov/bls/blsmissn.htm) > **Mission**: "Measures labor market activity, working conditions, price changes, and productivity in the U.S. economy to support public and private decision making." **Goal**: Estimate the total number of certain job types. .pull-left[ Complex survey data from the BLS Occupational Employment Statistics Survey ] .pull-right[ Quarterly Census of Employment and Wages: * Size class * Geographic information * Industry type * Whether or not its a multi-establishment firm ] --- <img src="img/bls2.png" width="15%" style="float:left; padding:10px" /> ### Case Study 2: [US Bureau of Labor Statistics](https://www.bls.gov/bls/blsmissn.htm) > **Mission**: "Measures labor market activity, working conditions, price changes, and productivity in the U.S. economy to support public and private decision making." **Goal**: Estimate the total number of certain job types. -- .orange2[What assisting model should we use?] --- <img src="img/bls2.png" width="15%" style="float:left; padding:10px" /> ### Case Study 2: [US Bureau of Labor Statistics](https://www.bls.gov/bls/blsmissn.htm) > **Mission**: "Measures labor market activity, working conditions, price changes, and productivity in the U.S. economy to support public and private decision making." **Goal**: Estimate the total number of certain job types. .orange2[Why should we not use linear regression?] -- * We have mostly categorical variables with many categories. -- * The industry code is likely important and most job types probably only fall into a few of the industries. -- * Additive model is too simple and the fully interactive is too complex with not enough data in many sub-groups. --- <img src="img/bls2.png" width="15%" style="float:left; padding:10px" /> ### Case Study 2: [US Bureau of Labor Statistics](https://www.bls.gov/bls/blsmissn.htm) > **Mission**: "Measures labor market activity, working conditions, price changes, and productivity in the U.S. economy to support public and private decision making." Use regression trees instead! <img src="img/trees.001.jpeg" width="50%" style="display: block; margin: auto;" /> * Recursively split sample into two groups based on an auxiliary variable. * Industry has 24 categories. + Industry #71: Arts, Entertainment, and Recreation + Industry #72: Accommodation and Food Services --- <img src="img/bls2.png" width="15%" style="float:left; padding:10px" /> ### Case Study 2: [US Bureau of Labor Statistics](https://www.bls.gov/bls/blsmissn.htm) > **Mission**: "Measures labor market activity, working conditions, price changes, and productivity in the U.S. economy to support public and private decision making." <img src="img/trees.002.jpeg" width="60%" style="display: block; margin: auto;" /> * .orange2[Recursively] split sample into two groups based on an auxiliary variable. --- <img src="img/bls2.png" width="15%" style="float:left; padding:10px" /> ### Case Study 2: [US Bureau of Labor Statistics](https://www.bls.gov/bls/blsmissn.htm) > **Mission**: "Measures labor market activity, working conditions, price changes, and productivity in the U.S. economy to support public and private decision making." <img src="img/trees.004.jpeg" width="100%" style="display: block; margin: auto;" /> * .orange2[Recursively] split sample into two groups based on an auxiliary variable. --- <img src="img/bls2.png" width="15%" style="float:left; padding:10px" /> ### Case Study 2: [US Bureau of Labor Statistics](https://www.bls.gov/bls/blsmissn.htm) > **Mission**: "Measures labor market activity, working conditions, price changes, and productivity in the U.S. economy to support public and private decision making." <img src="img/trees.003.jpeg" width="60%" style="display: block; margin: auto;" /> At each end node, the estimator is given by the .orange2[survey-weighted mean]. --- ### The Regression Tree Estimator `$$\hat{t}_y = \sum_{i \in U} \hat{m}(\boldsymbol{x_i}) + \sum_{i \in s} \frac{y_i - \hat{m}(\boldsymbol{x_i})}{\pi_i}$$` Let `\(Q = \{ B_{1}, B_{2}, \ldots, B_{q} \}\)` be the of partitioning boxes from the tree. -- For `\(\boldsymbol{x} \in B_k\)`: `\begin{align*} \hat{m}({\boldsymbol x})= \widehat{\#}(B_{k})^{-1} \sum_{i \in s} \pi_{i}^{-1} y_i I ({\boldsymbol x}_i \in B_{k}) \end{align*}` where `\begin{align*} \widehat{\#}(B_{k}) = \sum_{i \in s} \pi_{i}^{-1} I ({\boldsymbol x}_i \in B_{k}). \end{align*}` --- ### The Regression Tree Estimator * Splits are selected based on a permutation test. * Tree stops splitting based on a p-value threshold or minimum end node size. * R package for regression trees built on complex survey data: [rpms](https://cran.r-project.org/web/packages/rpms/index.html) + Written by [Daniell Toth](https://scholar.google.com/citations?user=1lUPZikAAAAJ) + Software demo: Wed at 10am in the Sir Alex Ferguson Library --- ### The Regression Tree Estimator #### Calibration Can linearize the trees: `$$\hat{m}({\bf x}) = \tilde{\mu}_{n1} I({\bf x} \in B_{n1}) + \tilde{\mu}_{n2} I({\bf x} \in B_{n2}) + \ldots + \tilde{\mu}_{nq} I({\bf x} \in B_{nq})$$` -- * Regression tree is .orange2[calibrated] to population totals of the end nodes: `$$\sum_{j \in U} I({ \bf x}_j \in B_{nk}) = \sum_{j \in s} \tilde{w}_j I({ \bf x}_j \in B_{nk})$$` ---- -- #### Post-stratification * Post-stratified estimator: Generalized regression estimator where the assisting model is the group mean model -- * Regression tree estimator: End nodes of the tree are a set of .orange2[predictive] post-strata! --- ### Case Study 2: [US Bureau of Labor Statistics](https://www.bls.gov/bls/blsmissn.htm) > **Mission**: "Measures labor market activity, working conditions, price changes, and productivity in the U.S. economy to support public and private decision making." **Goal**: Estimate the total number of certain job types. <img src="img/bls_results.001.jpeg" width="90%" /> --- class: center, middle background-color: #CFF09E ### General Thoughts on Selecting the Model -- #### Consider your audience. -- #### Figure out what data you have access to. -- #### Do lots of exploratory data analysis. -- #### Analyze the resulting survey weights. --- class: center, middle background-color: #CFF09E ## Questions? --- class: center, middle background-color: #CFF09E ## Up next: ### Break ### Implementation in `R`!