Missing Data & Observational Data Modeling

Motivation:

Missing data problems are endemic in the conduct of statistical experiments and data collection operations. The investigators almost never observe all the outcomes they had set to record. When dealing with sample surveys, this means that individuals or entities in the survey do not respond at all or give only part of the information they are being asked to provide. Even if a response is obtained, the information provided may be logically inconsistent making such responses in effect missing. Statistical agencies compensate for these types of missing data in computing reliable official statistics using methods such as imputation and survey weight adjustment.  Such techniques utilize non-missing survey information to methodicallyfill in” the missing data. As data collection becomes more expensive and response rates decrease, observational data sources such as administrative records and commercial data providing alternate information on individuals or entities becomes more available. Deeper model-based imputation and survey weight adjustment methods are useful for improving and/or evaluating how sample survey or census data can be supplemented with information obtained from quality observational data. All these missing data problems and associated techniques involve statistical modeling along with subject matter experience.

 

Research Problems:

·   Simultaneous imputation of multiple survey variables to maintain joint properties, related to methods of evaluation of model-based imputation methods.

·   Integrating editing and imputation of sample survey and census responses via multiple imputation and latent variable models.

·   Nonresponse adjustment and imputation using administrative records, based on response propensity and/or multiple imputation statistical and machine learning models.

·   Development of joint modeling and imputation of categorical variables using log-linear models for (sometimes sparse) contingency tables.

·   Statistical modeling (e.g., latent class models) for combining sample survey, census and/or alternative source data.

·   Statistical techniques (e.g., classification methods, multiple imputation models) for using alternative data sources to supplement field data collection.

·   Evaluation and visualization of nonresponse bias and nonresponse adjustments for geographic and social-economic subpopulations.

 

Current Subprojects:

·   Data Editing, Imputation, and Weighting for Nonresponse (Morris, Thibaudeau, Kang, Ben-David, Chen, Shao)

·   Imputation and Weighting Models using Observational/Alternative Data Sources (Morris, Kang, Thibaudeau, Dompreh, Joyce)

 

Potential Applications:

·   Study flexible and data-driven nonresponse weight adjustments using administrative records for surveys experiencing data collection interruptions such as the ACS during the COVID-19 pandemic.

·   Measure sensitivity of estimates, impact of nonresponse on representativeness, and weight distributions in low-response surveys such as the Household Trends and Outlook Survey (formerly Household Pulse Survey).

·   Re-visit traditional missing data model techniques (e.g. imputation and response propensity models) using machine learning algorithms with alternate data sources for household surveys such as the ACS.

·   Produce multiply imputed, synthetic and/or composite estimates of more geographical granular and timely economic activity based on third party data.

·   Study joint multiple imputation of categorical characteristic data in the Decennial Census using models that account for household hierarchical structure and produce plausible values that do not violate edit constraints.

 

Accomplishments (October 2020-September 2024):

·   Developed a novel two-stage weighting approach using machine learning for ACS data products and evaluated performance compared to existing methods through simulations.

·   Developed experimental weighting techniques based on inverse probability weighting to address nonresponse issues in the American Community Survey caused by data collection disruptions and assessed traditional and machine learning response propensity model performance.

·   Collaborated to develop Bayesian multiple imputation models for using third party data to produce geographically granular and timely retail sales experimental used in the Monthly State Retail Sales program to serve as a case study for future economic estimates.

·   Developed a framework for evaluating a short time series of experimental survey estimates to serve as an evaluation measure to compare a variety of nonresponse adjustment methods.

·   Collaborated to implement joint imputation of characteristic data in the Decennial Census using latent class models that accounts for household hierarchical structure and correlation between characteristics of an observational unit.

·   Provided a series of four lectures that presented introductory missing data methods: concepts, definitions, theory, and applications of statistical methodology. Participants were identified across the Census Bureau as follows: Center for Economic Studies, Center for Statistical Research & Methodology, Demographic Statistical Methods Division, Decennial Statistical Studies Division, Economic Statistical Methods Division, Research & Methodology Directorate, and Social, Economic, & Housing Statistics Division.

·   Developed and assessed a novel distribution for categorical data in the presence of underlying cluster; and developed a generalized linear mixed model for count data in the presence of clustering as published in Stats.

·   Empirically studied sensitivity of Household Pulse Survey estimates to nonresponse weight adjustments and representativeness of respondents, in order to guide collaboration on improving the nonresponse procedure used in production.

 

Short-Term Activities (FY 2025 - FY 2027):

·      Research methods for smoothing weights particularly in low-response sample surveys with potentially significant nonresponse bias and substantial nonresponse weight adjustments.

·      Continue research on latent variable models for joint imputation of categorical data that satisfies edit constraints.

·      Research novel categorical distributions for contingency table modeling and joint imputation of categorical variables particularly for clustered data.

·      Continue research on accounting for observed zero cells in loglinear models for sparse contingency tables.

·      Knowledge-share practical assessments and solutions for nonresponse bias analyses across economic and demographic surveys.

·      Develop practical examples and guide usage of visualization of geographically-differentiated response patterns and sensitivity of survey outcomes.

·      Continue developing measures for evaluating competing nonresponse adjustment procedures on real data when a trusted benchmark does not exist.

 

Longer-Term Activities (beyond FY 2027):

·   Further novel application and case studies of predictive latent class models for joint categorical variables with nested observational structures and sparse variable distributions.

·   Develop a principled framework for incorporating uncertainty from model-based weight adjustments (particularly when based on machine learning algorithms) in variance estimation for survey outcomes.

·   Research methodology for principled selection of variables and model complexity in calibration or inverse probability weight models to achieve an optimal balance of bias and efficiency.

·   Devise models, visualizations, statistical quantities, etc. for empirically comparing nonresponse methods on real data in the absence of a reliable benchmark – either through assessments over time or across geographies or for sets of demographic or economic variables.

·   Research flexible categorical distributions with reasonable sampling properties for use in imputation of complex characteristic structure and correlation.

·   Joint modeling of response propensity and administrative source accuracy.

·   Research practical ways to apply decision theoretic concepts to the use of administrative records (versus personal contact or proxy response).

 

Selected Publications (Journal Articles, Peer Review):

Ibrahim, S., Mazumder, R., Radchenko, P., and Ben-David, E. (In Press). "Predicting Census Survey Response Rates with Parsimonious Additive Models and Structured Interactions," The Annals of Applied Statistics.

Kaputa, S., Morris, D.S., and Holan, S. (2024). “Bayesian Multi-Source Hierarchical Models with Applications to the Monthly Retail Trade Survey,” Journal of Survey Statistics and Methodology.

Kang, J., Morris, D.S., Joyce, P., and Dompreh, I. (2023). “On Calibrated Inverse Probability Weighting and Generalized Boosting Propensity Score Models for Mean Estimation with Incomplete Survey Data,” Wiley Interdisciplinary Reviews (WIREs) Computational Statistics.

Morris, D.S. and Sellers, K.F. (2022). “A Flexible Mixed Model for Clustered Count Data,” Stats: Special Issue on Statistics, Data Analytics, and Inferences for Discrete Data, 5(1): 52–69. https://doi.org/10.3390/stats5010004.

Morris, D.S., Raim, A.M., and Sellers, K.F. (2020). “Conway-Maxwell-Multinomial Distribution for Flexible Modeling of Clustered Categorical Data,” Journal of Multivariate Analysis, 179.

Dumbacher, B., Morris, D.S., and Hogue, C. (2019). “Using Electronic Transaction Data to Add Geographic Granularity to Official Estimates of Retail Sales,” Journal of Big Data, 6(80).

Keller, A., Mule, V.T., Morris, D.S., and Konicki, S. (2018). "A Distance Metric for Modeling the Quality of Administrative Records for Use in the 2020 Census," Journal of Official Statistics, 34(3): 1-27.

Morris, D. S. (2017). “A Modeling Approach for Administrative Record Enumeration in the Decennial Census,” Public Opinion Quarterly: Special Issue on Survey Research, Today and Tomorrow, 81(S1): 357-384.

Thibaudeau Y., Slud, E., and Gottschalck, A. O. (2017). “Modeling Log-Linear Conditional Probabilities for Estimation in Surveys,” Annals of Applied Statistics 11(2), 680-697.

Morris, D.S., Keller, A., and Clark, B. (2016). “An Approach for Using Administrative Records to Reduce Contacts in the 2020 Census,” Statistical Journal of the International Association for Official Statistics, 32(2): 177-188.

Thibaudeau, Y. (2002). “Model Explicit Item Imputation for Demographic Categories,” Survey Methodology, 28(2), 135-143.

 

Selected Publications (CSRM Research Reports, CSRM Studies, Proceedings Papers, and Other):

Powers, R., Eltinge, J., Martinez, W., and Morris, D.S. (2024). “Using Linked Micromaps for Evidence-Based Policy,” In JSM Proceedings, Section on Statistical Graphics. Alexandria, VA: American Statistical Association.

Morris, D.S. and Raim, A.M. (2023). “Comparing Trial and Variable Association in Contingency Table Data Using Multinomial Models for Clustered Data,” in Proceedings of the 37th International Workshop on Statistical Modelling. Dortmund, Germany: Statistical Modelling Society, 536-542.

Winkler, W. E. (2018). “Cleaning and Using Administrative Lists: Enhanced Practices and Computational Algorithms for Record Linkage and Modeling/Edit/Imputation,” Research Report Series (Statistics #2018-05), Center for Statistical Research and Methodology, U.S. Census Bureau, Washington, D.C.

Thibaudeau, Y. and Morris, D.S. (2016). “Bayesian Decision Theory to Optimize the Use of Administrative Records in Census NRFU,” Proceedings of the Joint Statistical Meetings. Alexandria, VA: American Statistical Association.

Bechtel, L., Morris, D.S., and Thompson, K.J. (2015). “Using Classification Trees to Recommend Hot Deck Imputation Methods: A Case Study,” in FCSM Proceedings. Washington, D.C: Federal Committee on Statistical Methodology.

Garcia, M., Morris, D.S., and Diamond, L.K. (2015). “Implementation of Ratio Imputation and Sequential Regression Multivariate Imputation on Economic Census Products,” Proceedings of the Joint Statistical Meetings.

Winkler, W. and Garcia, M. (2009). “Determining a Set of Edits,” Research Report Series (Statistics #2009-05), Statistical Research Division, U.S. Census Bureau, Washington, D.C.

Winkler, W. E. (2008). “General Methods and Algorithms for Imputing Discrete Data under a Variety of Constraints,” Research Report Series (Statistics #2008-08), Statistical Research Division, U.S. Census Bureau, Washington D.C.

 

Contact:

Darcy Morris, Joseph Kang, Isaac Dompreh, Yves Thibaudeau, Jun Shao, Emanuel Ben-David, Sixia Chen (ASA/NSF/Census Research Fellow/University of Oklahoma Health Sciences)

 

Funding Sources for FY 2025-2030:          

0331 – Working Capital Fund / General Research Project

Various Decennial, Demographic, and Economic Projects

Related Information


Page Last Revised - July 16, 2025