Missing Data, Edit, and Imputation
Motivation: Missing data problems are endemic to the conduct of statistical experiments and data collection projects. The investigators almost never observe all the outcomes they had set out to record.
When dealing with sample surveys or censuses, that means individuals or entities omit to respond, or give only part of the information they are being asked to provide. In addition, the information provided may be logically inconsistent, which is tantamount to missing.
To compute official statistics, agencies need to compensate for missing data. Available techniques for compensation include cell adjustments, imputation and editing, possibly aided by administrative information. All these techniques involve mathematical modeling along with subject matter experience.
- Compensating for missing data typically involves explicit or implicit modeling. Explicit methods include Bayesian multiple imputation, propensity score matching and direct substitution of information extracted from administrative records. Implicit methods revolve around donor-based techniques such as hot-deck imputation and predictive mean matching.
All these techniques are subject to edit rules to ensure the logical consistency of the remedial product. Research on integrating together statistical validity and logical requirements into the process of imputing continues to be challenging. Another important problem is that of correctly quantifying the reliability of predictor in part through imputation, as their variance can be substantially greater than that computed nominally.
Specific projects consider (1) nonresponse adjustment and imputation using administrative records, based on propensity and/or multiple imputation models and (2) simultaneous imputation of multiple survey variables to maintain joint properties, related to methods of evaluation of model-based imputation methods.
- Research on missing data leads to improved overall data quality and predictors accuracy for any census or sample survey with a substantial frequency of missing data. It also leads to methods to adjust the variance to reflect the additional uncertainty created by the missing data. Given the continuously rising cost of conducting censuses and sample surveys,
imputation and other missing-data compensation methods aided by administrative records may come to argument actual data collection, in the future.
Accomplishments (October 2017 - September 2018):
- Showed how to use log-linear models coupled with complementary logistic regression to improve the efficiency (reducing the sampling error) of estimates of gross flows and estimate of gross flows proportions from month to month, classified by demographic variables. Showed how these estimators can be implemented for labor force measurements and gross flows estimated from the Current Population Survey (to appear in Methodology of Longitudinal Surveys 2, P. Lynn ed.).
- Investigated the feasibility of using third party (“big”) data from First Data –a large payment processor to supplement and/or enhance retail sales estimates in the Monthly/Annual Retail Trade Survey (MRTS and ARTS).
- Completed research and development of optimization methods as an alternative to raking balance complexes when detail items are allowed to be negative or there is subtraction in the balance complexes.
- Applied and completed evaluation of optimization methods for raking balance complexes in the Quarterly Financial Report (QFR) when items can take negative values.
- Researched non-parametric Bayesian editing and imputation methods developed by Kim et al. (2017) as an alternative to Fellegi_Holt editing for Economic Census data. These methods have the added advantage of integrating data synthesis with editing/imputation processing.
- Completed research, proposed, implemented, and evaluated several Industry Characteristic Classification measures at the industry and imputation cell levels necessary for applying Bayesian editing methods to Economic Census data.
Short-Term Activities (FY 2019):
- Reverse engineer the record-linkage package “BigMatch” in order to integrate it into the “PVS” process.
- Extend the analysis and estimation of changes in the labor force status using log-linear models coupled with matching logistic regression methods to the Current Population Survey.
- Continue researching modeling approaches for using administrative records in lieu of Decennial Census field visits due to imminent design decisions.
- Continue to investigate the feasibility of using third party (“big”) data from various available sources to supplement and/or enhance retail sales estimates in the Monthly/Annual Retail Trade Survey (MRTS and ARTS).
- Continue research, implementation, and resolution of editing and data issues when applying non-parametric Bayesian editing methods to edit and multiply impute Economic Census data.
- Continue research on integration of Bayesian editing and multiple imputation processing with disclosure avoidance and data synthesis processing.
- Continue research on applying Bayesian editing methods developed by Hang Kim et al. (2015) to developing synthetic economic census data.
- Continue work on heuristic methods for edit generations.
Longer-Term Activities (beyond FY 2019):
- Maintain a functional inventory of record-linkage software packages – BigMatch, SRD Matcher, PVS Matcher, Python Tool Kit- for various uses at the Census Bureau.
- Extend small area estimation modeling for longitudinal data (survey and/or third party) in presence of attrition and/or other type of missing data using log-linear models in tandem with logistic regression.
- Extend the modeling of propensity jointly with the accuracy of administrative sources.
- Continue researching modeling approaches for using administrative records in lieu of Decennial Census field visits to support future design decisions.
- Research practical ways to apply decision theoretic concepts to the use of administrative records (versus personal contact or proxy response) in the Decennial Census.
- Research joint models for longitudinal count data and missing data (e.g. drop out) using shared random effects to measure the association between propensity for nonresponse and the count outcome of interest.
- Research imputation methods for a Decennial Census design that incorporates adaptive design and administrative records to reduce contacts and consequently increases proxy response and nonresponse.
- Research macro and selective editing in the context of large sets of administrative records and high-bandwidth data stream (Big Data).
- Continue collaboration on researching methods for data integration of the exports and patents data files with the Business Register (BR).
- Evaluate the results of data corrections in the Standard Economic Processing System (StEPS) using new raking algorithms for adjusting balance complexes.
- Continue research on edit procedures.
- Investigate why some of the newly developed alternative methods for raking lead to lower weighted totals than the existing StEPS raking method, apply the methodology to additional balance complexes from the QFR, and research the application to balance complexes from other Economic Census surveys.
Bechtel, L., Morris, D.S., and Thompson, K.J. (2015). “Using Classification Trees to Recommend Hot Deck Imputation Methods: A Case Study.” In FCSM Proceedings. Washington, DC: Federal Committee on Statistical Methodology.
Garcia, M., Morris, D.S., and Diamond, L.K. (2015). “Implementation of Ratio Imputation and Sequential Regression Multivariate Imputation on Economic Census Products.” Proceedings of the Joint Statistical Meetings.
Keller, A., Mule, V.T., Morris, D.S. and Konicki, S. (2018). "A Distance Metric for Modeling the Quality of Administrative Records for Use in the 2020 Census." Journal of Official Statistics, 34(3): 1-27.
Morris, D.S., Keller, A., and Clark, B. (2016). "An Approach for Using Administrative Records to Reduce Contacts in the 2020 Census." Statistical Journal of the International Association for Official Statistics, 32(2): 177-188.
Morris, D. S. (2017). “A Modeling Approach for Administrative Record Enumeration in the Decennial Census,” Public Opinion Quarterly: Special Issue on Survey Research, Today and Tomorrow, 81(S1): 357-384.
Thibaudeau Y., Slud, E., and Gottschalck, A. O. (2017). “Modeling Log-Linear Conditional Probabilities for Estimation in Surveys,” Annals of Applied Statistics 11(2), 680-697.
Thibaudeau, Y. (2002). “Model Explicit Item Imputation for Demographic Categories,” Survey Methodology, 28(2), 135-143. Winkler, W. E. (2008). “General Methods and Algorithms for Imputing Discrete Data under a Variety of Constraints,” Research Report Series (Statistics #2008-08), Statistical Research Division, U.S. Census Bureau, Washington DC.
Winkler, W. E. (2018). “Cleaning and Using Administrative Lists: Enhanced Practices and Computational Algorithms for Record Linkage and Modeling/Edit/Imputation,” Research Report Series (Statistics #2018-05), Center for Statistical Research and Methodology, U.S. Census Bureau, Washington, D.C.
Winkler, W. and Garcia, M. (2009). “Determining a Set of Edits,” Research Report Series (Statistics #2009-05), Statistical Research Division, U.S. Census Bureau, Washington, DC.
Contact: Yves Thibaudeau, Maria Garcia, Martin Klein, Darcy Morris, Jun Shao, Eric Slud, William Winkler, Xiaoyun Lu
Funding Sources for FY 2019:
- 0331 – Working Capital Fund / General Research Project
Various Decennial, Demographic, and Economic Projects