Simulation and Statistical Modeling
Simulation studies that are carefully designed under realistic survey conditions can be used to evaluate the quality of new statistical methodology for Census Bureau data. Furthermore, new computationally intensive statistical methodology is often beneficial because it can require less strict assumptions, offer more flexibility in sampling or modeling, accommodate complex features in the data, enable valid inference where other methods might fail, etc.
Statistical modeling is at the core of the design of realistic simulation studies and the development of computationally intensive statistical methods. Modeling also enables one to efficiently use all available information when producing estimates.
Such studies can benefit from software for data processing. Statistical disclosure avoidance methods are also developed and properties studied.
- Systematically develop an environment for simulating complex surveys that can be used as a test-bed for new data analysis methods.
- Develop flexible model-based estimation methods for survey data.
- Develop new methods for statistical disclosure control that simultaneously protect confidential data from disclosure while enabling valid inferences to be drawn on relevant population parameters.
- Investigate the bootstrap for analyzing data from complex sample surveys.
- Develop models for the analysis of measurement errors in Demographic sample surveys (e.g., Current Population Survey or the Survey of Income and Program Participation).
- Identify and develop statistical models (e.g., loglinear models, mixture models, and mixed-effects models) to characterize relationships between variables measured in censuses, sample surveys, and administrative records.
- Investigate noise infusion and synthetic data for statistical disclosure control.
- Simulating data collection operations using Monte Carlo techniques can help the Census Bureau make more efficient changes.
- Use noise multiplication or synthetic data as an alternative to top coding for statistical disclosure control in publicly released data. Both noise multiplication and synthetic data have the potential to preserve more information in the released data over top coding.
- Rigorous statistical disclosure control methods allow for the release of new microdata products.
- Using an environment for simulating complex surveys, statistical properties of new methods for missing data imputation, model-based estimation, small area estimation, etc. can be evaluated.
- Model-based estimation procedures enable efficient use of auxiliary information (for example, Economic Census information in business surveys), and can be applied in situations where variables are highly skewed and sample sizes are not sufficiently large to justify normal approximations. These methods may also be applicable to analyze data arising from a mechanism other than random sampling.
- Variance estimates and confidence intervals in complex surveys can be obtained via the bootstrap.
- Modeling approaches with administrative records can help enhance the information obtained from various sample surveys.
Accomplishments (October 2017 - September 2018):
- Developed new methodology using principles of multiple imputation to analyze data under a differentially private Laplace mechanism.
- Developed new methodology for using survey estimates to construct a joint confidence region for a ranking of populations.
- Continued developing finite sample methodology for drawing inference based on singly and multiply imputed synthetic data under the linear regression model; including theoretical and empirical evaluations of the proposed methods when certain assumptions used to derive the methodology do not hold.
- Applied model selection and model validation methodology to develop models for producing estimates based on the Tobacco Use Supplement to the Current Population Survey using small area estimation methodology.
- Evaluated bootstrap methodology for propensity score estimation.
- Developed new visualizations for displaying uncertainty in estimated rankings.
- Continued improving a synthetic population designed for simulating Monthly Wholesale Trade Survey data for a period representative of over four years; used the synthetic population to evaluate new flexible modeling strategies being developed for multivariate data with missing values.
Short-Term Activities (FY 2019):
- Continue development of methodology for using multiple imputation to analyze data under a differentially private data release mechanism.
- Continue developing finite sample methodology for drawing inference based on singly and multiply imputed synthetic data; extending some standard multivariate statistical methods for application to synthetic data.
- Continue developing new methodology for constructing a joint confidence region for a ranking based on sample survey data, and develop accompanying visualizations.
- Develop new flexible modeling strategies for multivariate data with missing values, and evaluate these models using the synthetic population constructed for simulating Monthly Wholesale Trade Survey data; refine synthetic population as needed.
Longer-Term Activities (beyond FY 2019):
- Develop methodology for analyzing singly and multiply imputed synthetic data under various realistic scenarios.
- Develop noise infusion methodology for statistical disclosure control.
- Study ways of quantifying the privacy protection/data utility tradeoff in statistical disclosure control.
- Develop and study bootstrap methods for sample survey data.
- Create an environment for simulating complex aspects of economic/demographic surveys.
- Develop methodology for quantifying uncertainty in statistical rankings, and refine visualizations.
Moura, R., Klein, M., Coelho, C. and Sinha, B. (2017). "Inference for Multivariate Regression Model based on Synthetic Data generated under Fixed-Posterior Predictive Sampling: Comparison with Plug-in Sampling." REVSTAT - Statistical Journal, 15(2): 155-186
Klein, M. and Datta, G. (2017). "Statistical Disclosure Control Via Sufficiency Under the Multiple Linear Regression Model," Journal of Statistical Theory and Practice.
Klein, M., and Sinha, B. (2016). "Likelihood Based Finite Sample Inference for Singly Imputed Synthetic Data Under the Multivariate Normal and Multiple Linear Regression Models," Journal of Privacy and Confidentiality,7: 43-98.
Klein, M., and Sinha, B. (2015). "Inference for Singly Imputed Synthetic Data Based on Posterior Predictive Sampling under Multivariate Normal and Multiple Linear Regression Models," Sankhya B: The Indian Journal of Statistics 77-B, 293-311.
Klein, M., and Sinha, B. (2015). "Likelihood-Based Inference for Singly and Multiply Imputed Synthetic Data under a Normal Model," Statistics and Probability Letters, 105, 168-175.
Klein, M., and Sinha, B. (2015). "Likelihood-Based Finite Sample Inference for Synthetic Data Based on Exponential Model," Thailand Statistician: Journal of The Thai Statistical Association, 13, 33-47.
Wright, T., Klein, M., and Wieczorek, J. (2014). "Ranking Populations Based on Sample Survey Data," Center for Statistical Research and Methodology, Research and Methodology Directorate Research Report Series (Statistics #2014-12). U.S. Census Bureau. Available online.
Klein, M., Lineback, J.F., and Schafer, J. (2014). "Evaluating Imputation Techniques in the Monthly Wholesale Trade Survey," Proceedings of the Joint Statistical Meetings, Alexandria, VA: American Statistical Association.
Klein, M., Mathew, T., and Sinha, B. (2014). "Noise Multiplication for Statistical Disclosure Control of Extreme Values in Log-normal Regression Samples." Journal of Privacy and Confidentiality, 6, 77-125.
Klein, M., Mathew, T., and Sinha, B. (2014). "Likelihood Based Inference Under Noise Multiplication," Thailand Statistician: Journal of The Thai Statistical Association, 12, 1-23.
Wright, T., Klein, M., and Wieczorek, J. (2013). "An Overview of Some Concepts for Potential Use in Ranking Populations Based on Sample Survey Data," The 59th International Statistical Institute World Statistics Congress, Hong Kong, China.
Klein, M., and Sinha, B. (2013). "Statistical Analysis of Noise Multiplied Data Using Multiple Imputation," Journal of Official Statistics, 29, 425-465.
Klein, M., and Linton, P. (2013). "On a Comparison of Tests of Homogeneity of Binomial Proportions," Journal of Statistical Theory and Applications, 12, 208-224.
Klein, M., Mathew, T., and Sinha, B. (2013). "A Comparison of Statistical Disclosure Control Methods: Multiple Imputation Versus Noise Multiplication." Center for Statistical Research and Methodology, Research and Methodology Directorate Research Report Series (Statistics #2013-02). U.S. Census Bureau. Available online.
Shao, J., Klein, M., and Xu, J. (2012). "Imputation for Nonmonotone Nonresponse in the Survey of Industrial Research and Development," Survey Methodology, 38, 143-155.
Klein, M., and Wright, T. (2011). "Ranking Procedures for Several Normal Populations: An Empirical Investigation," International Journal of Statistical Sciences, 11, 37-58.
Klein, M., and Creecy, R. (2010). "Steps Toward Creating a Fully Synthetic Decennial Census Microdata File," Proceedings of the Joint Statistical Meetings, Alexandria, VA: American Statistical Association.
Contact: Martin Klein, Isaac Dompreh, Brett Moran, Bimal Sinha
Funding Sources for FY 2018:
- 0331 - Working Capital Fund / General Research Project
Various Decennial, Demographic, and Economic Projects