Experimentation, Prediction, & Modeling

Motivation:

Experiments at the Census Bureau are used to answer many research questions, especially those related to testing, evaluating, and advancing survey sampling methods. A properly designed experiment provides a valid, cost-effective framework that ensures the right type of data are collected as well as sufficient sample sizes and power are attained to address the questions of interest. The use of valid statistical models is vital to both the analysis of results from designed experiments and in characterizing relationships between variables in the vast data sources available to the Census Bureau. Statistical modeling is an essential component for wisely integrating data from previous sources (e.g., censuses, sample surveys, and administrative records) in order to maximize the information that they can provide. In particular, linear mixed effects models are ubiquitous at the Census Bureau through applications of small area estimation. Models can also identify errors in data, e.g., by computing valid tolerance bounds and flagging data outside the bounds for further review.

Research Problems and Potential Applications:

1. Investigate established methods and novel extensions to support design (e.g., factorial designs), analysis, and sample size determination for Census Bureau experiments.

· Sample sizes can be determined to achieve desired power under planned designs and statistical procedures.

· Experimental design can help guide and validate testing procedures proposed for censuses and surveys.

2. Investigate methodology for experimental designs embedded in sample surveys, including large-scale field experiments embedded in ongoing surveys.

· This includes design-based and model-based analysis and variance estimation incorporating the sampling design and the experimental design (van den Brakel, Survey Methodology, 2005).

· Embedded experiments can be used to evaluate the effectiveness of alternative contact strategies, especially for improving response rates.

· Of particular interest to the Census Bureau is where systematic sampling is used both for the sampling design and the experimental design.

· A potential application area is to expand the collection of experimental design procedures utilized with the American Community Survey.

3. Identify and develop statistical models (e.g., loglinear models, mixture models, and mixed-effects models), associated methodologies, and computational tools for problems relevant to the Census Bureau.

· Modeling can help to characterize relationships between variables measured in censuses, sample surveys, and administrative records.

· Modeling can help to study response rates in a census or survey operation and their relationships to associated variables. It can also be used to predict volumes of incoming responses with appropriate measures of uncertainty.

· Models can be used to provide principled measures of statistical variability for constructs like the POP Division's Population Estimates.

· Modeling can enhance information obtained from various sample surveys using auxiliary data sources, such as administrative records.

· Fiducial prediction intervals of random effects can be applied to mixed effects models such as those used in small area estimation.

4. Construct rectangular nonparametric tolerance regions for multivariate data, focusing on multivariate ratio edits.

· This can be applied to multivariate economic data and aid in the editing process by identifying observations that are outlying in one or more attributes and which subsequently should undergo further review.

· The importance of ratio edits and multivariate/multiple edits is noted in the work of Thompson and Sigman (Journal of Official Statistics, 1999) de Waal, Pannekoek and Scholtus (Handbook of Statistical Data Editing and Imputation, 2011), and Ghosh-Dastidar and Schafer (JASA, 2003 and Journal of Official Statistics, 2006).

5. Develop a technique for mis-reporting via the COM-Poisson distribution in order to estimate more accurate count estimates.

· This could be used to assess the amount of misreporting in historical Census datasets to aid in model development to estimate more accurate survey count outcomes.

6. Develop a disclosure policy motivated by the COM-Poisson and related distributions that allows one to protect individual information reported in two-way and multi-way tables.

· This would allow the Census Bureau to release statistical measures associated with a general distributional form while protecting individual privacy.

· This would allow one to estimate the form of multi-way tables of interest while masking the true response data.

Current Subprojects:

· Developing Flexible Distributions and Statistical Modeling for Count Data Containing Dispersion (Sellers, Morris, Raim).

· Design and Analysis of Embedded Experiments (Mathew, Raim, Sellers)

· Randomization, Re-randomization and Covariate Balance in Treatment-control Comparisons (Ben-David, Mathew)

· Ratio Edits for Multivariate Data Based on Tolerance Rectangles (Mathew)

· Generation of Random Variates for Weighted Distributions (Raim, Livsey, Irimata)

Accomplishments (October 2020-September 2024):

· Completed manuscripts regarding the development of one-step autoregressive and moving average models, respectively, for count data motivated by the COM-Poisson distribution.

· Completed a manuscript on developing a flexible model to analyze clustered categorical data.

· Completed manuscript describing the development of a flexible bivariate distribution motivated by the Conway-Maxwell-Poisson distribution and established via the trivariate reduction method.

· Completed manuscript discussing the development of a flexible multivariate discrete distribution.

· Completed paper describing initial developments of a flexible mixed effects model for clustered count data. Completed paper on model-based ACS special tabulations as a precursor to considering more formal privacy protection. Considers a hierarchical Bayesian model with a Dirichlet process mixture and spatial random effects.

· Completed paper on direct sampler methodology with rejection sampling, using step function as an envelope.

· Completed technical report on direct sampling methodology with application to privacy protected data.

· Completed manuscript on vertical weighted strips method. This is a framework to construct proposal distributions for rejection sampling using the form of weighted distributions.

· Completed paper on Bayesian hierarchical modeling of privacy protected data. Several standard methods of protection are considered, and a more convenient Gaussian approximation is evaluated for accuracy.

· Completed ‘fntl’ R package with detailed vignette describing the API. This package provides a straightforward interface to numerical tools in the R API (and several additional implementations) where functional arguments are specified as C++ lambda functions.

· Completed a technical report using continuation-ratio logit model to analyze the effect of a new training module for Spanish-speaking enumerators on response rates of Spanish-speaking households in the 2020 Census.

Short-Term Activities (FY 2025 - FY 2027):

· Develop a COM-Poisson regression model that allows for excess zeros and censored outcomes.

· Complete R package and vignette on to support vertical weighted strips sampling methodology.

· Explore panel count models for response count data observed over the time span of a census operation.

· Apply vertical weighted strips methodology to rejection sampling in Bayesian small area estimation: especially in joint modeling of direct estimates and associated variance estimates.

· Develop multivariate rectangular regions that can be used to address the multivariate ratio edit problem.

Longer-Term Activities (beyond FY 2027):

· Develop generalized/flexible spatial and time series models motivated by the COM-Poisson distribution.

· Significant progress has been made recently on randomization-based causal inference for complex experiments; Ding (Statistical Science, 2017), Dasgupta, Pillai and Rubin (Journal of the Royal Statistical Society, Series B, 2015), Ding and Dasgupta (Journal of the American Statistical Association, 2016), Mukerjee, Dasgupta and Rubin (Journal of the American Statistical Association, 2018), Branson and Dasgupta (International Statistical Review, 2020). It is proposed to adopt these methodologies for analyzing complex embedded experiments, by taking into account the features of embedded experiments (for example, random interviewer effects and different sampling designs).

· Generalize the Kadane et al. (2006) COM-Poisson motivated data disclosure limitation procedure for one-way tables to handle two-way and multi-way tables. Determine the associated sufficient statistics of the bivariate (or multivariate) COM-Poisson distribution and use them to describe the space of feasible tables that can be used to substitute the true contingency table.

· Consider generalizations of the frequentist and Bayesian approaches to address under-reporting described in Winkelmann (1996), Fader and Hardie (2000), Neubauer and Djuras (2009), and Neubauer et al. (2009) to allow for data dispersion via the COM-Poisson distribution.

· Review literature on causal inference and consider problems and applications relevant to the Census Bureau.

· Investigate the role of fiducial inference and approximate fiducial inference in mixed and random effects models, linear as well as nonlinear (including generalized linear models), with an emphasis on problems of interest to the Census Bureau; for example, to address prediction problems relevant in small area estimation.

· Consider extensions to sample size determination in Raim et al (JOS, 2023). This includes variations to the statistic and hypothesis for the planned test procedure, accounting for varying costs of fieldwork in allocation, presence of mixed effects, and models to more holistically capture mechanisms to respond.

Selected Publications (Journal Articles, Peer Review):

Raim, A.M., Nichols, E., and Mathew, T. (2023). “A Statistical Comparison of Call Volume Uniformity Due to Mailing Strategy,” Journal of Official Statistics, 39, 103-121.

Raim, A.M., Mathew, T., Sellers, K. F., Ellis, R., and Meyers, M. (2023). “Design and Sample Size Determination for Experiments on Nonresponse Follow-up using a Sequential Regression Model,” Journal of Official Statistics, 39(2), 173-202.

Raim, A.M. (2023). “Direct Sampling with a Step Function,” Statistics and Computing, 33(22). https://doi.org/10.1007/s11222- 022-10188.

Lucagbo, M., Mathew, T., and Young, D. (2023). “Rectangular Multivariate Normal Prediction Regions for Setting Reference Regions in Laboratory Medicine,” Journal of Biopharmaceutical Statistics, 33(2), 191-209.

Lucagbo, M. and Mathew, T. (2023). “Rectangular Tolerance Regions and Multivariate Normal Reference Regions in Laboratory Medicine,” Biometrical Journal, 65(3).

Arsham, A., Bebu, I., and Mathew, T. (2023). “Cost-Effectiveness Analysis Under Multiple Effectiveness Outcomes: A Probabilistic Approach,” Statistics in Medicine, 42, 3936-3955.

Arsham, A., Bebu, I., and Mathew, T. (2022). “A Bivariate Regression-Based Cost-Effectiveness Analysis,” Journal of Statistical Theory and Practice, 16, Article No. 27.

Janicki, R., Raim, A.M., Holan, S.H., and Maples, J. (2022). “Bayesian Nonparametric Multivariate Spatial Mixture Mixed Effects Models with Application to American Community Survey Special Tabulations,” The Annals of Applied Statistics, Volume 16, Issue 1, 144-168.

Lucagbo, M. and Mathew, T. (2022). “Rectangular Confidence Regions and Prediction Regions in Multivariate Calibration,” Journal of the Indian Society for Probability and Statistics, 23, 155–171.

Morris, D.S. and Sellers, K.F. (2022). “A Flexible Mixed Model for Clustered Count Data,” Stats: Special Issue on Statistics, Data Analytics, and Inferences for Discrete Data, 5(1): 52–69. https://doi.org/10.3390/stats5010004.

Rivas, A., Antoun, C., Feuer, S., Mathew, T., Nichols, E., Olmsted-Hawala, E. and Wang, L (2022), “Comparison of Three Navigation Button Designs in Mobile Survey for Older Adults,” Survey Practice, 15(1).

Weems, K.S., Sellers, K.F., and Li, T. (2021). “A Flexible Bivariate Distribution for Count Data Expressing Data Dispersion,” Communications in Statistics - Theory and Methods, https://doi.org/10.1080/03610926.2021.1999474.

Feng, X., Mathew, T, and Adragni, K. (2021). “Interval Estimation of the Intra-class Correlation in General Linear Mixed Effects Models,” Journal of Statistical Theory and Practice, 15, Article 65.

Sellers, K.F., Arab, A., Melville, S., and Cui, F. (2021). “A Flexible Univariate Moving Average Time-Series Model for Dispersed Count Data,” Journal of Statistical Distributions and Applications 8 (1). https://doi.org/10.1186/s40488-021-00115-2

Sellers, K.F., Li, T., Wu, Y., and Balakrishnan, N. (2021). “A Flexible Multivariate Distribution for Correlated Count Data,” Stats, 4(2), 308-326, https://doi.org/10.3390/stats4020021.

Zhao, J., Mathew, T., and Bebu, I. (2021). “Accurate Confidence Intervals for Inter-Laboratory Calibration and Common Mean Estimation,” Chemometrics and Intelligent Laboratory Systems, 208. DOI: 10.1016/j.chemolab.2020.104218.

Zimmer, Z., Park, D., and Mathew, T. (2021). “Tolerance Limits under Zero-Inflated Lognormal and Gamma Distributions,” Computational and Mathematical Methods, Special Issue on Statistics, 3. DOI: 10.1002/cmm4.1113.

Morris, D.S., Raim, A.M., and Sellers, K.F. (2020). “A Conway-Maxwell-multinomial Distribution for Flexible Modeling of Clustered Categorical Data,” Journal of Multivariate Analysis. DOI: https://doi.org/10.1016/j.jmva.2020.104651.

Sellers K.F., Peng, S.J., and Arab, A. (2020). “A Flexible Univariate Autoregressive Time-series Model for Dispersed Count Data,” Journal of Time Series Analysis, 41(3): 436-453.

Sellers, K.F. and Young, D. (2019). “Zero-inflated Sum of Conway-Maxwell-Poissons (ZISCMP) Regression with Application to Shark Distributions,” Journal of Statistical Computation and Simulation, 89 (9): 1649-1673.

Sellers, K.F., and Morris, D. (2017). “Under-dispersion Models: Models That Are ‘Under The Radar’,” Communications in Statistics – Theory and Methods, 46 (24): 12075-12086.

Sellers K.F., Morris D.S., Shmueli, G., and Zhu, L. (2017). “Reply: Models for Count Data (A Response to a Letter to the Editor),” The American Statistician.

Young, D.S., Raim, A.M., and Johnson, N.R. (2017). “Zero-inflated Modelling for Characterizing Coverage Errors of Extracts from the U.S. Census Bureau's Master Address File,” Journal of the Royal Statistical Society: Series A. 180(1):73-97.

Zhu, L., Sellers, K.F., Morris, D.S., and Shmueli, G. (2017). “Bridging the Gap: A Generalized Stochastic Process for Count Data,” The American Statistician, 71 (1): 71-80.

Mathew, T., Menon, S., Perevozskaya, I., and Weerahandi, S. (2016). “Improved Prediction Intervals in Heteroscedastic Mixed- Effects Models,” Statistics & Probability Letters, 114, 48-53.

Sellers, K.F., Morris, D.S., and Balakrishnan, N. (2016). “Bivariate Conway-Maxwell-Poisson Distribution: Formulation, Properties, and Inference,” Journal of Multivariate Analysis, 150:152-168.

Sellers, K.F. and Raim, A.M. (2016). “A Flexible Zero-inflated Model to Address Data Dispersion,” Computational Statistics and Data Analysis, 99: 68-80.

Young, D. and Mathew, T. (2015). “Ratio Edits Based on Statistical Tolerance Intervals,” Journal of Official Statistics 31, 77- 100.

Klein, M., Mathew, T., and Sinha, B.K. (2014). “Likelihood Based Inference under Noise Multiplication,” Thailand Statistician. 12(1), pp.1-23. URL: http://www.tci-thaijo.org/index.php/thaistat/article/view/34199/28686.

Young, D.S. (2014). “A Procedure for Approximate Negative Binomial Tolerance Intervals,” Journal of Statistical Computation and Simulation, 84(2), pp.438-450. URL: http://dx.doi.org/10.1080/00949655.2012.715649

Gamage, G., Mathew, T., and Weerahandi, S. (2013). “Generalized Prediction Intervals for BLUPs in Mixed Models,” Journal of Multivariate Analysis, 120, 226-233.

Mathew, T. and Young, D.S. (2013). “Fiducial-Based Tolerance Intervals for Some Discrete Distributions,” Computational Statistics and Data Analysis, 61, 38-49.

Young, D.S. (2013). “Regression Tolerance Intervals,” Communications in Statistics – Simulation and Computation, 42(9), 2040-2055.

Selected Publications (CSRM Research Reports, CSRM Studies, Proceedings Papers, and Other):

Raim, A.M., Livsey, J.A., and Irimata, K.M. (2025+). "Rejection Sampling with Vertical Weighted Strips," https://arxiv.org/abs/2401.09696.

Raim, A.M. (2024). "fntl: Numerical Tools for Rcpp and Lambda Functions," Research Report Series (Computing #2024-01), Center for Statistical Research and Methodology, U.S. Census Bureau, Washington, D.C.

Raim, A.M., Ellis, R., and Meyers, M. (2024). "A Multinomial Analysis of Bilingual Training and Nonresponse Follow-up Contact Rates in the 2020 Decennial Census”, Research Report Series (Statistics #2024-01), Center for Statistical Research and Methodology, U.S. Census Bureau, Washington, D.C.

Raim, A.M. and Nichols, E. (2023). "A Comparison of Map Usability via Bivariate Ordinal Analysis," Research Study Series (Statistics #2023-01), Center for Statistical Research and Methodology, U.S. Census Bureau, Washington, D.C.

Raim, A.M. and Sellers, K.F. (2022). "COMPoissonReg: Usage, the Normalizing Constant, and Other Computational Details," Research Report Series (Computing #2022-01), Center for Statistical Research and Methodology, U.S. Census Bureau, Washington, D.C.

Irimata, K.M., Raim, A.M., Janicki, R., Livsey, J.A., and Holan, S.H. (2022). "Evaluation of Bayesian Hierarchical Models of Differentially Private Data Based on an Approximate Data Model," Research Report Series (Statistics #2022-05), Center for Statistical Research and Methodology, U.S. Census Bureau, Washington, D.C.

Raim, A.M. (2021). "Direct Sampling in Bayesian Regression Models with Additive Disclosure Avoidance Noise," Research Report Series (Statistics #2021-01), Center for Statistical Research and Methodology, U.S. Census Bureau, Washington, D.C.

Raim, A.M., Holan, S.H., Bradley, J.R., and Wikle, C.K. (2020). stcos: “Space-Time Change of Support, version 0.3.0,” https://cran.r-project.org/package=stcos.

Zhu, L., Sellers, K., Morris, D., Shmueli, G., and Davenport, D. (2020). cmpprocess: “Flexible Modeling of Count Processes,” version 1.1, https://cran.r-project.org/package=cmpprocess

Raim, A.M., Holan, S.H., Bradley, J.R., and Wikle, C.K. (2019). “Spatio-Temporal Change of Support Modeling for the American Community Survey with R,” URL: https://arxiv.org/abs/1904.12092.

Sellers, K., Lotze, T., and Raim, A. (2019). COMPoissonReg: “Conway-Maxwell-Poisson Regression, version 0.7.0,” https://cran.r-project.org/package=COMPoissonReg

Sellers, K., Morris, D., Balakrishnan, N., and Davenport, D. (2018). multicmp: “Flexible Modeling of Multivariate Count Data via the Multivariate Conway-Maxwell-Poisson Distribution,” version 1.1, https://cran.r-project.org/package=multicmp

Morris, D.S., Sellers, K.F., and Menger, A. (2017). “Fitting a Flexible Model for Longitudinal Count Data Using the NLMIXED Procedure,” SAS Global Forum Proceedings Paper 202-2017, SAS Institute: Cary, NC.

Raim, A.M., Holan, S.H., Bradley, J.R., and Wikle, C.K. (2017). “A Model Selection Study for Spatio-Temporal Change of Support,” in Proceedings, Government Statistics Section of the American Statistical Association, Alexandria, VA: American Statistical Association.

Heim, K. and Raim, A.M. (2016). “Predicting Coverage Error on the Master Address File Using Spatial Modeling Methods at the Block Level,” In JSM Proceedings, Survey Research Methods Section, Alexandria, VA: American Statistical Association.

Raim, A.M. (2016). “Informing Maintenance to the U.S. Census Bureau's Master Address File with Statistical Decision Theory,” In JSM Proceedings, Government Statistics Section. Alexandria, VA: American Statistical Association.

Raim, A.M. and Gargano, M.N. (2015). “Selection of Predictors to Model Coverage Errors in the Master Address File,” Research Report Series (Statistics #2015-04), Center for Statistical Research and Methodology, U.S. Census Bureau, Washington, D.C.