Estimated reading time: 7 minutes
The U.S. Census Bureau continuously strives to fulfill its mission to provide quality data about the nation’s people and economy.
Recently, the agency has been exploring ways that artificial intelligence may enhance its ability to create high-quality data products in the face of declining survey response rates, which impact the cost and reliability of survey estimates. This effort has led to the use of machine learning techniques to create a new process that we refer to as cross-survey modeling. This process allows the Census Bureau to enhance the usefulness of federal data products by bridging the gaps between surveys.
Cross-survey modeling involves taking data from a content-rich survey with fewer respondents and applying them to a survey or data source with broader geographic coverage, which allows for estimates at lower levels of geography.
The Census Bureau first used cross-survey modeling to add information about the likelihood that a housing unit had air conditioning to the 2022 Community Resilience Estimates (CRE) for Heat. This was done after an earlier version of the CRE for Heat was published as part of the experimental data product series. The experimental CRE for Heat builds on the Census Bureau’s standard CRE, a modeled data product that measures social vulnerability at the census tract, county, and state levels. Social vulnerability includes characteristics such as poverty and lack of vehicle access, which can make it harder for communities to respond to and recover from natural disasters and emergencies. These vulnerability components are defined using data from the American Community Survey (ACS). Although the vulnerability components come from the ACS, the CRE is not based on ACS data alone. It uses statistical modeling to combine survey data with other sources, improving the precision of the estimates, especially for small geographic areas.
The CRE for Heat focuses specifically on vulnerability to extreme heat. In its initial version, all vulnerability components came from the ACS, but a key factor — whether a home has air conditioning — is not included in the ACS.
To address this, the Census Bureau used cross-survey modeling to bring in air conditioning data from the American Housing Survey (AHS) and align it with ACS records at the household level.
This addition allowed for a more complete measure of heat-related vulnerability, which has been crucial for stakeholders like the State of Arizona that uses the CRE for Heat in its preparedness plan for extreme heat.
Cross-survey modeling includes four stages — harmonization, model training, calibration and spatial smoothing.
Harmonization is the process of aligning variables from different data sources to ensure they are comparable. Surveys may categorize similar concepts like education level or household structure differently and these discrepancies must be resolved before a machine learning model can be trained. With cross-survey modeling for the CRE for Heat, harmonization involves recoding variables from the source (AHS) and target (ACS) into a shared format. For example, detailed education levels in the ACS were collapsed to match the broader categories used in the AHS. This step is critical for creating a unified, harmonized dataset where predictors have consistent meanings across sources. This ensures that the model’s insights are valid when transferred from one survey to another.
After the data sources are harmonized, we identify the most important characteristics and best type of model to use. After multiple runs and experimentation, we decided to utilize extreme gradient-boosted trees (XGBoost) to train the machine learning model for the air conditioning predictions because XGBoost models consistently produced high-quality predictions compared to other models. XGBoost constructs decision trees, each trained to reduce the residual errors of the previous ones.
This modeling technique also helps us determine characteristics important in determining whether a housing unit has an air conditioning unit. In this case, average July wet bulb temperature, educational attainment, average residential energy cost and living on the coastline were key indicators.
In the final model using this methodology and features, accurate prediction was at 84% within the AHS if a household had an air conditioning unit. This model was then applied to the ACS responses.
Calibration ensures the modeling results line up the expectations at higher levels of geographies. This is done after applying the model parameters to the target survey (in this case, the ACS). As part of the air conditioning model, we used data from the 2020 Residential Electricity Consumption Survey (RECS) to ensure estimates aligned with the state-level estimates published in the survey.
While this third-party benchmark is helpful in determining a reasonable estimate, it is not a required step or the only measure that could have been used. The calibration step could be switched out with other sources in different models. AHS data could also have been used to determine suitable census division estimates. This step keeps estimates consistent with those from higher levels of geography, which are considered more accurate and reliable because they are based on more data.
A final step in creating our estimates is a spatial smoothing exercise which uses data from nearby counties to improve the quality of the estimates. Spatial smoothing assumes neighboring areas are similar and reduces sudden changes or random noise to reveal clearer patterns across a space. Spatial smoothing is a statistical technique used to reduce geographic noise in small-area estimates. In the cross-survey modeling project, spatial smoothing was applied to stabilize county-level predictions of air conditioning prevalence. Although the machine learning model produced accurate household-level probabilities, those predictions could vary across neighboring counties due to sampling variability. To address this, a spatial autoregressive (SAR) model was used to “borrow strength” from surrounding areas by incorporating information from the five nearest neighboring counties within each state, weighted by population size. This approach preserved meaningful spatial patterns while reducing the risk of extreme or implausible values in smaller counties, resulting in more reliable and geographically coherent estimates.
Cross-survey modeling has broader implications for additional data products at the Census Bureau. In this case, cross-survey modeling was used as part of a multidimensional measure, but it could also be used to complete standalone estimates of household air conditioning availability.
Researchers are also pursuing new measures of other topics (that don’t require additional data collection or interviews) to better understand the nation’s people and economy. Among them: wealth, social connectedness, underemployment and opioid addiction.
In summary, it’s becoming increasingly important to provide local governments and stakeholders with the data they need to make informed decisions. Methodologies like cross-survey modeling and data products like the CRE help provide these entities with the information they need without new data collections. The further use and development of these products and methodologies will allow the Census Bureau to continue to fulfill its mission in new and innovative ways. More information about cross-survey modeling can be found in “Cross-Survey Modeling: Fusing Data from Multiple Data Sources to Enhance Multi-Dimensional Measures.”