Inconsistent Search Results
Users may experience issues with the search function. We encourage you to browse our pages manually through the navigation until this is resolved. Thank you for your patience.

Source Data Innovation at the Census Bureau: Improving the Quality and Sustainability of Our Statistics

Written by:

Our statistical products – like median household income or monthly retail sales – don't just emerge out of thin air. Underlying all our data products are one or more sources of more detailed data from which we compute the statistics you see on our website or in the news media. The sources are typically responses to our household and business censuses and surveys or records on government or private-sector databases.

Surveys Under Stress

The most important factor underlying the quality of the statistics we produce is the quality of source data from which they're computed. For decades, the primary method for gathering source data at federal statistical agencies has been the sample surveys. Such surveys yield statistics about a population by randomly sampling a much smaller subset of units, unlike a census, which measures all units in the population. Pioneered at the U.S. Census Bureau by Morris Hansen, Edward Deming and others in the middle of the 20th century, sample surveys revolutionized federal statistics by giving agencies scientifically sound and economical methods to compute statistics of interest across many topics.

However, the workhorse surveys that served the country well for decades are showing signs of stress. Households and businesses are becoming more reluctant to fill out government (and other) surveys. This trend has been underway for some time, and we see it across many countries. Moreover, it appears to have accelerated after the pandemic. Declining response rates reduce the quality of statistics computed from surveys and increase survey costs. So far in the United States, agencies like the Census Bureau have been able to maintain high quality by incorporating additional data, adjusting survey weights, targeted oversampling, or other methods. Statistical agencies in some countries have been forced to withhold publishing data due to quality concerns arising from low response rates.

Big Data to the Rescue?

Over the same period that we’ve been experiencing declining survey response rates, we’ve seen an explosion of digital data that can be used to supplement or even replace existing survey collections. I can’t delve into all the possible benefits and pitfalls of using “big data” for official statistics in this blog, but good discussions can be found in Groves (2011), Jarmin (2019), Abraham et al. (2022), and NASEM (2023). Noting both the costs and benefits, the Census Bureau is carefully exploring several alternative data sources that can help us in a variety of ways across our broad economic and social measurement mission.

Before I describe some of the ways we’re tapping new data sources, it’s important to note that the Census Bureau has been using nonsurvey data in our work for decades. Administrative data from other government agencies such as the Internal Revenue Service, the Social Security Administration, the Postal Service, and state unemployment insurance offices provide crucial information that underlie our Frames from which we draw samples for our household and business surveys. But we also use administrative data directly in estimation, as in products like our Small Area Income and Poverty Estimates, Business Dynamics Statistics, Business Formation Statistics, OnTheMap, and the 2020 Census. These Frames contain high-quality information covering the universes of the primary units that are the subjects of our statistical products: locations (business, residential, and government), organizations (businesses and governments), and households and individuals. These Frames are thus invaluable when we’re evaluating alternative data sources, as they help us understand the coverage and characteristics of the units included—for example, the number and locations of retail stores in a third-party dataset.

Exploring Alternative Data Sources

Let me now turn to some exciting examples of how we’re using novel nonsurvey data sources to make our statistics more accurate, more timely, and more sustainable.

  • Retail Sales – Every month, the Census Bureau releases Advance Monthly Sales for Retail and Food Services, a Principal Federal Economic Indicator. The Monthly Advance Retail Trade Survey (MARTS) is a subsample of the Monthly Retail Trade Survey (MRTS). Both surveys are voluntary and, as with many surveys, we’ve seen decreasing willingness by retail firms to participate. Fortunately, we’ve been able to utilize data on retail sales from third parties like Circana to supplement survey responses. Important for this application is the timeliness of the third-party data. But also important is its accuracy and fitness for purpose. Our data scientists worked closely with the Circana team to carefully evaluate and learn about the data so we could address issues before we used them in production.
  • Monthly State-Level Retail Sales – The MARTS and MRTS are both firm-level surveys where large multistore retailers report only at the national level. As such, the MARTS and MRTS only support national estimates. But users want more detailed data, so we’re leveraging third-party data – in particular, store-level detail from Circana, and state-level totals from NielsenIQ – along with our survey data to produce experimental Monthly State Retail Sales estimates.
  • Retail Prices and Quantities – Our monthly retail surveys collect sales data (i.e., prices x quantities), and separately, the Bureau of Labor Statistics (BLS) collects prices for a sample of products. Some academic colleagues and I have been using data separately from Circana and NielsenIQ to prototype retail price and quantity indices directly from transaction-level data in what we call the Re-Engineering Statistics with Economic Transactions or RESET project. This ambitious research project envisions largely replacing existing monthly Census Bureau and BLS retail surveys with a fully integrated approach that better addresses fundamental issues in price measurement, such as substitution bias, quality change, and product turnover. The price and quantity data will be measured simultaneously for identical items – rather than independently estimated from different samples – allowing more accurate measures of both inflation and expenditure. The project has conducted methodological research, and we are now moving to a demonstration phase in partnership with Circana where we will be producing monthly indices for consumer goods at a similar cadence to official statistics. We hope to be releasing indices for subsets of products starting in early 2026. If successful, we’ll be providing the agencies with code and documentation for possible implementation.
  • Better Address Information Using Satellites Imagery – The Census Bureau’s Master Address File (MAF), part of our Geospatial Frame, is the continuously updated address list we employ to mail the Decennial Census, the American Community Survey and our other household surveys. Prior to 2020, the Census Bureau would deploy temporary listers before the census (nearly 150,000 in 2010) to verify all addresses on the MAF, adding new ones and deleting those that were no longer residential housing units. Before the 2020 Census, the Census Bureau leveraged satellite imagery to conduct in-office address canvassing for about 68% of the addresses on the MAF, greatly reducing the need to hire temporary staff and saving over $674 million. Since then, our Geographic Support Program has built on that success by using machine learning for automated change detection in satellite imagery, along with parcel-level administrative data, to keep the MAF up to date throughout the decade. Given the increased accuracy of the MAF from using these new source data and tools, we anticipate only needing to deploy listers in a small number of unusual cases (e.g., in extremely remote areas) before the 2030 Census.
  • Experimental Single-Family Housing Starts From Satellite Imagery – Each month, the Census Bureau produces estimates of Single-Family Housing Starts and Completions using data from our Survey of Construction that collects data from a small subset of building permits via interviews by Census Bureau field representatives. Following the success of using imagery for the 2020 Census, the Census Bureau has been exploring using imagery and computer vision to supplement survey data for blended experimental estimates for select geographies. Scaling this technique could allow the agency to greatly expand the coverage of starts and completions. Given the expense of detailed imagery, a careful comparison of the cost-quality tradeoff vs. traditional personal-visit methods is required before expanding nationwide.
  • Direct Feeds From Businesses – When the Census Bureau sends a business a survey, one or more people there will likely need to query a company database to respond. One of their tasks will be to translate data stored and organized using the business’s own schema into the concepts we’re requesting. This is burdensome, and of course, businesses would like to minimize the cost of complying with our surveys. This is directly in contrast to our need to get more detailed and timely information from businesses to improve our statistics. We’ve been collaborating with several larger companies to develop automated direct data feeds, which have proven to be a win-win. We get the more timely and detailed data we need, and we dramatically reduce the reporting burden on the companies. For example, one company required 600 person-hours a year to comply with our monthly, quarterly and annual surveys. Engineering the automated feed reduced this to 15 minutes a quarter. To scale this approach economywide, we first need the voluntary cooperation of companies to establish direct feeds. But we also need a way to more efficiently solve the schema mapping problem. When it’s just a handful of companies, our data scientists and subject matter experts (SMEs) can accomplish this with substantial human review. We simply don’t have enough staff to do this for more than a few dozen companies, so we are training a Large Language Model (LLM) to map the idiosyncratic data structures and schema used by companies to the one our SMEs need to turn the raw data into official statistics. We’ve made substantial improvements to what our engineers and data scientists are dubbing Valhalla, which will greatly assist in onboarding direct data feeds at scale.

Wrapping Up

These short descriptions do no justice to all the work our teams have accomplished on these and similar efforts across the Census Bureau. For example, we’ve been working with Circana for over 10 years now. New data sources are never “shovel-ready,” as they say. It takes time to evaluate and understand new data sources and how their characteristics impact the statistics we publish. My last blog post described our work building a modern Business Ecosystem at the Census Bureau. This is crucial as we expand our uses of alternative data sources, both due to the scale of the data (e.g., retail transactions or satellite imagery) and the complexity of statistical computations that utilize AI and blended data. Tools like Valhalla will help us more efficiently incorporate new data sources and make them ready for use. The goal of all this is to provide you with higher quality and more sustainable statistics. We’d love your feedback and will keep you informed as we move forward.

This article was filed under:

Page Last Revised - August 21, 2025