Finding 'Anomalies' Illustrates 2020 Census Quality Checks Are Working

March 09, 2021

Written by:

Michael Thieme, Assistant Director for Decennial Census Programs, Systems and Contracts

We’re in the midst of data processing for the 2020 Census. As Acting Census Bureau Director Ron Jarmin acknowledged in a recent blog, we’ve discovered some “anomalies” along the way that we’re looking into and resolving.

Today, I’d like to unpack what that means. The word “anomaly” can sound alarming. In fact, the scientific advisory group JASON recently recommended that we consider avoiding the word because of the unwarranted alarm it causes, especially when used without context.

Instead of potentially causing confusion by introducing another word, this blog endeavors to explain this technical term and provide the needed context.

“Anomaly” just means that we’ve found something in our quality review process that doesn’t look quite right. Anomalies found in processing are not errors in the census, but they can turn into errors if we don’t review and resolve them. It’s a feature of our quality check process to find them, and it gives us the opportunity to fix any issues we confirm.

No matter what they are called, these anomalies are a signal that the quality checks on the census are working. Let’s dive deeper into what they are and how we are addressing them.

Finding Anomalies

With an accurate census count as the primary goal, our subject matter experts meticulously go through the response data, comparing population totals against other data sources, such as the 2010 Census, the 2020 population estimates, and the American Community Survey. They also ensure that processing ran as designed. As we review the data, we look for outliers — numbers that don’t fit what we might reasonably expect.

Where we find outliers, we dig deeper to find out what’s going on. If we determine that a fix is needed to correct an error, we fix it.

Examining outliers is a normal part of data processing and the quality checks we do for any census or survey.

To date we have encountered 33 anomalies, which fall into three main categories. In his last blog, Acting Director Jarmin gave some high-level examples of their causes. In this blog, I’ll explain more specifically what and how many we’ve seen, and what we have done to fix them.

“Standard” or Coding-Related Anomalies

The biggest category is “standard” anomalies. These arise in processing any census or survey.

These routine anomalies relate to coding — how the response data appear and are processed in our data files and in the resulting tallies.

First, let me give some background on how coding works. We spend a lot of time ahead of the census thinking through how to deal with both responses that come in and missing information.

We establish business rules. These rules define “if X happens, then Y should happen.”
We write specifications. These specifications translate the business rules into instructions for how the data should be processed in our systems.
We write code. The code turns responses and missing information into data, following the specifications.

Having worked on a number of decennial censuses, the career processing staff at the Census Bureau understand that the reality for any large, complex data collection is that coding is never able to anticipate every data situation, no matter how well we test it. Knowing this, we meticulously run quality checks looking for outliers in the data.

Where we spot an outlier, we go back and look at the specifications and code to identify where things may have gone wrong. If the code or specifications were incorrect, we write a fix and then test that fix. When we confirm the fix works, we implement it to make sure the data display correctly.

So far in 2020 Census processing, 27 of the 33 anomalies we’ve found are of this type. Let me give a couple of examples.

Miscalculating age for missing birthdays. We found that our system was miscalculating ages for people who included their year of birth but left their birthday and month blank. We fixed this with a simple code correction. Making sure ages calculate correctly helps us with other data processing steps for matching and removing duplicate responses.
Incorrectly sorting out self-responses from group quarters residents. The 2020 Census allowed people to respond online or by phone without using the pre-assigned Census ID that links their response to their address. As a result, some people who live in group quarters facilities, such as nursing homes, were able to respond on their own even though they were also counted through the separate Group Quarters Enumeration operation. This also makes their address show up as a duplicate — as both a group quarters facility and a housing unit. Our business rules sort out these duplicate responses and addresses by accepting the response coming from the group quarters operation and removing the response and address appearing as a housing unit. We found an error in how this rule was being carried out. The code was correctly removing the duplicate address but wasn’t removing the duplicate response. We fixed this with another code correction, which enables us to avoid overcounting these residents.

Anomalies From Unanticipated Respondent Actions

Another category of anomalies results from respondent actions that we did not anticipate. The COVID-19 pandemic seemed to exacerbate these.

So far in 2020 we have encountered five anomalies of this type. The most notable example is related to the count of students living in college dorms at some universities.

When the pandemic hit, we strongly encouraged colleges and universities to provide responses for their residents electronically instead of through one of the in-person options we offered. A small number of colleges and universities mistakenly reported the total student population in all dorms for each dorm. If not fixed, this could have inflated the population count on those campuses.

We spotted this error because months before we asked colleges and universities to respond for their students, we asked them to estimate how many students lived on their campuses. We compared these estimates to the response data they ultimately provided, and the numbers stood out as outliers. We confirmed what had happened and then implemented a code fix to correctly distribute the population among those dorms.

Unanticipated Census Taker Actions

The last category of anomalies results from unanticipated census taker action. So far in 2020 we have encountered only one anomaly of this type.

We had a census taker who incorrectly flagged a group quarters facility, which wiped out the response data for the entire facility. We identified the issue during our quality checks and were able to reinstate the response data for all the residents.

Summary

I am pleased to report that we have not found any anomalies that are impossible to fix. We have fixed or are fixing every anomaly that our systems and processes have identified so far, and we will continue to look for and fix any that arise as we continue processing the data.

In fact, we completed the second phase of our data processing (validation of the Decennial Response File 2) on Feb. 24. In this phase, we removed duplicate responses that we received and addressed any anomalies which needed to be corrected. We have begun work on the third phase (Census Unedited File processing), and we will continue to look for and fix any anomalies that arise as we continue processing the data.

Finding these anomalies illustrates that our quality checks are working — ensuring we can count everyone once, only once, and in the right place.