Geographic Coding of Administrative Records--Past Experience and Current Research

Douglas K. Sater

Population Division
U.S. Bureau of the Census
Washington, D.C.

April 1993

Population Division Working Paper No. 2

Line Divider

This working paper was presented at the 1993 annual meeting of the Population Association of America, Cincinnati, Ohio, April 1-3, 1993 and annual meeting of the American Statistical Association, August 8-12, 1993, San Francisco, California.

The views expressed are attributable to the author and do not necessarily reflect the view of the U.S. Bureau of the Census.
Standard errors are shown in brackets "[ ]" in the text following the figure cited. For percents, they are derived from the square root of P*(1-P)/N, where p is the percent, and N is the base number from which the percent is derived.


Douglas K. Sater

The Population Estimates Branch of the Bureau of the Census annually produces estimates of the population for states and counties, and biennially produces estimates of the population for the 36,000 general purpose units of local government. These include all incorporated places and all functioning minor civil divisions (MCDs). Where places are split by county and/or MCDs, the estimates are made for each place/county/MCD piece and then aggregated into place totals. There are two basic approaches to the estimate process 1/, which have different requirements for the geographic coding of the administrative records. The first approach is to use the aggregate numbers from the administrative records or change in the aggregate numbers as symptomatic indicators of population change (or to use the change in each subarea's share of the parent area's aggregate numbers as symptomatic indicators of population change). This approach places higher priority on the number of returns coded (coding rate) and annual stability in the coding rate. The second approach is the Administrative Records method, which estimates each of the components of population change -- births, deaths, international migration and internal migration. To measure internal "migration", the Census Bureau geographically codes the current year individual income tax returns and matches them to the prior years' geographically-coded file. A comparison of the geographic codes on the returns matched between the two years determines the in-migrants, out-migrants and non-migrants. The quality and consistency of the geographic coding has a direct affect on the quality of the "migration" data, and hence, on the quality of the population estimates. Note also that the IRS data is not migration data, per se, but is a measure of the movement of people between geographic areas.

The individual income tax returns do not contain any geographic information, but they do contain a complete mailing address. Specifically, street address, post office name, 9-digit ZIP Code and mailing state abbreviation are included. The mailing address represents the location at which the taxpayer wants contact with the IRS to occur. For most taxpayers, the mailing address is same as the residence address. For others, it could be a place of business, a tax preparer or accountant, a post office box, a second residence (for dual residents), parents' address, etc. Geographic coding to mailing address rather than to residence will affect the migration data in two ways. First, some residence movers will be missed because the mailing address did not change, and false movers will be created solely because the mailing address was changed. Second, measured migration may be from/to incorrect geography.

This paper discusses current methods of assigning geographic codes to the Federal individual income tax returns and test results of potential new methods. Ideally, we would want to evaluate the geographic coding process in terms of the impact on the reliability of the resultant migration data. However, there is no good benchmark for evaluating the migration data, or the resources to create such. Thus, the evaluations will focus on the coding rates and on the quality of the assigned geographic codes. The paper is broken into five basic sections as described below.

Section I -- Direct Data Collection and Probabilistic Coding. This section covers the collection of residence information in the individual income tax returns, the development of the probability coding files, geographic coding based on these files, coding rates, and the annual decline in the quality of the coding.

Section II -- Coding by ZIP Code. Areas covered include a discussion of the 5-digit ZIP Code and its relationship to geographic areas, the post office's ZIP+4 assignment process, the post offices ZIP+4-to-county cross reference file, the editing of the file, the development of a ZIP/sector-to-county coding file, an assessment of coding rates using the coding file, and a discussion of the quality of the assigned county codes.

Section III -- Address Types on Income Tax Returns. This section outlines the post office's addressing guidelines, shows the address types contained in the 1990 decennial census address control file, and discusses the various address types and the quality of the address information in the individual income tax returns.

Section IV -- Address Based Geographic Coding. Included in this section is a discussion of the geographic coding procedures in the Census Bureau's TIGER system, a report on results of TIGER coding of a sample of individual income tax return addresses, and an assessment of coding rates if the TIGER is expanded to include all city type addresses.

Section V -- Future Work. The paper closes with a brief outline of other research activities that needs to be conducted.

The research presented in this paper relates specifically to the geographic coding of addresses in the individual income tax returns. However, it is applicable to analogous addresses in other administrative record systems.

Section I -- Direct Data Collection and Probabilistic Coding

This section covers the collection of residence information in the individual income tax returns, the development of the probability coding files, geographic coding based on these files, coding rates, and the annual decline in the quality of the coding.

A. Direct data collection

One approach to geographically coding the individual income tax returns would be to ask the taxpayers directly about the geography in which they live. However, program agencies are prohibited from collecting information that is not directly connected with the administration of their program. A special legislative change to the IRS code was required to permit the direct collection of the residence information in the individual income tax returns. The Census Bureau asked for residence information on the 1972, 1974 and 1980 income tax returns. The following residence questions were included in the 1980 income tax returns.

A. Where do you live?

State City,village,
Borough, etc.
B. Do you live
within the
legal limits
of a city,
village, etc?
yes no
C. In what
county do
you live?
D. In what
do you

A clerical staff at IRS used reference materials provided by the Census Bureau to manually code these responses. The coding procedures were designed to code as many returns as possible, even cases in which "incomplete" responses were supplied. The 6-digit Geographic Reference Identification Number (GRIN) code assigned to each set of responses was posted and then data-keyed. This GRIN code was converted to a unique set of reported state, county, MCD and place codes.

The first obvious limitation of this approach is cost. The manual coding and keying of the responses is a very expensive process. The 1980 process cost about 10 cents per case, or about $9 million.

A second limitation in the process is that the questions can only ask about geography that the taxpayers should (mostly) know, such as state, county MCD and place. Unorganized territories, census statistical entities, tracts, and blocks, do not fall in this category. States and most counties are fairly well reported. However, there are some counties that pose some problems. For example, Baltimore City, MD is an independent city (which is treated as a county equivalent), but there is also a Baltimore County. The difference between Baltimore City and Baltimore County caused some misreporting and miscoding.

A third limitation in taxpayer reporting is that the "city name" reported in Question A was not always the name of an actual city (or town). Often, because the name reported in this item was a post office name, procedures needed to be implemented to account for these. The coding materials were designed to contain post office names as well as actual city (or town) names, thus, it was possible to assign GRIN codes to these names.

If the name was a post office name within an unincorporated area, the code for that area was assigned. If the name was for an incorporated area, more editing needed to be done. For most persons who lived in an incorporated area, there was a fairly unambiguous relationship among city name, post office name, and ZIP Code area. Generally, the city name was used as the post office name and the ZIP Code area was entirely included in the city limits. In other cases, the city and post office name differed, but the ZIP Code area was entirely within the city limits. These were coded to the incorporated area. However, there were cases in which the city name and post office name were the same, but the corporate limits of the city did not coincide with the boundaries of the ZIP Code area. In order to assign the correct state, county, MCD and place codes for these returns, it was necessary to use the tax filer's response to Question B. Filers who indicated "YES" to Question B were coded to the incorporated place. Those who indicated "NO" were assigned to the balance of the county or MCD. Responses to Question B, however, also contained response error. For the ZIP Codes in 1980 that were split between Wilmington City, DE and the balance of New Castle County, 5 percent of those living inside the city limits erroneously reported "no" and 15 percent of those living outside the city limits erroneously reported "yes".

Another problem to handle was nonreporting and incomplete reporting of all the parts of geographic information needed to completely code the record. One type of partially reported code was where a place was reported that was split by county and/or MCD, but the county or MCD was not reported. In this case, the code was salvaged by keeping the reported place code and allocating a value to the missing county and/or MCD code. Another type of partial code is where the county code was reported, but not the place and/or MCD. Seventy percent were completely reported and data captured, but there were 18 percent partial responses, there were 5 percent that were not codable at all, 4 percent that were not reported or only reported at the state level, and 3 percent that were unrecoverable keying errors.

B. Probability Coding Guide

The direct data collection effort did not provide adequate coding rates on its own, and the process was too expensive to do annually. Therefore, we built in the probability coding process. The Probability Coding Guide was developed using mailing state code, the ZIP Code, post office name, the address type, and the reported geographic codes. Essentially, the coding guide is a summary tabulation of the number of returns for each unique geographic code within a specific Key-4 group. The Key-4 parameters are: mailing state code, 5-digit ZIP Code, first 9 characters of post office name, and address type. An example of an entry in the coding guide is:

KEY-4 = 24-20772-UPPERMARL-1    
Codes = 24-033-000-0995

In addition to the primary coding guide, two secondary coding guides were developed to provide a code for cases that were not coded by the primary guide. These additional coding guides are similar to the primary probability guide, except for the keys that are used to enter each. One secondary coding guide has a key defined as ZIP Code/address type. The other secondary coding guide has a key using state/first 9 characters of post office name/address type. The secondary guides were obtained by collapsing the primary guide for the appropriate keys.

C. Probability Coding

The coding guides were used for two purposes. First, they were used to assign codes to those returns in the 1980 file without a complete reported code. Second, they were used annually thereafter to assign codes to movers and new filers.

An address on the tax record is geographically coded in the following manner: (1) a Key-4 (state, ZIP Code, 1st 9 characters of post office name, and address type) is defined from the mailing address of the record needing to be coded; (2) this key is then located in the probability coding guide; (3) a random number generator is called; and (4) a geographic code (state, county, MCD and place) is assigned from the appropriate distribution based upon the random number that was generated. If the return cannot be assigned to a Key-4 (for example, no ZIP Code) or the assigned Key-4 is not in the Probability Coding Guide, attempts are made to code the return with a secondary coding guide. If the primary and both secondary coding guides fail, the return is considered to have an uncoded address. Note that this process is based on a statistical approach. For this coding method, misclassification is not of great concern, but rather the importance is placed on the overall level of coding to appropriate geographic areas.

The basic advantage of the probability coding system is that it is designed to incorporate two types of coding situations. First, to code to subcounty level for ZIP Codes that are split at the subcounty level. Second, to include a residence adjustment factor for the cases where the mailing address does not reflect the actual residence. The other advantages are in the efficiency in the annual processing (it is quick and inexpensive) and in the high coding rate.

To examine the coding rates for the probability coding system we selected a sample of the 1988 individual income tax return file as a test file, and tabulated the number of returns by source of code (primary coding guide, secondary coding guides, or uncoded) by state. This coding system was able to code virtually all addresses (99.96 percent) to the county and subcounty level. It was even able to code 99.6 percent of the returns with no street address (based on ZIP Code, post office name). Most returns (96.6 percent) were coded using the full Key-4 (that is, the primary coding guide). The coding rates are not shown by state in this paper, but all coding rates were in excess of 98.0 percent; the state with the lowest coding rate was Alaska, where 98.3 percent of all addresses were coded.

D. Annual Decline in Probability Coding Quality

The quality of the coding deteriorates over time. The coding guides are based on ZIP to geographic relations that existed in 1980. However, these relationships change over time. The post office changes ZIP Code boundaries, creates new ZIP Codes, deletes old ZIP Codes, converts rural route type addresses to city type deliveries, cities annex and de-annex areas, new cities incorporate, etc. This is why updated residence information was obtained during the 1970s.

In the 1980s, however, it was no longer feasible to do so. Thus, we manually updated the 1980 probability coding guides to the extent possible. First, in 1987, post office ZIP Code editing materials were used to update/correct the ZIP Codes in the 1980 file. The coding guides were then rebuilt. Improved editing procedures also were implemented. For the most part, adjustments to the coding guides to account for governmental unit changes (annexations, new incorporations, dis-incorporations, etc.) also were made. This was a tedious labor intensive manual review/update process involving a host of source materials, including old and new ZIP Code maps, original and updated decennial census block maps and housing unit counts, atlases, and ZIP Code directories. Where the adjustment to the coding guide was a substantial improvement, the changes were incorporated. Adjustments could not be made for some areas, however, because of insufficient information or the accuracy of the adjustment was questionable. Also, adjustments due to address conversions could not be made.

Even if the adjustments were complete and accurate, the quality of the coding drops over time simply by the nature of the coding process. As tax filers move in years subsequent to the base year (1980), they are assigned a "new" set of geographic codes according to the appropriate 1980 probability distribution. Implicit in the coding process is the assumption that the geographic distribution is the same for filers who move into the ZIP Code area, filers who move out of the ZIP Code area, and filers who do not move, and that the relationship does not change over time. To the extent that this is not true and as the cumulative percentage of tax filers who move at least once between 1980 and the current tax year increases, the overall error rate increases.

Clearly, the Census Bureau cannot continue this process well into the 1990s, given that new methods can be developed based on hardware, software, and basic data files that did not exist in the 1970s and 1980s. Ideally, we need to develop a new address-based geographic coding system that is quick, efficient, and accurate, that codes most addresses to the required geographic levels and that can be easily updated. The coding process is also not limited to one method and coding data source; it can be a hybrid of several, taking the best of each.

Section II -- Coding by ZIP Code

Areas covered in this section include a discussion of the 5-digit ZIP Code and its relationship to geographic areas, the post office's ZIP+4 assignment process, the post office's ZIP+4-to-county cross refrence file, the editing of the file, the development of a ZIP/sector-to-county coding file, an assessment of coding rates using the coding file, and a discussion of the quality of the assigned county codes.

A. County Coding by ZIP Code

One specific goal of our research efforts is to be able to accurately code to county quickly and efficiently so that the state and county population estimates can be produced in an integrated process. Given the current production schedule and methods of processing, that leaves 1 to 2 weeks for the county coding and migration production process. This production schedule precludes the use of address-based coding systems for county coding. However, coding to county by using ZIP-to-county cross reference files is a promising avenue.

Coding to county by using the ZIP Code in the mailing address does not assume that the mailing address is the same as the residence address, but it does implicitly assume that they are in the same county.

B. ZIP Code Assignment

ZIP Codes are designed to deliver mail. The ZIP Codes and area of responsibility are assigned to handle the mail as efficiently as possible and (mostly) without regard to geographic boundaries. In a technical sense, ZIP Codes are not area based, but a collection of delivery points. However, each ZIP Code usually can be assembled (with boundaries). A ZIP Code can also be assigned to a unique delivery point such as a university, government building, business, or a group of post office boxes.

At the state level, most ZIP Codes deliver wholly within the state, but a few do deliver to out-of-state areas. At the county level, some ZIP Codes cross county boundaries, but most deliver wholly within the county. The ZIP Codes that are split by state or county, however, pose problems for coding by ZIP Code.

In selected parts of the country, there are also postal delivery processes that pose special problems. In Alaska, for example, there are post offices that are an intermediate drop off point where they hold mail in pouches for later delivery to a remote area such as logging camp, fishery, etc. These are now being changed to post office boxes, with a three character alpha as part of the box number, but they still pose problems for geographic coding by ZIP Code. Also, there are areas that have no house-by-house delivery and individuals have to pick up their mail from the post office. Such individuals may also have a choice of post offices. In such cases, direct coding by ZIP Code may be problematic.

One option is to create a ZIP to county cross-reference file by collapsing the 1980 primary coding guide to ZIP/state/county, using only one possible county. This incorporates the 1980 mailing to residence adjustment. However, it is an old adjustment, and only the 5-digit ZIP Code is available. Based on our experience with the 1980 coding guide, we estimate we could code to the county level using 5-digit ZIP Code (only) with about 96 percent accuracy overall. However, quality of coding will vary dramatically by county. For many of the large counties, the coding will be good, but for most of the small counties, the coding will be very poor. Some counties will not be coded at all. Additionally, independent cities, such as Baltimore city, MD or Manassas Park City, VA and the surrounding counties will have substantial problems in the coding.

For many large cities (excluding the independent cities), most of the ZIP Codes are wholely contained within the city. Geographic coding to large cities using 5-digit ZIP code (only) may be feasible. For small places and the more sparsely populated areas, the ZIP codes tend to cover several subcounty areas. Geographic coding to such subcounty areas using 5-digit ZIP Codes would not be very good at all.

C. ZIP + 4 Code

A few years ago, the post office assigned an additional 4 digits to the existing 5-digit ZIP Code to make mail handling and delivery more efficient. The +4 code is actually two codes in one -- the first 2 codes are sector and the second 2 codes are segments within the sector. The following describes the ZIP+4 assignment process prepared by Suzanne Shepherd 2/. But first a cautionary note. These are guidelines established by the post office and there is flexibility of implementation by the individual postmasters.

"The U.S.P.S. perceives ZIP+4 codes in city-style address areas as essentially geographic in nature. A city-style address typically is an address in structure number-street name form, such as "4320 Huntingtown Road." The first two digits of the +4 add-on, which is referred to as the "sector" component, typically represents a block group (but is not coincident with Census Bureau-defined block groups). The last two digits of the +4 add-on, which is referred to as the "segment" component, typically represents a block side, a company, a unit within a company, a building, or a floor within a building.

To establish ZIP+4 Codes, the U.S.P.S. plots a 5-digit ZIP Code boundary on a street map and uses main thoroughfares to cut the 5-digit ZIP Code area into preliminary sectors. The U.S.P.S. then counts the number of block sides and the number of companies that receive 10 or more mailing pieces. If these two numbers total more than 50 in a primarily commercial area, the preliminary sector usually is further divided. If these two numbers total more than 70 in a primarily residential area, the preliminary sector usually is further divided. These thresholds are merely guidelines that change somewhat due to a preliminary sector's growth potential. For example, if a preliminary sector contains a lot of open area, the U.S.P.S. will lower the number, but if a preliminary sector is already quite congested, the U.S.P.S. will raise the number." 2/

The map on page 11 shows the City of Cambridge and a small portion of the surrounding area. The city and the surrounding area are covered by a single 5-digit ZIP Code. The sectors for the city style deliveries have been overlaid on the map. These boundaries have been derived from an examination of the ZIP+4 Codes on residential address lists. For exposition purposes, the boundaries have been expanded to the nearest physical feature (river, interstate highway, etc.), to include uninhabited area (such as city parks, cemetaries, etc.). Also, some sectors that have only business deliveries may not be shown on the map.

From the map, we can see that the sectors are formed by adjacent blocks and block faces, and can be bounded by a polygon. The polygons are mutually exclusive and encompass the entire city style delivery area. We can also see from the map that a sector includes deliveries on both sides of a street at a sector boundary. Other Post Offices may choose to have the sector boundary in the middle of the street, with even numbered addresses in one sector and odd numbered addresses in another sector.

The shaded areas to the north and to the southeast of the delineated sectors shows area inside the city limits that does not have city style deliveries. These areas are covered by the rural route style deliveries, even though the addresses are of the house number/ street name format.

"When segment numbers are depleted within a particular sector area, which we may also refer to as a ZIP+2 area, the U.S.P.S. inserts another sector area within the original sector area. This additional sector area may split the original sector area, creating two discontiguous sector areas with the same sector number. Segment numbers are unique for a sector number. The number assigned to the new, inserted sector area is previously unused within the particular 5-digit ZIP Code area. Residential-to-commercial rezoning typically causes segment number depletion." 2/

Sector map of the City of Cambridge Ohio (244k)

"In areas that have rural-style addresses, the U.S.P.S. assigns +4 add-ons according to a letter carrier's line of travel. Therefore, ZIP+4 Codes in these areas do not refer to geographic areas.

In areas that have rural-style addresses, a street segment receives a +4 add-on only if it is part of a letter carrier's route. The U.S.P.S. differentiates between block sides only if a carrier stops on both sides of the street to deliver mail. The first rural route for a 5-digit ZIP Code usually has a sector number of "97", the second rural route has a sector number of "96", and so forth. The +4 add-ons for a rural route typically go from "9701" to "97nn", with "9701" being the first street segment on which the carrier delivers mail and "97nn" being the last." 2/

The map on page 13 shows the delivery path of two of the 9 rural route sectors from the Cambridge Post Office. The solid line is sector 94 and the dashed line is sector 97. It is obvious from the map that these sectors are not geographically based. They deliver to a few addresses in the city limits, and to to addresses in several townships outside the city limits. In short, these sectors wind all over the countryside. They do not, however, cross into another county.

There are two other interesting facets of the rural route deliveries for the Cambridge Post Office. Most of the area has been converted to house number/ street name format and are covered by sectors 90 to 97. Sectors 91 to 97 cover most of the area in a linear fashion. Sector 90 is comprised of scattered street segments not covered by sectors 92 to 97. Also, the few areas that have not been converted to house number/ street name format are all lumped together in sector 98.

"If a rural route crosses a county boundary, the sector number changes, typically to another number in the nineties, and the U.S.P.S. numbers the segments in sequence beginning with "01". If the rural route crosses back into the original county, the +4 numbering resumes where the original +4 numbering left off. For example, if "9718" was the last +4 number assigned before the rural route crossed into another county, then "9719" is the first +4 assigned when the rural route crosses back into the original county.

When a group of rural mail boxes receive mail from different letter carriers, their sector numbers are different and there may be no pattern to the +4 add-ons. For example, the +4 add-ons for a group of rural mail boxes may be "9601", "9622", "9705", and "9601" again, because the mail boxes are not only on different rural routes, but on routes coming out of different 5-digit ZIP Codes. If a structure receives mail via a rural route, its mail box does not need to be anywhere near the structure." 2/

Map of Cambridge Ohio, showing two of the nine rural route sectors (145k)

"If a jurisdiction establishes city-style addresses and the U.S.P.S. adopts them for mail delivery, the U.S.P.S. reassigns the +4 numbers." 2/

Additionally, sectors 00 through 09 are usually reserved for the P.O. boxes. Sectors 98 and 99 are usually reserved for the postmaster and for "business mail reply".

The +4 codes are used by the IRS in the mailing address. For the 1988 IRS 1-percent sample file, 94 percent of all addresses had the +4 codes, 98 percent for house number/street name type addresses, 91 percent for rural routes and 98 percent for P.O. boxes.

D. ZIP+4 to County Cross Reference File

The post office has created a ZIP+4 to county cross refrence file which could serve as the basis for the county coding process. The file is a quarterly product and is updated to reflect changes occurring since the prior release. That is, new ZIP Codes are added, discontinued ZIP Codes are deleted, changes to ZIP Codes or +4 codes incorporated.

The ZIP+4 to county cross reference file contains a record for each unique ZIP+4 Code, or about 24 million records. Two exceptions to this are as follows: (a) If a business (or government agency) has more than one +4 code assigned to it, the file will have only one record with the data on the record showing the range of +4 codes assigned; (b) the same may be true for post office boxes.

The file contains the following data items: ZIP Code, sector/segment for lowest of sector/segment range, for highest of range, 2-character state abbreviation, county code and county name. Note that the state name is the state in which the post office is located and the county represents the county in which the mail is delivered. That is, in a few cases, the county may be in a different state than the state name identified. There are no street names or address range information contained in this file.

The file should cover all ZIP Codes in the U.S., all ZIP Codes for U.S. possessions (Puerto Rico, Virgin Islands, etc.), and all APO/FPO ZIP Codes. All counties and county equivalents in the U.S. and U.S. possessions are represented in the file with the exception of Yellowstone National Park, MT (30-133), and, for the 1991 file, Denali Borough, AK (02-068).

The county should represent the county in which the mail is delivered. For post office boxes, it is the county in which the boxes are located. The APO/FPO ZIP Codes are assigned to the county the mail is delivered from, with the exception of APO/FPO ZIP Codes for military bases in Alaska and Hawaii. These are assigned appropriate county codes in Alaska or Hawaii.

E. Coverage Edit

The ZIP+4 to county cross refrence file may not include all ZIP Codes. Some are post office errors. Some are ZIP Codes actually used by local areas that are not known by the office assembling the file. Some may be discontinued ZIP Codes. However, because of lags in implementing ZIP Code changes, administrative record systems are likely to include outdated ZIP Codes. Also, some people continue to use the old ZIP Code even though it has been changed. The first step was to compare the ZIP Codes in the file with those actually used in the IRS file and with those listed in recent ZIP Code directories. Where needed, additional ZIP Codes were incorporated into the file.

F. APO/FPO County Code Update

The complete list of APO/FPO ZIP Codes was reviewed to make sure appropriate codes were used. The state/county code was changed to a foreign catagory, except for the APO/FPO ZIP Codes for Alaska and Hawaii. The codes for the Trust Territories were also reviewed and modified, as necessary, to reflect the FIPS state and county equivalent codes.

G. Illegal County Code Edit

The file contains some illegal county codes. A county code of 999 was occasionally used and there were other non-existent county codes. All records in a ZIP Code that contained an illegal county code were examined and a correct county code determined.

The 999s were cases where the ZIP Code crossed into another state and the person assembling the data did not know what county to code. This occurred most often in North Dakota and South Dakota. These were recoded to a contiguous county in an adjoining state where it seemed reasonable to do so (by looking at the ZIP Code map, the ZIP Code directory, and atlas). Most of the other illegal county codes were obvious typographic errors (such as digit transposition). However, some were because the state code is the ZIP state and the county code is in another state. These were reviewed and the state code changed. A few (but not many), of the illegal county codes were cases where the person preparing the county codes simply made up a new code to represent some special case in their area. It was not possible to tell what these were. For these, and the remainder of the illegal county codes, a county code was assigned (frequently the dominant county code for the sector). Thus, all illegal state/county codes were changed to legal state/county codes.

Also, in ZIP Codes that had more than one county listed, there were some that contained at least one county that was not contiguous to the other(s). A few of these were plausible (e.g. where counties are very close but not contiguous) and were not changed, some of these were actually for a contiguous county across the state line (the state code was repaired), some were typographic errors not caught in previous edits (and were fixed), but a significant number were inexplicable. These were replaced with the dominant code for the sector.

These reviews and corrections are based primarily on educated guesses and "most likely" corrections. We simply did not have resources to do a thorough review/correction to obtain exact information (say for example by calling the local post office). Still, a substantial amount of effort was expended to clean up the file. It is reasonable to expect that there are still some errors in the file that were not caught by the edits, and some errors introduced by the review/correction process. However, since remaining errors will be at the sector level, they will have less impact on quality (than if the error is at the 5-digit level).

The above discussion focused on "bad" codes within ZIP/Sectors but did not give a feel for how many there were. There were 1,494 ZIP Codes with a change, and 3,438 (out of 857,400) ZIP/Sector records with a change. There were 17,539 ZIP+4 records (out of about 24,000,000) with a change.

As mentioned earlier, there are a few ZIP Codes that deliver across state lines, and there are a few ZIP/sectors that cross county lines. There are 153 ZIP Codes in more than one state, mostly occurring in Minnesota, Nebraska, North Dakota and South Dakota. There are 9,000 ZIP Codes in more than one county. There were 11,331 (out of the total 857,400) ZIP/sectors that were split by county. All states had some split sectors, with Virginia, Michigan and Ohio having an especially larger dosage. The rural route sectors, as expected, contained (relatively) the lion's share of split sectors. Most of the other cases are in the lower sector range (reserved for post office boxes) and in Sector 99 (reserved for the postmaster and business mail return). There must be some non-standard county code assignment occurring for these selected cases. We will have to further investigate these at a later date.

H. ZIP/Sector to County Coding Guide

Most ZIP Codes are entirely within one county. For those that are split by counties, most of the ZIP/sectors are entirely within one county. Therefore, the file could be collapsed down without loss of information. The collapsed version would provide for a fast and efficient method of coding. We collapsed the file down to a file containing ZIP Code and sector range for the strings of sectors in the same county. For 77 percent of the ZIP Codes, the ZIP range will be 00 to 99 (as the ZIP delivers within one county). Split sectors were assigned the dominant county. Where a ZIP Code was split by county, an auxiliary coding guide was created which contains the dominant county code in the ZIP Code.

I. The Sample Test File and County Coding Results

The sample test file is a 1-in-10 subsample on all tax year 1988 individual income tax returns where the 8th and 9th digit of the filers social security number are 05, 20, 45, 70 or 95 and the 6th digit is 2 or 7. This sample is roughly a nationally representative random sample. There are 105,239 records in the file. There are 823 addresses outside the United States (APO/FPO addresses, Puerto Rico, Virgin Islands, other Trust Territories, and foreign countries), and 104,416 are in the U.S. There 401 on military bases, 376 are of the relative position type (eg N 502 E 353), and 228 are blank. There are 9,442 rural routes or have a highway route number (eg RT 40). There are 8,005 post office boxes, and 84,866 are city type, leaving 1,098 miscellaneous stuff. (Note that these classifications were developed using a relatively simple algorithm that accounts for most records, but they are still conservative. Other sections later in the paper discuss the addresses contained on the income tax returns in greater detail). The file includes the street address, post office name, ZIP+4 and state. This file was coded to county using the ZIP/sector to county and the dominant auxiliary coding guides.

J. Coding Rates

The table shown below provides summaries of the number of returns in the sample test file by level and source of county code by address type. The following notes clarify the definitions of the data columns:

  1. Column 3: "100 Percent ZIP" - of the returns coded to county via ZIP/sector, these were coded using 5-digit ZIP Code only (and not sector) because the ZIP Code was entirely within a single county.

  2. Column 4: "ZIP/sector" - of the returns coded to county via ZIP/sector, these needed both the ZIP and sector codes because the ZIP Code was in more than one county.

  3. Column 6: "Zero ZIP" - these returns were not coded to county because there was no ZIP Code on the IRS record (all of these are for addresses outside the U.S. so they should have no ZIP Code).

  4. Column 7: "Bad ZIP" - these are records where ZIP Code did not match a ZIP Code in the coding file. The presumption is that these are bad ZIP Codes.

  5. No +4 Code - these were classified as "not coded" because the ZIP matched a split ZIP Code in the coding file (requiring a sector code) but the IRS record did not have a +4 code. (In fact, these were actually coded to county using the dominant county for the whole ZIP; but, because they may be of lesser quality, they were separately identified in the tables.)

The text shows percents estimated by the sample test file. The standard error of the estimated percent is shown in brackets "[ ]" following the percent. If the estimated standard error rounds to 0.0 then it is not shown.

Of all tax returns in the sample, we were able to code 98.7 percent to county using the ZIP/sector coding file. There were 81.9 percent that were coded from 100 percent ZIP Codes and 16.8 percent were coded from split ZIP Codes. For sample returns in the U.S., we were able to code 98.8 percent of the returns to county. The coding rate was 97.5 percent [0.8] for military, 98.0 percent [1.1] for relative position types, and 99.0 percent for city type addresses. The coding rate was 98.9 percent [0.1] for rural route or highway route number, 99.5 percent [0.1] for

Number of Returns Coded by ZIP/Sector

  Total Coded to County

Not Coded to County

Total 100%
Total Zero
No +4

Total............... 105,240 103,842 86,195 17,647 1,398 180 66 1,152
Foreign............. 824 644 631 13 180 180 0 0
Total in U.S........ 104,416 103,198 85,564 17,364 1,218 0 66 1,152
Military............ 401 391 383 8 10 0 0 10
Relative position... 376 358 311 47 18 0 0 18
City Type addresses. 84,865 84,037 71,921 12,116 828 0 44 784
Blank addresses..... 228 163 163 0 65 0 0 65
Rural Routes........ 9,443 9,343 5,528 3,815 100 0 3 97
P.O. Boxes.......... 8,005 7,961 6,386 1,575 44 0 13 31
Others.............. 1,098 945 872 73 153 0 6 147
Total............... 100.0 98.7 81.9 16.8 1.3 .2 .1 1.1
Foreign............. 100.0 78.3 76.7 1.6 21.7 21.1 .2 .4
Total in U.S........ 100.0 98.8 81.9 16.9 1.2 .0 .1 1.1
Military............ 100.0 97.5 95.5 2.0 2.5 .0 .0 2.5
Relative position... 100.0 95.2 82.7 12.5 4.8 .0 .0 4.8
City Type addresses. 100.0 99.0 84.7 14.3 1.0 .0 .1 .9
Blank addresses..... 100.0 71.5 71.1 .4 28.5 .0 .0 28.5
Rural Routes........ 100.0 98.9 58.6 40.4 1.0 .0 .0 1.0
P.O. Boxes.......... 100.0 99.5 79.8 19.7 .5 .0 .2 .4
Others.............. 100.0 86.1 79.4 6.6 13.9 .0 .5 13.4

post office boxes, and 86.1 percent [1.0] for others. The coding rate was 71.5 percent [3.0] for returns with a blank address. The coding rates varied somewhat by state. For all address types in the U.S., the coding rates for all states were 97.0 percent or higher, except for North Dakota (93.2 percent [1.5]) and South Dakota (94.4 percent [1.4]). The lowest coding rate for city type deliveries was 96.5 percent [1.5] or higher for all states. The classification of uncodeds includes those cases that do not have a +4 code and were in a ZIP Code that was split by county. For the U.S., there were 1,218 uncoded cases (or 1.2 percent), 66 were because of bad ZIP Codes and 1,158 were because the +4 was missing.

If we classify those 1,158 uncoded returns because the +4 was missing as coded via the dominant coding guide, then the uncoded rate drops to 0.1 percent overall.

K. Quality of Coding

To assess the quality of the ZIP/sector assigned county code, we first compared the state code assigned by the ZIP+4 against the mailing state code. Out of 104,416 cases, there were 7 errors (0.01 percent) because of bad mailing state code and 2 errors because of our error (authors) in keying updates to the ZIP+4 to county coding file.

There was no benchmark of known county codes to compare the ZIP/sector derived county codes. However, a comparison of the probability county codes with the ZIP/sector codes was done. The agreement rate should give an assessment of the lower limit of the coding accuracy in either source.

For all sample cases, the county codes assigned by the ZIP/sector and probability agreed on 96.3 percent of the records. The agreement rate varies by state. The states with the lowest agreement percentage rates are:

Virginia - 87.1  [0.7]
Maryland - 92.2  [0.6]
Colorado - 92.4  [0.7]
Georgia - 92.8  [0.5]
Minnesota - 93.7  [0.6]
West Virginia - 94.1  [0.9]
   South Carolina - 94.2  [0.6]
Michigan - 94.3  [0.4]
Mississippi - 94.7  [0.7]
Indiana - 94.7  [0.5]
South Dakota - 94.8  [1.3]

In Maryland, there were 170 cases where the county codes disagreed. Most of the disagreement occurred at the fringes of Baltimore City, where the 5-digit ZIP Code was split between Baltimore City and Baltimore County. Because of resource limitations, we were only able to review and reconcile 37 cases. The probability county code was wrong for 33 cases and the ZIP/sector derived county code was wrong for 4 cases. This review was intended to be expository rather than empirical, so the results cannot be generalized.

The agreement rates also will vary by county, and we need to know the performance at the county level. The sample test file is too small to measure this, a full scale test is needed. Second, the real question is not accuracy in county coding per se, but the reliability of the migration data that is based on these codes. Given the promise of the ZIP/sector coding approach, the small size of the coding file, and speed with which the coding can be done, we are going to conduct a test of the coding of all 110 million individual income tax returns, production of the county level migration data and evaluation of the migration data. Additionally, we are going to move this work from a mainframe environment to a SUN workstation environment to determine if the work can, in fact, be done in this environment and if the tight timing schedules can be met.

Section III -- Address Types on Income Tax Returns

Before embarking on the development of new methods of geographically coding mailing addresses to the subcounty level, an assessment needs to be done of the numbers and types of mailing addresses that are actually used on the income tax returns. This section outlines the post office's addressing guidelines, shows the address types contained in the 1990 decennial census address control file, and discusses the various address types and the quality of the address information in the individual income tax returns.

A. Post Office Addressing Guidelines

Local planning authorities have the responsibility for deciding on addressing conventions as well as assigning individual addresses. This could be any of the local groups responsible for assigning city-type addresses, such as local planning boards, developers, municipalities, utility companies, etc.

The post office maintains ongoing relationships with the local address planning authority, primarily in a consulting or advisory capacity. They (the post office) has no actual authority over the address conventions or actual addresses assigned by the local planning authority. The post office does, however, have the right to refuse to adopt new city-style addresses prepared by the local planning authority if it determines that it will be unable to provide efficient, cost-effective service. For all other units requiring individual delivery, the post office has assigned rural routes or has general delivery (by customers name). There are still some areas that do not have delivery services and the individuals must pick up the mail from the post office.

Also, the post office actively encourages local address planning authorities to assign city style addresses in areas that currently do not, as do local emergency services and utilities.

The post office has defined the standard parts of a city style address. This is the usual house number/street name addressing convention. It can be composed of many parts in the primary and secondary addresses, but the minimum content is the primary number, primary street name and suffix such as 124 Maple St. Content can include:

  1. primary address number
  2. directional prefix
  3. primary street name
  4. suffix (e.g. RD, ST, LA, AVE, etc.)
  5. directional suffix
  6. secondary address (APT, SUITE, etc.)
  7. secondary number (Note that "number" may be alphabetic, numeric or mixed such as APT 401, SUITE G or SUITE 401 G)

Further, the post office recommends the following conventions:

  1. Each noncontinuous street should have one correct name and should be uniquely identifiable without directionals or suffixes. For example, avoid Palm Court, Palm Avenue and Palm Street as names for separate streets.

  2. The use of names that also are suffixes, such as Court Street, Southeast Boulevard or East West Hwy should not be used.

  3. Sound-alike names such as Beech and Beach, Main and Maine should be avoided.

  4. Street names should not be longer than 15 characters.

  5. Special characters (hyphens, apostrophes, periods, etc) such as St. Lawrence St, O'Connor Boulevard, etc., should be avoided.

  6. The use of nonspecific addresses (such as corner of 5th and Main Streets) should be avoided. Use a specific address (such as 501 Main Street).

  7. The house number assignment should proceed from a logical point of origin and be in proper numerical sequence in relation to other lots with frontage on the same street. Odd numbers should be assigned to one side, even numbers assigned to the other. Also, numeric assignment should be sufficiently flexible to accommodate maximum density permitted by zoning regulations (have room for growth).

  8. Street numbers should be no longer than 6 digits.

  9. Do not use fractionals (such as "101-1/2 Main St"), alphas (such as "A101 Main St" or "101G Main St"), or hyphenated numbers (such as 10-101 Main St) in the house number field.

  10. Use individually addressed primary numbers rather than secondaries wherever possible (such as "101 Main" and "103 Main" vs. "101 Main Apt. A" and "101 Main Apt. B").

  11. Maintain addressing continuity through the municipality (and municipality to municipality where possible).

  12. The rural box numbers should be sequential along the carriers line of travel, with sufficient flexibility to accommodate growth, and avoiding hyphenation or alphabetics (e.g. Box 124A, Box 124B, etc.).

  13. Rural route name conventions are "RR" or "HC" and do not include "Rural", "Route", "#", "No", "Number", "RD", "RFD", "Star Route", etc.

  14. Do not use both rural route and street name and do not use secondary address. (such as "RR 1 Box 124 Boden RD" or "RR1 Box 124 Apt A").

  15. Post office box format should be "PO Box" followed by the box number. Excluded are other names such as "caller", "drawer", "lockbox", "PB", etc. Also, the box numbers should not include alphabetics or hyphens.

Again, note that these are guidelines and may or may not be followed by the local planning authority. Also, there may be addresses previously defined that do not conform to the current guidelines. In fact, we have seen examples of violations of all of these guidelines in the addresses on the individual income tax returns.

B. Address Types in The 1990 Census Address Control File (ACF)

The attached Table 1 shows the number of addresses contained in the ACF 3/. The data are shown by type of enumeration area (TEA), address type and state. Tape address register (TAR) areas are the 384 urban areas in Metropolitan Statistical Areas for which there was city type delivery, and commercially available mailing lists that could be geographically coded by the TIGER system. These are essentially areas that were covered by the 1980 GBF/DIME files). Prelist addresses are those addresses outside TAR areas that were manually compiled by Census Bureau personnel prior to the 1990 census. List/enumerate are those addresses listed at the time of the actual enumeration.

Overall, 83.1 percent of the addresses in the ACF were city types. The percent varies considerably by state. There are 12 states with a city type usage rate of 90 percent or more, and nine states with the usage rate in the 80 percent range. There were 14 states in the 70 percent range, eight states in the 60 percent range, and five states in the 50 percent range. Finally, there were two states (Maine and West Virginia) in the 40 percent range and one state (Vermont) in the 30 percent range.

Overall, 8.2 percent of the addresses in the ACF were rural routes. Sixteen states had rural route usage rates of 10 to 19 percent, and nine states had usage rates of 20 percent or more. These were Alabama, Arizona, Maine, Mississippi, North Carolina, North Dakota, South Dakota, Vermont and West Virginia.

Post office boxes were 3.7 percent of the addresses in the ACF, with nine states having a usage rate in the 10 to 19 percent range. These were Idaho, Maine, Montana, New Mexico, North Dakota, South Dakota, Vermont, West Virginia and Wyoming. Then there was Alaska, weighing in at 22.9 percent.

The usage rate for general delivery was 0.2 percent, with Alaska at 12.5 percent and West Virginia at 2.5 percent.

The percent of all other addresses was 4.7 percent for the U.S. with the largest "other" address prevalence occurring in the following states:

Arizona - 11.9 Oklahoma - 12.0
Missouri - 10.1 West Virginia - 14.7
North Dakota - 10.2 Maine - 16.0
Vermont - 21.7 New Hampshire - 14.3
Kentucky - 10.0 Tennessee - 12.1
Montana - 11.7  

C. Address Formats Actually Used on Income Tax Returns

It is important to note the very real difference in address types and quality of address information between a controlled environment, such as the decennial address list, and in an uncontrolled environment, such as the taxpayer supplied addresses. The tabulation of address types contained in the ACF does not account for many of the nuances in address formats, and there are numerous mailing address formats used that are not residential. We know that there are a host of nonstandard postal delivery methods and associated address types (such as house number/street name, rural route, post office box, etc.). There are also different address descriptions, formats, abbreviations used by the taxpayer. Also, the address format may not be standard (e.g. "Box 17B Route 7" vs. "Route 7 Box 17B"), and abbreviations may not be standard (RT vs. RR vs. Rural Route), etc. The address could be a host of other locations, such as accountants, financial institutions, place of work or business, parents' address, post office box, etc.

D. Address Classification Process

The first step was to pore through addresses in selected parts of the country to find various nonstandard addresses. The second step was to develop a simple classification system, apply it to the sample test file and tabulate the results. Addresses not falling into any of the classifications were reviewed, refinements made to the classifications, and the process repeated. After this work was done, we found some more nuances that did not show up in the test. The following describes the various nonstandard address formats that we found.

Military -- There is a standard on-base mail delivery system, akin to P.O. boxes. The format is "PSC #", such as PSC 125. Also, the returns may be addressed by the military unit. The following character strings also were used to recode such military addresses if it occurred at the beginning of the adddress: HQ, HHB, HHC, HHD, HHT, MSSG, MACS, VF, WPNS, MWSS, VMF, VMFA, VFMAT, HMH, MALS, USS, QTRS, QUARTERS, BEQ, CO A, CO B, ... CO Z, A CO, B CO, ... Z CO, ACO, BCO, ... ZCO, MAG, NTTC, A BTRY, B BTRY,... Z BTRY, ALPHA, BRAVO, CHARLIE, DELTA, ECHO, FOX, GULF, HOTEL, and NAS. Also, another recode was assigned if the ZIP or the ZIP+4 was for a military installation, or the post office name was for military bases.

Relative position -- these are formats that have two directionals and two numbers as the address, such as "N 123 E 234". The recode looked for "N # E #" or "# N # E" for any of the eight possible combinations of directionals. These are prevalent in selected areas of Utah and Wisconsin. These are mislabeled in tables 5 and 8 as "latitude/longitude".

Directional prefixes -- Instead of a standard format such as "123 N Main st", the format is "N 123 Main St". The recode was for any of the eight combinations of directionals (N, E, S, W, NE, SE, NW, and SW). These were mostly in Washington state.

Mile -- This is where the street name is a number followed by "Mile", such as "26 Mile Road". An example of an address might be "123 26 Mile Road". We found these in Michigan.

Address number is by mile -- An Example would be "Mile 23 Main Hyw". To be recoded as such, the first four characters of the address had to be "MILE". These were found in Alaska.

Blank address -- In some cases, the address field is blank, where delivery is identified by addressee's name.

General delivery -- In some cases, the address is "General Delivery", "GEN DEL" or "LOCAL", where delivery is identified by addressee's name.

Rural Routes -- The following character strings were used to identify the rural route type (yes, all of these were actually used in the income tax addresses) if it occurred at the beginning of the address field: RTE, RR, R RT, R R, R T, R #, ROUTE, RURAL ROUTE, RURAL, RD, R D, RFD, R F D, STAR, MTD RT, MTD RTE, SUB RT, SUB RTE, SUBURBAN RT, KEYSTONE RT, SEARING RT, SKAAR RT, HCR, H C, HCO, HC, HWC, and R NO.

Route Number -- This represents the highway number, such as "RT 40" rather than the post office's rural route. There can be abbreviations for U.S. routes, state routes, county routes and even township routes. The recode included addresses that had " RT " anywhere in the address, and selected character strings if it occurred at the beginning of the address (such as "SR 662 BOX 125", or was preceded by a numeric (such as "1234 SR 662". The character strings included: RT, ST RTE, ST R, SR, CR, TR, HWY, etc. Although each of the character strings were recoded separately, they are included in with other categories in the tabulations.

  1. If the highway route designation was preceded by a numeric (such as "1234 ST RT 662"), then it was included in the category "Old Type 1's". There were 188,000 (out of 84,540,000) of this highway route classification.

  2. If the highway route designation occurred at the beginning of the address (such as "ST RT 662"), then it was included in the category "Other". There were 14,000 (out of 1,098,000) of this highway route classification.

  3. If the highway route designation was not at the beginning and was not preceded by a numeric and contained "RT" somewhere in the address, then it was included in the category "Other RT in address". There were 6,000 (out of 54,000) of the highway route classification.

Post Office Boxes -- The following character strings were used to identify the post office boxes if it occurred at the beginning of the address: BOX, BX, P O B, P O BX, P O BOX, POB, P BOX, P O, PO, PO B, and PO BOX. If the string "DRAWER" or "POUCH" occurred anywhere in the address, it was recoded as post office boxes.

Trailer parks -- Addresses in trailer parks sometimes include the character string "LOT" in the address. These were recoded (if not recoded in above categories).

Buildings -- A recode was created if the address contained the character string "BLDG" anywhere in the address and not if recoded in above categories.

City Type addresses -- These were recoded if the first character was numeric and not recoded in any of the above categories. They are shown in the tables as "Old Type 1's".

Alpha prefix -- In some areas, the house number may be preceded by a nondirectional prefix, such as "G 123 Elm St". For the cases in Michigan, apparently the prefix is the first character of the county name. These were not recoded separately, but were left in the "other" category.

VIA -- In Alaska, there are addresses such as "Red Mountain VIA Manly", where Manly is a legitimate post office name. The character string "VIA" is also used in Puerto Rico, and stands for "road". These were also left in the "other" category.

Other -- This is everything not recoded above. Most of these are street names without a house number. They may or may not also have a box number. Examples might be "Boden Rd" or "Boden Rd Bx 123". Other stuff included are building or business names, apartment names, community or trailer park names, names of group quarters (such as a monastery, hospital, fraternity, etc.) and address types noted above that were not recoded by the above algorithm.

E. Tabulation results

Table 2 shows a tabulation of the sample test file by the summary recode and by mailing state code for AL-WY, PR (Puerto Rico, Virgin Islands, etc.), FR (other Foreign) and the U.S. total.

Of the 104,416 U.S. returns in the test file, 81.3 percent [0.1] were city type, 9.0 percent [0.1] were rural routes and 7.7 percent [0.1] were post office boxes. There were 0.4 percent military, 0.4 percent relative position, 0.1 percent "# mile", 0.2 percent directional prefix, 0.2 percent with a blank address and 1.1 percent were all others.

The usage of post office boxes in the IRS (7.7 percent [0.1]) is more than double the usage in the Census (3.7 percent). There are two influences in this difference. First, in the more rural areas, it may be more convenient to use a post office box or there may be no house delivery. Second, some people use post office boxes for various pieces of mail even though they have regular delivery to the house (with a city-type address).

The relative usage of addresses classified as other is also striking -- 1.0 percent in the IRS vs. 4.7 percent in the census. In D.C., the pattern is reversed, there are 6.1 percent [1.4] "others" in the IRS vs. 0.0 percent in the census. These 6.1 percent include examples of mail deliveries to place of work.

The composition varies dramatically by state. There were nine states with 90 percent or greater city type addresses, and 10 states with 80 to 89 percent. There were 12 states with 70 to 79 percent (under the national average), nine states with 60 to 69 percent, and five states with 50 to 59 percent. There were six states with less than 50 percent.

The rates of rural routes and post office box usage also varies by state. For the six states with less than 50 percent city type, the percent rural route and post office box usage is as follows:

Rural Route
P.O. Box
Alaska  6  [1.4] 41  [2.8]
Maine 21  [1.7] 25  [1.8]
Mississippi 35  [1.6] 15  [1.2]
West Virginia   29  [1.8] 21  [1.6]
Vermont 34  [3.0] 24  [2.7]
Utah  2  [0.6]  8  [1.1]

Also, Alaska has 4.3 percent [1.2] military, Michigan has 1.2 percent [0.2] directional prefix, Connecticut has 2.2 percent [0.4] other, The District of Columbia has 5.8 percent [1.4] other, Maine has 6.8 percent [1.1] other, New Hampshire has 8.4 percent [1.2] other, and Vermont has 5.2 percent [1.4] other. Washington has 5.6 percent [0.5] "directional prefix" and Wisconsin has 2.3% [0.3] relative position.

UT has 44.9 percent [2.1] relative position type.

Relatively speaking, the problem areas are a small proportion overall. However, they are concentrated in localities and failure to account for them will preclude development of reliable migration data for these areas.

Using a KEY-4 to probability geographic code cross-tabulation of all tax returns, we were able to put tables 3 and 4 together. These tables show the number of counties and number of places by population size and the percent of each of four address types.

Counties - Less than half the counties had percent city type address in excess of 50 percent (1,305 out of 3,023); 553 had percent rural routes in excess of 50 percent; 318 had percent P.O. box usage in excess of 50 percent; and 10 had percent "others" in excess of 50 percent.

Places - As expected, cities had much larger concentration of city type addresses, especially for larger places. There were 85 percent of all places sized 25,000 to 50,000 with a city type usage rate in excess of 90 percent; and 97 percent of all places sized 50,000 or more had a percent city type in excess of 90 percent. For smaller places, the percent with high concentration of city type deliveries drops off dramatically. These smaller places typically had higher post office box usage (especially), somewhat higher rural route usage and to some extent, other address types.

F. Quality of Addresses

In addition to the variant address types, there are questions about the quality of the address information that will affect the ability to geographically code. First, the address is supplied by the taxpayer. It can contain address parts in any combination, in any order, and can contain bits of different types of addresses (especially prevalent in rural areas). It can contain numerous variations of name spellings and a plethora of nonstandard abbreviations. The address information is handwritten, which is then read by a data entry person and then data keyed. Even carefully written or printed addresses can be easily misread or miskeyed. The script form of "Ct" (for court) can easily be misread as "Cl". Another example is misreading a printed "M" (as in Mill) as a "H". The data entry persons work under strict production schedules, and address keying quality is of less importance than the quality of the other information on the form, such as income and tax amounts (If the address is deliverable, then it is good enough).

Section IV -- TIGER Address Coding Process and Results

Addresses from the sample test file were extracted and provided to the Geography Division for TIGER coding. The universe of mailing addresses provided to Geography Division excluded foreigns (i.e. APO/FPO, Puerto Rico, Virgin Islands, other Trust Territories, and other countries). The addresses were coded to place using the 1990 place of work header coding. Areas in the TIGER universe (TAR areas) were then selected and coded to block via TIGER. Records assigned to block code had the place level code from header coding replaced with the place level code appropriate for the coded block. This section is a discussion of the geographic coding procedures in the Census Bureau's TIGER system, a report on results of TIGER coding of a sample of individual income tax return addresses, and an assessment of coding rates if the TIGER is expanded to include all city type addresses. Assessing the quality of the TIGER assigned block codes is outside the scope of this research effort.

A. TIGER Coding Process

The Geography Division maintains various files and software systems for geographically coding addresses and developed several special files for coding place of work in the 1990 census.

"The TIGER is an integrated cartographic data base which automates the Census Bureau's mapping and geographic coding activities for the 1990 decennial census as well as for census activities (such as current population surveys). The TIGER contains the cartographic data base for producing census maps, including boundaries for states, counties, MCDs/CCDs, places, tracts and blocks for the entire United States. ZIP Code boundaries are not, however, included. The TIGER does include street range information for block faces, but only for TAR areas. This comprises about 80 percent of the housing units but only about 400 areas." 4/

The Work Place File provides supplementary coverage for work places that are not covered by TIGER address range data.

The coding done for the sample test file was solely machine coding. The machine geocoding strategy codes exact match responses and compensates for minor variations in the spelling of reported addresses.

The TIGER geographic coding of addresses is a very complex and detailed process. The description provided here is a simplified overview. The coding process involves 4 logical steps. First, a place level header code is assigned. Then the records are partitioned into TIGER coding areas. TIGER standardizes the addresses. Finally, the records are matched to the address range information and a block level code assigned. If necessary, auxiliary information sources, such as the Workplace File, are used to code the address.

The first step is to create a place level header geographic code. The state code, post office name, and the 5-digit ZIP code are first compared to the City Refrence File (CRF). In the event that the three address components do not exactly match a record in the CRF, additional searching is done with two different blocking factors: (1) post office name within 3-digit ZIP Code, and (2) to post office name within state. An exhaustive search from both perspectives is attempted. A "best fit" record is chosen, based on a ranking of the combinations of matching and nonmatching components. All place level records associated with the chosen CRF record are selected. If there is more than one geographic code, then one code needs to be selected in a decision process based on additional "tie breaker" information. The potential list of tie breaker information includes: county, historic geographic information, the ZIP+4 Code, a flag distinguishing inside/outside corporate limits, place type, and a flag representing rural route address type. In the absence of any tie breaker information, the first (alphabetically) place level code is chosen. For purposes of blocking the file into the 384 TIGER areas, the header coding process worked quite well, even without the "tie breaker" information. However, using the header codes at the county and place level for other purposes requires caution.

The addresses are then standardized.

"The Geography Division utilizes a street and building "address standardizer" to "prep" both the TIGER file address images and the detail address responses prior to matching. The standardizer ensures that the images in both the standardized coding files and the records to be coded contain the same data and conform to the same name and format conventions. The standardizer recognizes street, building, intersection, rural route and post office box types of addresses. It allows only building type and street type addresses that have house numbers to match TIGER, because TIGER does not currently maintain such data items as rural routes and post office boxes. It replaces variant abbreviations with standard abbreviations, and it removes such address components as apartment numbers and suite designators from the match fields." 4/

The standardized addresses are then matched to the address range information in TIGER. The matching process uses the same basic blocking and "best fit" matching strategy as in the header coding process. The two blocking criteria are: (1) street name within TIGER coding area (defined from the header geographic codes), and (2) street name within 3-digit Zip Code.

"The Geography Division utilizes a "character match string comparator" to score similarities between unmatched name strings and the corresponding name strings in a reference file. An empirical routine that has evolved over fifteen years, the comparator matches two name strings character by character. When characters mismatch, it looks ahead one or two characters to determine the reason for mismatch. It recognizes transposed characters, dropped characters, and matched characters, and it attempts to realign to matching characters following each mismatch." 4/

In addition to matching names with minor differences (equivocation), it will accept nonmatches to street type and directionals. It scores combinations of matching and nonmatching address components using a ranking scheme. It selects the "best fit", providing that it is the only "best fit" and that the ranking value exceeds a certain level. Note that the coding system does not accept mismatches to address range or parity (whether address is odd of even). Matched records are assigned block codes. For our work, higher level codes (state, county, MCD and place) were also assigned based on the block codes.

B. Table Definitions

Table 5 shows the TIGER coding rates by source of code and address type. Table 6 shows the table of TIGER coding rates by state. Table 7 shows three selected summaries ranked by state. The three maps following Table 7 represent these summary rates.

Column 2 of Tables 5 and 6 shows the total number of records coded to place level via the header coding. Column 12 shows the number of records outside of TIGER areas that were coded to place via the header coding. Column 4 shows the total number of records inside TIGER areas. Column 5 shows the number of records inside TIGER areas not coded via TIGER. That is, it is coded to place (only) via header coding. Column 6 shows the number that were coded to DO/ARA only, or to DO/ARA/Block. Column 8 shows the total coded to DO/ARA/Block and Columns 9, 10, 11 shows the number coded by source of code - via address match, via workplace file or via employer name file.

C. TIGER Coding Rates

The TIGER header coding process coded 99.8 percent of all addresses in the U.S. to the place level. About 67.4 percent [0.1] of all addresses are coded to places inside the 384 metropolitan areas (TIGER areas), and 32.4 percent [0.1] are coded to places outside the TIGER areas.

The TIGER header coding process coded 99.8 percent of all city type addresses in the U.S. to the place level. About 78.8 percent [0.1] of the city type addresses are coded to places inside the 384 metropolitan areas (TIGER areas), and 21.0 percent [0.1] of the city type addresses are coded to places outside the TIGER areas.

Only those city type addresses coded to places inside the 384 metropolitan areas (TIGER areas) are eligible for coding to block via address range information or supplemental work place files. For city type addresses inside the TIGER areas, TIGER was able to code 77.7 percent [0.2] to the block level.

It is also useful to look at the percent of the records coded to block by TIGER based on two different universes (denominators): (1) all city type addresses; and (2) all addresses. TIGER coded 61.2 percent [0.2] of all city type addresses in the U.S. (both inside and outside the TIGER areas) to the block level. For all addresses in the U.S., TIGER coded 49.8 percent [0.2] to the block level.

The coding rates did vary by state. The variation by state of the percent of all addresses coded to block primarily reflects the number of city type addresses in TAR areas in the state. The rates vary from 8.8 percent [1.5] in Vermont to 81.6 percent [2.3] in the District of Columbia. The percent of all city type addresses that were coded to block via TIGER varies from 25.6 percent [4.2] in Vermont to 87.2 percent [2.0] in the District of Columbia. (Variation is primarily a reflection of the percent of addresses in TAR areas.)

The percent of eligible addresses (city type in TAR areas) that were coded by TIGER is also shown. The coding rate varies from 87.2 percent [2.0] in the District of Columbia to 56.1 percent [4.1] in West Virginia. There are 11 states with rates under 70 percent -- Alabama, Alaska, Florida, Georgia, Mississippi, New Hampshire, North Carolina, South Carolina, Utah, Vermont, and West Virginia. Most of the variation by state is due to variations in quality of address range information or variations in quality of IRS report address information.

Utah, however, is a special case -- 44.9 percent of the addresses are of the relative position type, most of which are inside TIGER areas. However, TIGER does not seem to handle this format -- only 1 percent [0.7] are coded by TIGER.

D. Causes of Uncodeds

We also were interested in identifying causes of non-coding, so we displayed all TIGER uncoded records in the ZIP codes for selected counties in Maryland. These included Prince George's, Anne Arundel, and Baltimore counties, Baltimore City, and some of Hartford, Montgomery and Carroll counties. These were selected because we had independent street name listings and maps (the ADC street maps). The quality of the address coding information in TIGER can vary by the individual area so the results can not be generalized.

We looked up the uncoded addresses in the ADC street maps to try to find the cause. Causes are shown in the following table. Most of these (75 percent) were because of a bad house number in the mailing address or because of missing or incorrect TIGER address information (either name or house number range). Unfortunately, we did not have the time to examine the TIGER address information in more depth.

Note that the IRS addresses contain numerous misspellings. There are atleast three types of misspellings -- phonetic misspelling by the taxpayer, misreading of hand written addresses by the IRS data keyer, and the usual types of data keying errors. The TIGER matching software does deal with minor name misspellings that are phonetic in nature. However, there was some noncoding (26 cases) because of other misspellings or miskeying. Four of the 26 misspellings were because of miskeying "CL" instead of "CT". Eighteen of the 26 differed on one character only.

There were also 21 cases with address parsing failure. The following table shows the number of uncoded cases by cause of noncoding:

  Number Percent
TIGER address information
198 75.0
Street name misspelling, bad
      abbreviation or miskeying......
26 9.9
Address parsing failure........... 21 7.9
Name not in ADC either............ 7 2.7
Address is for a GQ............... 2 .8
Military unit type address........ 3 1.1
Incorrect ZIP..................... 1 .4
Incorrect street name............. 1 .4
Incorrect street designation...... 2 .8
Bad TIGER header county code...... 3 1.1
Total............................. 264 100.0

Improvements in coding will rely primarily in improvements in the TIGER address information. However, other improvements to coding could be made by: (1) address standardization and parsing improvements; and (2) expanded handling of address misspellings and miskeying in the name matching software.

E. ACF Coding Results

The results of the TIGER coding of the IRS addresses are based on the 1990 software and address segment and range information. However, the address information was limited to the TAR areas. The following sections assess the expected increase in coding rates that can be expected from expanding the TIGER address range information to include all city type addresses.

We extracted the IRS address information from the sample test file for cases in the U.S. that were not coded by TIGER. These could be because they were outside of the TAR areas or because they were not coded for other reasons. There were 35,351 such addresses. Staff from the Geography Division standardized these and matched them to an ACF extract file of street segment/address range information. There were 1,185 cases with no house number, leaving a potential coding universe of 34,166 cases. Of these, 46.1 percent [0.3] were coded, 32.8 percent [0.3] did not match on street name, 15.9 percent [0.2] matched on street name but not on house number, and 5.0 percent [0.1] had a feature identification (FID) error (matched on street name but not on other street descriptors such as Road, Lane, Drive, etc., or on directionals such as N, E, NE, etc.).

  Number Percent
Total cases................... 35,351  
      No house number............ 1,185  
Universe for matching......... 34,166 100.0
      Coded...................... 15,750 46.1
      Not coded.................. 18,416 53.9
            Nonmatch on street name. 11,218 32.8
            Nonmatch on FID......... 1,697 5.0
            Nonmatch on house number 5,441 15.9
            Other................... 60 .2

In hindsight, it would have been useful to separate the above results by TIGER/nonTIGER areas and to relate the results to the uncoded cases review we did for the selected counties in Maryland.

F. A Note on the FID

Obviously, street name, house number and directionals are critical to correct matching and coding, but, the street descriptor is also critical if the street name is not unique within the "blocking factor". In Prince George's County, MD, for example, there are 269 different street names beginning with the letter "A" (source: ADC's street map). 202 of these are unique without the street descriptor, but 67 need the street descriptor to form a unique name. For example, there are the following streets in Cheltenham: Angora Ct, Angora Dr, Angora Terr, and Angora Way.

G. Estimated TIGER/ACF Coding Rates

Using the TIGER coding results data and the above data, we estimated an expected coding rate for TIGER with the ACF information incorporated. The universe described above is based on TIGER defined address types, and is slightly different from the universe of type addresses we have been using in our tables. The classification "not codable by TIGER" includes: (1) those addresses not assigned a place level header code; (2) those addresses classified as city type that the TIGER software did not recognize as city type (post office boxes, rural routes, other); and (3) an estimated number of city type addresses that are not codable because the TIGER software did not recognize a house number (using results shown in part E, decomposed by address type).

We also modified the data shown in part E to account for the universe differences, resulting in a new estimated coding rate of 47.5 percent. The number of cases not coded by TIGER that could be coded by the incorporation of the ACF into TIGER is estimated as the number of potentially codable cases that were not coded by the 1990 TIGER times an estimated coding rate for the ACF. The modified rate of 47.5 percent was used for the "Type 1's" category, a rate of 0.0 percent was used for the categories "Blank address", "Rural Routes", and "Post Office Boxes". The actual TIGER coding rate was used for the other miscellaneous types (because the low rates represent inherent deficiencies in the software or in the actual IRS addresses). The number of cases "codable by TIGER or the ACF" includes the above estimate plus the number actually coded by TIGER.

These estimates represent our best guess as to the expected coding rates. These estimates are subject to sampling errors, and are sensitive to the implicit assumptions used to compile the estimates from the several sources. That is, the coding rate will be affected by how the TIGER/ACF integration is actually done.

Table 8 shows results of the estimation process. The new TIGER/ACF coding files should be able to code 64.4 percent of all IRS addresses in the U.S. About 9 percent are uncodable because they are rural routes, and about 8 percent are uncodable because they are post office boxes. About 1 percent are uncodable because they are missing a house number, 1 percent are uncodable for other reasons, and there are 17 percent that can potentially be coded but are not coded.

One last point to keep in mind is that the addresses used for this test are IRS mailing addresses. These may include non-residential (business, etc) addresses that are not covered by TIGER and are not in the ACF.

Section V -- Future Work

There is still a lot of work to be done in order to design a complete geographic coding system. One activity is to evaluate address standardization and parsing software, especially for the nonstandard addresses. If changes for the nonstandard addresses cannot be effectively to the software without messing up the standardization of other addresses, can changes be made that are area specific, or do we need to develop any special preprocessors. If TIGER is not able to match an address to a specific street name and address range, what higher level matching criteria would be acceptible? For example, suppose the street name matches and the address range does not match, but the street is entirely within a block -- would coding to the block be acceptible? What higher levels or geography would be acceptible, tract, place, etc? Would this also apply to the coding of street corners (where 4 different block codes could possibly be assigned). Can we also need to investigate the rural route/ highway route number/street name usages where they overlap, and investigate methods for coding these. We need to look into the handling of military bases for the migration data and its impact on the coding process. An assessment of the differences between mailing and residence addresses needs to be done, as well as deciding on how these should be handled in the migration data. The impact of address coding improvements on the migration data also needs to be assessed.

Another process to investigate would be the incorporation of digitized 5-digit ZIP Codes maps (with possible extensions to the ZIP/sector and/or ZIP+4), TIGER feature and boundary information, geographic coding information, and decennial census data (such as block statistics) into a geographic information system. Finally, an assessment of methods of updating the coding files needs to be done as well as looking at longer range coding systems that can be developed from information being developed for the 2000 census.

1/ van der Vate, Barbara J., "Methods Used in Estimating the Population of Substate Areas in the United States", August 1988

Batutis, Michael J. Jr. and Ronald C. Prevost, "Computers, Data Base Structure and Population Estimates Methodology: Directions for the Coming Decade", March 1991

2/ Shepherd, Suzanne B, " Meeting With Gary West, Address Programs Support Manager of the United States Postal Service, Louisville, KY Division", Census Bureau internal memorandum, October 2, 1991
3/ Walsh, Thomas C., "Results of Processing List/Enumerate Addresses", Census Bureau Internal Memorandum, February 25, 1992
4/ Yergen, Walter, "Presentation at Workshop on Computer Matching", November 1991.

Table 1 -- Number of Housing Units in the 1990 Census by Address Type and State (87k)
Table 2 -- Number of 1988 Income Tax Returns by Address Type and State (46k)
Table 3 -- Number of Counties by Percent Usage of Address Types and Population Size (13k)
Table 4 -- Number of Places by Percent Usage of Address Types and Population Size (13k)
Table 5 -- Number of Income Tax Returns by TIGER Coding Status and Address Type (14k)
Table 6 -- Number of Income Tax Returns by TIGER Coding Status and State, for All Addresses and City Type Addresses (49k)
Table 7 -- State Ranking of TIGER Coding Rates (4k)
  Percent Of All Returns In U.S.A. Coded To Block (105k)
  Percent Of All City Type Addresses Coded to Block (106k)
  Percent of City Type Addresses in TAR Areas Coded to Block (92k)
Table 8 -- Potential Increase in TIGER Coding Rates From ACF Augmentation (15k)

Population Division Working Papers