Geographic Coding of Administrative Records -- Current Research in Zip/Sector-To-County Coding Process

June 1994

Written by:

Douglas K. Sater

Working Paper Number: POP-WP007

Disclaimer

This paper reports the general results of research undertaken by Census Bureau Staff. The views expressed are attributable to the authors and do not necessarily reflect those of the Census Bureau.

Introduction

The Population Estimates Branch of the Bureau of the Census annually produces estimates of the population for states and counties, and biennially produces estimates of the population for the 36,000 general purpose units of local government. These include all incorporated places and all functioning minor civil divisions (MCDs). Where places are split by county and/or MCDs, the estimates are made for each place/county/MCD piece and then aggregated into place totals. There are two basic approaches to the estimate process 1/, which have different requirements for the geographic coding of the administrative records.

The first approach is to use the aggregate numbers from the administrative records or change in the aggregate numbers as symptomatic indicators of population change (or to use the change in each subarea's share of the parent area's aggregate numbers as symptomatic indicators of population change). This approach places higher priority on the number of returns coded (coding rate) and annual stability in the coding rate.

The second approach is the Tax Return method, which estimates each of the components of population change -- births, deaths, international migration and internal migration. To measure internal "migration", the Census Bureau geographically codes the current year individual income tax returns and matches them to the prior years' geographically-coded file. A comparison of the geographic codes on the returns matched between the two years determines the in-migrants, out-migrants and non-migrants. The quality and consistency of the geographic coding has a direct affect on the quality of the "migration" data, and hence, on the quality of the population estimates. Note that the IRS data measures the movement of tax filers and their exemptions, and is not migration data, per se. It is an indicator of the movement of people between geographic areas.

The individual income tax returns do not contain geographic codes for states, counties, MCDs or places, but they do contain a complete mailing address. Specifically, street address, post office name, 9-digit ZIP Code and mailing state abbreviation are included. The mailing address represents the location at which the taxpayer wants contact with the IRS to occur. For most taxpayers, the mailing address is the same as the residence address. For others, it could be a place of business, a tax preparer or accountant, a post office box, a second residence (for dual residents), parents' address, etc. Geographic coding to mailing address rather than to residence will affect the migration data in two ways. First, some residence movers will be missed because the mailing address did not change, and false movers will be created solely because the mailing address was changed. Second, measured migration may be from/to incorrect geography.

Technical Working Paper No. 2 2/ discussed current methods of assigning geographic codes to the federal individual income tax returns and initial test results of potential new methods. One specific goal of our research efforts is to be able to accurately code to county quickly and efficiently so that the state and county population estimates can be produced in an integrated process. Given the current production schedule and methods of processing, that leaves 1 to 2 weeks for the county coding and migration production and review processes. This production schedule precludes the use of address-based coding systems for county coding.

This report focuses on results of coding all 110 million records in the tax year 1991 and 1992 files using a ZIP+4-to-county cross reference file (referred in this document as CCRS) and the resultant county migration data. This paper is broken into eight basic sections as described below:

Section I -- This section is a description of the ZIP/sector-to-county coding files. Areas covered include: (1) a discussion of the 5-digit ZIP Code and its relationship to geographic areas; (2) the post office's ZIP+4 assignment process; (3) the post office's ZIP+4-to-county cross reference file; (4) the editing of the cross reference file; and (5) the development of a ZIP/sector-to-county coding file (CCRS).

Section II -- This section briefly discusses the process of moving the data to a work station environment, the process of coding, and the preparation of the migration data. Consideration is also given to work station environment issues, such as file sizes, processing software, production times, etc.

Section III -- This section briefly discusses the definitional differences between: (1) the mailing state migration data; (2) the probability state and county migration data; and (3) the original and revised CCRS migration data. Subsequent sections explore selected data differences in greater detail.

Section IV -- This section compares the number and geographic distribution of the uncoded tax returns between the CCRS and the probability data products. The number of uncoded tax returns varies between the mailing state migration data, the CCRS migration data and the probability migration data. Also, the geographic distribution of the uncoded returns are different between the CCRS coding and the probability coding.

Section V -- This section discusses state level differences in the original CCRS and the revised CCRS migration data, the CCRS migration data vs the mailing state migration data, the CCRS migration data vs the probability migration data. The section also looks at causes of the differences found.

Section VI -- This section compares the original and the revised CCRS migration data at the county level and compares the CCRS migration data to the probability migration data. The specific case of Emporia, VA is examined in detail.

Section VII -- Even though the CCRS migration data may appear to be of good quality for the counties, there may still be errors in the CCRS county coding. This section compares the total number of returns and exemptions between the original CCRS data, the revised CCRS data, and the probability data. Specific cases of Catron Co. NM, Grant Co. NM, Blanco Co. TX, and Burnet CO. TX are examined in detail.

Section VIII -- This section briefly outlines the recommendations for the use of the CCRS migration data, revisions needed in the design of the CCRS data, and future work needed.

The main tables and plots are shown at the back of the report, beginning on page 52. They are numbered sequentially regardless of whether they are tables or plots. The plots are generally derived from the table immediately preceding it. Note that the plots are actually 3-D histograms, because they are shown in discrete cells, and the cell "height" is represented by an alpha character. The "A" represents one tally in the cell (such as state or county), the "B" represents 2 tallies, ... , and a "Z" represents 26 or more tallies. Where the "Z" represents more than 26 tallies, the sum total of all tallies in excess of 26 are noted at the bottom of the plot.

The analysis is conducted for the 3141 counties and county equivalents in the U.S. as of 1990 (that is, Denali Borough, Alaska is not included). Tables 7a and 7b also exclude Yukon-Koyukuk Co., Alaska. Tables 8a to 13b exclude Yukon-Koyukuk Co., Alaska, and Yellowstone National Park, Montana.

The research presented in this paper relates specifically to the geographic coding of addresses in the individual income tax returns. However, it is applicable to analogous addresses in other administrative record systems.

SECTION I -- CODING BY ZIP CODE

Areas covered in this section include:

a discussion of the 5-digit ZIP Code and its relationship to geographic areas;
the post office's ZIP+4 assignment process;
the post office's ZIP+4-to-county cross reference file;
the editing of the file; and
the development of a ZIP/sector-to-county coding file.

A. Relationship of 5-digit ZIP Code to County

One specific goal of our research efforts is to be able to accurately code to county quickly and efficiently so that the state and county population estimates can be produced in an integrated process. Given the current production schedule and methods of processing, that leaves 1 to 2 weeks for the county coding and migration production process. This production schedule precludes the use of address-based coding systems for county coding. However, coding to county by using ZIP+4-to-county cross reference files is a promising avenue.

Coding to county by using the ZIP Code in the mailing address does not assume that the mailing address is the same as the residence address, but it does implicitly assume that they are in the same county.

B. ZIP Code Assignment

ZIP Codes are designed to deliver mail. The ZIP Codes and area of responsibility are assigned to handle the mail as efficiently as possible and (mostly) without regard to geographic boundaries. In a technical sense, ZIP Codes are not area based, but a collection of delivery points. However, each ZIP Code usually can be assembled (with boundaries). A ZIP Code can also be assigned to a unique delivery point such as a university, government building, business, or a group of post office boxes.

At the state level, most ZIP Codes deliver wholly within the state, but a few do deliver to out-of-state areas. At the county level, some ZIP Codes cross county boundaries, but most deliver wholly within the county. The ZIP Codes that are split by state or county, however, pose problems for coding by ZIP Code.

In selected parts of the country, there are also postal delivery processes that pose special problems. In Alaska, for example, there are post offices that are an intermediate drop off point where they hold mail in pouches for later delivery to a remote area such as logging camp, fishery, etc. These are now being changed to post office boxes, with a three character alpha as part of the box number, but they still pose problems for geographic coding by ZIP Code. Also, there are areas that have no house-by-house delivery and individuals have to pick up their mail from the post office. Such individuals may also have a choice of post offices. In such cases, direct coding by ZIP Code may be problematic.

One option is to create a ZIP to county cross-reference file by collapsing the 1980 primary coding guide to ZIP/state/county, using only one possible county. This incorporates the 1980 mailing to residence adjustment. However, it is an old adjustment, and only the 5-digit ZIP Code is available. Based on our experience with the 1980 coding guide, we estimate we could code to the county level using 5-digit ZIP Code (only) with about 96 percent accuracy overall. However, quality of coding will vary dramatically by county. For many of the large counties, the coding will be good, but for most of the small counties, the coding will be very poor. Some counties will not be coded at all. Additionally, independent cities, such as Baltimore city, MD or Manassas Park City, VA and the surrounding counties will have substantial problems in the coding.

For many large cities (excluding the independent cities), most of the ZIP Codes are wholly contained within the city. Geographic coding to large cities using 5-digit ZIP code (only) may be feasible. For small places and the more sparsely populated areas, the ZIP codes tend to cover several subcounty areas. Geographic coding to such subcounty areas using 5-digit ZIP Codes would not be very good at all.

C. ZIP + 4 Code

A few years ago, the post office assigned an additional 4 digits to the existing 5-digit ZIP Code to make mail handling and delivery more efficient. The +4 code is actually two codes in one -- the first 2 codes are sector and the second 2 codes are segments within the sector. The following describes the ZIP+4 assignment process prepared by Suzanne Shepherd 3/. But first a cautionary note. These are guidelines established by the post office and there is flexibility of implementation by the individual postmasters.

"The U.S.P.S. perceives ZIP+4 codes in city-style address areas as essentially geographic in nature. A city-style address typically is an address in structure number-street name form, such as "4320 Huntingtown Road." The first two digits of the +4 add-on, which is referred to as the "sector" component, typically represents a block group (but is not coincident with Census Bureau-defined block groups). The last two digits of the +4 add-on, which is referred to as the "segment" component, typically represents a block side, a company, a unit within a company, a building, or a floor within a building.

To establish ZIP+4 Codes, the U.S.P.S. plots a 5-digit ZIP Code boundary on a street map and uses main thoroughfares to cut the 5-digit ZIP Code area into preliminary sectors. The U.S.P.S. then counts the number of block sides and the number of companies that receive 10 or more mailing pieces. If these two numbers total more than 50 in a primarily commercial area, the preliminary sector usually is further divided. If these two numbers total more than 70 in a primarily residential area, the preliminary sector usually is further divided. These thresholds are merely guidelines that change somewhat due to a preliminary sector's growth potential. For example, if a preliminary sector contains a lot of open area, the U.S.P.S. will lower the number, but if a preliminary sector is already quite congested, the U.S.P.S. will raise the number." 3/

The map on page 8 shows the City of Cambridge, Ohio and a small portion of the surrounding area. The city and the surrounding area are covered by a single 5-digit ZIP Code. The sectors for the city style deliveries have been overlaid on the map. These boundaries have been derived from an examination of the ZIP+4 Codes on residential address lists. For exposition purposes, the boundaries have been expanded to the nearest physical feature (river, interstate highway, etc.), to include uninhabited area (such as city parks, cemeteries, etc.). Also, some sectors that have only business deliveries may not be shown on the map.

From the map, we can see that the sectors are formed by adjacent blocks and block faces, and can be bounded by a polygon. The polygons are mutually exclusive and encompass the entire city style delivery area. We can also see from the map that a sector includes deliveries on both sides of a street at a sector boundary. Other Post Offices may choose to have the sector boundary in the middle of the street, with even numbered addresses in one sector and odd numbered addresses in another sector.

The shaded areas to the north and to the southeast of the delineated sectors shows area inside the city limits that does not have city style deliveries. These areas are covered by the rural route style deliveries, even though the addresses are of the house number/ street name format.

"When segment numbers are depleted within a particular sector area, which we may also refer to as a ZIP+2 area, the U.S.P.S. inserts another sector area within the original sector area. This additional sector area may split the original sector area, creating two discontiguous sector areas with the same sector number. Segment numbers are unique for a sector number. The number assigned to the new, inserted sector area is previously unused within the particular 5-digit ZIP Code area. Residential-to-commercial rezoning typically causes segment number depletion." 3/

Text Chart A -- City Style Sectors in the Cambridge, Ohio Post Office [<1.0 MB]

"In areas that have rural-style addresses, the U.S.P.S. assigns +4 add-ons according to a letter carrier's line of travel. Therefore, ZIP+4 Codes in these areas do not refer to geographic areas.

In areas that have rural-style addresses, a street segment receives a +4 add-on only if it is part of a letter carrier's route. The U.S.P.S. differentiates between block sides only if a carrier stops on both sides of the street to deliver mail. The first rural route for a 5-digit ZIP Code usually has a sector number of "97", the second rural route has a sector number of "96", and so forth. The +4 add-ons for a rural route typically go from "9701" to "97nn", with "9701" being the first street segment on which the carrier delivers mail and "97nn" being the last." 3/

The map on page 10 shows the delivery path of two of the 9 rural route sectors from the Cambridge, Ohio Post Office. The dotted line is sector 94 and the dashed line is sector 97. It is obvious from the map that these sectors are not geographically based. They deliver to a few addresses in the city limits, and to addresses in several townships outside the city limits. In short, these sectors wind all over the countryside. They do not, however, cross into another county.

There are two other interesting facets of the rural route deliveries for the Cambridge, Ohio Post Office. Most of the area has been converted to house number/ street name format and are covered by sectors 90 to 97. Sectors 91 to 97 cover most of the area in a linear fashion. Sector 90 is comprised of scattered street segments not covered by sectors 92 to 97. Also, the few areas that have not been converted to house number/ street name format are all lumped together in sector 98.

"If a rural route crosses a county boundary, the sector number changes, typically to another number in the nineties, and the U.S.P.S. numbers the segments in sequence beginning with "01". If the rural route crosses back into the original county, the +4 numbering resumes where the original +4 numbering left off. For example, if "9718" was the last +4 number assigned before the rural route crossed into another county, then "9719" is the first +4 assigned when the rural route crosses back into the original county.

When a group of rural mail boxes receive mail from different letter carriers, their sector numbers are different and there may be no pattern to the +4 add-ons. For example, the +4 add-ons for a group of rural mail boxes may be "9601", "9622", "9705", and "9601" again, because the mail boxes are not only on different rural routes, but on routes coming out of different 5-digit ZIP Codes. If a structure receives mail via a rural route, its mail box does not need to be anywhere near the structure." 3/

Text Chart B -- Two of the 9 Rural Route Sectors in the Cambridge, Ohio Post Office [<1.0 MB]

"If a jurisdiction establishes city-style addresses and the U.S.P.S. adopts them for mail delivery, the U.S.P.S. reassigns the +4 numbers." 3/

Additionally, sectors 00 through 09 are usually reserved for the P.O. boxes. Sectors 98 and 99 are usually reserved for the postmaster and for "business mail reply".

The +4 codes are used by the IRS in the mailing address. For the 1988 IRS 1-percent sample file, 94 percent of all addresses had the +4 codes. Also, 98 percent of the house number/street name type addresses had a +4 code, 91 percent of the rural route type addresses had a +4 code, and 98 percent of the P.O. box addresses had a +4 code.

D. ZIP+4 to County Cross Reference File

The post office has created a ZIP+4-to-county cross reference file which could serve as the basis for the county coding process. The file is a quarterly product and is updated to reflect changes occurring since the prior release. That is, new ZIP Codes are added, discontinued ZIP Codes are deleted, changes to ZIP Codes or +4 codes incorporated.

The ZIP+4 to county cross reference file contains a record for each unique ZIP+4 Code, or about 24 million records. Two exceptions to this are as follows: (a) If a business (or government agency) has more than one +4 code assigned to it, the file will have only one record with the data on the record showing the range of +4 codes assigned; (b) the same may be true for post office boxes.

The file contains the following data items:

ZIP Code;
sector/segment for lowest of the sector/segment range;
sector/segment for highest of the sector/segment range;
a 2-character state abbreviation;
county code; and
county name.

Note that the 2-character state abbreviation is the state in which the post office is located and the county represents the county in which the mail is delivered. That is, in a few cases, the county may be in a different state than the state name identified. There are no street name or address range information contained in this file.

The file should cover all ZIP Codes in the U.S., all ZIP Codes for U.S. possessions (Puerto Rico, Virgin Islands, etc.), and all APO/FPO ZIP Codes. All counties and county equivalents in the U.S. and U.S. possessions are represented in the file with the exception of Yellowstone National Park, MT (30-133), and, for the 1991 file, Denali Borough, AK (02-068).

The county should represent the county in which the mail is delivered. For post office boxes, it is the county in which the boxes are located. The APO/FPO ZIP Codes are assigned to the county the mail is delivered from, with the exception of APO/FPO ZIP Codes for military bases in Alaska and Hawaii. These are assigned appropriate county codes in Alaska or Hawaii.

I was not able to exactly determine how the ZIP+4-to-county cross reference file was prepared, but my understanding of the process is as follows. The data file was manually prepared at the local post office level, under general guidelines provided from "headquarters" USPS. The posted work sheets were data keyed and the file compiled by the regional or national information centers. Thus, it is reasonable to expect errors in the posting and in the data keying of the county codes. Also, it is important to note that the local post offices are relatively autonomous. They usually try to adhere to the guidelines provided by "headquarters"; but, one should expect variations to occur. Further, there will be no documentation of such variation. In short, the ZIP+4-to-county cross reference needs to be thoroughly edited. Sections E, F, and G describe the edits we performed on the cross reference file.

E. Coverage Edit

The ZIP+4-to-county cross reference file may not include all ZIP Codes. Some are post office errors. Some are ZIP Codes actually used by local areas that are not known by the office assembling the file. Some may be discontinued ZIP Codes. However, because of lags in implementing ZIP Code changes, administrative record systems are likely to include outdated ZIP Codes. Also, some people continue to use the old ZIP Code even though it has been changed.

The first step was to compare the ZIP Codes in the file with those actually used in the IRS file and with those listed in recent ZIP Code directories. Where needed, additional ZIP Codes were incorporated into the file. Also, when the tax year 1990-1991 and 1991-1992 1-percent test files were processed, we examined all ZIP codes that had at least 5 uncoded returns. We assigned a county code to the ZIP codes and incorporated them into the coding files.

F. APO/FPO County Code Update

Post Offices for the U.S. military overseas (APO, FPO) are handled out of 4 cities in the U.S. -- New York, Miami, San Francisco and Seattle. The county codes assigned to the APO/FPO ZIP codes reflected these cities. First, these state/county codes needed to be changed to a separate category denoting APO/FPO, with an exception. The APO/FPO ZIP codes for military bases in Alaska and Hawaii are assigned the appropriate county in Alaska or Hawaii. Second, the complete list of APO/FPO ZIP Codes was reviewed to make sure that all appropriate ZIP codes were included. Additions were made where necessary.

The state/county codes for the Trust Territories were also reviewed and modified, as necessary, to reflect the FIPS state and county equivalent codes.

G. Illegal County Code Edit

The ZIP+4-to-county cross reference file contains some illegal county codes. A county code of 999 was occasionally used and there were other non-existent county codes. All records in a ZIP Code that contained an illegal county code were examined and a correct county code determined.

The 999s were cases where the ZIP Code crossed into another state and the person assembling the data did not know what county to code. This occurred most often in North Dakota and South Dakota. These were recoded to a contiguous county in an adjoining state where it seemed reasonable to do so (by looking at the ZIP Code map, the ZIP Code directory, and atlas).

Most of the other illegal county codes were obvious typographic errors from posting and data keying (such as digit transposition). However, some were because the state code is the ZIP state and the county code is in another state. These were reviewed and the state code changed.

A few (but not many), of the illegal county codes were cases where the person preparing the county codes simply made up a new code to represent some special case in their area. It was not possible to tell what these were. For these, and the remainder of the illegal county codes, a county code was assigned (frequently the dominant county code for the sector). Thus, all illegal state/county codes were changed to legal state/county codes.

Also, in ZIP Codes that had more than one county listed, there were some that contained at least one county that was not contiguous to the other(s).

A few of these were plausible (e.g. where counties are very close but not contiguous) and were not changed.
Some of these were actually for a contiguous county across the state line (the state code was repaired).
Some were typographic errors not caught in previous edits (and were fixed).
A significant number were inexplicable. These were replaced with the dominant code for the sector.

These reviews and corrections are based primarily on educated guesses and "most likely" corrections. We simply did not have resources to do a thorough review/correction to obtain exact information (for example, by calling the local post office). Still, a substantial amount of effort was expended to clean up the file. It is reasonable to expect that there are still some errors in the file that were not caught by the edits, and some errors introduced by the review/correction process.

The above discussion focused on "bad" codes within ZIP/Sectors but did not give a feel for how many there were. There were 1,494 ZIP Codes with a change, and 3,438 (out of 857,400) ZIP/Sector records with a change. There were 17,539 ZIP+4 records (out of about 24,000,000) with a change.

As mentioned earlier, there are a few ZIP Codes that deliver across state lines, and there are a few ZIP/sectors that cross county lines. There are 153 ZIP Codes in more than one state. There are 9,000 ZIP Codes in more than one county. There were 11,331 (out of the total 857,400) ZIP/sectors that were split by county. All states had some split sectors, with Virginia, Michigan and Ohio having an especially larger dosage. The rural route sectors, as expected, contained (relatively) the lion's share of split sectors. Most of the other cases are in the lower sector range (reserved for post office boxes) and in Sector 99 (reserved for the postmaster and business mail return). There must be some non-standard county code assignment occurring for these selected cases. We will have to further investigate these at a later date.

H. ZIP/Sector to County Coding Guide

Most ZIP Codes are entirely within one county. For those that are split by counties, most of the ZIP/sectors are entirely within one county. Therefore, the file could be collapsed down without loss of information. The collapsed version would provide for a fast and efficient method of coding. We collapsed the file down to a file containing ZIP Code and sector range for the strings of sectors in the same county. This formed the basis for the CCRS coding file. For 77 percent of the ZIP Codes, the ZIP range will be 00 to 99 (as the ZIP delivers within one county). Split sectors were assigned the dominant county. Where a ZIP Code was split by county, an auxiliary coding guide was created which contains the dominant county code in the ZIP Code.

SECTION II -- SUN CODING PROCESS

The project involves not only the evaluation of the county codes and the resultant migration data, but also full scale testing and evaluation of the computer processing in a work station environment. This section outlines the computer processing resources and requirements, testing of selected processes for a 1-percent sample file, the computer processing procedures, and results from processing the full file. Discussion focuses on processing time and storage requirements.

A. Computer Processing Environments

To do the computer processing for all activities within the Estimates and Projections area in the Population Division, including the processing of the income tax returns, we have access to 4 types of computing environments:

UNISYS 1100 mainframe -- Together with the mainframe, we have access to 32 tape cartridge drives, 16 computer tape drives, and 5.5 million tracks (roughly equivalent to 39.4 gigabytes) of mass storage on hard disk drives. There is also automated backup of all files on mass storage, and numerous other automated processes. The programs for processing the tax return information are written in Standard FORTRAN, with custom (nonstandard) software for efficient input and output. The UNISYS 1100 machine will be scrapped in December 1996.
VAX -- We have access to a two VAX mini-computers (one 6700 series and one 7000 series), with 8.2 gigabytes of disk storage (only a small portion of which is allocated to the Estimates and Projections area), and 2 computer tape drives. SAS and VAX FORTRAN are available on the machines.
SUN workstation -- Networked to the basic server, are 11 SPARC stations (9 have a 1-gigabyte local disk drive, one has two 2-gigabyte local disk drives), several CD-ROM readers, a tower containing six 2-gigabyte disk drives, an optical disk jukebox and an 8-mm tape drive. There is no software for true distributive processing or file handling. Any distributive processing and file handling would have to be done manually. (For example, I could copy the basic extract file for a single cut to a local disk on one of the SPARC stations, run the processing on the SPARC station, copy the output files from the local disk to the disk on the server, and then delete the files on the local disk. This could be done for several of the SPARC stations.)
PCs -- We have several 386 personal computers.

B. Computer Processing Tasks

Each year, we get the current year IRS file of about 110 million individual income tax returns. The file is in a record ID sort and broken into 136 "cuts" defined by ranges of record ID. The average cut is about 810,000 records, ranging in size from 320,0000 to 1,093,000 records. The input file is 250 characters in length. We process the file on the UNISYS on a flow basis by cut, usually 13 to 15 cuts per day, in a span of about 2 weeks. The first process done on the UNISYS involves, among other things, matching the prior years file to the current years file. At that time we create an ASCII extract file containing selected data items from both years for all records. The extract file is 33 characters in length. There are 3 record types on the file: (1) year-1 only records; (2) year-1 to year-2 matched records; and (3) year-2 only records. There are about 120 million records on the file.

From the SUN, we cannot directly access records on the UNISYS. We can, however copy uncompressed ASCII files between the UNISYS and the VAX, and between the VAX and the SUN. The ASCII extract moved to the SUN need to be compressed. The compressed files then need to be uncompressed, read, edited, coded to state/county, and the migration tabulation created. We also need to create a final coded output file, and an auxiliary coded file (for pass back to the UNISYS), and several other small extract files and data tabulations. This needs to be completed within 1 week of the receipt of the last cut from the UNISYS system.

To do the computer processing, we allocated some resources on the UNISYS, the VAX and the SUN workstation. There is sufficient storage space on the UNISYS for the basic extract file for several "batches" of cuts. We had sufficient temporary storage space on the VAX for a copy of the basic extract file for a few cuts. For this project, we allocated dedicated computer time at night and one disk drive on the SUN server. The disk can hold about 2 gigabytes of data. We also had access to about 10 platter sides in the optical disk jukebox. Each platter side can hold up to 220 megabytes of data. We also had access to one 8mm tape drive; a single tape can hold up to 2 gigabytes.

Because of issues of security and processing complexity, we decided to process only on the server and not to attempt any manual distributive processing on the SPARC stations. I mentioned the potential for the manual distributive processing to point out there is much more processing power and more file storage space for this project than we actually decided to use.

Before embarking on the prototype development and full scale testing, we needed to test alternative processes. To do this, we ran a 1-percent file through numerous programs and looked at CPU usage and file sizes. The 1-percent file contained about 1.2 million records. All tests described below were done on the SUN workstation server under the UNIX operating system. The test programs were written in SAS.

C. Initial Testing -- Coding Approach

The first step was to define a coding method. There are two viable approaches in SAS, given that the file is in record ID sort and not in ZIP/sector code sort.

The first approach is to collapse the ZIP/sector file to ranges of ZIP/sectors that code to the same state/county, to save on file size without loss of information. This file is used to create a SAS format. Essentially, a SAS format is an executable subprogram that provides a recoded value for a supplied value. The SAS format is used to code the input records. In this case, the supplied value is the ZIP/sector needing to be coded, and the recoded value is the state/county code.

The second approach is to create a direct access SAS file (accessed by ZIP/sector). This approach would be efficient if the input file is in sort by ZIP/sector. It becomes less efficient at lesser degrees of implicit sort. That is, the process codes fewer and fewer of the tax returns for each input of a record from the coding file.

The CPU usage times just for editing the data (consistency edits) and coding the data are shown below. Note that the times are for coding all year-1 and all year-2 records. The choice is obvious.

Format approach	10.27 minutes
Direct access	28.10 minutes

However, we do not need to code all year-2 records. That is, we do not need to code those matched records where the ZIP/sector codes are the same for year-1 and year-2. Excluding these from the coding universe brings the editing and coding time for the format approach to 7.40 minutes.

D. Initial Testing -- Data Compression

Obviously, compressed data sets save storage space, but the question of how much, and at what cost in CPU usage needed to be addressed. The trade off in time reading and writing the data sets also needed to be considered (time is needed to uncompress for reading, but time is saved in reading less data). To compress or uncompress the 1-percent file takes about 5 minutes. In SAS, however, it is possible to read a compressed ASCII file and uncompress it, by data block, "on the fly". It takes 1.44 minutes less to read the compressed 1-percent ASCII file and uncompress it "on the fly" than it does to read the uncompressed 1-percent ASCII file (because less data are read). Total savings for reading the compressed 1-percent ASCII file directly is then 6.44 minutes.

The other question is whether the output data set should be in ASCII or SAS, compressed or uncompressed. The following shows the file sizes, in megabytes, for the file formats:

	Compressed:	Uncompressed:
Input ASCII	13.3	38.8
Output ASCII	16.9	52.6
Output SAS	62.9	65.6

We decided that creating an ASCII output data set and then compressing it was the preferred option, even though it takes about 5 minutes more time. Note that we were unable to devise a process that creates the output data set and compresses it "on the fly".

The auxiliary coded output file (for pass back to the UNISYS) needs to be uncompressed ASCII. But, it needs to be held in compressed format until it can be uncompressed and copied to the UNISYS. All other extract files and data tabulations are sufficiently small that compression does not save much space. These files were created as uncompressed SAS data sets for convenience in access and use.

E. Initial Testing -- Sorting

To sort the 1-percent file SAS data set takes 2.14 minutes. However, it uses more memory than we were willing to allocate. Also, nowhere in the process do we have a SAS data set for the input or output coded file. Any sorts considered were only for the small extract files or summary data tabulations.

F. Initial Testing -- Summary of Requirements

The following shows the processing time and storage usages for the 1-percent file:

Total, all processes	28.90 minutes
Read compressed ASCII input, uncompress, edit, code to state/county, output coded file	11.92 minutes
Output auxiliary coded data file	1.74 minutes
Compress coded output file	5.00 minutes
Compress auxiliary output file	2.50 minutes

Create migration tally	5.39 minutes
Other tallies and files	2.35 minutes

Input compressed ASCII file	13.3 megabytes
Output compressed ASCII file	16.9 megabytes
Auxiliary coded ASCII file	7.9 megabytes
Other files and tallies	.3 megabytes
Total, all files	38.4 megabytes

For processing the full file, this translates to 3.840 gigabytes, and 48 hours CPU time. For processing a batch of 15 average sized cuts, this translates to 576 megabytes and 7.4 hours CPU processing. If we process in batches of about 15 cuts, we need adequate storage space for a minimum of 2 batches, preferably 3.

G. Processing Outline for the Full File

We decided to process the full file on a cut-by-cut basis in daily batches of about 15 cuts as they are moved from the UNISYS to the SUN. Thus, we wanted to run the 15 cuts sequentially at night, without manual intervention. That is, we did not want a person to have to physically attend to (baby sit) the processing of each cut. Second, there is a fair amount of bookkeeping work involved in the checking of the processing results. Thus we designed a process control file (PCF) to serve 3 purposes: (1) to control the processing of the individual cuts; (2) to provide for computerized checking of the processing; and (3) to serve as final documentation of the processing. We also wrote the processing in SAS MACROs.

For example, when we had 15 new cuts moved to the SUN, we updated the PCF to show the cuts as "available for processing". The master SAS MACRO would iteratively access the PFC and start-up the processing of the next available cut. When each cut was successfully run, the PCF for the cuts would be updated via SAS MACROs, the process status would be set to "done", and the next available cut would be started. After all available cuts were run, the master SAS MACRO would stop execution. Upon completion of all available cuts, then manual intervention would be required to review the results and to begin the file back-up process.

To further complicate the process, we are handling different activities for different batches of cuts in the same day. For example, on a typical day, we could be deleting files for the 15 cuts backed-up on the previous day, backing-up files for the 15 cuts processed on the SUN the previous night, moving to the SUN 15 new cuts processed on the UNISYS the previous night, and processing 15 new cuts on the UNISYS.

The following shows the essence of the flow of the processing. It may not be easy to follow without a flowchart, but three points should be obvious: (1) the processing is not terribly simple; (2) it involves numerous data files for each of the 136 cuts; and (3) there is a substantial amount of CPU processing, data input and data output.

Create a batch of the extract files on the UNISYS
Copy the newly available cuts to the SUN
Compress the files
Update the PCF for the newly available cuts
Start up the master MACRO to process all newly available cuts. The processing will, for each cut:
1. Read the data, uncompress it on the fly, perform consistency edits, and code the records
2. Output the final coded data file
3. Compress the final coded data file
4. Output the auxiliary coded data file
5. Compress the auxiliary coded data file
6. Output a file of all year-1 or year-2 records that have been altered by the consistency edit, and append a tally of the edit changes to the PCF
7. Output a file of all records not coded in year-1 or year-2, tally by ZIP code, and append totals to the PCF
8. Output a file of all records excluded from the migration data, tally by state/county, and append totals to the PCF
9. Output a file of APO/FPO ZIP codes not coded by ZIP but coded by mailing state, tally by ZIP code, and append totals to the PCF
10. Output migration extract file, and tally the 45 cell migration tabulation by state/county
11. Check control counts against the PCF, and update the PCF
12. Delete all intermediate files
Review the data in the PCF, append file sizes to the PCF, and modify the PCF if any reruns are necessary
Review summary data files, if warranted
Copy final coded files to platters
Copy all data files to 8mm tape
Uncompress auxiliary coded files and move to the UNISYS
Delete original input files, final coded files, and auxiliary coded files.

All the above data files and summary tabulations are maintained for each cut. At the end of the processing for all cuts, selected data extracts and summary tabulations need to be merged into a total for all cuts.

H. Execution and Wall Time

We ran the full file for tax years 1990-1991 and 1991-1992. One night, we had communications problems which slowed the process to a virtual standstill. Other than that, the communications posed no major problem.

After the first two batches, we ceased compressing the auxiliary coded file, as we were able to move it back to the UNISYS in a timely fashion. The CPU time was an average of 25 minutes per cut. The PCF and the SAS MACROs ran flawlessly and made the computer processing and data review aspects of the process smooth and easy. The wall time closely reflected the CPU time.

In fact, we could have handled up to about 20 cuts. The limitation is in the availability of extract files from the UNISYS and the work required to move the files between the UNISYS and the SUN, compressing and uncompressing files, backing up files, etc.

The following shows some of the usages, assuming that the SUN is dedicated, that the network is up and unencumbered, that the jukebox has platters mounted and has unencumbered communications to the server.

It takes about 1/2 hour to set up the process control file for a batch of 15 cuts.
It takes about 25 minutes to process an average sized cut through all the processes described in G.5.a to G.5.l (above). That translates to 6 hours and 15 min for a batch of 15 average sized cuts (time is .029 minutes for 1000 records).
It takes about 1 hour to check-in the batch of 15 processed cuts, review the data items in the PCF, etc.
It takes about 1/2 hour to set up the files for copying to platter (ie checking available space on platter, defining what cuts go to what platters, updating the PCF, etc) for a batch of 15 cuts.
It takes about 8 minutes per average cut to copy the file to platter, or about 2 hours for a batch of 15 cuts.
Time is also spent moving files from UNISYS to the SUN and from the SUN to the UNISYS; however, we did not keep records on transmission times.
Other time is used in file deletions, file maintenance, checking on space availability, coordinating work, loading platters, etc.

SECTION III -- SUMMARY OF DATA DIFFERENCES BETWEEN CCRS MIGRATION, MAILING STATE MIGRATION AND PROBABILITY MIGRATION

This section briefly discusses the definitional differences between the CCRS migration, the mailing state migration and the probability migration data sources. The mailing state migration is a migration data set produced at the state level only, where the state code is obtained directly from the 2-character state abbreviation of the mailing address. The probability migration is a migration data product where the state/county/MCD/place geographic code is assigned on a probabilistic basis from information in the mailing address; specifically, the 2-character state abbreviation, the 5-digit ZIP code, the post office name, and a recode for the type of the address (city style, rural route, post office box, and other). See the Population Estimates and Projections Working Paper No. 2 for more details on the probability coding process. 2/

A. Year 2 only 1040NR

A 1040NR is an individual income tax return for non-resident aliens earning federally taxable income, and for aliens temporary residing in the U.S. (non-immigrants such as college students on temporary visa) earning federally taxable income. If a year-2 only return was a 1040NR, then we counted these as "in-migrants from foreign". This definition is the same for the mailing state migration, the probability state and county migration, and the CCRS state and county migration.

B. Uncoded returns

Uncoded returns are excluded from the migration data in all 3 sources, except for matched returns that are uncoded in only one year. These are included as year-1 or year-2 onlys. The handling is consistent for the three products. However, the number and geographic distribution of the uncoded returns is not the same for the three products. There were 58 uncoded records for tax year 1992 for the mailing state migration. There were about 58,000 uncoded records for the CCRS migration data and about 52,000 uncoded records for the probability migration data. Recall that the CCRS assigns a code at the county level or no code at all. That is, there are no "state onlys". The probability coding assigns codes at the subcounty primitive level or is uncoded. That is, there are no "state onlys" or "county onlys". Thus:

The number of returns in the mailing state migration will be higher than in the CCRS state migration or the probability state migration.
The number of returns for states and counties will be different between the CCRS migration and the probability migration, even though the overall number of uncoded returns is about the same. The differences will be more pronounced at the county level. For example, the numbers will be lower for the probability migration in Michigan, especially in Oakland county, MI.

More detailed analysis of the uncodeds is the subject of section IV.

C. County Coverage

There should be a record for each county and county equivalent in the U.S. This includes Kalawao Co., Hawaii (15-005), and Yellowstone National Park, Montana (30-113). There is no record for Denali Borough, Alaska (02-068). Also, there are county equivalent records for the trust territories and for various categories of foreign, such as APO/FPO.

There are 3 counties in the CCRS migration data files which have no returns coded to it. They are: Haines Borough, Alaska (02-100), Kalawao Co., Hawaii (15-005) and Yellowstone National Park, Montana (30-113). There is only one ZIP code for Haines, which serves Haines and selected areas in Skagway-Yakutat-Angoon (02-231). Most of the mail deliveries are in Haines. However, the CCRS file codes all in the ZIP code to Skagway-Yakutat-Angoon. For future work, it would be more appropriate to change the coding file so the Haines ZIP code codes to Haines. Residents of Kalawao may be getting their mail through the Kualapuu (96757) post office, which codes 100% to Maui under CCRS. Residents of Yellowstone National Park in Montana may be getting their mail from the Yellowstone National Park in Wyoming (82190), which codes 100% to Wyoming under CCRS.

D. APO/FPOs

The CCRS codes APO/FPOs and foreigns to several state and county equivalents. Aggregated, these should closely match the foreigns from the mailing state or the probability migration data; with one exception. Some APO/FPO ZIPs are coded to the area rather that to the category "APO/FPO". For example, ZIP 96530 was coded to 66 (Guam) instead of 82 (APO/FPO).

E. KEY-3 Movers

The universe of year-2 records to be coded by probability includes all year-2 onlys, but only those matched returns where the KEY-3 (mailing state/ ZIP code/ post office name) is different and the street address is different. After coding the movers, migration is determined by comparing the codes. The use of the mover check mitigates the creation of spurious migration due solely to ZIP code changes or conversion from rural route addresses to city style addresses. (The coding of these areas is, however, not correct).

For the original CCRS processing, mover status was not determined; migration was determined solely by comparing the codes. Thus, if the sectors within a ZIP code were reorganized and the old/new sectors code to a different county, then we can create spurious migration. Also, if the post office converts rural route addresses to city style addresses, then the sector code is also changed. If the state/county code for the rural route sector is different from the state/county code for the newly assigned sector, then spurious migration will result. Both of these situations has occurred in Emporia, VA (Examined in detail in section VI).

We can, however restrict the universe of records eligible to be migrants to exclude those whose key-3 (State/ZIP/P.O. name) is the same or the address is the same. This eliminates the spurious migration noted above, but it also eliminates legitimate county-to-county migration that occurs within the same ZIP code. We reran the tax year 1991-1992 migration data to incorporate this universe change. This migration data is referred to as the revised CCRS migration. Thus, we have two alternate definitions of migrants.

F. Miscoding

In spite of the editing of ZIP/sector-to county coding file, there are still errors in the assigned state/county codes. This occurred in the case of Catron Co, NM vs Grant Co, NM and in the case of Blanco Co, TX vs Burnet Co, NM (examined in detail in section VII). Such errors of coding will show up as a difference in the number of returns and exemptions for the county when compared to the numbers for the probability data, and compared to the 1990 population in the county. The miscoding will generally not be detectible by looking at migration rates, even though the migration data for the counties affected by miscoding will both be biased toward the average of the two counties. The amount of the bias will depend on the relative number of cases involved and in the migration differential between the miscoded cases and the rest of the county. Section VII discusses this further.

G. Mailing vs Residence

The mailing state migration data and the CCRS migration data is based on coding information from the mailing address only. No attempt is made at adjustment for differences in mailing vs residence addresses. The probability migration implicitly includes an adjustment for the mailing vs residence addresses, although that relationship is based in 1980.

H. Probability coding

The coding by probability also has its faults. First, it is based on a ZIP-to-county relationship that existed in 1988. As ZIP codes are reorganized, split, deleted, added, etc., the relationship may no longer be valid.

Second, it is also based on address types that existed in 1980. The probability coding guide contains separate geographic distributions for each address type -- city style, post office boxes, rural routes and other. Since 1980, many of the rural route addresses have been converted to city style addresses. These may be coded using an incorrect geographic distribution. For example, suppose that in 1980 all city style addresses in ZIP 11111 coded to the City of Smallville in Kent County and all rural route addresses coded to Clark County. Also suppose that since 1980, all rural route addresses in ZIP 11111 were converted to city style addresses. Then all the addresses in ZIP 11111 that are actually in the Clark County will be coded to Kent County. This problem occurs in rural areas where the nearest post office is in another county and in numerous independent cities in Virginia.

Third, the use of the post office name secondary coding guide will introduce some errors.

And, of course, the nature of probability coding introduces errors. Suppose that ZIP 22222 is half in Olsen County and half in Lane county. When the records in ZIP 22222 are coded by probability,

25 percent of those that are actually in Olsen County will be coded to Lane County,
25 percent of those that are actually in Lane County will be coded to Olsen County, and
50 percent will be coded correctly.

This means that the migration rates for the affected counties will be biased toward the average of the two.

Thus, where we compare the CCRS migration data to the probability migration data, do not presume that the probability is correct; even if it has a stable time series.

I. Military Bases

There are a few military bases that are in more than one county. The CCRS coding file assigns these to only one county. The affected Military bases are:

Lowry AFB -- Arapahoe and Denver Counties, CO
Ft. Benning -- Chattahoochee and Muscogee Counties, GA
Ft. Riley -- Geary and Riley Counties, KS
Ft. Campbell -- Christian Co., KY and Montgomery Co., TN
Ft. Knox -- Hardin and Meade Counties, KY
Ft. Bliss/McGregor Range -- Otero Co., NM and El Paso Co., TX
Ft. Hood -- Bell and Coryell Counties, TX
Marine Corps Combat Development Center, Quantico -- Prince William and Stafford Counties, VA

There is a possibility that other analogous situations may exist, such as National Parks, American Indian Reservations, etc.; but I have not investigated these.

SECTION IV -- UNCODEDS

Uncoded returns are excluded from the migration data in all 3 sources, except for matched returns that are uncoded in only one year. These are included as year-1 or year-2 onlys. The handling is consistent for the three products. However, the number and geographic distribution of the uncoded returns is not the same for the three products. This section: (1) notes differences in the coding processes that will impact on the uncoded rates; (2) looks at the number and size of ZIP codes that are uncoded; (3) looks at the number and percent uncoded by state; and finally (4) looks at the number of uncodeds by county.

A. Coding Processes

After the classification of "foreign", the mailing state data are tallied solely on the basis of the mailing state code. For a return to be uncoded in the mailing state data, the mailing state code must be missing and there is a non-zero ZIP code in the range for the U.S. (The returns with no mailing state code and no ZIP code are coded to "other foreign"). Obviously, there are very few uncodeds in the mailing state data. There were 58 uncoded returns in the tax year 1991-1992 data.

The coding of foreign in the ZIP/sector-to-county cross reference process (CCRS) and in the probability follow a similar process. Once the foreign universe is defined, the CCRS coding process codes to state and county based on the ZIP/sector code. (Note that the universe for coding is all returns, so the data presented in this section is for all returns, including the zero exemption returns). The CCRS assigns a code at the county level or no code at all. That is, there are no "state onlys". For a return to be uncoded in the CCRS, there is a ZIP code on the return that is not in the CCRS coding file, or the ZIP code is zero and there is a mailing state code for the U.S. There were 57,662 uncoded returns for tax year 1991-1992.

The CCRS coding file that we used was circa April, 1990. When we processed the tax year 1990-1991 and the 1991-1992 1-percent test files, we tallied the number of returns in each uncoded ZIP code. We reviewed any uncoded ZIP code with more than 5 returns and were able to assign a state/county to that ZIP code. These were added into the CCRS coding file before processing the full file. Also, when we processed each cut of the full file, we tallied the number of returns in each uncoded ZIP code. We reviewed the tally, looking for any uncoded ZIP code with over 500 returns. This process was built-in so we could modify the CCRS coding file "on-the-fly", if necessary. There were no uncoded ZIP codes with more than 500 returns in any cut. However, as we shall see in a moment, the sum of all cuts did show some uncoded ZIP codes with more than 500 returns. When we did these uncoded ZIP code tallies, we did not tally ZIP by the mailing state. To get the data summaries by state, we built a ZIP-to-state coding file, applied that to the ZIP tallies and then summarized by state.

Once the foreign universe is defined, the probability coding process assigns codes to state, county, MCD and place based on the primary coding guide (mailing state/ZIP code/Post office name/address type), the ZIP code secondary coding guide or the post office name secondary coding guide. In general, if a new ZIP code is added for an existing post office, the return will not code based on the ZIP code, but it will code based on the post office name. If a new ZIP code is added together with a new post office name, then the return will be uncoded. Note that the universe for coding includes only those returns where the Key-3 and the address is different between the year-1 and year-2 address and year-2 onlys. There were 51,595 uncoded returns for the tax year 1991-1992 probability data. We tallied the number of probability uncoded returns by ZIP code and state. The same ZIP-to-state coding file was applied to the probability uncoded returns to get the summary tallies by state.

B. Size of the Uncoded ZIP Codes

The following table shows the number of uncoded ZIP codes by the number of returns in that ZIP code. There are more uncoded ZIP codes with at least one return in the CCRS (6,146) than in the probability (4,488); most of the difference is for uncoded ZIP codes with only one return. The ZIP codes in the 1s category are probably keying errors. In the probability coding system, the post office name secondary coding guide does code some of the returns with erroneous ZIP codes.

Text Table C -- Number of Uncoded ZIP Codes by Number of Returns in the ZIP Code

Number of Returns in Uncoded ZIP	CCRS		Probability
Number of Returns in Uncoded ZIP	Number of ZIPS	Cumulative Distrib	Number of ZIPS	Cumulative Distrib
1	4,250	69.2	2,158	48.1
2	858	83.1	517	59.6
3	330	88.5	316	66.5
4	155	91.0	224	71.5
5-9	201	94.3	529	83.3
10-24	114	96.2	404	92.3
25-99	105	97.8	325	98.8
100-499	107	99.6	39	99.7
500-999	22	100.0	4	98.9
1000 or more	3	100.0	8	100.0
Total	6,146	100.0	4,488	100.0

Most of the uncoded ZIP codes contain few returns in either coding system. For the CCRS, 97.8 percent of the uncoded ZIP codes contain less than 100 returns. For probability, the rate is 98.8 percent. At the other end of the distribution are a few uncoded ZIP codes containing 1,000 or more returns. In the CCRS, one uncoded ZIP code has 1,079 returns, another has 1167 returns and the third has 1,584 returns. The uncoded ZIP with 1,584 returns is for ZIP=00000 (missing). In the probability coding process, the 8 uncoded ZIP codes contain the following number of returns: 1,000, 1,131, 1,263, 1,425, 1,992, 2,024, 2,273, and 4,775.

C. Geographic Distribution by State

The number of uncoded returns is (relatively) very small in the CCRS and the probability. Tables 1 and 2 show selected data by state (where the state is coded from the ZIP-to-state coding file). Included are:

the total number of returns;
the number of returns in uncoded ZIP codes;
the percent uncoded;
the number of uncoded ZIP codes;
the average number of returns per uncoded ZIP code;
the number of uncoded ZIPs with at least 50 returns; and
the number of uncoded ZIP codes with at least 500 returns.

Table 1 shows the data for the CCRS and table 2 shows the data for probability.

All states have a very small number of returns in uncoded ZIP codes in both the CCRS and the probability. In the CCRS, Florida, Texas, California and New York have the largest number of uncodeds (9,237, 7,872, 6,626 and 3,845 respectively). However, the percent of uncoded returns is still less than 0.25 percent for these states. The largest percent uncoded are in Idaho (0.30) and Maine (0.32); which are still very small. In the probability, Michigan (15,042), California (8,024) and Florida (1,063) have the largest number of returns in uncoded ZIP codes. For Michigan, the percent of uncoded returns is 0.38 percent; which is also very small. Most other states have similar uncoded rates in the CCRS and in the probability.

Perhaps the best way to view the relationships is shown in Plots 3a, 3b and 4. Plot 3a shows the number of uncoded returns in the CCRS by the number of uncoded returns in the probability for states. The name of selected states is also shown on the plot. Plot 3b is a plot of the percent of uncoded returns in the CCRS by the percent of uncoded returns in the probability. Plot 4 is a plot of the number of uncoded ZIP codes in the CCRS vs probability. Recall that the plot points are represented by an alphabetic character, where an "A" means that there is one case (state) at the plot point, a "B" means that there is two cases (states) at the plot point, etc.

D. Geographic Distribution by County

The percent of uncodeds is quite small at the state level; however, we do not expect the same to be the case at the county level for a few counties. For example, we expect most of the probability uncodeds in Michigan to be in Oakland County. This would not be cause for concern if: (1) the geographic distribution is the same for the CCRS and the probability; (2) the geographic distribution is proportional to the state size; and (3) the geographic distribution did not change over time.

To investigate this, we are manually coding all ZIP codes with at least 25 returns using an atlas or calling the local post office. Results are not yet available.

SECTION V -- DIFFERENCES BETWEEN THE MIGRATION DATA SETS AT THE STATE LEVEL

The differences in the number of returns and exemptions in each state are a function of the number and patterns of uncoded returns, and differences between the probability coding guide, the ZIP/sector-to-county cross reference (CCRS) coding materials and the mailing state code. Differences in the numbers and state distributions of the returns in uncoded ZIP codes were discussed in Section IV. One could add these into the number of coded returns by state to get a comparison of the differences in the numbers between mailing state, probability and CCRS. However, we decided to focus on the overall differences between: (1) the original and the revised CCRS migration data; and (2) between the revised CCRS and the probability migration data. Recall that the original CCRS migration is based solely on the state/county codes for year-1 vs year-2 address information; whereas the revised CCRS migration is restricted to the key-3 movers (discussed in detail in section III-E).

A. Original CCRS vs Revised CCRS

We compared the differences between the tax year 1991-1992 migration data for the original and the revised data at the state level. The difference is, of course, the number of returns where the CCRS assigned state code is different, but either the key-3 (state/ZIP code/P.O. name) or the street address is the same. The differences for tax year 1991-1992 are shown in Text Table D. States with less than 3 returns different are not shown.

Text Table D -- Differences in State Level Migration Between the Original CCRS and the Revised CCRS

State	Outs-to-Nons		Ins-to-Nons
State	Rtns	Exmp	Rtns	Exmp
AL	6	15	18	45
AR	10	18	34	90
CA	-	-	3	5
CO	2	7	3	9
CT	8	16	6	14
DE	18	40	3	6
DC	9	23	8	22
FL	25	67	53	129
GA	73	180	61	160
ID	-	-	3	15
IL	8	18	6	12
IN	31	84	60	148
IA	39	91	37	103
KS	29	64	23	56
KY	28	70	24	54
LA	-	-	4	8
MD	12	29	33	77
MI	4	7	4	10
MN	63	159	56	140
MO	11	32	21	48
MT	-	-	5	18
NE	33	80	22	48
NV	25	72	47	101
NC	19	55	13	24
ND	34	64	54	20
OH	66	159	38	97
OK	6	15	-	-
RI	7	15	8	16
SD	79	198	65	146
TN	32	81	4	9
TX	32	84	6	12
UT	47	101	23	66
VA	13	24	21	60
WV	26	63	22	59
WI	10	22	11	23
WY	7	26	14	42
TOTAL	817	1,998	817	1,998

Columns 1 and 2 show the number of returns and exemptions that originally were counted as out-migrants and that became non-movers in the revision. Columns 3 and 4 shows the number of returns and exemptions that were originally counted as in-migrants that became non-migrants.

Overall, the differences are very small. The biggest differences are in North Dakota, South Dakota, Florida, Georgia, Ohio, Indiana and Minnesota. On a relative basis (not shown in the tables), Delaware is the leader. This is not terribly surprising, given the ZIP codes that are split by state, and an effort by the post office to eliminate the state splits. However, it was intriguing, so we investigated them further, by looking at the 817 cases where a state migrant became a state non-migrant to try to determine the cause. Text Table E shows the results of this review. Note that we did not have actual address information for the review, so the causes of the differences are deduced from the types of changes seen and the relative number of cases in the classifications.

Text Table E -- Review of Cases Where a State Migrant in Original CCRS Became a Non-Migrant in Revised CCRS

	Number	Percent
Total Cases.....................	817

Different ZIP...................	112	100.0

Errors that were fixed......	35	31.3
ZIP changes.................	69	61.6
New errors..................	8	7.1

Same ZIP, Different Sector......	705	100.0

'--' to P.O. Box or City....	11	1.6
P.O. Box or City to '--'....	0	0.0
'--' to Rural Route.........	44	6.1
Rural Route to '--'.........	0	0.0
98, 99 to Rural Route.......	184	26.1
Rural Route to 98, 99.......	64	9.1
98, 99 to P.O. Box or City..	50	7.1
P.O. Box, City to 98, 99....	28	4.0
Rural Route to City.........	27	3.8
City to Rural Route.........	2	0.3
Rural Route to P.O. Box.....	39	5.5
P.O. Box to Rural Route.....	21	3.0
Rural Route to Rural Route..	185	26.2
City to City................	25	3.5
Other, Unknown..............	25	3.5

B. ZIP Code Differs Between the Two Years

If the ZIP code is different between the two years, then we know that the street address is the same. The ZIP code can be different for several reasons. One, there can be errors in the 1991 ZIP code that has been corrected for 1992. Also a correct ZIP code in 1991 can be wrong in 1992 (keying errors for those filers not using the mailing label). A third possibility is that the area of responsibility of the local post office has changed.

There are 112 cases (out of 817) where the ZIP code differs between the two tax years. Our review of these cases is limited because we did not have the street address to work with, or the resources to determine the correct ZIP code, even if we were able to retrieve the street address. This represents our best guess (educated, but a guess none the less) as to what occurred. Data summaries from the review are shown in the table below.

Of the 112 ZIP code differences, there are 35 cases (31%) where the 1991 ZIP code is in error and 8 cases where the 1992 ZIP code is in error. There are 69 cases (62%) where the responsibility for the delivery area has been changed. This could either be for a change occurring between 1991 and 1992, or it could be for a change occurring in a prior year. We looked at the location of the post office in a street atlas, and they are very close to the state line.

C. ZIP Code Is the Same Between the Two Years

The majority of cases, 705 (or 86%), are where the ZIP code is the same, but the sector code is different, and the two sectors code to different states. Unfortunately we can not tell whether the street address is the same or not. That is, these cases can be:

real movers within the ZIP, but across the state line;
non-movers where the street address is different (for a variety of reasons); or
non-movers where the street address is the same.

However, we can make some inferences from the information that we do have. Recall that the post office assigns post offices boxes in sectors 00, 01, 02, etc; rural routes in sectors 97, 96, 95, etc; with city style addresses assigned to sectors in the middle. Sectors 98 and 99 are reserved for the postmaster's use. Frequently, rural route addresses in areas that no longer support rural routes (all rural routes have been converted to city style), are lumped into sector 98. Yes, some people refuse to convert and the post office still delivers their mail. Sector 98 is also used as a catchall for rural route addresses not codable to a legitimate rural route sector. In some cases, the sector code is missing (shown as '--' in the above table).

The most likely conclusion is that most of the differences in the state code involves a post office in one state delivering mail to rural route addresses in another state. The largest category ('RR to RR') has 185 cases (26%) that changed from one legitimate rural route sector to another legitimate rural route sector code. Most of the cases appear to be because of sector reorganization. But, some cases can be because of errors in the 1991 or 1992 sector code, or, of course, because of real moves. The next largest category with 184 cases (26%) are where sector 98 or 99 was changed to a sector for a rural route. Also, there were 44 cases changing from sector unknown ('--') to a rural route sector. Opposite of that are cases where a rural route sector was changed to 98 or 99. There were 64 (9%) such cases.

Some of the cases are errors in the year-1 ZIP/sector code on the tax return that were fixed for year-2. Some are new errors introduced in year-2.

The types of changes and the relative sizes of the changes point to other causes. For example, the table shows some evidence of conversion of rural route type addresses to city style addresses. There are 27 cases changing from a rural route sector to a city style sector, while there are 2 cases changing from a city style sector to a rural route sector. Also, there are 50 cases changing from sector 98 or 99 to a sector for city style addresses or for a post office box. There are 28 cases going in the other direction.

(Note that the migration tabulation program used the year-1 assigned state/county code for the non-movers. However, that needs consideration. The year-2 code zip/sector is probably more accurate, but if the coding materials are based on year-1 (and do not include changes between year-1 and year-2), then it may be more appropriate to use the year-1 assigned state/county code. If the coding materials are based on year-2, then the year-2 assigned state/county codes are preferable.)

D. Comparison of Migration Data From The Four Data Sources

There are four data sources for state level migration data:

Mailing state migration
Coded (probability) state level migration
Original CCRS state level migration data
Revised CCRS state level migration data

We ran a comparison of the sources currently available for tax years 1991-1992, showing the difference and the percent difference for the net migration rate. Table 5a shows the net migration rates. Table 5b shows differences in the net migration rate for the following 6 comparisons:

Probability (coded) minus mailing state
Original CCRS minus mailing state
Revised CCRS minus mailing state
Original CCRS minus probability
Revised CCRS minus probability
Original CCRS minus Revised CCRS

Note that the differences shown in table 5b are based on the exact migration rate and are then rounded. That is, the differences shown actually represent intervals. For example, a difference of -0.0 represents values from -0.049999999999 to -0.000000000001; a difference of 0.0 represents values from 0.000000000000 to 0.049999999999; etc.

E. Original CCRS vs Revised CCRS

The net migration rates for the original CCRS and the revised CCRS are (as expected) very close for all states. The largest differences are for North Dakota (0.010) and South Dakota (-0.011).

F. Probability vs Mailing State

The net migration rates for probability closely parallels the net migration rates for mailing state, for all states except for the District of Columbia. Other than the District of Columbia, the largest difference is 0.041 for New Mexico. The difference for the District of Columbia is -0.233. These higher levels of out-migration from the District are picked up as in-migrants to Maryland and Virginia. This is due to the difference between mailing address and residence as reflected in the probability distributions.

Note that the net migration rates or differences in the net migration rates are not directly comparable across states, because of the different sizes in the base used for the rates. To do such comparisons requires the number of exemptions by components of migration (in-migrants, out-migrants and non-migrants), which is too voluminous for this report.

G. Revised CCRS vs Mailing State

The revised CCRS net migration is close, but slightly higher than the mailing state migration for all states. The only reasonable explanation is that there is a slight difference in the way the "year-2 only 1040NR" returns are tallied as "in-migrants from foreign" between the two data products; counting more in the CCRS than in the mailing state. (See section III-A for a description of the year-2 only 1040NR returns). The four states with the largest differences are Hawaii (0.157), Maryland (0.112), Massachusetts (0.190) and New York (0.150). These states have the largest concentration (relative to population) of the "year-2 only 1040NR" returns.

H. Revised CCRS vs Probability

The differences in the net migration rates between the revised CCRS and the probability follow the same pattern as described in G (above), except for the District of Columbia (with a difference of 0.250). The probability net migration for the District of Columbia appears to be suspect.

SECTION VI -- COMPARISONS OF MIGRATION DATA BETWEEN THE PROBABILITY MIGRATION, THE ORIGINAL CCRS MIGRATION DATA AND THE REVISED CCRS MIGRATION DATA

For the original CCRS processing, mover status was not determined; migration was determined solely by comparing the codes. Thus, if the sectors within a ZIP code were reorganized and the old/new sectors code to a different county, then we can create spurious migration. Also, if the post office converts rural route addresses to city style addresses, then the sector code is also changed. If the state/county code for the rural route sector is different from the state/county code for the newly assigned sector, then spurious migration will result.

Under the revised ZIP/sector-to-county cross reference (CCRS) processing, we restricted the universe of records eligible to be migrants to exclude those whose key-3 (State/ZIP/P.O. name) is the same or the address is the same. This eliminates the spurious migration noted above, but it also eliminates legitimate county-to-county migration that occurs within the same ZIP code.

This section looks in detail at the specific case of the Emporia post office in Virginia. It also compares the migration data from the original CCRS to the revised CCRS for all counties, and compares the revised CCRS to the probability migration data for all counties.

A. The Case of Emporia, VA

Emporia is served by one ZIP code (23847), but that ZIP code also serves a portion of Greensville Co, a small piece of Brunswick Co., and a small piece of Sussex Co. Sectors for post office boxes code to Emporia under CCRS. Most sectors for city style addresses code to Emporia, but a few of the sectors addresses code to Greensville Co. Most sectors for rural route addresses code to Greensville Co., but a few sectors for rural route addresses code to Brunswick and Sussex counties. No rural route sectors code to Emporia.

Map of Area around Emporia, VA [<1.0 MB]

Text Table F -- Emporia, VA (1990 population=5,306)

	Total Returns	Total Exemptions	Coverage Estimate	Net Migration Rate
PROBABILITY:
TY 1979-1980	1,526	3,343	63.0	-0.59
TY 1980-1981	1,465	3,191	60.0	-2.83
TY 1981-1982	1,488	3,293	62.1	0.07
TY 1982-1983	1,609	3,499	66.0	5.09
TY 1983-1984	1,654	3,590	67.7	1.53
TY 1984-1985	1,716	3,677	69.3	-1.12
TY 1985-1986	1,756	3,717	70.0	-0.30
TY 1986-1987	1,838	3,621	68.3	0.73
TY 1987-1988	1,682	3,662	69.0	0.68
TY 1988-1989	1,850	3,674	69.3	-0.48
TY 1989-1990	1,823	3,794	71.5	2.72
TY 1990-1991	1,935	4,071	76.7	3.48
TY 1991-1992	2,031	4,354	82.1	2.10
ORIGINAL CCRS:
TY 1990-1991	2,669	5,679	107.0	5.88
TY 1991-1992	3,652	8,098	152.6	41.23
REVISED CCRS:
TY 1990-1991	--	--	--	--
TY 1991-1992	2,783	6,059	114.2	2.02

Text Table G -- Greensville Co., VA (1990 population=8,853)

	Total Returns	Total Exemptions	Coverage Estimate	Net Migration Rate
PROBABILITY:
TY 1979-1980	4,011	9,706	110.3	-0.08
TY 1980-1981	4,002	9,678	109.3	-0.39
TY 1981-1982	3,873	9,465	106.9	-1.76
TY 1982-1983	3,803	9,268	104.7	-1.90
TY 1983-1984	3,869	9,259	104.6	-1.06
TY 1984-1985	3,900	9,124	103.1	-1.10
TY 1985-1986	3,923	8,977	101.4	-0.49
TY 1986-1987	3,948	8,717	98.5	-1.46
TY 1987-1988	3,987	8,741	98.7	-1.33
TY 1988-1989	4,146	8,845	99.9	-0.82
TY 1989-1990	3,864	8,511	96.1	-0.26
TY 1990-1991	3,715	8,273	93.5	-0.70
TY 1991-1992	3,621	8,142	92.0	-0.69
ORIGINAL CCRS:
TY 1990-1991	1,870	4,223	47.7	-2.99
TY 1991-1992	1,273	2,878	32.5	-32.07
REVISED CCRS:
TY 1990-1991	--	--	--	--
TY 1991-1992	1,788	4,072	46.0	-1.99

The probability migration data, the original CCRS migration and the revised CCRS migration data are shown in Text Tables F and G on page 38. The column called "coverage estimate" is the number of total exemptions divided by the 1990 population times 100. It is not really a coverage estimate for any year other than tax year 1989. Data that are not yet available have "--" in the table.

There are two obvious problems with the original CCRS migration data for the independent city of Emporia, Virginia and Greensville Co., Virginia.

First, Emporia has a high net in-migration rate for tax year 1990-1991 and a very high net in-migration rate for tax year 1991-1992 in the original CCRS data. The net migration rate dropped to a reasonable level under the revised CCRS definition for tax year 1991-1992. Corresponding to this is a high net out-migration rate for Greensville Co. for tax year 1990-1991 and a very high net out-migration rate for tax year 1991-1992. The net out-migration rate drops to a reasonable level under the revised CCRS definition for tax year 1991-2992.

A second problem is the very high over coverage for Emporia and the very high under coverage for Greensville Co. in the original CCRS data product. The bad coverage rates are mitigated to a large extent under the revised CCRS definition, it is still not very good.

Although we are using the probability migration history as a comparator, it is no prize either. There is under coverage for Emporia and over coverage for Greensville Co. The probability net migration rates for Emporia and Greensville Co. are not very stable over time.

To investigate the situation, we looked at all cases for Emporia where the key-3 or the street address was the same between year-1 and year-2, but the assigned state/county code differed. There are 86 cases where an out-migrant became a non-mover (for various reasons), and 951 cases where an in-migrant became a non-mover. Unfortunately, we did not have the street addresses from the tax returns or a coded benchmark from which we could make accurate determinations. From the patterns of changes and the relative number or returns with those changes, we are generally able to deduce the causes. The following describes the main patterns of changes and the number of returns in each.

Out of the 951 cases that were originally in-migrants to Emporia and were non-movers under the revised definition:

388 had a year-1 sector for rural routes and had year-2 sector for city style addresses;
321 had a year-1 sector of 98 or 99 and had year-2 sector for post office boxes;
21 had a year-1 sector for rural routes and had year-2 sector for post office boxes;
195 had a year-1 sector for city style addresses and had a different year-2 sector for city style addresses;
26 had a year-1 sector for city style addresses and had year-2 sector for post office boxes;

From the above information, we concluded that the Emporia post office made two types of changes between the two tax years that caused our coding problems -- not only did they convert rural route addresses to city style addresses, but they also reorganized some of the city style sectors. These changes, in conjunction with the way the ZIP/sector codes were originally laid out, caused the huge spurious in-migration to Emporia under the original definition. The huge spurious net migration also affected the coverage rates.

B. Comparison of the Original CCRS and Revised CCRS Net Migration Data For All Counties

Table 6a shows the number of counties by the original and the revised CCRS net migration rates for the under age 65 universe. Plot 6b is a plot of the same data. The vertical zero axis and the diagonal are also plotted. There are 2,320 counties not clearly represented on the plot because they are in the 'or more' part of category 'Z'.

From the plot, it looks like the two measures track fairly well along the diagonal. That is, there is not much difference between the original CCRS and the revised CCRS for most counties. However, there are several counties with very high net migration rates (the cells for Emporia and Greensville are noted on the plot). For these other counties, there is clear movement from high net migration under the original CCRS to reasonable levels under the revised CCRS (closer to the vertical zero axis).

In terms of numbers, 2,657 of the 3,140 counties (about 84.6 percent) had a difference (+ or -) in the net migration rates of 0.9 or less. There were 296 counties (9.4 percent) with a difference of 1.0 to 1.9. There were 144 counties (4.6 percent) with a difference of 2.0 to 4.9. There were 43 counties (1.4 percent) with a difference of 5.0 or more.

Even though the change is small, there are a large number of counties with some change. Recall that there are 9,800 ZIP codes split by county, out of 41,746. I do not believe that sector reorganizations or conversions of rural route addresses can account for all these changes. Rather, I think they are accounted for by returns having a different sector code in year-1 and year-2 without any reorganization or actual moving. That is, some are errors in the year-1 sector code, and some are errors in year-2 sector code. There may also be some actual moving, where the before and the after address are within the same ZIP code.

Table 7a shows the number of counties by state and the difference in the original and the revised CCRS net migration rates. Plot 7b is a plot of the same data. There is considerable variation by state in the distribution of counties by difference in the net migration rates. The states with the largest number of the counties with a difference (+ or -) greater than 2.0 are: Georgia (with 26), Kentucky (with 15), Tennessee (with 14) and Virginia (with 20).

The basic conclusion is that the revised CCRS definition is preferable, even though the use of the revision also eliminates some true movers as well as the spurious migration. Note that this effect is also inherent in the probability migration data.

C. The Revised CCRS Migration Data Compared to the Probability Migration Data For All Counties

Table 8a shows the number of counties by the revised CCRS and the probability net migration rates for the under 65 universe. Table 8b is a plot of the same data.

From the plot, it appears that most counties fall on or near the diagonal. However, there are several counties that have moderate to large differences in the net migration rates. There are 2,925 (of the 3,139 counties, or 93.2 percent) that have a difference of (+ or -) of 0.9 or less. There are 144 counties (4.6 percent) where the difference is between 1.0 and 1.9. There are 51 counties (1.6 percent) where the difference is between 2.0 and 4.9; and 19 counties (0.6 percent) where the difference is 5.0 or larger. In some of the outlier cases, the CCRS will be better, in others, the probability may be better. Unfortunately, we have no benckmark to determine which is better.

We ran a least squares regresson of the net migration rates for the revised CCRS vs the probability. If the rates for the two data sets were identical, we would find an intercept of 0.000, a slope of 1.000 and a R-square of 1.000. The estimated parameters are as follows (with the standard error of the estimate in parenthesis):

intercept = 0.185 (.024)
slope = 0.779 (.011)
R square = .587

Table 9a shows the number of counties by state and the difference between the original CCRS and the probability net migration rates for the under 65 universe. Table 9b is a plot of the same data. Georgia has 6 counties with a difference (+ or -) greater that 1.9, Texas and Virginia both have 10 counties where the difference is greater than 1.9.

SECTION VII -- COMPARISONS OF TOTAL RETURNS AND EXEMPTIONS BETWEEN THE PROBABILITY MIGRATION, THE ORIGINAL CCRS MIGRATION DATA AND THE REVISED CCRS MIGRATION DATA

This section looks at the difference in the number of total exemptions between the original ZIP/sector-to-county cross reference (CCRS) data products and the revised CCRS data. It discusses coverage issues for the specific cases of Catron and Grant counties, New Mexico, and the cases of Blanco and Burnet counties, Texas. Differences in coverage between the revised CCRS and the probability for all counties are also shown.

The text and the tables refer to a "coverage estimate". This is defined as the total number of exemptions divided by the 1990 population times 100. Thus, it can only be considered a coverage measure for the tax year 1989 data. However, for lack of a better word, the text refers to this as coverage for tax years other than 1989.

A. Coverage of the Original CCRS Data Compared to the Revised CCRS Coverage for all counties

Table 10a shows a tally of the number of counties by the original CCRS and the revised CCRS coverage rates for tax year 1991-1992. Plot 10b is a plot of the same data. Table 11a shows a tally of the number of counties by state and the difference in the coverage rates between the original CCRS and the revised CCRS data for tax year 1991-1992. Plot 11b is a plot of the same data.

For those counties that had migration data problems such as the case of Emporia, Virginia (discussed in detail in section VI-A), the revised definition will also dampen problems in coverage. This is not a truly startling observation, and there are not that many counties with such problems. We leave it to the reader to peruse these tables and plots.

B. The Cases of Catron Co., NM vs Grant Co., NM and Blanco Co., TX vs Burnet Co., TX

In spite of the editing of ZIP/sector-to county coding file, there may still be errors in the assigned state/county codes. Such errors of coding will show up as a difference in the number of returns and exemptions for the county when compared to the numbers for the probability data, and compared to the 1990 population in the county. The miscoding will generally not be detectible by looking at migration rates, even though the migration data for the counties affected by miscoding will both be biased toward the average of the two counties. The amount of the bias will depend on the relative number of cases involved and in the migration differential between the miscoded cases and the rest of the county.

Text Table H -- Catron Co, NM (1990 population=2,563)

	Total Returns	Total Exemptions	Coverage Estimate	Net Migration Rate
Probability:
TY 1979-1980	886	2,286	89.2	2.86
TY 1980-1981	883	2,291	89.4	-3.42
TY 1981-1982	913	2,377	92.8	1.85
TY 1982-1983	922	2,423	94.5	1.30
TY 1983-1984	904	2,298	89.6	-3.08
TY 1984-1985	942	2,289	89.3	-3.39
TY 1985-1986	989	2,396	93.5	6.82
TY 1986-1987	988	2,399	93.6	0.41
TY 1987-1988	1,021	2,422	94.5	0.10
TY 1988-1989	1,009	2,320	90.5	0.15
TY 1989-1990	967	2,232	87.1	-1.80
TY 1990-1991	962	2,140	83.5	-1.51
TY 1991-1992	991	2,205	86.0	2.10
Original CCRS:
TY 1990-1991	4,692	10,996	429.0	2.27
TY 1991-1992	4,765	11,111	433.5	0.92
Revised CCRS:
TY 1990-1991	--	--	--	--
TY 1991-1992	4,758	11,088	432.6	0.67
CCRS less ZIP 88061:
TY 1990-1991	657	1,759	68.6	--
TY 1991-1992	619	1,778	69.4	--

Text Table I -- Grant Co, NM (1990 population=27,676)

	Total Returns	Total Exemptions	Coverage Estimate	Net Migration Rate
Probability:
TY 1979-1980	9,464	25,214	91.4	1.65
TY 1980-1981	9,834	26,068	94.2	2.04
TY 1981-1982	9,207	24,658	89.1	-3.64
TY 1982-1983	8,873	23,786	86.0	-2.15
TY 1983-1984	8,997	23,878	86.3	-1.00
TY 1984-1985	9,155	23,760	85.9	-1.24
TY 1985-1986	9,278	23,843	86.2	-0.13
TY 1986-1987	9,517	24,581	88.8	0.78
TY 1987-1988	9,831	24,959	90.2	-0.52
TY 1988-1989	10,071	24,346	88.0	0.10
TY 1989-1990	9,552	23,668	85.5	0.13
TY 1990-1991	9,557	24,299	87.8	1.13
TY 1991-1992	9,826	23,657	85.5	-1.84
Original CCRS:
TY 1990-1991	7,497	18,487	66.8	-0.29
TY 1991-1992	7,308	17,789	64.3	-2.72
Revised CCRS:
TY 1990-1991	--	--	--	--
TY 1991-1992	7,317	17,818	64.4	-2.55
CCRS plus ZIP 88061:
TY 1990-1991	10,183	23,472	84.8	--
TY 1991-1992	9,992	23,565	85.2	--

Text Tables H and I shown on page 44 contains the probability migration history data and the original and revised CCRS migration data for Catron Co., NM and Grant Co., NM. The two counties are, of course, contiguous.

The "coverage" estimate for the tax year 1991-1992 probability data appears reasonable, at 86.0 for Catron Co. and 85.5 for Grant Co. There is obvious over coding to Catron Co. (coverage is 433.5) and under coding to Grant Co. (coverage is 64.3) in the tax year 1991-1992 CCRS data. We examined all ZIPs coding to Catron and Grant counties and found that ZIP code 88061 (Silver City in Grant Co) erroneously coded to Catron Co. We did not have time to create 1991-1992 ZIP-to-ZIP tallies for the number of returns and exemptions in ZIP 88061. However, we did run a tally from the 1-percent test file. From that, we calculated the percent of returns and exemptions coded to Catron Co, from ZIP code 88061. That was applied to CCRS data to obtain a modified estimate of the number of returns and exemptions if ZIP 88061 were correctly coded. The coverage rate for the modified data for Catron Co. is 69.4.

Similarly, the coverage estimates for Grant Co. are: 85.5 for probability; 64.3 for the original CCRS; and 85.2 for the modified CCRS. The coverage for tax year 1990-1991 follow a similar pattern. This is a good example of how one miscoded ZIP code can dramatically affect the coverage for the two counties.

Text Tables J and K (shown on page 46) show similar data for Blanco and Burnet Counties, Texas. We followed a similar review and estimation process for these counties. The probability coverage estimate for tax year 1991-1992 is 93.9 for Blanco Co. and 94.5 for Burnet Co. The coverage estimate for the original CCRS in tax year 1991-1992 is 234.6 for Blanco Co. and 52.2 for Burnet Co. In this case, ZIP code 78654 (Marble Falls) is miscoded. Using the same estimating procedure, the modified coverage estimate for the CCRS data is 111.4 for Blanco Co. and 83.0 for Burnet Co. That is an improvement, but it is still out of line. We were not able to determine why, but 3 possibilities spring to mind: (1) inadequacy of the 1-percent file for this work; (2) a ZIP coding 100% to Blanco Co. may actually be split, where part should actually be coded to another county (but I do not think that is the case); or (3) there is a significant difference between the mailing and the residence address.

C. Coverage of the Revised CCRS Data Compared to the Probability Coverage for all counties

Table 12a shows the number of counties by the revised CCRS and the probability coverage rates for tax year 1991-1992. Table 12b is a plot of the same data. The horizontal and vertical zero axis as well as the diagonal are plotted. There are 2,678 counties not clearly represented on the plot because they are in the 'or more' part of category 'Z'.

Text Table J -- Blanco Co, TX (1990 population=5,972)

	Total Returns	Total Exemptions	Coverage Estimate	Net Migration Rate
Probability:
TY 1979-1980	1,758	4,055	67.9	0.71
TY 1980-1981	1,813	4,147	69.4	3.60
TY 1981-1982	1,898	4,242	71.0	3.92
TY 1982-1983	1,982	4,509	75.5	6.48
TY 1983-1984	2,058	4,586	76.8	4.36
TY 1984-1985	2,275	4,921	82.4	5.94
TY 1985-1986	2,223	4,952	82.9	2.87
TY 1986-1987	2,292	5,029	84.2	0.74
TY 1987-1988	2,445	5,378	90.1	2.94
TY 1988-1989	2,452	5,291	88.6	-1.92
TY 1989-1990	2,327	5,221	87.4	0.71
TY 1990-1991	2,418	5,468	91.6	4.49
TY 1991-1992	2,480	5,607	93.9	2.99
Original CCRS:
TY 1990-1991	6,217	13,831	231.6	5.17
TY 1991-1992	6,311	14,010	234.6	3.00
Revised CCRS:
TY 1990-1991	--	--	--	--
TY 1991-1992	6,311	14,010	234.6	3.00
CCRS less ZIP 78654:
TY 1990-1991	3,482	7,745	129.6	--
TY 1991-1992	3,247	6,651	111.4	--

Text Table K -- Burnet Co, NM (1990 population=22,677)

	Total Returns	Total Exemptions	Coverage Estimate	Net Migration Rate
Probability:
TY 1979-1980	6,685	15,448	68.3	5.17
TY 1980-1981	7,021	15,939	70.2	5.62
TY 1981-1982	7,379	16,784	74.0	8.60
TY 1982-1983	7,779	17,806	78.5	7.84
TY 1983-1984	8,344	18,873	83.2	6.64
TY 1984-1985	8,962	20,186	89.0	5.19
TY 1985-1986	9,032	20,450	90.2	1.13
TY 1986-1987	9,072	20,091	88.6	-2.24
TY 1987-1988	9,196	20,220	89.2	-0.98
TY 1988-1989	9,472	20,444	90.2	0.41
TY 1989-1990	9,061	20,275	89.4	1.77
TY 1990-1991	9,348	20,775	91.6	3.11
TY 1991-1992	9,565	21,429	94.5	4.92
Original CCRS:
TY 1990-1991	4,931	11,160	49.2	1.61
TY 1991-1992	5,165	11,829	52.2	7.33
Revised CCRS:
TY 1990-1991	--	--	--	--
TY 1991-1992	5,072	11,614	51.2	5.20
CCRS plus ZIP 78654:
TY 1990-1991	7,671	17,246	76.1	--
TY 1991-1992	8,162	18,820	83.0	--

There are 1,910 counties (out of 3,139, or 60.9 percent) on the diagonal. There are 2,838 counties (90.4 percent) on the diagonal or within one cell of the diagonal. There are a few counties that are well off of the diagonal. Also there a number of counties (221 or 7.0 percent) where the coverage rate is 100 percent or more in either the revised CCRS or the probability. The breakdown is as follows:

Text Table L -- Revised CCRS vs Probability Coverage Rates

Revised CCRS	Probability
Revised CCRS	Total	Coverage < 100	Coverage > 100
Total.........	3,139	2,995	144
Coverage < 100	2,971	2,918	53
Coverage > 100	168	77	91

In some cases, the CCRS coverage looks better; in some cases, the probability looks better. In others, both look like they may have coding problems. The other end of the distribution shows counties with low coverage rates. Given that the coverage rates are dependent on filing requirements as well as coding problems, it is difficult to determine how many of those with low coverage rates have a coding problem. Further, there may be some counties with reasonable coverage rates that have coding problems.

We ran a least squares regression of the coverage rates for the revised CCRS vs the probability. If the rates for the two data sets were identical, we would find an intercept of 0.000, a slope of 1.000 and a R-square of 1.000. The estimated parameters are as follows (with the standard error of the estimate in parenthesis):

intercept = 16.548 (1.182)
slope = 0.813 ( .0211)
R square = .328

Table 13a shows the number of counties by state and difference in the coverage rates between the revised CCRS and the probability. There are 221 counties (7.0 percent) where the difference (+ or -) is 10.0 or more. Virginia has 47 of these counties, Georgia has 17 and Texas has 16. Again, we have no benchmark to assess which of the coding processes are better, but we can make a few basic conclusions.

The CCRS coding process looks like it does well for most counties. However, there a few counties where there are errors in the CCRS coding file. If the counties with a bad coverage rate are affected by only one ZIP code, that means that there are not that many poorly coded ZIP codes. If these ZIP codes are not split by county, then we can review and correct the CCRS coding file (just like we did for Catron, Grant, Blanco and Burnet counties). That process should be a manageable task.

Further, we can incorporate one more edit. We can match the probability coding guide to the CCRS coding file and review ZIP codes where there is a significant difference. Indeed, we have begun the process, and expect to rerun the tax year 1990-1991 and 1991-1992 migration data.

Section VIII -- CONCLUSIONS AND RECOMMENDATIONS

The CCRS processing ran well on the SUN server, and was more than able to keep up with the flow of input data cuts from the UNISYS processing.
The Original CCRS definition includes too much spurious migration to consider it for a final product. The revised definition is preferable, even though it may eliminate a small amount of real migration.
Overall, the quality of the migration data produced by the revised CCRS processing gives reasonable results for most counties, but there are some deficiencies.
The CCRS coding file does contain some errors. We need to further edit the file by comparing the county code to the dominant county for the ZIP in the probability coding guide and resolve differences. We should also repeat the analysis done for Catron and Grant counties for other counties with high degrees of over/under coverage.
We will rerun the migration data for tax years 1990-1991 and 1991-1992 based on the corrected CCRS file.
No matter what we do to the CCRS coding file, it will probably still contain a few errors and we need some type of contingency plan for correction. Probably the best process to follow would be to redesign the migration matrix to include breaks for: same zip/ same sector, same ZIP/ different sector, and different ZIP. We need to do some design and experimentation on this idea.
On a production schedule, suspect data needs to be quickly identified and a resolution made. Under that schedule, we will not have time to do a thorough review to find, investigate and resolve more than a few individual counties that are suspect. That is, even if the incorporation of ZIP and sector into the migration matrix can be done, there may not be time to use it to revise problem data in a production schedule.
The post office will continue to reorganize the ZIP codes, the sector codes and convert rural route addresses to city style addresses. Such changes implemented between year-1 and year-2 will continue to cause problems in the migration data. In addition to the production of the revised CCRS migration data, the computation of the original CCRS migration data would be useful to help spot problem areas.
One needs to have a time series of migration data to help spot bad data. Even then, it is hard to tell what is bad and what is merely unusual (but good) data.
The CCRS coding file is vintage April 1990. We saw the effect of changes in ZIP and sector codes on the data. It is important that updates for ZIP and Sector changes be incorporated into the CCRS coding file (while maintaining the corrections made in our reviews). We need to design and implement an annual update process for the CCRS coding file.
The CCRS coding process was done for the ZIP codes on the individual income tax returns. We can use the CCRS coding files and processes for other administrative record files as well, where county codes are missing or of poor quality.
Finally, and perhaps most important, is that the review and analysis we have done is looking at the bad data, seeking perfection. The question is really whether the CCRS is better than probability, which we know is lacking in perfection, and if the 6 month time gain in using the CCRS is worth the effort.

1/ Batutis, Michael J. Jr, "Subnational Estimates of Total Population By The Tax Return Methodology:, April 1994

van der Vate, Barbara J., "Methods Used in Estimating the Population of Substate Areas in the United States", August 1988

2/ Sater, Douglas, K., "Geographic Coding of Administrative Records--Past Experience and Current Research", Population Estimates and Projections Technical Working Paper No. 2, April 1993.

3/ Shepherd, Suzanne B, "Meeting With Gary West, Address Programs Support Manager of the United States Postal Service, Louisville, KY Division", Census Bureau internal memorandum, October 2, 1991