Skip Main Navigation Skip To Navigation Content

Information Quality

Skip top of page navigation
Census.gov Information Quality Main Statistical Quality Standards › Glossary

Glossary


-A-

Accuracy of survey results refers to how closely the results from a sample can reproduce the results that would be obtained from a complete count (i.e., census) conducted using the same techniques at the same time. The difference between a sample result and the result from a complete census taken under the same conditions and at the same time is an indication of the precision of the sample result.

Administrative records and administrative record data refer to micro data records contained in files collected and maintained by administrative or program agencies and commercial entities. Government and commercial entities maintain these files for the purpose of administering programs and providing services. Administrative records (e.g., Title 26 data) are distinct from systems of information collected exclusively for statistical purposes, such as data from censuses and surveys that are collected under the authority of Titles 13 or 15 of the United States Code (U.S.C.). For the most part, the Census Bureau draws upon administrative records developed by federal agencies. To a lesser degree, it may use information from state, local, and tribal governments, as well as commercial entities. To obtain these data, the Census Bureau must adhere to a number of regulatory requirements.

The Administrative Records Tracking System (ARTS) is an electronic database on the Census Bureau’s Intranet. It tracks Census Bureau administrative records agreements, agreement commitments, administrative data projects, and relevant external contacts.

Administratively restricted information(as defined in Data Stewardship Policy DS007, Information Security Management Program) consists of agency documentation that is not intended as a public information product and other pre-release or embargoed public information. Examples of administratively restricted information include:

  • ”For Official Use Only” (FOUO) information: Internal Census Bureau documentation consisting of program or operational materials (e.g., contracting, financial, budget, security, legal, policy documents) determined by management to be either protected under the Freedom of Information Act and/or of a nature that release could negatively impact the mission of the Census Bureau.
  • Embargoed data or reports that have not been released, but meet Disclosure Review Board requirements for public release.
  • Proprietary contractor information, such as its cost proposal and labor rates.
  • All information not otherwise protected by statutory authority, but that is subject to access and/or use restrictions, as provided in a valid Agreement with the government agency or other entity supplying the information.
  • All personally identifiable information (PII) not protected by an existing legal authority.
  • All business identifiable information (BII) not protected by an existing legal authority.

Allocation involves using statistical procedures, such as within-household or nearest neighbor matrices populated by donors, to impute for missing values.

American National Standards Institute codes (ANSI codes) are a standardized set of numeric or alphabetic codes issued by the American National Standards Institute (ANSI) to ensure uniform identification of geographic entities through all federal government agencies.

The autocorrelation function of a random process describes the correlation between the processes at different points in time.

Automated record linkage is the pairing of data, primarily via computer software.

An autoregressive integrated moving average (ARIMA) model is a generalization of an autoregressive moving average or (ARMA) model for nonstationary time series. A nonstationary time series is a time series not in equilibrium about a constant mean level. In a nonstationary time series, the mean or variance of the series may not be the same at all time periods. The model is generally referred to as an ARIMA(p,d,q) model where p, d, and q are integers greater than or equal to zero and refer to the order of the autoregressive, integrated (differencing), and moving average parts of the model respectively.

An autoregressive moving average (ARMA) model is a stationary model of time series data where the current data point and current stochastic error are each modeled as finite linear regressions of previous data points or stochastic errors respectively. The regression for the data points is referred to as an autoregression. The regression for the stochastic errors is referred to as a moving average. Symbolically, the model is denoted as an ARMA (p,q) model where p and q are integers greater than or equal to zero and refer to the order of the autoregressive and moving average parts of the model respectively. A stationary time series is a time series in equilibrium about a constant mean level. These models are fitted to time series data either to better understand the data or to predict future points in the series.

-B-

Behavior coding of respondent/interviewer interactions involves systematic coding of the interaction between interviewers and respondents from live or taped field or telephone interviews to collect quantitative information. When used for questionnaire assessment, the behaviors that are coded focus on behaviors indicative of a problem with the question, the response categories, or the respondent's ability to form an adequate response.

Bias is the difference between the expected value of an estimator and the actual population value.

Blocking is grouping the records of a set into mutually exclusive, exhaustive pieces by using a set of fields (e.g., state, last name, first initial).  Usually used in the context of record linkage.

Bonferroni correction is a method used to address the problem of multiple comparisons. It is based on the idea that if an experimenter is testing n dependent or independent hypotheses on a set of data, then one way of maintaining the family-wise error rate is to test each individual hypothesis at a statistical significance level of 1/n times what it would be if only one hypothesis were tested.

Bottom-coding is a disclosure limitation technique that involves limiting the minimum value of a variable allowed on the file to prevent disclosure of individuals or other units with extreme values in a distribution.

A bridge study continues an existing methodology concurrent with a new methodology for the purpose of examining the relationship between the new and old estimates.

Business identifiable information is information defined in the Freedom of Information Act (FOIA) as trade secrets or commercial or financial information, that is obtained from a person representing a business entity, and which is privileged and confidential (e.g., Title 13) and exempt from automatic release under FOIA. Also included is commercial or other information that, although it may not be exempt from release under the FOIA, is exempt from disclosure by law (e.g., Title 13). Also see Personally identifiable information.

-C-

The calibration approach to estimation for finite populations consists of: (a) a computation of weights that incorporate specified auxiliary information and are restrained by calibration equation(s); (b) the use of these weights to compute linearly weighted estimates of totals and other finite population parameters: weight times variable value, summed over a set of observed units; (c) an objective to obtain nearly design unbiased estimates as long as nonresponse and other nonsampling errors are absent.

Cell suppression is a disclosure limitation technique where sensitive cells are generally deleted from a table and flags are inserted to indicate this condition.

A census is a data collection that seeks to obtain data directly from all eligible units in the entire target population. It can be considered a sample with a 100 percent sampling rate. The Economic Census may use administrative records data rather than interviews for some units.

Census Bureau publications are information products that are backed and released by the Census Bureau to the public. “Backed and released by the Census Bureau” means that the Census Bureau’s senior management officials (at least through the Associate Director responsible for the product) have reviewed and approved the product and the Census Bureau affirms its content. Because publications do not contain personal views, these information products do not include a disclaimer.

Clerical record linkage is record matching that is primarily performed manually.

A cluster is a set of units grouped together on the basis of some well-defined criteria. For example, the cluster may be an existing grouping of the population such as a city block, a hospital, or a household; or may be conceptual such as the area covered by a grid imposed on a map.

Coding is the process of categorizing response data using alphanumeric values so that the responses can be more easily analyzed.

Coefficient of variation (CV) is a measure of dispersion calculated by dividing the standard deviation of an estimate by its mean. It is also referred to as the relative standard error.

Cognitive interviews are used as a pretesting technique consisting of one-on-one interviews using a draft questionnaire to find out directly from respondents about their problems with the questionnaire In a typical cognitive interview, respondents report aloud everything they are thinking as they attempt to answer a survey question.

Computer-assisted personal interviewing (CAPI) is an interviewing technique similar to computer-assisted telephone interviewing, except that the interview takes place in person instead of over the telephone. The interviewer sits in front of a computer terminal and enters the answers into the computer.

Computer-assisted telephone interviewing (CATI) is an interviewing technique, conducted using a telephone, in which the interviewer follows a script provided by a software application. The software is able to customize the flow of the questionnaire based on the answers provided, as well as information already known about the participant.

A confidence interval is a range of values determined in the process of estimating a population parameter. The likelihood that the true value of the parameter falls in that range is chosen in advance and determines the length of the interval. That likelihood is called the confidence level. Confidence intervals are displayed as (lower bound, upper bound) or as estimate ± MOE, where MOE = z-value * standard error of the associated estimate (when the confidence level = 90%, the z-value = 1.645).

Confidentiality involves the protection of personally identifiable information and business identifiable information from unauthorized release.

Controlled rounding is a form of random rounding, but it is constrained to have the sum of the published entries in each row and column equal the appropriate published marginal totals.

Controlled tabular adjustment is a perturbative method for statistical disclosure limitation in tabular data. This method perturbs sensitive cell values until they are considered safe and then rebalances the nonsensitive cell values to restore additivity.

A convenience sample is a nonprobability sample, from which inferences cannot be made. Convenience sampling involves selecting the sample from the part of the population that is convenient to reach. Convenience sampling is not allowed for Census Bureau information products.

Covariance is a characteristic that indicates the strength of relationship between two variables. It is the expected value of the product of the deviations of two random variables, x and y, from their respective means.

Coverage refers to the extent to which elements of the target population are listed on the sampling frame. Overcoverage refers to the extent that elements in the population are on the frame more than once and undercoverage refers to the extent that elements in the population are missing from the frame.

Coverage error which includes both undercoverage and overcoverage, is the error in an estimate that results from (1) failure to include all units belonging to the target population or failure to include specified units in the conduct of the survey (undercoverage), and (2) inclusion of some units erroneously either because of a defective frame or because of inclusion of unspecified units or inclusion of specified units more than once in the actual survey (overcoverage).

A coverage ratio is the ratio of the population estimate of an area or group to the independent estimate for that area or group.  The coverage ratio is sometimes referred to as a coverage rate and may be presented as percentage.

Cross-sectional studies (also known as cross-sectional analysis) form a class of research methods that involve observation of some subset of a population of items all at the same time. The fundamental difference between cross-sectional and longitudinal studies is that cross-sectional studies take place at a single point in time and that a longitudinal study involves a series of measurements taken on the same units over a period of time. See Longitudinal survey.

Cross-validation is the statistical practice of partitioning a sample of data into subsets such that the analysis is initially performed on a single subset, while the other subset(s) are retained for subsequent use in confirming and validating the initial analysis.

Custom tabulations are tables prepared by the Census Bureau at the request of a data user or program sponsor. This terminology does not apply to tables produced by Census Bureau software (e.g., FERRET or American Fact Finder).

A cut-off sample is a nonprobability sample that consists of the units in the population that have the largest values of a key variable (frequently the variable of interest from a previous time period). For example, a 90 percent cut-off sample consists of the largest units accounting for at least 90 percent of the population total of the key variable. Sample selection is usually done by sorting the population in decreasing order by size, and including units in the sample until the percent coverage exceeds the established cut-off.

-D-

Data capture is the conversion of information provided by a respondent into electronic format suitable for use by subsequent processes.

Data collection involves activities and processes that obtain data about the elements of a population, either directly by contacting respondents to provide the data or indirectly by using administrative records or other data sources. Respondents may be individuals or organizations.

Data collection instrument refers to the device used to collect data, such as a paper questionnaire or computer assisted interviewing system.

A data program is a program that generates information products, often on a regular schedule. These programs include efforts such as the censuses and surveys that collect data from respondents. Data programs also include operations that generate information products from administrative records and operations that combine data from multiple sources, such as various surveys, censuses, and administrative records. Specific examples of multiple source data programs include the Small Area Income and Poverty Estimates (SAIPE) program, the Population Division’s “Estimates and Projections” program, the National Longitudinal Mortality Study, and the Annual Survey of Manufactures (ASM). One-time surveys also are considered data programs.

Data-use agreements for administrative records are signed documents between the Census Bureau and other agencies to acquire restricted state or federal data or data from vendors. These are often called Memoranda of Understanding (MOU).

Derived statistics are calculated from other statistical measures. For example, population figures are statistical measures, but population-per-square-mile is a derived quantity.

The design effect is the ratio of the variance of a statistic, obtained from taking the complex sample design into account, to the variance of the statistic from a simple random sample with the same number of cases. Design effects differ for different subgroups and different statistics; no single design effect is universally applicable to any given survey or analysis.

A direct comparison is a statement that explicitly points out a difference between estimates.

Direct estimates are estimates of the true values of the target populations, based on the sample design and resulting survey data collected on the variable of interest, only from the time period of interest and only from sample units in the domain of interest. Direct estimates may be adjusted using explicit or implicit models (e.g., ratio adjustment, hot or cold deck imputation, and non-response adjustment) to correct for nonresponse and coverage errors.

Disclosure is the release of personally identifiable information or business identifiable information outside the Census Bureau.

Dissemination means Census Bureau-initiated or sponsored distribution of information to the public (e.g., publishing information products on the Census Bureau Internet Web site). Dissemination does not include distribution limited to government employees or agency contractors or grantees; intra-agency or inter-agency use or sharing of government information; and response to requests for agency records under the Freedom on Information Act, the Privacy Act, or other similar law. This definition also does not include distribution limited to correspondence with individuals or persons, press releases, archival records, public filings, subpoenas, or adjudicative processes.

A dress rehearsal is a complete test of the data collection components on a small sample under conditions that mirror the full-implementation. See Field test.

-E-

Editing is the process of identifying and examining missing, invalid, and inconsistent entries and changing these entries according to predetermined rules, other data sources, and recontacts with respondents with the intent to produce more accurate, cohesive, and comprehensive data. Some of the editing checks involve logical relationships that follow directly from the concepts and definitions. Others are more empirical in nature or are obtained through the application of statistical tests or procedures.

Equivalent quality data is data obtained from another source than the respondent, which have quality equivalent to data reported by the respondent.  Equivalent quality data have three possible sources: 1) data directly substituted from another census or survey (for the same reporting unit, question wording, and time period); 2) data from administrative records; or 3) data obtained from some other equivalent source that has been validated by a study approved by the program manager in collaboration with the appropriate Research and Methodology area (e.g., company annual reports, Securities and Exchange Commission (SEC) filings, and trade association statistics).

An estimate is a numerical quantity for some characteristic or attribute calculated from sample data as an approximation of the true value of the characteristic in the entire population. An estimate can also be developed from models or algorithms that combine data from various sources, including administrative records.

Estimation is the process of using data from a survey or other sources to provide a value for an unknown population parameter (such as a mean, proportion, correlation, or effect size), or to provide a range of values in the form of a confidence interval.

Exploratory studies (also called Feasibility studies) are common methods for specifying and evaluating survey content relative to concepts. In economic surveys, these studies often take the form of company or site visits.

External users – see Users.

-F-

Fax imaging is properly called Paperless Fax Imaging Retrieval System (PFIRS). This collection method mails or faxes a paper instrument to respondents. The respondents fax it back to the Census Bureau, where it is automatically turned into an image file.

Feasibility studies (also called Exploratory studies) are common methods for specifying and evaluating survey content relative to concepts. In economic surveys, these studies often take the form of company or site visits.

Field follow-up is a data collection procedure involving personal visits by enumerators to housing units to perform the operations such as, resolving inconsistent and/or missing data items on returned questionnaires, conducting a vacant/delete check, obtaining information for blank or missing questionnaires, and visiting housing units for which no questionnaire was checked in.

A field test is a test of some of the procedures on a small scale that mirrors the planned full-scale implementation. See Dress rehearsal.

A focus group is a pretesting technique whereby respondents are interviewed in a group setting to guide the design of a questionnaire based on the respondent’s reaction to the subject matter and the issues raised during the discussion.

A frame consists of one or more lists of the units comprising the universe from which respondents can be selected (e.g., Census Bureau employee telephone directory). The frame may include elements not in the universe (e.g., retired employees). It may also miss elements that are in the universe (e.g., new employees).

The frame population is the set of elements that can be enumerated prior to the selection of a sample.

-G-

Geocoding is the conversion of spatial information into computer-readable form. As such, geocoding, both the process and the concepts involved, determines the type, scale, accuracy, and precision of digital maps.

A geographic entity is a spatial unit of any type, legal or statistical, such as a state, county, place, county subdivision, census tract, or census block.

A geographic entity code (geocode) is a code used to identify a specific geographic entity. For example, the geocodes needed to identify a census block for Census 2000 data are the state code, county code, census tract number, and block number. Every geographic entity recognized by the Census Bureau is assigned one or more geographic codes. "To geocode" means to assign an address, living quarters, establishment, etc., to one or more geographic codes that identify the geographic entity or entities in which it is located.

A generalized variance function is a mathematical model that describes the relationship between a statistic (such as a population total) and its corresponding variance. Generalized variance function models are used to approximate standard errors of a wide variety of characteristics of the target population.

Goodness-of-fit means how well a statistical model fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under a model. Such measures can be used in statistical hypothesis testing (e.g., to test for normality of residuals, to test whether two samples are drawn from identical distributions, or to test whether outcome frequencies follow a specified distribution).

A graphical user interface (GUI) emphasizes the use of pictures for output and a pointing device such as a mouse for input and control whereas a command line interface requires the user to type textual commands and input at a keyboard and produces a single stream of text as output.

-H-

Random variables are heteroscedastic if they have different variances. The complementary concept is called homoscedasticity.

Random variables are homoscedastic if they have the same variance. This is also known as homogeneity of variance. The complement is called heteroscedasticity.

A housing unit is a house, an apartment, a mobile home or trailer, a group of rooms or a single room occupied as separate living quarters or, if vacant, intended for occupancy as separate living quarters. The Census Bureau’s estimates program prepares estimates of housing units for places, counties, states, and the nation. 

Hypothesis testing draws a conclusion about the tenability of a stated value for a parameter. For example, sample data may be used to test whether an estimated value of a parameter (such as the difference between two population means) is sufficiently different from zero that the null hypothesis, designated H0 (no difference in the population means), can be rejected in favor of the alternative hypothesis, H1 (a difference between the two population means).

-I-

An implied comparison between two (or more) estimates is one that readers might infer, either because of proximity of the two estimates in the text of the report or because the discussion presents the estimates in a manner that makes it likely readers will compare them. For an implied comparison to exist between two estimates:

  • The estimates must be for similar subgroups that it makes sense to compare (e.g., two age subgroups, two race subgroups).
  • The estimates must be of the same type (e.g., percentages, rates, levels).
  • The subgroups must differ by only one characteristic (e.g., teenage males versus teenage females; adult males versus adult females; teenage males versus adult males). If they differ by more than one characteristic an implied comparison does not exist (e.g., teenage males versus adult females).
  • The estimates appear close enough to each other in the report that the reader would make a connection between them. Two estimates in the same paragraph that satisfy the first three criteria will always constitute an implied comparison. However, if the two estimates were in different sections of a report they would not constitute an implied comparison.

Estimates presented in tables do not constitute implied comparisons. However, if a table displays the difference between two estimates, it is a direct comparison.

Imputation is a procedure for entering a value for a specific data item where the response is missing or unusable.

Information products may be in print or electronic format and include news releases; Census Bureau publications; working papers (including technical papers or reports); professional papers (including journal articles, book chapters, conference papers, poster sessions, and written discussant comments); abstracts; research reports used to guide decisions about Census Bureau programs; presentations at public events (e.g., seminars or conferences); handouts for presentations; tabulations and custom tabulations; public-use data files; statistical graphs, figures, and maps; and the documentation disseminated with these information products.

Information quality is an encompassing term comprising utility, objectivity, and integrity.

Integration testing is the phase of software testing in which individual software modules are combined and tested as a group. The purpose of integration testing is to verify functional, performance and reliability requirements placed on major design items. Integration testing can expose problems with the interfaces among program components before trouble occurs in real-world program execution.

Integrity refers to the security of information – protection of the information from unauthorized access or revision, to ensure that the information is not compromised through corruption or falsification.

Internal users – see Users.

Interviewer debriefing has traditionally been the primary method used to evaluate field or pilot tests of interviewer-administered surveys. Interviewer debriefing consists of group discussions or structured questionnaires with the interviewers who conducted the test to obtain their views of questionnaire problems.

An item allocation rate is the proportion of the estimated (weighted) total (T) of item t that was imputed using statistical procedures, such as within-household or nearest neighbor matrices populated by donors, for that item.

Item nonresponse occurs when a respondent provides some, but not all, of the requested information, or if the reported information is not useable.

-J-

Joint partners refers to projects where both the Census Bureau and another agency are collecting the data together, but for their own use. It is a collaborative effort to reduce overall costs to the government and increase efficiency.

-K-

Key from image (KFI) is an operation in which keyers enter questionnaire responses by referring to a scanned image of a questionnaire for which entries could not be recognized by optical character or optical mark recognition with sufficient confidence.

Key from paper (KFP) is an operation in which keyers enter information directly from a hard-copy questionnaire that could not be read by optical character or optical mark recognition with sufficient confidence.

Key variables are main classification variables (e.g., geography, demographic attributes, economic attributes, industry etc.) of units to be studied.

-L-

Latent class analysis is a method for estimating one or more components of the mean squared error or an estimator.

Linear regression is a method that models a parametric relationship between a dependent variable Y, explanatory variables Xi, i = 1, ..., p, and a random term ε. This method is called "linear" because the relation of the response (the dependent variable Y) to the independent variables is assumed to be a linear function of the parameters.

Linking – see Record linkage.

Load testing is the process of putting demand on a system or device and measuring its response. Load testing generally refers to the practice of modeling the expected usage of a software program by simulating multiple users accessing the program concurrently.

Logistic regression is a model used for prediction of the probability of occurrence of an event. It models the logit of the probability as a linear function of the parameters using explanatory variables Xi, i = 1, ..., p.

A Longitudinal survey is a correlational research study that involves repeated observations of the same items over long periods of time, often many decades.

Longitudinal studies are often used in psychology to study developmental trends across the life span. The reason for this is that unlike cross-sectional studies, longitudinal studies track the same unit of observation, and therefore the differences observed in those people are less likely to be the result of cultural differences across generations.

-M-

Mail-out/mail-back is a method of data collection in which the U.S. Postal Service delivers addressed questionnaires to housing units. Residents are asked to complete and mail the questionnaires to a specified data capture center.

The margin of error (MOE) is a measure of the precision of an estimate at a given level of confidence (e.g., 90%). The larger the margin of error, the less confidence one should have that the reported results are close to the "true" figures; that is, the figures for the whole population.

Master Address File (MAF)/Topologically Integrated Geographic Encoding and Referencing (TIGER) is a topologically integrated geographic database in which the topological structures define the location, connection, and relative relationship of streets, rivers, railroads, and other features to each other, and to the numerous geographic entities for which the Census Bureau tabulates data for its censuses and sample surveys.

Matching – see Record linkage.

Measurement error is the difference between the true value of the measurement and the value obtained during the measurement process.

Metadata are data about data. Metadata are used to facilitate the understanding, use and management of data. An item of metadata may describe an individual datum or content item, or a collection of data including multiple content items.

Methodological expert reviews are independent evaluations of an information product conducted by one or more technical experts. These experts may be within the Census Bureau or outside the Census Bureau, such as advisory committees. See also Peer reviews.

A microdata file includes the detailed information about people or establishments. Microdata come from interviews and administrative records.

A model is a formal (e.g., mathematical) description of a natural system. The formal system is governed by rules of inference; the natural system consists of some collection of observable and latent variables. It is presumed that the rules of inference governing the formal system mimic in some important respect the causal relations that govern the natural system (e.g., the formal laws of arithmetic apply to counting persons).

Model validation involves testing a model’s predictive capabilities by comparing the model results to “known” sources of empirical data.

Monte Carlo simulation is a technique that converts uncertainties in input variables of a model into probability distributions. By combining the distributions and randomly selecting values from them, it recalculates the simulated model many times and brings out the probability of the output.

In multi-stage sampling, a sample of clusters is selected and then a subsample of units is selected within each sample cluster. If the subsample of units is the last stage of sample selection, it is called a two-stage design. If the subsample is also a cluster from which units are again selected, it is called a three-stage design, etc.

Multicollinearity is a statistical term for the existence of a high degree of linear correlation amongst two or more explanatory variables in a multiple regression model. In the presence of multicollinearity, it is difficult to assess the effect of the independent variables on the dependent variable.

Multivariate analysis is a generic term for many methods of analysis that are used to investigate relationships among two or more variables.

-N-

Noise infusion is a method of disclosure avoidance in which values for each establishment are perturbed prior to table creation by applying a random noise multiplier to the magnitude data (e.g., characteristics such as first-quarter payroll, annual payroll, and number of employees) for each company.

Nonresponse means the failure to obtain information from a sample unit for any reason (e.g., no one home or refusal). There are two types of nonresponse – see Unit nonresponse and Item nonresponse.

Nonresponse bias is the deviation of the expected value of an estimate from the population parameter due to differences between respondents and nonrespondents. The impact of nonresponse on a given estimate is affected by both the degree of nonresponse and the degree that the respondents’ reported values differ from what the nonrespondents would have reported.

Nonresponse error is the overall error observed in estimates caused by differences between respondents and nonrespondents. It consists of a variance component and nonresponse bias.

Nonresponse follow-up is an operation whose objective is to obtain completed questionnaires from housing units for which the Census Bureau did not have a completed questionnaire in mail areas (mailout/mailback, update/leave, and urban update/leave).

Nonresponse subsampling is a method for reducing nonresponse bias in which new attempts are made to obtain responses from a subsample of sampling units that did not provide responses to the first attempt.

Nonsampling errors are survey errors caused by factors other than sampling (e.g., nonsampling errors include errors in coverage, response errors, non-response errors, faulty questionnaires, interviewer recording errors, and processing errors).

The North American Industry Classification System (NAICS) is the standard used by Federal statistical agencies in classifying business establishments for the purpose of collecting, analyzing, and publishing statistical data related to the U.S. business economy. Canada, Mexico, and the U.S. jointly developed the NAICS to provide new comparability in statistics about business activity across North America. NAICS coding has replaced the U.S. Standard Industrial Classification (SIC) system (for more information, see www.census.gov/epcd/www/naics.html).

-O-

Objectivity focuses on whether information is accurate, reliable, and unbiased, and is presented in an accurate, clear, complete, and unbiased manner.

Optical character recognition (OCR) is a technology that uses an optical scanner and computer software to “read” human handwriting and convert it into electronic form.

Optical mark recognition (OMR) is a technology that uses an optical scanner and computer software to recognize the presence of marks in predesignated areas and assign a value to the mark depending on its specific location and intensity on a page.

Outliers in a set of data are values that are so far removed from other values in the distribution that their presence cannot be attributed to the random combination of chance causes.

-P-

The p-value is the probability that the observed value of the test statistic or a value that is more extreme in the direction of the alternative hypothesis, calculated when H0 is true, is obtained.

Parameters are unknown, quantitative measures (e.g., total revenue, mean revenue, total yield or number of unemployed people) for the entire population or for specified domains that are of interest. A parameter is a constant in the equation of a curve that can be varied to yield a family of similar curves or a quantity (such as the mean, regression coefficient, or variance) that characterizes a statistical population and that can be estimated by calculations from sample data.

Participation means that the employee takes an active role in the event.

A peer review is an independent evaluation of an information product conducted by one or more technical experts.

Personally identifiable information refers to any information about an individual maintained by the Census Bureau which can be used to distinguish or trace an individual’s identity, such as their name, social security number, date and place of birth, biometric records, etc., including any other personal information which is linked or linkable to an individual. Also see Business identifiable information.

Census Bureau information products must not contain policy views. The Census Bureau’s status as a statistical agency requires us to absolutely refrain from taking partisan political positions. Furthermore, there is an important distinction between producing data and using that data to advocate for program and policy changes. The Census Bureau’s duty is to produce high quality, relevant data that the nation’s policy makers can use to formulate public policy and programs. The Census Bureau should not, however, insert itself into a debate about the program or policy implications of the statistics it produces. We produce poverty statistics; we do not advocate for programs to alleviate poverty.

Population estimates (post-censal or intercensal estimates) are prepared for demographic groups and geographic areas. These estimates usually are developed from separate measures of the components of population change (births, deaths, domestic net migration, and net international migration) in each year but may be supplemented with other methodologies in the absence of current measures of components.

Post-stratification is applied to survey data by stratifying sample units after data collection using information collected in the survey and auxiliary information to adjust weights to population control totals or for nonresponse adjustment.

Precision of survey results refers to how closely the results from a sample can be obtained across repeated samples conducted using the same techniques from the same population at the same time. A precise estimate is stable over replications.

Pretesting is a broad term that incorporates many different techniques for identifying problems for both respondents and interviewers with regard to question content, order/context effects, skip instructions, and formatting.

Primary sampling units (PSU) are clusters of reporting units selected in the first stage of a multi-stage sample.

Probabilistic methods for survey sampling are any of a variety of methods for sampling that give a known, non-zero probability of selection to each member of the frame. The advantage of probabilistic sampling methods is that sampling error can be calculated without reference to a model assumption. Such methods include random sampling, systematic sampling, and stratified sampling.

The probability of selection is the probability that a population (frame) unit will be drawn in a sample. In a simple random selection, this probability is the number of elements drawn in the sample divided by the number of elements on the sampling frame.

Probability sampling is an approach to sample selection that satisfies certain conditions:

  1. We can define the set of samples that are possible to obtain with the sampling procedure.
  2. A known probability of selection is associated with each possible sample.
  3. The procedure gives every element in the population a nonzero probability of selection.
  4. We select one sample by a random mechanism under which each possible sample receives exactly its probability of selection.

A project is a temporary endeavor undertaken to create a unique product, service, or result.

A projection is an estimate of a future value of a characteristic based on trends.

Protected information (as defined in Data Stewardship Policy DS007, Information Security Management Program) includes information about individuals, businesses, and sensitive statistical methods that are protected by law or regulation. The Census Bureau classifies the following as protected information:

  • Individual census or survey responses.
  • Microdata or paradata, containing original census or survey respondent data and/or administrative records data that do not meet the disclosure avoidance requirements.
  • Address lists and frames, including the Master Address File (MAF).
  • Pre-release Principal Economic Indicators and Demographic Time-Sensitive Data.
  • Aggregate statistical information produced for internal use or research that do not meet the Disclosure Review Board disclosure avoidance requirements, or that have not been reviewed and approved for release.
  • Internal use methodological documentation in support of statistical products such as the primary selection algorithm, swapping rates, or Disclosure Review Board checklists.
  • All personally identifiable information (PII) protected by an existing legal authority (such as Title 13, Title 15, Title 5, and Title 26).
  • All business identifiable information (BII) protected by an existing legal authority.

A public event means that the event is open to the general public, including events that require a registration fee.

-Q-

A qualified user is a user with the experience and technical skills to meaningfully understand and analyze the data and results. For example, a qualified user of direct estimates produced from samples understands sampling, estimation, variance estimation, and hypothesis testing.

A quantity response rate is the proportion of the estimated (weighted) total (T) of data item t reported by tabulation units in the sample (expressed as a percentage). [Note: Because the value of economic data items can be negative (e.g., income), the absolute value must be used in the numerators and denominators in all calculations.]

A questionnaire is a set of questions designed to collect information from a respondent. A questionnaire may be interviewer-administered or respondent-completed, using paper-and-pencil methods for data collection or computer-assisted modes of completion.

-R-

Raking is a method of adjusting sample estimates to known marginal totals from an independent source. For a two-dimensional case, the procedure uses the sample weights to proportionally adjust the weights so that the sample estimates agree with one set of marginal totals. Next, these adjusted weights are proportionally adjusted so that the sample estimates agree with the second set of marginal totals. This two-step adjustment process is repeated enough times until the sample estimates converge simultaneously to both sets of marginal totals.

In random rounding, cell values are rounded, but instead of using standard rounding conventions a random decision is made as to whether they will be rounded up or down.

Ratio estimation is a method of estimating from sample data. In ratio estimation, an auxiliary variate xi, correlated with yi is obtained for each unit in the sample. The population total X of the xi must be known. The goal is to obtain increased precision by taking advantage of the correlation between yi and xi. The ratio estimate of Y, the population total of yi, is formula, where y and x are the sample totals of yi and xi respectively.

Readily accessible means that users can access the documentation when they need it, not that it is only available on request.

Recoding is a disclosure limitation technique that involves collapsing/regrouping detail categories of a variable so that the resulting categories are safe.

Record linkage is the process of linking or matching two or more records that are determined to refer to the same person or establishment.

Regression is a statistical method which tries to predict the value of a characteristic by studying its relationship with one or more other characteristics.

A regression model is a statistical model used to depict the relationship of a dependent variable to one or more independent variables.

Reimbursable projects are those for which the Census Bureau receives payment (in part or in total) from a customer for products or services rendered.

Reinterview is repeated measurement of the same unit intended to estimate measurement error (response error reinterview) or designed to detect and deter falsification (quality control reinterview).

A release phase refers to the point in the statistical process where you release the data. It may be to the public, the sponsor, or any other user for whom the data was created.

Releases of information products are the delivery or the dissemination of information products to government agencies, organizations, sponsors, or individuals outside the Census Bureau, including releases to the public.

Replication methods are variance estimation methods that take repeated subsamples, or replicates, from the data, re-compute the weighted estimate for each replicate, and then compute the variance based on the deviations of these replicate estimates from the full-sample estimate. The subsamples are generated to properly reflect the variability due to the sample design.

Reproducibility means that the information is capable of being substantially reproduced, subject to an acceptable degree of imprecision. For information judged to have more (less) important impacts, the degree of imprecision that is tolerated is reduced (increased). If the Census Bureau applies the reproducibility test to specific types of original or supporting data, the associated guidelines shall provide relevant definitions of reproducibility (e.g., standards for replication of laboratory data). With respect to analytic results, “capable of being substantially reproduced” means that independent analysis of the original or supporting data using identical methods would generate similar analytic results, subject to an acceptable degree of imprecision or error.

A residual is the observed value minus the predicted value.

Respondent burden is the estimated total time and financial resources expended by the respondent to generate, maintain, retain, and provide census or survey information.

Respondent debriefing is a pretesting technique that involves using a structured questionnaire following data collection to elicit information about respondents' interpretations of survey questions.

A response analysis survey is a technique for evaluating questionnaires from the perspective of the respondent. It is typically a respondent debriefing conducted after a respondent has completed the main survey.

Response error is the difference between the true answer to a question and the respondent's answer. It may be caused by the respondent, the interviewer, the questionnaire, the survey procedure or the interaction between the respondent and the interviewer.

A response rate measures the proportion of the selected sample that is represented by the responding units.

Revisions history is a stability diagnostic to compare regARIMA modeling and seasonal adjustment results over lengthening time spans. History analysis begins with a shortened series. Series values are added, one at a time, and the regARIMA model and seasonal adjustment are reestimated. Comparing different sets of adjustment options for the same series may indicate that one set of options is more stable. Among adjustment options whose other diagnostics indicate acceptable quality, options that result in fewer large revisions, that is, fewer large changes as data are added, usually are preferred.

-S-

The sample design describes the target population, frame, sample size, and the sample selection methods.

The sample size is the number of population units or elements selected for the sample, determined in relation to the required precision and available budget for observing the selected units.

A sample survey is a data collection that obtains data from a sample of the population.

The sampled population is the collection of all possible observation units (objects on which measurements are taken) that might have been chosen in the sample. For example, in a presidential poll taken to determine who people will vote for, the target population might be all persons who are registered to vote. The sampled population might be all registered voters who can be reached by telephone.

Sampling is the process of selecting a segment of a population to observe and facilitate the estimation and analysis of something of interest about the population. The set of sampling units selected is referred to as the sample. If all the units are selected, the sample is referred to as a census.

Sampling error is the uncertainty associated with an estimate that is based on data gathered from a sample of the population rather than the full population.

A sampling frame is any list or device that, for purposes of sampling, de-limits, identifies, and allows access to the sampling units, which contain elements of the frame population. The frame may be a listing of persons, housing units, businesses, records, land segments, etc. One sampling frame or a combination of frames may be used to cover the entire frame population.

Sampling units are the basic components of a sampling frame. The sampling unit may contain, for example, defined areas, houses, people, or businesses.

Sampling weight is a weight assigned to a given sampling unit that equals the inverse of the unit's probability of being included in the sample and is determined by the sample design. This weight may include a factor due to subsampling.

Sanitized data, used for testing, may be totally fictitious or based on real data that have been altered to eliminate the ability to identify the information of any entity represented by the data.

Scheffé's method is a method for adjusting significance levels in a linear regression analysis to account for multiple comparisons.; It is particularly useful in analysis of variance, and in constructing simultaneous confidence bands for regressions involving basis functions. Scheffé's method is a single-step multiple comparison procedure which applies to the set of estimates of all possible contrasts among the factor level means, not just the pairwise differences considered by the Tukey method.

A scoring weight is the amount of value assigned when a pair of records agree or disagree on the same matching variable. Each matching variable is assigned two scoring weights --- a positive weight for agreement and a negative weight for disagreement. After comparing all matching variables on a matching variable by matching variable basis, the resulting set of assigned weights are added to get a total score for the total record. Pairs of records with scores above a predetermined cut-off are classified as a match; pairs of records with scores below a second predetermined cut-off are classified as a non-match.

Seasonal adjustment is a statistical technique that consists of estimating seasonal factors and applying them to a time series to remove the seasonal variations in the estimates.

Sensitivity analysis is designed to determine how the variation in the output of a model (numerical or otherwise) can be apportioned, qualitatively or quantitatively, to changes in input parameter values and assumptions. This type of analysis is useful in ascertaining the capability of a given model, as well its robustness and reliability.

Sequential sampling is a sampling method in which samples are taken one at a time or in successive predetermined groups, until the cumulative result of their measurements (as assessed against predetermined limits) permits a decision to accept or reject the population or to continue sampling. The number of observations required is not determined in advance, but the decision to terminate the operation depends, at each stage, on the results of the previous observations. The plan may have a practical, automatic termination after a certain number of units have been examined.

Significance level refers to the probability of rejecting a true null hypothesis.

Simple random sampling (SRS) is a basic probability selection scheme that uses equal probability sampling with no strata.

A skip pattern in a data collection instrument is the process of skipping over non-applicable questions depending upon the answer to a prior question.

Sliding spans diagnostics are seasonal adjustment stability diagnostics for detecting adjustments that are too unstable. X-12-ARIMA creates up to four overlapping subspans of the time series, seasonally adjusts each span, then compares the adjustments of months (quarters with quarterly data) common to two or more spans. Months are flagged whose adjustments differ by more than a certain cutoff. (The default cutoff is 3% for most comparisons.) If too many months are flagged, the seasonal adjustment is rejected for being too unstable. The series should not be adjusted unless other software options are found that lead to an adjustment with an acceptable number of flagged months. Sliding spans diagnostics can include comparisons of seasonally adjusted values, seasonal factors, trading day factors, month-to-month changes and year-to-year changes. (Year-to-year change results are not used to accept or reject an adjustment.)

Small area estimation is a statistical technique involving the estimation of parameters for small sub-populations where a sample has insufficient or no sample for the sub-populations to be able to make accurate estimates for them. The term “small area” may refer strictly to a small geographical area such as a county, but may also refer to a “small domain,” i.e., a particular demographic within an area. Small area estimation methods use models and additional data sources (such as census data) that exist for these small areas in order to improve estimates for them.

Special sworn status (SSS) is conferred upon individuals for whom the Census Bureau approves access to confidential Census Bureau data in furtherance of a Title 13 purpose. SSS individuals are subject to same legal penalties for violation of confidentiality as employees.

Spectral graphs are diagnostic graphs that indicate the presence of seasonal or trading day effects. Visually significant peaks at the marked seasonal and/or trading day frequencies usually indicate the presence of these effects, in some cases as residual effects after an adjustment that is not fully successful for the span of data from which the spectrum is calculated. Spectral graphs are available for the prior-adjusted series (or original series if specified), regARIMA model residuals, seasonally adjusted series, and modified irregular.

Split panel tests refer to controlled experimental testing of questionnaire variants or data collection modes to determine which one is "better" or to measure differences between them.

Stakeholders include Congress, federal agencies, sponsors, state and local government officials, advisory committees, trade associations, or organizations that fund data programs, use the data, or are affected by the results of the data programs.

The standard deviation is the square root of the variance and measures the spread or dispersion around the mean of a data set.

The standard error is a measure of the variability of an estimate due to sampling.

The Standard Occupational Classification System (SOC) is used to classify workers into occupational categories for the purpose of collecting, calculating, or disseminating data (for more information, see www.bls.gov/soc/).

Statistical attribute matching consists of comparing two records, determining if they refer to “similar” entities (but not necessarily the same entity), and augmenting data from one record to the other.

Statistical inference is inference about a population from a random or representative sample drawn from it. It includes point estimation, interval estimation, and statistical significance testing.

A statistical model consists of a series of assumptions about a data generating process that explicitly involve probability distributions and functions on those distributions, in order to construct an estimate or a projection of one or more phenomena.

Statistical purposes refer to the description, estimation, or analysis of the characteristics of groups without identifying the individuals or organizations that compose such groups.

Statistical significance is attained when a statistical procedure applied to a set of observations yields a p-value that exceeds the level of probability at which it is agreed that the null hypothesis will be rejected.

Strata are created by partitioning the frame and are generally defined to include relatively homogeneous units within strata.

Stratification involves dividing the sampling frames into subsets (called strata) prior to the selection of a sample for statistical efficiency, for production of estimates by stratum, or for operational convenience. Stratification is done such that each stratum contains units that are relatively homogeneous with respect to variables that are believed to be highly correlated with the information requested in the survey.

Stratified sampling is a sampling procedure in which the population is divided into homogeneous subgroups or strata and the selection of samples is done independently in each stratum.

Sufficient data is determined for a survey by whether the respondent completes enough items for the case to be considered a completed response.

Supplemental reinterview allows the regional offices to select any field representative (FR) with an original interview assignment for reinterview. All assigned cases that are not selected for reinterview are available as inactive supplemental reinterview cases. The regional office may place a field representative in supplemental reinterview for various reasons: the FR was not selected for reinterview; the FR was hired during the assignment period; or the regional office needs to reinterview additional cases to investigate the FR for suspected falsification.

Swapping is a disclosure limitation technique that involves selecting a sample of records, finding a match in the database on a set of predetermined variables, and swapping all other variables.

Synthetic data are microdata records created to improve data utility while preventing disclosure of confidential respondent information. Synthetic data is created by statistically modeling original data and then using those models to generate new data values that reproduce the original data's statistical properties. Users are unable to identify the information of the entities that provided the original data.

Systematic sampling is a method of sample selection in which the sampling frame is listed in some order and every kth element is selected for the sample, beginning from a random start between 1 and k.

A systems test is used to test the data collection instrument along with the data management systems.

-T-

The target population is the complete collection of observations under study. For example, in a presidential poll taken to determine who people will vote for, the target population might be all persons who are registered to vote The sampled population might be all registered voters who can be reached by telephone.

A Taylor series is a representation of a function as an infinite sum of polynomial terms calculated from the values of its derivatives at a single point.

The Taylor series method for variance estimation is used to estimate variances for non-linear estimators such as ratio estimators. If the sample size is large enough so that estimator can be closely approximated by the first order (linear) terms in the Taylor series, then the variances can be approximated by using variance methods appropriate for linear statistics. The Taylor series approximation to the ratio estimator is: formula. This approximation is linear in the survey sample totals x and y.

Testing is a process used to ensure that methods, systems or other components function as intended.

A time series is a sequence of data values obtained over a period of time, usually at uniform intervals.

Timeliness of information reflects the length of time between the information's availability and the event or phenomenon it describes.

Top-coding is a disclosure limitation technique that involves limiting the maximum value of a variable allowed on the file to prevent disclosure of individuals or other units with extreme values in a distribution.

Topologically Integrated Geographic Encoding and Referencing (TIGER) – see definition for Master Address File (MAF)/Topologically Integrated Geographic Encoding and Referencing (TIGER).

A total quantity response rate is the proportion of the estimated (weighted) total (T) of data item t reported by tabulation units in the sample or from sources determined to be equivalent-quality-to-reported data (expressed as a percentage).

Touch-tone data entry (TDE) is a data collection method that uses an electronic instrument to collect and capture data by telephone.

Transparency refers to providing documentation about the assumptions, methods, and limitations of an information product to allow qualified third parties to reproduce the information, unless prevented by confidentiality or other legal constraints.

Truth decks are used to test imputation methods by comparing the imputed values to the original values for the items flagged as missing. The truth deck originates as a file of true responses. Certain responses are then blanked in a manner that reflects the probable nonresponse in the sample. The truth deck is then run through the imputation process in order to evaluate the accuracy of the imputed values.

Tukey’s method is a single-step multiple comparison procedure and statistical test generally used in conjunction with an ANOVA to find which means are significantly different from one another. Named after John Tukey, it compares all possible pairs of means, and is based on a studentized range distribution q (this distribution is similar to the distribution of t from the t-test).

-U-

Unduplication involves the process of deleting units that are erroneously in the frame more than once to correct for overcoverage.

Unit nonresponse occurs when a sampled unit fails to respond or a sampled unit response does not meet a minimum threshold and is classified as not having responded at all.

Usability testing in surveys is the process whereby a group of representative users are asked to interact and perform tasks with survey materials (e.g., computer-assisted forms) to determine if the intended users can carry out planned tasks efficiently, effectively, and satisfactorily.

A user interface is the aspects of a computer system or program that can be seen (or heard or otherwise perceived) by the human user, and the commands and mechanisms the user uses to control its operation and input data.

Users are organizations, agencies, the public, or any others expected to use the information products. Census Bureau employees, contractors, and other Special Sworn Status individuals affiliated with the Census Bureau are internal users. Users outside of the Census Bureau, including Congress, federal agencies, sponsors, other Special Sworn Status individuals, and the public, are external users.

Utility refers to the usefulness of the information for its intended users.

-V-

Variance is a measurement of the error associated with nonobservation, that is, the error that occurs because all members of the frame population are not measured. The measurement is the average of the squared differences between data points and the mean.

Version Control is the establishment and maintenance of baselines and the identification of changes to baselines that make it possible to return to the previous baseline. A baseline, in the context of documentation, is a document that has been formally reviewed and agreed on.

-W-

Weights are values associated with each sample unit that are intended to account for probabilities of selection for each unit and other errors such as nonresponse and frame undercoverage so that estimates using the weights represent the entire population. A weight can be viewed as an estimate of the number of units in the population that the sampled unit represents.

Working papers are information products that are prepared by Census Bureau employees (or contractors), but the Census Bureau does not necessarily affirm their content. They include technical papers or reports, division reports, research reports, and similar documents that discuss analyses of subject matter topics or methodological, statistical, technical or operational issues. The Census Bureau releases working papers to the public, generally on the Census Bureau’s Web site. Working papers must include a disclaimer, unless the Associate Director responsible for the program determines that a disclaimer is not appropriate.



Back to Main


[PDF] or PDF denotes a file in Adobe’s Portable Document Format. To view the file, you will need the Adobe® Reader® Off Site available free from Adobe. This symbol Off Site indicates a link to a non-government web site. Our linking to these sites does not constitute an endorsement of any products, services or the information found on them. Once you link to another site you are subject to the policies of the new site.
Source: U.S. Census Bureau | Methodology and Standards Council |  Last Revised: July 08, 2013