TECHNICAL DEVELOPMENT OF THE PROPOSED STATISTICAL METADATA STANDARD
Gregory J. Lestina, Jr.,
William P. LaPlant, Jr.,
Daniel W. Gillman,
Martin V. Appel
Abstract
The Bureau of the Census is developing a Statistical Metadata Content Standard to define the necessary metadata
to describe all aspects of survey design, processing, analysis, and data sets. The draft standards document must be
easily reviewed by subject matter experts. Our experience has shown that information displayed in the format of a
standard is not easily understood by experts outside the standards community. In order to facilitate review and
discussion, we chose to display the standard in a format similar to a textbook's Table of Contents (TOC), with
subsequent information displayed in an outline format. To further facilitate review and comment, the TOC was
published in HTML format on the Bureau's World Wide Web server. Thus a reviewer, using internet browsers such
as Mosaic or Netscape, can traverse the document, display context sensitive help, such as definitions, and leave
behind comments as he/she scrutinizes the document.
In addition to displaying the standard, the TOC provides a mechanism for navigating and identifying subjects of
interest about surveys for data dissemination, survey design and documentation, and process integration. The TOC
also has certain system design implications. For example, it serves as a mapping between tools for data
dissemination and integrated processing. In addition, the TOC is being used as a blueprint for building conceptual
and logical data models for metadata repositories.
This paper will present a description of the TOC, a description of its uses, details of the Web implementation, and a
demonstration of the system.
1 Introduction
The United States Bureau of the Census (BOC) is continually trying to improve timeliness and accuracy of the
statistics provided to its customers. With the rapid advancement of the Internet and electronic processing of data,
there are new and more efficient ways of managing and providing information to the Census Bureau's customers.
In fact, these new advancements are further emphasized by the efforts of Vice President Al Gore. One effort, the
National Performance Review (NPR), calls for government to improve service to the citizens, and a second effort is
the development of the telecommunications National Information Infrastructure (NII), which includes widespread
dissemination of government information over the national network. Therefore, our goal is to make data and
documentation easier to access, understand, and use. The development and use of metadata standards and systems
are essential to the success of this effort.
2 Statistical Metadata Definition
Statistical metadata is descriptive information or documentation about statistical data, i.e. microdata or macrodata.
Statistical metadata facilitates sharing, querying, and understanding of statistical data over the lifetime of the data.
The types of statistical data (electronic or otherwise) are described as follows:
Microdata - data on the characteristics of units of a population, such as individuals, households or
establishments, collected by a census, survey, or experiment;
Macrodata - data derived from microdata by statistics on groups or aggregates, such as counts, means, or
frequencies.
The extensive nature of statistical metadata lends itself to categorization into three components or levels:
Systems - the information about the physical characteristics of the application's data set(s), such as
location, record layout, database schemas, media, size, etc;
Applications - the descriptive information about the application's products and procedures, such as sample
designs, questionnaires, software, variable definitions, edit specifications, etc;
Administrative - the management information, such as budgets, costs, schedules, etc.
The systems, applications, and administrative components help to differentiate the sources and uses of statistical
metadata (Gillman, Appel, LaPlant, 1996).
3 Current and Proposed Data Services Using or Providing Metadata
The need to review the current data dissemination systems began in the Fall of 1992 after a customer survey found
that users of Census Bureau data are dissatisfied with the timeliness of data release, and the delivery of products and
services as scheduled. The Reinvention Lab at the Bureau of the Census was organized to review these problems.
The lab's initial focus was on post-data collection processing. After analysis of customer needs, members of the Lab
came to the conclusion that they needed to frame the problem more broadly than just post-data collection processing
and include all aspects of survey design and execution. The result was the development of the concept of the
"Integrated Processing System" (IPS) (Reinvention Lab of the Census Bureau, 1994).
Since 1992, the Census Bureau has developed automated systems to improve the efficiency of statistical
dissemination. The use of the World Wide Web for data dissemination has been a great success at the Census
Bureau. An Internet prototype was begun in January, 1994, making the Census Bureau one of the pioneer federal
agencies to disseminate data using the Internet. On average, the Bureau is now receiving more than 60,000
inquiries per day from customers who access this site. Through this Internet site, the BOC offers: population
estimates and projections; economic indicators; international trade data; research from the Center for Economic
Studies; news releases; state ranking and profiles, financial data for states, counties, cities, and school districts; and
job vacancies. This prototype allows users to get Census information in seconds by using a software browser such
as NCSA Mosaic or Netscape.
From this success, several programs are being developed using the Internet as a data access tool. The following
four programs plan to or currently provide metadata and currently use or will use the Internet as a data access tool:
a) The Integrated Processing System (IPS) - The "Integrated Processing System" (IPS) is envisioned to be the
"umbrella" for a compatible set of automated tools to design, conduct, and manage Census Bureau surveys and
censuses in an effort to improve cost effectiveness, timely reporting, data quality, and data access (Reinvention Lab
of the Census Bureau, 1994).
b) FERRET (Federal Electronic Research and Review Extraction Tool) - FERRET presently makes Current
Population Survey (CPS) data and documentation available over the Internet through World Wide Web pages. A
user is able to extract a dataset, review variable metadata, produce cross-tabulations, and display macrodata tables in
SAS or ASCII format (Appel, Gillman, LaPlant, Creecy, 1996).
c) DADS (Data Access and Dissemination System) - The BOC plans to develop and implement an Internet
based data access and dissemination system (DADS) initially focused on the 2000 Decennial Census and
Continuous Measurement data sets, but with the ability to accommodate other data sets. The goal is to provide
customers one general (electronic) system for all Census Bureau data access (Appel, Gillman, LaPlant, Creecy,
1996).
d) StEPS (Standard Editing and Processing System) - The objective of StEPS is to eliminate redundant
processing by combining existing survey systems into one system. It is anticipated that 109 current surveys of the
Economic Directorate will be migrating to StEPS by December 1999 (StEPS and the Economic Directorate View
for Current Survey Processing, 1995).
4 Metadata Repository
The metadata repository under development will contain the metadata for survey design, processing, analysis, and
data sets. Links to the data files themselves will eventually be made, creating a fully integrated data/metadata
system.
Because the Bureau of the Census manages data in a decentralized and non-uniform way, the metadata repository
will bridge the gap between the data and the users who wish to find them. The metadata repository will facilitate a
solution for the data users while allowing the survey data managers to find a smooth transition to standard data
management strategies.
There are many specific functions for which the metadata repository is being designed. Primarily, the metadata
repository will be a standard tool for researchers and analysts to locate data and descriptions of surveys. Data
dictionaries, record layouts, questionnaires, sample designs, and standard errors are examples of information that
will be directly available.
Links from subject types, e.g., income, race, age, and geography, to data sets will allow users to locate data sets by
subject. Less obvious, users can compare designs of different surveys and find common information collected by
different surveys.
There are a number of types of users of data who require different kinds of metadata. Programmers will be
interested in record layouts, data dictionaries, file storage medium, and other information needed to process data.
Data analysts and researchers will have more interest in sample design, questionnaire, standard errors, and other
similar information. Managers will be more interested in costs, schedules, and processes. A complete model of
statistical metadata will have to take all these needs into account.
5 The Survey Design and Statistical Methodology Metadata Standard
This section describes the Survey Design and Statistical Methodology Standard (SDSM) and its importance in
developing the metadata repository.
5.1 Elements and Structure of the Standard
The SDSM (Census Bureau, 1996, "Standard for Survey Design and Statistical Methodology Metadata") was
begun in January 1995 and is an extension of the Content Standard for Cultural and Demographic Data Metadata
[FGDC/SCDD-95] of the Federal Geographic Data Committee (FGDC 1994a and FGDC 1994b). It is a standard
intended to provide statistical metadata elements for administrative, planning, design, collection, analysis, and
processing of statistical data and was developed primarily by Census Bureau researchers in consultation with
subject matter experts and other researchers at the Census Bureau and Bureau of Labor Statistics. Textbooks and
other standards were used as additional sources of information. The standard provides a way of classifying the
metadata and classifying the meaning of the metadata. It is intended to provide a data user with the information to
interpret and use statistical data.
The SDSM is intended to be a comprehensive, hierarchical thesaurus of terms, an outline of all the concepts
contained in any documentation about the design, processing, analysis or data dissemination of surveys or censuses.
It is designed to help the contributors and users of metadata answer questions such as "who", "why", "what",
"when", "where", and "how" for issues related to surveys, systems, and products. This outline or thesaurus is really
a list of statistical metadata at the Census Bureau. From this list, we are able to define a logical business model for
the Census Bureau. And from this logical business model, we are able to build the central metadata repository.
The metadata items of the SDSM standard are organized into "chapters" or "sections". Each chapter or section
represents a logical set of metadata. The inclusion rule (status of metadata items in the standard to
ensure a complete set of documentation about the planning, collection, production and
analysis of a data set) and definition are provided for each metadata data element. The following are the top-level chapters, their inclusion rules, and their definitions:
0. Identification (mandatory). This chapter contains the minimal set of mandatory metadata items and is
applicable to the entire set of metadata. This section contains identifying information and any
documentation developed during the conceptualization phase of survey planning.
1. Content (optional). This chapter contains information about the nature of the data that is the subject of
the survey, i.e., the universe of interest and the specific data items to be gathered. Contains definitions, data
standardization rules, and coding information.
2. Planning (optional). Documentation related to the project planning for all phases of survey work. This
includes documentation related to budgeting, staffing, and training.
3. Design (optional). This chapter includes information on the development of the universe and frame;
sampling strategies; the design of the "measurement instrument" (questionnaire or equivalent); the
construction of the "observation register" including the check-in, check-out mechanism; and how non-response will be handled.
4. Implementation (optional). This chapter includes documentation related to implementation of the
survey, including: interviewer procedures, guidelines and training materials; distribution and collection of
forms or other measurement instruments; execution of the "observation register," i.e., check-in, check-out
and enumerator diaries; field edits and verification; follow-up procedures, training and tracking; sampling
mechanism for follow-up and quality assurance; data preparation procedures and training; and mechanisms
for creating and maintaining records on the process.
5. Analysis (optional). Documentation related to all statistical processes used to analyze the survey results
or those used for displaying or presenting the resultant information.
6. Data_Processing (optional). (Computer Systems) Documentation of all computer processes needed to
support survey activities or processes.
7. Data (optional). Documentation concerning all data sets retained related to the survey, and, possibly,
the data itself.
Each of the these chapters is subdivided into sections containing an outline of concepts. For example, Chapter 2,
Planning is subdivided into Section 2.0 Point of Contact , Section 2.1 Project Conceptualization, Section 2.2 Design
Proposal, and Section 2.3 Design Evaluation. Each of these sections may contain subsections. For example Section
2.1 contains Section 2.1.1 Options and Section 2.1.2 Shell or Model Design. There are approximately 533 chapters,
sections, and sub-sections in the standard. These sections and subsections each contain a definition, citation, or
maybe other forms of statistical metadata such as a URL. These metadata data elements also provide a basis for
defining logical data models about how the Census Bureau does business. An example of an element in the
standard is as follows:
2.1.1 Options (optional). <IPS>. Previous and related data and surveys that is considered in planning this
project.
The SDSM standard does not specify the physical format of the content, the services to be provided, or the syntax to
be used for its metadata data elements. For this reason, the SDSM is called a "content standard" because it defines
which metadata items about statistical surveys and censuses are important. The SDSM supports labeling metadata
content by tags included in the metadata itself or by indexes provided through tools (LaPlant, Lestina, Gillman,
Appel, 1996).
5.2 Relationship to Other Standards
The CDDM is mapped to the metadata portions of the "Spatial Data Transfer Standard [FIPS-173]" (SDTS) and
supports providing metadata for the "Government Information Locator Service [FIPS-192]" (GILS). The SDSM
assumes the existence of these other standards which define additional, related, metadata. The thematic content of a
data file is provided as specified by the CDDM while the physical layout is provided by either an SDTS mapped to
a "Data Descriptive File for Information Interchange [FIPS-123]" (DDF) specification or by a GILS specification
(LaPlant, Lestina, Gillman, Appel, 1996).
6 The Table of Contents (TOC) for the SDSM Standard
In addition to the SDSM standard, a Table of Contents (TOC) to the standard is available. The TOC provides a map
to the contents of the standard in a way that is similar to the table of contents of a book. The TOC provides a way
for "readers" to quickly go to areas of interest (Census Bureau, 1996, "Table of Contents for Survey Design and
Statistical Methodology Metadata").
The Table of Contents will be used with survey documentation tools by the user (the survey designer, subject matter
analyst, etc). If the user is working with existing documentation, the tools will assist in organizing and annotating
that documentation. If the user is designing a new survey, this tool will provide a ready-made structure for
developing the required documentation. This will ensure that the various aspects of survey design and analysis are
addressed, or at least that an explicit decision is made to defer addressing them.
The TOC hierarchy has allowed us to further develop high level conceptual models of existing systems and show
how they will interface with the proposed metadata repository. For example, the Integrated Processing System
(IPS) will not store metadata but will link to the metadata repository to get the location of available metadata. The
Standard Economic Processing System (StEPS)(1) will need a separate data element registry for assigning definitions
to elements in their repository so that information can be standardized across different repositories. We are
currently developing the conceptual and logical models for the standard metadata repository. The TOC is being
used as the point of reference for these models.
The Table of Contents is an on-line summary of the Survey Design and Statistical Metadata Standard. The on-line
TOC was developed to help reviewers of the SDSM. The on-line TOC presents the standard, in an easily accessible
way, on the World Wide Web. The TOC is a combination of HTML pages and HTTP CGI(2) scripts written in Perl(3).
The CGI scripts are used for displaying, navigating, and allowing users to enter comments about various elements of
the standard. The on-line TOC provides an easy method for users to become familiar with the SDSM and allows
users to enter their questions or comments on any of the elements in the standard.
The URL for the TOC is http://www.census.gov/ftp/pub/std/www/TOC.html. Through this site, we received many
visitors, but received only a few comments to the definitions.
7 Applications
As mentioned earlier, the Table of Contents reflects the business processes of the Census Bureau and can be used to
develop a metadata repository . Because of the popularity of new technologies that implement the World Wide
Web, the Census Bureau must also re-examine the ways it collects, processes, and outputs its data and reports so
that the process is more efficient for the data user. This section explains the Census Bureau's effort to model a
metadata repository and explains the use of the TOC as an important tool in this model. When the model is
complete, users will have the ability to view large holdings of Census metadata with considerable efficiency.
7.1 Modeling the Repository
The first step in this development involves defining the elements or concepts of the repository. For example, the
Census Bureau is primarily concerned with concepts such as Design, Collection, Analysis, and Dissemination of
statistics. These concepts are reflected in the Table of Contents.
The second step is to develop a logical model from the conceptual model (Barker, 1990). A logical model is a
representation of how data is stored. A tool called Open Workgroup Repository (OWR)(4) is used to electronically
create and implement the logical model, then build the repository. The repository uses a relational database for
storing repository instances. We are using Oracle because of its availability and its ability to work with OWR. The
OWR uses the Command Manipulation Language (CML) to translate logical schema to SQL. The OWR follows
the standard provided by the Information Resource Dictionary System (IRDS), a standard that is used for
implementing repositories.
7.2 Tool Development
As mentioned earlier, the Table of Contents is also a data dissemination and collection tool that is the center of
metadata holdings at the Census Bureau. It is the central and connecting point for statistical systems located in
many different physical locations at the Census Bureau. All documentation at the Census Bureau will be linked to
the Table of Contents.
Section 4 discussed what the repository does for the user. By means of a CGI or web broker interface on the
Internet, the Table of Contents is designed to link to the appropriate systems giving the user access to the
documentation requested. All of these transactions are transparent to the user. The Table of Contents, therefore,
provides access to all available documentation on the subject selected.
Not only is the Table of Contents useful in providing metadata to users, but it is useful for data providers in
updating their data. An office or person that creates metadata, for example, may want to add the recent
memorandum on a survey supplement to the repository. Or the same user may need to update a variable definition
or add a new variable to the repository. This sort of interface would be provided interactively or as a batch process.
With the number of different Census Bureau programs, there could be numerous "front ends" or user interfaces with
the Table of Contents, one for each statistical program. For example, a screen could be developed with keywords
or for a user to input a word, and the document could be included with the appropriate section of the Table of
Contents. The keyword search can also be used as a documentation search tool, with the Table of Contents
providing reference to documents in the repository. Other front ends may include a menu system containing various
subject areas of the Census Bureau referencing the Table of Contents. Or maybe the user prefers the idea of
opening folders, as in Lotus Notes, to access Census Bureau files. These ideas will be developed at a later date.
7.3 Unifying Statistical Systems and Repositories
The Census Bureau is an organization containing many different programs that respond to many different
customers. If data is processed differently for each program, it follows that the metadata for these programs are not
coordinated. Creating a single metadata repository for the Census Bureau requires an effort to somehow coordinate
the metadata for public and private access. The Table of Contents can be used to help coordinate these programs.
The statistical systems that collect, process, analyze, and disseminate statistical information, such as DADS,
FERRET, IPS, and StEPS will be defined as Statistical Information Systems (SIS)(Gillman, Appel, LaPlant, 1996).
These SIS allow users to access the data and metadata for data dissemination or automated survey processing. Our
goal is to unify these SIS so that they are transparent to users looking for metadata from the central metadata
repository. Unification of the systems will depend on the separate tables of contents of each of the other systems.
These individual tables of contents would be used for mapping from one SIS to another and for mapping to the
central repository (see Figure 1) (Gillman, Appel, LaPlant, 1996). These tables of contents would also provide an
information outline for users and analysts.
There are a combination of various user interface tools that provide the user with the access to the data, metadata,
and documentation. For example, a user can access and update metadata and a documentation library using SQL,
SAS, word processing software, or various Internet tools.
8 Conclusion
Work on the logical data model for the metadata
repository was begun in May, 1996. The Survey
Design and Statistical Metadata Standard is due for
final review in June, 1996. We are currently
exploring ways to physically link items from the
metadata repository to IPS, DADS, FERRET, and
StEPS. We will then need to design tools to populate
and manage the repository.
The task still facing us is to transform the logical
interpretation of the Table of Contents to the physical access of the various SIS'. Unifying SIS' through a logically
central repository is expected to provide greater functionality than the sum of separate systems.
As mentioned in section 5.1, the items in the
individual tables of contents are "tags" and are used to
associate items in other tables of contents. The tags
are defined in the SDSM standard. Thus, requests for
information can be easily transferred across systems.
9 References
Appel, M. V., Gillman, D.W., LaPlant, W.P., Creecy, R.H. (1996), "Towards Unified Metadata Systems and
Practices at the Census Bureau", Integrated Statistical Information Systems (ISIS), May 1996, Bratislava, Slovakia.
Barker, Richard, Case*Method Entity Relationship Modelling, Addison-Wesley Publishing Company, Wokingham,
England, 1990.
Census Bureau (1996), "Standard for Survey Design and Statistical Methodology Metadata" Draft Standard, Census
Bureau internal document, in progress.
Census Bureau (1996), "Table of Contents for Survey Design and Statistical Methodology Metadata, Draft
Standard", Census Bureau internal document, in progress.
FGDC (1994a), Federal Geographic Data Committee, "Content Standards for Digital Gepspatial Metadata", June 8,
1994.
FGDC (1994b), Federal Geographic Data Committee - Subcommittee on Cultural and Demographic Data, "Cultural
and Demographic Data Metadata", DRAFT, September 15, 1994.
[FGDC/SCDD-95] Federal Geographic Data Committee, Subcommittee on Cultural and Demographic Data
(FGDC/SCDD), "Cultural and Demographic Data Metadata." Draft of May 1995.
[FIPS-123] National Institute of Standards and Technology, Federal Information Processing Standard Publication
123: Data Descriptive File for Information Interchange (DDF). U.S. Department of Commerce, 1992. Adopts,
with modifications, International Standard 8211-1985.
[FIPS-173] National Institute of Standards and Technology, Federal Information Processing Standard Publication
173:Spatial Data Transfer Standard (SDTS). U.S. Department of Commerce, 1992.
[FIPS-192] National Institute of Standards and Technology, Federal Information Processing Standard Publication
192: Application Profile for the Government Information Locator Service (GILS). U.S. Department of Commerce,
1994.
Gillman, D.W., Appel, M.V., LaPlant, W.P. (1996), "Design Principles for a Unified Statistical Data/Metadata
System", 8th Scientific and Statistical Database Management Conference, June 1996, Stockholm, Sweden.
LaPlant, W.P., Lestina, G.J., Gillman, D.W., Appel, M.V. (1996), "Proposal for A Statistical Metadata Standard",
1996 U.S. Bureau of the Census Annual Research Conference, March 1996, Arlington, Virginia.
Reinvention Lab of the Census Bureau (1994), "Integrated Processing System", Systems Planning Document,
December 15, 1994.
"StEPS and the Economic Directorate Vision for Current Survey Processing", December 4, 1995