Appendix A2: Questionnaire Testing and Evaluation Methods for Censuses and Surveys

Pretesting is critical to the identification of problems for both respondents and interviewers with regard to question content, order/context effects, skip instructions, and formatting. Problems with question content, for example, include confusion over the meaning of the question as well as misinterpretation of individual terms or concepts. Problems with skip instructions may result in missing data and frustration by interviewers and/or respondents. Formatting concerns are relevant to self-administered questionnaires and may lead to respondent confusion and a loss of information.

“Pretesting” is a broad term that applies to many different methods or combinations of methods that can be used to test and evaluate questionnaires. These methods are valuable for identifying problems with draft questionnaires, but they have different strengths and weaknesses, and may be most useful at different stages of questionnaire/instrument development. Typically, using several pretesting methods is more effective in identifying problem questions and suggesting solutions than using just a single method. This appendix briefly describes the different types of pretesting methods, their strengths and weaknesses, and situations where they are most beneficial.

The enumeration and description of potential pretesting and evaluation methods in this appendix is meant to cover all the available techniques; however, some techniques do not satisfy the pretesting requirement of Statistical Quality Standard A2: Developing Data Collection Instruments and Supporting Materials. Other methods satisfy the requirement only under special circumstances. The pretesting requirement of Standard A2 identifies the methods that must be used to pretest census and survey questions.

Although the pretesting requirement of Standard A2 must be satisfied, the appropriateness of the methods and the resources available to implement them should be considered in determining which pretesting methods to use.

Pretesting and evaluation techniques fall into two major categories – pre-field and field techniques. Generally, pre-field techniques are used during the preliminary stages of questionnaire development. Pre-field techniques include:

Respondent focus groups. (This method does not satisfy the pretesting requirement, unless the focus group completes and discusses a self-administered questionnaire.)
Exploratory or feasibility visits to companies or establishment sites. (This method does not satisfy the pretesting requirement.)
Cognitive interviews. (This method satisfies the pretesting requirement.)
Usability techniques. (This method does not satisfy the pretesting requirement unless it is focused on respondent understanding of a self-administered or interviewer-administered questionnaire.)
Methodological expert reviews. (This method does not satisfy the pretesting requirement.)

Field techniques are used to evaluate questionnaires tested under field conditions, either in conjunction with a field test or during production data collection. Using field techniques during production data collection would be appropriate only for ongoing or recurring surveys. Field techniques include:

Behavior coding of interviewer-respondent interactions. (This method satisfies the pretesting requirement.)
Respondent debriefings. (This method satisfies the pretesting requirement.)
Interviewer debriefings. (This method does not satisfy the pretesting requirement.)
Analysts’ feedback. (This method does not satisfy the pretesting requirement.)
Split panel tests. (This method satisfies the pretesting requirement.)
Analysis of item nonresponse rates, imputation rates, edit failures, or response distributions. (This method does not satisfy the pretesting requirement.)

Pre-field Techniques

Respondent Focus Groups are used early in the questionnaire development cycle and can be used in a variety of ways to assess the question-answering process. Generally, the focus group technique does not satisfy the pretesting requirement, because it does not expose respondents to a questionnaire.

The only use of focus groups that satisfies the pretesting requirement is to have the group complete a self-administered questionnaire, followed by a discussion of the experience. This provides information about the appearance and formatting of the questionnaire and reveals possible content problems.

Focus groups can be used before questionnaire construction begins to gather information about a topic, such as:

How potential respondents structure their thoughts about a topic.
How respondents understand general concepts or specific terminology.
Respondents’ opinions about the sensitivity or difficulty of the questions.
How much burden is associated with gathering the information necessary to answer a question.

Focus groups can also be used to identify variations in language, terminology, or the interpretation of questions and response options. Used in this way, they may provide quicker access to a larger number of people than is possible with cognitive interviews. One of the main advantages of focus groups is the opportunity to observe an increased amount of interaction on a topic in a short time. The group interaction is of central importance – it can result in information and insights that may be less accessible in other settings. However, precisely because of this group interaction, the focus group does not permit a good test of an individual’s response process when alone. Moreover, in focus groups the researcher does not have as much control over the process as with cognitive interviews or interviewer-administered questionnaires. One or two people in the group may dominate the discussion and restrict the input from other group members.

Exploratory or Feasibility Studies are another common method for evaluating survey content relative to concepts. Economic survey practitioners typically call these studies company or site visits because they carry out the studies at the site of the business or institution. Because these visits are conducted before the questionnaire has been developed, they do not satisfy the pretesting requirement.

Because economic surveys rely heavily on business or institutional records, the primary goal of these site visits is to determine the availability of the desired data in records, their periodicity, and the definition of the concept as used in company records. Other goals include assessment of response burden and quality and the identification of the appropriate respondent.

The design of these company or site visits tends to vary a great deal. Because they are exploratory in nature, the activity may continue until the economic survey or program staff sufficiently understands the respondents’ views of the concepts, resources permitting of course. Purposive or convenience samples are selected that target key data providers. Sample sizes are small, perhaps as few as five and rarely more than thirty. Typically, several members of the survey or program staff, who may or may not include questionnaire design experts, conduct meetings with multiple company employees involved in government reporting. Information gained during these visits helps determine whether the survey concepts are measurable, what the specific questions should be, how to organize or structure the questions related to the concept of interest, and to whom the form should be sent.

Exploratory or feasibility studies may be multi-purpose. In addition to exploring data availability for the concept of interest, survey or program staff may also set up reporting arrangements and review operating units to ensure correct coverage. A common by-product of these visits is to solidify relationships between the companies and the survey or program staff.

Cognitive Interviews are used in the later part of the questionnaire development cycle, after a questionnaire has been constructed based on information from focus groups, site visits, or other sources. They consist of one-on-one interviews using a draft questionnaire in which respondents describe their thoughts while answering the survey questions. Cognitive interviews provide an important means of learning about respondents’ problems with the questionnaire directly from them. Because this technique tests the questionnaire with potential respondents, it satisfies the pretesting requirement.

In addition, small numbers of interviews (as few as fifteen) can yield information about major problems if respondents repeatedly identify the same questions and concepts as sources of confusion. Because sample sizes are small, iterative pretesting of an instrument is often possible. After one round of interviews is complete, researchers can diagnose problems, revise question wording to solve the problems, and conduct additional interviews to see if the new questions are successful.

Cognitive interviews may or may not be conducted in a laboratory setting. The advantage of the laboratory is that it offers a controlled environment for conducting the interview, and provides the opportunity for video as well as audio recording. However, laboratory interviews may be impractical or unsuitable. For example, economic surveys rarely conduct cognitive interviews in a laboratory setting. Rather, cognitive testing of economic surveys is usually conducted on-site at the offices or location of the business or institutional respondent. One reason for this approach is to enable business or institutional respondents’ to have access to records. Another is business respondents’ reluctance to meet outside their workplaces for these interviews. In many economic surveys, which tend to be relatively lengthy and require labor-intensive data retrieval from records, testing may be limited to a subset of questions or sections rather than the entire questionnaire. Thus, researchers must be careful to set the proper context for the target questions.

“Think aloud” interviews, as cognitive interviews have come to be called, can be conducted either concurrently or retrospectively – that is, the respondents’ verbalizations of their thought processes can occur either during or after the completion of the questionnaire. As the Census Bureau conducts them, cognitive interviews incorporate follow-up questions by the researcher in addition to the respondent’s statement of his or her thoughts.

Probing questions are used when the researcher wants to have the respondent focus on particular aspects of the question-response task. For example, the interviewer may ask how respondents chose among response choices, how they interpreted reference periods, or what a particular term meant. Paraphrasing (asking the respondents to repeat the question in their own words) permits the researcher to learn whether the respondent understands the question and interprets it in the manner intended, and it may reveal better wordings for questions.

In surveys of businesses or institutions, in which data retrieval often involves business records, probing and paraphrasing techniques are often augmented by questions asking respondents to describe those records and their contents or to show the records to the researcher. Since data retrieval tends to be a labor-intensive process for business respondents, frequently requiring the use of multiple sources or consultation with colleagues, it is often unrealistic for researchers to observe the process during a cognitive interview. Instead, hypothetical probes are often used to identify the sources of data, discover respondents’ knowledge of and access to records, recreate likely steps taken to retrieve data from records or to request information from colleagues, and suggest possible estimation strategies.

Usability Techniques are used to aid development of automated questionnaires. Objectives are to discover and eliminate barriers that keep respondents from completing an automated questionnaire accurately and efficiently with minimal burden. Usability tests that are focused on respondent understanding of the questionnaire satisfy the pretesting requirement. Usability tests that are focused on the interviewers’ ability to administer the instrument do not satisfy the pretesting requirement; however, they are recommended for interviewer-administered electronic questionnaires.

Aspects that deserve attention during usability testing include the language, fonts, icons, layout, organization, and interaction features, such as data entry, error recovery, and navigation. Typically, the focus is on instrument performance in addition to how respondents interpret survey questions. Problems identified during testing can then be eliminated before the instrument is finalized.

As with paper questionnaires, different usability techniques are available depending upon the stage of development. One common technique is called the usability test. These tests are similar to cognitive interviews – that is, one-on-one interviews that elicit information about the respondent’s thought process. Respondents are given a task, such as “Complete the questionnaire,” or smaller subtasks, such as “Send your data to the Census Bureau.” The think aloud, probing,and paraphrasing techniques are all used as respondents complete their assigned tasks. Early in the design phase, usability testing with respondents can be done using low fidelity questionnaire prototypes (i.e., mocked-up paper screens). As the design progresses, versions of the automated questionnaire can be tested to choose or evaluate basic navigation features, error correction strategies, etc.

Disability accommodation testing is a form of usability testing which evaluates the ability of a disabled user to access the questionnaire through different assistive technologies, such as a screen reader. Expert reviews (see below) are also part of the repertoire of usability techniques.

Research has shown that as few as three participants can uncover half of the major usability problems; four to five participants can uncover 80 percent of the problems; and ten participants can uncover 90 percent of the problems (Dumas and Redish, 1999).

Finally, in a heuristic review, an expert compares the electronic survey instrument with usability principles that should be followed by all user interfaces (Nielsen, 1993).

Methodological Expert Reviews, conducted by survey methodologists or questionnaire-design experts, evaluate any difficulties potential interviewers and respondents may have with the questionnaire. Seasoned survey researchers who have extensive exposure to either the theoretical or practical aspects of questionnaire design use their expertise to achieve this goal. Because respondents do not provide direct input in these reviews, in general they do not satisfy the pretesting requirement. Usually these reviews are conducted early in the questionnaire development process and in concert with other pretest methods.

Expert reviews may be used instead of respondent-based pretesting only as a last resort, when extreme time constraints prevent the use of other pretesting methods. In such instances, survey methodology experts must conduct the reviews and document the results in a written report. The decision to use expert reviews rather than respondent-based pretesting must be made by subject-matter areas in consultation with the methodological research areas in the Center for Statistical Research and Methodology and on the Response Improvement Research Staff.

The cognitive appraisal coding system (Forsyth and Lessler, 1991) is a tool providing a systematic approach to the methodological expert review process. Like methodological expert reviews, results are used to identify questions that have potential for reporting errors. This tool is particularly effective when used by questionnaire design experts who understand the link between the cognitive response process and measurement results. However, novice staff or subject-area staff also can use this tool as a guide in their reviews of questionnaires.

Methodological expert reviews also can be conducted as part of a usability evaluation. Typically, this review is performed with an automated version of the questionnaire, although it need not be fully functional. Experts evaluate the questionnaire for consistency and application of user-centered principles of user-control, error prevention and recovery, and ease of navigation, training, and recall.

Field Techniques

Field techniques may be used with pretests or pilot tests of questionnaires or instruments and survey processes. They may also be employed in ongoing periodic (or recurring) surveys. The value of testing draft questionnaires with potential survey respondents cannot be overstated, even if it simply involves observation and evaluation by questionnaire developers. However, the following pretesting methods can be used to maximize the benefits of field testing.

Behavior Coding of Respondent/Interviewer Interactions involves systematic coding of the interaction between interviewers and respondents from live or taped field or telephone interviews to collect quantitative information. Using this pretesting method satisfies the pretesting requirement.

The focus here is on specific aspects of how the interviewer asks the question and how the respondent reacts. When used for questionnaire assessment, the behaviors that are coded focus on behaviors that indicate problems with the question, the response categories, or the respondent’s ability to form an adequate response. For example, if a respondent asks for clarification after hearing the question, it is likely that some aspect of the question caused confusion. Likewise, if a respondent interrupts the question before the interviewer finishes reading it, then the respondent misses information that might be important to giving a correct answer. For interviewer-administered economic surveys, the coding scheme may need to be modified from traditional household applications, because interviewers for establishment surveys tend to be allowed greater flexibility.

In contrast to the pre-field techniques described earlier, the use of behavior coding requires a sample size sufficient to address analytic requirements. For example, if the questionnaire contains many skip patterns, it is necessary to select a large enough sample to permit observation of various paths through the questionnaire. In addition, the determination of sample sizes for behavior coding should take into account the relevant population groups for which separate analysis is desired.

Because behavior coding evaluates all questions on the questionnaire, it promotes systematic detection of questions that elicit large numbers of behaviors that reflect problems. However, it is not usually designed to identify the source of the problems. It also may not be able to distinguish which of several similar versions of a question is better.

Finally, behavior coding does not always provide an accurate diagnosis of problems. It can only detect problems that are manifest in interviewer or respondent behavior. Some important problems, such as respondent misinterpretations, may remain hidden because both respondents and interviewers tend to be unaware of them. Behavior coding is not well-suited for identifying such problems.

Respondent Debriefing uses a structured questionnaire after data are collected to elicit information about respondents’ interpretations of survey questions. Use of this method satisfies the pretesting requirement.

The debriefing may be conducted by incorporating structured follow-up questions at the end of a field test interview or by re-contacting respondents after they return a completed self-administered questionnaire. In economic surveys, respondent debriefings sometimes are called “response analysis surveys” (“RAS”) or “content evaluations.” Respondent debriefings usually are interviewer-administered, but may be self-administered. Some Census Bureau economic surveys have conducted respondent debriefings by formulating them as self-administered questionnaires and enclosing them with survey forms during pilot tests or production data collections.

Sample sizes and designs for respondent debriefings vary. Sample sizes may be as small as 20 or as large as several hundred. Designs may be either random or purposive, such as conducting debriefings with respondents who exhibited higher error rates or errors on critical items. Since the debriefing instrument is structured, empirical summaries of results may be generated.

When used for testing purposes, the primary objective of respondent debriefing is to determine whether the respondents understand the concepts and questions in the same way that the survey designers intend. Sufficient information is obtained to evaluate the extent to which reported data are consistent with survey definitions. For instance, respondents may be asked whether they included or excluded particular items in their answers, per definitions. In economic surveys, the debriefings may ask about the use of records or estimation strategies. In addition, respondent debriefings can be useful in determining the reason for respondent misunderstandings. Sometimes results of respondent debriefings show that a question is superfluous and can be eliminated from the final questionnaire. Conversely, it may be discovered that additional questions need to be included in the final questionnaire to better operationalize the concept of interest. Finally, the data may show that the intended meaning of certain concepts or questions is not clear or able to be understood.

A critical requirement to obtain a successful respondent debriefing is that question designers and researchers have a clear idea of potential problems so that good debriefing questions can be developed. Ideas about potential problems can come from pre-field techniques (e.g., cognitive interviews conducted prior to the field test), from analysis of data from a previous survey, from careful review of questionnaires, or from observation of earlier interviews.

Respondent debriefings may be able to supplement the information obtained from behavior coding. As noted above, behavior coding demonstrates the existence of problems but does not always identify the source of the problem. When designed properly, the results of respondent debriefings can provide information about the sources of problems. Respondent debriefings also may reveal problems not evident from the response behavior.

Interviewer Debriefing has traditionally been the primary method used to evaluate field or pilot tests of interviewer-administered surveys. It also may be used following production data collection prior to redesigning an ongoing periodic or recurring survey. Interviewer debriefing consists of holding group discussions or administering structured questionnaires with the interviewers to obtain their views of questionnaire problems. The objective is to use the interviewers’ direct contact with respondents to enrich the questionnaire designer’s understanding of questionnaire problems. Although it is a useful evaluation component, it is not sufficient as an evaluation method and does not satisfy the pretesting requirement.

Interviewers may not always be accurate reporters of certain types of questionnaire problems for several reasons. When interviewers report a problem, it is not always clear if the issue caused trouble for one respondent or for many. Interviewers’ reports of problem questions may reflect their own preference regarding a question, rather than respondent confusion. Finally, experienced interviewers sometimes change the wording of problem questions as a matter of course to make them work, and may not even realize they have done so.

Interviewer debriefings can be conducted in several different ways: in a group setting, through rating forms, or through standardized questionnaires. Group setting debriefings are the most common method. They essentially involve conducting a focus group with the field test interviewers to learn about their experiences in administering the questionnaire. Rating forms obtain more quantitative information by asking interviewers to rate each question in the pretest questionnaire on selected characteristics of interest to the researchers (e.g., whether the interviewer had trouble reading the question as written, whether the respondent understood the words or ideas in the question). Standardized interviewer debriefing questionnaires collect information about the interviewers’ perceptions of a problem, the prevalence of a problem, the reasons for a problem, and proposed solutions to a problem. Interviewer debriefings also can ask about the magnitude of specific kinds of problems, to test the interviewers’ knowledge of subject-matter concepts.

Analysts’ Feedback is a method of learning about problems with a questionnaire specific to the economic area. At the Census Bureau, most economic surveys are self-administered; so survey or program staff analysts in the individual subject areas, rather than interviewers, often have contact with respondents. While collecting feedback from analysts is a useful evaluation component, it does not satisfy the pretesting requirement.

Feedback from analysts about their interactions with respondents may serve as an informal evaluation of the questionnaire and the data collected. These interactions include “Help Desk” phone inquiries from respondents and follow-up phone calls to respondents by analysts investigating suspicious data flagged by edit failures. Analyst feedback is more useful when analysts systematically record comments from respondents in a log. The log enables qualitative evaluation of the relative severity of questionnaire problems, because strictly anecdotal feedback sometimes may be overstated.

Another way to obtain analyst feedback is for questionnaire design experts to conduct focus groups with the analysts who review data and resolve edit failures. These focus groups can identify questions that may need to be redesigned or evaluated by other methods. Regardless of how respondent feedback is captured, analysts should provide feedback early in the questionnaire development cycle of recurring surveys to identify problematic questions.

Split Panel Tests are controlled experimental tests of questionnaire variants or data collection modes to determine which one is “better” or to measure differences between them. Split panel testing satisfies the pretesting requirement.

Split panel experiments may be conducted within a field or pilot test or embedded within production data collection for an ongoing periodic or recurring survey. For pretesting draft versions of a questionnaire, the search for the “better” questionnaire requires that an a priori standard be determined by which the different versions can be judged. Split panel tests can incorporate a single question, a set of questions, or an entire questionnaire.

It is important to select adequate sample sizes when designing a split panel test so that differences of substantive interest can be measured. In addition, these tests must use randomized assignment within replicate sample designs so that differences can be attributed to the question or questionnaire and not to the effects of incomparable samples.

Another use of the split panel test is to calibrate the effect of changing questions. Although split panel tests are expensive, they are extremely valuable in the redesign and testing of surveys for which the comparability of the data collected over time is an issue. They provide an important measure of the extent to which different results following a major survey redesign are due to methodological changes, such as the survey instrument or interview mode, rather than changes over time in the subject-matter of interest. Split panel testing is recommended for data with important policy implications.

Comparing response distributions in split panel tests produces measures of differences but does not necessarily reveal whether one version of a question produces a better understanding of what is being asked than another. Other question evaluation methods, such as respondent debriefings, interviewer debriefings, and behavior coding, are useful to evaluate and interpret the differences observed in split panel tests.

Analysis of Item Nonresponse Rates, Imputation Rates, Edit Failures, or Response Distributions from the collected data can provide useful information about how well the questionnaire works.& Use of this method in combination with a field test does not satisfy the pretesting requirement.

In household surveys, examination of item nonresponse rates can be informative in two ways. First, “don’t know” rates can determine the extent to which a task is too difficult for respondents. Second, refusal rates can determine the extent to which respondents find certain questions or versions of a question to be more sensitive than others.

In economic surveys, item nonresponse may be interpreted to have various meanings, depending on the context of the survey. In some institutional surveys (e.g., hospitals, prisons, schools) where data are abstracted from individual person-level records, high item nonresponse is considered to indicate data not routinely available in those records. Item nonresponse may be more difficult to detect in other economic surveys where questions may be left blank because they are not applicable to the responding business or the response value may be zero. In these cases, the data may not be considered missing at all.

Response distributions are the frequencies with which respondents provided answers during data collection. Evaluation of the response distributions for survey items can determine whether variation exists among the responses given by respondents or if different question wordings or question sequencings produce different response patterns. This type of analysis is most useful when pretesting either more than one version of a questionnaire or a single questionnaire for which some known distribution of characteristics exists for comparative purposes.

The quality of collected data also may be evaluated by comparing, reconciling, or benchmarking to data from other sources. This is especially true for economic data, but benchmarking data are also available for some household surveys.

Conclusion

At least one of the following techniques must be used to satisfy the pretesting requirement:

Cognitive interviews.
Usability techniques focused on the respondent’s understanding of the questionnaire.
Focus groups involving the administration of questionnaires.
Behavior coding of respondent/interviewer interactions.
Respondent debriefings in conjunction with a field test or actual data collection.
Split panel tests.

However, pretesting typically is more effective when multiple methods are used. Additional pretesting techniques should be carefully considered to provide a thorough evaluation and documentation of questionnaire problems and solutions. The relative effectiveness of the various techniques for evaluating survey questions depends on the pretest objectives, sample size, questionnaire design, and mode of data collection. The Census Bureau advocates that both pre-field and field techniques be undertaken, as time and funds permit.

For continuing surveys that have a pre-existing questionnaire, cognitive interviews should be used to provide detailed insights into problems with the questionnaire whenever time permits or when a redesign is undertaken. Cognitive interviews may be more useful than focus groups with a pre-existing questionnaire because they mimic the question-response process. For one-time or new surveys, focus groups are useful tools for learning what respondents think about the concepts, terminology, and sequence of topics prior to drafting the questionnaire. In economic surveys, exploratory/feasibility studies, conducted as company or site visits, also provide information about structuring and wording the questionnaire relative to data available in business/institutional records. Usability techniques are increasingly important as surveys move to automated data collection.

Pre-field methods alone may not be sufficient to test a questionnaire. Some type of testing in the field is encouraged, even if it is evaluated based only on observation by questionnaire developers. More helpful is small-to-medium-scale field or pilot testing with more systematic evaluation techniques. The various methods described in this appendix complement each other in identifying problems, the sources of problems, and potential solutions.

References

Dumas, J. and Redish, J., 1999. A Practical Guide to Usability Testing, Portland, OR: Intellect.

Forsyth, B. H., and Lessler, J. T., 1991. “Cognitive Laboratory Methods: A Taxonomy,” in Measurement Errors in Surveys, Biemer, P. P., Groves, R. M., Lyberg, L. E., Mathiowetz, N. A., and Sudman, S.(eds.), New York: John Wiley and Sons, Inc., pp. 393-418.

Nielsen, Jakob, 1993. Usability Engineering. Morgan Kaufmann, New York, NY.

Page Last Revised - October 8, 2021

Is this page helpful?
Thumbs Up Image

Yes

NO THANKS

255 characters maximum

255 characters maximum reached

Thank you for your feedback.
Comments or suggestions?

Top