Skip to content

Sampling Error

Component ID: #ti151108608
Component ID: #ti844849037

This section discusses methods for obtaining the sampling error estimates derived from the Survey of Income and Program Participation (SIPP) panels. The sample selected for each SIPP panel is a stratified multistage probability sample. This complex sample design needs to be taken into account when estimating the variances of SIPP estimates. The SIPP data files contain variables, related to the sample design, that are created for the purpose of variance estimation. Several software packages are now available for computing variance estimates for a wide range of statistics based on complex sample designs. Using the variables that specify the design, these programs can calculate appropriate variances of survey estimates. The Census Bureau also provides generalized variance functions (GVFs) that can be used to obtain approximate estimates of sampling variance for SIPP estimates.

A common mistake in the estimation of sampling error for survey estimates is to ignore the complex survey design and treat the sample as a simple random sample (SRS) of the population. That mistake occurs because most standard software packages for data analyses assume simple random sampling for variance estimation. When applied to SIPP estimates, SRS formulas for variances typically underestimate the true variances. This chapter describes how appropriate variance estimates, which take into account the complex sample design, can be obtained for SIPP estimates.

Component ID: #ti151108607

Direct Variance Estimation

The primary sampling unit (PSU) plays a key role in variance estimation with a multistage sample design. SIPP PSUs are mostly counties, groups of counties, or independent cities (SIPP Quality Profile, 3rd Ed. [U.S. Census Bureau, 1998a, Chapter 3]), which are sampled with probability proportional to size within strata. The PSUs are sampled without replacement so that no PSU is selected more than once for the sample. Some PSUs are so large that they are included in the sample with certainty. Because no sampling is involved, those PSUs are, in fact, not PSUs but strata.

Although the SIPP PSUs are selected without replacement (as is the case with most multistage designs), for the purpose of variance estimation they are treated as if they were sampled with replacement. The with-replacement assumption greatly facilitates variance estimation since it means that variance estimates can be computed by taking into account only the PSUs and strata, without the need to consider the complexities of the subsequent stages of sample selection. This widely used simplifying assumption leads to an overestimation of variances, but the overestimation is not great.

Several software packages are available for computing variances of a wide range of survey estimates (e.g., means and proportions for the total sample and for subclasses, for differences in means and proportions between subclasses, and for regression and logistic regression coefficients) from complex sample designs. Many of these packages are listed on the Web: http://www.fas.harvard.edu/~stats/survey-soft/survey-soft.html. Lepkowski and Bowles (1996) examined eight of the packages.

These packages use a variety of methods for variance estimation. Some use an approach based on a Taylor-series approximation, or linearization, method. Others use a replication method, such as jackknife repeated replications or balanced repeated replications. Although some methods have advantages in some situations, there is generally little to recommend one method over another. The variance estimates they produce are not identical, but the differences are usually small. See Wolter (1985) and Rust (1985) for discussions of these methods.

Back to top

Component ID: #ti389399575

Variance Units and Variance Strata, 1990-2008 Panels

For the 1990–2008 SIPP Panels, the sample member record contains information concerning the PSU and stratum within which the member was sampled. This information is needed as input for all of the specialized software packages. The original PSU and strata codes are not included in the SIPP public use data files, however, to avoid potential identification of small geographic areas and sampled individuals. Instead, sets of PSUs are combined across strata to produce variance units and variance strata, with two variance units in each variance stratum. Variance units and variance strata may be treated as PSUs and strata for variance estimation purposes. Their use does not give rise to any bias in the variance estimates. The variance estimates are somewhat less precise, however, than those obtained from the use of the PSUs and strata that have not been combined.

Under the complex sample design, the number of degrees of freedom for variance estimation depends on the number of variance strata. The 1984 SIPP Panel consists of 142 variance units in 71 variance strata; the panels between 1985 and 1991 have 144 variance units and 72 variance strata; the 1992–1993 Panels have 198 variance units and 99 variance strata; and the 1996-2001 Panels have 210 variance units and 105 variance strata; and the 2004-2008 Ppanels have 228 variance units and 114 variance strata. As a rough approximation, the number of degrees of freedom for a variance estimate is the number of variance strata. Thus, for national estimates, the variance estimates have about 71 degrees of freedom for the 1984 Panel, 72 degrees of freedom for the 1985–1991 Panels,  99 degrees of freedom for the 1992–1993 Panels, 105 degrees of freedom for the 1996-2001 Panels, and 114 degrees of freedom for the 2004-2008 Panels. Regional estimates will have fewer degrees of freedom because such estimates include only some of the variance strata.

Back to top

Component ID: #ti389399574

Replication Weights for the 1996 Panel

Analysts should use Fay’s method for estimating variances for the SIPP Panels. Fay’s method is a modified balanced repeated replication (BRR) method of variance estimation. The difference between the basic BRR method and Fay’s method is that the BRR method uses replicate factors of 0 and 2, whereas Fay’s method uses one factor, k, which is in the range (0, 1), with the other factor equal to 2 – k. In Fay’s method, the introduction of the perturbation factor (1 – k) allows the use of both halves of the sample. Thus, Fay’s method has the advantage that no subset of the sample units in a particular classification will be totally excluded. The variance formula for Fay’s method is

The variance of theta sub zero equals open curly brace 1 divided by open bracket  G multiplied by open paren 1 minus k closed paren squared closed bracket closed curly brace multiplied by summation i from 1 to G of open paren theta sub i minus theta sub zero closed paren squared where G equals number of replicates, the quantity 1 minus k equals the perturbation factor, the letter i equals replicate i from 1 to G, theta sub i equals the ith estimate of the parameter theta based on the observations included in the ith replicate, and theta sub zero equals the survey estimate of the parameter theta based on the full sample
Component ID: #ti151108609

where

Formula for Var Theta: The equation reads as follows: The variance of theta sub zero equals open curly brace 1 divided by open bracket 108 multiplied by open paren 0.5 squared closed paren closed bracket closed curly brace multiplied by summation i from 1 to 108 open bracket open paren theta sub i closed paren minus theta sub zero closed bracket squared.
Component ID: #ti151108611

The 1996 SIPP Panel uses 108 replicate weights, which are calculated on the basis of a perturbation factor of 0.5 (k = 0.5). Inserting those values into Equation (7-1) results in the 1996 SIPP Panel variance formula of

G equals the number of replicates; 1-k equals the perturbation factor; i = replicate i, i equals 1 to G; theta sub i equals the ith estimate of the parameter theta based on the observations included in the ith replicate; theta sub zero equals survey estimate of the parameter theta based on the full sample.
Component ID: #ti151108613

The 2004 and 2008 SIPP Panels use 120 replicate weights, which are calculated on the basis of a perturbation factor of 0.5 (k=0.5).

The Census Bureau used VPLX and SAS software to compute the replicate weights that are available through DataFerrett and the SIPP FTP Site.

Back to top

Component ID: #ti151108614

Approximate Variance Estimates

The Census Bureau provides two forms for approximate variance estimation: GVFs and tables of standard errors (the square root of the variance) for different estimated numbers and percentages. The generalized estimates provide indications of the magnitude of the sampling error in the survey estimates. They serve as convenient ways to summarize the sampling errors for a broad variety of estimates.

The GVFs for SIPP were derived by modeling the standard error behavior of groups of estimates with similar standard errors. The mathematical form of the function adopted is

The standard error s equals open paren a multiplied by x squared plus b multiplied by x closed paren raised to the power of 1/2.
Component ID: #ti389399569

where s represents the standard error and x the value of an estimate. The parameters a and b are derived on the basis of a selected group of estimates. They are updated annually and are included in the source and accuracy statement that accompanies each SIPP data file for a panel. It is essential to use the parameter estimates for a specific panel and to follow the instructions to apply necessary adjustments to obtain the correct estimates for subgroups. Besides GVFs, the Census Bureau provides summary tables of general standard errors. Those estimates are also available in the source and accuracy statements. The following examples show how to use GVFs to estimate the standard errors of estimated numbers and of sample means. The use of GVFs and tables of standard errors is described in the source and accuracy statements for each panel.

Before looking at the examples, the user should note that the generalized variance estimates for estimating the standard errors of other statistics may not be accurate for small subgroups. Using the 1984 SIPP Panel, Bye and Gallicchio (1989) developed variance functions for participants of Old-Age, Survivors, and Disability Insurance (OASDI) and Supplemental Security Income (SSI) programs. They found that for estimates of less than 10 million, the generalized standard error estimates provided by the Census Bureau were 1.20 to 1.75 times larger than those obtained from the variance functions developed specifically for that subgroup.

Back to top

Component ID: #ti389399576

Using GVFs for Standard Errors of Estimated Numbers

The approximate standard error, s, of an estimated number of persons (or households, and families) can be obtained by the formula

The standard error s equals open paren a multiplied by x squared plus b multiplied by x closed paren raised to the power of 1/2.
Component ID: #ti389399571

where a and b are the parameters associated with the estimate for the particular reference period, and x is the weighted estimate. This equation is appropriate for the standard errors of estimated numbers and should not be applied to estimates of dollar values.

Suppose that the number of females aged 25 to 44 with a  monthly income above $6,000 in September 2008 is 2,000,000 estimated from Wave 1 of the 2008 Panel . The approximate values of a and b from Table 4 of the source and accuracy statement of the 2008Panel are a = –0.00002917 and b = 3,584. Then, the standard error, s, of this estimated number is given by

The standard error s equals open bracket open paren minus 0.00002917 multiplied by 2,000,000 squared closed paren plus open paren 3,584 multiplied by 2,000,000 closed paren closed bracket to the power of ½ equals 83,972 females.
Component ID: #ti389399573

The approximate 90 percent confidence interval for the estimated number can be computed as x ± 1.64 s, which ranges from 1,861,866 and 2,138,134. Therefore, a conclusion that the average estimate derived from all possible samples lies within an interval computed in this way would be correct for roughly 90 percent of all samples.

Back to top

Component ID: #ti389399577

Using GVFs for Standard Errors of a Mean

A mean is defined here to be the average quantity of some characteristic (other than the number of persons or households) per person or household. For example, a mean could be the average monthly household income of females 25 to 54 years of age. The formula used to estimate the standard error of a mean, , is

The standard error s sub mean of x equals the square root of b over y then multiplied by s squared.
Component ID: #ti389399600

where y is the size on which the estimate is based, s2 is the estimated population variance of the characteristic, and b is the parameter associated with the particular type of characteristic. Because of the approximations used in developing this formula, an estimate of the standard error of the mean obtained from this formula will generally underestimate the true standard error.

The estimated population mean is computed with the formula

The estimated population mean x bar equals the ratio of summation i equals 1 to n of w sub i multiplied to x sub i over summation i equals 1 to n of w sub i where there are n units of the item of interest, x sub i is the value of the item for the ith unit, and w sub i is the final weight for the ith unit.
Component ID: #ti389399602

and the estimated population variance can be computed as

The estimated population variance s squared equals the ratio of summation i equals 1 to n of w sub i multiplied to open paren x sub i minus x bar closed paren squared over summation i equals 1 to n of w sub i or summation i equals 1 to n of w sub i multiplied to open paren x sub i minus x bar closed paren squared over summation i equals 1 to n of w sub i minus 1 where there are n units of the item of interest, x sub i is the value of the item for the ith unit, x bar is the estimated population mean and w sub i is the final weight for the ith unit.
Component ID: #ti389399604

with the use of standard software for weighted data. Suppose that, based on Wave 1 data of the 2008 Panel, the mean monthly cash household income for females aged 25 to 54 is $2,530, the weighted number of females in this age range is y = 39,851,000, and the population variance is estimated to be s2 = 3,159,887. When the appropriate b parameter of 3,584 from the source and accuracy statement for Panel 2008 is used, the estimated standard error of this mean is

The standard error s sub mean of x equals the square root of 3,584 over 39,851,000 then times 3,159,887.  The final product is $16.86.
Component ID: #ti389399606

Thus, the 90 percent confidence interval, computed as

The mean of x plus or minus 1.64 standard error s sub mean of x.
Component ID: #ti389399608

ranges from $2,502 to $2,558. Therefore, a conclusion that the average estimate derived from all possible samples lies within an interval computed in this way would be correct for roughly 90 percent of all samples.

Back to top

Component ID: #ti389399609

Variance Estimation with Imputed Data

Imputation methods are used to fill in several types of missing data in SIPP. They are used to complete some item nonresponse, person-level nonresponse within households (Type Z nonresponse), and some wave nonresponse (intermittent responses bounded by two responding waves). Imputation fills in gaps in the data set and makes data analyses easier. It also allows more people to be retained as panel members for longitudinal analyses. The concern, however, is that imputation fabricates data to some degree. Treating the imputed values as actual values in ­­estimating the variance of survey estimates leads to an overstatement of the precision of the estimates (Brick and Kalton, 1996). It is important to recognize this fact when sizable proportions of values are imputed.

Back to top

X
  Is this page helpful?
Thumbs Up Image Yes    Thumbs Down Image No
X
No, thanks
255 characters remaining
X
Thank you for your feedback.
Comments or suggestions?