Census Summary Files: Technical Issues
When using Census summary file data, users should be aware of some technical issues that influence what numbers are reported. These issues include:
How these issues are handled in each census year and in products within each year(print, ASCII text files, profiles, American FactFinder, CD/DVDs) can vary and users should consult the technical documentation if they need to clarify the impact on the particular data they are using.
Starting in 1940 the Census Bureau has used sampling for some questions. This means all respondents answered one set of questions and some percentage of respondents answered additional questions. The questionnaire with only those questions asked of everyone is referred to as the short form while the questionnaire with additional questions is called the long form.
The totals reported in data products from the short form are exact counts(100% count data) while the counts in products from the long form they are estimates (sample data). Users of summary file data should always be aware of whether they are using exact count data or estimated data.
- Exact count data is more accurate
If all your analysis requires only data derived from the questions on the short form (race, hispanic origin, sex, household type, owner/renter status, or, earlier than 2000, marital status), use data products based on the long form.
- Do not use both estimates and exact count data in the same analysis.
If any part of your analysis requires data asked only on the long form, use long form data for all variables.
- Discrepancies among counts can occur.
A difference can occur for the same totals derived when one is exact count data based on the short form and one is an estimate based on the long form. The size of discrepancies increases with smaller geographies.
Note: in 1970 there were two sample sizes used for the long form, so discrepancies can occur between totals derived from different questions on the long form if one total was based on a question asked of 20% of the population, while the other question was asked of 15% of the population.
Each record on a summary file contains totals for a single geographic location and the geographic codes that describe the location. For example, in 2000 the record with totals for Bronx County will have codes that indicate the state (36-NY), county (05-Bronx), MSA/CMSA (5602 NY-NJ-CT-PA) and PMSA (5600 NY-NY). There also is a code that indicates that the data on the record represents totals at the county level and this code is called the "summary level." For example, on SF3 in 2000 the record that reports county totals for the Bronx would have the geographic codes listed above plus a summary level code of 050 to indicate the totals are for county. A record with totals for a tract in the Bronx would contain the same geographic codes for state, county, MSA, etc. plus a code for tract number and a summary level code equal to 140.
Usually with graphical interfaces to summary file data, the codes are transparent to users. With American FactFinder the summary levels are presented as choices in a list of named levels of geography. This is somewhat true of the interfaces on CD-ROM and DVD versions which also have menu-based access to the data. On these products, data from one summary file can be divided among a number of physical discs and the basis for the division is summary level.
When using the ASCII versions, summary levels also are the basis for dividing a single summary file into A, B, C, etc. parts of the file.Within any one part, a user will use the values in summary level variable to select records for a particular geographic level. Users must consult the technical documentation to see what values to test for.
The Census Bureau must ensure that the identity of individuals cannot be ascertained based on the socio-demographic profiles reported in the summary files. Suppression is the non-reporting of data when the total count for the selected population within a geographic area is so small that the members of the group could possibly be identified. Suppression is an issue when working with small levels of geography and/or small population subgroups, that is populations selected based on some characteristics like race, age, hispanic origin etc. Suppression impacts the files in the following ways.
- Suppression of All Tables for Selected Geographic Areas
In 2000 on Summary File 3 most tables report data to the block group level (all P and H tables). As well there are a number of tables that only report to the tract level (the PT and HT tables). Data in PT and HT tables are not published for any block group, regardless of its size.
In the 1970 ASCII files, rather than suppress data for an entire reporting level, suppression of all tables for a geographic level was done selectively, based on the size of the population group being reported. For example, in New York State there are data tables, based on the total population, for 3824 tracts, on the white population for 3814 tracts, and on the African American population for 3216 tracts.
- Suppression of Selected Tables within One Geographic Area
In 2000, if suppression occurs for one or more areas from among a group of selected geographies, the names of those suppressed areas will be listed without data.
The 1980 and 1990 ASCII files contain a set of variables called suppression flags that are used to indicate whether a data in a table is suppressed. The documentation indicates for each table which if any suppression flag applies. When a suppression flag is associated with a table, you must test the suppression flag to determine whether zero values indicate totals of zero or no data.
In the 1970 summary files, suppression of data in selected tables is indicated by a negative value in the first field of the table.
Imputation (called Allocation in 1970 and 1980) is the assignment of acceptable entries in place of unacceptable or missing entries on a basic record. This is done to complete, and thus retain in the sample, questionnaires that were not entirely filled out or had questions filled out incorrectly.
In 1990 and 2000 users concerned about allocation can check the "imputation tables" that correspond to the subject variable they are interested in. The value in the allocation variable reports what part of the total count for a variable was imputed. For 1980, no specifics about imputation are reported on the file, and in 1970 allocation tables are reported in a separate record type.