Chapter 1 -- Accessing the Digital Census

Last Modified 17 August 1998

A) About the Census

The framers of the United States Constitution had two major goals when they decided to conduct a decennial (every ten years) census of the population. First, they needed to know how many people were living in all areas of the country so that Congressional districts could be defined with all members of the House of Representatives elected from districts of the same population size. Differences in population growth in some areas more than others would mean that representation would have to be periodically adjusted. Only a periodic census could pinpoint those changes. Second, the founding fathers needed to apportion taxes among the various states, with the total federal taxes to be received from each state proportional to the number of persons in each state. They reasoned that any state wanting to inflate its population figures for more representation would be deterred by the burden of additional taxes. Thus U.S. marshals and their assistants were ordered to canvass the country and ask the heads of households for the number of free white males and females over and under age 16, the number of other free persons, and the number of slaves.

While the above information was necessary, there were requests for additional data about the country. Thomas Jefferson and others, for example, wanted to know just what kinds of economic and occupational activities existed within the new nation. By 1840 questions about education, occupation, commerce, and industry were also included. Also, by this time the population had grown so large that the scope and cost of tabulating the data began to be a real concern in the Congress. Efforts were made to simplify the questions, the collection of data, and the tabulation.

In 1890 the census-takers made the first attempt at machine tabulation of the results, and by 1930 statistical sampling was introduced in order to reduce the length of time each enumerator had to spend interviewing at each household. In 1940 the census tract was introduced as a reporting unit within metropolitan areas. This geographic area of about 4000 persons has proved to be an elemental unit for local-area statistics that has remained fairly stable in dimensions ever since. New tracts in later censuses were created by subdividing existing tracts and adding new tracts to new residential areas.

In all censuses before 1960 the Census Bureau hired enumerators to visit each household and collect the information directly from the resident. In 1960 a sample, mail-out questionnaire was developed for a form that could be electronically scanned. Respondents handed the completed form to enumerators. Not every question was asked of every person and procedures were developed to find hard-to-enumerate groups. By 1970, a full, mail-out and mail-back census form was used in most parts of the country. Results were published in numerous volumes and were also produced in digital form, with various software programs developed by the Census Bureau to assist in this.

Over the decades the actual questionnaire has undergone considerable modification. Changes have been made to its content, phrasing of questions, geographical units, and collection procedures. Fortunately, the last two censuses, 1980 and 1990, have been very similar in the questions, the geographical units, and the tabulation of results. The results of these censuses are widely available in digital form so that researchers can examine population and housing questions more readily than ever.

The 1980 and 1990 censuses made extensive use of sampling. A basic short list of questions about gender, age, marital status, and housing was asked of everyone. Additional details were asked of only a sample of one in six households. The results from the 100% count on the short form of the questionnaire were tabulated separately from those of the sample from the long form. Various digital data files called Summary Tape Files (STF) are numbered to help users keep the two types of data distinct.

To carry out these mail-out censuses, a monumental digital street file had to be developed so that addresses could be assigned to the appropriate geographic unit. In 1980 this was known as the DIME file. That file contained information on the address ranges, surrounding census areas, and locations for every street segment (the distance between two intersections). Respondents to the mail-out census could be assigned to a census block by first determining on which street they resided and then determining which segment contained the address range encompassing their address. Because of the census block information appended to each segment, the respondentís data could be combined with that of other people living in the same block.

The DIME file was the primary base for the revised TIGER files used for the 1990 census. The TIGER files contain name changes and additional information such as points to better describe the shape of street segments. Unfortunately, many of the data quality problems associated with the DIME files also found their way into the TIGER files and so the Census Bureau and many private vendors expended considerable effort cleaning up these files during the 1990s.

DIME and TIGER files are not particularly useful unless one has some software that can convert them into boundary files for mapping or incorporate them into address matching. Many users simply acquire boundary files of the various census areas from private vendors rather than work with the TIGER files.

B) Digital Census Data

The U.S. Census reports the 1980 and 1990 census information in two major digital formats. The first is the Summary Tape File that contains population aggregations for selected variables for various geographic units. The second is the Public-Use Microdata Sample (PUMS) file. The PUMS file contains separate records for each household and individual. It is very useful because it enables researchers to measure interrelationships between variables directly rather than by geographical areas, thus avoiding problems of ecological fallacy. The researcher can create custom tabulations based on individuals. The PUMS data are available only for areas with at least 100,000 persons. This is sometimes a larger area than might be desired, and in rural areas the PUMA can amount to several counties.

Readers wanting additional detail on the digital files should consult Appendix A. This appendix contains detailed explanations of the Summary Tape Files, census geography, the Social Sciences Database Archive (SSDBA) of census data, and information on accessing data through that resource. Readers familiar with digital census data and the SSDBA data may continue.

Summary Tape Files

The STF files are tabulations and cross-tabulations in tabular form that correspond to much of the census information in published volumes. Data include items such as counts of persons and households, persons by race by sex by age category, housing type by tenure, and so on. Summary tape files come in four major types [1, 2, 3, and 4] and usually in at least three systems of geographic aggregations indicated by an "a" "b", "c", or "d".

STF1 and STF2 contain information from the short form questionnaire on gender, ethnicity, marital status, and a few housing variables. STF3 and STF4 contain information from the long-form questionnaires on education, occupation, income, migration, etc. Because the long-form contains more questions, these files are much larger than STF1 or STF2

C) Geography in Summary Tape Files

Coding Geographic Units - FIPS Codes

All geographic units have a standardized number identifier referred to as a FIPS (Federal Information Processing Standards) Code. Appendix E lists FIPS codes for all U.S. states and counties and Appendix F lists codes for all U.S. cities over 10,000 population.

The different levels of geography within a version are specified through a Summary Level Code. Some care is needed in specifying geography in order to avoid tabulating unwanted records that have the same Summary Level Code. Also, one needs to look at the census documentation to make sure a desired data item is available for a desired level of geography. Note that in general there is a single file for each state. However, the c files contain geography for the entire nation, while multiple files are needed to handle the massive STF4 tabulation.

A) Public-Use Microdata Sample Files

There are two PUMS files, which contain data for either a 5% or a 1% sample of all the housing units in the United States. In 1980 an estimate of the total number of persons was obtained by multiplying the sample value by 20 or 100, but in 1990 each person and housing unit received an individual weight that is used to estimate the total population. PUMS files provide considerable detail on a number of variables and Appendix G lists the necessary codes to deal with these variables.

The 1990 PUMS files contain a number of geographic areas called PUMAs (Public-Use Microdata Areas). See Appendix C for a list of California PUMAs, each of which contains a minimum of 100,000 persons. In 1980 Los Angeles County had only 3 geographic units (Los Angeles City, Long Beach City, and the remainder of the county). However, in 1990 the county was divided into 52 PUMAs that greatly expanded the geographic value of the PUMS data. In heavily populated places like the city of Los Angeles, PUMAs consist of aggregations of tracts while in other areas they may be aggregations of incorporated places. Unfortunately, these places are often sometimes not contiguous.

The PUMS data set has a different structure than the Summary Tape Files. It is arranged in a hierarchical structure in which both housing and person record types are found in the same file. Data for a household (all persons in a housing unit) appear first and then a person record follows for each person in the household. Each person record contains a household identifier and codes to indicate the position of that person in the household.

B) Variables in the Sample Files for This Module

Six data sets have been extracted and made available as SPSS portable files. Most of the data were extracted from Summary Tape Files from the Census Bureau, but some data are from the 1994 County-City Databook and the California Department of Finance.

Data at the census tract level are provided in two data sets for the cities of Glendale and Burbank, California. These two cities were chosen because they are adjacent and had relatively stable tract boundaries between 1980 and 1990. One file provides ethnic data for 1980 and 1990 and the other provides detailed age data for males and females.

A third file contains mostly ethnic data and some household data for all cities over 10,000 persons in the United States. A fourth file contains a wide range of data for all counties in the United States. The fifth file was derived from California Department of Finance statistics on population change for all California counties.

The last file is all PUMS data for a single PUMA. This file was extracted from the California 5% PUMS file at the SSDBA and is in their modified format - a housing record appended to every person record. This PUMA covers the cities of Burbank and San Fernando. The table below indicates the names of the six files. All have codebooks as digital text files and as printed appendices at the end of this module.

Module Data Sets

SPSS File Name	Description	Appendix
bgtrsp.por	Burbank-Glendale Tracts, 1980-1990	H
bgatsp.por	Burbank-Glendale Tracts, Age Groups	H
USCIsp.por	U.S. Cities	I
USCOsp.por	U.S. Counties	I
CACOsp.por	California Counties, Population Growth	J
PUMSsp.por	Data for One PUMA	K