Appendix A -- Detailed Descriptions and Access to Major Files | SSRIC - Social Science Research and Instructional Council

CENS: Appendix A

Last Modified 17 August 1998

Summary
Tape Files

Short-Form Files

STF1 contains
36 tables of population information and 45 tables of housing information.
Each of these 81 tables is repeated for each geographic unit within a file.
STF2 is much like STF1 except that it contains more table categories. Furthermore,
it contains a and b record types for both person and housing
data. The a record type is the tabulation for all the population while
the b record is a tabulation for a particular ethnic group. One form
of STF2 contains tabulations for 9 ethnic groups while another contains tabulations
for up to 28 ethnic groups. STF2 contains 13 a-type person tables,
28 a-type housing tables, 27 b-type person tables, and 27 b-type
housing tables.

Long-Form
Files

STF3 contains
170 person tables and 92 housing tables. Like STF2, STF4 contains a and b
record types which make it by far the largest and most detailed of the Summary
Tape Files. It contains 122 a-type person tables, 76 a-type housing tables,
161 b-type person tables, and 74 b-type housing tables. One form of STF4 contains
a and b tabulations for 9 ethnic groups while another contains tabulations
of up to 48 groups. The table below indicates the relative sizes of the four
Summary Tape Files by the number of cells of data for each geographic unit.

FileCells

STF1900+

STF22100+

STF33300+

STF48500+

Geography
in Summary Tape Files

The table below indicates the smallest level of geography contained in each
of the data files.

File Minimum
Units

STF1a Block Group

STF1b Block

STF1c County and Place > 10,000

STF1d Congressional District

STF2a Tract

STF2b County Subdivision and Place > 1000

STF2c County and Place > 10,000

STF3a Block Group

STF3b ZIP Code

STF3c County Subdivision and Place > 10,000

STF3d Congressional District

STF4a Tract

STF4b County Subdivision and Place > 2500

STF4c County Subdivision > 10,000

In Appendix D
are the Summary Level Codes and the Geographic Component Codes for the STF3
data files. Note, for example, that the State code has potentially five records.
If only the state total record is desired then the Geographic Component code
must be set to 00.

Appendix D indicates
the geographical hierarchy followed to reach the final geographic unit. Of
particular importance is the hierarchy used to reach census tracts and block
groups. Many tract boundaries are split by incorporated place boundaries.
This means that selecting Summary Level Code 080 or 090 will result in many
more geographic units than selecting Summary Level Code 140 or 150. The latter
numbers are for complete tracts while the former codes might be useful when
studying tracts only within a specific city. Accidentally selecting the split-tracts
will cause considerable difficulty with most mapping programs since they typically
use only complete tract boundary files. Also statistical computations may
be affected because of numerous zero values in parts of the split tracts.

Also there are
differences in some of the Summary Level Codes when accessing STF4. Complete
tracts in STF4a have the Summary Level Code of 141, places over 2500 persons
in STF4b have the Summary Level Code of 163, and places over 10,000 in STF4c
have the Summary Level Code of 161. Summary Level Codes are found on page
6-1 of U.S. Census Summary Tape File Codebooks. A discussion of census geography
can be found in Appendix A of the same documentation.

Coding Geographic
Units - FIPS Codes

For various types
of features within a state or county FIPS code numbers increase according
to the name of a location in alphabetical order. For example, Alameda County
in California is 001 and Amador County is 003. A census tract FIPS code consists
of six digits. The first four identify a tract and the last two serve as a
suffix. Tracts that are split in a later census because of increased population
would have a suffix of .01 and .02 such as 1101.01 and 1101.02. In some cases
split tracts have been split a second time. The suffix may also have values
from .80 to .98 indicating that the tract was created by modifying an existing
boundary. A value of .99 indicates persons aboard a ship at the time the census
was taken.

In most cases
a FIPS code must be specified to subset a desired geographic unit or units
from a census file. Occasionally more than one FIPS code must be used to create
a desired data set. For example, to get the tracts of Los Angeles County,
one would request a Summary Level Code of 140 and a county FIPS Code of 037.
To get the data for the state of California one might specify a Summary Level
Code of 040, a Geographic Component Code of 00, and a state FIPS Code of 06.

For most mapping
programs FIPS codes from the census must be joined to create a matching value
for the boundary codes in the program's mapping data. Thus, if one wanted
to map tracts in Los Angeles County, one would have to join the state, county,
and tract FIPS codes to create an identifying label. For example, 060371101.02
would specify tract 1101.02 in county 037 in state 06. The number for the
census file must match the number in the mapping file exactly or the census
data will not load into the mapping program. One convenient approach is to
export a list of geographic unit labels from the mapping program to check
on the format needed for the census data. Often this list can be pasted directly
into the census data table although some record checking is usually required
to account for unlabeled areas.

The Census Bureau
publishes a large list of FIPS codes in its Geographic Identification Coding
Scheme publication. Each census data record contains an area name (ANPSADPI)
so that names of locations can be directly located in the census files.

Considerable
care is needed in comparing data from the 1980 and 1990 census at the tract
level, not only because of split or aggregated tracts, but because some boundaries
have been shifted contrary to policy. The California Department of Finance
maintains Tract Equivalency Files that can be used to locate where changes
have occurred.

Public-Use
Microdata Sample Files

Many spreadsheet programs seem incapable of dealing with the structure of
PUMS. For this reason a modified structure has been created for the files
stored at the Social Sciences Database Archive. In the SSDBA files, the household
record has been appended to each person record. This has the effect of greatly
enlarging the database while making it easier to work with. It also means
that a user must restrict tabulations to include only heads of household records
when tabulating housing data. Otherwise calculations will be based on duplicate
housing records appended to each person in the household.

One also needs
to be particularly conscious of the population serving as the universe when
trying to replicate aggregations used by the U.S. Census. For example, employment
data should be tabulated only for persons over age 15 who are employed as
civilians. One could also limit the tabulations to those employed full-time.

Since customized
populations can be created from the PUMS files, some care is needed in dealing
with the significance of very small counts especially when PUMAs are being
used. Chapter 3 of the PUMS Codebook suggests procedures in dealing with this
issue.

Census Data
in the Social Sciences Database Archive

The Social Sciences Database Archive contains a number of digital files from
the U.S. Census Bureau. These include the PUMS and Summary Tape Files for
1980 and 1990 as well as County-City Databooks and Current Population Estimates.
Not all STF and PUMS files are available through the SSDBA for all states.
Data files are in SPSS format, and in most cases additional files provide
dictionaries and codebooks necessary to extract information from the files.
A few of the files have spss programs for reading the data, and these may
be expanded to carry out various procedures. See Appendix B for a description
of the available data sets. The following table describes the current location
and availability of the 1990 STF and PUMS resources in the SSDBA.

1990 STF and
PUMS Resources in the SSDBA

This table contains
the descriptions of the locations of various census resources within the SSDBA.
There are four types of files which are indicated with the following codes:

Data: the file
containing the database. All are SPSS system files. "By request" means that
the data are not directly accessible.

Cbk: the codebook describing the database

Dic: a data dictionary for the database

Prog:a SSDBA program that will describe the contents of the database. SPSS
statistics commands can be appended to it.

STF1a CA Data:
/usr/ssdba/ssdba46/c90stf1a-ca.sys

CBk: /ssdba-data/docs/codebooks/c90stf1.cb

Dic: /ssdba-data/docs/codebooks/c90stf1a.dic

STF1b CA Data:
By request

Cbk: /ssdba-data/docs/codebooks/c90stf1.cb

Dic: /ssdba-data/docs/codebooks/c90stf1b.dic

Prog: /ssdba-data/docs/programs/untested/c90stf1b-1.uspss

Prog:/ssdba-data/docs/programs/untested/c90stf1b-1.usas

Prog:/ssdba-data/docs/programs/untested/c90stf1b-2.uspss

Prog:/ssdba-data/docs/programs/untested/c90stf1b-2.usas

STF1c U.S. Data:
/usr/ssdba/ssdba42/c90stf1c.sys

Cbk: /ssdba-data/docs/codebooks/c90stf1c.cb

STF2a U.S. Data:
By request

STF3a Los Angeles
and Orange Counties only

Data: /usr/ssdba/ssdba37/c90stf3a-ca1.sys

Other CA Counties Data: /usr/ssdba/ssdba38/c90stf3a-ca2.sys

Cbk: /ssdba-data/docs/codebooks/c90stf3.cb

Prog: /ssdba-data/docs/programs/icp9782.spss

STF3b ZIPS beginning
8 or 9

Data: /usr/ssdba/ssdba76/c90stf3b.sys

Cbk: /ssdba-data/docs/codebooks/c90stf3.cb

STF3c NE U.S.
Data: /usr/ssdba/ssdba37/c90stf3c-a.sys

Rest of U.S. Data: /usr/ssdba/ssdba47/c90stf3c-b.sys

Cbk: /ssdba-data/docs/codebooks/c90stf3.cb

STF4a CA

B recs, All persons Data: /usr/ssdba/ssdba50/c90stf4a-t1.sys

B recs, White Data: /usr/ssdba/ssdba48/c90stf4a-t2.sys

B recs, Black Data: /usr/ssdba/ssdba65/c90stf4a-t3.sys

B recs, Am Inds Data: /usr/ssdba/ssdba61/c90stf4a-t4.sys

B recs, Asian Data: /usr/ssdba/ssdba49/c90stf4a-t5.sys

B recs, Other Race Data: /usr/ssdba/ssdba51/c90stf4a-t6.sys

B recs, Hispanic Data: /usr/ssdba/ssdba61/c90stf4a-t7.sys

B recs, NonHisp Wh Data: /usr/ssdba/ssdba51/c90stf4a-t8.sys

B recs, NonHisp Bl Data: /usr/ssdba/ssdba53/c90stf4a-t9.sys

B recs, NonHisp Oth Data: /usr/ssdba/ssdba49/c90stf4a-t10.sys

A recs, All persons
Data: /usr/ssdba/ssdba53/c90stf4a-t11.sys

Cbk: None: See
Census Docs.

PUMS CA

Person & Housing Recs Data: /c/census/c90pums-p5.sys

Housing Recs only Data: /c/census/c90pums-hr

Cbk: /ssdba-data/docs/codebooks/c90pums-p1.frq

STF and PUMS
SPSS Programs in the SSDBA

The programs
below are located in the following directory:

/ssdba-data/docs/programs

c10pums.spss

c40pums.spss

c50pums.spss

c60pums.spss

c70pums.spss

c80pums-us.spss

c80stf1a.spss

c80stf1c.spss

c80stf3a.spss

c80stf3c.spss

c80stf4a.spss

c90eeo.spss

c90geog.spss

c90pums-5.sas

c90pums-5.spss

c90pums-h.spss

c90pums-p.spss

c90pums.bak

c90stf1a.spss

c90stf1c.spss

c90stf3a.spss

c90stf3c.spss

c90stf4a-a.spss

c90stf4a-b.spss

c90tract.spss

SPSS Programs
for Extracting Data from the Sample Databases and from the SSDBA Archive

In order to extract variables from one of the census files you need to know
the variable names. One easy way to get these is to execute a DISPLAY DICTIONARY
command. This command will list out the contents and formats of a database.
Note that the variable names have slightly different formats between the various
Summary Tape Files in the SSDBA and so such a listing is necessary.

Data Dictionaries

The following
program will create a dictionary of a STF4b file on the unix version of SPSS
at the SSDBA.

finish.

Program to read
PUMS extract and crosstab ethnicity by occupation

The following
two programs were used to read the SSDBA PUMS file and create some crosstabulations.
Note that the resulting table is for all persons who were employed, not just
civilian employed. A non-Hispanic white category was created by using the
Hispanic and Race variables. This program could be copied and pasted into
the pc-version of spss.

recode RACE (2=2) (6,7=3) (8=4) (9=6) (11=7) into ETHNIC.

* Compute Non-Hispanic White Category

compute NH = 0.

if (HISPANIC eq 0 or HISPANIC = 199) NH = 1.

if (RACE eq 1 and NH eq 1) ETHNIC = 1.

recode HISPANIC (1,210 thru 220=8) (2,261=9) (3,271=10) (226=11) into ETHNIC.

recode OCCUP (0 thru 37=1) (38 thru 106, 164 thru 199=2) (107 thru 163=3)
(200 thru 208=4) (209 thru 353, 356 thru 389=5) (354,355=6) (403 thru 407=7)
(413 thru 427, 456 thru 469=8) (433 thru 444=9) (445 thru 447=10) (503 thru
549=11) (553 thru 599=12) (628 thru 699=13) (703 thru 799=14) (800 thru 859=15)
(866 thru 889=16) into OCC.

value labels ETHNIC 1 'NhW' 2 'Blk' 3 'Chi+Tai' 5 'Fil' 6 'Jap' 7 'Kor' 8
'Mex' 9 'PR' 10 'Cub' 11 'Sal'.

value labels OCC 1 'ExMgt' 2 'Prof' 3 'Teach' 4 'HlthTch' 5 'SaleTch' 6 'PostOf'
7 'PrHHS' 8 'Serv' 9 'FoodPr' 10 'HealSr' 11 'Mechan' 12 'Const' 13 'PrecProd'
14 'MachOp' 15 'Trans' 16 'Helper'.

* Create Occupations by Ethnic by Gender

crosstabs variables= ETHNIC(1,11) SEX(0,1) OCC(1,16)/ tables=OCC by ETHNIC
by SEX/ CELLS=COUNT ROW COLUMN.

FINISH.

Program to Crosstab
Ethnicity by Income Categories

if (RACE eq 1 and NH eq 1) ETHNIC = 1.

recode HISPANIC (1,210 thru 220=1) (2,261=2) (3,271=3) (221,225,227,228,229=4)
(222=5) (223=6) (224=7) (226=8) into HISP.

recode RHHINC (1 thru 4999=1) (5000 thru 9999=2) (10000 thru 14999=3) (15000
thru 19999=4) (20000 thru 24999=5) (25000 thru 29999=6) (30000 thru 34999=7)
(35000 thru 39999=8) (40000 thru 44999=9) (45000 thru 49999=10) (50000 thru
59999=11) (60000 thru 79999=12) (80000 thru 99999=13) (100000 thru 150000=14)
(150001 thru HIGHEST=15) into HHINC.

value labels HHINC 1 '<5' 2 '5-9' 3 '10-14' 4 '15-19' 5 '20-24' 6 '25-29'
7 '30-34' 8 '35-39' 9 '40-44' 10 '45-49' 11 '50-59' 12 '60-79' 13 '80-99'
14 '100-150' 15 '150+'.

value labels ETHNIC 1 'NhW' 2 'Blk' 3 'InAlEs' 4 'Chi+Tai' 6 'Fil' 7 'Jap'
8 'AsInd' 9 'Kor'.

value labels HISP 1 'Mex' 2 'PR' 3 'Cub' 4 'CenAm' 5 'Gua' 6 'Hon' 7 'Nic'
8 'Sal'.

* For household income use persons 16+ years.

select if (AGE ge 16).

* Select head of household records.

select if RELAT1 = 0.

* Create Household Income by Ethnic and Hispanic Tables

crosstabs variables= ETHNIC(1,9) HISP(1,8) HHINC(1,15)/ tables=HHINC by
ETHNIC HISP/ CELLS=COUNT ROW COLUMN.

FINISH.

Program to
Compute Several Summary Variables

The following
program reads the 5% PUMS file, computes several basic summary variables
for the state of California, and produces frequencies of the values for
selected variables.

get file '/c/census/c90pums-p5.sys'
/keep=PUMA AGE RACE HISPANIC ANCSTRY1 OCCUP CLASS ENGLISH POB YEARSCH CITIZEN
RELAT1 PERSONS TENURE/.

* Select PUMAS in the five-county Southern California area.

select if (PUMA ge 4200 and PUMA le 4808 or PUMA ge 5200 and PUMA le 7207).

recode RACE (2=2) (4,5,301 thru 327=3) (6=4) (7=5) (8=6) (9=7) (10=8) (11=9)
(12=10) (13=11) (15=12) (16=13) (19=14) (22=15) (25=16) (26=17) into ETHNIC.

compute NH = 0.

if (HISPANIC eq 0 or HISPANIC = 199) NH = 1.

if (RACE eq 1 and NH eq 1) ETHNIC = 1.

recode HISPANIC (1,210 thru 220=1) (2,261=2) (3,271=3) (221,225,227,228,229=4)
(222=5) (223=6) (224=7) (226=8) (231 thru 249=9) into HISP.

recode ANCSTRY1 (15,22=1) (148 thru 150=2) (302=3) (308=4) (360=5) (416=6)
(419=7) (431=8) (434=9) (400 thru 415,417,418,421 thru 430,435 thru 481,490
thru 499=10) (522,523=11) (553 thru 558=12) (800 thru 802=13) into ANC.

value labels ANC 1 'Eng' 2 'Rus' 3 'Belz' 4 'Jam' 5 'Brz' 6 'Ira' 7 'Isr'
8 'Arm' 9 'Tur' 10 'Arab' 11 'Eth' 12 'Nig' 13 'Aus'.

value labels ETHNIC 1 'NhW' 2 'Blk' 3 'InAlEs' 4 'Chi' 5 'Taiw' 6 'Fil'
7 'Jap' 8 'AsInd' 9 'Kor' 10 'Vie' 11 'Cam' 12 'Lao' 13 'Tha' 14 'Indo'
15 'Pak' 16 'Haw' 17 'Sam'.

value labels HISP 1 'Mex' 2 'PR' 3 'Cub' 4 'CenAm' 5 'Gua' 6 'Hon' 7 'Nic'
8 'Sal' 9 'SoAm'.

* Tabulate various stats for 5-co area by ethnic groups.

if (AGE ge 18) VAR01 = 1.

if (AGE ge 25) VAR02 = 1.

if ((ENGLISH eq 0 or ENGLISH eq 1) and AGE ge 18) VAR03 = 1.

if ((YEARSCH ge 14 and YEARSCH le 17) and AGE ge 25) VAR04 = 1.

if (CITIZEN eq 3 or CITIZEN eq 4) FB = 1.

if (FB eq 1 and AGE ge 25) VAR05 = 1.

if (OCCUP ge 3 and OCCUP le 199) OC = 1.

if (CLASS ge 1 and CLASS le 8) CL = 1.

if (OC eq 1 and CL eq 1 and AGE ge 25) VAR07 = 1.

if (PERSONS ge 1 and RELAT1 eq 0) VAR08 = 1.

if ((TENURE eq 1 or TENURE eq 2) and RELAT1 eq 0) VAR09 = 1.

if ((TENURE ge 1 and TENURE le 4) and RELAT1 eq 0) VAR10 = 1.

variable labels
VAR01 'Pers 18+' VAR02 'Pers 25+' VAR03 'SpkEngO/VW/18' VAR04 'ColEd/25'
VAR05 'ForB/25' VAR07 'AdmExProf/25' VAR08 'Hsehldrs' VAR09 'OwnOccHU' VAR10
'OccHU' FB 'ForBorn'.

value labels
VAR01 1 'Pers 18+'/ VAR02 1 'Pers 25+'/ VAR03 1 'SpkEng'/ VAR04 1 'ColEduc'/
VAR05 1 'ForBor25'/ VAR07 1 'AdmPrOcc'/ VAR08 1 'HsHlds'/ VAR09 1 'OwnOcH'/
VAR10 1 'OccHU'/ FB 1 'ForBorn'/ OC 1 'ProfOcc'/ CL 1 'Worker'.

frequencies
variables=ETHNIC (1,17) HISP (1,9) ANC (1,13) CLASS (0,9) CITIZEN (0,4)
FB (1,1) VAR01 to VAR10 (1,1)/ barchart/ format=condense.

finish.

Running
SPSS on Venus at the SSDBA.

Because the PUMS file is quite large and because some time is required to
access STF databases, it is quite possible that a program will take from 30
to 45 minutes to finish execution. An alternative to waiting at the terminal
is to submit your spss file as a batch job. This can be done be entering the
following statement at the unix prompt.

The program file
contains the spss program statements and the listing file receives the results
of the program execution. If a data output file is created, it will be saved
according to its name in the program. A caution is that successive runs will
fail if the program attempts to generate a new file with the same name as an
existing file.

The SPSS program
on Venus serves as a good editor for entering and updating programs. A number
of basic edit functions are available through various escape key -number strokes.
Esc-1 can be used to view directory contents. Esc-2 will allow you to switch
between an output and input screen. Esc-3 inputs an existing file, Esc-9 saves
an edited file, and Esc-0 runs a program from the editor and quits the editor.
Two escape strokes will terminate a command selection.