- Summary
Tape Files - Geography
in Summary Tape Files - Public-Use
Microdata Sample Files - Census Data
in the Social Sciences Database Archive - SPSS Programs
for Extracting Data from the Sample Databases and from the SSDBA Archive
Short-Form Files
STF1 contains
36 tables of population information and 45 tables of housing information.
Each of these 81 tables is repeated for each geographic unit within a file.
STF2 is much like STF1 except that it contains more table categories. Furthermore,
it contains a and b record types for both person and housing
data. The a record type is the tabulation for all the population while
the b record is a tabulation for a particular ethnic group. One form
of STF2 contains tabulations for 9 ethnic groups while another contains tabulations
for up to 28 ethnic groups. STF2 contains 13 a-type person tables,
28 a-type housing tables, 27 b-type person tables, and 27 b-type
housing tables.
Long-Form
Files
STF3 contains
170 person tables and 92 housing tables. Like STF2, STF4 contains a and b
record types which make it by far the largest and most detailed of the Summary
Tape Files. It contains 122 a-type person tables, 76 a-type housing tables,
161 b-type person tables, and 74 b-type housing tables. One form of STF4 contains
a and b tabulations for 9 ethnic groups while another contains tabulations
of up to 48 groups. The table below indicates the relative sizes of the four
Summary Tape Files by the number of cells of data for each geographic unit.
FileCells
STF1900+
STF22100+
STF33300+
STF48500+
The table below indicates the smallest level of geography contained in each
of the data files.
File Minimum
Units
STF1a Block Group
STF1b Block
STF1c County and Place > 10,000
STF1d Congressional District
STF2a Tract
STF2b County Subdivision and Place > 1000
STF2c County and Place > 10,000
STF3a Block Group
STF3b ZIP Code
STF3c County Subdivision and Place > 10,000
STF3d Congressional District
STF4a Tract
STF4b County Subdivision and Place > 2500
STF4c County Subdivision > 10,000
In Appendix D
are the Summary Level Codes and the Geographic Component Codes for the STF3
data files. Note, for example, that the State code has potentially five records.
If only the state total record is desired then the Geographic Component code
must be set to 00.
Appendix D indicates
the geographical hierarchy followed to reach the final geographic unit. Of
particular importance is the hierarchy used to reach census tracts and block
groups. Many tract boundaries are split by incorporated place boundaries.
This means that selecting Summary Level Code 080 or 090 will result in many
more geographic units than selecting Summary Level Code 140 or 150. The latter
numbers are for complete tracts while the former codes might be useful when
studying tracts only within a specific city. Accidentally selecting the split-tracts
will cause considerable difficulty with most mapping programs since they typically
use only complete tract boundary files. Also statistical computations may
be affected because of numerous zero values in parts of the split tracts.
Also there are
differences in some of the Summary Level Codes when accessing STF4. Complete
tracts in STF4a have the Summary Level Code of 141, places over 2500 persons
in STF4b have the Summary Level Code of 163, and places over 10,000 in STF4c
have the Summary Level Code of 161. Summary Level Codes are found on page
6-1 of U.S. Census Summary Tape File Codebooks. A discussion of census geography
can be found in Appendix A of the same documentation.
Coding Geographic
Units - FIPS Codes
For various types
of features within a state or county FIPS code numbers increase according
to the name of a location in alphabetical order. For example, Alameda County
in California is 001 and Amador County is 003. A census tract FIPS code consists
of six digits. The first four identify a tract and the last two serve as a
suffix. Tracts that are split in a later census because of increased population
would have a suffix of .01 and .02 such as 1101.01 and 1101.02. In some cases
split tracts have been split a second time. The suffix may also have values
from .80 to .98 indicating that the tract was created by modifying an existing
boundary. A value of .99 indicates persons aboard a ship at the time the census
was taken.
In most cases
a FIPS code must be specified to subset a desired geographic unit or units
from a census file. Occasionally more than one FIPS code must be used to create
a desired data set. For example, to get the tracts of Los Angeles County,
one would request a Summary Level Code of 140 and a county FIPS Code of 037.
To get the data for the state of California one might specify a Summary Level
Code of 040, a Geographic Component Code of 00, and a state FIPS Code of 06.
For most mapping
programs FIPS codes from the census must be joined to create a matching value
for the boundary codes in the program's mapping data. Thus, if one wanted
to map tracts in Los Angeles County, one would have to join the state, county,
and tract FIPS codes to create an identifying label. For example, 060371101.02
would specify tract 1101.02 in county 037 in state 06. The number for the
census file must match the number in the mapping file exactly or the census
data will not load into the mapping program. One convenient approach is to
export a list of geographic unit labels from the mapping program to check
on the format needed for the census data. Often this list can be pasted directly
into the census data table although some record checking is usually required
to account for unlabeled areas.
The Census Bureau
publishes a large list of FIPS codes in its Geographic Identification Coding
Scheme publication. Each census data record contains an area name (ANPSADPI)
so that names of locations can be directly located in the census files.
Considerable
care is needed in comparing data from the 1980 and 1990 census at the tract
level, not only because of split or aggregated tracts, but because some boundaries
have been shifted contrary to policy. The California Department of Finance
maintains Tract Equivalency Files that can be used to locate where changes
have occurred.
Many spreadsheet programs seem incapable of dealing with the structure of
PUMS. For this reason a modified structure has been created for the files
stored at the Social Sciences Database Archive. In the SSDBA files, the household
record has been appended to each person record. This has the effect of greatly
enlarging the database while making it easier to work with. It also means
that a user must restrict tabulations to include only heads of household records
when tabulating housing data. Otherwise calculations will be based on duplicate
housing records appended to each person in the household.
One also needs
to be particularly conscious of the population serving as the universe when
trying to replicate aggregations used by the U.S. Census. For example, employment
data should be tabulated only for persons over age 15 who are employed as
civilians. One could also limit the tabulations to those employed full-time.
Since customized
populations can be created from the PUMS files, some care is needed in dealing
with the significance of very small counts especially when PUMAs are being
used. Chapter 3 of the PUMS Codebook suggests procedures in dealing with this
issue.
The Social Sciences Database Archive contains a number of digital files from
the U.S. Census Bureau. These include the PUMS and Summary Tape Files for
1980 and 1990 as well as County-City Databooks and Current Population Estimates.
Not all STF and PUMS files are available through the SSDBA for all states.
Data files are in SPSS format, and in most cases additional files provide
dictionaries and codebooks necessary to extract information from the files.
A few of the files have spss programs for reading the data, and these may
be expanded to carry out various procedures. See Appendix B for a description
of the available data sets. The following table describes the current location
and availability of the 1990 STF and PUMS resources in the SSDBA.
1990 STF and
PUMS Resources in the SSDBA
This table contains
the descriptions of the locations of various census resources within the SSDBA.
There are four types of files which are indicated with the following codes:
Data: the file
containing the database. All are SPSS system files. "By request" means that
the data are not directly accessible.
Cbk: the codebook describing the database
Dic: a data dictionary for the database
Prog:a SSDBA program that will describe the contents of the database. SPSS
statistics commands can be appended to it.
STF1a CA Data:
/usr/ssdba/ssdba46/c90stf1a-ca.sys
CBk: /ssdba-data/docs/codebooks/c90stf1.cb
Dic: /ssdba-data/docs/codebooks/c90stf1a.dic
STF1b CA Data:
By request
Cbk: /ssdba-data/docs/codebooks/c90stf1.cb
Dic: /ssdba-data/docs/codebooks/c90stf1b.dic
Prog: /ssdba-data/docs/programs/untested/c90stf1b-1.uspss
Prog:/ssdba-data/docs/programs/untested/c90stf1b-1.usas
Prog:/ssdba-data/docs/programs/untested/c90stf1b-2.uspss
Prog:/ssdba-data/docs/programs/untested/c90stf1b-2.usas
STF1c U.S. Data:
/usr/ssdba/ssdba42/c90stf1c.sys
Cbk: /ssdba-data/docs/codebooks/c90stf1c.cb
STF2a U.S. Data:
By request
STF3a Los Angeles
and Orange Counties only
Data: /usr/ssdba/ssdba37/c90stf3a-ca1.sys
Other CA Counties Data: /usr/ssdba/ssdba38/c90stf3a-ca2.sys
Cbk: /ssdba-data/docs/codebooks/c90stf3.cb
Prog: /ssdba-data/docs/programs/icp9782.spss
STF3b ZIPS beginning
8 or 9
Data: /usr/ssdba/ssdba76/c90stf3b.sys
Cbk: /ssdba-data/docs/codebooks/c90stf3.cb
STF3c NE U.S.
Data: /usr/ssdba/ssdba37/c90stf3c-a.sys
Rest of U.S. Data: /usr/ssdba/ssdba47/c90stf3c-b.sys
Cbk: /ssdba-data/docs/codebooks/c90stf3.cb
STF4a CA
B recs, All persons Data: /usr/ssdba/ssdba50/c90stf4a-t1.sys
B recs, White Data: /usr/ssdba/ssdba48/c90stf4a-t2.sys
B recs, Black Data: /usr/ssdba/ssdba65/c90stf4a-t3.sys
B recs, Am Inds Data: /usr/ssdba/ssdba61/c90stf4a-t4.sys
B recs, Asian Data: /usr/ssdba/ssdba49/c90stf4a-t5.sys
B recs, Other Race Data: /usr/ssdba/ssdba51/c90stf4a-t6.sys
B recs, Hispanic Data: /usr/ssdba/ssdba61/c90stf4a-t7.sys
B recs, NonHisp Wh Data: /usr/ssdba/ssdba51/c90stf4a-t8.sys
B recs, NonHisp Bl Data: /usr/ssdba/ssdba53/c90stf4a-t9.sys
B recs, NonHisp Oth Data: /usr/ssdba/ssdba49/c90stf4a-t10.sys
A recs, All persons
Data: /usr/ssdba/ssdba53/c90stf4a-t11.sys
Cbk: None: See
Census Docs.
PUMS CA
Person & Housing Recs Data: /c/census/c90pums-p5.sys
Housing Recs only Data: /c/census/c90pums-hr
Cbk: /ssdba-data/docs/codebooks/c90pums-p1.frq
STF and PUMS
SPSS Programs in the SSDBA
The programs
below are located in the following directory:
/ssdba-data/docs/programs
c10pums.spss
c40pums.spss
c50pums.spss
c60pums.spss
c70pums.spss
c80pums-us.spss
c80stf1a.spss
c80stf1c.spss
c80stf3a.spss
c80stf3c.spss
c80stf4a.spss
c90eeo.spss
c90geog.spss
c90pums-5.sas
c90pums-5.spss
c90pums-h.spss
c90pums-p.spss
c90pums.bak
c90stf1a.spss
c90stf1c.spss
c90stf3a.spss
c90stf3c.spss
c90stf4a-a.spss
c90stf4a-b.spss
c90tract.spss
In order to extract variables from one of the census files you need to know
the variable names. One easy way to get these is to execute a DISPLAY DICTIONARY
command. This command will list out the contents and formats of a database.
Note that the variable names have slightly different formats between the various
Summary Tape Files in the SSDBA and so such a listing is necessary.
Data Dictionaries
The following
program will create a dictionary of a STF4b file on the unix version of SPSS
at the SSDBA.
-
get file '/usr/ssdba/ssdba50/c90stf4a-t1.sys'
/keep=all.
display dictionary.
finish.
Program to read
PUMS extract and crosstab ethnicity by occupation
The following
two programs were used to read the SSDBA PUMS file and create some crosstabulations.
Note that the resulting table is for all persons who were employed, not just
civilian employed. A non-Hispanic white category was created by using the
Hispanic and Race variables. This program could be copied and pasted into
the pc-version of spss.
-
get file 'cpuma5200'
/keep=SEX RACE HISPANIC OCCUP/.
recode RACE (2=2) (6,7=3) (8=4) (9=6) (11=7) into ETHNIC.
* Compute Non-Hispanic White Category
compute NH = 0.
if (HISPANIC eq 0 or HISPANIC = 199) NH = 1.
if (RACE eq 1 and NH eq 1) ETHNIC = 1.
recode HISPANIC (1,210 thru 220=8) (2,261=9) (3,271=10) (226=11) into ETHNIC.
recode OCCUP (0 thru 37=1) (38 thru 106, 164 thru 199=2) (107 thru 163=3)
(200 thru 208=4) (209 thru 353, 356 thru 389=5) (354,355=6) (403 thru 407=7)
(413 thru 427, 456 thru 469=8) (433 thru 444=9) (445 thru 447=10) (503 thru
549=11) (553 thru 599=12) (628 thru 699=13) (703 thru 799=14) (800 thru 859=15)
(866 thru 889=16) into OCC.
value labels ETHNIC 1 'NhW' 2 'Blk' 3 'Chi+Tai' 5 'Fil' 6 'Jap' 7 'Kor' 8
'Mex' 9 'PR' 10 'Cub' 11 'Sal'.
value labels OCC 1 'ExMgt' 2 'Prof' 3 'Teach' 4 'HlthTch' 5 'SaleTch' 6 'PostOf'
7 'PrHHS' 8 'Serv' 9 'FoodPr' 10 'HealSr' 11 'Mechan' 12 'Const' 13 'PrecProd'
14 'MachOp' 15 'Trans' 16 'Helper'.
* Create Occupations by Ethnic by Gender
crosstabs variables= ETHNIC(1,11) SEX(0,1) OCC(1,16)/ tables=OCC by ETHNIC
by SEX/ CELLS=COUNT ROW COLUMN.
FINISH.
Program to Crosstab
Ethnicity by Income Categories
-
get file 'cpuma5200'
/keep=PUMA AGE RACE HISPANIC RELAT1 RHHINC/.
recode RACE (2=2) (4,5,301 thru 327=3) (6,7=4) (8=6) (9=7) (10=8) (11=9)
into ETHNIC.
compute NH = 0.
if (HISPANIC eq 0 or HISPANIC = 199) NH = 1.
if (RACE eq 1 and NH eq 1) ETHNIC = 1.
recode HISPANIC (1,210 thru 220=1) (2,261=2) (3,271=3) (221,225,227,228,229=4)
(222=5) (223=6) (224=7) (226=8) into HISP.
recode RHHINC (1 thru 4999=1) (5000 thru 9999=2) (10000 thru 14999=3) (15000
thru 19999=4) (20000 thru 24999=5) (25000 thru 29999=6) (30000 thru 34999=7)
(35000 thru 39999=8) (40000 thru 44999=9) (45000 thru 49999=10) (50000 thru
59999=11) (60000 thru 79999=12) (80000 thru 99999=13) (100000 thru 150000=14)
(150001 thru HIGHEST=15) into HHINC.
value labels HHINC 1 '<5' 2 '5-9' 3 '10-14' 4 '15-19' 5 '20-24' 6 '25-29'
7 '30-34' 8 '35-39' 9 '40-44' 10 '45-49' 11 '50-59' 12 '60-79' 13 '80-99'
14 '100-150' 15 '150+'.
value labels ETHNIC 1 'NhW' 2 'Blk' 3 'InAlEs' 4 'Chi+Tai' 6 'Fil' 7 'Jap'
8 'AsInd' 9 'Kor'.
value labels HISP 1 'Mex' 2 'PR' 3 'Cub' 4 'CenAm' 5 'Gua' 6 'Hon' 7 'Nic'
8 'Sal'.
* For household income use persons 16+ years.
select if (AGE ge 16).
* Select head of household records.
select if RELAT1 = 0.
* Create Household Income by Ethnic and Hispanic Tables
crosstabs variables= ETHNIC(1,9) HISP(1,8) HHINC(1,15)/ tables=HHINC by
ETHNIC HISP/ CELLS=COUNT ROW COLUMN.
FINISH.
Program to
Compute Several Summary Variables
The following
program reads the 5% PUMS file, computes several basic summary variables
for the state of California, and produces frequencies of the values for
selected variables.
get file '/c/census/c90pums-p5.sys'
/keep=PUMA AGE RACE HISPANIC ANCSTRY1 OCCUP CLASS ENGLISH POB YEARSCH CITIZEN
RELAT1 PERSONS TENURE/.
* Select PUMAS in the five-county Southern California area.
select if (PUMA ge 4200 and PUMA le 4808 or PUMA ge 5200 and PUMA le 7207).
recode RACE (2=2) (4,5,301 thru 327=3) (6=4) (7=5) (8=6) (9=7) (10=8) (11=9)
(12=10) (13=11) (15=12) (16=13) (19=14) (22=15) (25=16) (26=17) into ETHNIC.
compute NH = 0.
if (HISPANIC eq 0 or HISPANIC = 199) NH = 1.
if (RACE eq 1 and NH eq 1) ETHNIC = 1.
recode HISPANIC (1,210 thru 220=1) (2,261=2) (3,271=3) (221,225,227,228,229=4)
(222=5) (223=6) (224=7) (226=8) (231 thru 249=9) into HISP.
recode ANCSTRY1 (15,22=1) (148 thru 150=2) (302=3) (308=4) (360=5) (416=6)
(419=7) (431=8) (434=9) (400 thru 415,417,418,421 thru 430,435 thru 481,490
thru 499=10) (522,523=11) (553 thru 558=12) (800 thru 802=13) into ANC.
value labels ANC 1 'Eng' 2 'Rus' 3 'Belz' 4 'Jam' 5 'Brz' 6 'Ira' 7 'Isr'
8 'Arm' 9 'Tur' 10 'Arab' 11 'Eth' 12 'Nig' 13 'Aus'.
value labels ETHNIC 1 'NhW' 2 'Blk' 3 'InAlEs' 4 'Chi' 5 'Taiw' 6 'Fil'
7 'Jap' 8 'AsInd' 9 'Kor' 10 'Vie' 11 'Cam' 12 'Lao' 13 'Tha' 14 'Indo'
15 'Pak' 16 'Haw' 17 'Sam'.
value labels HISP 1 'Mex' 2 'PR' 3 'Cub' 4 'CenAm' 5 'Gua' 6 'Hon' 7 'Nic'
8 'Sal' 9 'SoAm'.
* Tabulate various stats for 5-co area by ethnic groups.
if (AGE ge 18) VAR01 = 1.
if (AGE ge 25) VAR02 = 1.
if ((ENGLISH eq 0 or ENGLISH eq 1) and AGE ge 18) VAR03 = 1.
if ((YEARSCH ge 14 and YEARSCH le 17) and AGE ge 25) VAR04 = 1.
if (CITIZEN eq 3 or CITIZEN eq 4) FB = 1.
if (FB eq 1 and AGE ge 25) VAR05 = 1.
if (OCCUP ge 3 and OCCUP le 199) OC = 1.
if (CLASS ge 1 and CLASS le 8) CL = 1.
if (OC eq 1 and CL eq 1 and AGE ge 25) VAR07 = 1.
if (PERSONS ge 1 and RELAT1 eq 0) VAR08 = 1.
if ((TENURE eq 1 or TENURE eq 2) and RELAT1 eq 0) VAR09 = 1.
if ((TENURE ge 1 and TENURE le 4) and RELAT1 eq 0) VAR10 = 1.
variable labels
VAR01 'Pers 18+' VAR02 'Pers 25+' VAR03 'SpkEngO/VW/18' VAR04 'ColEd/25'
VAR05 'ForB/25' VAR07 'AdmExProf/25' VAR08 'Hsehldrs' VAR09 'OwnOccHU' VAR10
'OccHU' FB 'ForBorn'.
value labels
VAR01 1 'Pers 18+'/ VAR02 1 'Pers 25+'/ VAR03 1 'SpkEng'/ VAR04 1 'ColEduc'/
VAR05 1 'ForBor25'/ VAR07 1 'AdmPrOcc'/ VAR08 1 'HsHlds'/ VAR09 1 'OwnOcH'/
VAR10 1 'OccHU'/ FB 1 'ForBorn'/ OC 1 'ProfOcc'/ CL 1 'Worker'.
frequencies
variables=ETHNIC (1,17) HISP (1,9) ANC (1,13) CLASS (0,9) CITIZEN (0,4)
FB (1,1) VAR01 to VAR10 (1,1)/ barchart/ format=condense.
finish.
SPSS on Venus at the SSDBA.
Because the PUMS file is quite large and because some time is required to
access STF databases, it is quite possible that a program will take from 30
to 45 minutes to finish execution. An alternative to waiting at the terminal
is to submit your spss file as a batch job. This can be done be entering the
following statement at the unix prompt.
-
nohup spss -m
-s 48M < program file > listing file &
The program file
contains the spss program statements and the listing file receives the results
of the program execution. If a data output file is created, it will be saved
according to its name in the program. A caution is that successive runs will
fail if the program attempts to generate a new file with the same name as an
existing file.
The SPSS program
on Venus serves as a good editor for entering and updating programs. A number
of basic edit functions are available through various escape key -number strokes.
Esc-1 can be used to view directory contents. Esc-2 will allow you to switch
between an output and input screen. Esc-3 inputs an existing file, Esc-9 saves
an edited file, and Esc-0 runs a program from the editor and quits the editor.
Two escape strokes will terminate a command selection.