Appendix A -- Detailed Descriptions and Access to Major Files

CENS: Appendix A
Last Modified 17 August 1998
  1. Summary
    Tape Files

  2. Short-Form Files

    STF1 contains
    36 tables of population information and 45 tables of housing information.
    Each of these 81 tables is repeated for each geographic unit within a file.
    STF2 is much like STF1 except that it contains more table categories. Furthermore,
    it contains a and b record types for both person and housing
    data. The a record type is the tabulation for all the population while
    the b record is a tabulation for a particular ethnic group. One form
    of STF2 contains tabulations for 9 ethnic groups while another contains tabulations
    for up to 28 ethnic groups. STF2 contains 13 a-type person tables,
    28 a-type housing tables, 27 b-type person tables, and 27 b-type
    housing tables.


    STF3 contains
    170 person tables and 92 housing tables. Like STF2, STF4 contains a and b
    record types which make it by far the largest and most detailed of the Summary
    Tape Files. It contains 122 a-type person tables, 76 a-type housing tables,
    161 b-type person tables, and 74 b-type housing tables. One form of STF4 contains
    a and b tabulations for 9 ethnic groups while another contains tabulations
    of up to 48 groups. The table below indicates the relative sizes of the four
    Summary Tape Files by the number of cells of data for each geographic unit.







  3. Geography
    in Summary Tape Files

  4. The table below indicates the smallest level of geography contained in each
    of the data files.

    File Minimum

    STF1a Block Group

    STF1b Block

    STF1c County and Place > 10,000

    STF1d Congressional District

    STF2a Tract

    STF2b County Subdivision and Place > 1000

    STF2c County and Place > 10,000

    STF3a Block Group

    STF3b ZIP Code

    STF3c County Subdivision and Place > 10,000

    STF3d Congressional District

    STF4a Tract

    STF4b County Subdivision and Place > 2500

    STF4c County Subdivision > 10,000

    In Appendix D
    are the Summary Level Codes and the Geographic Component Codes for the STF3
    data files. Note, for example, that the State code has potentially five records.
    If only the state total record is desired then the Geographic Component code
    must be set to 00.

    Appendix D indicates
    the geographical hierarchy followed to reach the final geographic unit. Of
    particular importance is the hierarchy used to reach census tracts and block
    groups. Many tract boundaries are split by incorporated place boundaries.
    This means that selecting Summary Level Code 080 or 090 will result in many
    more geographic units than selecting Summary Level Code 140 or 150. The latter
    numbers are for complete tracts while the former codes might be useful when
    studying tracts only within a specific city. Accidentally selecting the split-tracts
    will cause considerable difficulty with most mapping programs since they typically
    use only complete tract boundary files. Also statistical computations may
    be affected because of numerous zero values in parts of the split tracts.

    Also there are
    differences in some of the Summary Level Codes when accessing STF4. Complete
    tracts in STF4a have the Summary Level Code of 141, places over 2500 persons
    in STF4b have the Summary Level Code of 163, and places over 10,000 in STF4c
    have the Summary Level Code of 161. Summary Level Codes are found on page
    6-1 of U.S. Census Summary Tape File Codebooks. A discussion of census geography
    can be found in Appendix A of the same documentation.

    Coding Geographic
    Units - FIPS Codes

    For various types
    of features within a state or county FIPS code numbers increase according
    to the name of a location in alphabetical order. For example, Alameda County
    in California is 001 and Amador County is 003. A census tract FIPS code consists
    of six digits. The first four identify a tract and the last two serve as a
    suffix. Tracts that are split in a later census because of increased population
    would have a suffix of .01 and .02 such as 1101.01 and 1101.02. In some cases
    split tracts have been split a second time. The suffix may also have values
    from .80 to .98 indicating that the tract was created by modifying an existing
    boundary. A value of .99 indicates persons aboard a ship at the time the census
    was taken.

    In most cases
    a FIPS code must be specified to subset a desired geographic unit or units
    from a census file. Occasionally more than one FIPS code must be used to create
    a desired data set. For example, to get the tracts of Los Angeles County,
    one would request a Summary Level Code of 140 and a county FIPS Code of 037.
    To get the data for the state of California one might specify a Summary Level
    Code of 040, a Geographic Component Code of 00, and a state FIPS Code of 06.

    For most mapping
    programs FIPS codes from the census must be joined to create a matching value
    for the boundary codes in the program's mapping data. Thus, if one wanted
    to map tracts in Los Angeles County, one would have to join the state, county,
    and tract FIPS codes to create an identifying label. For example, 060371101.02
    would specify tract 1101.02 in county 037 in state 06. The number for the
    census file must match the number in the mapping file exactly or the census
    data will not load into the mapping program. One convenient approach is to
    export a list of geographic unit labels from the mapping program to check
    on the format needed for the census data. Often this list can be pasted directly
    into the census data table although some record checking is usually required
    to account for unlabeled areas.

    The Census Bureau
    publishes a large list of FIPS codes in its Geographic Identification Coding
    Scheme publication. Each census data record contains an area name (ANPSADPI)
    so that names of locations can be directly located in the census files.

    care is needed in comparing data from the 1980 and 1990 census at the tract
    level, not only because of split or aggregated tracts, but because some boundaries
    have been shifted contrary to policy. The California Department of Finance
    maintains Tract Equivalency Files that can be used to locate where changes
    have occurred.


  5. Public-Use
    Microdata Sample Files

  6. Many spreadsheet programs seem incapable of dealing with the structure of
    PUMS. For this reason a modified structure has been created for the files
    stored at the Social Sciences Database Archive. In the SSDBA files, the household
    record has been appended to each person record. This has the effect of greatly
    enlarging the database while making it easier to work with. It also means
    that a user must restrict tabulations to include only heads of household records
    when tabulating housing data. Otherwise calculations will be based on duplicate
    housing records appended to each person in the household.

    One also needs
    to be particularly conscious of the population serving as the universe when
    trying to replicate aggregations used by the U.S. Census. For example, employment
    data should be tabulated only for persons over age 15 who are employed as
    civilians. One could also limit the tabulations to those employed full-time.

    Since customized
    populations can be created from the PUMS files, some care is needed in dealing
    with the significance of very small counts especially when PUMAs are being
    used. Chapter 3 of the PUMS Codebook suggests procedures in dealing with this


  7. Census Data
    in the Social Sciences Database Archive

  8. The Social Sciences Database Archive contains a number of digital files from
    the U.S. Census Bureau. These include the PUMS and Summary Tape Files for
    1980 and 1990 as well as County-City Databooks and Current Population Estimates.
    Not all STF and PUMS files are available through the SSDBA for all states.
    Data files are in SPSS format, and in most cases additional files provide
    dictionaries and codebooks necessary to extract information from the files.
    A few of the files have spss programs for reading the data, and these may
    be expanded to carry out various procedures. See Appendix B for a description
    of the available data sets. The following table describes the current location
    and availability of the 1990 STF and PUMS resources in the SSDBA.

    1990 STF and
    PUMS Resources in the SSDBA

    This table contains
    the descriptions of the locations of various census resources within the SSDBA.
    There are four types of files which are indicated with the following codes:

    Data: the file
    containing the database. All are SPSS system files. "By request" means that
    the data are not directly accessible.

    Cbk: the codebook describing the database

    Dic: a data dictionary for the database

    Prog:a SSDBA program that will describe the contents of the database. SPSS
    statistics commands can be appended to it.

    STF1a CA Data:

    CBk: /ssdba-data/docs/codebooks/c90stf1.cb

    Dic: /ssdba-data/docs/codebooks/c90stf1a.dic

    STF1b CA Data:
    By request

    Cbk: /ssdba-data/docs/codebooks/c90stf1.cb

    Dic: /ssdba-data/docs/codebooks/c90stf1b.dic

    Prog: /ssdba-data/docs/programs/untested/c90stf1b-1.uspss




    STF1c U.S. Data:

    Cbk: /ssdba-data/docs/codebooks/c90stf1c.cb

    STF2a U.S. Data:
    By request

    STF3a Los Angeles
    and Orange Counties only

    Data: /usr/ssdba/ssdba37/c90stf3a-ca1.sys

    Other CA Counties Data: /usr/ssdba/ssdba38/c90stf3a-ca2.sys

    Cbk: /ssdba-data/docs/codebooks/c90stf3.cb

    Prog: /ssdba-data/docs/programs/icp9782.spss

    STF3b ZIPS beginning
    8 or 9

    Data: /usr/ssdba/ssdba76/c90stf3b.sys

    Cbk: /ssdba-data/docs/codebooks/c90stf3.cb

    STF3c NE U.S.
    Data: /usr/ssdba/ssdba37/c90stf3c-a.sys

    Rest of U.S. Data: /usr/ssdba/ssdba47/c90stf3c-b.sys

    Cbk: /ssdba-data/docs/codebooks/c90stf3.cb

    STF4a CA

    B recs, All persons Data: /usr/ssdba/ssdba50/c90stf4a-t1.sys

    B recs, White Data: /usr/ssdba/ssdba48/c90stf4a-t2.sys

    B recs, Black Data: /usr/ssdba/ssdba65/c90stf4a-t3.sys

    B recs, Am Inds Data: /usr/ssdba/ssdba61/c90stf4a-t4.sys

    B recs, Asian Data: /usr/ssdba/ssdba49/c90stf4a-t5.sys

    B recs, Other Race Data: /usr/ssdba/ssdba51/c90stf4a-t6.sys

    B recs, Hispanic Data: /usr/ssdba/ssdba61/c90stf4a-t7.sys

    B recs, NonHisp Wh Data: /usr/ssdba/ssdba51/c90stf4a-t8.sys

    B recs, NonHisp Bl Data: /usr/ssdba/ssdba53/c90stf4a-t9.sys

    B recs, NonHisp Oth Data: /usr/ssdba/ssdba49/c90stf4a-t10.sys

    A recs, All persons
    Data: /usr/ssdba/ssdba53/c90stf4a-t11.sys

    Cbk: None: See
    Census Docs.


    Person & Housing Recs Data: /c/census/c90pums-p5.sys

    Housing Recs only Data: /c/census/c90pums-hr

    Cbk: /ssdba-data/docs/codebooks/c90pums-p1.frq

    STF and PUMS
    SPSS Programs in the SSDBA

    The programs
    below are located in the following directory:



























  9. SPSS Programs
    for Extracting Data from the Sample Databases and from the SSDBA Archive

  10. In order to extract variables from one of the census files you need to know
    the variable names. One easy way to get these is to execute a DISPLAY DICTIONARY
    command. This command will list out the contents and formats of a database.
    Note that the variable names have slightly different formats between the various
    Summary Tape Files in the SSDBA and so such a listing is necessary.

    Data Dictionaries

    The following
    program will create a dictionary of a STF4b file on the unix version of SPSS
    at the SSDBA.

    get file '/usr/ssdba/ssdba50/c90stf4a-t1.sys'

    display dictionary.



Program to read
PUMS extract and crosstab ethnicity by occupation

The following
two programs were used to read the SSDBA PUMS file and create some crosstabulations.
Note that the resulting table is for all persons who were employed, not just
civilian employed. A non-Hispanic white category was created by using the
Hispanic and Race variables. This program could be copied and pasted into
the pc-version of spss.

    get file 'cpuma5200'

    recode RACE (2=2) (6,7=3) (8=4) (9=6) (11=7) into ETHNIC.

    * Compute Non-Hispanic White Category

    compute NH = 0.

    if (HISPANIC eq 0 or HISPANIC = 199) NH = 1.

    if (RACE eq 1 and NH eq 1) ETHNIC = 1.

    recode HISPANIC (1,210 thru 220=8) (2,261=9) (3,271=10) (226=11) into ETHNIC.

    recode OCCUP (0 thru 37=1) (38 thru 106, 164 thru 199=2) (107 thru 163=3)
    (200 thru 208=4) (209 thru 353, 356 thru 389=5) (354,355=6) (403 thru 407=7)
    (413 thru 427, 456 thru 469=8) (433 thru 444=9) (445 thru 447=10) (503 thru
    549=11) (553 thru 599=12) (628 thru 699=13) (703 thru 799=14) (800 thru 859=15)
    (866 thru 889=16) into OCC.

    value labels ETHNIC 1 'NhW' 2 'Blk' 3 'Chi+Tai' 5 'Fil' 6 'Jap' 7 'Kor' 8
    'Mex' 9 'PR' 10 'Cub' 11 'Sal'.

    value labels OCC 1 'ExMgt' 2 'Prof' 3 'Teach' 4 'HlthTch' 5 'SaleTch' 6 'PostOf'
    7 'PrHHS' 8 'Serv' 9 'FoodPr' 10 'HealSr' 11 'Mechan' 12 'Const' 13 'PrecProd'
    14 'MachOp' 15 'Trans' 16 'Helper'.

    * Create Occupations by Ethnic by Gender

    crosstabs variables= ETHNIC(1,11) SEX(0,1) OCC(1,16)/ tables=OCC by ETHNIC



Program to Crosstab
Ethnicity by Income Categories


      get file 'cpuma5200'

      recode RACE (2=2) (4,5,301 thru 327=3) (6,7=4) (8=6) (9=7) (10=8) (11=9)
      into ETHNIC.

      compute NH = 0.

      if (HISPANIC eq 0 or HISPANIC = 199) NH = 1.

      if (RACE eq 1 and NH eq 1) ETHNIC = 1.

      recode HISPANIC (1,210 thru 220=1) (2,261=2) (3,271=3) (221,225,227,228,229=4)
      (222=5) (223=6) (224=7) (226=8) into HISP.

      recode RHHINC (1 thru 4999=1) (5000 thru 9999=2) (10000 thru 14999=3) (15000
      thru 19999=4) (20000 thru 24999=5) (25000 thru 29999=6) (30000 thru 34999=7)
      (35000 thru 39999=8) (40000 thru 44999=9) (45000 thru 49999=10) (50000 thru
      59999=11) (60000 thru 79999=12) (80000 thru 99999=13) (100000 thru 150000=14)
      (150001 thru HIGHEST=15) into HHINC.

      value labels HHINC 1 '<5' 2 '5-9' 3 '10-14' 4 '15-19' 5 '20-24' 6 '25-29'
      7 '30-34' 8 '35-39' 9 '40-44' 10 '45-49' 11 '50-59' 12 '60-79' 13 '80-99'
      14 '100-150' 15 '150+'.

      value labels ETHNIC 1 'NhW' 2 'Blk' 3 'InAlEs' 4 'Chi+Tai' 6 'Fil' 7 'Jap'
      8 'AsInd' 9 'Kor'.

      value labels HISP 1 'Mex' 2 'PR' 3 'Cub' 4 'CenAm' 5 'Gua' 6 'Hon' 7 'Nic'
      8 'Sal'.

      * For household income use persons 16+ years.

      select if (AGE ge 16).

      * Select head of household records.

      select if RELAT1 = 0.

      * Create Household Income by Ethnic and Hispanic Tables

      crosstabs variables= ETHNIC(1,9) HISP(1,8) HHINC(1,15)/ tables=HHINC by



    Program to
    Compute Several Summary Variables

    The following
    program reads the 5% PUMS file, computes several basic summary variables
    for the state of California, and produces frequencies of the values for
    selected variables.

    get file '/c/census/c90pums-p5.sys'

    * Select PUMAS in the five-county Southern California area.

    select if (PUMA ge 4200 and PUMA le 4808 or PUMA ge 5200 and PUMA le 7207).

    recode RACE (2=2) (4,5,301 thru 327=3) (6=4) (7=5) (8=6) (9=7) (10=8) (11=9)
    (12=10) (13=11) (15=12) (16=13) (19=14) (22=15) (25=16) (26=17) into ETHNIC.

    compute NH = 0.

    if (HISPANIC eq 0 or HISPANIC = 199) NH = 1.

    if (RACE eq 1 and NH eq 1) ETHNIC = 1.

    recode HISPANIC (1,210 thru 220=1) (2,261=2) (3,271=3) (221,225,227,228,229=4)
    (222=5) (223=6) (224=7) (226=8) (231 thru 249=9) into HISP.

    recode ANCSTRY1 (15,22=1) (148 thru 150=2) (302=3) (308=4) (360=5) (416=6)
    (419=7) (431=8) (434=9) (400 thru 415,417,418,421 thru 430,435 thru 481,490
    thru 499=10) (522,523=11) (553 thru 558=12) (800 thru 802=13) into ANC.

    value labels ANC 1 'Eng' 2 'Rus' 3 'Belz' 4 'Jam' 5 'Brz' 6 'Ira' 7 'Isr'
    8 'Arm' 9 'Tur' 10 'Arab' 11 'Eth' 12 'Nig' 13 'Aus'.

    value labels ETHNIC 1 'NhW' 2 'Blk' 3 'InAlEs' 4 'Chi' 5 'Taiw' 6 'Fil'
    7 'Jap' 8 'AsInd' 9 'Kor' 10 'Vie' 11 'Cam' 12 'Lao' 13 'Tha' 14 'Indo'
    15 'Pak' 16 'Haw' 17 'Sam'.

    value labels HISP 1 'Mex' 2 'PR' 3 'Cub' 4 'CenAm' 5 'Gua' 6 'Hon' 7 'Nic'
    8 'Sal' 9 'SoAm'.

    * Tabulate various stats for 5-co area by ethnic groups.

    if (AGE ge 18) VAR01 = 1.

    if (AGE ge 25) VAR02 = 1.

    if ((ENGLISH eq 0 or ENGLISH eq 1) and AGE ge 18) VAR03 = 1.

    if ((YEARSCH ge 14 and YEARSCH le 17) and AGE ge 25) VAR04 = 1.

    if (CITIZEN eq 3 or CITIZEN eq 4) FB = 1.

    if (FB eq 1 and AGE ge 25) VAR05 = 1.

    if (OCCUP ge 3 and OCCUP le 199) OC = 1.

    if (CLASS ge 1 and CLASS le 8) CL = 1.

    if (OC eq 1 and CL eq 1 and AGE ge 25) VAR07 = 1.

    if (PERSONS ge 1 and RELAT1 eq 0) VAR08 = 1.

    if ((TENURE eq 1 or TENURE eq 2) and RELAT1 eq 0) VAR09 = 1.

    if ((TENURE ge 1 and TENURE le 4) and RELAT1 eq 0) VAR10 = 1.

    variable labels
    VAR01 'Pers 18+' VAR02 'Pers 25+' VAR03 'SpkEngO/VW/18' VAR04 'ColEd/25'
    VAR05 'ForB/25' VAR07 'AdmExProf/25' VAR08 'Hsehldrs' VAR09 'OwnOccHU' VAR10
    'OccHU' FB 'ForBorn'.

    value labels
    VAR01 1 'Pers 18+'/ VAR02 1 'Pers 25+'/ VAR03 1 'SpkEng'/ VAR04 1 'ColEduc'/
    VAR05 1 'ForBor25'/ VAR07 1 'AdmPrOcc'/ VAR08 1 'HsHlds'/ VAR09 1 'OwnOcH'/
    VAR10 1 'OccHU'/ FB 1 'ForBorn'/ OC 1 'ProfOcc'/ CL 1 'Worker'.

    variables=ETHNIC (1,17) HISP (1,9) ANC (1,13) CLASS (0,9) CITIZEN (0,4)
    FB (1,1) VAR01 to VAR10 (1,1)/ barchart/ format=condense.



  • Running
    SPSS on Venus at the SSDBA.

  • Because the PUMS file is quite large and because some time is required to
    access STF databases, it is quite possible that a program will take from 30
    to 45 minutes to finish execution. An alternative to waiting at the terminal
    is to submit your spss file as a batch job. This can be done be entering the
    following statement at the unix prompt.

      nohup spss -m
      -s 48M < program file > listing file &

    The program file
    contains the spss program statements and the listing file receives the results
    of the program execution. If a data output file is created, it will be saved
    according to its name in the program. A caution is that successive runs will
    fail if the program attempts to generate a new file with the same name as an
    existing file.

    The SPSS program
    on Venus serves as a good editor for entering and updating programs. A number
    of basic edit functions are available through various escape key -number strokes.
    Esc-1 can be used to view directory contents. Esc-2 will allow you to switch
    between an output and input screen. Esc-3 inputs an existing file, Esc-9 saves
    an edited file, and Esc-0 runs a program from the editor and quits the editor.
    Two escape strokes will terminate a command selection.