Chapter 3 -- Survey Research Design and Quantitative Methods of Analysis for Cross-sectional Data





COWI: Chapter 3
Last Modified 15 August 1998

Almost
everyone has had experience with surveys. Market surveys ask respondents whether
they recognize products and their feelings about them. Political polls ask questions
about candidates for political office or opinions related to political and social
issues. Needs assessments use surveys that identify the needs of groups. Evaluations
often use surveys to assess the extent to which programs achieve their goals.

Survey research
is a method of collecting information by asking questions. Sometimes interviews
are done face-to-face with people at home, in school, or at work. Other times
questions are sent in the mail for people to answer and mail back. Increasingly,
surveys are conducted by telephone.

SAMPLE SURVEYS

Although we want
to have information on all people, it is usually too expensive and time consuming
to question everyone. So we select only some of these individuals and question
them. It is important to select these people in ways that make it likely that
they represent the larger group.

The population
is all the individuals in whom we are interested. (A population does not always
consist of individuals. Sometimes, it may be geographical areas such as all
cities with populations of 100,000 or more. Or we may be interested in all
households in a particular area. In the data used in the exercises of this
module the population consists of individuals who are California residents.)
A sample is the subset of the population involved in a study. In other
words, a sample is part of the population. The process of selecting the sample
is called sampling. The idea of sampling is to select part of the population
to represent the entire population.

The United States
Census is a good example of sampling. The census tries to enumerate all residents
every ten years with a short questionnaire. Approximately every fifth household
is given a longer questionnaire. Information from this sample (i.e., every
fifth household) is used to make inferences about the population. Political
polls also use samples. To find out how potential voters feel about a particular
race, pollsters select a sample of potential voters. This module uses opinions
from three samples of California residents age 18 and over. The data were
collected during July, 1985, September, 1991, and February, 1995, by the Field
Research Corporation (The Field Institute 1985, 1991, 1995). The Field Research
Corporation is a widely-respected survey research firm and is used extensively
by the media, politicians, and academic researchers.

Since a survey
can be no better than the quality of the sample, it is essential to understand
the basic principles of sampling. There are two types of sampling-probability
and nonprobability. A probability sample is one in which each individual
in the population has a known, nonzero chance of being selected in the sample.
The most basic type is the simple random sample. In a simple
random sample, every individual (and every combination of individuals) has
the same chance of being selected in the sample. This is the equivalent of
writing each person's name on a piece of paper, putting them in plastic balls,
putting all the balls in a big bowl, mixing the balls thoroughly, and selecting
some predetermined number of balls from the bowl. This would produce a simple
random sample.

The simple random
sample assumes that we can list all the individuals in the population, but
often this is impossible. If our population were all the households or residents
of California, there would be no list of the households or residents available,
and it would be very expensive and time consuming to construct one. In this
type of situation, a multistage cluster sample would be used.
The idea is very simple. If we wanted to draw a sample of all residents of
California, we might start by dividing California into large geographical
areas such as counties and selecting a sample of these counties. Our sample
of counties could then be divided into smaller geographical areas such as
blocks and a sample of blocks would be selected. We could then construct a
list of all households for only those blocks in the sample. Finally, we would
go to these households and randomly select one member of each household for
our sample. Once the household and the member of that household have been
selected, substitution would not be allowed. This often means that we must
call back several times, but this is the price we must pay for a good sample.

The Field Poll
used in this module is a telephone survey. It is a probability sample using
a technique called random-digit dialing. With random-digit dialing,
phone numbers are dialed randomly within working exchanges (i.e., the first
three digits of the telephone number). Numbers are selected in such a way
that all areas have the proper proportional chance of being selected in the
sample. Random-digit dialing makes it possible to include numbers that are
not listed in the telephone directory and households that have moved into
an area so recently that they are not included in the current telephone directory.

A nonprobability
sample
is one in which each individual in the population does not have
a known chance of selection in the sample. There are several types of nonprobability
samples. For example, magazines often include questionnaires for readers to
fill out and return. This is a volunteer sample since respondents self-select
themselves into the sample (i.e., they volunteer to be in the sample). Another
type of nonprobability sample is a quota sample. Survey researchers
may assign quotas to interviewers. For example, interviewers might be told
that half of their respondents must be female and the other half male. This
is a quota on sex. We could also have quotas on several variables (e.g., sex
and race) simultaneously.

Probability samples
are preferable to nonprobability samples. First, they avoid the dangers of
what survey researchers call "systematic selection biases" which are inherent
in nonprobability samples. For example, in a volunteer sample, particular
types of persons might be more likely to volunteer. Perhaps highly-educated
individuals are more likely to volunteer to be in the sample and this would
produce a systematic selection bias in favor of the highly educated. In a
probability sample, the selection of the actual cases in the sample is left
to chance. Second, in a probability sample we are able to estimate the amount
of sampling error (our next concept to discuss).

We would like
our sample to give us a perfectly accurate picture of the population. However,
this is unrealistic. Assume that the population is all employees of a large
corporation, and we want to estimate the percent of employees in the population
that is satisfied with their jobs. We select a simple random sample of 500
employees and ask the individuals in the sample how satisfied they are with
their jobs. We discover that 75 percent of the employees in our sample are
satisfied. Can we assume that 75 percent of the population is satisfied? That
would be asking too much. Why would we expect one sample of 500 to give us
a perfect representation of the population? We could take several different
samples of 500 employees and the percent satisfied from each sample would
vary from sample to sample. There will be a certain amount of error as a result
of selecting a sample from the population. We refer to this as sampling
error
. Sampling error can be estimated in a probability sample, but not
in a nonprobability sample.

It would be wrong
to assume that the only reason our sample estimate is different from the true
population value is because of sampling error. There are many other sources
of error called nonsampling error. Nonsampling error would include
such things as the effects of biased questions, the tendency of respondents
to systematically underestimate such things as age, the exclusion of certain
types of people from the sample (e.g., those without phones, those without
permanent addresses), or the tendency of some respondents to systematically
agree to statements regardless of the content of the statements. In some studies,
the amount of nonsampling error might be far greater than the amount of sampling
error. Notice that sampling error is random in nature, while nonsampling error
may be nonrandom producing systematic biases. We can estimate the amount of
sampling error (assuming probability sampling), but it is much more difficult
to estimate nonsampling error. We can never eliminate sampling error entirely,
and it is unrealistic to expect that we could ever eliminate nonsampling error.
It is good research practice to be diligent in seeking out sources of nonsampling
error and trying to minimize them.


DATA ANALYSIS

Examining
Variables One at a Time (Univariate Analysis)

The rest of this
chapter will deal with the analysis of survey data. Data analysis involves
looking at variables or "things" that vary or change. A variable is
a characteristic of the individual (assuming we are studying individuals).
The answer to each question on the survey forms a variable. For example, sex
is a variable-some individuals in the sample are male and some are female.
Age is a variable; individuals vary in their ages.

Looking at variables
one at a time is called univariate analysis. This is the usual starting
point in analyzing survey data. There are several reasons to look at variables
one at a time. First, we want to describe the data. How many of our sample
are men and how many are women? How many are black and how many are white?
What is the distribution by age? How many say they are going to vote for Candidate
A and how many for Candidate B? How many respondents agree and how many disagree
with a statement describing a particular opinion?

Another reason
we might want to look at variables one at a time involves recoding. Recoding
is the process of combining categories within a variable. Consider age, for
example. In the data set used in this module, age varies from 18 to 89, but
we would want to use fewer categories in our analysis, so we might combine
age into age 18 to 29, 30 to 49, and 50 and over. We might want to combine
African Americans with the other races to classify race into only two categories-white
and nonwhite. Recoding is used to reduce the number of categories in the variable
(e.g., age) or to combine categories so that you can make particular types
of comparisons (e.g., white versus nonwhite).

The frequency
distribution is one of the basic tools for looking at variables one at a time.
A frequency distribution is the set of categories and the number of
cases in each category. Percent distributions show the percentage in
each category. Table 3.1 shows frequency and percent distributions for two
hypothetical variables-one for sex and one for willingness to vote for a woman
candidate. Begin by looking at the frequency distribution for sex. There are
three columns in this table. The first column specifies the categories-male
and female. The second column tells us how many cases there are in each category,
and the third column converts these frequencies into percents.

Table
3.1 -- Frequency and Percent Distributions for Sex and Willingness to
Vote for a Woman Candidate (Hypothetical Data)
Sex Voting
Preference
Category  Freq.  Percent  Category  Freq.  Percent  Valid
Percent 
Male  380  40.0 
Willing
to Vote for a Woman 
460  48.4  51.1 
Female  570  60.0 
Not
Willing to Vote for a Woman 
440  46.3  48.9 
Total  950  100.0 
Refused  50  5.3  Missing 
Total  950  100.0  100.0 


In this hypothetical
example, there are 380 males and 570 females or 40 percent male and 60 percent
female. There are a total of 950 cases. Since we know the sex for each case,
there are no missing data (i.e., no cases where we do not know the proper
category). Look at the frequency distribution for voting preference in Table
3.1. How many say they are willing to vote for a woman candidate and how many
are unwilling? (Answer: 460 willing and 440 not willing) How many refused to
answer the question? (Answer: 50) What percent say they are willing to vote
for a woman, what percent are not, and what percent refused to answer? (Answer:
48.4 percent willing to vote for a woman, 46.3 percent not willing, and 5.3
percent refused to tell us.) The 50 respondents who didn't want to answer the
question are called missing data because we don't know which category into which
to place them, so we create a new category (i.e., refused) for them. Since we
don't know where they should go, we might want a percentage distribution considering
only the 900 respondents who answered the question. We can determine this easily
by taking the 50 cases with missing information out of the base (i.e., the denominator
of the fraction) and recomputing the percentages. The fourth column in the frequency
distribution (labeled "valid percent") gives us this information. Approximately
51 percent of those who answered the question were willing to vote for a woman
and approximately 49 percent were not.

With these data
we will use frequency distributions to describe variables one at a time. There
are other ways to describe single variables. The mean, median, and mode are
averages that may be used to describe the central tendency of a distribution.
The range and standard deviation are measures of the amount of variability
or dispersion of a distribution. (We will not be using measures of central
tendency or variability in this module.)


Exploring
the Relationship Between Two Variables (Bivariate Analysis)

Usually we want
to do more than simply describe variables one at a time. We may want to analyze
the relationship between variables. Morris Rosenberg (1968:2) suggests that
there are three types of relationships: "(1) neither variable may influence
one another .... (2) both variables may influence one another ... (3) one
of the variables may influence the other." We will focus on the third of these
types which Rosenberg calls "asymmetrical relationships." In this type of
relationship, one of the variables (the independent variable) is assumed
to be the cause and the other variable (the dependent variable) is
assumed to be the effect. In other words, the independent variable is the
factor that influences the dependent variable.

For example,
researchers think that smoking causes lung cancer. The statement that specifies
the relationship between two variables is called a hypothesis (see
Hoover 1992, for a more extended discussion of hypotheses). In this hypothesis,
the independent variable is smoking (or more precisely, the amount one smokes)
and the dependent variable is lung cancer. Consider another example. Political
analysts think that income influences voting decisions, that rich people vote
differently from poor people. In this hypothesis, income would be the independent
variable and voting would be the dependent variable.

In order to demonstrate
that a causal relationship exists between two variables, we must meet
three criteria: (1) there must be a statistical relationship between the two
variables, (2) we must be able to demonstrate which one of the variables influences
the other, and (3) we must be able to show that there is no other alternative
explanation for the relationship. As you can imagine, it is impossible to
show that there is no other alternative explanation for a relationship. For
this reason, we can show that one variable does not influence another variable,
but we cannot prove that it does. We can only show that it is more plausible
or credible to believe that a causal relationship exists. In this section,
we will focus on the first two criteria and leave this third criterion to
the next section.

In the previous
section we looked at the frequency distributions for sex and voting preference.
All we can say from these two distributions is that the sample is 40 percent
men and 60 percent women and that slightly more than half of the respondents
said they would be willing to vote for a woman, and slightly less than half
are not willing to. We cannot say anything about the relationship between
sex and voting preference. In order to determine if men or women are more
likely to be willing to vote for a woman candidate, we must move from univariate
to bivariate analysis.

A crosstabulation
(or contingency table) is the basic tool used to explore the relationship
between two variables. Table 3.2 is the crosstabulation of sex and voting
preference. In the lower right-hand corner is the total number of cases in
this table (900). Notice that this is not the number of cases in the sample.
There were originally 950 cases in this sample, but any case that had missing
information on either or both of the two variables in the table has been excluded
from the table. Be sure to check how many cases have been excluded from your
table and to indicate this figure in your report. Also be sure that you understand
why these cases have been excluded. The figures in the lower margin and right-hand
margin of the table are called the marginal distributions. They are simply
the frequency distributions for the two variables in the whole table. Here,
there are 360 males and 540 females (the marginal distribution for the column
variable-sex) and 460 people who are willing to vote for a woman candidate
and 440 who are not (the marginal distribution for the row variable-voting
preference). The other figures in the table are the cell frequencies. Since
there are two columns and two rows in this table (sometimes called a 2 x 2
table), there are four cells. The numbers in these cells tell us how many
cases fall into each combination of categories of the two variables. This
sounds complicated, but it isn't. For example, 158 males are willing to vote
for a woman and 302 females are willing to vote for a woman.

Table
3.2 -- Crosstabulation of Sex and Voting Preference (Frequencies)
Sex
Voting
Preference
Male  Female  Total 
Willing
to Vote for a Woman
158  302  460 
Not
Willing to Vote for a Woman
202  238  440 
Total 360  540  900 


We could make comparisons
rather easily if we had an equal number of women and men. Since these numbers
are not equal, we must use percentages to help us make the comparisons. Since
percentages convert everything to a common base of 100, the percent distribution
shows us what the table would look like if there were an equal number of men
and women.

Before we percentage
Table 3.2, we must decide which of these two variables is the independent
and which is the dependent variable. Remember that the independent variable
is the variable we think might be the influencing factor. The independent
variable is hypothesized to be the cause, and the dependent variable is the
effect. Another way to express this is to say that the dependent variable
is the one we want to explain. Since we think that sex influences willingness
to vote for a woman candidate, sex would be the independent variable.

Once we have
decided which is the independent variable, we are ready to percentage the
table. Notice that percentages can be computed in different ways. In Table
3.3, the percentages have been computed so that they sum down to 100. These
are called column percents. If they sum across to 100, they are called
row percents. If the independent variable is the column variable,
then we want the percents to sum down to 100 (i.e., we want the column percents).
If the independent variable is the row variable, we want the percents to sum
across to 100 (i.e., we want the row percents). This is a simple, but very
important, rule to remember. We'll call this our rule for computing percents.
Although we often see the independent variable as the column variable so the
table sums down to 100 percent, it really doesn't matter whether the independent
variable is the column or the row variable. In this module, we will put the
independent variable as the column variable. Many others (but not everyone)
use this convention. It would be helpful if you did this when you write your
report.

Table
3.3
-- Voting Preference by Sex (Percents)
Voting
Preference
Male Female Total
Willing
to Vote for a Woman
43.9  55.9  51.1 
Not
Willing to Vote for a Woman
56.1  44.1  100.0 
Total
Percent
100.0  100.0  100.0 
(Total
Frequency)
(360)  (540)  (900) 


Now we are ready
to interpret this table. Interpreting a table means to explain what the table
is saying about the relationship between the two variables. First, we can look
at each category of the independent variable separately to describe the data
and then we compare them to each other. Since the percents sum down to 100 percent,
we describe down and compare across. The rule for interpreting percents
is to compare in the direction opposite to the way the percents sum to 100.
So, if the percents sum down to 100, we compare across, and if the percents
sum across to 100, compare down. If the independent variable is the column variable,
the percents will always sum down to 100. We can look at each category
of the independent variable separately to describe the data and then compare
them to each other-describe down and then compare across. In Table 3.3, row
one shows the percent of males and the percent of females who are willing to
vote for a woman candidate--43.9 percent of males are willing to vote for a
woman, while 55.9 percent of the females are. This is a difference of 12 percentage
points. Somewhat more females than males are willing to vote for a woman. The
second row shows the percent of males and females who are not willing to vote
for a woman. Since there are only two rows, the second row will be the complement
(or the reverse) of the first row. It shows that males are somewhat more likely
to be unwilling to vote for a woman candidate (a difference of 12 percentage
points in the opposite direction).

When we observe
a difference, we must also decide whether it is significant. There are two
different meanings for significance-statistical significance and substantive
significance. Statistical significance considers whether the difference
is great enough that it is probably not due to chance factors. Substantive
significance
considers whether a difference is large enough to be important.
With a very large sample, a very small difference is often statistically significant,
but that difference may be so small that we decide it isn't substantively
significant (i.e., it's so small that we decide it doesn't mean very much).
We're going to focus on statistical significance, but remember that even if
a difference is statistically significant, you must also decide if it is substantively
significant.

Let's discuss
this idea of statistical significance. If our population is all men and women
of voting age in California, we want to know if there is a relationship between
sex and voting preference in the population of all individuals of voting age
in California. All we have is information about a sample from the population.
We use the sample information to make an inference about the population. This
is called statistical inference. We know that our sample is not a perfect
representation of our population because of sampling error. Therefore,
we would not expect the relationship we see in our sample to be exactly the
same as the relationship in the population.

Suppose we want
to know whether there is a relationship between sex and voting preference
in the population. It is impossible to prove this directly, so we have to
demonstrate it indirectly. We set up a hypothesis (called the null hypothesis)
that says that sex and voting preference are not related to each other in
the population. This basically says that any difference we see is likely to
be the result of random variation. If the difference is large enough that
it is not likely to be due to chance, we can reject this null hypothesis of
only random differences. Then the hypothesis that they are related (called
the alternative or research hypothesis) will be more credible.


Table
3.4 -- Development of Chi Square Statistic

Column 1 


Column 2 

Column 3 


Column 4 

Column 5 


f o

f e

(f o
- f e

(f o
- f e)2

(f o
- f e)2/fe

158 

184 

-26 

676 

3.67 

202 


176 

26 


676 

3.84 


302 

276 

26 

676 

2.45 

238 

264 

-26 

676 

2.56 
12.52
= chi square

In the
first column of Table 3.4, we have listed the four cell frequencies from the
crosstabulation of sex and voting preference. We'll call these the observed
frequencies
(f o) because they are what we observe from our table.
In the second column, we have listed the frequencies we would expect if, in
fact, there is no relationship between sex and voting preference in the population.
These are called the expected frequencies (f e). We'll briefly
explain how these expected frequencies are obtained. Notice from Table 3.1 that
51.1 percent of the sample were willing to vote for a woman candidate, while
48.9 percent were not. If sex and voting preference are independent (i.e., not
related), we should find the same percentages for males and females. In other
words, 48.9 percent (or 176) of the males and 48.9 percent (or 264) of the females
would be unwilling to vote for a woman candidate. (This explanation is adapted
from Norusis 1997.) Now, we want to compare these two sets of frequencies to
see if the observed frequencies are really like the expected frequencies. All
we do is to subtract the expected from the observed frequencies (column three).
We are interested in the sum of these differences for all cells in the table.
Since they always sum to zero, we square the differences (column four) to get
positive numbers.

Finally, we divide
this squared difference by the expected frequency (column five). (Don't worry
about why we do this. The reasons are technical and don't add to your understanding.)
The sum of column five (12.52) is called the chi square statistic.
If the observed and the expected frequencies are identical (no difference),
chi square will be zero. The greater the difference between the observed and
expected frequencies, the larger the chi square.

If we get a large
chi square, we are willing to reject the null hypothesis. How large does the
chi square have to be? We reject the null hypothesis of no relationship between
the two variables when the probability of getting a chi square this large
or larger by chance is so small that the null hypothesis is very unlikely
to be true. That is, if a chi square this large would rarely occur by chance
(usually less than once in a hundred or less than five times in a hundred).
In this example, the probability of getting a chi square as large as 12.52
or larger by chance is less than one in a thousand. This is so unlikely that
we reject the null hypothesis, and we conclude that the alternative hypothesis
(i.e., there is a relationship between sex and voting preference) is credible
(not that it is necessarily true, but that it is credible). There is always
a small chance that the null hypothesis is true even when we decide to reject
it. In other words, we can never be sure that it is false. We can only conclude
that there is little chance that it is true.

Just because
we have concluded that there is a relationship between sex and voting preference
does not mean that it is a strong relationship. It might be a moderate or
even a weak relationship. There are many statistics that measure the strength
of the relationship between two variables. Chi square is not a measure of
the strength of the relationship. It just helps us decide if there is a basis
for saying a relationship exists regardless of its strength. Measures of
association
estimate the strength of the relationship and are often used
with chi square. (See Appendix D for a discussion of how to compute the two
measures of association discussed below.)

Cramer's V
is a measure of association appropriate when one or both of the variables
consists of unordered categories. For example, race (white, African American,
other) or religion (Protestant, Catholic, Jewish, other, none) are variables
with unordered categories. Cramer's V is a measure based on chi square. It
ranges from zero to one. The closer to zero, the weaker the relationship;
the closer to one, the stronger the relationship.

Gamma
(sometimes referred to as Goodman and Kruskal's Gamma) is a measure of association
appropriate when both of the variables consist of ordered categories. For
example, if respondents answer that they strongly agree, agree, disagree,
or strongly disagree with a statement, their responses are ordered. Similarly,
if we group age into categories such as under 30, 30 to 49, and 50 and over,
these categories would be ordered. Ordered categories can logically be arranged
in only two ways-low to high or high to low. Gamma ranges from zero to one,
but can be positive or negative. For this module, the sign of Gamma would
have no meaning, so ignore the sign and focus on the numerical value. Like
V, the closer to zero, the weaker the relationship and the closer to one,
the stronger the relationship.

Choosing whether
to use Cramer's V or Gamma depends on whether the categories of the variable
are ordered or unordered. However, dichotomies (variables consisting of only
two categories) may be treated as if they are ordered even if they are not.
For example, sex is a dichotomy consisting of the categories male and
female. There are only two possible ways to order sex-male, female and female,
male. Or, race may be classified into two categories-white and nonwhite. We
can treat dichotomies as if they consisted of ordered categories because they
can be ordered in only two ways. In other words, when one of the variables
is a dichotomy, treat this variable as if it were ordinal and use gamma. This
is important when choosing an appropriate measure of association.

In this chapter
we have described how surveys are done and how we analyze the relationship
between two variables. In the next chapter we will explore how to introduce
additional variables into the analysis.


REFERENCES
AND SUGGESTED READING

Methods of
Social Research

  • Riley, Matilda
    White. 1963. Sociological Research I: A Case Approach. New York:
    Harcourt, Brace and World.
  • Hoover, Kenneth
    R. 1992. The Elements of Social Scientific Thinking (5th
    Ed.). New York: St. Martin's.

Interviewing

  • Gorden, Raymond
    L. 1987. Interviewing: Strategy, Techniques and Tactics. Chicago:
    Dorsey.

Survey Research
and Sampling

  • Babbie, Earl
    R. 1990. Survey Research Methods (2nd Ed.). Belmont, CA:
    Wadsworth.
  • Babbie, Earl
    R. 1997. The Practice of Social Research (8th Ed). Belmont,
    CA: Wadsworth.

Statistical Analysis

  • Knoke, David,
    and George W. Bohrnstedt. 1991. Basic Social Statistics. Itesche,
    IL: Peacock.
  • Riley, Matilda
    White. 1963. Sociological Research II Exercises and Manual.
    New York: Harcourt, Brace & World.
  • Norusis,
    Marija J. 1997. SPSS 7.5 Guide to Data Analysis. Upper Saddle River,
    New Jersey: Prentice Hall.

Data Sources

  • The Field
    Institute. 1985. California Field Poll Study, July, 1985. Machine-readable
    codebook.
  • The Field
    Institute. 1991. California Field Poll Study, September, 1991. Machine-readable
    codebook.
  • The Field
    Institute. 1995. California Field Poll Study, February, 1995. Machine-readable
    codebook.