Chapter 8: Multivariate Analysis

Chapter Eight: Multivariate Analysis
 
Crosstabs Revisited
 
Simple crosstabs, which examine the influence of one variable on another, should be only the first step in the analysis of social science data (refer to Chapter Five). It is fun to hypothesize that the more conservative a person's political orientation the more likely they are to oppose abortion, then run the crosstabs, and then conclude you were right. However, this one step method of hypothesis testing is very limited. What if all the Republicans in your sample are religiously conservative and all the Democrats are atheists? Is it the political party that best explains your findings, or is it religious orientation?  Or what if the political conservatives as a group are much older than the liberals, would age then be the real causal factor?  Or is it some combination among all of these variables that explains the varying opinions of your respondents?
In Exercise 3 of Chapter Five, we wondered if TRUST is related to RACE. Using TRUST as the dependent variable and RACE as the independent, here is the table from Chapter 5 (Figure 8-1):
 

Figure 8-1

At first glance, RACE differences appear to be very important (overall, 58% of those surveyed said people cannot be trusted, but the epsilon statistic -- the difference between the highest and lowest percentage -- is 22). Also note that few Respondents said “Depends” – most had a definite opinion here.

Let’s do some recoding: RACE should be recoded into a different variable called RACER (Race Recoded). Whites and Blacks will stay the same, but Other is eliminated by recoding it as missing (see Figures 8-2 and 8-3). Review Chapter 3 if you need to refresh your memory on how to recode.
 


Figure 8-2

Figure 8-3

Let’s also recode TRUST into a different variable called TRUSTR to eliminate the “Depends” category. Don’t forget to create new value labels after you recode. Now run the crosstabs for TRUSTR and RACER. Your output should look like Figure 8-4.


Figure 8-4

When "pp", a percentage point difference (epsilon) is this high, it’s “interesting” (actually, anything higher than 10-12 is interesting) even if you don't yet know whether it is statistically significant. Here you have a pp difference of 24. And here’s how you might describe what you’ve found so far: “Although most Respondents (62%) say that other people cannot be trusted, over 80% of the Black respondents said this compared to 58% of the Whites in this sample.” Or, “Fewer than one-fifth of Blacks said that people can be trusted, compared to more than two-fifths of Whites.”

Is this a strong relationship (statistically speaking)? There are a lot of choices in the "Statistics" dialog box, but here we will just look at the gamma statistic (your instructor will probably have you look at other statistics, but gamma is almost always appropriate; see Figure 8-5). Yes it is significant.


Figure 8-5

Can you have confidence that race is the causal factor here? While it may indeed be true that race is explanatory, you won't really have confidence in this conclusion until you have failed to account for this variation in any other way. To do this, we will need to do some elaboration analysis by running crosstabs of (i.e., "controlling for") other independent variables to see if something else might account for this variation among respondents.

Recall that your original crosstabs procedure produces one contingency table, with as many rows as there are categories (or values) of the dependent variable, and as many columns as there are categories of the independent variable. When you start using control (sometimes called test) variables, you will get as many separate tables as there are categories of the control variable. For instance, if you want to control for levels of education, and simply used EDUC as the control variable, you end up with 20 separate tables. This is NOT a good idea. Try doing this to see what we mean. Notice how difficult it is to compare across this many tables. So before you do any further analysis, recode your variables into the smallest number of categories that are still logically useful. Review Chapter 3 if you have forgotten how to do this.

In the next example EDUC was recoded as EDUC2 into two categories, those with high school or less (0 12 years), and those with more then high school (13+ years).  After you have done these recodes, let's see what happens when we do crosstabs again. This time we will control for our recoded education variable. To do the appropriate crosstabs, go to the Analyze, Descriptive Statistics, Crosstabs menu. Enter TRUSTR into the Row box and RACER into the Column box. Now you are ready for the next step, the addition of a control variable. Choose EDUC2 from your variables list and enter it into the empty box at the bottom of the Crosstabs screen. Figure 8-6 shows you what the Crosstabs dialog box will look like.


Figure 8-6

The SPSS output for this procedure is shown in Figure 8-7.


Figure 8-7


You still have the two columns of your independent variable (RACER), but you can compare TRUSTR for people who have no college education (0-12 Years) with those who do (13+). A possible description: "Whites are more likely than blacks to think people can be trusted holding education constant (50.1% vs. vs. 31.3% and 32.3% vs. 9.3%). Those with more education are more likely than those with less education to say people can be trusted holding race constant (50.1% vs. 32.3% and 31.3% vs. 9.3%). Both education and race are related to trust of people".
So what is more important, race or years of education? Just as you can’t stop with a crosstab of only two variables when you want to test out your hypotheses, you also can’t stop with just one control variable. Some of the other “major demographic variables” that might explain social differences include sex, social class, income, occupation, marital status, age, political ideology, and religion.
 
Figure 8-4 shows the original, or zero order contingency table of the relationship between trust and race.
Figure 8-7 shows the two partial tables that resulted from controlling for education, one for each category of that variable (0-12, 13+).
<>Try other variables as a control to see what happens. As a general rule, here is how to interpret what you find from this elaboration analysis:
  •  If the partial tables are similar to the zero order table, you have replicated your original findings, which means that in spite of the introduction of a particular control variable, the original relationship persists. The only way to convince us that this is indeed a strong, or even causal, relationship is if you control for all the other logical independent variables you can think of, and still find essentially no differences between the zero order tables and their partials.
  • If all the partials are significantly less than those found in the original AND IF your control variable is antecedent (occurs prior in time) to both the other variables, you have found a spurious relationship and explained away the original. In other words, the original relationship was due to the influence of that other variable, not the one you hypothesized.
  • If the partials are less AND IF your control variable is intervening, you have interpreted the relationship. If the time sequence between the independent and control variable is not determinable (or otherwise unclear), you don't know whether you have explanation or interpretation, but you do know that the control variable is important.
  • If one or more partials is stronger than the original relationship and one or more is weaker, you have discovered the conditions under which the original relationship is strongest. This is referred to as specification, or the interaction effect.
  • If the zero order table showed weak association between the variables, you might still find strong associations in the partials (which is a good argument for keeping on with your initial analysis of the data even if you didn't "find" anything with bivariate analysis). The addition of your control variable showed it to have been acting as a suppressor in the original table.
<>Last, if a zero order table shows only a weak or moderate association, the partials might show the opposite relationship, due to the presence of a distorter variable.
<>Try some of your own three way (or higher) tables using some of the variables in the GSS00A data set. Recall that for this procedure, there should be few categories for each variable, particularly your control variables (so you might need to recode), and you are limited to variables measured at, or recoded to, nominal or ordinal levels.
  <>Multiple Regression
  <>Once you have discovered that several of your independent variables are related to your dependent variable, you might want to try multiple regression (multiple linear regression analysis). The three or more way crosstabs shown previously are more an exploratory technique, whereas multiple regression is more explanatory. With multiple regression you can generate beta values (partial regression coefficients) which give you an idea of the relative impact of each independent variable on the dependent. <>You also will generate the R squared value, which is a summary statistic of the impacts of all the independent variables taken together. Remember the important assumptions for using regression: a linear relationship between each independent variable and the dependent; a normal distribution of your variables, and variables measured at interval or ratio levels. Any variable with only two categories can be treated as interval level. <>Go to the Analyze, Regression, Linear menu. For your dependent variable, choose TRUST from the variable list. For the independent variables choose EDUC (unrecoded), CLASS, AGE (see Figure 8-8).
 

Figure 8-8

Lets look at some of the options possible. Choose the "Statistics" button at the bottom of the dialog box and a new dialog box will appear, shown here in Figure 8-9 with the default options. The defaults are appropriate for us.


Figure 8-9

Click on the "Continue" button to return to Figure 8-8, then click on the "Plots" button. Your screen should now look like Figure 8-10. Again the defaults are appropriate for us.


Figure 8-10

Click on "Continue" and then "Options" and your screen should look like Figure 8-11, which shows the default options.


Figure 8-11

Defaults are acceptable so click "Continue" to return to the Linear Regression dialog box (Figure 8-7). Your last task is to choose your method of analysis. Click on the "Method:" button right under the "Independent(s): " box. You have several choices here, and you can use the scroll button to see what they are. "Stepwise" is the one we chose for this example, and the one that you will probably use most often (see Figure 8-12.)


Figure 8-12

For an in depth discussion of all the possible choices for Multiple Regression, you will need to consult the SPSS manuals.

When you finally click "OK" in the Linear Regression dialog box after having chosen stepwise regression using all the default options discussed above, you will see the results in the Output window (Figures 8-13 through 8-15).
 


Figure 8-13

Figure 8-14


Figure 8-15


 
Chapter Eight Exercises 1. Create some hypotheses that use RACER and TRUSTR and some of the other independent variables found in our GSS00A data set. Remember to recode any variables that have too many categories. Test your hypotheses first using Crosstabs, then using Regression analysis. Do the race differences ever go away completely?
  2. How would you hypothesize the relationship between FEAR (Afraid to walk at night in neighborhood) and SEX? After you have looked at the output, control for CLASS. How would you discuss what you found? Now run FEAR and SEX but control for TRUSTR. How would you characterize the relationships among these variables?
  3. After creating the appropriate hypotheses, run the Crosstabs for each of the seven abortion variables with SEX, AGE (recoded), some measure of religiosity, and some measure of political ideology, controlling for RACER and for education (recoded). How are all these variables related?