Author: Ed Nelson
Department of Sociology M/S SS97
California State University, Fresno
Fresno, CA 93740
Note to the Instructor: This is the eleventh in a series of 13 exercises that were written for an introductory research methods class. The first exercise focuses on the research design which is your plan of action that explains how you will try to answer your research questions. Exercises two through four focus on sampling, measurement, and data collection. The fifth exercise discusses hypotheses and hypothesis testing. The last eight exercises focus on data analysis. In these exercises we’re going to analyze data from one of the Monitoring the Future Surveys (i.e., the 2015 survey of high school seniors in the United States). This data set is part of the collection at the Inter-university Consortium for Political and Social Research at the University of Michigan. This data set is freely available to the public and you do not have to be a member of the Consortium to use it. We’re going to use SDA (Survey Documentation and Analysis) to analyze the data which is an online statistical package written by the Survey Methods Program at UC Berkeley and is available without cost wherever one has an internet connection. A weight variable is automatically applied to the data set so it better represents the population from which the sample was selected. You have permission to use this exercise and to revise it to fit your needs. Please send a copy of any revision to the author so I can see how people are using the exercises. Included with this exercise (as separate files) are more detailed notes to the instructors and the exercise itself. Please contact the author for additional information.
This page in MS Word (.docx) format is attached.
Goals of Exercise
The goal of this exercise is to introduce measures of association. The exercise also gives you practice in using CROSSTABS in SDA.
Part I—Relationships between Variables
We’re going to use the Monitoring the Future (MTF) Survey of high school seniors for this exercise. The MTF survey is a multistage cluster sample of all high school seniors in the United States. The survey of seniors started in 1975 and has been done annually ever since. To access the MTF 2015 survey follow the instructions in the Appendix. Your screen should look like Figure 11-1. Notice that a weight variable has already been entered in the WEIGHT box. This will weight the data so the sample better represents the population from which the sample was selected
MTF is an example of a social survey. The investigators selected a sample from the population of all high school seniors in the United States. This particular survey was conducted in 2015 and is a relatively large sample of a little less than 14,000 seniors. In a survey we ask respondents questions and use their answers as data for our analysis. The answers to these questions are used as measures of various concepts. In the language of survey research these measures are typically referred to as variables.
In exercise 9RM we used crosstabulation and percents to describe the relationship between pairs of variables in the sample. In exercise 10RM we went beyond simple description. We used the sample data to make inferences about the population from which the sample was selected. Chi Square was used to test hypotheses about the population. Chi Square is the appropriate test when your variables are nominal or ordinal (see exercise 6RM).
Chi Square is a test of the null hypothesis that two variables are unrelated to each other. Another way to put this is that the two variables are independent of each other. If we can reject the null hypothesis then we have support for our research hypothesis that the two variables are related to each other. But showing that two variables are related is not the same thing as determining the strength of the relationship. The strength of a relationship is actually a continuum from very weak to very strong. To measure the strength of a relationship we need to select and compute a measure of association. In this exercise we’re going to focus on nominal and ordinal variables.
Part II – What is a Measure of Association?
Before we discuss measures of association, we need to talk about independent and dependent variables. The dependent variable is whatever you are trying to explain. For example, let’s say we want to find out why some people think they will eventually graduate from a four-year college while others don’t. The independent variable is some variable that you think might help you answer this question. Perhaps we decide to use their grades in high school as our independent variable.
A measure of association is a numerical value that tells us how strongly related two variables are. There are several characteristics of a good measure of association.
- They range from a value of 0 (i.e., no relationship) to 1 (i.e., the strongest possible relationship).
- For variables that have an underlying order from low to high they can be positive or negative. A positive value indicates that as one variable increases, the other variable also increases. A negative value indicates that as one variable increases, the other variable decreases.
- Some measures specify which variable is dependent and which is independent. The independent variable is some variable that you think might help explain the variation in the dependent variable. For example, if your two variables were education and voting you might choose education as the independent variable and voting as your dependent variable because you think that education will help you explain why some people vote Democrat and others vote Republican. Measures of association that specify which variable is dependent and which is independent are called asymmetric measures and measures that don’t specify which is dependent and which is independent are called symmetric measures.
Part III – Choosing a Measure of Association
There are many measures of association to choose from. We’re going to limit our discussion to those measures that SDA will compute plus a couple others. When choosing a measure of association we’ll start by considering the level of measurement of the two variables (see exercise 6RM).
- If one or both of the variables is nominal, then choose one of these measures.
- Contingency Coefficient – SDA doesn’t compute this but it’s easy to compute by hand.
- Cramer’s V – SDA doesn’t compute this either but it’s also easy to compute by hand and we’ll show you how.
- If both of the variables are ordinal, then choose from this list.
- Somer’s d with the row variable as the dependent variable
- Kendall’s tau-b
- Kendall’s tau-c
- Dichotomies should be treated as ordinal. Most variables can be recoded into dichotomies. For example, marital status can be recoded into married or not married. Race can be recoded as white or non-white. All dichotomies should be considered ordinal.
Part IV – Measures of Association for Nominal Variables
There are a few nominal level variables in the MTF survey.
- v13 – region of country where high school is located which we’ll refer to as the region where they currently live
- v2151 – race
- v2152 – type of community where respondent grew up
When one or both of your variables are nominal, you have a choice of the following measures – Contingency Coefficient and Cramer’s V. Let’s start with the Contingency Coefficient (C). One of the problems with this measure is that it varies from 0 to some value less than 1. The larger the number of categories, the closer the maximum value is to 1. For a table with two rows and two columns, the maximum value is .707 but for a table with three rows and three columns the maximum value is .816. So you can’t use C to compare the strength of the relationship unless the tables have the same number of rows and columns.
Cramer’s V is an extremely useful measure because it can vary between 0 and 1 regardless of the number of rows and columns. Values of V can therefore be compared for tables with different number or rows and columns.
Let’s look at an example to help us better understand measures of association for nominal variables. We’re going to use two variables – v2151 and v13. The first variable – v2151 – is race and the second – v13 – is the region of the country where the respondent currently lives. It would make sense to think of region as the dependent variable since race might influence where they currently live. Always put the dependent variable in the row and the independent variable in the column.
Run CROSSTABS in SDA to produce the crosstabulation of v2151 and v13. Click on OUTPUT OPTIONS and look at PERCENTAGING. Since your independent variable is always in the column, you want to use the column percents. By default, the box for column percents is already checked. Also, click on OUTPUT OPTIONS and check the box for SUMMARY STATISTICS. You probably don’t want any of the charts so you could click on the drop down arrow next to TYPE OF CHART and select NO CHART. Click on RUN THE TABLE to produce the crosstabulation. Your screen should look like Figure 11-2.
Calculating C and V is easy. All you have to do is follow these simple steps.
- C equals the square root of the following: Chi Square divided by the sum of the number of cases in the table and Chi Square.
- Chi Square is the Pearson Chi Square. SDA expresses this as Chisq-P. (See exercise 10RM)
- Look at the SUMMARY STATISTICS that SDA gives you. The Pearson Chi Square is 1,422.29 and the number of cases in the table is 10,982.1.
- So divide 1,422.29 by the sum of 10,982.1 and 1,422.29. This equals 1,422.29 divided by 12,404.39 or 0.1146
- Now take the square root of .1146 which equals 0.339.
- V equals the square root of the following: Chi Square divided by the product of the number of cases in the table and the smaller of two values – the number of rows minus 1 and the number of columns minus 1.
- The Pearson Chi Square is 1,422.29, the number of cases in the table is 10,882.1, the number of rows minus 1 is 4-1 or 3, the number of columns minus 1 is 3 – 1 or 2.
- The smaller of the number of rows minus 1 and the number of columns minus 1 is 2 since 3 -1 is smaller than 4 – 1.
- So divide 1,422.29 by the product of 10,982.1and 2. This equals 1,422.29 divided by 21,964.2 or .0648.
- Now take the square root of .0648 which equals 0.254.
Notice that C and V are moderate in size. C is 0.339 and V is 0.254. You can see that C tells us that there is a moderate relationship between these two variables as does V.
Part V – Now it’s Your Turn
Use CROSSTABS in SDA to give you the table for v2150 and v13. The variable v2150 is the respondent’s sex. We want to find out whether the respondent’s sex is related to where the respondent currently lives. Decide which variable is independent and dependent. Remember to put the dependent variable in the row and the independent variable in the column. Get the correct percents and tell SDA to compute Chi Square. Then compute C and V by hand. Use all this information to describe the relationship between these two variables.
Part VI – Measures of Association for Ordinal Variables
There are a number of ordinal level variables in the MTF survey. Here are a few examples.
- v2173 – how respondents rates self on school ability compared to others of same age
- v2174 – how respondent rates his or her intelligence compared to others of same age
- v2179 – respondent’s report of his or her average grade in high school
- v2183 – how likely the respondent thinks he or she will graduate from a four-year college
You have a choice from four measures that SDA will compute for ordinal variables – Gamma, Somer’s d, Kendall’s tau-b, and Kendall’s tau-c. Let’s start with Somer’s d. This measure is the only one of the four that is an asymmetric measure. That means that Somer’s d allows you to specify one of the variables as independent and the other as dependent. Use CROSSTABS in SDA to get the crosstabulation of v2173 and v2183. If we think that the respondents’ evaluation of their school ability influences how likely it is that they think they will graduate from a four-year college, then graduation from college would be our dependent variable and would go in the row and evaluation of school ability would go in the column. Be sure to get the column percents, Chi Square, and the four measures of association we listed above.
Chi Square tells us that we should reject the null hypothesis that the two variables are unrelated which provides support for our research hypothesis that the variables are related to each other. Since v2183 is our dependent variable, the appropriate value of Somer’s d is
.27. Tau-b and tau-c are very close to each other (0.31 and 0.27). Gamma (0.47) is larger. Gamma will always be larger because of the way it is computed.
You probably noticed that these measures for ordinal variables can be both positive and negative. It’s often hard to interpret the sign. We would like to be able to say that a positive value indicates that as one variable increases the other variable increases and a negative value indicates that as one variable increases the other variable decreases. But that depends on how the values are coded. So to determine whether a relationship is positive or negative it’s better to look at the percentages and let them tell you if it is positive or negative. In our example, the relationship is positive since the higher respondents rate their school ability, the more likely they are to think they will graduate from college.
Part VII – Now it’s Your Turn Again
Use CROSSTABS to give you a table for v2179 and v2183. We want to find out if the respondent’s grades in high school help us understand why some think they will graduate from college and others don’t. Decide which variable is independent and dependent. Get the correct percents and tell SDA to compute Chi Square and the four measures of association we discussed. Use all this information to describe the relationship between these two variables.
Part VIII – Using Measures of Association to Compare Tables
The primary use of measures of association is to compare the strength of a relationship in several tables. You want to make sure that you compare the same measure of association across tables. Compare Gamma values to Gamma values and V values to V values. Rerun one of the tables that you created in parts 5 and 7 but this time hold sex constant. Put v2150 (sex) in the control box which is right below the COLUMN box in the crosstabs dialog box. Now compare the appropriate measure of association to determine if the relationship is stronger for males or females or whether it doesn’t vary much by sex. Remember not to make too much out of small differences in the measures.
 See exercise 6RM for a discussion of levels of measurement. Nominal variables have no underlying order and ordinal variables have an underlying order. Measures of association for nominal variables range from 0 to 1 while measures for ordinal variables range from -1 to +1.
 There’s another popular measure called Lambda but SDA doesn’t compute it and it’s harder to compute by hand so we’re going to skip it.
 If your table has two columns and two rows, V is equal to Phi which you might be familiar with. Since Phi is a special case of V, we’re not going to discuss it.
 That’s important because as we noted earlier some measures are asymmetric which means that it depends on which of the two variables is dependent and which is independent. In these cases, SDA assumes that the row variable is the dependent variable.