Chapter 3 -- Population Characteristics | SSRIC - Social Science Research and Instructional Council

Last Modified 19 August 1998

Because differences in the age, income, education, gender, ethnicity, and employment characteristics of the population may affect access to resources and social status, these population components are frequently studied in more detail or are controlled when studying an issue. For example, white males tend to earn higher incomes than white females or persons of many other ethnic groups, persons with higher education attainment tend to earn higher incomes, women predominate in older age groups, and men and women are often found in different occupations.

In this section you will examine a few of the measures commonly used to describe some of these population components.

A. The Sex Ratio

The sex component of the population is a significant element affecting many statistical tabulations. For example, there are important differences between men and women in the areas of employment, age, and income. It may be useful to control these elements in an analysis.

The sex ratio is an often-used measure of the difference in the number of men and women in an area. It is the ratio of males per 100 females and is calculated by dividing the number of males by the number of females and multiplying the ratio by 100. Scores higher than 100 indicate more males than females.

**Table 4. Sex Ratios, 1990**
Glendale	Los Angeles City	California
93	101	100

Several underlying factors may influence the sex ratio. For very large populations in developed countries such as the entire United States the ratio is less than 100, indicating that there are more females than males. However, when the ratio is examined across different age groups, the ratio is greater than 100 in the early age groups before dipping below 100 in the early 20s. This is because more males are born than females. The sex ratio is about 105 regardless of the country. However, in the more developed countries males tend to die at a higher rate than females.

In the early twentieth century, men were predominant in the western United States because many people had migrated into the region and the majority of those migrants were young men. However, by 1990 the sex ratio was almost even - 99.5. However, state capitals with a large number of women in clerical and administrative jobs have lower sex ratios than other cities. Similarly, retirement communities with large elderly populations have still lower ratios.

While Table 4 indicates that overall the total number of males and females in California is about the same, Table 5 presents the sex ratios by age category. As expected, males predominate up to age 34. After age 60 the proportion of females increases markedly. Among people age 75 and older, the table shows that women outnumber men two to one. The increase in the proportion of males from age 10 to age 24 is due to the male predominance among in-migrants, some of whom were teenagers in young families coming from other states or from countries like Mexico.

**Table 5. Sex Ratios by Age**
Age Group	Sex Ratio
0 - 4 yrs	105
5 - 9 yrs	105
10 - 14 yrs	110
15 - 19 yrs	116
20 - 24 yrs	110
25 - 29 yrs	106
30 - 34 yrs	103
35 - 39 yrs	100
40 - 44 yrs	99
45 - 49 yrs	98
50 - 54 yrs	95
55 - 59 yrs	88
60 - 64 yrs	82
65 - 69 yrs	48
70 - 74 yrs	68
75 - 79 yrs	50
80+ yrs	50

B. The Location Quotient

While percentages provide an indication of the relative proportion of a population subgroup in different areas, the location quotient can be used to compare a local proportion to that of a much larger area. Location quotients can determine if an ethnic population or employment within a certain occupation or industry is relatively strong or weak in different areas.

A location quotient is calculated by first dividing the number of persons in a population subgroup by the total population within a local area. This ratio is then divided by the comparable ratio for a much larger area such as an entire state. For example, if 3000 out of 10,000 persons in a community were Hispanic (a ratio of .3) and 300,000 persons out of one million persons in a state were Hispanic (a ratio of .3), the location quotient would be 1. This would indicate that the community has the same proportion of Hispanics as the entire state. When the location quotient is greater than one, the community would have a higher concentration of Hispanics than the state. A score of 2.0 would mean that the community has twice the proportion of Hispanics as the state while a score of 0.25 would indicate the community has one quarter the percentage of the state.

In the table below, various occupational categories and classes of employment are compared between Los Angeles County and the State of California. The location quotients in the first table indicate that Los Angeles County has about 1.5 times the proportion of workers in private household and machine operator occupations as the state as a whole. Also, Los Angeles County has about half the proportion of workers in farming, forestry, and fishing occupations as found over the entire state. The second table reveals that Los Angeles County has about two-thirds the proportion of state and federal government workers as the entire state. This is surprising to many people, who may have imagined that the much larger number of poorer people in Los Angeles County compared to all other counties in the state would automatically translate into a higher proportion of government workers in that county. Also, there are relatively fewer self-employed people in Los Angeles County than in the state despite the concentration of immigrants in the county and the fact that many immigrants have opened small businesses as a means of adaptation to this country.

**Table 6. Location Quotient by Occupation and Class of Worker**

**Los Angeles County, 1990**
Occupations	California Employed Persons	LA County Employed Persons	California Proportion	LA County Proportion	Location Quotient
Executive, Admin, Managerial	1,939,417	555,616	0.139	0.132	0.95
Professional Specialty Occupations	2,057,087	603,519	0.147	0.144	0.98
Technicians and Support	527,367	141,767	0.038	0.034	0.90
Sales	1,690,007	486,374	0.121	0.116	0.96
Administrative Support	2,319,459	730,744	0.166	0.174	1.05
Private Household Services	95,059	44,456	0.007	0.011	1.56
Protective Services	235,799	65,721	0.017	0.016	0.93
Other Services	1,402,919	406,436	0.100	0.097	0.96
Farming, Forestry, Fishing	382,369	52,446	0.027	0.012	0.46
Precision Production, Repair	1,548,625	462,923	0.111	0.110	1.00
Machine Operators	797,300	345,158	0.057	0.082	1.44
Transportation and Moving	480,057	142,276	0.034	0.034	0.99
Helpers, Laborers	520,844	166,356	0.037	0.040	1.06
Total Employed	13,966,309	4,203,792

**Class of Worker**
Class of Worker	California Employed Persons	LA County Employed Persons	California Proportion	LA County Proportion	Location Quotient
Private for Profit	10,000,783	3,134,368	0.715	0.746	1.04
Private not for Profit	734,520	223,631	0.052	0.053	1.01
Local Government	1,078,146	307,672	0.077	0.073	0.95
State Government	499,399	100,286	0.036	0.024	0.67
Federal Government	446,373	90,789	0.032	0.022	0.67
Self Employed	1,173,375	329,115	0.084	0.078	0.93
Unpaid Family	60,713	17,931	0.004	0.004	0.98
Total Employed	13,996,309	4,203,792

C. The Entropy Index

The entropy index (H) is a measure of the diversity of groups in an area. If all component groups are equally present, the index reaches a maximum. If only one of several groups is present it is 0. The maximum score increases with the number of groups used in computing the entropy index. However, it can be standardized to a maximum of 1 by dividing all values by the maximum possible score (i.e. all groups equally present in an area).

In the table below five major ethnic categories have been tabulated for four California cities. The proportion of each group in its city multiplied by the natural log of the proportion is reported in the lower part of the table. The sum of the indexes for each city is the Entropy Index (H), which is reported in its raw and standardized values at the bottom. The raw scores (H) were standardized by dividing by the maximum possible score for five groups (1.609).

The cities of Los Angeles and San Francisco are found to be much more diverse than Glendale and Burbank (Table 7). Because of their large Asian and Hispanic populations, many cities in California are among the most ethnically diverse in the United States.

**Table 7. Diversity Scores**
Group	Los Angeles Persons	San Francisco Persons	Glendale Persons	Burbank Persons
Non Hispanic Whites	1,299,604	337,118	114,765	64,453
Blacks	487,674	79,039	2,334	1,638
American Indians	16,379	3,456	629	501
Asians & Pacific Islanders	341,807	210,876	25,453	6,335
Hispanics	1,391,411	100,717	37,731	21,172
Group Total 1990	3,536,875	731,206	180,912	94,099

(P_k/P) ln(P_k/P)
Non Hispanic Whites	0.368	0.357	0.289	0.259
Blacks	0.273	0.240	0.056	0.071
American Indians	0.025	0.025	0.020	0.028
Asians & Pacific Islanders	0.226	0.359	0.276	0.182
Hispanics	0.367	0.273	0.327	0.336

H	1.259	1.254	0.967	0.875
Standardized H	0.782	0.779	0.601	0.544

D. Geographic Association

Scattergrams

Very frequently social scientists want to determine the strength of the association of two or more variables over space. For example, one might want to know if larger populations within metropolitan counties are associated with higher crime rates. One way to examine this association is to make a scattergram. Scattergrams graphically portray how closely changes in one variable correspond to changes in another. In the example below the population values for the 593 metropolitan counties in the U.S. have been plotted on the x-axis and the corresponding crimes per 100,000 persons have been plotted on the y-axis.

Figure 2. Scattergram of Population vs Crimes Per 100,000 Persons

Scattergram of Population vs Crimes Per 100,000 Persons

In this scattergram there does appear to be some association between higher crime rates and larger populations. However, there is quite a bit of variability in this trend - a few cities with large populations have relatively low crime rates and a few small cities have relatively high crime rates. If the relationship were very strong, the points would spread out along a line and if it were very weak, the points would be scattered randomly over the plot. Very strong, almost linear, distributions may be found in physical relationships such as the increase in pressure in a container with an increase in temperature. However, such strong relationships are rare among social data.

Correlation

If a scatter of points does seem to exhibit a pattern, then one might choose to measure the strength and the direction of it through the use of correlation statistics. Correlation determines whether a relationship exists between two variables which have usually been sampled from a larger population. Correlation measures are usually standardized so that if an increase in the first variable, x, always brings the same increase in the second variable, y, then the correlation value would be +1.0. If the increase in x always brought the same decrease in the y variable, then the correlation score would be -1.0. If an increase in x brought no regular change in y, then the correlation would be 0. In most calculations of correlation, an approximation of a linear relationship is assumed. However, the relationship could be curvilinear or cyclical, and so one should always examine a scattergram to see if the relationship between two values is non-linear.

There are several types of correlation measures which can be applied to different measurement scales of a variable (i.e. nominal, ordinal, or interval). One of these, the Pearson product moment correlation coefficient, is based on interval-level data and on the concept of deviation from a mean for each of the variables. A statistic, covariance, is the product of the deviations of the observed values from each of their means divided by the number of observations. This mean deviation is divided by the product of the standard deviations of the two variables to get the correlation or:

S(X - SX/N) x (Y - SY/N)

N .

S(X - X)

S(Y - Y)

The correlation statistic above is for the entire population. If a sample had been selected, the N would have been replaced by n - 1.

Computing the Pearson product moment correlation for the crime and population data yields a correlation score of .449, which is only a moderate level of correlation. Another statistic, called the coefficient of determination, can be calculated to determine the percent of the total variance explained by the correlation between the two variables. The coefficient of determination is simply the square of the r or correlation coefficient. In this example, the coefficient of determination is only .202. Thus, about 20% of the variance between population size and crime rate is accounted for by the correlation between these two variables. This would suggest that other variables yet unaccounted for are more powerful influences on the relationship.

The distribution of points in the scattergram rises quickly and then spreads out to the right, which suggests that the distribution may be somewhat non linear. Calculating the natural logarithm of the population increases the correlation coefficient to .605 and the coefficient of determination to .367. Thus, a non-linear form of correlation increases the percent of variance explained to about 37%. Apparently the crime rate does increase with population size, but at a decreasing rate.

Because all 593 metropolitan counties in the U.S. were used to compute the correlation statistic, and there is less value in testing its significance. Had a sample of the counties been taken, one could consider the possibility that such a relationship could have occurred by chance. To test the significance of the relationship, one could assume that there is no relationship between population size of counties and the crime rate (null hypothesis) and that the value of r is due to sampling error. A statistic called the t statistic is commonly used to test the hypothesis that the correlation value is due to sampling error.

t = |r| x SQRT(n - 2)

SQRT(1 - r²)

If the 593 counties had been a sample, the t test yields a value of 12.204. Consulting a table of t-statistic values indicates that a score of 1.96 would be expected to occur by chance only 5% of the time and 3.922 only .01% of the time. The value of 12.204 is far higher than that. Thus, the null hypothesis could be rejected.

There are a number of assumptions made about the data in correlation analysis which are not always met. For example, the observations should be selected randomly, they should be measured on the interval or ratio scale, be normally distributed, and they should be independent of each other. The latter condition may be a particular problem in samples that are geographically near to one another. However, large sample sizes can mitigate many of these problems.

Regression

If the correlation between two variables is found to be significant and there is reason to suspect that one variable influences the other, then it may be useful to calculate a regression line for the two variables. In this example one might expect that an increase in population produces an increase in the crime rate. Thus, the crime rate would be considered a dependent variable and the population size would be considered an independent variable. When plotting these variables, the dependent variable, crime, would be plotted on the y-axis and the independent variable would be plotted on the x-axis of a scattergram.

Regression expresses the relationship between the two variables as the equation for a line which best fits the scatter of points in a scattergram. The line minimizes the sum of the squared deviations of the dependent (y variable) from the line. From the equation one can estimate the value of y for a given value of x. Differences between the estimated and real y-axis values are called residuals.

Figure 3. Regression of Population vs Crimes Per 100,000 Persons

Regression of Population vs Crimes Per 100,000 Persons

The equation for the above regression line is

Crimes/100k = 3897.35 + 0.005149 * Pop

Since it is possible that quite different scatters of points could produce the same line, it is also helpful to calculate the standard error of the estimate. This provides an indication of the scatter of the points about the line. This value can be useful for comparing different samples.

S(Y-SY/N

For this crime example the standard error of the estimate is 2252.9

The reliability of the regression equation also may be tested with analysis of variance. With the F statistic one can determine how much of the total y variability is due to the regression line and how much is due to the residuals. If a large portion of the variance comes from the equation and the independent variable then the model provides a good prediction of y and a high value of F.

(S(Y - SY/N

df .

S(Y - SY/N

Where df is the degrees of freedom.

For the crime example, the F statistic is 148.94. The null hypothesis would state that the regression equation fails to predict the variation in y and could, by chance, generate a value of 3.86 (from a table of F statistics) 5% of the time. Thus the null hypothesis can be rejected. Because 148.94 is much greater than 3.86, the null hypothesis can be rejected. This means that larger cities do indeed have higher crime rates even though other factors have a greater effect on crime rates than city size.

Scatter.jpg

Scatregress.jpg