Compound Measures
In the Chapter 2 exercises the text introduced you to the idea of compound
measures. If you recall, when you have a complex idea like position for
abortion that varies with the circumstances, often several questions are
required to count all the possible circumstances under which someone may
be willing to allow abortions to legally take place. The more complex the
concept, the more likely that the researcher will have to use several measurements
to capture an acceptable number of properties that are part of the concept.
Because the text does not really go into this topic and because knowing
how to combine measures is essential to performing research, I want to
give you some additional information here. The major question here is how
we can combine those measurements to get a good compound measure. But before
we get to that, let's briefly go over the reasons why compound measures
are often preferable to simple single indicators for concepts.
Advantages of Compound Measures
1. Compound measures, as noted above, allow you to measure several
different aspects of a complex concept. Most concepts, like political
particiaption or trust or knowledge, have many component parts. To capture
the concept in a measure, to operationalize it, requires that we have questions
that try and measure all or at least many of the most relevant parts.
2. Compound measures are generally more reliable because a mistake or misinterpretation on one component measure does not cause the whole measurement to go wrong. For example, if you think of a test in school as a compound measure, then you should not fail just because you miss one question due to a careless mistake or a misinterpretation.
3. Compound measures allow you to discriminate more closely between units that are really different but might be measured the same using a simple single-component measure. In general, the more components, the more finely you can discriminate between different units. For example, the more questions a teacher asks on a test the more easily she can tell the difference between a B+ student and an A- student.
Evaluating Compound Measures
1. The Reliability and Validity of each component part. Each component part of a compound measure must have
the same general qualities of any simple measure. That is, each part must
be both reliable and valid. If not, then the compound measure has problems.
2. Coverage.
If the measure covers all the relevant dimensions or
properties of the concept, we say that it has good coverage.
3. Discrimination. A good compound measure will be able to classify the
units into several different categories to create a good deal of variation
in the overall measurement. For example, a compound measure of the economic
development of nations that resulted in a few nations being classified
as highly developed and only one or two as undeveloped while 98% of all
the nations were classified as "developing" would not discriminate very
well. The same is true of a test that places all students as making A's
or F's.
4. Homogeneity. Each of the component parts of the measure should have
something
in common, at least conceptually. All the component measures of economic
development should have something to do with the economy. If one of the
measures measures political stability, then the overall measure may not
be sufficiently homogeneous.
5. Independence. At the same time the component parts should be
relatively independent from each other. If one part determines the score
on another part then the measure lacks independence. For example, again
using the measure of economic development, if one part is the number of
telephones per capita and another is the percentage of homes with telephone
service, we have two measures that are too closely related to be considered
independent. On the other hand, the number of motor vehicles per capita
is not logically dependent on the number of telephones per capita. Independence
is kind of the opposite of homogeneity. We are looking for balance between
the two ideas.
Ways to Create Compound Measures
1. Simple Counting
This method is often used in survey research when you have a bunch of questions that are either dichotomous yes/no questions or questions that are "Likert scales" (named after their creator).
Yes/no questions are usually coded as 0 for "no" and 1 for "yes." The more "yes's" someone gives, the more of the quality that this person has. You add the number of "yes's" up and get the overall score. For example, we can measure political participation by the number of yes's to questions like: Did you vote? Did you display a yard sign? Did you make a contribution? Did you discuss the election or candidates with others? And so on.
A Likert scale is usually either a 5 or 7 point scale that begins with reading a statement and then having the respondent say how she feels about the statement on a scale that runs from "Strongly Agree" through "Don't feel one way or the other" to "Strongly Disagree." If it is a 7 point scale, it includes "Moderately" and "Slightly" on both sides. If it is a 5 point scale it includes just "Agree" and "Disagree" as points 2 and 4 in the scale. Here you must do two things. Make sure that the codes are in the direction so that the highest code (5 on a 5 point scale, for example) is attached to the answer that means having more of whatever the concept is all about. For example, "Strongly agree" indicates more more political trust when that answer is given to the statement: "Politicians care about what people like me think." But "Strongly Agree" indicates a low level of political trust when it is given to the statement that "All politicians are out for themselves." So if we are measuring political trust, then the SA answer to the first question should be coded as 5 and as 1 to the second statement. Got it so far?
After you get everything properly coded, you just add the codes together for each unit. So if there are 5 questions on a 5 point scale, then the total scale score will run from a possible low of 5 to a high of 25. If we are measuring political trust, then the lower the overall score, the less trust each person in the survey has.
2. Averaging
The problem with the counting method is what to do with people who fail to answer one of the questions in the scale. You could throw them out. That causes you to lose data. Or you could just count without that question. That causes you to distort their score downward each time. Averaging solves that problem. Here you just add up the scores on each component question, like you did in the addition method, but then you divide the total score by the number of questions answered.
So in the case of a 5 question Likert scale, people who answered all 5 questions would have their total scores divided by 5 and those who answered only 4 will divide their total scores by 4. We will do some examples in class and for homework to make sure that you understand.
3. Thurstone Scaling ("panel of experts" method)
The first two methods we looked at weighted all the questions the same. That may be be ok if all questions are equally important in measuring the concept, but in many situations some questions should be weighted more than others. This is just like when on a test the teacher decides to make one question a 25 point question because the material it covers is more important than the material covered in a 10 point question. A Thurstone Scale allows you to weigh some items heavier than others. It almost always involves survey questions to which respondents are to agree or disagree. Here is how it works. There are 4 steps.
1) You start with a "panel of judges," who are presumable experts in the area and a set of agree/disagree format survey questions that are presumably all measuring the same underlying concept. The "experts" then read each question and individually rate it on some scale (say, from 1 to 9) according to how much of the concept is captured in that question if a respondent were to have agreed with it.
2) Then you throw out questions on which the experts have major disagreements about the weights. You keep the ones where they have similar weights.
3) Next you compute an average weight for each item you keep, using the ratings of the experts. Ideally there are a wide variety of weights for different questions.
4) Then you actually ask the questions to respondents and compute a score for them from the average weight for all the questions to which they said "agree."
I know this sounds complicated, but a simple example will show you how easy it is. The big problem with Thurstone scaling is getting a panel of experts who have a good understanding of the concept that is being measured. Suppose we have four questions that all presumably measure extent of support for public education. What the experts must do is rate each question. A rating of 1 indicates most opposition to public education if a person agrees with that statement. A rating of 9 indicates the most support for public education if a person agrees with that statement. So here are the statements.
1) Tax money spent on public education is mostly wasted.
2) People should only send their children to public schools if they cannot afford private schools.
3) I would be willing to pay $500 more in taxes each year if the money went to improving the public schools.
4) School vouchers should be made available so that parents can choose whatever public or private school they want to send their children to.
Suppose we have a panel of three experts who rate the questions as follows:
#1: 1 -- 3 -- 9 -- 2
#2: 1 -- 2 -- 9 -- 4
#3: 1 -- 3 -- 8 -- 1
You then look at which questions created the most disagreement among the experts. In this case all agreed that agreement with question or statement 1) indicated the highest opposition to public education. On statement 2) there was a little disagreement, but the range of ratings was only one, from 2 to 3. On statement 3) again there was also a little difference in ratings, but again all thought that agreement with that statement indicated about as much support for public schools as possible. However, the experts came up with a wide range of ratings for statement 4). The third expert thought agreement with that statement indicated indicated the highest level of opposition to public education, while the second expert thought that agreement only indicated mild opposition to public education. So based on this, we should throw out statement 4). The average ratings for the remaining three questions are as follows:
1) 1+1+1=3 and 3/3=1
2) 3+2+3=8 and 8/3=2.7
3) 9+9+8=26 and 26/3=8.7
Now suppose that a respondent answers the remaining three question/statements as follows:
1) disagree 2) agree 3) agree
You then compute the average weight for the questions with which they agreed. So only the weights for question/statements 2) and 3) are used. So that respondent's score is as follows: 2.7 + 8.7 = 11.4 and 11.4 divided by 2 = 5.7. Thus this respondent is nearly one point to the support side of the neutral point of 5 on support of public education. We will do some other examples in class to make sure you understand this method.
4. Guttman Scaling
Guttman Scales can only be used for questions that have dichotomous answers, just as was the case for Thurstone scaling. Here the answers are usually of the yes/no format. It assumes that the statements measure different amounts of the same property. Unlike the Thrustone scale in which all you have is a number that indicates a relative amount, Guttman scales allows a substantive interpretation of the score for each individual. That is a real plus in describing how people and groups of people score, especially when you are trying to explain what it all means. The drawback is that you can't be sure that you have a Guttman scale until after the survey is done. But if you use one that has been used successfully elsewhere, you will probably be ok. Here are the rules. Assume that 0 is no and 1 is yes and that we have already asked the questions to the sample of respondents.
1) rearrange the questions and respondents until you have a triangular pattern that has as few "errors" in it as possible. Suppose you have four statement/questions and 5 people in the sample. The pattern might look something like the following (with statements across the top and people forming the rows):
1 1 1 1
1 1 0 1
1 1 0 0
1 0 0 0
0 0 0 0
In this pattern there is only one error. The second person either should have answered yes to the third question or answered no to the last question. Either way it is an error. You can also remove one or two questions if they are causing a lot of errors (but you can't remove people!).
After you do this (computers do this for us when we have a lot of people and a lot of questions), you calculate the percentage of errors that occurred. If it is less than 10%, then we accept the question/statements as Guttman scalable. In this example, there was one error out of a possible 20 (5 people times 4 questions=20). So the percentage of errors is 1/20=5%. This is less than 10%, so we would accept these questions as Guttman scalable for these respondents.
Finally, what is the Guttman scale score for the third person? It is the number of questions with which they agreed. And the interpretation of what that means is determined with the meaning of that last question for which they said "yes." So suppose these were questions about how much one loves the South and the second question/statement was "Would you want to live in the South even if it meant that you were paid less money?" Then this person's score of 2 would mean that their love for the South extended to being willing to take a salary cut to remain there. We will do some other examples to illustrate this in class.
5. Standardized scores ("z scores")
This is a real interesting kind of compound measure
technique in that it allows us to accomplish two very important things.
First, we can add together very different kinds of things--as long as we
can treat each one as an interval level measure. For example we can add
together miles of railway in a nation with the numbers of newspapers per
capita with the illiteracy rate. Second, it allows us to weight all of
the things equally in the final measure. So we can add together your GPA
along with the number of clubs and organizations you joined at USCA to
get an overall measure that is not distorted by your having joined 8 organizations
(which would otherwise hide your 2.1 GPA in the final score).
Unfortunately, we really can't do this one yet, because
we do not yet know how to do the statistical manipulations. First, we have
to calculate standardized scores for each individual measure. As you will
learn, standard scores all have a mean of 0 and a standard deviation of
one. They are measured in terms of how many standard deviations they are
above or below the mean. So all measurement units, whether miles, or grade
point units, or numbers of clubs, or miles of railway, all become standard
deviation units. But you do not know how to do that yet. In any case, once
you make this transformation, you can just add them together. What is nice
is that most computer packages will allow you to do this rather easily.
Assignment for next class: To be made in class. You need to be
there!!!