Sampling
I am going to expand on what Corbett says in Chapter 6, giving a little
more detail in a couple of key areas. My goal here is to give you enough
information so that you could design your own sample if you were doing
a survey.
Here is a general outline of what I am going to talk about. First, if we are doing a survey, we must decide what kind of units to observe in doing research. Second, we need to decide how many to observe--that is, sample size. Third, we need to know precisely how to choose those units--sample selection.
How we answer the second and third questions will have a big impact on error in our findings. In general there are two kinds of error in survey research: sampling error and bias. Sampling error is good. It is good because if we choose our units correctly, we can control it. That is, we can estimate what it will be and make it smaller if we need to before we do the survey.
On the other hand, bias is bad. We can never be sure how big it is. And it is hard to eliminate. All we can do is try and minimize it as much as possible. Bias can enter survey research in three ways. 1) Sample bias can wreck research if the units for study do not represent the population. That is what we will talk about here in this section--how to minimize sample bias and minimize the chance that the sample represents the population. 2) Question and questionnaire bias. 3) Interviewer bias. These last two topics will be covered in the next section on surveys, in which we will talk about how to minimize bias in writing questions and in designing questionnaires, and how to train interviewers to minimize bias. So we will wait on that till the next section. For now, we will stick to talking about sampling error and sample bias.
What kinds of units to observe?
Generally speaking, one should choose units that are as close as possible to those specified in the hypothesis. So if the hypothesis is about citizens, then the units observed should be citizens. If it is about cities, then cities should be observed.
Sometimes, as I showed you earlier, you can substitute aggregate data for individual data. For example, if you need to know about voting behavior, ideally you would do a survey of voters or an exit poll. But you may be able to substitute precinct data--IF you know the characteristics of the precincts and the precincts are homogeneous--that is, all of the people in the precinct are alike in ways that are important in your study (e.g. white, middle class, working class, Hispanic, or whatever). Then you can compare precincts with different characteristics and assume that the votes represent how people with these different characteristics are voting. But the more heterogeneous the aggregate units are, the less you can assume about who in each unit is doing what.
How Many? Sample size and sampling error
Corbett does not give us much help here. He has a pretty good discussion of accuracy and confidence levels, talking about a "formula" that he never shows us. Finally, he gives us a table to estimate sample size. It was based on the formula that I will give you. He also has a very confusing and unnecessary discussion of "Variability." Don't bother to read this section. The whole purpose is to make an assumption about the nature of the data we are gathering. I will show you that assumption in class in working with the formulas I am about to give you.
At this point, I will just give you some formulas. In class we will go over them and I will show you how they are applied and a little about how they are related to each other.
Formula of sampling error for proportions AFTER a survey is completed:
where p is the proportion in decimal form, e.g. .35
of the sample say they voted
n is the sample size
N is the population size
This is great, except that it is only good after the survey is over! What we really need is a formula that will help us plan a survey and keep sampling error in an acceptable range. if we make certain worst case assumptions (like the variability thing that Corbett talks about), this formula can be reduced to the following (I will show you in class how we get to this one):
Formula for sampling error for proportions BEFORE a survey is performed:
Sampling Error for Proportions = [1/sq rt (n)] [(N-n)/(N-1)]
where n is the sample size
N is the population size
You will note that p is not in this formula--because we do not know what p is until after it is done!
The last factor in the formula, [(N-n)/(N-1)], is call the Correction Factor. It can also be dropped if the sample is small relative to the population. The rule of thumb here is that if the sample is LESS than 10% of the population, you can drop this factor. So in a sample of 300 out of a population of 10,000 people, the factor is inconsequential and can be ignored. On the other hand, if you are doing say, 350 out of the 3200 USCA students, then it should be kept.
When can these formulas be used? They can only be used when each unit in the population has an equal chance of being chosen (when we have a sample that approximates a random sample -- see the discussion below). Otherwise, the formulas are inappropriate and have little to do with actual error.
How accurate are these formulas? They are all calculated at what is
called the 95% significance level. That means that there is a one
in twenty chance that the actual proportion for the population will be
outside the + or - confidence interval. Another way to look at it is
that 19 of 20 times you will be within the sampling error of the turth.
Of course, except in the case of exit polls that ask for whom you voted,
we do not know what the truth is. So we must just have faith in the laws
of sampling and do the best we can to minimize bias--the other source of
error.
How to choose units? Types of Samples
Once you decide how many, you have to come up with a scheme for choosing them. One could have a whole course on sample selection--indeed, such courses exist at the graduate level. For our purposes, we will keep things relatively simple. But I do want to add to what Corbett tells you--his discussion is not really enough to design a practical sampling scheme.
Generally, all sample designs fall into one of two categories: haphazard samples or probability samples. Haphazard samples are pretty terrible in the sense that they ignore the problem of bias and consequently usually have a great deal of bias in them. There are several special kinds of haphazard samples.
Convenience samples are composed of units that the researcher finds convenient to find and survey. The "person-on-the-street" interviews that one sees so frequently are convenience samples that are rarely representative of all the diverse people who exist. The people interviewed are chosen because they are easily found (often at malls these days) and easily interviewed. Years ago in the early days of survey research even the professional pollsters used a form of convenience sampling. They called it "quota sampling," and thought it was a way of being more scientific. What they did was to try and force the sample to look like the general population in specified ways: so many women, so many in each age group, and so on. But after they did that, the interviewers went out and found people in each group who were convenient to find. It was still a convenience sample and was likely biased in that certain kinds of people are more convenient to find than others.
Self-selected samples are even worse, if that is possible. Here the interviewer waits for the sample to come to her. Usually the motivation is caring enough to participate. Straw polls in which one calls in or clips something out of a newspaper and then sends it in, or uses a form on a Web page, are all versions of straw polls. The bias is that those with the most intense feelings are the most likely to volunteer themselves. That usually means a bias in favor of more extreme opinions.
The Literary Digest survey of 1936 is haphazard, as Corbett points out. It has elements of both a convenience sample and a self-selected sample. It was convenient for the magazine to get addresses from auto registrations and from telephone books. Then those who could afford the magazine or had strong feelings about it self-selected themselves to respond. What Corbett does not explain is why the survey got it right in the previous two elections. The reason is that the parties were not aligned nearly as much along economic lines in 1928 and 1932. The splits were more regional. And in 1932, everybody, whether rich or poor, was unhappy with Hoover. But in 1936 the poor had moved to the Democratic Party and the wealthy were angry with FDR, feeling that he was a traitor to his economic class. So the vote was more along class lines. And as the book points out, the sample had very few lower class people in it.
Probability Samples are sample in which one knows the probability of each unit in the population being chosen. If the probability of each unit in the population being chosen is equal, then we can use the sampling error formulas that were given above. Sometimes people will call this a simple random sample. Simple random samples do have this characteristic, but in fact they refer to a particular method of choosing a probability sample. So let's start with that kind of sample.
A simple random sample requires a complete listing of all units in the population. Each unit is given a unique number (1, 2, 3, and so on). Then a random number table or generator from a computer program is used to select the numbers of the units to be chosen. I will provide you with a random number table and go over some examples in class. Unfortunately, often one does not have a numbered list of all units in the population, so this is not really a practical method most of the time. Choosing the numbers out of a bowl or basket (like in the old draft lotteries) is a variation on this method. We will do an exercise on this in class with M&M's to show how the formulas work.
A systematic sample is a very practical way of choosing a sample when you have a list but it is not numbered. Say, for example, you have a customer list for a bank, of say 2500 people. If you want a sample of 400, you then choose every kth one, where k is N/n, or in this case, 2500/400 = 6.25, which we then round to 6 so that we don't run out of people in the list at the end. The first one chosen would be randomly chosen from the first 6, and then you count every 6th one from there.
If you have a really long list with pages--like a phone book--you can do a slight alternative on this by choosing the rth name (where r is some low random number so that you do not have to count very much) on every kth page (or column), so that you get the desired number and get all the way through the phone book.
There is a slight bias in this in that people at the end may have a lower probability of being chosen. In the example of 400 from a list of 2500, choosing every 6th leaves roughly the last 100 with no chance of being chosen. the best way around this is to go ahead through the list and choose a few extra so that all had an equal chance. In this case you would end up with a sample of about 416.
In telephone interviewing firms often use a technique called random digit dialing. This technique uses a random number table (usually a computer program) to generate telephone numbers so that all possible numbers has an equal chance of being chosen. A real plus here is that unlisted numbers are just as likely to be selected. A systematic sample from a phone book misses unlisted numbers. This works pretty well if everyone in the population has the same exchange or it is a national sample with thousands of possible exchanges and hundreds of area codes. But it does not work very well in most other cases because you have a lot of unused exchanges or area codes and therefore a lot of generated numbers that are not usable. So what we usually do is to combine this method with stratification.
When you stratify a sample, you make sure that you select certain proportions of the sample in each strata of the population. This is only used in combination with other techniques. So for example, in random digit dialing, we might stratify Aiken City area numbers into several strata corresponding to each of the exchanges in the area (641, 642, 648, 649 and so on). Then once you select the number in each strata you want, you then generate the other 4 digits using random numbers. How do you know how many from each strata (exchange in this case) to use? You may have to take a systematic sample from the phone book to see what the proportion of numbers in the phone book each exchange is. One need only assume that unlisted numbers have the same proportions.
Stratification is often used in deciding whom to talk to once one reaches a working number. If someone calling you says that they need to speak with the oldest/youngest/second oldest male/female in the household, you know that this is professional survey that is stratifying by age and gender. That is typical. An easy and useful variation on this which chooses all genders and ages proportionally is the "Last Birthday method." Here you simply ask to speak to the person (adult) in the household who has had the most recent birthday. This may sound pretty silly, but if you think about it, it is a quick and excellent off-the-wall way of randomly choosing a member of the household with whom to talk.
Even if you stratify by exchange, one still often ends up with many numbers that are not usable. One way around this is to take advantage of the fact that the phone company assigns numbers in clusters, so that if one is good, the numbers around it are usually also good. The "Plus One" method of number selection in telephone surveys works in two steps. First, you select a sample using the systematic sample technique or some variation thereof. Then, in order to allow for unlisted numbers to be chosen (which can run as high as half of all households in some urban areas), you simple add one (or two or three or four or five) to the number. And you have a sample of mostly working numbers so that all numbers in the population are roughly equal in chances of being chosen. This is a very good method, especially when you are having to work through several listings, say all those in Aiken County!
Cluster Samples are used in face-to-face surveys in order to save time and money. It would be impractical to travel to 400 widely scattered homes throughout South Carolina. So instead, using census tracts and block maps, we might randomly choose 25 tracts and then choose two blocks in each tract so that we have 50 blocks to travel to. Then we do 8 interviews in each block area (choosing houses systematically). Often this is combined with some stratification so that we get the right mix of urban, suburban, and rural, for example. Needless to say, this is a very complicated technique that requires a great deal of planning.
Weighted Samples are chosen in situations where one needs to compare subgroups, one of which would be too small to use if the groups were chosen in proportion to their actual size. Say, for example, that one wishes to compare the attitudes of social science majors (history and political science, sociology, psychology) with all other majors here at USC Aiken. If we were to choose a sample of 200 using some method that made each student equally likely to be chosen, we would end up with something in the range of only about 20 social science majors in our sample at the most, since they number only 200 out of roughly 2,000 students. Obviously one could not make a very good comparison with such a small subsample with such a larger subsample! The sampling error for the roughly 20 social science majors would be about + or - 22% (using the formula 1/ sq rt of n). If making some kind of comparison was the major point of the study, we might wish to weight the two subgroups equally, taking a systematic sample of 100 of each group. That would have about a + or - 10% sampling error for each subgroup (again, using the formula 1/sq rt of n).
Further suppose we make our comparison. Then we decide that we would also like to make some projection to the student body as a whole. For example, suppose we compared attitudes toward a career in public service, and found that 60% of the social science majors would consider a career in public service as compared to only 25% of the other majors. Now we would like to know how the whole student body feels about a career in public service. How would we answer this question?
We must re-weight the subsamples so that they are in correct proportion to what they should have been. Here is method you use. 1) Compute the number of each subgroup that should have been in the subsample were it proportional to the population. 2) Divide that by the number that actually was in each subgroup. This gives you the re-weighting factor for each subgroup. re-weight the subsamples. 3) Then multiply the number of units that gave the specific answer in which you are interested in each subgroup. 4) Add these re-weighted units together. 5) Recompute the proportion of the whole sample that gave that particular response.
Yes, I know that sounds like IRS Form 1040 instructions on computing the capital gains tax. But I think an example will make it clear--and it should even make intuitive sense--something that has always eluded me in the form 1040. Let's go back to our make-believe example.
If you remember, 60% of the social science majors (out of the subsample
of 100) would consider a career in public service as compared to only 25%
of the other majors (also out of a subsample of 100). So the "number that
was actually in each subgroup" was 100 each. However, there SHOULD have
been 20 social science majors and 180 other majors, based on the fact that
social science majors were 10% of the population (200 out of 2,000, if
you remember). So 20 and 180 are the "should have been" numbers for each
subsample. Using these numbers, the re-weighting factor for the social
science majors is 20/100 = .2, and the re-weighting factor for the other
majors is 180/100 = 1.8. This means that when we re-weight, the social
science majors, who were over-counted, will be recounted for .2 persons
each, and the other majors, who were actually under-counted, will be re-weighted
as 1.8 persons each. The rest can be pretty easily set up in a table.
| Group | % yes | # yes | x re-weight fact | new # yes | new % yes |
| Soc Sci maj | 60% | 60 of 100 | 60 x .2 | 12 of 20 | |
| other maj | 25% | 25 of 100 | 25 x 1.8 | 45 of 180 | |
| totals | 57 of 200 | 57/200 = 29% |
You can see that had we just added together the original numbers (which
is inappropriate), we would have had 60 + 25 = 85 or 85/200 = 43%
saying they would consider a career in public service. That obviously overestimates
the truth for the whole student body because too many social science majors
were in the sample, and social science majors are far more likely to consider
a public service career. The other majors should have far more influence
on the total because they are a greater part of the overall student body.
The percentage computed using the re-weighted students does exactly that.
It is much closer to the percentage that would consider a career in public
service among other majors.
Assignment for next class:
1. How large a sample would you need in order to end up with a sampling error or no more than + or - 5%, if the population is 50,000 (which is about the number of adults over 18 in Aiken County)?
2. Suppose you chose a sample of 500 adults out of this population of 50,000 in question number 1. And further suppose that your survey found that 30% said that they were registered to vote. What is the sampling error for that 30% figure?
3. Suppose that instead of 30%, 45% said that they were registered to vote? What is the sampling error for that figure?
4. Suppose that the % who said they were registered to vote is 50%. What is the sampling error for that finding?
5. One more--suppose the percentage is 55%--what is the sampling error for that figure?
6. Compare the numbers you found for questions 2-5. Do you see a pattern? Describe the pattern. What does it suggest about how to estimate sampling error when you don't know what the proportion (p in the furmula) will be? (This is what Corbett was talking about when he had that confusing discussion on "variability," by the way.)
7. Suppose you were asked to do a survey of business majors here at
USC Aiken. There are 450 business majors.
a. What would the sampling error be if you were
to do a sample of 150 of them?
b. If you have a list that can be computer generated
with their names and telephone numbers, how would you design the survey
of 150? Describe the best process.
c. Suppose your friend, who is a business major,
offers to give out the questionnaire in his classes, which have just over
180 students in all of them combined. Then what would the sampling error
be if he got 180 questionnaires completed?
8. Take a page from the Aiken City area telephone directory. Pick the page that comes closest to the middle two digits in your social security number. Use a systematic sample of 100 to produce estimates of the relative number of numbers at each exchange on that page.
9. Suppose you use a weighted sample in doing a survey of male and female
USCA students so that the number of males and females are equal. You know
that 65% of the student body is female. The question is whether students
want to start intercollegiate football here at USCA. Each subgroup has
200 students in it and was chosen using a sytematic sample so that each
male and female student had an equal chance of being chosen within their
subgroup. The finding is that 45% of the females in the sample say "yes"
to football and 65% of the males say "yes."
a. Based on these figures, is there likely to be
a real difference in how males and females feel about football in the actual
population of all students?
b. Reweight the two subsamples and recombine them
so that we can project how the student body as a whole feels about football.
Account for sampling error.