Chapter 11. Statistics
.
I. Introduction. Two types of statistics
Two types of statistics exist that are relevant to this course, descriptive and inferential. Let's look at each in turn
A. Descriptive Statistics
Descriptive statistics are designed to summarize some quality or aspect of data. So we might use some kind of measure of central tendency (means, medians or modes) to tell us what the middle or center of the data look like. Alternatively, we might he interested in a quality that focuses on relationships between variables, so we might use a correlation coefficient or a Tau B or Tau C to summarize a relationship. A single number summarizes things quickly, but we might also consider graphs and charts to be statistics that also are designed to summarize things about relationships. So crosstabulations are also a kind of descriptive statistic that summarizes a potential relationship. The same would also be true about a scatterplot, or even a mapping feature like we have in MicroCase.
B. Inferential Statistics
Inferential statistics usually look almost the same as descriptive statistics, but they add something extra. Rather than being precise, they include an error term because they are inferring something from a sample to a larger population. For example, the percentage of all students who are seniors at USCA would be a descriptive statistic, but the percentage of seniors in a sample of USCA students would be an inferential statistic. Ant the inferential statistic should include some kind of expected sampling error. So it might be that 19% plus or minus 4% is the inferential statistic for the percentage of seniors at USCA.
Put another way, as soon as we make inferences, we must introduce the laws of probability. You have already been doing that in working with sampling error. And you know that the sampling error has a probability of being correct, a 95% probability wtih the formulas we have been using.
II. Kinds of Statistics to Use -- depends on levels of measurement and purpose
We could organize this discussion in several ways. But because what we can do is so dependent on the level of measurement of whatever variables are involved, we will organize this discussion around the levels of measurement. We will start with what we can do with nominal or categorical levels of measurement, then move to ordinal, and then finally to ratio. Dichotomous measurements are special cases that we can actually treat as nominal or ordinal or even ratio. We will skip interval, because we never really have this one in social science, but if we did it would be the same as ordinal.
One very useful rule to remember is that any statistic for a lower level of measurement can also be used for higher levels of measurement. So if something works for nominal level measurements, it can also be used for interval and ratio measurements.
Then for each level of measurement, we will briefly go over the appropriate statistics for two of the purposes of science, to describe and to explain. The third purpose, to predict, is really an extension of explanation, so the same statistics would be involved. If, for example we find can explain whey people voted in a regression that used several independent variables, say education, political efficacy, and strength of party identification, then we could use those same variables to predict who will vote in some upcoming election. As another example, suppose we can explain, again using a multiple regression, which party won a presidential election election using change in real per capita income, inflation, unemployment, and which party held the White House at the time of the election and for how long, they we cold use that same equation to predict who will win some upcoming presidential election. This, by the way, is called the "Fair model," named after Ray Fair, who developed it.
A. Nominal or categorical
1. Statistics used to describe
Suppose we look at ethnicity and have four groups: whites, blacks, Hispanics, and other. We can only do a limited number of statistical things to describe these data. "Univariate" statistics in Micro Case is where you can find all these statistics.
a. Central Tendency
To see what the center of these data are like, about
all we can do is look at the mode, that is, the category that occurs
most often. For example, suppose whites are the largest in number, with say
70% being white (which is about right for
b. Dispersion or variability
How varied are our data can be seen several ways. We can do a frequency distribution or a percentage distribution. These would tell us the relative number or percentage in each category, in either graph form or table form.
In doing research that goes beyond just description, we need to consider the variation of our variables. If we do not have much variation, then seeking explanatory relationships is not possible.
For example, if we only have 2 to 3 percent Hispanic in our sample, even if it is a pretty large sample, we can't use Hispanic as a variable simply because we do not have enough to look at shift with any statistical confidence. Any crosstabulation with an Hispanic row or column could have large percentage shifts with only a few people producing that shift.
2. Statistics used to explain relationships
The best way to test explanatory relationships using nominal level variables is to produce a crosstabulation. Micro Case does this quite well, as you already know. In producing a crosstabulation, you should eliminate any rows or columns that have no bearing on the relationship you are testing and also remove those that have too few cases in them to tell you anything.
Suppose you are looking at explaining whether people in
Cramer's V is one good statistic that tells you how strong the relationship is in on number. It varies from 0 to 1, with 0 being no relationship (this would reflect no shift in percentage as you look across a row) and 1 being a perfect relationship (reflecting a 100 percentage point shift across a row). Micro Case produces other statistics, but I would suggest that you rely on Cramer's V for summarizing the strength of any relationship you find.
To see whether the relationship is statistically significant, we generally use the chi square statistic. It tells us the chance that we would find such a relationship in a sample of the size we had when no relationship exists in the population. In social science, we conventionally agree that we insist that before we reject the null hypothesis (that no relationship exists, if you remember), the chance that we could find this relationship when no relationship exists must be no more than 0.05, or 5%. This is often called the 5% significance level and it is usually simply called "p. " So we insist that p be equal to or less than 0.05.
We will learn to hand calculate chi squares for simple tables and how to interpret them. But for your reports, you can use the chi square and significance level that Micro Case produces for you at the touch of a key. For example, if the significance level of the chi square for the table between ethnicity and party id here in Aiken County is p = 0.0023, we would then reject the null hypothesis that no relationship exists because p < 0.05. If we have a really strong relationship with a fairly large sample, often Micro Case will report a p of 0.000. This does not mean that a zero chance exists that no relationship exists in the general population. Rather, it means that the chance of no relationship is extremely low--we would have to go out several decimal points further to find a number.
B. Ordinal
1. Statistics used to describe
a. Central Tendency
To see what the center of these data are like, we can still use a mode or we can look at the median. This is the value of the case in the middle after we have arranged the data from the one with the least amount of whatever we were measuring to the greatest amount. for example, if we have a five point Likert scale from strongly agree to strongly disagree, then we could say that the mode was agree if more people said agree than any other answer. The median might be agree or even another answer, depending on how the answers were distributed.
b. Dispersion or variability
Again, we can use frequency or percentage distributions, but we can also have range, which is the lowest to the highest answer or value.
So in our Likert scale, the range might be from strongly disagree to strongly agree, but it also might be from strongly disagree to agree if no one chose the strongly agree answer. If we were measuring letter grades on a test, the range might be from B+ to D-, if no A's or F's were made.
2. Statistics used to explain relationships
We can do a crosstabulation, though sometimes we need to combine or collapse groups so that we do not have too many rows to interpret or so that we have enough cases in each column to improve statistical significance.
Here is another informal rule that we can also use when we have ordinal data. If we have more than 7 values, like we did in our survey question about family income (where we had 10 values for family income), we can pretend that the measurements are ratio level and use the statistics for ratio levels of measurement (see below).
C. Ratio
1. Statistics used to describe
a. Central Tendency
In addition to mode and median, we can now add the mean (which people typically just call the average, though mode and median are also kinds of averages). The mean is just the arithmetic average.
One thing to be careful about in looking at means is that a few extreme cases at either end can skew the mean up or down. These are often called "outliers." Medians are not affected by extreme cases, so reporting both the median and mean is a good idea in looking at ratio data.
b. Dispersion
Frequency and percentage distributions and range can be used. However, if the measurements are really precise and we do not have a huge number of cases, frequency and percentage distributions often do not tell us much without collapsing the data into groups. Of course this in effect turns the data into ordinal data!
Ratio level measurements allow us to produce two new statistics, variance and standard deviation. We will learn to compute these by hand, though again Micro Case does a great job in producing them at the click of a key. Basically, variance is the average squared distance from the mean and the standard deviation is the square root of the variance. You might think of the standard deviation as the average distance from the mean. It is ok to think of it this way, but that would not be exactly true, because strictly speaking the average distance from the mean would be zero, because the values over the mean would cancel out values below the mean.
Once you have the standard deviation, you can transform ratio measurements into normal scores. What this does is measure each case by how many standard deviations it is from the mean. So a unit or case that is right at the mean would be scored as a 0. One that is one standard deviation above the mean would be a +1 and one that is a standard deviation below would be a -1.
What is cool about this is that if we normalize several different ratio level measures, we can combine measures that are normalized even if originally they used different units of measurements. That is because the normal scores are now measures in standard deviation units, not in years of education or thousands of dollars like income would be. This allows us to create some pretty sophisticated compound measures when we have ratio measurements.
2. Statistics used to explain relationships
With ratio level measurements, we usually do several related things to examine explanatory relationships. Of course, we can always collapse and to crosstabs, which is not a bad idea. But collapsing data loses information and groups units together that are in fact different.
The standard practice is to start with a scatterplot, in which the independent variable is plotted on the x-axis and the dependent variable is plotted on the y-axis to produce points corresponding to each case or unit. We look to see if the data points form some pattern, typically rising together left to right (a positive relationship) or falling left to right (a negative relationship). Micro Case does a reasonable job at this, though it does not clearly show how many points fall on top of each other.
Two other statistics help us in interpreting any relationship. The correlation
coefficient, often simply called "r," tells two things. First its
sign tells us if it is a positive or negative relationship. Second, the
closer it is to a value of either + or -1, the stronger the relationship is in
the sense that a unit change in the independent variable produces more change
in the dependent variable.
Slightly different guidelines apply here in terns of the adjectives we use to describe the strength of a relationship than we used in describing the strength of relationships for the Cramer's V or the Tau B or C statistics. Ignoring whether the sign is + or -, less than 0.25 is considered extremely weak and hardly worth talking about. Between 0.26 and 0.34 is weak. Between 0.35 to 0.39 is moderate. And 0.40 and larger is considered strong. I would add very strong at the 0.5 level and extremely strong at 0.6 and above.
That straight line, called the regression line, gives us an estimate
of what the relationship would look like if every point did fall on the line
and it were a perfect relationship. You might think of the regression line
as a line that makes an estimate of the dependent variable for different values
of the independent variable. The regression line is in the form of the equation
for a line: Y = A + bX, where b coefficient tells you
how much the dependent variable changes for each unit change in the independent
variable. The A in the equation is the constant (or where the line would cross the
Y-axis when the
value of X is 0).
For example, suppose if we have a
regression that explains income (measured to the nearest $1,000) in terms of
years of education. Suppose the regression line is as follows:
Income
= 3.5 + 0.65 (Education)
This means that if education were 0, we
would still expect an income of about 3,5 x $1.000 or
$3,500. And for each extra year of education, we would expect an additional
0.65 x $1,000 or $650 in income.
Let’s look at some real data. Here
is the scatter plot and regression line for the American Government general
knowledge scores as students enter the course using GPA as the independent
variable. We might theorize that successful hard working students are more
likely to have learned more from other courses and also probably read more as
well than less successful students and therefore should do better on the
general knowledge pretest.

The scatterplot and
regression line support our theory. To put is in the scientific language we
have been learning to use, we reject the null hypothesis that no relationship
exists between GPA and American Government general knowledge pretest score. Each additional point in the GPA predicts an
additional 3.2 additional correct answers. So someone with a 3.9 GPA will be
expected to have about 3.2 more correct answers than someone with a GPA of 2.9.
Another statistic, variance explained,
or r2 (which is just the correlation coefficient squared), tells us how
well the regression line describes the data. Variance explained is
expressed in percentages or proportions, so an r2 of 0.55 means that
the independent variable explained 55% or the change or variance in the
dependent variable.
Here are a couple of examples. To take the
extreme case, suppose we had a regression line that perfectly fit the data in a
positive relationship. The correlation would then be +1.0 and the r2 would
be (1.0)2 = 1 or 100%. This means that knowing the independent
variable allows us to explain ALL the change or all the variance in the
dependent variable. A correlation of +0.8, an extremely strong relationship, means
that the independent variable explains (0.8)2 or .64 or 64% of the variance in the
dependent variable.
The error that is left over after we
explain the variance might be called “unexplained variance.” We get
a pictorial representation of this in MicroCase by
looking at the residuals. Each residual
is the vertical distance from the regression line to each data point. So if all
points fell on the line, a perfect relationship, the residuals would be zero. The
longer the residual lines, the more the unexplained variance.
Here are the residuals for the regression line
that explains family income by years of education for the 2006 Aiken County Exit
Poll.

You will note that this was a highly
significant relationship (p = 0.000) that was strong (r = 0.408). The double
stars after the r (**) is a standard way of saying that the relationship is
significant at the 0.01 or 1% level. One star (*) is the standard way of saying
that it is significant at the 0.05 or 5% level. But in this case, the stars do
not add anything we do not already see, because MicroCase
tells us a more precise probability (p = 0.000), so we already know that it is
significant at better than the 1% level! If we computed the r2, it
would be (0.408)2 = 0.166 or 16.6% of the variance of family income
is explained by years of education.
What is really cool is that we can also do multiple regression. That allows us to try to explain a dependent variable with several independent variables at once. The interpretation is about the same. Each independent variable has its own coefficient, sometimes called the b-coefficient. Moreover, if we look at the standardized coefficients, usually called the "betas," we can compare the influence of the independent variables. Comparing the unstandardized coefficients does not tell us much because the units of measurement greatly influence their values. However, we can't easily look at any regression line in multiple regression because each independent variable adds another dimension to any picture. Micro Case does a great job with both simple regression and multiple regression.
Whether we do simple regression with one independent variable or multiple regression, we use either an t test to evaluate the statistical
significance of each coefficient and the regression. The significance of
the regression as a whole is done with an f test. As we did with chi
square, these tests produce a p = 0.___ , which tells
us the probability that we could have this relationship in a sample when no
relationship exists in the population between this independent variable and
this dependent variable. So interpret the p's the
same way as the p in the chi square test. A p of < or = to 0.05 is what
we want to reject the null hypothesis that an independent variable in the
regression had no impact on the dependent variable.
Let’s do an example. Suppose we wanted
to try and explain political knowledge using several independent variables, age
(thinking that older people have had more time to learn things), years of
education (thinking that time spent in school should make on more politically
aware), and exposure to several news sources, days in the previous week that
one has looked at news in newspapers, on the Web, and on television. In the
2006 Aiken County Exit Poll we asked four political knowledge questions (producing
a scale from 0 to 4) and we measured age, years of education, and exposure in
the previous week to these news outlets. Here is what the MicroCase
regression program produced.

Analysis of Variance
Dependent Variable: PolKnow
N: 511
Missing: 122
Multiple R-Square = 0.125
Y-Intercept = 0.293
Standard error of the estimate = 1.012
LISTWISE deletion (1-tailed test) Significance Levels: **=.01, *=.05
Source Sum of Squares
REGRESSION 74.187 5 14.837 14.490 0.000
RESIDUAL 517.354 505 1.024
TOTAL 591.540 510
Unstand.b Stand.Beta Std.Err.b t
EducYrs 0.092 0.256 0.015
5.924 **
Age 0.009 0.122 0.003 2.659 **
PaperNws 0.039 0.097 0.018 2.169 *
TVNws 0.022 0.044 0.023 0.982
WebNws 0.045 0.122 0.016 2.800
**
A lot of information is in the two screens
I pasted in above, but if you know what to look for, it is fairly easy to interpret.
The pictorial path diagram at the top tells you a great deal. It shows the
independent variables on the left, with each having a path to the dependent
variable, political knowledge. First it shows the total variance explained by
all five independent variables (0.125 or 12.5%). That is not a whole lot, but the
model using all the variables was statistically significant. You see that from
the ** next to the r2, which tells us that it is significant at the
1% level. We can also see a more precise measure of the significance in the “Analysis
of Variance” table at the bottom in the F statistic (which is kind of
like the chi square, but for different kinds of data), which is 14.49 and has a
probability of 0.000. So the model is indeed very significant.
However, that does not mean that all the independent variables had equal contributions in explaining change in the dependent variable. We can see the relative importance of each independent variable by looking at the Beta’s and their significances in both the diagram and the Analysis of Variance table below it. These are really standardized regression coefficients. Standardizing them, if you remember, changes all the different units of measurement (years, and days of the week) into standardized units (what we called standard deviation units) so that we can compare them on an equal footing, so to speak. In short, the bigger the Beta, the more important the independent variable. Beta’s that do not have at least one * are not significant at the 0.05 level and do not have a significant impact. We can also see the significance in the “t statistic” for each variable in the bottom of the Analysis of Variance table. Form this you can see that education years is roughly twice as important (Beta of 0.256 that is significant at the 1% level) as age and web news, which are equal in importance (Betas of 0.122 that are significant at the 1% level). The other significant independent variable is news from newspapers, which is reasonably close in importance to web news and age (a Beta of 0.097 and significant at the 5% level). Interestingly, watching news on television had no significant impact. The Beta was the smallest and it was not significant at even the 5% level.
D. Special cases: when we have different levels of measurement for the independent an dependent variables
As is often the situation, suppose we have different levels of measurement for the independent and dependent variables. Suppose for example, that the dependent variable is ratio (say family income to the nearest thousand dollars) and the independent is either nominal or ordinal (say ethnicity or broad education groups).
To examine explanatory relationships we have two basic choices. Either we can
do a lot of collapsing and do a crosstabulation.
Alternatively, we can do something called analysis of variance, or ANOVA
for short. What this does is look at the distribution of the dependent variable
for each value of the independent variable. It compares the means for each of
the distributions and calculates the chances that the distributions overlap
so little that we should conclude that different values of the independent
variable produce different values for the dependent variable.
ANOVA produces a really cool graph that is
not too hard to understand. MicroCase does a pretty
good job at showing this graph. Here is
the graphical representation for the ANOVA procedure using the American
Government general knowledge pretest scores for USCA students and the dependent
variable and gender as the independent variable. Below the graph is the table
that compares the means, also produced by Micro Case. One might theorize that
women should have lower scores than men here in the South where women are
socialized to think that politics is not for women. After all,

Means, Standard Deviations and Number of Cases of Dependent Var: t1
by Categories of Independent Var: gender
Difference of means across groups is statistically
significant (Prob. = 0.000)
N Mean Std.Dev.
male 554 16.883 10.131
female 1433 12.064 7.521
You can see in the graph that females do
have a lower mean score. The actual difference is shown in the table where
females scored nearly 5 questions lower (12.1 and 16.9). But you can also see
from the rectangular boxes around the mean (which show the distance from one
standard deviation above and below the mean, which cover about 67% of all cases—imagine
a normal curve rising out of the graph—a third dimension) the
distribution for men and women overlap quite a bit. The critical question is
whether that difference, given the overlap, is statistically significant. In
fact it is, looking at the probability in the table just below the graph (p =
0.000). that p was produced from the F statistic,
which you can see in a third table produced by Micro Case in the ANOVA
procedure. I did not show it here.
One last little twist! What is really cool
in ANOVA is that you can pretend that either variable is the independent
variable. As long as you know which one is really your independent variable
in your own theory, you can do the ANOVA procedure. You see, the stat program
does not know what your theory is! Just like regression, you get an r2 and an F test to allow you to see how
much of the variation was explained and whether the relationship was
statistically significant.
That covers pretty much all the statistics we will be using in this course. You have already been using a lot of them. Hopefully, some more practice in the statistics for higher levels of measurement will help you get comfortable in using things like regression and ANOVA.
III. Summary
Here is a summary table that will help you figure out what kinds of statistics to use along with a few more comments focusing in particular on crosstabulations and how to read them.
Suppose You have a hypothesis involving two variables. This is called a bivariate relationship. Now you need to use statistics to see if the data you have gathered support that hypothesis (or allow us to reject the null!). How you do this depends on the levels of measurement in both the independent and dependent variables.
If both independent and dependent variables are nominal or ordinal with just a few values, you should know that a crosstabulation is appropriate. Theo other situations we have discusses are in the table below. But crosstabulation can almost always be used.
The formula for reading any row in a crosstabulation is as follows. This wording works for almost any crosstabulation. :
"As the independent variable changes from ___ to
___ (reading left to right), the percentage of those who are _______ (whatever
that row is in the dependent variable) shifts (increases, decreases or however
the percentages change, maybe even up and down) from ___ to _ .
Independent Variable
|
Dependent |
Nominal |
Ordinal |
Ratio |
|
Nominal |
X-tab |
X-tab |
Collapse I.V. & X-tab or ANOVA (if you reverse the variables) |
|
Ordinal |
X-tab
|
X-tab
|
Collapse I.V. & X-tab or ANOVA (if you reverse the variables) |
|
Ratio |
X-tab or ANOVA |
X-tab or ANOVA |
Collapse both variables & X-Tab or scatterplot/regression |
The crosstabulation
procedure (X-tab) works best when you have no more than three rows in the
table. That means no more than three values for the dependent variable. Of
course, you can always collapse rows to get this, or you can compare measures
of central tendency (medians or means).
It really does not matter how many columns you have, but you had better have few enough so that you have a sufficient number of cases in each cell. You can collapse columns to get this.
A tradeoff exists between what is desirable theoretically and what is practical. Theory might tell us that many values for the independent variable are needed for a complete understanding of how the dependent variable reacts to different values of the independent variable. But we may only have enough cases in the sample to have only three columns.
Cells with just a few cases in them can radically alter percentage shifts if one or two cases are in one cell rather than in another on that row. So pay attention when cell frequencies get low. You may need to do some collapsing. As you will see later, tests of statistical significance help us to keep a handle on this. If the cell frequencies get too small, the shift will not come out to be statistically significant.
Exercises Using our data and Micro Case
1. Using 2008 survey data, find a variable with high variation and one that has low variation for a nominal or an ordinal variable.
2. Look at ethnicity, a nominal level variable, and describe it in terms of central tendency and variation. Do the same for religious fundamentalism.
3. Look at age and years of education, both ratio level variables, and describe them in terms of central tendency and variation.
4. Do appropriate collapsing to produce a usable frequency distribution for age and then for years of education.
5. Look at the relationship between ethnicity and religious fundamentalism in a crosstabulation. How strong is any relationship and is it statistically significant.
6. a. Look at the relationship between years of education and family income using a crosstabulation. What do you have to do to make this crosstabulation readable? Do it! What is the strength of this relationship? How much of the variance in family income is explained by years of education? Is the relationship significant?
b. Look at the relationship between years of education and family income using a scatterplot (here we are pretending that family income is ratio measurement). What is the strength of this relationship? How much of the variance in family income is explained by years of education? Is the relationship significant?
7. Using ANOVA, examine the relationship between gender and years of education. How much of the variance in years of education is explained by gender? Is the relationship statistically significant?
8. Find two variables that you have some reason to think might be related, with at least one at the ratio level of measurement. Produce the appropriate statistics to evaluate any relationship in terms of strength and statistical significance. Be prepared to explain it to the class.
9. Now that you have had some practice doing using different statistics, let's put it all together to give you some practice to make sure you know when to use exactly what. I will give you some hypotheses to test. I want you to figure out how to set up the appropriate table/chart/graph to test the bivariate relationship, interpret it as appropriate, produce and interpret the appropriate statistic(s) for how strong it is, and produce and interpret the appropriate statistic(s) for the significance for any relationships you find.
Copyright, Robert E. Botsch, 2010
last updated on 10/22/2010