Chapter 11. Statistics

 
 .

I. Introduction. Two types of statistics

    Two types of statistics exist that are relevant to this course, descriptive and inferential. Let's look at each in turn

    A. Descriptive Statistics

    Descriptive statistics are designed to summarize some quality or aspect of data. So we might use some kind of measure of central tendency (means, medians or modes) to tell us what the middle or center of the data look like. Alternatively, we might he interested in a quality that focuses on relationships between variables, so we might use a correlation coefficient or a Tau B or Tau C to summarize a relationship. A single number summarizes things quickly, but we might also consider graphs and charts to be statistics that also are designed to summarize things about relationships. So crosstabulations are also a kind of descriptive statistic that summarizes a potential relationship. The same would also be true about a scatterplot, or even a mapping feature like we have in MicroCase.

    B. Inferential Statistics

    Inferential statistics usually look almost the same as descriptive statistics, but they add something extra. Rather than being precise, they include an error term because they are inferring something from a sample to a larger population. For example, the percentage of all students who are seniors at USCA would be a descriptive statistic, but the percentage of seniors in a sample of USCA students would be an inferential statistic. Ant the inferential statistic should include some kind of expected sampling error. So it might be that 19% plus or minus 4% is the inferential statistic for the percentage of seniors at USCA. 

    Put another way, as soon as we make inferences, we must introduce the laws of probability. You have already been doing that in working with sampling error. And you know that the sampling error has a probability of being correct, a 95% probability wtih the formulas we have been using.

II. Kinds of Statistics to Use -- depends on levels of measurement and purpose

    We could organize this discussion in several ways. But because what we can do is so dependent on the level of measurement of whatever variables are involved, we will organize this discussion around the levels of measurement. We will start with what we can do with nominal or categorical levels of measurement, then move to ordinal, and then finally to ratio. Dichotomous measurements are special cases that we can actually treat as nominal or ordinal or even ratio. We will skip interval, because we never really have this one in social science, but if we did it would be the same as ordinal.

    One very useful rule to remember is that any statistic for a lower level of measurement can also be used for higher levels of measurement. So if something works for nominal level measurements, it can also be used for interval and ratio measurements.

    Then for each level of measurement, we will briefly go over the appropriate statistics for two of the purposes of science, to describe and to  explain. The third purpose, to predict, is really an extension of explanation, so the same statistics would be involved. If, for example we find can explain whey people voted in a regression that used several independent variables, say education, political efficacy, and strength of party identification, then we could use those same variables to predict who will vote in some upcoming election. As another example, suppose we can explain, again using a multiple regression, which party won a presidential election election using change in real per capita income, inflation, unemployment, and which party held the White House at the time of the election and for how long, they we cold use that same equation to predict who will win some upcoming presidential election. This, by the way, is called the "Fair model," named after Ray Fair, who developed it.

    A. Nominal or categorical

        1. Statistics used to describe

    Suppose we look at ethnicity and have four groups: whites, blacks, Hispanics, and other. We can only do a limited number of statistical things to describe these data. "Univariate" statistics in Micro Case is where you can find all these statistics.

            a. Central Tendency

    To see what the center of these data are like, about all we can do is look at the mode, that is, the category that occurs most often. For example, suppose whites are the largest in number, with say 70% being white (which is about right for South Carolina in the early 2000's)

            b. Dispersion or variability

    How varied are our data can be seen several ways. We can do a frequency distribution or a percentage distribution. These would tell us the relative number or percentage in each category, in either graph form or table form.

    In doing research that goes beyond just description, we need to consider the variation of our variables. If we do not have much variation, then seeking explanatory relationships is not possible.

    For example, if we only have 2 to 3 percent Hispanic in our sample, even if it is a pretty large sample, we can't use Hispanic as a variable simply because we do not have enough to look at shift with any statistical confidence. Any crosstabulation with an Hispanic row or column could have large percentage shifts with only a few people producing that shift.

        2. Statistics used to explain relationships

    The best way to test explanatory relationships using nominal level variables is to produce a crosstabulation.  Micro Case does this quite well, as you already know. In producing a crosstabulation, you should eliminate any rows or columns that have no bearing on the relationship you are testing and also remove those that have too few cases in them to tell you anything.

    Suppose you are looking at explaining whether people in Aiken County are Democrats or Republicans based on ethnic identity. We already know that too few Hispanics will be in the table to do much with, and the same will be true for other groups as well. And because we are not really interested in independents, we can eliminate that row. So that would give us a 2 x 2 table with whites and blacks on top and Democrats and Republicans on the side to produce the rows. You already know how to read such a table.

    Cramer's V is one good statistic that tells you how strong the relationship is in on number. It varies from 0 to 1, with 0 being no relationship (this would reflect no shift in percentage as you look across a row) and 1 being a perfect relationship (reflecting a 100 percentage point shift across a row). Micro Case produces other statistics, but I would suggest that you rely on Cramer's V for summarizing the strength of any relationship you find.

    To see whether the relationship is statistically significant, we generally use the chi square statistic. It tells us the chance that we would find such a relationship in a sample of the size we had when no relationship exists in the population. In social science, we conventionally agree that we insist that before we reject the null hypothesis (that no relationship exists, if you remember), the chance that we could find this relationship when no relationship exists must be no more than 0.05, or 5%. This is often called the 5% significance level and it is usually simply called "p. " So we insist that p be equal to or less than 0.05.

    We will learn to hand calculate chi squares for simple tables and how to interpret them. But for your reports, you can use the chi square and significance level that Micro Case produces for you at the touch of a key. For example, if the significance level of the chi square for the table between ethnicity and party id here in Aiken County is p = 0.0023, we would then reject the null hypothesis that no relationship exists because p < 0.05. If we have a really strong relationship with a fairly large sample, often Micro Case will report a p of 0.000. This does not mean that a zero chance exists that no relationship exists in the general population. Rather, it means that the chance of no relationship is extremely low--we would have to go out several decimal points further to find a number.

    B. Ordinal

        1. Statistics used to describe

            a. Central Tendency

    To see what the center of these data are like, we can still use a mode or we can look at the median. This is the value of the case in the middle after we have arranged the data from the one with the least amount of whatever we were measuring to the greatest amount. for example, if we have a five point Likert scale from strongly agree to strongly disagree, then we could say that the mode was agree if more people said agree than any other answer. The median might be agree or even another answer, depending on how the answers were distributed.  

            b. Dispersion or variability

    Again, we can use frequency or percentage distributions, but we can also have range, which is the lowest to the highest answer or value.

    So in our Likert scale, the range might be from strongly disagree to strongly agree, but it also might be from strongly disagree to agree if no one chose the strongly agree answer. If we were measuring letter grades on a test, the range might be from B+ to D-, if no A's or F's were made.

        2. Statistics used to explain relationships

    We can do a crosstabulation, though sometimes we need to combine or collapse groups so that we do not have too many rows to interpret or so that we have enough cases in each column to improve statistical significance.

    Here is another informal rule that we can also use when we have ordinal data. If we have more than 7 values, like we did in our survey question about family income (where we had 10 values for family income), we can pretend that the measurements are ratio level and use the statistics for ratio levels of measurement (see below).  

    C. Ratio

        1. Statistics used to describe

            a. Central Tendency

    In addition to mode and median, we can now add the mean (which people typically just call the average, though mode and median are also kinds of averages). The mean is just the arithmetic average.

    One thing to be careful about in looking at means is that a few extreme cases at either end can skew the mean up or down. These are often called "outliers." Medians are not affected by extreme cases, so reporting both the median and mean is a good idea in looking at ratio data.

            b. Dispersion

    Frequency and percentage distributions and range can be used. However, if the measurements are really precise and we do not have a huge number of cases, frequency and percentage distributions often do not tell us much without collapsing the data into groups. Of course this in effect turns the data into ordinal data!

    Ratio level measurements allow us to produce two new statistics, variance and standard deviation. We will learn to compute these by hand, though again Micro Case does a great job in producing them at the click of a key. Basically, variance is the average squared distance from the mean and the standard deviation is the square root of the variance. You might think of the standard deviation as the average distance from the mean. It is ok to think of it this way, but that would not be exactly true, because strictly speaking the average distance from the mean would be zero, because the values over the mean would cancel out values below the mean.

    Once you have the standard deviation, you can transform ratio measurements into normal scores. What this does is measure each case by how many standard deviations it is from the mean. So a unit or case that is right at the mean would be scored as a 0. One that is one standard deviation above the mean would be a +1 and one that is a standard deviation below would be a -1.

    What is cool about this is that if we normalize several different ratio level measures, we can combine measures that are normalized even if originally they used different units of measurements. That is because the normal scores are now measures in standard deviation units, not in years of education or thousands of dollars like income would be. This allows us to create some pretty sophisticated compound measures when we have ratio measurements.  

        2. Statistics used to explain relationships

    With ratio level measurements, we usually do several related things to examine explanatory relationships. Of course, we can always collapse and to crosstabs, which is not a bad idea. But collapsing data loses information and groups units together that are in fact different.

    The standard practice is to start with a scatterplot, in which the independent variable is plotted on the x-axis and the dependent variable is plotted on the y-axis to produce points corresponding to each case or unit. We look to see if the data points form some pattern, typically rising together left to right (a positive relationship) or falling left to right (a negative relationship). Micro Case does a reasonable job at this, though it does not clearly show how many points fall on top of each other.

    Two other statistics help us in interpreting any relationship. The correlation coefficient, often simply called "r," tells two things. First its sign tells us if it is a positive or negative relationship. Second, the closer it is to a value of either + or -1, the stronger the relationship is in the sense that a unit change in the independent variable produces more change in the dependent variable.

     Slightly different guidelines apply here in terns of the adjectives we use to describe the strength of a relationship than we used in describing the strength of relationships for the Cramer's V or the Tau B or C statistics. Ignoring whether the sign is + or -, less than 0.25 is considered extremely weak and hardly worth talking about. Between 0.26 and 0.34 is weak. Between 0.35 to 0.39 is moderate. And 0.40 and larger is considered strong. I would add very strong at the 0.5 level and extremely strong at 0.6 and above.

    That straight line, called the regression line, gives us an estimate of what the relationship would look like if every point did fall on the line and it were a perfect relationship. You might think of the regression line as a line that makes an estimate of the dependent variable for different values of the independent variable. The regression line is in the form of the equation for a line: Y = A + bX, where b coefficient tells you how much the dependent variable changes for each unit change in the independent variable. The A in the equation is the constant (or where the line would cross the Y-axis when the value of X is 0).

     For example, suppose if we have a regression that explains income (measured to the nearest $1,000) in terms of years of education. Suppose the regression line is as follows:

                                            Income = 3.5 + 0.65 (Education)

     This means that if education were 0, we would still expect an income of about 3,5 x $1.000 or $3,500. And for each extra year of education, we would expect an additional 0.65 x $1,000 or $650 in income.

     Let’s look at some real data. Here is the scatter plot and regression line for the American Government general knowledge scores as students enter the course using GPA as the independent variable. We might theorize that successful hard working students are more likely to have learned more from other courses and also probably read more as well than less successful students and therefore should do better on the general knowledge pretest.

                      

          

     The scatterplot and regression line support our theory. To put is in the scientific language we have been learning to use, we reject the null hypothesis that no relationship exists between GPA and American Government general knowledge pretest score.  Each additional point in the GPA predicts an additional 3.2 additional correct answers. So someone with a 3.9 GPA will be expected to have about 3.2 more correct answers than someone with a GPA of 2.9.

     Another statistic, variance explained, or r2 (which is just the correlation coefficient squared), tells us how well the regression line describes the data. Variance explained is expressed in percentages or proportions, so an r2 of 0.55 means that the independent variable explained 55% or the change or variance in the dependent variable.

     Here are a couple of examples. To take the extreme case, suppose we had a regression line that perfectly fit the data in a positive relationship. The correlation would then be +1.0 and the r2 would be (1.0)2 = 1 or 100%. This means that knowing the independent variable allows us to explain ALL the change or all the variance in the dependent variable. A correlation of +0.8, an extremely strong relationship, means that the independent variable explains (0.8)2  or .64 or 64% of the variance in the dependent variable.  

     The error that is left over after we explain the variance might be called “unexplained variance.” We get a pictorial representation of this in MicroCase by looking at the residuals. Each residual is the vertical distance from the regression line to each data point. So if all points fell on the line, a perfect relationship, the residuals would be zero. The longer the residual lines, the more the unexplained variance.  

     Here are the residuals for the regression line that explains family income by years of education for the 2006 Aiken County Exit Poll.

                         

 

     You will note that this was a highly significant relationship (p = 0.000) that was strong (r = 0.408). The double stars after the r (**) is a standard way of saying that the relationship is significant at the 0.01 or 1% level. One star (*) is the standard way of saying that it is significant at the 0.05 or 5% level. But in this case, the stars do not add anything we do not already see, because MicroCase tells us a more precise probability (p = 0.000), so we already know that it is significant at better than the 1% level! If we computed the r2, it would be (0.408)2 = 0.166 or 16.6% of the variance of family income is explained by years of education.

     What is really cool is that we can also do multiple regression. That allows us to try to explain a dependent variable with several independent variables at once. The interpretation is about the same. Each independent variable has its own coefficient, sometimes called the b-coefficient. Moreover, if we look at the standardized coefficients, usually called the "betas," we can compare the influence of the independent variables. Comparing the unstandardized coefficients does not tell us much because the units of measurement greatly influence their values. However, we can't easily look at any regression line in multiple regression because each independent variable adds another dimension to any picture. Micro Case does a great job with both simple regression and multiple regression.

    Whether we do simple regression with one independent variable or multiple regression, we use either an t  test to evaluate the statistical significance of each coefficient and the regression. The significance of the regression as a whole is done with an f  test. As we did with chi square, these tests produce a p = 0.___ , which tells us the probability that we could have this relationship in a sample when no relationship exists in the population between this independent variable and this dependent variable. So interpret the p's the same way as the p in the chi square test. A p of < or = to 0.05 is what we want to reject the null hypothesis that an independent variable in the regression had no impact on the dependent variable.

     Let’s do an example. Suppose we wanted to try and explain political knowledge using several independent variables, age (thinking that older people have had more time to learn things), years of education (thinking that time spent in school should make on more politically aware), and exposure to several news sources, days in the previous week that one has looked at news in newspapers, on the Web, and on television. In the 2006 Aiken County Exit Poll we asked four political knowledge questions (producing a scale from 0 to 4) and we measured age, years of education, and exposure in the previous week to these news outlets. Here is what the MicroCase regression program produced.

                   

                  Analysis of Variance

Dependent Variable: PolKnow      

N: 511          Missing: 122

Multiple R-Square = 0.125     Y-Intercept = 0.293

Standard error of the estimate = 1.012

LISTWISE deletion (1-tailed test)     Significance Levels: **=.01, *=.05

 

       Source     Sum of Squares   DF     Mean Square   F      Prob.

       REGRESSION    74.187         5     14.837      14.490   0.000

       RESIDUAL      517.354       505    1.024         

       TOTAL         591.540       510                  

 

            Unstand.b       Stand.Beta    Std.Err.b     t

EducYrs       0.092          0.256         0.015        5.924 **

Age           0.009          0.122         0.003        2.659 **

PaperNws      0.039          0.097         0.018        2.169 *

TVNws         0.022          0.044         0.023        0.982  

WebNws        0.045          0.122         0.016        2.800 **

 

     A lot of information is in the two screens I pasted in above, but if you know what to look for, it is fairly easy to interpret. The pictorial path diagram at the top tells you a great deal. It shows the independent variables on the left, with each having a path to the dependent variable, political knowledge. First it shows the total variance explained by all five independent variables (0.125 or 12.5%). That is not a whole lot, but the model using all the variables was statistically significant. You see that from the ** next to the r2, which tells us that it is significant at the 1% level. We can also see a more precise measure of the significance in the “Analysis of Variance” table at the bottom in the F statistic (which is kind of like the chi square, but for different kinds of data), which is 14.49 and has a probability of 0.000. So the model is indeed very significant.

     However, that does not mean that all the independent variables had equal contributions in explaining change in the dependent variable. We can see the relative importance of each independent variable by looking at the Beta’s and their significances in both the diagram and the Analysis of Variance table below it. These are really standardized regression coefficients. Standardizing them, if you remember, changes all the different units of measurement (years, and days of the week) into standardized units (what we called standard deviation units) so that we can compare them on an equal footing, so to speak. In short, the bigger the Beta, the more important the independent variable. Beta’s that do not have at least one * are not significant at the 0.05 level and do not have a significant impact. We can also see the significance in the “t statistic” for each variable in the bottom of the Analysis of Variance table. Form this you can see that education years is roughly twice as important (Beta of 0.256 that is significant at the 1% level) as age and web news, which are equal in importance (Betas of 0.122 that are significant at the 1% level). The other significant independent variable is news from newspapers, which is reasonably close in importance to web news and age (a Beta of 0.097 and significant at the 5% level). Interestingly, watching news on television had no significant impact. The Beta was the smallest and it was not significant at even the 5% level.

    D. Special cases: when we have different levels of measurement for the independent an dependent variables

    As is often the situation, suppose we have different levels of measurement for the independent and dependent variables. Suppose for example, that the dependent variable is ratio (say family income to the nearest thousand dollars) and the independent is either nominal or ordinal (say ethnicity or broad education groups).

    To examine explanatory relationships we have two basic choices. Either we can do a lot of collapsing and do a crosstabulation. Alternatively, we can do something called analysis of variance, or ANOVA for short. What this does is look at the distribution of the dependent variable for each value of the independent variable. It compares the means for each of the distributions and calculates the chances that the distributions overlap so little that we should conclude that different values of the independent variable produce different values for the dependent variable.

     ANOVA produces a really cool graph that is not too hard to understand. MicroCase does a pretty good job at showing this graph.  Here is the graphical representation for the ANOVA procedure using the American Government general knowledge pretest scores for USCA students and the dependent variable and gender as the independent variable. Below the graph is the table that compares the means, also produced by Micro Case. One might theorize that women should have lower scores than men here in the South where women are socialized to think that politics is not for women. After all, South Carolina is the state that has the lowest percentage of females in the state legislature.

         

Means, Standard Deviations and Number of Cases of Dependent Var: t1

by Categories of Independent Var: gender       

Difference of means across groups is statistically significant (Prob. = 0.000)

 

 

       N      Mean   Std.Dev.

male   554    16.883 10.131

female 1433   12.064 7.521

 

     You can see in the graph that females do have a lower mean score. The actual difference is shown in the table where females scored nearly 5 questions lower (12.1 and 16.9). But you can also see from the rectangular boxes around the mean (which show the distance from one standard deviation above and below the mean, which cover about 67% of all cases—imagine a normal curve rising out of the graph—a third dimension) the distribution for men and women overlap quite a bit. The critical question is whether that difference, given the overlap, is statistically significant. In fact it is, looking at the probability in the table just below the graph (p = 0.000). that p was produced from the F statistic, which you can see in a third table produced by Micro Case in the ANOVA procedure. I did not show it here.

     One last little twist! What is really cool in ANOVA is that you can pretend that either variable is the independent variable. As long as you know which one is really your independent variable in your own theory, you can do the ANOVA procedure. You see, the stat program does not know what your theory is! Just like regression, you get an r2  and an F test to allow you to see how much of the variation was explained and whether the relationship was statistically significant. 

     That covers pretty much all the statistics we will be using in this course. You have already been using a lot of them. Hopefully, some more practice in the statistics for higher levels of measurement will help you get comfortable in using things like regression and ANOVA.

III. Summary

     Here is a summary table that will help you figure out what kinds of statistics to use along with a few more comments focusing in particular on crosstabulations and how to read them. 

     Suppose You have a hypothesis involving two variables. This is called a bivariate relationship. Now you need to use statistics to see if the data you have gathered support that hypothesis (or allow us to reject the null!). How you do this depends on the levels of measurement in both the independent and dependent variables.

     If both independent and dependent variables are nominal or ordinal with just a few values, you should know that a crosstabulation is appropriate. Theo other situations we have discusses are in the table below. But crosstabulation can almost always be used.

     The formula for reading any row in a crosstabulation is as follows. This wording works for almost any crosstabulation. :

"As the independent variable changes from ___ to ___ (reading left to right), the percentage of those who are _______ (whatever that row is in the dependent variable) shifts (increases, decreases or however the percentages change, maybe even up and down) from ___ to _ .

 

Independent Variable

Dependent 

Nominal

Ordinal

Ratio

Nominal

X-tab

X-tab

Collapse I.V. & X-tab

or ANOVA (if you reverse the variables)

Ordinal

X-tab

 

X-tab

 

Collapse I.V. & X-tab

or ANOVA (if you reverse the variables) 

Ratio

X-tab

or ANOVA

X-tab 

or ANOVA

Collapse both variables & X-Tab

or scatterplot/regression

 
     The crosstabulation procedure (X-tab) works best when you have no more than three rows in the table. That means no more than three values for the dependent variable. Of course, you can always collapse rows to get this, or you can compare measures of central tendency (medians or means).

     It really does not matter how many columns you have, but you had better have few enough so that you have a sufficient number of cases in each cell. You can collapse columns to get this.

     A tradeoff exists between what is desirable theoretically and what is practical. Theory might tell us that many values for the independent variable are needed for a complete understanding of how the dependent variable reacts to different values of the independent variable. But we may only have enough cases in the sample to have only three columns.

     Cells with just a few cases in them can radically alter percentage shifts if one or two cases are in one cell rather than in another on that row. So pay attention when cell frequencies get low. You may need to do some collapsing. As you will see later, tests of statistical significance help us to keep a handle on this. If the cell frequencies get too small, the shift will not come out to be statistically significant.

 

 

Exercises Using our data and Micro Case

1. Using 2008 survey data, find a variable with high variation and one that has low variation for a nominal or an ordinal variable.

2. Look at ethnicity, a nominal level variable, and describe it in terms of central tendency and variation. Do the same for religious fundamentalism.

3. Look at age and years of education, both ratio level variables, and describe them in terms of central tendency and variation.

4.  Do appropriate collapsing to produce a usable frequency distribution for age and then for years of education.

5. Look at the relationship between ethnicity and religious fundamentalism in a crosstabulation. How strong is any relationship and is it statistically significant.

6. a. Look at the relationship between years of education and family income using a crosstabulation. What do you have to do to make this crosstabulation readable? Do it! What is the strength of this relationship? How much of the variance in family income is explained by years of education? Is the relationship significant?

     b. Look at the relationship between years of education and family income using a scatterplot (here we are pretending that  family income is ratio measurement). What is the strength of this relationship? How much of the variance in family income is explained by years of education? Is the relationship significant?

7. Using ANOVA, examine the relationship between gender and years of education. How much of the variance in years of education is explained by gender? Is the relationship statistically significant?

8. Find two variables that you have some reason to think might be related, with at least one at the ratio level of measurement. Produce the appropriate statistics to evaluate any relationship in terms of strength and statistical significance. Be prepared to explain it to the class.

9. Now that you have had some practice doing using different statistics, let's put it all together to give you some practice to make sure you know when to use exactly what. I will give you some hypotheses to test. I want you to figure out how to set up the appropriate table/chart/graph to test the bivariate relationship, interpret it as appropriate, produce and interpret the appropriate statistic(s) for how strong it is, and produce and interpret the appropriate statistic(s) for the significance for any relationships you find.

 

Copyright, Robert E. Botsch, 2010

last updated on 10/22/2010