Chi-square test for association (independence) | AP Statistics | Khan Academy
We're already familiar with the chi-squared statistic. If you're not, I encourage you to review the videos on that. And we've already done some hypothesis testing with the chi-squared statistic. We've even done some hypothesis testing based on two-way tables. Now we're going to extend that by thinking about a chi-squared test for association between two variables.
So let's say that we suspect that someone's foot length is related to their hand length, that these things are not independent. Well, what we can do is set up a hypothesis test. Remember, the null hypothesis in a hypothesis test is to always assume no news. So what we could say is here is that there is no association—no association between foot and hand length. Another way to think about it is that they are independent. Oftentimes, what we're doing is called a chi-square test for independence.
Then our alternative hypothesis would be our suspicion: there is an association; there is an association. So, foot and hand length are not independent. What we can then do is go to a population, and we can randomly sample it.
Let's say we randomly sample 100 folks. For all of those hundred folks, we figure out whether their right hand is longer, their left hand is longer, or both hands are the same. We also do that for the feet and we tabulate all of the data, and this is the data that we actually get. Now, it's worth thinking about this for a second. What we just did is different from a chi-squared test for homogeneity.
In a chi-square test for homogeneity, we sample from two different populations, or we look at two different groups, and we see whether the distribution of a certain variable amongst those two different groups is the same. Here, we are just sampling from one group, but we're thinking about two different variables for that one group. We're thinking about feet length and we're thinking about hand length.
You can see here that 11 folks had both their right hand longer and their right foot longer. Three folks had their right hand longer, but their left foot was longer. Then, eight folks had their right hand longer, but both feet were the same. Likewise, we had nine people where their left foot and hand were longer, but you had two people where the left hand was longer, but the right foot was longer.
We could go through all of these, but to do our chi-square test, we would have said: what would be the expected value of each of these data points if we assumed that the null hypothesis was true, that there was no association between foot and hand length?
So, to help us do that, I'm going to make a total of our columns here and also a total of our rows. Let me draw a line here so we know what was going on. So, what are the total number of people who had a longer right hand? Well, it's going to be 11 plus 3 plus 8, which is 22. The total number of people who had a longer left hand is 2 plus 9 plus 14, which is 25.
Then, the total number of people whose hands had the same length: 12 plus 13 plus 28, that is 53. Finally, if I were to total this column, 22 plus 25 is 47 plus 53, we get 100 right over here. Then if we total the number of people who had a longer right foot: 11 plus 2 plus 12, that's 13 plus 12, that is 25.
The longer left foot is 3 plus 9 plus 13; that's also 25. We could either add these up and we would get 50, or we could say, "Hey, 25 plus 25 plus what is 100?" Well, that is going to be equal to 50.
Now, to figure out these expected values, remember we're going to figure out the expected values assuming that the null hypothesis is true, assuming that these distributions are independent, that foot length and hand length are independent variables. Well, if they are independent, which we are assuming, then our best estimate is that 22 have a longer right hand, and our best estimate is that 25 percent have a longer right foot.
So, out of 100, you would expect 0.22 times 0.25 times 100 to have a longer right hand and foot. I'm just multiplying the probabilities, which you would do if these were independent variables.
So, 0.22 times 0.25—let's see, one-fourth of 22 is 5.5, so this is going to be equal to 5.5. Now, what number would you expect to have a longer right hand but a longer left foot? That would be 0.22 times 0.25 times 100. Well, we already calculated what that would be; that would be 5.5.
Then, to figure out the expected number that would have a longer right hand, but both feet would be the same length, we could multiply 22 out of 100 times 50 out of 100 times 100, which is going to be half of 22, which is equal to 11.
We can keep going. This value right over here would be 0.25 times 0.25 times 100. Twenty-five times twenty-five is 625, so that would be 6.25. This value right over here would be 0.25 times 0.25 times 100, which is again 6.25.
Then this value right over here, a couple of ways we can get it: we can multiply 0.25 times 50 times 100, which would get us to 12.5, or we could have said this plus this plus this has to equal 25, so this would be 12.5.
Now this expected value we can figure out because 5.5 plus 6.25 plus this is going to equal 25. So let's see: 5.5 plus 6.25 is 11.75; 11.75 plus 13.25 is equal to 25.
Same thing over here: this would be 13.25 because this is 11.75 plus 13.25 is equal to 25. If we add these two together, we get 26.5; 26.5 plus what is equal to 53 would be equal to another 26.5.
Now, once you figure out all of your expected values, that's a good time to test your conditions. The first condition is that you took a random sample. So let's assume we had done that. The second condition is that your expected value for any of the data points has to be at least equal to five. We can see that all of our expected values are at least equal to five.
The actual data points we got do not have to be equal to five, so it's okay that we got a two here because the expected value here is five or larger. The last condition is the independence condition—that either we are sampling with replacement or that we have to feel comfortable that our sample size is no more than 10 percent of the population. So let's assume that that happened as well.
So, assuming we met all of those conditions, we are ready to calculate our chi-squared statistic. What we're going to do is for every data point, we're going to find the difference between the data point and the expected, squared over the expected.
So I did that one. Now I'll do this one: plus 3 minus 5.5 squared over 5.5. Plus now I'll do this one: 8 minus 11 squared over 11. Then I'll do this one: 2 minus 6.25 squared over 6.25, and I'll keep doing it. I'm going to do it for all nine of these data points.
I actually calculated this ahead of time to save some time, and if you do this for all nine of the data points, you're going to get a chi-square statistic of 11.942.
Now, before we calculate the p-value, we're going to think about what our degrees of freedom are. Now we have a three by three table here. So one way to think about it is the number of rows minus one times the number of columns minus one. This is two times two, which is equal to four.
Another way to think about it is if you know four of these data points and you know the totals, then you could figure out the other five data points. So now we are ready to calculate a p-value, and you could do that using a calculator or you could do that using a chi-squared table. But let's say we did it using a calculator and we get a p value of 0.018.
Just to remind ourselves what this is—it is the probability of getting a chi-squared statistic at least this large or larger. Next, we do what we always do with hypothesis testing: we compare this to our significance level.
We actually should have set our significance level from the beginning, so let's just assume that when we set up our hypotheses here, we also said that we want a significance level of 0.05. You really should do this before you calculate all of this, but then you compare your p value to your significance level.
We see that this p value is a good bit less than our significance level. So one way to think about it is we got all these expected values assuming that the null hypothesis was true. The probability of getting a result this extreme or more extreme is less than two percent, which is lower than our significance level.
So this will lead us to reject our null hypothesis, and it suggests to us that there is an association between hand length and foot lengths.