Calculating correlation coefficient r | AP Statistics | Khan Academy
What we're going to do in this video is calculate by hand the correlation coefficient for a set of bivariate data. When I say bivariate, it's just a fancy way of saying for each x data point, there is a corresponding y data point.
Now, before I calculate the correlation coefficient, let's just make sure we understand some of these other statistics that they've given us. We assume that these are samples of the x and the corresponding y from a broader population, and so we have the sample mean for x and the sample standard deviation for x.
The sample mean for x is quite straightforward to calculate. It would just be 1 plus 2 plus 2 plus 3 over 4, and this is 8 over 4, which is indeed equal to 2. The sample standard deviation for x, we've also seen this before; this should be a little bit of review. It's going to be the square root of the distance from each of these points to the sample mean squared.
So, 1 minus 2 squared plus 2 minus 2 squared plus 2 minus 2 squared plus 3 minus 2 squared, all of that over, since we're talking about sample standard deviation, we have four data points, so one less than four is all of that over three. Now this actually simplifies quite nicely because this is zero, this is zero, this is one, this is one.
So, you essentially get the square root of two-thirds, which is, if you approximate, zero point one. So that's that. The same thing is true for y. The sample mean for y, if you just add up 1 plus 2 plus 3 plus 6 over 4, four data points, this is 12 over 4, which is indeed equal to 3.
Then the sample standard deviation for y, you would calculate the exact same way we did it for x, and you get 2.160. Now, with all of that out of the way, let's think about how we calculate the correlation coefficient.
Now, right over here is a representation for the formula for the correlation coefficient, and at first, it might seem a little intimidating until you realize a few things. All this is saying is for each corresponding x and y, find the z-score for x.
So, we could call this z sub x for that particular x, so z sub x sub i, and we could say this is the z-score for that particular y, z sub y sub i, is one way that you could think about it. Look, this is just saying for each data point, find the difference between it and its mean, and then divide by the standard, the sample standard deviation.
So, that's how many sample standard deviations it is away from its mean, and so that's the z-score for that x data point, and this is the z-score for the corresponding y data point. How many sample standard deviations is it away from the sample mean?
In the real world, you won't have only four pairs, and it will be very hard to do it by hand, and we typically use software, computer tools to do it. But it's really valuable to do it by hand to get an intuitive understanding of what's going on here.
So, in this particular situation, r is going to be equal to one over n minus one. We have 4 pairs, so it's going to be 1 over 3, and it's going to be times a sum of the products of the z-scores. So this first pair right over here, so this, the z-score for this one is going to be 1 minus how far it is away from the x sample mean divided by the x sample standard deviation 0.816 times 1.
Now, we're looking at the y variable, the y z-score, so it's 1 minus 3, 1 minus 3 over the y sample standard deviation 2.16, and we're just going to keep doing that. I'll do it like this. So the next one, it's going to be 2 minus 2 over 0.816.
This is where I got the 2 from and I'm subtracting from that the sample mean right over here times now we're looking at this 2, 2 minus 3 over 2.160. Plus, I'm happy there's only four pairs here. Two minus two again, two minus two over 0.816.
Times now we're gonna have three minus 3, 3 minus 3 over 2.160. And then the last pair you're going to have 3 minus 2, 3 minus 2 over 0.816 times 6 minus 3, 6 minus 3 over 2.160.
So before I get a calculator out, let's see if there's some simplifications I can do. 2 minus 2, that's going to be 0. 0 times anything is 0, so this whole thing is 0. 2 minus 2 is 0, 3 minus 3 is here; this is going to actually be 0 times 0, so that whole thing is 0.
Let's see, this is going to be 1 minus 2, which is negative 1, 1 minus 3 is negative 2. So this is going to be r is equal to 1/3 times negative times negative is positive, and so this is going to be 2 over 0.816 times 2.160.
And then plus 3 minus 2 is 1, 6 minus 3 is 3, so plus 3 over 0.816 times 2.16. Well, these are the same denominator, so actually I could rewrite if I have 2 over this thing plus 3 over this thing, that's going to be 5 over this thing.
So I could rewrite this whole thing as 5 over 0.816 times 2.160, and now I can just get a calculator out to actually calculate this. So we have 1 divided by 3 times 5 divided by 0.816 times 2.16. The zero won't make a difference, but I'll just write it down, and then I will close that parentheses.
And let's see what we get. We get an r of, and since everything else goes to the thousandths place, I'll just round to the thousands place, an r of 0.946. So r is approximately 0.946.
So what does this tell us? The correlation coefficient is a measure of how well a line can describe the relationship between x and y. r is always going to be greater than or equal to negative one and less than or equal to one.
If r is positive one, it means that an upward sloping line can completely describe the relationship. If r is negative one, it means a downward sloping line can completely describe the relationship. r anywhere in between says, well, it won't just, it won't be as good.
If r is zero, that means that a line isn't describing the relationships well at all. Now in our situation here, not to use a pun, in our situation here, our r is pretty close to one, which means that a line can get pretty close to describing the relationship between our x's and our y's.
So, for example, I'm just going to try to hand draw a line here, and it does turn out that our least squares line will always go through the mean of the x and the y. So the mean of the x is two, mean of the y is three.
We'll study that in more depth in future videos, but let's see, this actually does look like a pretty good line. So let me just draw it right over there. You see that I actually can draw a line that gets pretty close to describing it.
Isn't perfect; if it went through every point, then I would have an r of one, but it gets pretty close to describing what is going on. Now, the next thing I want to do is focus on the intuition. What was actually going on here with these z-scores, and how does taking products of corresponding z-scores get us this property that I just talked about where an r of 1 will be strong positive correlation?
r of negative 1 would be strong negative correlation. Well, let's draw the sample means here. So the x sample mean is 2; this is our x-axis here. This is x equals 2, and our y sample mean is 3.
This is the line y is equal to 3. Now we can also draw the standard deviations. This is, let's see, the standard deviation for x is 0.816, so I'll be approximating it. So if I go 0.816 less than our mean, it'll get us someplace around there.
So that's one standard deviation below the mean; one standard deviation above the mean would put us someplace right over here. And if I do the same thing in y, one standard deviation above the mean 2.160, so that would be 5.160, so it would put us someplace around there.
And one standard deviation below the mean, so let's see, we're gonna go, if we took away two, we would go to one, and then we're gonna go take another point 1.160, so it's going to be somewhere right around here.
So, for example, for this first pair, 1 comma 1, what were we doing? Well, we said, all right, how many standard deviations is this below the mean? And that turns out to be negative 1 over 0.816. That's what we have right over here; that's what this would have calculated.
And then how many standard deviations for in the y direction? And that is our negative 2 over 2.160. But notice since both of them were negative, it contributed to the r. This would become a positive value, and so one way to think about it, it might be helping us get closer to the one.
If both of them have a negative z-score, that means that there is a positive correlation between the variables. When one is below the mean, the other is, you could say, similarly below the mean.
Now, if we go to the next data point, 2 comma 2, right over here, what happened? Well, the x variable was right on the mean, and because of that, that entire term became 0. The x z-score was zero, and so that would have taken away a little bit from our correlation coefficient.
The reason why it would take away, even though it's not negative, you're not contributing to the sum, but you're going to be dividing by a slightly higher value by including that extra pair.
If you had a data point where, let's say, x was below the mean and y was above the mean, something like this, the term, if this was one of the points, this term would have been negative because the y z-score would have been positive, and the x z-score would have been negative.
And so when you put it in the sum, it would have actually taken away from the sum and so would have made the r score even lower. Similarly, something like this would have done, would have made the r score even lower because you would have a positive z-score for x and a negative z-score for y, and so a product of a positive and a negative would be a negative.