yego.me
💡 Stop wasting time. Read Youtube instead of watch. Download Chrome Extension

Calculating correlation coefficient r | AP Statistics | Khan Academy


7m read
·Nov 11, 2024

What we're going to do in this video is calculate by hand the correlation coefficient for a set of bivariate data. When I say bivariate, it's just a fancy way of saying for each x data point, there is a corresponding y data point.

Now, before I calculate the correlation coefficient, let's just make sure we understand some of these other statistics that they've given us. We assume that these are samples of the x and the corresponding y from a broader population, and so we have the sample mean for x and the sample standard deviation for x.

The sample mean for x is quite straightforward to calculate. It would just be 1 plus 2 plus 2 plus 3 over 4, and this is 8 over 4, which is indeed equal to 2. The sample standard deviation for x, we've also seen this before; this should be a little bit of review. It's going to be the square root of the distance from each of these points to the sample mean squared.

So, 1 minus 2 squared plus 2 minus 2 squared plus 2 minus 2 squared plus 3 minus 2 squared, all of that over, since we're talking about sample standard deviation, we have four data points, so one less than four is all of that over three. Now this actually simplifies quite nicely because this is zero, this is zero, this is one, this is one.

So, you essentially get the square root of two-thirds, which is, if you approximate, zero point one. So that's that. The same thing is true for y. The sample mean for y, if you just add up 1 plus 2 plus 3 plus 6 over 4, four data points, this is 12 over 4, which is indeed equal to 3.

Then the sample standard deviation for y, you would calculate the exact same way we did it for x, and you get 2.160. Now, with all of that out of the way, let's think about how we calculate the correlation coefficient.

Now, right over here is a representation for the formula for the correlation coefficient, and at first, it might seem a little intimidating until you realize a few things. All this is saying is for each corresponding x and y, find the z-score for x.

So, we could call this z sub x for that particular x, so z sub x sub i, and we could say this is the z-score for that particular y, z sub y sub i, is one way that you could think about it. Look, this is just saying for each data point, find the difference between it and its mean, and then divide by the standard, the sample standard deviation.

So, that's how many sample standard deviations it is away from its mean, and so that's the z-score for that x data point, and this is the z-score for the corresponding y data point. How many sample standard deviations is it away from the sample mean?

In the real world, you won't have only four pairs, and it will be very hard to do it by hand, and we typically use software, computer tools to do it. But it's really valuable to do it by hand to get an intuitive understanding of what's going on here.

So, in this particular situation, r is going to be equal to one over n minus one. We have 4 pairs, so it's going to be 1 over 3, and it's going to be times a sum of the products of the z-scores. So this first pair right over here, so this, the z-score for this one is going to be 1 minus how far it is away from the x sample mean divided by the x sample standard deviation 0.816 times 1.

Now, we're looking at the y variable, the y z-score, so it's 1 minus 3, 1 minus 3 over the y sample standard deviation 2.16, and we're just going to keep doing that. I'll do it like this. So the next one, it's going to be 2 minus 2 over 0.816.

This is where I got the 2 from and I'm subtracting from that the sample mean right over here times now we're looking at this 2, 2 minus 3 over 2.160. Plus, I'm happy there's only four pairs here. Two minus two again, two minus two over 0.816.

Times now we're gonna have three minus 3, 3 minus 3 over 2.160. And then the last pair you're going to have 3 minus 2, 3 minus 2 over 0.816 times 6 minus 3, 6 minus 3 over 2.160.

So before I get a calculator out, let's see if there's some simplifications I can do. 2 minus 2, that's going to be 0. 0 times anything is 0, so this whole thing is 0. 2 minus 2 is 0, 3 minus 3 is here; this is going to actually be 0 times 0, so that whole thing is 0.

Let's see, this is going to be 1 minus 2, which is negative 1, 1 minus 3 is negative 2. So this is going to be r is equal to 1/3 times negative times negative is positive, and so this is going to be 2 over 0.816 times 2.160.

And then plus 3 minus 2 is 1, 6 minus 3 is 3, so plus 3 over 0.816 times 2.16. Well, these are the same denominator, so actually I could rewrite if I have 2 over this thing plus 3 over this thing, that's going to be 5 over this thing.

So I could rewrite this whole thing as 5 over 0.816 times 2.160, and now I can just get a calculator out to actually calculate this. So we have 1 divided by 3 times 5 divided by 0.816 times 2.16. The zero won't make a difference, but I'll just write it down, and then I will close that parentheses.

And let's see what we get. We get an r of, and since everything else goes to the thousandths place, I'll just round to the thousands place, an r of 0.946. So r is approximately 0.946.

So what does this tell us? The correlation coefficient is a measure of how well a line can describe the relationship between x and y. r is always going to be greater than or equal to negative one and less than or equal to one.

If r is positive one, it means that an upward sloping line can completely describe the relationship. If r is negative one, it means a downward sloping line can completely describe the relationship. r anywhere in between says, well, it won't just, it won't be as good.

If r is zero, that means that a line isn't describing the relationships well at all. Now in our situation here, not to use a pun, in our situation here, our r is pretty close to one, which means that a line can get pretty close to describing the relationship between our x's and our y's.

So, for example, I'm just going to try to hand draw a line here, and it does turn out that our least squares line will always go through the mean of the x and the y. So the mean of the x is two, mean of the y is three.

We'll study that in more depth in future videos, but let's see, this actually does look like a pretty good line. So let me just draw it right over there. You see that I actually can draw a line that gets pretty close to describing it.

Isn't perfect; if it went through every point, then I would have an r of one, but it gets pretty close to describing what is going on. Now, the next thing I want to do is focus on the intuition. What was actually going on here with these z-scores, and how does taking products of corresponding z-scores get us this property that I just talked about where an r of 1 will be strong positive correlation?

r of negative 1 would be strong negative correlation. Well, let's draw the sample means here. So the x sample mean is 2; this is our x-axis here. This is x equals 2, and our y sample mean is 3.

This is the line y is equal to 3. Now we can also draw the standard deviations. This is, let's see, the standard deviation for x is 0.816, so I'll be approximating it. So if I go 0.816 less than our mean, it'll get us someplace around there.

So that's one standard deviation below the mean; one standard deviation above the mean would put us someplace right over here. And if I do the same thing in y, one standard deviation above the mean 2.160, so that would be 5.160, so it would put us someplace around there.

And one standard deviation below the mean, so let's see, we're gonna go, if we took away two, we would go to one, and then we're gonna go take another point 1.160, so it's going to be somewhere right around here.

So, for example, for this first pair, 1 comma 1, what were we doing? Well, we said, all right, how many standard deviations is this below the mean? And that turns out to be negative 1 over 0.816. That's what we have right over here; that's what this would have calculated.

And then how many standard deviations for in the y direction? And that is our negative 2 over 2.160. But notice since both of them were negative, it contributed to the r. This would become a positive value, and so one way to think about it, it might be helping us get closer to the one.

If both of them have a negative z-score, that means that there is a positive correlation between the variables. When one is below the mean, the other is, you could say, similarly below the mean.

Now, if we go to the next data point, 2 comma 2, right over here, what happened? Well, the x variable was right on the mean, and because of that, that entire term became 0. The x z-score was zero, and so that would have taken away a little bit from our correlation coefficient.

The reason why it would take away, even though it's not negative, you're not contributing to the sum, but you're going to be dividing by a slightly higher value by including that extra pair.

If you had a data point where, let's say, x was below the mean and y was above the mean, something like this, the term, if this was one of the points, this term would have been negative because the y z-score would have been positive, and the x z-score would have been negative.

And so when you put it in the sum, it would have actually taken away from the sum and so would have made the r score even lower. Similarly, something like this would have done, would have made the r score even lower because you would have a positive z-score for x and a negative z-score for y, and so a product of a positive and a negative would be a negative.

More Articles

View All
Who are the Water Mafia | Parched
[busy street sounds] [rhythmic music playing] AMAN SETHI: Everyone buys water from the water mafia– the rich, the poor, the middle class. That’s because Delhi and its surroundings have about 24 million people. And anywhere between 30% to 40% don’t have a…
URGENT: Federal Reserve Freezes Rates, Stocks Decline, Housing Falls
What’s up, Graham? It’s guys here, and we’ve got some pretty serious news. After hitting some of the highest interest rates that we’ve seen since 2001, the Federal Reserve has officially made their decision today to pause the rate hike for the month of Se…
Seth Klarman: The Investing Opportunity of a Generation (First Interview in 12 YEARS)
Do you think that opportunity that you had in 1979 still exists in 2023? Seth Clarman is a legendary investor who just broke his 12-year silence to reveal the secrets to outperforming the market and the investment opportunity he would dedicate his life t…
Locating less obvious y-intercepts on graphs | Grade 8 (TX TEKS) | Khan Academy
So we have the graph of a line shown right over here, and my question to you is: what is the Y intercept of this line? Pause this video and see if you can figure it out yourself. All right, now let’s work through this together. So when we just eyeball it…
How to Solve Money Disputes Like a Multi-Millionaire | Shark Tank's Kevin O'Leary
Hi there, Mr. Wonderful here. There’s nothing more stressful than a money dispute, whether it’s with a business partner or a family member, and in these extraordinary times, the stakes are higher than ever. But you know what? You don’t need that stress. Y…
Worked example: Calculating an equilibrium constant from initial and equilibrium pressures
Let’s say we have a pure sample of phosphorus pentachloride, and we add the PCl5 to a previously evacuated flask at 500 Kelvin. The initial pressure of the PCl5 is 1.6 atmospheres. Some of the PCl5 is going to turn into PCl3 and Cl2. Once equilibrium is r…