Example: Correlation coefficient intuition | Mathematics I | High School Math | Khan Academy
So I took some screen captures from the Khan Academy exercise on correlation coefficient intuition. They've given us some correlation coefficients, and we need to match them to the various scatter plots on that exercise. There's a little interface where we can drag these around in a table to match them to the different scatter plots.
The point isn't to figure out how exactly to calculate these; we'll do that in the future, but really to get an intuition of what we're trying to measure. The main idea is that correlation coefficients are trying to measure how well a linear model can describe the relationship between two variables.
For example, if I have... let me draw some coordinate axes here. So, let's say that's one variable; say that's my y variable, and let's say that is my X variable. And so, let's say when X is low, Y is low. When X is a little higher, Y is a little higher. When X is a little bit higher, Y is higher. When X is really high, Y is even higher. This one, a linear model would describe it very, very well.
We can... it's quite easy to draw a line that effectively goes through those points. Something like this would have an R of one. R is equal to one; a linear model perfectly describes it. It's a positive correlation. When one increases, when one variable gets larger, then the other variable is larger. When one variable is smaller, then the other variable is smaller, and vice versa.
Now, what would an R of negative 1 look like? Well, that would once again be a situation where a linear model works really well, but when one variable moves up, the other one moves down and vice versa. So let me draw my coordinate axes again. I'm going to try to draw a dataset where the R would be negative 1.
So maybe when Y is high, X is very low. When Y becomes lower, X becomes higher. When Y becomes a good bit lower, X becomes a good bit higher. So once again, when Y decreases, X increases or as X increases, Y decreases. So they're moving in opposite directions. But you can fit a line very easily to this. So the line would look something like this.
So this would have an R of negative 1. An R of zero; R is equal to zero would be a dataset where a line doesn't really fit very well at all. I'll do that one really small since I don't have much space here.
So an R of zero might look something like this. Oh, maybe I have a data point here. Maybe I have a data point here. Maybe I have a data point here. Maybe I have one there, there, there, there, and it wouldn’t necessarily be this well organized. But this gives you a sense of things.
How would you actually try to fit a line here? You could equally justify a line that looks like that or a line that looks like that or a line that looks like that. So there really isn't a linear model that describes the relationship between the two variables that well right over here.
So with that as a primer, let's see if we can tackle these scatter plots. The way I'm going to do it is I'm just going to try to eyeball what a linear model might look like. There are different methods of trying to fit a linear model to a dataset, an imperfect dataset. I drew very perfect ones at least for R equals 1 and R equals -1, but these are what the real world actually looks like.
Very few times will things perfectly sit on a line. So for scatter plot A, if I were to try to fit a line, it would look something like that. If I were to try to minimize distances from these points to the line, I do see a general trend that when Y is... you know, if we look at these data points over here, when Y is high, X is low, and when X is high, when X is larger, Y is smaller.
So it looks like R is going to be less than zero and a reasonable bit less than zero. It's going to approach this thing here. And if we look at our choices, it wouldn’t be R equal to 0.65. These are positive, so I wouldn’t use that one or that one. And this one is almost no correlation, R equal to 0.02. This is pretty close to zero.
So I feel good with R equal to 0.72. R equal to 0.72. Now I want to be clear: if I didn't have these choices here, I wouldn’t just be able to say, just looking at these data points without being able to do a calculation that R is equal to 0.72. I'm just basing it on the intuition that it is a negative correlation.
It seems pretty strong; you know, the pattern kind of jumps out at you that when Y is large, X is small. When X is large, Y is small. So I like something that's approaching R equals -1. So I've used this one up already.
Now, scatter plot B. If I were to just try to eyeball it again, this is going to be imperfect. But the trend, if I were to try to fit a line, it looks something like that. So, it looks like a line fits it reasonably well. There are some points that would still be hard to fit; they're still pretty far from the line.
And it looks like it's a positive correlation. When X is small, Y is small. X is relatively small and vice versa. And when X... as X grows, Y grows. And when Y grows, X grows. So this one's going to be positive, and it looks like it would be reasonably positive.
I have two choices here, so I don’t know which of these it’s going to be. It’s either going to be R equal to 0.65 or R equal to 0.84. Let’s look at scatter plot C. Now, this one's all over the place. It kind of looks like what we did over here.
You know, I could... you know, well, what does a line look like? You can almost imagine anything. Does it look like that? Does it look like that? Does a line look like that? These things really don't seem to... there's not a direction that you could say, well, as X increases, maybe Y increases or decreases; there's no rhyme or reason here.
So this looks very non-correlated. This one is pretty close to zero, so I feel pretty good that this is R equal to 0.02. In fact, you know, if we tried, probably the best line that could be fit would be one with a slight negative slope. So it might look something like this.
And notice even when we try to fit a line, there are all sorts of points that are way off the line. So the linear model did not fit it that well. So R equal to 0.02. So we use that one.
Now we have scatter plot D. So that's going to use one of the other positive correlations. It does look like, you know, there is a positive correlation. When Y is low, X is low, and when X is high, Y is high, and vice versa.
We could try to fit something that looks something like that, but it's still not as good as that one. You can see the points that we're trying to fit; there are several points that are still pretty far away from our model.
So the model is not fitting it that well. I would say scatter plot B is a better fit. A linear model works better for scatter plot B than it works for scatter plot D. So I would give the higher R to scatter plot B and the lower R, R equal to 0.65 to scatter plot D. R is equal to 0.65.
Once again, that's because with the linear model, it looks like there's a trend, but there are several data points that are really way off the line in scatter plot D compared to scatter plot B. There are a few that are still way off the line in B, but these are even more off of the line in D.