Mean and standard deviation versus median and IQR | AP Statistics | Khan Academy
So we have nine students who recently graduated from a small school that has a class size of nine, and they want to figure out what is the central tendency for salaries one year after graduation. They also want to have a sense of the spread around that central tendency one year after graduation.
So they all agree to put in their salaries into a computer, and these are their salaries that are measured in thousands. One makes thirty-five thousand, fifty thousand, fifty thousand, fifty thousand, fifty-six thousand, two make sixty thousand, one makes seventy-five thousand, and one makes two hundred fifty thousand. So she's doing very well for herself. The computer spits out a bunch of parameters based on this data.
Here, it spits out two typical measures of central tendency. The mean is roughly 76.2. The computer calculates it by adding up all of these numbers, these nine numbers, and then dividing by nine. The median is 56. The median is quite easy to calculate; you just order the numbers and take the middle number, which is 56.
Now what I want you to do is pause this video and think about this data set, for this population of salaries, which measure of central tendency is a better measure.
Alright, let's think about this a little bit. I'm going to plot it on a line here. I'm going to plot my data so we get a better sense; we just don't see them as numbers but see where those numbers sit relative to each other.
So let's say this is zero. Let's see one, two, three, four, five. So this would be 250, this is 50, 100, 150, 200, and let's see. Let's say if this is 50, then this would be roughly 40 right here. I just want to get rough, so this would be about 60, 70, 80, 90. Close enough; I could draw this a little bit neater, but 60, 70, 80, 90.
Actually, let me just clean this up a little bit more, too. This one right over here would be a little bit closer, so let me just put it right around here. So that's 40, and then this would be 30, 20, 10. Okay, that's pretty good.
So let's plot this data. One student makes 35,000, so that is right over there. Two make 50,000. So, one, two, [Music] and three. I'll put it like that. One makes 56,000, which would put them right over here.
One makes sixty thousand; actually, two make sixty thousand, so it's like that. One makes seventy-five thousand, so that's sixty, seventy, seventy-five thousand. This one's going to be right around there, and then one makes two hundred fifty thousand. So one's salary is all the way around there.
When we calculate the mean as 76.2, our measure of central tendency is 76.2, which is right over there. So is this a good measure of central tendency? Well, to me, it doesn't feel that good because our measure of central tendency is higher than all of the data points except for one.
The reason is that our data is skewed significantly by this data point at 250,000. It is so far from the rest of the distribution, from the rest of the data, that it has skewed the mean. This is something that you see in general; if you have data that is skewed, especially things like salary data, where most people are making 50, 60, or 70,000 but someone might make 2 million, that will skew the average.
When you add them all up and divide by the number of data points you have, in this case, especially when you have data points that skew the mean, the median is much more robust. The median at 56 sits right over here, which seems to be much more indicative for central tendency.
Think about it: even if you made this instead of 250,000, if you made this 250,000,000, which is a ginormous amount of money to make, it wouldn't even change the median. The median doesn't care how high this number gets. This could be a trillion dollars; this could be a quadrillion dollars, and the median is going to stay the same.
So the median is much more robust if you have a skewed data set. The mean makes a little bit more sense if you have a symmetric data set or if you have things that are roughly above and below the mean or things that aren’t skewed incredibly in one direction, especially by a handful of data points, like we have right over here.
In this example, the median is a much better measure of central tendency. So what about spread? Well, you might immediately say, "Well, Sal, you already told us that the mean is not so good," and the standard deviation is based on the mean. You take each of these data points, find their distance from the mean, square that number, add up those squared distances, divide by the number of data points if we're taking the population standard deviation, and then take the square root of the whole thing.
Since this is based on the mean, which isn't a good measure of central tendency in this situation, this is also going to skew that standard deviation. This is going to be a lot larger than if you look at the actual spread you want. An indication of the spread yes, you have this one data point that's way far away from either the mean or the median, depending on how you want to think about it, but most of the data points seem much closer.
For that situation, not only are we using the median, but the interquartile range is once again more robust. How do we calculate the interquartile range? Well, you take the median, then you take the bottom group of numbers and calculate the median of those. So that's 50 right over here, and then you take the top group of numbers, the upper group of numbers, and find the median there.
60 and 75 is 67.5. If this looks unfamiliar, we have many videos on interquartile range and calculating standard deviation, median, and mean. This is just a little bit of review. The difference between these two is 17.5. Notice this distance between these two; this 17.5 isn't going to change even if this is 250 billion dollars.
Once again, both of these measures are more robust when you have a skewed data set. So the big takeaway here is that the mean and standard deviation aren't bad if you have a roughly symmetric data set. If you don't have any significant outliers that really skew the data set, the mean and standard deviation can be quite solid.
But if you're looking at something that could get really skewed by a handful of data points, the median might be a better choice. Median for central tendency, interquartile range for spread around that central tendency. That's why you'll see when people talk about salaries, they'll often talk about the median.
You could have some skewed salaries, especially on the upside. We talk about things like home prices; you'll see the median often measured more typically than the mean because home prices in a neighborhood, or in a city, a lot of the houses might be in the 200,000 to 300,000 range. But maybe there’s one ginormous mansion that is a hundred million dollars, and if you calculated the mean, that would skew and give a false impression of the average or the central tendency of prices in that city.