yego.me
💡 Stop wasting time. Read Youtube instead of watch. Download Chrome Extension

Mean and standard deviation versus median and IQR | AP Statistics | Khan Academy


5m read
·Nov 11, 2024

So we have nine students who recently graduated from a small school that has a class size of nine, and they want to figure out what is the central tendency for salaries one year after graduation. They also want to have a sense of the spread around that central tendency one year after graduation.

So they all agree to put in their salaries into a computer, and these are their salaries that are measured in thousands. One makes thirty-five thousand, fifty thousand, fifty thousand, fifty thousand, fifty-six thousand, two make sixty thousand, one makes seventy-five thousand, and one makes two hundred fifty thousand. So she's doing very well for herself. The computer spits out a bunch of parameters based on this data.

Here, it spits out two typical measures of central tendency. The mean is roughly 76.2. The computer calculates it by adding up all of these numbers, these nine numbers, and then dividing by nine. The median is 56. The median is quite easy to calculate; you just order the numbers and take the middle number, which is 56.

Now what I want you to do is pause this video and think about this data set, for this population of salaries, which measure of central tendency is a better measure.

Alright, let's think about this a little bit. I'm going to plot it on a line here. I'm going to plot my data so we get a better sense; we just don't see them as numbers but see where those numbers sit relative to each other.

So let's say this is zero. Let's see one, two, three, four, five. So this would be 250, this is 50, 100, 150, 200, and let's see. Let's say if this is 50, then this would be roughly 40 right here. I just want to get rough, so this would be about 60, 70, 80, 90. Close enough; I could draw this a little bit neater, but 60, 70, 80, 90.

Actually, let me just clean this up a little bit more, too. This one right over here would be a little bit closer, so let me just put it right around here. So that's 40, and then this would be 30, 20, 10. Okay, that's pretty good.

So let's plot this data. One student makes 35,000, so that is right over there. Two make 50,000. So, one, two, [Music] and three. I'll put it like that. One makes 56,000, which would put them right over here.

One makes sixty thousand; actually, two make sixty thousand, so it's like that. One makes seventy-five thousand, so that's sixty, seventy, seventy-five thousand. This one's going to be right around there, and then one makes two hundred fifty thousand. So one's salary is all the way around there.

When we calculate the mean as 76.2, our measure of central tendency is 76.2, which is right over there. So is this a good measure of central tendency? Well, to me, it doesn't feel that good because our measure of central tendency is higher than all of the data points except for one.

The reason is that our data is skewed significantly by this data point at 250,000. It is so far from the rest of the distribution, from the rest of the data, that it has skewed the mean. This is something that you see in general; if you have data that is skewed, especially things like salary data, where most people are making 50, 60, or 70,000 but someone might make 2 million, that will skew the average.

When you add them all up and divide by the number of data points you have, in this case, especially when you have data points that skew the mean, the median is much more robust. The median at 56 sits right over here, which seems to be much more indicative for central tendency.

Think about it: even if you made this instead of 250,000, if you made this 250,000,000, which is a ginormous amount of money to make, it wouldn't even change the median. The median doesn't care how high this number gets. This could be a trillion dollars; this could be a quadrillion dollars, and the median is going to stay the same.

So the median is much more robust if you have a skewed data set. The mean makes a little bit more sense if you have a symmetric data set or if you have things that are roughly above and below the mean or things that aren’t skewed incredibly in one direction, especially by a handful of data points, like we have right over here.

In this example, the median is a much better measure of central tendency. So what about spread? Well, you might immediately say, "Well, Sal, you already told us that the mean is not so good," and the standard deviation is based on the mean. You take each of these data points, find their distance from the mean, square that number, add up those squared distances, divide by the number of data points if we're taking the population standard deviation, and then take the square root of the whole thing.

Since this is based on the mean, which isn't a good measure of central tendency in this situation, this is also going to skew that standard deviation. This is going to be a lot larger than if you look at the actual spread you want. An indication of the spread yes, you have this one data point that's way far away from either the mean or the median, depending on how you want to think about it, but most of the data points seem much closer.

For that situation, not only are we using the median, but the interquartile range is once again more robust. How do we calculate the interquartile range? Well, you take the median, then you take the bottom group of numbers and calculate the median of those. So that's 50 right over here, and then you take the top group of numbers, the upper group of numbers, and find the median there.

60 and 75 is 67.5. If this looks unfamiliar, we have many videos on interquartile range and calculating standard deviation, median, and mean. This is just a little bit of review. The difference between these two is 17.5. Notice this distance between these two; this 17.5 isn't going to change even if this is 250 billion dollars.

Once again, both of these measures are more robust when you have a skewed data set. So the big takeaway here is that the mean and standard deviation aren't bad if you have a roughly symmetric data set. If you don't have any significant outliers that really skew the data set, the mean and standard deviation can be quite solid.

But if you're looking at something that could get really skewed by a handful of data points, the median might be a better choice. Median for central tendency, interquartile range for spread around that central tendency. That's why you'll see when people talk about salaries, they'll often talk about the median.

You could have some skewed salaries, especially on the upside. We talk about things like home prices; you'll see the median often measured more typically than the mean because home prices in a neighborhood, or in a city, a lot of the houses might be in the 200,000 to 300,000 range. But maybe there’s one ginormous mansion that is a hundred million dollars, and if you calculated the mean, that would skew and give a false impression of the average or the central tendency of prices in that city.

More Articles

View All
Pen Pal Experiment: Two Women Swap the Data of Their Daily Lives | Short Film Showcase
[Music] I’m Georgia. I am Italian, but I live in New York. I’m Stephanie. I was born in Denver, Colorado, but I’ve lived in London for the past 13 years. We met each other in person twice. When in September 2014, we decided to collaborate on a year-lon…
WEIRDEST TAN LINES EVER! IMG! #24
I choose You Pikachu II and Conan the Snowman. It’s episode 24 of IMG. Cats can be painted to look like Pikachu, and so can girls. She needs to put on this contre sweater. Here are operating systems as Batman villains: Linux is the Penguin, Mac OS is Two…
Calculating velocity using energy | Modeling Energy | High School Physics | Khan Academy
So we have a spring here that has a spring constant of 4 newtons per meter. What we then do is take a 10 gram mass and we put it on top of the spring, and we push down to compress the spring by 10 centimeters. We then let go, and what I’m curious about is…
The U.S. Faces its "Most Dangerous Time" in Decades (Jamie Dimon Explains)
You said this may be the most dangerous time the world has seen in decades. Why do you think it’s the most dangerous time? Jamie Dimon, the CEO of JP Morgan Chase, is widely regarded as one of the most esteemed bankers in history. While I typically look …
Why Are Turkeys Running Wild in These Neighborhoods? | National Geographic
[Music] Don’t get close to them. Wild turkeys are not considered native to California, most of the state. Really, turkeys are not a problem, but they are certainly a local problem, particularly in some residential areas that have high-quality turkey habit…
Decomposing shapes to find area (grids) | Math | 3rd grade | Khan Academy
Each small square in the diagram has a side length of one centimeter. So, what is the area of the figure? We have this figure down here in blue, and we want to know its area. Area is the total space it covers, and we’re also told that each of these little…