yego.me
💡 Stop wasting time. Read Youtube instead of watch. Download Chrome Extension

Judging outliers in a dataset | Summarizing quantitative data | AP Statistics | Khan Academy


5m read
·Nov 11, 2024

We have a list of 15 numbers here, and what I want to do is think about the outliers. To help us with that, let's actually visualize the distribution of actual numbers. So let us do that.

Here on a number line, I have all the numbers from one to 19. Let's see... we have two ones, so I could say that's one one and then two ones. We have one six, so let's put that six there. We have got, uh, 13, or we have two 13s, so we're going to go up here: 1, 13, and 2, 13s.

Let's see, we have three 14s: 14, 14, and 14. We have a couple of 15s: 15, 15, so 15, 15. We have one 16, so that's our 16 there. We have three 18s: one, two, three, so one, two, and then three. And then we have a 19, and then we have a 19.

When you look visually at the distribution of numbers, it looks like the meat of the distribution, so to speak, is in this area right over here. Some people might say, okay, we have three outliers: these two ones and the six. Some people might say, well, the six is kind of close enough; maybe only these two ones are outliers, and those would actually be both reasonable things to say.

Now to get on the same page, statisticians will use a rule sometimes where they say anything that is more than one and a half times the interquartile range from below Q1 or above Q3, well, those are going to be outliers. What am I talking about? Well, let's actually figure out the median, Q1, and Q3 here. Then we can figure out the interquartile range, and then we can figure out, by that definition, what is going to be an outlier.

If that all made sense to you so far, I encourage you to pause this video and try to work through it on your own, or I'll do it for you right now.

All right, so what's the median here? The median is the middle number. We have 15 numbers, so the middle number is going to be whatever number has seven on either side. So that's going to be the eighth number: 1, 2, 3, 4, 5, 6, 7. Is that right? Yep, six, seven. So that's the median, and you have 1, 2, 3, 4, 5, 6, 7 numbers on the right side too. So that is the median, sometimes called Q2. That is our median.

Now what is Q1? Well, Q1 is going to be the middle of this first group. This first group has seven numbers in it, and so the middle is going to be the fourth number. It has three and three, three to the left, three to the right, so that is Q1.

Then Q3 is going to be the middle of this upper group. Well, that also has seven numbers in it, so the middle is going to be right over there. It has three on either side, so that is Q3.

Now what is the interquartile range going to be? The interquartile range is going to be equal to Q3 minus Q1. The difference between 18 and 13, between 18 and 13, well, that is going to be 18 minus 13, which is equal to five.

Now to figure out outliers: outliers are going to be anything that is below... So outliers, outliers are going to be less than our Q1 minus 1.5 times our interquartile range. This, once again, this isn't some rule of the universe; this is something that statisticians have kind of said. Well, if we want to have a better definition for outliers, let's just agree that it's something that's more than one and a half times the interquartile range below Q1 or an outlier could be greater than Q3 plus one and a half times the interquartile range.

Once again, this is somewhat, you know, people just decided it felt right. One could argue it should be 1.6, or one could argue it should be one, or two, or whatever, but this is what people have tended to agree on.

So let's think about what these numbers are. Q1 we already know, so this is going to be 13 minus 1.5 times our interquartile range. Our interquartile range here is five, so it's 1.5 times 5, which is 7.5. So this is 7.5. 13 - 7.5 is what? 13 - 7 is 6, and then you subtract another 0.5 is 5.5. So we have outliers, outliers are less than 5.5.

Q3 is 18; this is once again 7.5. 18 + 7.5 is 25.5, or outliers, outliers greater than 25, 25.5.

So based on this, we have a kind of a numerical definition for what's an outlier. We're not just subjectively saying, well, this feels right, or that feels right, right? Based on this, we only have two outliers: only these two ones are less than 5.5. This is the cutoff right over here. So this dot just happened to make it, and we don't have any outliers on the high side.

Now, another thing to think about is drawing box and whiskers plots based on Q1, our median, our range, all the range of numbers. You could do it either taking into consideration your outliers or not taking into consideration your outliers.

So there are a couple of ways that we can do it. Let me actually clear... let me clear all of this. We've figured out all of this stuff, so let me clear all of that out, and let's actually draw a box and whiskers plot.

So I'll put another—another, actually, let me do two here. That's one, and then let me put another one down there. This is another. Now, if we were to just draw a classic box and whiskers plot here, we would say, all right, our median at 14. Actually, I'll do it both ways. Our median is at 14, median is at 14. Q1's at 13, Q1's at 13, Q1's at 13. Q3 is at 18, Q3 is at 18, Q3 is 18.

So that's the box part. Let me draw that as an actual... let me actually draw that as a box. So my best attempt—there you go, that's the box. And this is also a box. So far, I'm doing the exact same thing.

Now, if we don't want to consider outliers, we would say, well, what's the entire range here? Well, we have things that go from one all the way to 19. So one way to do it is hey, we can start at one, and so our entire range—we go... let me draw a little bit better than that. We're going all the way, all the way from one to 19.

Now, in this one, we're including everything; we're including even these two outliers. But if we don't want to include those outliers, we want to make it clear that they're outliers, well, let's not include them.

What we can do instead is say, all right, including our non-outliers, we would start at six, because six we're saying is in our dataset, but it is not an outlier. Let me make this look better. So we are going to start at six and go all the way to 19.

To say that we have these outliers, we would put this... we have outliers over there. So once again, this is a box and whiskers plot of the same dataset without outliers, and this is one where we make it clear where the outliers actually are.

More Articles

View All
Apoorva Mehta at Startup School NY 2014
[Alexis] Instacart CEO, Apoorva Mehta, started out with a company that offered something pretty amazing, right? Shopping from stores across your city all in one bag delivered to your home within a few hours. So, you can have that case of Yingling from Cos…
Introduction to solubility equilibria | Equilibrium | AP Chemistry | Khan Academy
Let’s say we have a beaker of distilled water at 25 degrees Celsius, and to the beaker, we add some barium sulfate. Barium sulfate is a white solid. A small amount of the barium sulfate dissolves in the water and forms barium 2 plus ions in solution and s…
Harry Zhang with Kevin Hale on Building Lob to Automate the Offline World
Today we have Harry Zhang, co-founder of Lob. Lob makes APIs for companies to send letters and postcards. So, Kevin has a question for you. “I’m trying to think back to when you guys applied to YC. You didn’t have almost anything. Like, I would say it wa…
The early Temperance movement - part 1
Hi, this is Becca from KH Academy and today I’m going to be talking about Temperance. So, what was the temperance movement? In this video, I’ll talk a little bit about what Temperance was, what its causes were, and how it started to develop in the early 1…
She Shoots, She Scores: Title IX Turns 50 | Podcast | Overheard at National Geographic
Um, I’m Amy Briggs. It is Wednesday, April 13th, I think, and I am in Princeton, New Jersey, and I’m walking down Prospect Avenue, which is the street where all the eating clubs are. So, eating clubs on a sunny spring day, I took a walk down memory lane. …
New Technologies: Making Wildlife Cinematography More Accessible | National Geographic
[Music] I always wanted to go and explore far away in empty places. From very early on, I just wanted to travel and discover places that weren’t impacted by humans. We have got on 1.6 inside the heart. After several years as an Antarctic ecologist, I had…