yego.me
💡 Stop wasting time. Read Youtube instead of watch. Download Chrome Extension

10% Rule of assuming "independence" between trials | Random variables | AP Statistics | Khan Academy


5m read
·Nov 11, 2024

As we go further in our statistical careers, it's going to be valuable to assume that certain distributions are normal distributions or sometimes to assume that they are binomial distributions. Because if we can do that, we can make all sorts of interesting inferences about them when we make that assumption.

But one of the key things about normal distributions or binomial distributions is we assume that they're the sum, or they can be viewed as the sum of a bunch of independent trials. So, we have to assume that trials are independent. Now, that is reasonable in a lot of situations, but sometimes, let's say you're conducting a survey of people exiting a mall.

In that case, let's say you're asking whether they have done their taxes already. If they're exiting the mall, it's hard to do these samples with replacement. They're leaving the mall. You can't say, "Hey, wait! I just asked you a question, now you've answered it, now go back into the mall," because I want each trial to be truly independent.

But we all know it feels intuitive that, hey, if there are 10,000 people in the mall and I'm going to sample 10 of them, does it really matter that it's truly independent? Doesn't it matter that we're just kind of close to being independent? Because of that idea, and because we do want to make inferences based on things being close to a binomial distribution or a normal distribution, we have something called the 10 percent rule.

The 10 percent rule says that if our sample is less than or equal to 10 percent of the population, then it is okay to assume approximate independence. There are some fairly sophisticated ways of coming up with this ten percent threshold; people could have picked nine percent, or they could have picked ten point one percent. But ten percent is a nice round number, and if we look at some tangible examples, it seems to do a pretty good job.

So, for example, right over here, let's let x be the number of boys from three trials of selecting from a classroom of n students, where 50 percent of the class is boys and 50 percent of the class is girls. What we have over here is a bunch of different n's. What if we have 20 students in the class? What if we have 30? What if we have 100? What if we have 10,000?

We could find the probability that we select three boys with replacement in each of these scenarios, and we could also find the probability that we select three boys without replacement. Then we could think about what proportion our sample size is of the entire population and then we could say, "Hey, does the 10 percent rule actually make sense?"

In this first column, where we are picking three boys with replacement, in this case, because we are replacing, each of these trials are independent. If our trials are independent, then x would be truly a binomial variable. Here, we aren't independent because we are not replacing. So not independent.

Officially, in this column right over here, when we're not replacing, x would not be considered a binomial random variable. But let's see if there's a threshold where, if our sample size is a small enough percentage of our entire population, where we would feel not so bad about assuming x is close to being binomial.

In all of the cases where you have independent trials and 50 percent of the population is boys and 50 is girls, well, you're going to amount to one-half times one-half times one-half, so in all of those situations, you have a 12.5 percent chance that x is going to be equal to three. In this case, x would be a binomial variable.

But look over here when three is a fairly large percentage of our population. In this case, it is fifteen percent. The percent chance of getting three boys without replacement is ten point five percent, which is reasonably different from twelve and a half percent. It is two percent different, but two percent relative to twelve and a half percent, so that's some place in between ten and twenty percent difference in terms of the probability.

So this is a reasonably big difference. But as we increase the population size without increasing the sample size, we see that these numbers get closer and closer to each other all the way, so that if you have 10,000 people in your population and you're only doing three trials, the numbers get very, very close. This is actually 12.49 something percent, but if you round to the nearest tenth of a percent, you see that they are close.

So I think most people would say, all right, if your sample is three ten-thousandths of the population, that you'd feel pretty good treating this column without replacement as being pretty close to being a binomial variable. Most people would say, all right, this first scenario where your sample size is 15 percent of your population, you wouldn't feel so good treating this without replacement column as a binomial random variable.

But where do you draw the line? As we alluded to earlier in the video, the line is typically drawn at 10 percent. If your sample size is less than or equal to 10 of your population, it's not unreasonable to treat your random variable, even though it's not officially binomial, to say, "Okay, maybe it is. Maybe I can functionally treat it as binomial."

Then from there, I can make all of the powerful inferences that we tend to do in statistics. With that said, the lower the percentage the sample is of the population, the better. Now, to be clear, that's not saying that small sample sizes are better than large sample sizes in statistics—large sample sizes tend to be a lot better than small sample sizes.

But if you want to make this independence assumption, so to speak, even when it's not exactly true, you want your sample to be a small percentage of the population. So the ideal, let's say you're doing a survey at the mall, you might want to survey 100 people, but you would hope that there's at least a thousand people in the mall in order for you to feel like your trials are reasonably independent.

If there's 10,000 people in the mall, or somehow 50,000 people in the mall, which would be a very large mall, well that's even better.

More Articles

View All
Is Sargassum Attracting Sharks to Galveston? | SharkFest
NARRATOR: Bull sharks bite with more force, pound for pound, than any other species of large shark. But in Texas waters, they don’t frequently turn their teeth on humans. That is, until 2010—three attacks in less than a year. The safety and livelihood of …
Top 7 Video Game Mods: V-LIST #6
How’s Vau doing? Michael here, and today I’m talking mods. Not console mods, though; this guy who just freaking microwaved his PS3 deserves an honorable mention. Instead, I’m talking about modifications of games. I’m going to start with Grand Theft Auto—n…
Matter and energy in food webs | Middle school biology | Khan Academy
In this video, we’re going to talk about food webs, which is really just a way of picturing how all of the matter and how all of the energy flows inside of an ecosystem. Now, when I talk about matter, I’m talking about the atoms in an ecosystem, the molec…
Progressive Aspect | The parts of speech | Grammar | Khan Academy
Hello, grammarians! Let’s talk about the progressive aspect. So, we talked about the simple aspect as something that is just the most bare form. It’s what you see here: I walk, I will walk, I walked. But aspect allows us to talk about things that are on…
Desining from Day One: Artists as Founders: Multiverse (S20) - YC Gaming Tech Talks 2020
Um, so we’re Multiverse. We did YC W20, so that was from like January to March of this year, just before Corona hit. Um, so, you know, Multiverse, we’re making next generation tabletop RPGs. You can think of us like a mix between D&D and Roblox. We wa…
Determining if a function is invertible | Mathematics III | High School Math | Khan Academy
[Voiceover] “F is a finite function whose domain is the letters a to e. The following table lists the output for each input in f’s domain.” So if x is equal to a, then if we input a into our function, then we output -6. f of a is -6. We input b, we get …