Techniques for random sampling and avoiding bias | Study design | AP Statistics | Khan Academy

6m read

·Nov 11, 2024

Let's say that we run a school, and in that school, there is a population of students right over here. That is our population, and we want to get a sense of how these students feel about the quality of math instruction at this school. So we construct a survey, and we just need to decide who are we going to get to actually answer this survey.

One option is to just go to every member of the population, but let's just say it's a really large school. Let's say we're a college and there's 10,000 people in the college. We say, "Well, we can't just talk to everyone." So instead we say, "Let's sample this population to get an indication of how the entire school feels." We are going to sample it; we're going to sample that population.

Now, in order to avoid having bias in our response, in order for it to have the best chance of being indicative of the entire population, we want our sample to be random. So our sample could either be random or not random. It might seem at first pretty straightforward to do a random sample, but when you actually get down to it, it's not always as straightforward as you would think.

One type of random sample is just a simple random sample. So simple, simple random, random, random sample. This is saying, "All right, let me maybe assign a number to every person in the school. Maybe they already have a student ID number, and I'm just going to get a computer, a random number generator to generate the 100 people, the 100 students." So let's say there's a sample of 100 students that I'm going to apply the survey to. So that would be a simple random sample. We are just going into this whole population and randomly—let me just draw this—so this is the population. We are just randomly picking people out, and we know it's random because of a random number generator or we have a string of numbers or something like that that is allowing us to pick these students.

Now, that's pretty good. It's unlikely that you're going to have bias from this sample, but there is some probability that, just by chance, your random number generator just happened to select maybe a disproportionate number of boys over girls or a disproportionate number of freshmen or a disproportionate number of engineering majors versus English majors, and that's a possibility. So even though you're taking a simple random sample and it is truly random, once again, it's some probability that's not indicative of the entire population.

And so to mitigate that, there are other techniques at our disposal. One technique is a stratified sample. Stratified! This is the idea of taking our entire population and essentially stratifying it. So let's say we want to—we take that same population, we take that same population. I'll draw it as a square here just for convenience. We're going to stratify it by, let's say,—we're concerned that we get an appropriate sample of freshmen, sophomores, juniors, and seniors.

So we'll stratify it by freshmen, sophomores, juniors, and seniors. Instead of just sampling a 100 out of the entire pool, we sample 25 from each of these groups. So just like that! And so that makes sure that you are getting indicative responses from at least all of the different group—all of the different age groups or levels within your university.

Now, there might be another issue where you say, "Well, I'm actually more concerned that we have accurate representation of males and females in the school." There is some probability that if I do 100 random people, it's very likely that it's close to 50/50, but there's some chance, just due to randomness, that it's disproportionately male or disproportionately female. That's even possible in the stratified case.

So what you might say is, "Well, you know what I'm going to do?" There's a technique called a clustered sample. Let me write this right over here: clustered— a clustered sample! And what we do is we sample groups. Each of those groups we feel confident has a good balance of males and females.

For example, we might— instead of sampling individuals from the entire population, we might say, "Look, you know, on Tuesdays and Thursdays..." Even there, as you can tell, this is not a trivial thing to do. Let's just say that we can split our—let's say we can split our population into groups. Maybe these are classrooms, and each of these classrooms has an even distribution of males and females, or pretty close to even distributions.

So what we do is we sample the actual classrooms. That's why it's called a cluster, or cluster technique, or clustered random sample. We're going to randomly sample our classrooms, each of which have a close or maybe an exact balance of males and females. So we know that we're going to get good representation.

But we are still sampling; we're sampling from the clusters. Then we're going to survey every single person in each of these clusters—every single person in one of these classrooms. So once again, these are all forms of random surveys, or random samples. You have the simple random sample, you can stratify, or you can cluster, and then randomly pick the clusters and then survey everyone in that cluster.

Now, if these are all random samples, what are the non-random things? Well, one case of non-random you could have is a voluntary survey or voluntary sample. This might just be you tell every student at the school, "Hey, here's a web address! If you're interested, come and fill out this survey." That's likely to introduce bias because you might have maybe the students who really like the math instruction at their school more likely to fill it out.

Maybe the students who really don't like it are more likely to fill it out. Maybe it's just the kids who have more time are more likely to fill it out. So this has a good chance of introducing bias. The students who fill out the survey might just be skewed one way or the other because, you know, they volunteered for it.

Another non-random sample would be called a convenience sample, introducing bias because of convenience is a term that's often used. This might say, "Well, let's just sample the 100 first students who show up in school." That's just convenient for me because I didn't have to do these random numbers or do the stratification or do any of this clustering. But you can understand how this also would introduce bias. The first 100 students who show up at school might be the most diligent students.

Maybe they all take an early math class that has a very good instructor, or they're all happy about it. Or it might go the other way; the instructor there isn't the best one, and so it might introduce bias the other way. So if you let people volunteer or you just say, "Oh, let me go to the first N students," or you say, "Hey, let me just talk to all the students who happen to be in front of me right now," they might be in front of you out of convenience, but they might not be a true random sample.

Now there are other reasons why you might introduce bias, and it might not be because of the sampling. You might introduce bias because of the wording of your survey. You could imagine a survey that says, "Do you consider yourself lucky to get a math education that very few other people in the world have access to?" Well, that might bias you to say, "Well, yeah, I guess I feel lucky."

While if the wording was, "Do you like the fact that a disproportionate number of students at your school tend to fail algebra more than students at our surrounding schools?" Well, that might bias you negatively. So the wording really, really, really matters in surveys, and there's a lot that would go into this.

The other one is just people's—it's called response bias. And once again, this isn't about response. Response bias. This is just people not wanting to tell the truth, or maybe not wanting to respond at all. Maybe they're afraid that somehow their response is going to show up in front of their math teacher or the administrators, or if they're too negative it might be taken out on them in some way. Because of that, they might not be truthful, and so they might be overly positive or not fill it out at all.

So anyway, this is a very high-level overview of how you could think about sampling. You want to go random because it lowers the probability of introducing some bias into it. And then these are some techniques. Also, think about whether you're falling into some of these pitfalls that have a good chance of introducing bias.

Techniques for random sampling and avoiding bias | Study design | AP Statistics | Khan Academy

More Articles