Introduction to sampling distributions

5m read

·Nov 11, 2024

So let's say I have a bag of colored balls here, and we know that 40 of the balls are orange. Now imagine defining a random variable X, and X is based on a trial where we stick our hand in this bag, we don't look around, and we randomly pick a ball, look at its color, we record it, and then we're going to put it back. So we're going to assume that we have replacement here, and we're going to say that our random variable X is going to be equal to 1 if we pick an orange ball, and it's going to be equal to 0 otherwise.

You might already recognize this as a Bernoulli random variable, and we can construct a probability distribution for X. In fact, let's do this. So X is going to be discrete; it can only take on two different values. So X can take on zero, or it can take on one. If there are forty percent of the balls are orange, the one has a forty percent chance of happening. So let me do that. So there's going to be a forty percent chance of getting a 1. So that's 0.4 right over there, and there would be a 60 chance, or a 0.6 probability of getting a 0.

So this right over here, just trying to hand-draw it, so this would be 0.6 probability of getting a zero. So we could call this the probability distribution. Probability distribution for X, this is all review so far. But the reason why I did this is because we're now going to introduce ourselves to the notion of a sampling distribution, and it can be a little bit confusing because in our brains, we tend to think in terms of probability distributions and not as much in terms of sampling distributions.

So what you do in a sampling distribution is you still start with a population here, but then you take a sample of that population. So let me label things. So this is our population. This is our population here. We take a sample; we take a sample from that population, and it could have a certain sample size, sample size n. Then we'll calculate some statistic for that sample. So we will calculate a statistic, and then we're going to think about the distributions of these statistics that we can get from these samples.

One way to think about this is keep doing this. So this is our first sample of sample size n; we calculated statistics. Then we take another sample of sample size n, and then we calculate the statistic again. Then we take another sample, and we just keep doing this. We take another sample of sample size n, and we calculate the statistic again. And let's say we were to do this an infinite number of times, and we were to plot the distribution of the statistic that we're calculating; well then we have our sampling distribution.

Let's try to make this a little bit more tangible by going back to our colored balls example and calculate or think about a sampling distribution for that. So let's say we have our population here. Population, and we know that the parameter for this population; we know that the proportion of balls that are orange, forty percent are orange. We don't always know the parameters; oftentimes we're estimating the parameters by looking at samples. But let's say we then take sample sizes of ten, so sample size 10.

Every time we calculate the statistic for our sample of what percentage are orange. So let's say the first time we take a sample, this time over here we get three oranges. Three oranges. Let's say the next time we get two oranges. Actually, let me do these as a proportion. So if my sample size is 10, I get three oranges, which is thirty percent, and then if I do it again, I get two oranges, and that is twenty percent.

And I just keep doing this, and eventually, I can plot a distribution of these sample proportions. You would end up with some type of a discrete distribution. The way to read this discrete distribution is, let's say this right over here ends up, and I'm just going to make up a number. This isn't going to be the actual number, but let's say that this is 0.15. The way to read that is you have a 15 chance of getting a sample where 50 of your balls are orange. Or if this right over here is 0.07, that would mean that you have a 7 chance where 20 of your balls are orange.

Now, to make this even a little bit more tangible, let's run a simulation that actually does this. This right over here is a simulation created on Khan Academy on our computer programming scratch pads by Charlotte Owen. It's a simulation to construct a sampling distribution. So, let's say here she's using candies instead of just colored balls, but these candies are essentially colored balls. And so here we can set the population proportion. So let's say that the actual proportion, as we saw in our example, of let's say it's green as opposed to orange here is 40 percent.

And so let's say in each sample, just as we said, our sample size is ten, so we're gonna take a sample size of ten. And let's just do one sample first. So let's just draw a sample. And so what we did is we took 10 of these gumballs out, and we are counting how many of them are green. So in this first sample of 10, we see that 1, 2, 3, 4, 5, 6 of them are green. So in the out of the possible outcomes, we're now going to tally one of our outcomes having, hey, we got six of our 10 to be green.

And if we want to show the proportion instead of just the count, we can just pick percentage here. And so here we've had one scenario already where 60 were green. But we don't want to just do one sample; we just want to keep drawing samples. Let's draw another sample. So in this last sample, we have fifty percent are green. So now that we have one was fifty percent green, one was sixty percent green, let's try another sample.

Now we have another sample where we got sixty percent green. So there are two situations where we had sixty percent green. And so I can keep doing this over and over and over. And so what we're creating right over here is a sampling distribution. If we were to do this an infinite number of times, we would get the true sampling distribution of the sample proportion given the actual population proportion that is green.

And so this is after 77 samples. Notice this is saying that out of 77 of our samples, 22 of those samples resulted in 40 percent of our gumballs being green. Only one of our samples had 80 percent of our gumballs being green. And if we just want to do a ton more samples, I'll go all the way to drawing 50 samples at a time. So let me just keep increasing this. Notice we have 17 samples now where we had zero percent that are green.

We have 91 of the 2200 samples where 10 were green, where one out of the 10 in our sample were green. And we could convert any of these numbers; 17, 91, 256, we could turn these into percentages by just dividing by the number of samples. But this is fun; we could just keep going and making this larger and larger and larger. I encourage you to play with this; I'll provide a link for it in the description of this video and on Khan Academy.

But the main idea is to get an intuition for how a sampling distribution is different from just a traditional probability distribution; that in a sampling distribution, you're taking samples from a population, calculating some statistic for that sample, and what you're plotting in the sampling distribution are the various probabilities, the various likelihoods of the outcomes for those statistics in those samples.

Introduction to sampling distributions

More Articles