Marginal distribution and conditional distribution | AP Statistics | Khan Academy
Let's say we're a professor at a university of a statistics class and we administer an exam. We are curious about the relationship between the amount of time that students study and the percent that they get correct on the test.
So, what we do is we grade all of the exams. We set up these buckets: the time studied - 0 to 20 minutes, 21 to 40 minutes, 41 to 60 minutes, or greater than 60 minutes. So, those are our buckets for the amount of time studying. Then, we also create buckets for the percent correct: 0 to 19% correct, 20 to 39% correct, 40 to 59% correct, 60 to 79% correct, or 80 to 100% correct.
Then, we figure out what percentage of our entire student population falls into each of these categories. For example, 2% of our students studied 21 to 40 minutes and got between 80 and 100% on the exam. Additionally, 16% studied for more than an hour, over 60 minutes, and got between 40 and 59% on the exam.
What I have right over here is a two-way table. It describes a joint distribution. The Joint distribution between... You can view these as two variables: the time studied and the percent correct.
Now, what we're going to introduce ourselves to in this video are two new ideas outside of just the joint distribution. One is the idea of the marginal distribution. I will write this in green: marginal distribution.
This is the idea of: okay, I can see I can break down my class based on both of these variables, but what if I only care about one? If I care about the distribution of just the percent correct, I don't want to break it down by time studied.
Well, if I want to figure out the distribution of percent correct, I could just total up each of these rows and I would end up with this distribution right over here. Just to make it clear, I would see that 20% of my students got 80 to 100% correct. I would see that 30% of my students got between 60 and 79% correct. I see that 35% of my students got between 40 and 59% correct. I think you see where this is going: 10% of my students got between 20 and 39% correct, and then finally, 5% of my students got between 0 and 19% correct.
All we did was total up each of these rows. Notice these now add up to 100. This describes the distribution of the scores in my class. If someone were to just give you this column, you would say, “Okay, 20% of my students got 80 to 100%. You don't know the breakdown by how much they actually studied.” You'd say 5% got between 0 and 19% on my test, but you wouldn’t know what the breakdown of that 5% was based on how much they studied.
So, this type of distribution is called a marginal distribution. Well, because you could view it as it’s written in the margin right over here. We total these rows and we write it in the margin.
Now, there's another marginal distribution we could figure out: the distribution of the amount of time people study in my class. So if we cared about that, we would total up each of these columns. We would total up each of the columns and look at this right over here.
We'd say, “Okay, 7% of my class studied between zero and 20 minutes, 15% of my class studied between 21 and 40 minutes, 43% of my class studied between 41 and 60 minutes, and 35% of my class studied more than 60 minutes, more than an hour.” If I just look at this marginal distribution, this marginal distribution of the time studied, I'm not able to get the breakout of that 35% that studied for more than an hour.
If I just looked at that marginal distribution, once again, it's called that because I'm writing in the margin, in this case, below our table. If I just looked at that marginal distribution, I would not know the breakdown by the actual percent correct.
Now, there's another type of distribution that's related to these joint distributions, or you could say these two-way tables. That's thinking about the distribution of one variable given what bucket you fall in for the other variable.
So, let me write this down. If I want to say the distribution of scores... Let me write it this way: distribution of scores among those who studied more than 60 minutes.
So, where would I get that? Well, it's all right! I'll go to the column of the people who studied more than 60 minutes and then I'd find this distribution of scores, and I see it right over here. Among the folks who studied more than 60 minutes, I have this distribution of scores.
In that group, 10% got 80 to 100%, 5% got 60 to 79%, 16% got 40 to 59%, none of them (0%) got 20 to 39%, and 4% got 0 to 19%. So, this distribution of one variable given a bucket that you're falling into for another variable... This is called a conditional distribution because you're getting a distribution conditioned on a value of another variable.
So, this right over here is a conditional distribution.
The big idea here is that you have this two-way table we're trying to relate how two variables... Well, how we’re trying to study how two variables relate to each other. If we care about just the distribution of one of the variables, for example, the time studied, we can sum up the columns here and get this marginal distribution.
If we cared about the distribution of percent correct, we could sum up the rows and get that distribution. If we wanted, in the case that I just talked about, the distribution of one variable—the distribution of one variable in this case, the distribution of scores—the distribution of percent correct given a certain value, conditioned on a value of another variable, well, that's going to be a conditional distribution.