Bayes Classifiers and Sentiment Analysis

15m read

·Nov 3, 2024

Hey guys, this is Matt Kids 101, and today I'm gonna be talking about the machine learning algorithm called a Bayes classifier.

To get started, I'm gonna show you something cool you can do with Bayes classifiers. So here I've created a simple demo where I can type a sentence, something like this, and it tells me whether that sentence has a positive sentiment or a negative sentiment—whether it's good or bad, basically. So if I say, "This product is bad," it says it's negative. If I say, "This product is good," it says it's positive. If I say, "I really liked the movie," it's positive. If I say, "I really hate it," Ruby, it's negative.

All I had to do to get this to work was gather a bunch of sentences and label each one as either positive or negative. Then I just fed those sentences to my program, and it learned what makes the sentence good or bad. This idea of sentiment analysis can actually be applied to a whole lot of things.

So say you're a company and you've just released a product. You can actually track Twitter, you know, and see what people are saying about your product and see if they're saying generally good things or generally negative things. You can even take that a step further; if you're a stock trader, you could track a whole bunch of products and basically predict the stock market based on people's opinions about various products.

Another thing you can do is graph out the mood over the course of a piece of text. So I've taken a book, in this case, Stephen King's "Under the Dome," and I've plotted basically the average sentiment over time throughout the book. You can see it starts off kind of negative, kind of sad, then there's hope in the middle, and then all hope is lost at the end. At the end, it's also pretty negative sentiment throughout.

So, this is just a way to analyze a piece of text using an automated program. I hope I've convinced you by this point that Bayes classifiers are actually pretty useful, even just for something like sentiment analysis.

Before I get into the theory and the math behind Bayes classifiers, I just want to give a really simple example, which will show you kind of what we're after and what we're kind of getting at. So in this example, let's say we're trying to train a Bayes classifier to do sentiment analysis like I showed, and the only data we have are these seven sentences. Three of them are negative and four of them are positive. Based on this, we want the program to be able to draw inferences.

If we show it a new sentence, it just knows about these, you know, seven things that it's seen before. It has to decide for this new sentence whether it's positive or negative. So how can we make this happen? You know, it's pretty easy for us because, you know, we already have language. You probably already speak English and have a good idea of feelings.

But how could we get a program with just these sentences to learn what makes some of them good and some of them bad? The answer in this case is pretty straightforward, actually. All of the positive ones use the word "love," and all of the negative ones use some form of the word "hate." Ideally, if we had a Bayes classifier—or any kind of classifier—it would pick up on this and realize love is really important for things being positive, and hate is really important for things being negative.

So, if I see a new sentence and it says "love" in it, it's going to be positive. If it says "hate" in it, it's going to be negative. If it says both, we don't really know what to do because none of the training samples use both. But we'll certainly get to that once we talk about the math behind a Bayes classifier.

After we had our computer learn from these sentences, you know, hopefully if it were to pick up on this love-hate pattern, we would be able to show with these two new sentences. One of them says "hate," and one of them says "love." Based on that, it would be able to figure out that one of them is negative and one of them is positive.

So this is just the idea that we're after, basically. You know, we want the program to be able to look at all of the sentences as we show it and figure out which words are more likely to appear in positive sentences or which words are more likely to appear in negative sentences. Just based on that, it can make inferences about new things that it's never seen before.

More generally, you know, it doesn't have to be about sentences and positive or negative sentiment. It can be about anything. The idea is just figuring out, you know, which variables correlate to which classes, you know, which categorizations, basically.

Now I want to turn this very abstract idea into an actual procedure that we can perform, not on sentiment analysis, but on anything. I'm going to be using sentiment analysis as the example because it's pretty straightforward, and I've shown you how cool it is. But you can think about doing this with a lot of different things.

The first step, where we're going to make a Bayes classifier, is we have to decide how we're going to represent a piece of data. So, if we're doing sentiment analysis, a piece of data is maybe a sentence or, you know, just a piece of text in general. How are we going to represent that piece of text?

Well, the way I've described representing a piece of text previously is, you know, the appearance of certain words. So I might describe a sentence as, you know, it has the word "love," it doesn't have the word "hate," it has the word "elephant," it doesn't have the word "monkey," and I could go on, you know, through all, you know, how many 60,000, 30,000 words that we think are important. I could describe every single sentence as, you know, a vector basically if yes or no values, you know, does it have it, doesn't it have it.

Another way to describe the piece of text would be counting, you know? I could say it has the word "love" two times, it has the word "hate" one time. You can imagine that might be a little better if we're looking at longer pieces of text, like a newspaper article, where, you know, the longer the piece of text, the more likely any given word is going to appear in it. At some point, you have to start counting. You can't just say it doesn't have the word "hate" in it because a long novel will have the word "hate" in it even if it's not, you know, a very negative novel.

So this is the first thing we have to decide: how are we going to represent the data? You know, the variables in the data. For the purposes of this video, I'm pretty much gonna be sticking to the model of, you know, does the sentence have the word or doesn't it have the word? Because this is just a really simple way to represent a sentence, and for short pieces of text, like tweets, it works great, and it's just the easiest thing to understand. This is, by the way, called, I think, Bernoulli Naive Bayes, something along those lines. So if you want to look it up, just look up the word "Bernoulli." I'll probably have that in the description.

Now I want to give a really simple example of actually building a Bernoulli Naive Bayes classifier just using a really limited vocabulary of three words. In real life, we would probably use thousands of words, you know, we would look for the appearance of all of those words. But in this example, just to make things easier, we're only going to be looking at the three words "love," "hate," and "elephant," and we're only going to be looking at 12 sentences and, you know, how those sentences use these words.

In this case, I gathered six positive sentences and six negative sentences. I built this table of whether each sentence used the word "love," whether each sentence used the word "hate," and whether each sentence used the word "elephant." So you can see the first two positive sentences use the word "love" and didn't use the word "hate," and didn't use the word "elephant." That's pretty much what we would expect.

The next two positive sentences actually didn't use any of the three words in our vocabulary, so we can't say much about sentences of that nature. The last two said "love." One of them also said "hate," which isn't what we would expect. Maybe they were talking about a love-hate relationship or something like that, and one of them actually said "elephant." So this is going to be interesting.

If you look at the negative examples, you know, three of the negatives used the word "hate." How would we, you know, based on these statistics, you know, if we didn't know anything else about the sentence besides whether it used the word "love," whether it used the word "hate," and the word "elephant," how do we decide for a new sentence whether it's positive or negative?

This is where we have to start bringing in some math, and what we're going to be doing is we're going to be computing things called conditional probabilities. All that really means is, you know, if I tell you I've given you a sentence that's positive with positive sentiment, I want to know the probability of seeing the word "love." Or I give you a sentence, I tell you it's negative, what's the probability that the word "elephant" will be inside that sentence?

A conditional probability is just a probability conditioned on something—in this case, conditioned on whether the sentence is positive or negative. We're actually pretty much set up to compute these probabilities with the data we have in front of us because I've already split it up into positive and negative tables. So all we have to do is compute a probability for the positive table, and that'll be conditioned on, you know, a positive sentence. We compute probabilities for the negative table, and that's conditioned on a negative sentence.

I think you'll see what I mean in a second when I actually go through the math. So let's actually just start thinking about these probabilities. Let's say I have a positive sentence, and I want to know what's the probability that it has the word "love." Well, we have six positive sentences in total. You know, every six rows in the table and four of them have the word "love" in them. So four out of six is the probability that a positive sentence will have the word "love."

Now let's say we want to know the probability that a positive sentence will have the word "hate." Well, there's six positive sentences; still, only one of them has the word "hate" in it. So there's a one out of six chance that a positive sentence will have the word "hate." We could go through and do this for some negatives, so there's like, you know, three out of six negatives with the word "hate," so three out of six probability that a negative sentence will have the word "hate."

We would get probabilities like the ones I've written at the bottom here, just the probability of, you know, basically how many, you know, out of all of the entries in that column how many of them have a "yes" in there. So this is what we get, and these are our conditional probabilities, you know, the conditional probability of seeing each word given the, you know, the sentiment of the sentence.

Now we're pretty much done looking at the actual data. We've basically gotten these probabilities from this data, and we don't need the data anymore. So I'm just gonna make a more compact version of these two tables just with these probabilities we came up with and not with any of the data.

You can see our four out of six for "love" and positive is still there. Our three out of six for "hate" and negative is still there. All of our stats are still there. I just got rid of all of the "yeses" and "noes" that we had from our data. Now I pretty much need to touch on two pretty basic probability ideas.

The first being if I have the probability of something happening, I can easily figure out the probability of that not happening. Because if I know the probability of "love" appearing in a positive sample is four-sixths, then there's two-sixths left. So I know the probability that "love" doesn't appear is, you know, 1 - 4/6, which is 2/6.

I can create those tables here just for you to see, you know, if I subtract the probability of a thing happening from one, what I get out is the probability that thing doesn't happen. This is the idea of complementary probabilities; it's pretty basic, and we're going to be using this a lot.

The second thing I want to talk about is not something I can really easily make a slide for, but it's the idea of independence. So let's say I have a coin, and you know, there's a 1/2 chance I flip heads and a 1/2 chance I flip tails. Well, if I asked you what's the chance I flipped heads three times, I would flip it one time, flip it a second time, flip it a third time, and not know. One of those flips influences another flip; you know, they're independent events.

There's always a 1/2 chance of me flipping heads at any given flip of this coin. The past flips don't affect what happens. So if I want to know the chance I got heads three times in a row, I take the chance I flipped it the first time times the chance I flipped it the second time times the chance I flipped it the third time. So it's 1/2 times 1/2 times 1/2, which gives me 1/8.

The idea is if I have things which occur independently, and I want the probability of all of them happening at the same time, I just multiply the probabilities of each of each thing, and that gives me the probability that they all happen simultaneously. We're going to be applying that by making the assumption that each word is independent in a sentence.

For instance, I might assume that "love," "hate," and "elephant" are each independent events. You know, if "love" is there, it doesn't influence whether "hate" is there; things like that given that it's a positive sample or a negative sample. This assumption allows me to figure out, okay, the chance that "love" and "hate" appear is the probability of "love" times the probability of "hate."

It allows me to do things like that where I can just multiply probabilities to get the probability of both things happening at once. It's actually worth asking, you know, is this assumption a valid assumption to make? You know, even if it's a positive sentence, if I use a certain word like "love," I'm probably a lot less likely to use a different word like "a door" or, you know, something that means love, like a synonym of love because I've already used "love."

There are a lot of things like that where you might think words actually aren't independent of each other in English. Despite the fact that assuming words are independent is so wrong and so, like, naive, this technique still actually works. So we're making an assumption that isn't even remotely true, but yet we're still going to get really amazing results.

You can kind of think about maybe why that works, but it just does. So just beware that we are making an assumption, and that assumption is actually patently false, but it's going to be okay.

Now I want to actually go on and work on some sentences and use this classifier to decide if those sentences are positive or negative. As I do that, I will show you actually how we do that, you know, how we're going to use these probability principles I've just explained to actually decide, you know, the sentiment of a sentence.

Let's have a look at this sentence: "I really love my pet elephant." This sentence has the word "love," the word "elephant," and it doesn't have the word "hate." So, here's how we're going to decide whether this sentence is positive or negative.

First, we're going to assume that the sentence is positive, and we're gonna say, well, what is the chance of this sentence? You know, if we're assuming it's positive, what was the probability that I would get a sentence with the word "love," the word "elephant," and the word "hate"? To do that, we multiply the probability of "love" times the probability of not "hate" times the probability of "elephant," all given, you know, all from the positive team.

Then we're going to assume instead that it's negative and we're gonna do the same thing: the probability of "love" times the probability of not "hate" times the probability of "elephant," you know, using this negative table, and we're gonna see which one is more likely. You know, if it was positive, is the sentence more likely than if it was negative? You know, that kind of thing, and we're just going to compare those numbers and make a decision.

So let's actually go through and do that. If this sentence was positive, since it has the word "love," there's a four-six chance of that. That it's missing the word "hate," so the chance that it has the word "hate" was 1/6. The chance that it doesn't have the word "hate" is 1 - 1/6. It has the word "elephant," so the chance of that one is 1/6.

We multiply four-sixths by five-sixths by 1/6, and we get 0.093. We can do the same thing for negative. The chance of seeing "love" in a negative sentence is 1/6. The chance of seeing "hate" is 3/6, so the chance of not seeing is 1 - 3/6, and the chance of seeing "elephant" is 1/6.

We get 0.01 in that case. The chance of it being positive is much higher than the chance of it being negative. Or I should word that differently: the chance of seeing this sentence, if it was positive, is a lot higher than the chance of seeing the sentence if it was negative.

So, if we were really pissed off, the chances we would write this sentence, or, you know, really sad or something like that, the chances we would write this sentence are a lot lower. So, using this information, we conclude that the sentence is positive.

Something I think is important to note at this point is that the word "elephant" didn't actually matter because we saw the word "elephant." So we multiplied by 1/6 when we were doing the positive case and by 1/6 when we were doing the negative case. You know, "elephant" had the same likelihood for positive samples and negative samples, and as a result, you know, we didn't really have to multiply by the "elephant" probability at all because we shrunk it by the same amount for positive and negatives.

You know, we multiplied by 1/6 in both cases, so the answer would have been the same and will always be the same whether or not we count "elephant." This is nice because, intuitively, the word "elephant" doesn't really, you know, tell us whether a sentence is good or bad. It's, you know, it's positive or negative.

It shows that a Bayes classifier, you know, if there's a word that's basically not correlated at all with the categorization you want, it's essentially equivalent to just ignoring that category. You know, we ignore the 1/6. We implicitly ignore it. You know, we can multiply by the 1/6, but it doesn't really change which one is greater than the other one.

So that's just something I think is important to notice—how the Bayes classifier naturally basically ignores variables that don't matter. Now I'm just going to do one more example before the end of the video, just so we have another case to deal with.

This is going to be a situation where, you know, before we had the sentences with "love" and "hate," and I asked what would happen if "love" and "hate" both appeared in the same sentence. Now we can address questions like that.

So let's say we have the sentence, "I love and hate my elephant." Actually, as a person, I'm not sure whether to call this positive or negative, but we can just see what the Bayes classifier does. If we actually were to do the same thing we did before, we can multiply the probability of "love" times the probability of "hate" times the probability of "elephant" for both positive and negative.

Actually, it came out that positive and negative were pretty close, but positive actually won out a little bit. The reason positive won in this case is because "love" happened in 4 out of the 6 positive cases whereas "hate" only happened in 3 out of the 6 negative cases.

So "love" actually is kind of our pivot in this case because "love" is more strongly associated with a positive sentiment than "hate" is associated with a negative sentiment. In this case, we've been able to turn this into numerical things. You know, "hate" only appeared in three of our things, whereas "love" appeared in four of them. You know, because we made this numerical instead of just kind of intuitive or qualitative, we can actually make decisions where it's not so obvious what the decision would be.

Eventually, you know, I'm not sure if we would even agree with this. You know, is this really slightly more positive than negative? You know, is "love" generally... because I didn't... you know, 12 sentences isn't enough to get a good sense. But if we have a lot of sentences and it turned out "love" really did happen more often in a positive sentence than "hate" happened in a negative sentence, we might draw a similar conclusion.

This sentence might feel a little more positive to a person reading it than it would feel negative. So that's just something to think about—maybe more of a philosophical question: how do we deal with ties like this? This is the answer in the case of the Bayesian classifier.

Anyway, I hope you learned something. Thanks for watching and goodbye!

Bayes Classifiers and Sentiment Analysis

More Articles