Markov Chains and Text Generation

10m read

·Nov 3, 2024

Pictures can also refer to translation to machine code for correctly rounded when converted to Islam in material. Okay, so that probably didn't make a whole lot of sense, but that's because a human didn't write that sentence; a machine did using something called a Markov chain.

Today, I'm going to be talking about Markov chains and some of their applications, including things like text generation like you just saw. So, I want to start off by talking about a really simple example in real life where a Markov chain might arise. Imagine the only thing I care about is whether or not my car works.

So, there's just two states the world can be in: either my car is functional, you know, I can use it to drive to work, whatever, or my car is broken, you know, it's useless sitting in my driveway or whatever, and I just cannot get anywhere with my car. So now I want to start thinking about the probability, you know, the likelihood that I will go between these two situations.

So let's say most of the time if I wake up and my car works in the morning, it's probably going to be working at night still when I go to bed. You know, there's a small chance I'll have a breakdown in the middle of the day. Let's call that, let's say, a 1% chance. So there's a 1% chance I'll go from the "car works" state to the "car broken" state.

Now, let's suppose that once my car is broken I really want it fixed. So, if I wake up on a given day and my car is broken in the morning, there's a 90% chance I'll have it fixed by the end of the day. So we can represent that this way. And now there's a final piece of this picture, which I don't already have drawn here, which is, you know, 99% of the time my car works it will stay working, and 10% of the time my car is broken it'll stay broken.

So, I can represent these just with arrows pointing into the same state that they come out of. So, this is just an example of a Markov chain. Basically, what we have is we have a bunch of states, you know, in this case just two different states: either my car is functional or it's broken, and we have transitions between those states, you know, the likelihood that you'll go from one state to another given state.

Finally, one last thing we can do is we can represent this Markov chain instead of by using these arrows and these circles. We can also represent it as a table. In this case, I've chosen to make the rows the state you're coming from and the columns the state you're going to.

So if currently my car is broken and I want to know the odds that it will stay broken, I would look at the bottom right corner of this table in this case. So this is just another way I can write down a Markov chain. I can just write it down as a table or as a matrix.

So, you might be wondering how could I actually gather this data? You know, how would I find out that 99% of the time my car works it'll keep working and 90% of the time my car is broken I'll get it fixed? You know, how do I figure out these probabilities, you know, in a real-life situation?

One way to do that is to just gather a lot of data. So suppose I spent, you know, a little over two years just every day. I just, in the morning, note whether my car works, and at night I note whether my car works. And so I could fill in this table.

And so the idea is, let's say I started the day and my car was broken, I ended the day and my car was working. I would just add one to that cell in the table. Um, and I might get something that looks a little bit like this. So, this is just the kind of thing I would get if I just gathered a bunch of data.

Now we're left with the task of turning this table of data into a table of probabilities. You know, before we had a table with, you know, 90%, 99%, 1%, and now we just have a table with a bunch of numbers. So how do we get from this to that?

So really, let's think about what a probability actually is. If I start the day with a working car and I want to know the probability that at the end of the day my car will be broken, I want to know, you know, maybe nine times out of ten it'll be broken or, you know, fifty times out of a hundred it'll be broken or, you know, whatever, whatever the measure I'm doing it out of, something.

So, if my car is broken eight times when it was working in the morning, you know, eight times out of what is basically what I'm wondering. You know, if I took 800 samples where I started the day with my car working and there were only eight where I was broken at the end of the day, there's an eight out of 800 chance that my car will break down on a given day.

To turn this into an actual formal procedure, I'm just going to add another column to this table, which is the total number, the total sum of that entire row. For the first row, uh, days starting with my car working, you can see that there's a total of 800 days where I started my day and my car was working, and there were a total of 10 days when I started my day and my car wasn't working.

What this allows me to do is in the first row I can just divide all those things by 800, and in the second row I can divide those things by 10. So now in the first row I can see that if I wake up with my car working, 792 out of 800 times I will go to bed with my car working as well.

So really, this just turns our, you know, our raw counts, our raw, you know, number of days this happened, it turns them into probabilities. And if we express them as percentages, we get something like this, which is actually what we had before, um, if you remember what the original example I presented you before.

So how do we apply this idea of states and transition probabilities to something like the example I showed at the beginning of the video where I used a Markov chain to generate a bunch of text? What we're really looking for is a way to represent a piece of text as a bunch of states and transitions between states.

So I'm going to show how to do this, and I'm just going to use a toy sentence: "The quick brown fox jumps over the lazy dog" because it's pretty simple to work with this, you know, this small amount of text. The first thing I'm going to do is create a state for each word that appears in this sentence.

So I have a state for "quick," a state for "brown," a state for "the," and the way I'm going to define a state is as you read the sentence, whatever word you're currently reading is the state that you're in. So if I go through and read this, I start out in the "the" state, I go to "quick," I go to "brown," I go to "fox," I go to "jumps," I go to "over," back to the "the" state, then finally I go to "lazy" and then "dog."

The next step to building this Markov chain is thinking about the probabilities of going between each state. So, let's start with the "the" state because "the" is the only word that appears in this sentence twice. The first time we saw "the," we went to "quick," and the second time we saw "the," we went to "lazy" right afterwards.

So, there's actually a 50% probability in the state "the" that we go to "quick" next, and a 50% probability that we go from the state "the" to the state "lazy" next. Now, if we look at any other word, let's say we do "fox." "Fox" only appears one time in the sentence and "jumps" happens right after it.

So there's a 100% probability that the next thing after the "fox" state will be the "jump" state. There's just nothing else that we can go to from the "fox" state, and this is true for every word except for the word "the," because "the" is the only word that even happens multiple times in this sentence.

So, if we wanted to turn this into a Markov chain, we would get something like this where the word "the" branches off into "quick" and "lazy" with a 50% chance in each direction, and every other word just goes into the next word with 100% probability because there's nothing else that ever comes after that word.

So, the point of this example is to just show you how we could build the Markov chain using text. You know, we represent each word as a state, we represent the transitions as, you know, the likelihood that a given word will come after the word before it, and this sentence, you know, because it's just a few words, doesn't give us that good an idea of how the English language is structured.

You know, it doesn't tell us much about which words come after other words, although it does show us that the word "the" is pretty common. Um, so what I've done to get a better example of a bigger Markov chain is I've taken about 5,000 Wikipedia articles and I've trained a Markov chain on that, generated a big Markov chain for all the words that appear in there, you know, hundreds of thousands of words.

And I'm just going to show you the state for the word "the." So here we have, uh, you know, I've simplified it. I have this thing, other stuff in the corner basically for things I didn't want to fit in, so this is just the most common transitions out of the word "the."

We see first is the most likely word to come after the word "the." Right after it is "United," probably from "the United States," then "same," "most," etc. And what this shows is, you know, if I'm just reading text and all I know, all I'm told is the last word I saw was the word "the," I can predict, you know, first, "same," "most," "city" are all pretty likely words to happen.

So, you know, I know what's likely to come next. So the Markov chain is a model basically of the probabilities of, you know, certain words following other words when we use it in terms of text. So we've already done a great deal, and this is actually already a pretty useful model to have.

Say if you're making a keyboard for iOS or another mobile operating system and you want it to be able to predict what the next word the user is going to type is, uh, this is actually a fairly reliable system to do that. Now it's not necessarily the best system because it just depends on the last word that they typed, but it does take some context into account and it could actually do a pretty decent job of suggesting which word to type.

So now I still have to explain how we can use this to actually generate a random block of text. Basically, I'm just going to explain how we generate the next word in a block of text because if you can generate the next word, then you can keep generating words and you get a large string of text in the end.

So let's say we've just generated the word "the" in our random block of text. Well, we're going to pick the next word we generate randomly because it's random, but we're going to do it in a biased way. And, you know, we're going to bias the next word in a way where "first" is more likely than "end," and "United" is more likely than "other."

And actually, we're going to bias it exactly. So, even if we're picking the next word randomly, we're going to be picking it randomly in such a way that "first" has a 1.33% chance of happening, "same" has a 0.90% chance of happening, etc. So, we're picking randomly, but we're doing it in a biased fashion and we're biasing it the same way our Markov chain says we should bias it.

You know, if it says "first" happens more than "city" after the word "the," then we're going to pick randomly in that biased way. We're going to do it the way the Markov chain says to do it. Uh, you can think of maybe, uh, this is kind of like flipping a biased coin where instead of being 50% heads, 50% tails it's 75% heads, 25% tails.

It's still random what happens, but one thing is still more likely than the other thing. This is kind of like that. One last thing I'll mention is that this technique of generating random blocks of text, uh, the way it works is each word only depends on the word that comes before it. And so, the thing diverges in meaning extremely fast.

You know, a couple words down the line there's just no connection to what was said before, and one way to fix this problem is to make each word you generate depend not on the last word but on the last two words or the last three words. And you can actually set that up in a way that's similar to this, but instead of, you know, all the words coming off of "the," you might have all of the words that come after the phrase like "the same" or "the most," you know, something like that.

So, you'd have pairs of words in a state instead of just a word in a state, and that's actually what I did to generate the block of text you saw at the beginning of the video. But that's not really essential to the idea of a Markov chain; uh, it's just the same thing but with more words in a given state.

So now I want to conclude the video by just showing one other cool thing you can do with Markov chains that has nothing to do with text generation. I made a program where I could draw tons of stuff and it would record the pen strokes that I made. So it would record how I, uh, move my finger, you know, over a brief interval or my mouse and, uh, you know, the direction in which I moved it basically.

And then I trained a Markov chain basically based on these pen strokes. So, I trained a Markov chain on small segments of, um, of where I moved my mouse basically. And so each state is basically, you know, what direction is the mouse, you know, sliding at that time and how fast is it sliding, stuff like that.

And I used that to generate new images that had never been drawn before, and what's remarkable about these images is even though they're not letters or numbers or anything truly meaningful, they actually do look like something a human could draw. There's curls and squiggles; it's not just a purely random drawing. It actually emulates the pen strokes that a human might make while scribbling on, uh, on their mouse or on a touch screen.

So, I thought that was pretty interesting and just a cool little other thing you could do with a Markov chain that doesn't really have to do with text generation. Um, so I hope you enjoyed this video and I hope you learned something from it. Thanks for watching, and goodbye.

Markov Chains and Text Generation

More Articles