The Zipf Mystery

11m read

·Nov 10, 2024

Hey, Vsauce. Michael here. About 6 percent of everything you say and read and write is the "the" - is the most used word in the English language. About one out of every 16 words we encounter on a daily basis is "the." The top 20 most common English words in order are "the," "of," "and," "to," "a," "in," "is," "I," "that," "it," "for," "you," "was," "with," "on," "as," "have," "but," "be," "they."

That's a fun fact. A piece of trivia, but it's also more. You see, whether the most commonly used words are ranked across an entire language, or in just one book or article, almost every time a bizarre pattern emerges. The second most used word will appear about half as often as the most used. The third one third as often. The fourth one fourth as often. The fifth one fifth as often. The sixth one sixth as often, and so on all the way down.

Seriously. For some reason, the amount of times a word is used is just proportional to one over its rank. Word frequency and ranking on a log-log graph follow a nice straight line. A power-law. This phenomenon is called Zipf's Law, and it doesn't only apply to English. It also applies to other languages, like, well, all of them. Even ancient languages we haven't been able to translate yet.

And here's the thing. We have no idea why. It's surprising that something as complex as reality should be conveyed by something as creative as language in such a predictable way. How predictable? Well, watch this. According to WordCount.org, which ranks words as found in the British National Corpus, "sauce" is the 5,555th most common English word.

Now, here is a list of how many times every word on Wikipedia and in the entire Gutenberg Corpus of tens of thousands of public domain books shows up. The most used word, 'the,' shows up about 181 million times. Knowing these two things, we can estimate that the word "sauce" should appear about thirty thousand times on Wikipedia and Gutenberg combined. And it pretty much does.

What gives? The world is chaotic. Things are distributed in myriad of ways, not just power laws. And language is personal, intentional, idiosyncratic. What about the world and ourselves could cause such complex activities and behaviors to follow such a basic rule? We literally don't know.

More than a century of research has yet to close the case. Moreover, Zipf's law doesn't just mysteriously describe word use. It's also found in city populations, solar flare intensities, protein sequences and immune receptors, the amount of traffic websites get, earthquake magnitudes, the number of times academic papers are cited, last names, the firing patterns of neural networks, ingredients used in cookbooks, the number of phone calls people received, the diameter of Moon craters, the number of people that die in wars, the popularity of opening chess moves, even the rate at which we forget.

There are plenty of theories about why language is 'zipf-y,' but no firm conclusions, and this video doesn't contain a definite explanation either. Sorry, I know that's a bummer, since we appear to like knowing more than mystery. But that said, we also ask more than we answer.

So let's dive into Zipf's ramifications, some related patterns, some possible explanations, and the depth of the mystery itself. Zipf's law was popularized by George Zipf, a linguist at Harvard University. It is a discrete form of the continuous Pareto distribution from which we get the Pareto Principle.

Because so many real-world processes behave this way, the Pareto Principle tells us that, as a rule of thumb, it's worth assuming that 20% of the causes are responsible for 80% of the outcome, like in language, where the most frequently used 18 percent of words account for over 80% of word occurrences.

In 1896, Vilfredo Pareto showed that approximately 80% of the land in Italy was owned by just twenty percent of the population. It is said that he later noticed in his garden 20 percent of his pea pods contained eighty percent of the peas. He and other researchers looked at other datasets and found that this 80-20 imbalance comes up a lot in the world.

The richest 20% of humans have 82.7% of the world's income. In the US, 20% of patients use eighty percent of health care resources. In 2002, Microsoft reported that 80% of the errors and crashes in Windows and Office are caused by 20% of the bugs detected. A common rule of thumb in the business world states that 20% of your customers are responsible for 80% of your profits, and eighty percent of the complaints you receive will come from 20% of your customers.

A book titled "The 80/20 Principle" even says that in a home or office, 20% of the carpet receives 80 percent of the wear. Oh, and as Woody Allen famously said, "eighty percent of success is just showing up." The Pareto Principle is everywhere, which is good. By focusing on just 20 percent of what's wrong, you can often expect to solve eighty percent of the problems.

A variety of different unrelated factors cause this to be true from case to case, but if we can get to the bottom of what causes some of them, maybe we'll find that one or more of those mechanisms is responsible for Zipf's law in language.

George Zipf himself thought languages' interesting rank frequency distribution was a consequence of the Principle of Least Effort. The tendency for life and things to follow the path of least resistance. Zipf believed it drove much of human behavior and hypothesized that as language developed in our species, speakers naturally preferred drawing from as few words as possible to get their thoughts out there.

It was easier. But in order to understand what was being said, listeners preferred larger vocabularies that gave more specificity, so that they had to do less work. The compromise between listening and speaking, Zipf felt, led to the current state of language. A few words are used often and many, many, many words are used rarely.

Recent papers have suggested that having a few short, often used, predictable words helps dissipate information load density on listeners, spacing out important vocab so that the information rate is more constant. This makes sense, and much has been learned by applying the least effort principle to other behaviors, but later researchers argued that for language, the explanation was even simpler.

Just a few years after Zipf's seminal paper, Benoit Mandelbrot showed that there may be nothing mysterious about Zipf's law at all, because even if you just randomly type on a keyboard, you will produce words distributed according to Zipf's law. It's a pretty cool point, and this is why it happens.

There are exponentially more different long words than short words. For instance, the English alphabet can be used to make 26 one-letter words, but 26 squared 2-letter words. Also, in random typing, whenever the space bar is pressed, a word terminates. Since there's always a certain chance that the space bar will be pressed, longer stretches of time before it happens are exponentially less likely than shorter ones.

The combination of these exponentials is pretty 'Zipf-y.' For example, if all 26 letters and the spacebar are equally likely to be typed, after a letter is typed and a word has begun, the probability that the next input will be a space, thus creating a one-letter word, is just one in 27.

And sure enough, if you randomly generate characters or hire a proverbial typing monkey, about one out of every 27 or 3.7 percent of the stuff between spaces will be single letters. Two letter words appear when, after beginning a word, any character but the space bar is hit - a 26 in 27 chance - and then the space bar. A three-letter word is the probability of a letter, another letter, and then a space.

If we divide by the number of unique words of each length there can be, we get the frequency of occurrence expected for any particular word given its length. For example, the letter V will make up about 0.142 percent of random typing. The word "Vsauce" 0.0000000993 percent.

Longer words are less likely, but watch this. Let's spread these frequencies out according to the ranks they'd take up on a most often used list. There are 26 possible one-letter words, so each of the top 26 ranked words are expected to occur about this often. The next 676 ranks will be taken up by two-letter words that show up about this often.

If we extend each frequency according to how many members it has, we get Zipf. Subsequent researchers have detailed how changing up the initial conditions can smooth the steps out. Our mysterious distribution has been created out of nothing but the inevitabilities of math. So maybe there is no mystery.

Maybe words are just the result of humans randomly segmenting the observable world and the mental world into labels, and Zipf's law describes what naturally happens when you do that. Case closed. And as always... And as always, thanks for... wait a minute! Actual language is very different from random typing.

Communication is deterministic to a certain extent. Utterances and topics arrive based on what was said before. And the vocabulary we have to work with certainly isn't the result of purely random naming. For example, the monkey typing model can't explain why even the names of the elements, the planets, and the days of the week are used in language according to Zipf's law.

Sets like these are constrained by the natural world, and they're not the result of us randomly segmenting the world into labels. Furthermore, when given a list of novel words, words they've never heard or used before, like when prompted to write a story about alien creatures with strange names, people will naturally tend to use the name of one alien twice as often as another, three times as often as another...

Zipf's law appears to be built into our brains. Perhaps there is something about the way thoughts and topics of discussion ebb and flow that contributes to Zipf's law. Another way 'Zipf-ian' distributions occur is via processes that change according to how they've previously operated. These are called preferential attachment processes.

They occur when something - money, views, attention, variation, friends, jobs, anything really - is given out according to how much is already possessed. To go back to the carpet example, if most people walk from the living room to the kitchen across a certain path, furniture will be placed elsewhere, making that path even more popular.

The more views a video or image or post has, the more likely it is to get recommended automatically or make the news for having so many views, both of which give it more views. It's like a snowball rolling down a snowy hill. The more snow it accumulates, the bigger its surface area becomes for collecting more, and the faster it grows.

There doesn't have to be a deliberate choice driving a preferential attachment process. It can happen naturally. Try this. Take a bunch of paper clips and grab any two at random. Link them together and then throw them back in the pile. Now, repeat over and over again. If you grab paper clips that are already part of a chain, link 'em anyway.

More often than not, after a while you will have a distribution that looks 'Zipf-ian.' A small number of chains contain a disproportionate amount of the total paperclip count. This is simply because the longer a chain gets, the greater proportion of the whole it contains, which gives it a better chance of being picked up in the future and consequently made even longer.

The rich get richer, the big get bigger, the popular get popular-er. It's just math. Perhaps languages' Zipf mystery is, if not caused by it, at least strengthened by preferential attachment. Once a word is used, it's more likely to be used again soon.

Critical points may play a role as well. Writing and conversation often stick to a topic until a critical point is reached, and the subject is changed and the vocabulary shifts. Processes like these are known to result in power laws. So, in the end, it seems tenable that all these mechanisms might collude to make Zipf's law the most natural way for language to be.

Perhaps some of our vocabulary and grammar was developed randomly, according to Mandelbrot's theory. And the natural way conversation and discussion follow preferential attachment and criticality, coupled with the principle of least effort when speaking and listening are all responsible for the relationship between word rank and frequency.

It's a shame that the answer isn't simpler, but it's fascinating because of the consequences it has on what communication is made of. Roughly speaking, and this is mind-blowing, nearly half of any book, conversation, or article will be nothing but the same 50 to 100 words.

And nearly the other half will be words that appear in that selection only once. That's not so surprising when you consider the fact that one word accounts for 6 percent of what we say. The top 25 most used words make up about a third of everything we say, and the top 100 about half.

Seriously. I mean, whether it's all the words in "Wet Hot American Summer," or all the words in Plato's "Complete Works" or in the complete works of Edgar Allan Poe or the Bible itself, only about 100 words are used for nearly half of everything written or said.

In Alice's Adventures in Wonderland, 44% and in Tom Sawyer, 49.8% of the unique words used appear only once in the book. A word that is used only once in a given selection of words is called a 'hapax legomenon.' Hapax legomena are vitally important to understanding languages.

If a word has only been found once in the entire known collection of an ancient language, it can be very difficult to figure out what it means. Now, there is no corpus of everything ever said or written in English, but there are very, very large collections, and it's fun to find hapax legomena in them.

For instance, and this probably won't be the case after I mention it, but the word "quizzaciously" is in the Oxford English Dictionary, but appears nowhere on Wikipedia or in the Gutenberg corpus or in the British National Corpus or the American National Corpus, but it does appear when searched in just one result on Google.

Fittingly, in a book titled "ElderSpeak" that lists it as a 'rare word.' Quizzaciously, by the way, means "in a mocking manner," as in "The paradist rattled off quizzaciously, 'Hey, Vsauce. Michael here. But who is Michael and how much does here weigh?'" It's a little sad that quizzaciously has been used so infrequently. It's a fun word, but that's the way things go in a 'Zipf-ian' system.

Some things get all the love, some get little. Most of what you experience on a day-to-day basis is forgotten, forgettable. The Dictionary of Obscure Sorrows, as it often does, has a word for this - Olēka - the awareness of how few days are memorable.

I've been alive for almost 11,000 days, but I couldn't tell you something about each one of them. I mean, not even close. Most of what we do and see and think and say and hear and feel is forgotten at a rate quite similar to Zipf's law, which makes sense.

If a number of factors naturally selected for thinking and talking about the world with tools in a 'Zipf-ian' way, it makes sense we'd remember it that way too. Some things really well, most things hardly at all. But it bums me out sometimes because it means that so much is forgotten, even things that at the time you thought you could never forget.

My locker number - senior year - its combination, the jokes I liked when I saw a comedian on stage, the names of people I saw every day 10 years ago. So many memories are gone. When I look at all the books I've read and realize that I can't remember every detail from them, it's a little disappointing.

I mean, why even bother if the Pareto Principle dictates that my 'Zipf-ian' mind will consciously remember pretty much only the titles and a few basic reactions years later? Ralph Waldo Emerson makes me feel better. He once said, "I cannot remember the books I've read any more than the meals I have eaten. Even so, they have made me."

And as always, thanks for watching.

The Zipf Mystery

More Articles