What we learned from 5 million books - Erez Lieberman Aiden and Jean-Baptiste Michel
[Music] [Music] [Applause]
Everyone knows that a picture is worth a thousand words, but we at Harvard were wondering if this was really true. So we assembled a team of experts spanning Harvard, MIT, the American Heritage Dictionary, the Encyclopedia Britannica, and even our proud sponsors, Google. We cogitated about this for about four years, and we came to a startling conclusion, ladies and gentlemen: a picture is not worth a thousand words. In fact, we found some pictures that are worth 500 billion words.
So how did we get to this conclusion? So Aras and I were thinking about ways to get a big picture of human culture and human history changing over time. So many books actually have been written over the years, so we're thinking, well, the best way to learn from them is to read all of these millions of books. Now, of course, if there's a scale for how awesome that is, that has to rank extremely, extremely high. Now the problem is there's an x-axis for that, which is a practical axis; this is very, very low.
Now, people tend to use an alternative approach, which is to take a few sources and read them very carefully. This is extremely practical but not so awesome. What you really want to do, what you really want to do, is to get to the awesome yet practical part of this space. So it turns out there's a company across the river called Google who has started a digitization project a few years back that might just enable this approach. They have digitized millions of books, so what that means is one could use computational methods to read all of the books in the click of a button. That's very practical and extremely awesome.
Let me tell you a little bit about where books come from. Since time immemorial, there have been authors. These authors have been striving to write books, and this became considerably easier with the development of the printing press some centuries ago. Since then, the authors have won on 129 million distinct occasions, publishing books. Now, if those books are not lost to history, then they are somewhere in a library, and many of those books have been getting retrieved from the library and digitized by Google, which has scanned 15 million books to date.
Now, when Google digitizes a book, they put it into a really nice format. Now we've got the data, plus we have metadata. We have information about things like where it was published, who the author is, when it was published, and what we do is go through all of those records and exclude everything that's not the highest quality data. What we're left with is a collection of 5 million books, 500 billion words, a string of characters a thousand times longer than the human genome—a text which, when written out, would stretch from here to the moon and back ten times over—a veritable shard of our cultural genome.
Of course, what we did when faced with such outrageous hyperbole was what any self-respecting researchers would have done: we took a page out of XKCD and we said, "Stand back, we're going to try science." Now, of course, we're thinking, well, let's just first put the data out there for people to do science to it. Now, um, we're thinking, what data can we release? Well, of course, you want to take the books and release the full text of these five million books.
Now Google and John Orand in particular told us a little equation that we should learn. So we have 5 million books, that's 5 million authors; that is 5 million plaintiffs in a massive lawsuit. So although that would be really, really awesome, again, that's extremely, extremely impractical. It's pretty now, uh, again we caved in, and we did the very practical approach, a bit less awesome. We said, well, instead of releasing the full text, we're going to release statistics about the books.
So we're going to take, for instance, a glim of happiness; it's four words. We call it a "forr." We're going to tell you how many times a particular forr appeared in books published in 1801, 1802, 1803, all the way up to 2008. That gives us a time series of how frequently this particular sentence was used over time. We do that for all the words and phrases that appear in those books. That gives us a big table of two billion lines that tell us about the way culture has been changing.
So those two billion lines, we call them two billion engrams. What do they tell us? Well, the individual engrams measure cultural trends. Let me give you an example. Let's suppose that I am thriving, then tomorrow I want to tell you about how well I did, and so I might say, "Yesterday I throve." Alternatively, I could say, "Yesterday I thrived." Well, which one should I use? Hmm, how to know?
Well, as of about six months ago, the state of the art in this field is that you would, for instance, go up to the following psychologist with fabulous hair and you'd say, "Steve, you're an expert on the irregular verbs; what should I do?" And he'd tell you, well, most people say "thrive," but some people say "throve." Now you also knew more or less that if you were to go back in time 200 years and ask the following statesman with equally fabulous hair, "Tom, what should I say?" He'd say, well, in my day, most people throve, but some thrived.
So now what I'm just going to show you is raw data: two rows from this table of two billion entries. What you're seeing is year-by-year frequency of "thrived" and "throve" over time. Now this is just two out of two billion rows, so the entire data set is a billion times more awesome than this slide. [Applause]
Now, there are many other pictures that are worth 500 billion words. For instance, this one: if you just type in "influenza," you will see peaks at the time when you knew big flu epidemics were actually killing millions of people around the globe. If you were not yet convinced, sea levels are rising, so is atmospheric CO2 and global temperature. You might also want to have a look at this particular engram and tell Nietzsche that God is not dead. Although you might agree that he might need a better publicist.
Yes, you can get some pretty abstract concepts with this sort of thing. For instance, let me tell you the history of the year 1950. Pretty much for the vast majority of history, no one gave a damn about 1950. In 1700 and 1800 and 1900, no one cared. Through the '30s and '40s, no one cared. Suddenly, in the mid-'40s, there starting to be a buzz; people realized that 1950 was going to happen and it could be big.
But nothing got people interested in 1950 like the year 1950. People were walking around obsessed; they couldn't stop talking about all the things they did in 1950, all the things they were planning to do in 1950, all the dreams of what they wanted to accomplish in 1950. In fact, 1950 was so fascinating that for years thereafter, people just kept talking about all the amazing things that happened in '51, '52, '53. Finally, in 1954, someone woke up and realized that 1950 had gotten somewhat passé, and just like that, the bubble burst.
Now the story of 1950 is the story of every year that we have on record with a little twist because now we've got these nice charts, and because we have these nice charts, we can measure things. We can say, well, how fast does the bubble burst? And it turns out that we can measure that very precisely; equations were derived, graphs were produced, and the net result is that we find that the bubble bursts faster and faster with each passing year. We are losing interest in the past more rapidly now.
A little piece of career advice, so for those of you who seek to be famous, you can learn from the most famous 25 most famous political figures, authors, actors, and so on. So if you want to become famous early on, you should be an actor because then fame starts rising by the end of your 20s. You're still young; it's really great. Now, if you can wait a little bit, you should be an author because then you rise to very great heights, like Mark Twain, for instance, is extremely famous.
But if you want to reach the very top, you should delay gratification and, of course, become a politician, right? So here you will become famous by the end of your 50s and become very, very famous afterward. Scientists also tend to get famous when they're much, much older; like, for instance, biologists and physicists can be almost as famous as actors. One mistake you should not do is become a mathematician. If you do that, you might think, "Oh great, I'm going to do my best work when I'm in my 20s," but guess what? Nobody will really care.
There are more sobering notes among the engrams. For instance, here's the trajectory of Marc Chagall, an artist born in 1887, and this looks like the normal trajectory of a famous person. He gets more and more, and more and more famous, except if you look in German. If you look in German, you see something completely bizarre, something you pretty much never see, which is he becomes extremely famous and then all of a sudden plummets, going through a nadir between 1933 and 1945 before rebounding afterwards. And of course, what we're seeing is the fact that Marc Chagall was a Jewish artist in Nazi Germany.
Now, these signals are actually so strong that we don't need to know that someone was censored; we can actually figure it out using really basic signal processing. Here's a simple way to do it. Well, a reasonable expectation is that somebody's fame in a given period of time should be roughly the average of their fame before and their fame after. So that's sort of what we expect, and we compare that to the fame that we observe and we just divide one by the other to produce something we call a suppression index. If the suppression index is very, very, very small, then you very well might be being suppressed. If it's very large, maybe you're benefiting from propaganda.
Now you can actually look at the distribution of suppression indices over a whole population. So for instance, here—the distribution indices for 5,000 people picked in the English books where there's no known suppression would be like this: basically tightly centered around one. What you expect is basically what you observe. This is a distribution you see in Nazi Germany—it's very different; it's shifted to the left. People are talked about less than they should have been, but much more importantly, the distribution is much wider. There are many people who end up on the far left of this distribution who are talked about ten times fewer than they should have been. And then also many people on the far right who seem to benefit from propaganda. This picture here is the hallmark of censorship in the book record.
So culturomics is what we call this method. It's kind of like genomics, except genomics is kind of a lens on biology through the window of the sequence of bases in the human genome. Culturomics is similar; it's the application of massive scale data collection and analysis to the study of human culture. Here, instead of through the lens of a genome, through the lens of digitized pieces of the historical record.
The great thing about culturomics is that everyone can do it. Why can everyone do it? Everyone can do it because three guys, John Orwant, Matt Gray, and Will Brockman over at Google, saw the prototype of the engram viewer and they said, "This is so fun, we have to make this available for people." So in two weeks flat, the two weeks before our paper came out, they coded up a version of the engram viewer for the general public. And so you too can type in any word or phrase that you're interested in and see its engram immediately and also browse examples of all the various books in which your engram appears.
Now this was used over a million times in the first day, and this is really the peak of all the queries, right? So people want to be their best, put their best foot forward, but it turns out in the 18th century people didn't really care about that at all. They didn't want to be their best; they wanted to be their be. So what happens is, of course, this is just a mistake, right? It's not that they stbe for mediocrity; it's just that the "S" used to be written differently, kind of like an "f."
Now, of course, Google didn't pick this up at the time, so we know we reported this in the Science article that we wrote. But it turns out that this should just stand as a reminder that although this is a lot of fun, when you interpret these graphs, you have to be very careful, and you have to adopt the best standards in the sciences.
People have been using this for all kinds of fun purposes, actually. We're not going to have to talk; we'll just show you all the slides and remain silent. This person was interested in the arcs of frustration. There's various types of frustration. If you stub your toe, that's a 1. If the planet Earth is annihilated by the Vogons to make room for an interstellar bypass, that's an 8A. This person studied all the arcs from 1 through 8A, and it turns out that the less frequent arcs are, of course, the ones that correspond to things that are more frustrating, except oddly in the early 80s. We think that might have something to do with Reagan.
All right, the bottom line is, okay, there are many usages of this data, but the bottom line is that the historical record is being digitized. Google has started to digitize 15 million books; that's 12% of all the books that have ever been published. It's pretty big. It's a sizable chunk of human culture. There's much more to human culture—there's manuscripts, there's newspapers, there's things that are not text, like art and paintings.
This will happen to be on our computers, on computers across the world, and when that happens, that will transform the way we have to understand our past, our present, and human culture. Thank you very much. [Applause]