The birth of a word - Deb Roy

12m read

·Nov 9, 2024

[Music] [Music] Imagine if you could record your life: everything you said, everything you did, available in a perfect memory store at your fingertips, so you could go back and find memorable moments and relive them or sift through traces of time and discover patterns in your own life that previously had gone undiscovered.

Well, that's exactly the journey that my family began five and a half years ago. This is my wife and collaborator, Rupal, and on this day, at this moment, we walked into the house with our first child, our beautiful baby boy. We walked into a house with a very special home video recording system. This moment and thousands of other moments special for us were captured in our home because in every room in the house, if you looked up, you'd see a camera and a microphone, and if you looked down, you get this bird's-eye view of the room.

Here's our living room, the baby bedroom, kitchen, dining room, and the rest of the house, and all of these fed into a disc array that was designed for continuous capture. So here we are flying through a day in our home as we move from sunlit morning through incandescent evening and finally lights out for the day. Over the course of three years, we've recorded eight to ten hours a day, amassing roughly a quarter million hours of multitrack audio and video.

So you're looking at a piece of what is by far the largest home video collection ever made, and what this data represents for our family at a personal level, the impact has already been immense, and we're still learning its value. Countless moments of unsolicited natural moments, not posed moments, are captured there, and we're starting to learn how to discover them and find them.

But there's also a scientific reason that drove this project, which was to use this kind of natural longitudinal data to understand the process of how a child learns language, that child being my son. With many privacy provisions put in place to protect everyone who's recorded in the data, we made elements of the data available to my trusted research team at MIT, so we could start teasing apart patterns in this massive dataset trying to understand the influence of social environments on language acquisition.

So we're looking here at one of the first things we started to do. This is my wife and I cooking breakfast in the kitchen, and as we move through space and through time, a very everyday pattern of life in the kitchen. In order to convert this opaque 90 thousand hours of video into something we can start to see, we use motion analysis to pull out, as we move through space and through time, what we call space-time worms. This has become a part of our toolkit for being able to look and see where the activities are in the data and, with it, trace the patterns of, in particular, where my son moved throughout the home so we could focus our transcription efforts.

All the speech environments around my son, all the words that he heard from myself, my wife, our nanny, and over time, the words he began to produce. So with that technology and that data, and the ability to, with machine assistants, transcribe speech, we've now transcribed well over seven million words of our home transcripts.

And with that, let me take you now for a first tour into the data. So you've all, I'm sure, seen time-lapse videos where a flower will blossom as you accelerate time. I'd like you to now experience the blossoming of a speech form. My son, soon after his first birthday, would say "Gaga" to mean water, and over the course of the next half year, he slowly learned to approximate the proper adult form: "water."

So we're going to cruise through half a year in about 40 seconds. No video here, so you can focus on the sound, the acoustics of a new kind of trajectory. [Music] So he didn't just learn "water." Over the course of the 24 months, the first two years that we really focused on this is a map of every word he learned in chronological order.

And because we have full transcripts, we've identified each of the 503 words that he learned to produce. By his second birthday, he was an early talker, and so we started to analyze why. Why were certain words born before others? This is one of the first results that came out of our study a little over a year ago that really surprised us.

The way to interpret this apparently simple graph is on the vertical is an indication of how complex caregiver utterances are, based on the length of utterances, and the vertical axis is time. All of the data we aligned based on the following idea: every time my son would learn a word, we would trace back and look at all of the language he heard that contained that word, and we would plot the relative length of the utterances.

What we found was this curious phenomenon that caregiver speech would systematically dip to a minimum, making language as simple as possible, and then slowly ascend back up in complexity. The amazing thing was that the bounce, that dip, lined up almost precisely with when each word was born, word after word systematically.

So it appears that all three primary caregivers, myself, my wife, and our nanny, were systematically—and I would think subconsciously—restructuring our language to meet him at the moment of the birth of a word and bring him gently into more complex language. And the implications of this, there are many, but one I just want to point out is that there must be amazing feedback loops.

It's not, of course, my son learning from his linguistic environment, but the environment learning from him. That environment, people, are in these types of feedback loops and creating a kind of scaffolding that has not been noticed until now.

But that's looking at the speech context. What about the visual context? We're now looking at—think of this as a dollhouse cutaway of our home. We've taken those circular fisheye lens cameras, and we've done some optical correction, and then we can bring it into a three-dimensional life. So welcome to my home.

This is a moment, one moment captured across multiple cameras. The reason we did this is to create the ultimate memory machine, where you can go back and interactively fly around and then breathe video life into this system. What I'm going to do is give you an accelerated view of 30 minutes, again, of just life in the living room.

That's me and my son on the floor, and there's video analytics that are tracking our movements. My son is leaving red ink; I'm leaving green ink. We're now on the couch, looking out through the window at cars passing by, and finally, my son is playing in a walking toy by himself.

Now we freeze the action—30 minutes. We turn time into the vertical axis, and we open up for a view of these interaction traces we've just left behind, and we see these amazing structures, these little knots of two colors of thread. We call social hotspots; the spiral thread we call a solo hotspot, and we think that these affect the way language is learned.

What we'd like to do is start understanding the interaction between these patterns and the language that my son is exposed to, to see if we can predict how the structure of when words are heard affects when they're learned. So in other words, the relationship between words and what they're about in the world.

So here's how we're approaching this. In this video, again, my son is being traced out; he's leaving red ink behind, and there's our nanny by the door. She offers water, and off go the two worms over to the kitchen to get water. What we've done is used the word "water" to tag that moment, that bit of activity.

Now we take the power of data and take every time my son ever heard the word "water" and the context he saw it in, and we use it to penetrate through the video and find every activity trace that co-occurred with the instance of water. What this data leaves in its wake is a landscape. We call these wordscapes. This is the wordscape for the word "water," and you can see most of the action is in the kitchen. That's where those big peaks are over to the left.

And just for contrast, we can do this with any word. We can take the word "bye" as in "goodbye," and we're now zoomed in over the entrance to the house. We look and we find, as you'd expect, a contrast in the landscape where the word "bye" occurs in a much more structured way.

So we're using these structures to start predicting the order of language acquisition, and that's your ongoing work now in my lab, which we're peering into now at MIT. This is at the Media Lab. This has become my favorite way of video graphing just about any space.

Three of the key people in this project, Philip the Camp, Ronny Cubot, and Brendan Roy are pictured here. Philip has been a close collaborator in all the visualizations you're seeing, and Michael Fleischman was another PhD student in my lab who worked with me on this home video analysis. He made the following observation: that just the way we’re analyzing how language connects to events— which provide common ground for language—that same idea we can take out of your home, Deb, and we can apply it to the world of public media.

So our effort took an unexpected turn. Think of mass media as providing common ground, and you have the recipe for taking this idea to a whole new place. We've started analyzing television content using the same principles, analyzing event structure of a TV signal, episodes of shows, commercials, all of the components that make up the event structure.

We're now, with satellite dishes, pulling in and analyzing a good part of all the TV being watched in the United States, and you don't have to now go instrument living rooms with microphones to get people's conversations; you just tune into publicly available social media feeds. So we're pulling in about 3 billion comments a month, and then the magic happens.

You have the event structure, the common ground that the words are about coming out of the television feeds; you've got the conversations that are about those topics, and through semantic analysis—and this is actually real data you're looking at from our processing—each yellow line is showing a link being made between a comment in the wild and a piece of event structure coming out of the television signal.

The same idea now can be built up, and we get this wordscape, except now words are not assembled in my living room; instead, the context, the common ground, the activities are the content on television that's driving the conversations. What we're seeing here, these skyscrapers now are commentary that are linked to content on television.

Same concept, but looking at communication dynamics in a different, very different sphere. So fundamentally, rather than, for example, measuring content based on how many people are watching, this gives us the basic data for looking at engagement properties of content. Just like we can look at feedback cycles and dynamics in, you know, in a family, we can now open up the same concepts and look at much larger groups of people.

This is a subset of data from our database—just 50 thousand out of several million—and the social graph that connects them through publicly available sources. If you put them on one plane, a second plane is where the content lives. So we have the programs and the sporting events and the commercials, and all of the link structures that tie them together make a content graph.

And then the important third dimension: each of the links that you're seeing rendered here is an actual connection made between something someone said and a piece of content, and there are, again, now tens of millions of these links that give us the connective tissue of social graphs and how they relate to content.

We can now start to probe the structure in interesting ways. So if we, for example, trace the path of one piece of content that drives someone to comment on it, and then we follow where that comment goes and look at the entire social graph that becomes activated, then trace back to see the relationship between that social graph and content, very interesting structures become visible.

We call this a co-viewing clique—a virtual living room, if you will—and there are fascinating dynamics at play. It's not one-way; a piece of content, an event, causes someone to talk; they talk to other people, and that drives tune-in behavior back into mass media, and you have these cycles that drive overall behavior.

Another example—very different—another actual person in our database, and we're finding at least hundreds, if not thousands, of these. We've given this person a name: this is a pro-amateur or pro media critic who has this high fan-out rate. A lot of people are following this person; they are very influential, and they have a propensity to talk about what's on TV. So this person is a key link in connecting mass media and social media together.

One last example from this data: sometimes it's actually the piece of content that is special. So if we go and look at this piece of content, President Obama's State of the Union address from just a few weeks ago, and look at what we find in the same dataset at the same scale, the engagement properties of this piece of content are truly remarkable: a nation exploding in conversation in real time in response to what's on the broadcast.

Of course, through all of these lines are flowing unstructured language. We can x-ray and get a real-time pulse of a nation, real-time send of the social reactions in the different circuits in the social graph being activated by content.

So to summarize, the idea is this: as our world becomes increasingly instrumented and we have the capabilities to collect and connect the dots between what people are saying, in the context they're saying, what’s emerging is an ability to see new social structures and dynamics that have previously not been seen.

It's like building a microscope or telescope and revealing new structures about our own behavior around communication. I think the implications here are profound, whether it's for science, for commerce, for government, or perhaps most of all for us as individuals.

So just to return to my son, when I was preparing this talk, he was looking over my shoulder, and I showed him the clips I was gonna show to you today, and I asked him for permission—granted. And then I went on to reflect: isn't it amazing, this entire database? All these recordings I'm gonna hand up to you and to your sister, who arrived two years later, and you guys are gonna be able to go back and re-experience moments that you could never, with your biological memory, possibly remember the way you can now.

He was quiet for a moment. I thought, "What am I thinking? He's five years old; he's not gonna understand this." And just as I was having that thought, he looked up at me and said, "So that when I grow up, I can show this to my kids." I thought, "Wow, this is powerful stuff."

So I want to leave you with one last memorable moment from our family. This is our—the first time our son took more than two steps at once, captured on film. And I really want you to focus on something as I take you through it. It's a cluttered environment—its natural life. My mother's in the kitchen cooking, and of all places in the hallway, I realize he's about to do it, about to take more than two steps.

And so you hear me encouraging him, realizing what's happening, and then the magic happens. Listen very carefully—about three steps in, he realizes something magic is happening, and the most amazing feedback loop of all kicks in, and he takes a breath in, and he whispers, "Wow." And instinctively, I echo back the same.

So let's fly back in time to that memorable moment: Nice walking. [Music] [Applause]

When I think about succeeding in business, I think there's a couple of considerations. First and foremost is teamwork. Today's problems are just too complicated to be solved as an individual, and the opportunity to work with others collaboratively, where you can build on each other's ideas, I think is particularly important.

I think a second area that's quite important is the understanding of different disciplines. It is critical that one be an expert in your major, but it is absolutely essential that you have the ability to understand where other disciplines input from other disciplines and be able to incorporate that to make effective decisions.

And then last but not least, we are in a global business community, and so the opportunity to understand cultures, different cultures around the world, to be able to incorporate some of the learning from those cultures and to incorporate that into your business decisions is essential to success.

I have three degrees from Cornell, so I'd say just about everything that has prepared me came from Cornell, and I am indebted to the University for that experience. But as I think particularly about my business school career, I think the exposure to a variety of disciplines and the wealth of resources across the university were exceptionally helpful.

The birth of a word - Deb Roy

More Articles