Baidu's AI Lab Director on Advancing Speech Recognition and Simulation

21m read

·Nov 3, 2024

Today we have Adam Coats here for an interview. Um, Adam, uh, you run the AI Lab at Buu in Silicon Valley. Um, could you just give us a quick intro and explain what Buu is for people who don’t know?

Yeah, um, so BYU is actually the largest search engine in China. Um, so it turns out the internet ecosystem in China is this incredibly dynamic environment. Uh, and so Baidu sort of turned out to be an early technology leader and really established itself in PC search, um, but then also has sort of remade itself in the mobile revolution. Um, and increasingly today is becoming an AI company, um, recognizing the value of AI for a whole bunch of different applications, not just search.

Okay, and so, uh, yeah, what do you do exactly?

So, uh, I'm the director of the Silicon Valley AI lab, uh, which is one of uh, four labs within BYU Research. So, especially as BYU is becoming an AI company, um, the need for uh, a team to sort of be on the bleeding edge and understand all of the current research, be able to do a lot of basic research ourselves, uh, but also figure out how we can translate that into business and product impact for the company, um, that's increasingly critical. And so that's what BYU Research is here for. Um, In the AI lab in particular, uh, we kind of founded recognizing how extreme this problem was about to get.

Um, so I think the deep learning research and AI research right now is flying forward so rapidly, uh, that the need, uh, for teams to be able to both understand that research but also quickly translate it, uh, into something that businesses and products can use, uh, is more critical than ever. So, we we founded the AI lab to try to close that gap, uh, and help the company move faster.

And so then, how do you break up your time in between doing basic research for AI and actually implementing, like bringing it forward to a product?

Um, there's no hard and fast rule to this. Um, I think one of the things that we try to repat to ourselves every day is that we're mission-oriented. Um, so the mission of the AI lab is precisely to create AI technologies that can have a significant impact on at least 100 million people.

Uh, we chose this to sort of keep bringing ourselves back, um, to the final goal that we want all the research we do to ultimately end up in the hands of users. Um, and so sometimes that means that, uh, we spot, uh, something that needs to happen in the world, uh, to really change, uh, technology for the better and to help BYD.

Um, but no one knows how to solve it, uh, and there's a basic research problem there that someone has to tackle. Uh, and so we'll sort of go back to our visionary stance and think about the long term and invest in research.

Uh, and then as we have success there, we shift back to the other foot and take responsibility for carrying all of that to a real application and making sure we don't just solve the 90% uh, that you might put in, say your research paper, but we also solve the last, the last mile. We get to the 99.9%.

So maybe, maybe the best way to do this then is to just explain, like, something that started with research here and how that's been brought on to like a full-on product that exists.

So I'll give you an example. Uh, we've spent a ton of time on speech recognition. Um, so speech recognition, a few years ago is one of these technologies that always felt pretty good, but not, not good enough.

Um, and so traditionally speech recognition systems have been heavily optimized for things like mobile search. Uh, so if you hold your phone up close to your mouth, uh, and you say a short query, talk in a non-human voice, exactly! Uh, the systems could figure it out and they're getting quite good.

Um, I think, you know, the speech engine that we've built at Baidu called Deep Speech. Uh, it's actually superhuman for these short queries. Um, because you have no context; people can have thick accents.

So, um, that speech engine actually started out as a basic research project. Uh, we looked at this problem. We said, "gosh, what would happen if speech recognition were human level for every product you ever used?"

Um, so whether you're in your home or in your car or you pick up your phone, whether you hold your phone up close or hold it away, if I'm in the kitchen and my toddler is, you know, yelling at me, uh, can I still use a speech interface? Um, could it work as well as a human being? Uh, understands us.

And so then how did you do the, what is the basic research that moved it forward to put it in a place that it's useful?

So we had the hypothesis that maybe the thing holding back, uh, a lot of the progress in speech is actually just scale.

Um, maybe if we took some of the same basic ideas we could see in the research literature already and scaled them way up, uh, put in a lot more data, invested a lot of time in solving computational problems, uh, and built a much larger neural network than anyone had been building before for this problem, uh, we could just get better performance.

Uh, and lo and behold, with a lot of effort, um, we ended up with this pretty amazing, uh, speech recognition model that, like I said, in Mandarin at least, is actually superhuman.

Um, you can actually sit there and listen to, uh, a voice query that someone is trying out and you'll have native speakers sitting around debating with each other, wondering what the heck the person is saying.

Wow. Uh, and then the speech engine will give an answer and everybody goes, "oh, that's what it was," because it's just such a thick accent from perhaps someone in rural China.

How much data do you have to give it to train it? You know, to train it on a new L, 'cause I think on the side I saw it was English and Mandarin.

Yeah, um, like if I wanted German, how much would I have to give it?

So one of the big challenges for these things is that they need a ton of data. Uh, so our English system uses like 10 to 20,000 hours of audio.

Uh, the Mandarin systems are using even more. Uh, for top-end products, um, so this certainly means that, uh, the technologies at a state where to get that superhuman performance, uh, you've got to really care about it.

So, for Baidu voice search, maps, things like that that are flagship products, um, we can put in the capital and the effort to do that.

Um, but it's also one of the exciting things going forward in the basic research that we think about is how do we get around that? How can we develop machine learning systems that get you human performance on every product, uh, and do it with a lot less data?

So what I was wondering then, like, did you see that bird thing that was floating around the internet this week?

Okay, uh, they claim that they don't need all that much time, uh, all that much data, audio data to emulate your voice or simulate whatever they call it. Uh, you guys have a similar project going on, right?

That's right, yeah. We're working on text to speech. Um, why can they achieve that with less data?

I think the technical challenge behind all of this is there are sort of two things that we can do. Uh, one is to try to share data across many applications, so to take text-to-speech as one example.

Uh, if I learn to mimic lots of different voices and then you give me the 1,001st voice, uh, you'd hope that the first thousand taught you virtually everything you need to know about language and that what's left is really some idiosyncratic change, uh, that you could learn from very little data.

Uh, so that's one possibility.

Um, the other side of it is that a lot of these systems—this is much more important for things like speech recognition, that we were talking about—is we want to move from using supervised learning where a human being has to give you the correct answer, um, in order for you to train your neural network, uh, but move to unsupervised learning where I could just give you a lot of raw audio, uh, and have you learn the mechanics of speech, uh, before I ask you to learn a new language.

Um, and hopefully that can also bring down the amount of data that we need.

And so then on the technical side, like, could you give us just a, um, yeah, somewhat of an overview of how that actually works? Like, how do you process a voice for text-to-speech?

Uh, let's do both actually because I'm super interested.

All right, so, let's start with, uh, speech recognition before we go and train a speech system. Um, what we have to do is collect a whole bunch of audio clips.

Uh, so for example, if we wanted to build a new voice search engine, I would need to get lots of examples of people speaking to me, giving me little voice queries.

Uh, and then I actually need human annotators or I need some kind of system that can give me ground truth that can tell me for a given audio clip what was the correct transcription, right?

Uh, and so once you've done that, you can ask a deep learning algorithm to learn the function that predicts the correct text transcript from the audio clip.

Okay, uh, so this is called supervised learning. It's an incredibly successful framework. We're really good with this for lots of different applications.

Uh, but the big challenge there is those labels, uh, that someone has to be able to sit there and give you, say, 10,000 hours worth of labels, which can be really expensive.

Okay, um, so, and yeah, how is it actually recognizing, like what is the software doing to recognize the intonation of a word?

Well, traditionally what you would have to do is break these problems down into lots of different stages. So I, as a speech recognition expert, would sit down and I would think a lot about what are the mechanics of this language.

So for Chinese, you would have to think about tonality and how to break up all the different sounds into some intermediate representation.

Uh, and then you would need some sophisticated piece of software we call a decoder that goes through and tries to map that sequence of sounds to possible words that it might represent.

Oh, okay. Um, and so you have all these different pieces and you'd have to engineer each one, often with its own expert knowledge.

But Deep Speech and all of the new deep learning systems we're seeing now try to solve this in one fell swoop.

So the, really the answer to your question is kind of the vacuous one, yeah, which is that once you give me the audio clips and the characters that it needs to output, a deep learning algorithm can actually just learn to predict those characters directly.

Um, and in the past, it always looked like, uh, there was some fundamental problem that maybe we could never escape this need for these hand-engineered representations.

But it turns out that once you have enough data, all of those things go away.

And so where did your data come from? Like 10,000 hours of audio?

Uh, we actually do a lot of clever tricks in English where we don't have a large number of English language products.

So for example, it turns out that if you go onto, say, a crowdsourcing service, you can hire people very cheaply to just read books to you.

Wow! Um, and it’s not the same as the kinds of audio that we hear in real applications, um, but it's enough to teach a speech system all about, you know, liaison between words and you get some speaker variation and you hear strange vocabulary where English spelling is totally ridiculous.

Oh! Um, and in the past, you would hand-engineer these things. You'd say, "Well, I've never heard that word before, so I'm going to bake the pronunciation into my speech engine."

Um, but now it's all data-driven so if I hear enough of these unusual words, you see these neural networks actually learn to spell on their own, even considering all the weird exceptions of English.

Interesting. And you have the input, right? Because I've heard of people doing it with like a YouTube video, but then you need a caption as well with the audio, so it's twice as much if not more work.

Interesting. And so then what about the other way around? How does that work on the technical side?

Right. So that's one of the really kind of cool parts of deep learning right now is that a lot of these insights about what works in one domain keep transferring to other domains.

So with text-to-speech, you could see a lot of the same practices. So you would see that, um, a lot of systems were hand-engineered combinations of many different modules and each module would have its own set of machine learning algorithms with its own little tricks.

Um, and so one of the things that our team did recently with a piece of work that we're calling Deep Voice was to just ask, "What if I rewrote all of those modules using deep learning for every single one?"

Um, to not put them all together just yet, but even just ask, "Can deep learning solve all of these adequately to get a good speech system?"

Uh, turns out the answer is yes.

Um, that you can basically abandon most of this specialized knowledge in order to build all of the subsequent modules.

And in more recent research that's in the deep learning community, we're seeing that, of course, everyone is now figuring out how to make these things work end-to-end.

They're all data-driven, uh, and that's the same story we saw for Deep Speech, so we're really excited about that.

That's wild! And so do you have a team just dedicated to parsing, like, research coming out of universities and then figuring out how to apply it? Are you testing everything that comes out?

Um, it’s a bit of a mix. It's definitely our role to, um, not only think about AI research but to think about AI products and how to get these things to impact.

Um, I think, uh, there is clearly so much AI research happening that it's impossible to look through everything, but one of the big challenges right now is to not just digest everything but to identify the things that are, uh, truly important.

So what's like a, what's like a 90 million person product, like you're like, "Oh man!"

Well, it's the speech recognition we chose because we felt, uh, in aggregate it had that potential.

So as we have the next wave of AI products, um, I think we're going to move from these sort of bolted-on AI features to really immersive AI products.

So if you look at how keyboards were designed, you know, a few years ago for your phone, you see that everybody just bolted on a microphone and they hooked it up to their speech API.

Um, and then that was fine for that level of technology, but as the technology is getting better and better, um, we can now start putting speech up front.

We can actually build a voice-first keyboard, so it's something we've been prototyping in the AI lab. You can actually download this, uh, for your Android phone, uh, so it's called TuckType in case anybody wants to try it.

Yeah, um, but it’s remarkable how much it changes your habits. I use it all the time and I never thought I would do that.

And so it emphasized to me why the AI lab is here, that we can sort of discover these changes in user habits.

We can understand how, uh, speech recognition can impact people much more deeply than it could when it was just bolted onto a product.

Um, and that sort of spurs us on to start looking at the full range of speech problems that we have to solve to get you away from this sort of close-talking voice search scenario and into one where I can just talk to my phone or talk to a device and have it always work.

So as you've, like, you know, given this to a bunch of users, I assume, and gotten their feedback, um, have you been surprised with the, like, voice as interface?

I know lots of people talk about it. Uh, some people say like, uh, it doesn't really make sense. You know, for example, you see like Apple transcribing voicemails now.

Um, are there certain use cases where you've been surprised at how effective it is and others where you're like, "I don't know if this will ever play out?"

You know, I think, you know, the really obvious ones like texting, uh, seem to be the most popular. I feel like the feedback that is maybe the most fun for me is, uh, for when people, uh, with thick accents, uh, post a review.

They say, "Oh, I have this like, you know, crazy accent I grew up with and nothing works for me," but I tried, I tried this, uh, new keyboard and it works amazingly well.

Um, I have a friend who has a thick Italian accent and he complains all the time that nothing works.

Um, it's working! And, and all of this stuff now works for different accents because it's all data-driven.

We don't have to think about how we're going to serve all these different users; if they're represented in the datasets and we get some transcriptions, uh, we can actually serve them in a way that really wasn't possible when we were trying to do it all by hand.

That's fantastic! And have you gone it like through the whole system? In other words, like, if I want to give myself, you know, an Italian-American accent, can I do that yet with BYU?

Uh, we can't do that yet with our TTS engine, but it's definitely on the way.

Okay, cool! Um, so what else is on the way? What are you researching? What products are you working on? What's coming?

So speech and text to speech, I think these are part of a big effort to make this next generation of AI products really fly.

Um, once, uh, text to speech and speech are your primary interface to a new device, they have to be amazingly good and they have to work for everybody.

Uh, and so I think there's actually still quite a bit of room to run on those topics, uh, not just making it work for a narrow domain but making it work for really the full breadth of what humans can do.

Do you see a world where, uh, you can run this stuff locally or will they always be calling an API?

Yeah, okay. I think, uh, it's definitely going to happen.

One kind of funny thing is that if you look at folks who maybe have a lot less technical knowledge and don't really have the sort of instinct to think through how a piece of technology is working on the back end, um, I think the response to a lot of AI technologies now because they're reaching this sort of uncanny valley is that, uh, we often respond to them as though they're sort of human and that sets the bar really high.

Uh, our expectations for how delightful a product should be is now being set by our interactions with people.

Um, and one of the things we discovered as we were translating Deep Speech into a production system was that latency is a huge part of that experience.

Um, that the difference between 50 or 100 milliseconds of latency and 200 milliseconds of latency is actually quite perceptible, um, and it really, um, anything we can do to bring that down, uh, actually affects user experience quite a bit.

Um, we actually did a combination of research, uh, production hacking, working with product teams thinking through how to make all of that work, and that's a big part of the translation process that we're here for.

That's very cool! And so, yeah, what happens on the technical side to make it run faster?

So, um, when we first started like the basic research, uh, for Deep Speech, um, like all research papers, you know, we choose the model that gets the best benchmark score, um, which turns out to be horribly impractical for putting online.

Um, and so after sort of the initial research results team sat down with just a set of what you might think of as product requirements and started thinking through like, uh, what kinds of neural network models, uh, will allow us to get the same performance, um, but don't require so much sort of future context.

They don't have to listen to the entire audio clip, uh, before they can give you a really high accuracy response.

So kind of doing that, like, you know, the language prediction stuff, like the OpenAI guys were doing with the Amazon reviews, like predicting what's coming next.

Um, maybe not even predicting what's coming next, but one thing that humans do without thinking about it is, is if, um, if I misunderstand a word that you've said to me and then a couple of words later, um, I pick up context that disambiguates it, uh, I actually don't skip a beat.

I just understand that as one long stream.

And so one of the ways that our speech systems would do this is that they would listen to the entire audio clip first, uh, process it, uh, all in one fell swoop, and then give you a final answer.

Uh, and that works great for getting the highest accuracy, um, but it doesn't work so great for a product where you need to give a response online, give people some feedback that lets them know that you're listening.

Uh, and so you need to alter the neural network so that it tries to give you a really good answer using only what it's heard so far but can then update it very quickly, uh, as it gets more context.

So I've noticed over the past few years people have like gotten quite good at structuring sentences so the system understands them.

Um, you know, they put like the noun in the correct position so it like feeds back the correctly. I found this when I was traveling, like I was using Google Translate and I, uh, after like one day recognized that I couldn't give it a sentence, but if I gave it a noun, I could just show it to someone and like if I just show like, you know, bread, it will translate it perfectly and give it.

Um, do you find that like we're going to have to slightly adapt how we communicate with machines or your goal is to communicate like perfectly as we would?

I really wanted to be human level, um, and I don't see a serious barrier to getting there at least for really high-valued applications.

Uh, I think there's a lot more research to do but, um, I sincerely think there's a chance that over the next few years we're going to regard speech recognition as a solved problem.

That's very cool! So what are the really hard things happening right now? Like what are you not sure if it will work?

So I think we were talking earlier about getting all this data.

Um, so for problems where we can just get gobs of labeled data, um, I think we've got a little bit more room to run there, but we can certainly solve those kinds of applications.

But there's a huge range of what humans are able to do often without thinking that current speech engines just don't handle.

Um, we can deal with crosstalk and a lot of background noise.

Um, if you talk to me from the other side of a room, even if there's a lot of reverberation and things going on, uh, it usually doesn't bother anybody that much.

Um, and yet current speech systems often have a really hard time with this.

Uh, but for the next generation of AI products, they're going to need to handle all of this.

Uh, and so a lot of the research that we're doing now is focused on trying to go after all of those other things.

How do I handle people who are talking over each other or handle multiple speakers who are having a conversation very casually?

Um, how do I transcribe things that have very long structure to them, like a lecture, um, where over the course of the lecture, I might realize I misunderstood something, uh, or some piece of jargon gets spelled out for me and now I need to go and transcribe it.

Um, so this is one place where our ability to innovate on products is actually really useful.

Um, we’ve just launched recently, um, a product vision called Swift Scribe, uh, to help transcriptionists be much more efficient, and that's targeted at understanding all of these scenarios where the world wants this long-form transcription.

We have all of these conversations that we're having that are just sort of lost and we wish we had written down, um, but it's just too expensive to transcribe all of it, uh, for everyday applications.

So do you, um, so in terms of emulating someone's voice, uh, do you have any concerns for faking it? 'Cause I did, you see the, uh, the face simulation? I forget the, uh, the researcher’s name, so I'll have to link to it, but you know what I'm talking about.

So essentially you can like feed it both video and audio and you can recreate, you know, Adam talking. Do you have any thoughts on like how we can prepare for that world?

You know, I think, um, in some sense this is a social question, right? I think culturally we're all going to have to exercise a lot of critical thinking.

Um, we've always had this problem in some sense that I can read an article that has someone's name on it, um, and notwithstanding understanding writing style, uh, I don't know for sure where that article came from.

Um, and so I think we have habits for how to deal with that scenario.

We can be healthily skeptical and I think we're going to have to come up with, with ways to adapt that to this sort of brave new world.

Um, I think those are big challenges coming up and, and I do think about them, but I also think a lot about, um, just all the positives that, that AI is going to have.

I, you know, um, I don't talk about it too much, like my mother actually has muscular dystrophy, uh, and so, um, things like speech, uh, and language interfaces are just incredibly valuable for someone who cannot type on an iPad because the keys are too far apart.

Um, and so these are just all these like things that you don’t really think about, um, that these technologies are going to address, uh, over the next few years.

And on balance, I know that we're going to have a lot of big challenges of like how do we use these, how do we as users adapt to, uh, all of the implications, uh, but I think we've done really well with this in the past and we're going to keep doing well with it in the future.

So do you think where AI will create new jobs for people or will we all be like mechanistic Turks feeding the system?

Like, I’m not sure. I think, uh, this is this is something where, you know, the job turnover in the United States, uh, every quarter is incredibly high.

It's actually shocking, um, that the fraction of our workforce that, uh, quits one occupation and moves to another one is really high.

Um, I think it is clearly getting faster; like we talked about this phenomenon within the AI lab here where the deep learning research is flying ahead so quickly that we're often remaking ourselves to keep up with it and to make sure that we can keep innovating.

Uh, and I think that might even be a little bit of a lesson for everyone that continual learning is going to become more and more important going forward.

Yeah! So speaking of, like, what are you teaching yourself so the robots don't take your job?

Uh, I don't think we're at risk of robots taking our jobs right now.

I, um, actually it's kind of interesting we thought about like how does this change careers.

Um, one thing that, uh, has been true in the past is that, uh, if you were to create a new research lab, one of the first things you'd do is fill it with AI experts, um, where they live and breathe AI technology all day long.

Um, I think that's really important. I think for basic research you need that kind of specialization.

Um, but because the field is moving so quickly, uh, we also need a different kind of person now.

We, we also need people who are sort of chameleons who are these highly flexible types that can understand and even contribute to a research project, uh, but can also simultaneously shift to the other foot and think about how does this interact with GPU hardware and a production system and how do I think about a product team and user experience?

Because often product teams today can't tell you what to change in your machine learning algorithm, uh, to make the user experience better.

It's very hard to quantify where it's falling off the edge, uh, and so you have to be able to think that through to change the algorithms.

You also have to be able to look at the research community to think about what's possible and what's coming.

Um, and so there's a sort of amazing full-stack machine learning engineer that's starting to show up.

Where are they coming from? Like if I want to be that person, what do I do?

Like now say I'm, you know, 18. They seem to be really hard to find right now, believe me.

Um, so in the AI lab, we really, uh, set ourselves to just creating them.

Um, I think this is sort of the way unicorns are that we have to find the first few examples and see how exciting that is, uh, and then come up with a way for people to learn and become that sort of professional.

Um, actually one of the cultural characteristics of our team is that we look for people who are really self-directed and hungry to learn.

Um, that things are going so quickly, we just we can't guess what we're going to have to do in six months.

And having that sort of do anything attitude of saying, "Well, I'm going to do research today and think about research papers, but wow, once we get some traction and the results are looking good, we're going to take responsibility for getting this all the way to 100 million people."

Um, that's a towering request of anyone on our team and the things that we find really help everyone sort of connect to that and do really well with that is being really self-directed and able to kind of deal with ambiguity, um, and also really willing to learn a lot of stuff that isn't just AI research, um, but is also stepping way outside of comfort zones, uh, and learning about GPUs and high-performance computing, uh, and learning about how a product manager thinks.

Okay, so, uh, this has been super helpful! If someone wanted to learn more about what you guys are working on or even just things that have been influential to you, like what would you recommend they check out on the internet?

Oh my goodness! Um, so I have to think about this one for a second here.

Uh, I think the stuff that's actually been quite influential for me is actually like startup books.

Um, I think especially with big companies, um, it's easy to think of ourselves in silos of having a single job.

Um, one idea from the startup world that I think is really, uh, amazingly powerful is this idea that, uh, a huge fraction of what you're doing is learning.

Um, there's a tendency, especially amongst engineers, which I count myself a member, uh, is like we want to build something.

Uh, and so one of the disciplines that we all have to keep in mind is that we also have to be really clear-eyed and think about what do we not know right now, uh, and focus on learning as quickly as we can, uh, to find the most important part of AI research that's happening and find the most important pain point that people in the real world are experiencing, and then be really fast at connecting those.

Uh, and I think a lot of that influence on my thinking has come from the startup world.

There you go! That's a great answer.

Okay, cool! Thanks, man!

Thanks so much!

Baidu's AI Lab Director on Advancing Speech Recognition and Simulation

More Articles