Jeff Dean’s Lecture for YC AI
So I'm going to tell you a very not super deep into any one topic but very broad brush sense of the kinds of things we've been using deep learning for the kinds of systems we've built around making deep learning faster. This is joint work with many, many, many people at Google. So this is not purely my work, but most of it is from the Google Brain team which I lead. The Brain team's mission is basically to make machines intelligent and then use that new capability to improve people's lives in a number of different ways.
The way we do this is we conduct long-term research kind of independent of any particular application. In fact, probably supposed to stand in one place, independent of any particular application. We build open-source systems that help us with our research and deploying of machine learning models like TensorFlow. We collaborate across Google and all of Alphabet in getting machine learning systems and research that we've done into real Google products.
So we've done a lot of work in Google Search, Gmail, Photos, Speech Recognition, Translate, and many other places. We also bring in a lot of people into our group through internships and a new residency program that we started last year for people who want to learn how to do deep learning research, and that's been a pretty successful program as well.
The main research areas that our group is working in are these. I'm going to focus mostly on these today, actually a little bit of perception too. In January, I put out a blog post that kind of just highlighted some of the work our group has done over 2016. In putting that together, I kind of realized we were doing a lot of different things.
The nice thing about this is each one of these blue links is a link to something kind of interesting and substantial like a research paper or a product launch using learnings or some new TensorFlow features we've added. I won't go through that all now, but you can go find that blog post and learn more about some of the stuff we've done up to.
Okay, so why are we here? You probably already all know this given that you're working on AI-related companies, as I understand it. But the field of deep learning and neural networks in particular are really causing a shift in how we think about approaching a lot of problems. I think it's really changing the kinds of machine learning approaches that we use.
In the 80s and 90s, it was the case that neural nets seemed interesting and appealing, but they weren't the best solution at the time for a lot of problems we cared about because they just didn't quite have enough training data or enough computational capabilities. So people used other methods or developed kind of shallower machine learning methods with much more hand engineering features.
If you fast forward to now, what's happened is we've got much, much more compute. I actually did an undergrad thesis in 1990 on parallel training of neural nets because I liked the appealing attraction of the neural net model. I thought if we could just get, you know, a bunch more compute by paralyzing over a 64-processor hypercube machine, that it would all be even better.
It turned out what we needed was like a hundred thousand times as much compute, not sixty times. But if you fast forward to today, we actually have that. So what's happened is we now actually have the case where neural nets are the best solution for an awful lot of problems and a growing set of problems where we either previously didn't really know how to solve the problem or where we could solve it, but now we can solve it better with neural nets.
So the talk is really meant to orient you across a whole bunch of different problems where this is the case. The growing use of deep learning. So really, our group started in order to investigate the hypothesis that large amounts of compute could actually solve interesting problems using neural nets.
When we first started, we were sort of the vanguard of people using neural nets at Google. We did a bunch of work on unsupervised learning at a really large scale. At that time, we didn't even have GPUs in our data centers, so we just used 16,000 GPU cores. We did kind of interesting things with unsupervised learning there, but gradually we kind of built tools that enable people to use fly machine learning and deep learning in particular to a lot of problems.
You can see the growth rate of, you know, this is directories containing model description files, either from our first-generation system starting in about 2015 or our second-generation system, TensorFlow. We've deployed machine learning in collaboration with lots of teams. Other teams have also been independently just picking up this idea of deep learning and using it in lots and lots of places in Google products, and that's why you see that growth rate, and it's continuing to grow.
One of the things we focus on a lot is how can we reduce experimental turnaround time for our machine learning experiments. Because there's a very different qualitative feel to doing science and research in a domain where an experiment takes a month versus doing it in a domain where you know minutes or hours, you get an answer and then you can figure out what the next set of experiments are that you want to run. So a lot of our focus is on scaling machine learning models and scaling the underlying infrastructure and systems so that we can actually for some problems approach minutes or hours rather than weeks or months.
Part of that has been building the right tools. So TensorFlow is kind of our second-generation system that we built for tackling deep learning problems and machine learning problems. The first one we did was open-source. The second one we said we should really fix some of the design problems we saw in our first system, keep the good features about it, and then design this from the start to be an open-source platform.
So that people all over the world, not just at Google, can benefit from it and can help build a community that can all contribute to and improve the system. Zach Stone here is our TensorFlow product manager extraordinaire and is doing a great job of building the community both inside Google and outside.
The goals of TensorFlow that we want to establish is this common platform for expressing all kinds of machine learning ideas. So something that can be used for deep learning can be used for other kinds of machine learning that can be used for tackling perception problems and, you know, language understanding problems.
What if you have a crazy new machine learning research idea that doesn't really fit into what people have done before? We want it to be at least expressible relatively easily in TensorFlow. Then we want to make that platform really great for research, but we also want to be able to take that something you've developed in TensorFlow maybe experimentally and then if you now want to deploy that in a production setting, run it at the data center, run it at scale, running on a phone; all these kinds of things we want that to be another something you can do in the sense of a framework.
By open sourcing it, we make it available to everyone. So how has this been going? Well, this is a comparison of GitHub stars, which is one metric of popularity or interest in different source code repositories on GitHub. I show you a comparison of TensorFlow with a bunch of other open-source machine learning packages, many of which have been around for many more years of intensive attention.
TensorFlow's brown line is going up fairly steeply, so this has been pretty good. I think the reception for what TensorFlow does, which is enable flexible research, but also this kind of production readiness and being able to run in lots of places, is pretty appealing.
If you look at the other open-source packages which we did when we were starting to work on TensorFlow, many of them have two of the three attributes that we care about: being able to be really flexible, scalable, and sort of run on any platform. They all have different emphases, but we wanted something that satisfied all three of those.
We've been focusing a fair amount on speed. I think when we first released TensorFlow, we released a bunch of really nice tutorials that showed how to do different things with TensorFlow. But one of the mistakes we made was we released code that was meant to be exploratory and clear and not necessarily the highest performance way you would write that. Often then, people would take that as the way you should write a high-performance TensorFlow model, and that wasn't necessarily the case.
So we're now adapting and trying to put out things that are both the best way from a clarity standpoint but also are high performance. That's been a bit of a technical guide. I think it got a bit of a bad rap, but actually, our performance is quite good.
We've been doing a bunch of benchmarking and producing reproducible benchmark results that show that our scaling is quite good. So this is single machine scaling—nearly linear speed-up for a bunch of different image models on up to 8 GPU cards. Pretty close to linear speed-up for 64 GPU cards for a bunch of different kinds of problems.
So, if you hear TensorFlow is slow, don't believe it. We also support lots of different platforms, and I think this is important because often you want to train a model on a large dataset in a data center but then deploy that on a phone.
So we run on iOS and Android and Raspberry Pis, but also CPUs. And if you have a GPU card or TPU cards, we're happy to use that. We also run on our custom machine learning accelerators that I'll talk about in a minute. But really, we want to run on everything.
There are a bunch of other device manufacturers that are developing funky mobile ML accelerators or Qualcomm has a DSP, and they're all working to make sure TensorFlow runs well on those devices. We also kind of want to be agnostic of the language that people want because you want to be able to run machine learning where it makes sense. Different people have different sort of language environments.
Most fully developed system is obviously Python, but the C++ front works pretty well for production use, and then a bunch of other external community members have added a variety of other not fully fleshed out but reasonable support for some of these other languages.
We have a pretty broad usage base. Like a year ago, we had a meeting at Google of people using TensorFlow, and it was pretty impressive. We had people from most of these companies in the room which I think normally they don't all get in a room together—places like Apple is actually there as well—and NVIDIA, Qualcomm, Uber, Google, Snapchat, you know, many men until many, many other places.
In terms of stars, you know, I showed you the graphs related to machine learning platforms. This is the top repositories on GitHub overall, and we're up to number six, which is pretty good. All the other ones are either JavaScript or a list of programming books.
This is a visualization of where people are interested in different GitHub repositories, which is kind of cool. Machine learning is done all over the world, so one of the things that's happened is that growth in interest has happened, and there's been a pretty broad set of external contributors.
So there's really, you know, I think we're up to almost a thousand non-Google contributors across the world doing all kinds of different things for adding features or fixing bugs or improving the system in various ways, which has been really nice. Oh, and I think it's kind of nice that there's growing use in machine learning classes of TensorFlow as a way of illustrating machine learning concepts.
So at really good machine learning universities, like Toronto, Berkeley, Stanford, and other places, they're starting to use that as the core of their curriculum. Okay, so now I'm going to switch gears a bit and talk about some sort of more product-oriented applications of deep learning at Google. Google Photos is a good example, obviously computer vision that works.
One thing you could do is, you know, make a photos product around the idea that you can actually understand what's in people's photos, and that's been going really well. As a lesson for you who are starting companies, often applied domains, I think it's really important to be able to look at the machine learning work that is happening in the world and realize that often you can reuse many of the same ideas from one domain and just by pointing at it kind of different datasets, get completely different interesting product features.
So, if you, for example, use the same basic model structure, training on different data, you get something different. One general model trend is given an image, predict interesting pixels. There are a bunch of ways you could do that, but if you have a model structure that does that, my summer intern from a few years ago, Matt Velar, who actually went off to found Clarifai, which is a computer vision company, we were working in collaboration with the Street View team on identifying text in Street View images.
To do that, you can have training data where people have circled or drawn boxes around text, and you just try to predict the heatmap of which pixels contain text in a Street View image. This works reasonably well, and then you can run an OCR model on those pixels and actually read the text.
It works, you know, across lots of different font sizes and colors and whether it's close to or far from the camera. Some people in the Maps team decided they would build this thing that would help you identify whether your rooftop has solar energy potential and how much energy you could generate by installing solar panels.
One of the first things you have to do is find rooftops, and that's exactly the same model but just with different training data where you now have circles around rooftops. Then there's a bunch of other work to estimate the angle of the rooftop from the imagery or multiple views of the same house and then some stuff to predict what is the solar energy potential for that.
Another area where we've applied this is in the medical domain. The same basic model we want to be able to say, take a medical imaging problem. One of the first ones we’ve been tackling is ophthalmology problems, particularly taking a retinal image like this and tackling whether or not it has symptoms of a degenerative disease called diabetic retinopathy.
So this is again the same kind of problem. You want to identify parts of the eye that are related to that seem to be diseased in some way. Then you also have a whole image classification problem of does this eye show symptoms at the level of one, two, three, four, or five?
It turns out you can do this. Some people in our group have done a really nice sort of medical study showing that if you collect 150,000 ophthalmology images and then you get each one labeled by seven ophthalmologists—because if you ask two ophthalmologists to grade the same image one, two, three, four, or five, they agree 60 percent of the time—a bit terrifying.
If you ask the same ophthalmologist to grade the same image a few hours later, they agree with themselves 65 percent of the time, and that's mildly terrifying. So we had to get every image labeled by seven ophthalmologists to reduce the variance in the score and say, “Oh, if five people think it's a two, it's probably more like a two than a three.”
But in any case, the punchline of this paper is we now have a model that performs on par, slightly better than, the median of eight US board-certified ophthalmologists, which is cool because there are a bunch of places in the world, especially in India and other countries where there are many people at risk, and there just aren't enough ophthalmologists.
So actually doing clinical trials in India, we've licensed this to our Verily subsidiary, who's licensing it to a camera, an ophthalmology camera manufacturer who is going to be integrating this into the actual ophthalmology camera. Another area where being able to see is pretty useful is robotics. If you're trying to build robots, just being able to perceive the world around you clearly makes things a lot better.
So we've been doing a bunch of experiments both with real robots and also with simulated robotic environments, also with trying to do imitation learning from people performing actions and then trying to get robots to do so. We set up what we call an arm farm.
Oops, I'm not playing. Oh, maybe I'm not on the internet. Well, anyway, it's not that exciting, except we have a bunch of robots trying to grasp things. They can learn to grasp on their own whether they're grasping something successfully by just having a bin of things in front of them.
They just try to pick something up, and if they fail, their gripper closes all the way. If they succeed, then they don't close the gripper all the way, and they can actually see from the camera that they've managed to pick something up.
They can practice picking things up, and we can pool all the sensor data from all the robots that are doing this to retrain a model every night for grasping so that the next day's grasping attempts are better and better. By having lots of robots do this, you actually get a lot of parallel experience, much more than you can get on a single robotic arm.
We have a dataset of that we've actually released publicly of about 800,000 grasp attempts versus about 30,000 grasp attempts, which is kind of a big public dataset in the past. Surprisingly, more grasp attempts give you a much better grasping mechanism and model than 30 of them.
Not surprising. We've also been trying to... This is me awkwardly looking at a robot on a screen that you can't see doing some actions. I'm going to mimic the robotic nature of it, and then we have a video of me doing that, and then we're just trying to learn from the videos to transfer that action to the real robots, and that's working reasonably well as well.
Here's another example, and we're doing that first in a simulator, and then we're taking that simulator and trying to transfer those activities to a real robot, and that works reasonably well as well.
Another place that I'm pretty excited about deep learning is in lots and lots of scientific domains. You often have the case where you have a simulator of some really complex phenomenon, and that's often a sort of HPC style application and very computationally expensive, but it kind of gives you insight into whatever scientific processes are going on.
That allows you to iterate in a computational science methodology, but often those computations are pretty expensive. So one of the things we've been working on, and this is just one example, is you can use those simulators as training data for an OLAP.
Quantum chemists have a similar problem where they take in a configuration of molecules and run a bunch of time steps, and then at the end, they get some information about how the ultimate configuration of those molecules turns out. From that, they get a few properties about those molecules, like is it toxic? Did it bind to something else? You know, a handful of these things.
It turns out that's the data that if you use that as input, you run this really expensive simulator for an hour, then you get these thirty numbers out that turns out to be great training data for an ER lab. You can train a neural net to do exactly that same task or to approximately that task, approximate the entire simulator. You can essentially...
The punchline is the bottom there: you essentially get indistinguishable accuracy from using the real simulator, but it’s 300,000 times faster. That has a lot of implications for how you might do quantum chemistry if you suddenly have something 300,000 times faster. You might run 100 million things through your simulated neural lipase emulator and figure out what's going on or to identify a bunch of candidates that you might want to look into in more detail, so that's exciting.
Another place where these kinds of pixel-to-pixel models come in is some people in Google have done a model that tries to predict depth from an input image. We have some training data where we have the true depths given a camera viewpoint and where things are in the room or in the world, and then we try to train a model to predict depths from just the raw pixels of that image.
So that's a pixel-to-pixel learning problem, and you can imagine a lot of pixel-to-pixel learning problems. Indeed, you know, one application in cameras is you want to predict depth in a portrait, and then you can do kind of funky, cool effects, like identify the person in the foreground and turn the background black and white or like make it all fuzzy and artsy in the background, which is kind of cool.
But it turns out you can also take microscope images as the raw microscope images input and then the chemically stained microscope image as the target for your model. For example, that's often how people see cell bodies and cell boundaries is you apply different kinds of stains to the cells, and then you can make them show up on a microscope better and see what's going on.
Well, it turns out... So this animation, that's the input, that's the ground truth, and that's the predicted output of a neural net that's trained to virtually stain something without actually staining it. This is important because it turns out when you actually stain something, that kills the cells. So you don't get any temporal information about what's going on in the cells—essentially die when you apply the stain.
But here you can virtually stain something but then follow them longitudinally in time and see how cell processes kind of continue to happen without actually staining them. You can also stain for things that you can't necessarily develop a true chemical stain for, so if you have some labels which things are axons and which are dendrites in neural tissue, you can have a microscope viewer that highlights axons and dendrites in different colors and cell bodies even if that's kind of not something that you can chemically do with a real stained one.
One of the areas we've been doing a lot of work in is in language understanding models, and so this started out as research in our group to do essentially sequence-to-sequence learning. You have some input sequence and conditioned on that input sequence, you want to predict an output sequence. This turns out to be useful for actually a whole bunch of different problems, but one of them is translation.
So if you have a bunch of sentence pairs, one in French and the corresponding meaning sentence in English, then you can use a sequence-to-sequence model to take the input sentence one word at a time or even like one character at a time. When you hit a special end-of-French token, then you essentially start spitting out the corresponding English meaning in English translation of that French sentence.
That worked like this, and you have training data that is like that, and try to predict the next word from that training engine using recurrent neural net. That turns out to work reasonably well, and then you're actually trying to find the most probable sequence, not the sequence with the most probable individual terms.
So you do a little beam search where you kind of keep a window of candidates and sort of search over possible vocabulary items until you are happy and have found a likely output sequence, and that's how you do translation.
One application of this is in Gmail. We added a feature called Smart Reply, where essentially we get an incoming email. So this is one sent to my colleague Greg Corrado from his brother: "Hi, we want to invite you to join us for Thanksgiving dinner. Please bring your favorite dish. RSVP by next week."
To reduce the computational cost, we have a small feed-forward neural net that says, "Is this the kind of thing where a small short reply would make sense?" If yes, then we're going to activate a sequence-to-sequence model, and we're going to do a much more computationally expensive thing with that message's input. We then try to predict plausible replies.
This system produces three responses: "Count us in," "We'll be there," or "Sorry, we won't be able to make it." So this is a nice application of sequence-to-sequence models, and if you squint in the world, you'll find lots of applications of these.
It turns out my reply in April 2009 there was an April Fool's joke that Google put out saying, "Haha, we're going to reply to your email automatically." But then in November 2015, we launched this as a real product, and in just three months, 10 percent of mobile inbox replies were generated by the Smart Replies. That's kind of cool.
But obviously, one of the real potential applications of this was Translate, which is what we were doing demonstrating that this research was effective on a large, by academic standards, but a smallish public dataset of translation data called WMT. So when we looked to work on applying this to the real Google Translate product, we actually had a hundred to a thousand times as much training data, and so scaling this up was actually pretty challenging. We wanted to make the model a lot higher quality, but we did a nice, fairly detailed write-up of the engineering behind that in this many papers.
So this is kind of the structure of the model that we came up with. It has a very deep LSTM stack, each of which runs on different GPUs. There's an attention module so that rather than just having a single state that's updated by the recurrent model, you keep track of all the states, and then you learn to pay attention to different parts of the input data when you're generating different parts of the output sequence.
So you're about to generate, you know, the next word, and you look back at the word "hello" in the input sentence, and so on. This model runs on one replica of this model that runs on a machine with 100 GPU cards with different pieces of it in different places. Then we run a lot of copies of this model to do data parallelism across the large training data, and we share the parameters.
So this is a technique we've been using for quite a while that we originally published in 2012 about what we call a parameter server. Using many parallel data parallel copies to process different input data is all trying to update those shared parameters by applying gradients to those parameters, and this allows you to scale training quickly. So you can have, you know, 50 replicas of this kind of setup—about 20.
I think in this case, we were using about 16. So we're using 100 GPU cards to train a model. The really good news is the blue line here is the old phrase-based machine translation system. They didn't really have much machine learning, if any machine learning in it. It had large statistical models for lots of different sub-pieces of the problem.
So, it had a target language model that told you how often every five-word sequence in English occurred; it had an alignment model that says how words in English and French sentences align, had a phrase table, and a dictionary of plausible English and French phrases and sentences. It was like 500,000 lines of code to glue this whole thing together, and that's the blue line.
What we're showing is the quality of translations generated by that system as judged by humans, and the green line has a substantial jump in quality for basically nearly every language pair. Jumps up very substantially—it doesn't look like much, but those are really big jumps in quality.
The other nice thing is that system is 500 lines of TensorFlow code instead of 500,000 lines of GUI code with like lots of handwritten logic. The yellow line on top is human bilingual—not professional translator, but someone who speaks both those languages—translations as judged by other humans, and you can see that for some language pairs we're actually getting quite close to that human-level quality for translation, which is pretty exciting.
When we were trying to roll this out slowly across lots of different language pairs, we launched it in the dead of night in Japan. All of a sudden, many people in Japan noticed that suddenly English to Japanese translation was actually usable in quality as opposed to before when it was kind of supported but not usable, as one of the people on our Translate team referred to it.
This professor at a Japanese university decided we did this experiment translating the first paragraph of Hemingway's The Snows of Kilimanjaro into Japanese in the background and see what the quality looked like. If we focus on the last sentence, the old phrase-based system says, "Whether the leopard had what the demand that that altitude is, no that nobody explained."
So I think there's a leopard involved. Other than that, I really can't understand that. Neural machine translation just generates much more natural-sounding translations, so "No one can explain what leopard was seeking at that altitude." The only mistake it made was it left out the words "up."
You can see how this transforms it from really not usable to like actually pretty good. Another area we're doing a lot of research in is this notion of automating the solution of machine learning problems, what we call "learned to learn." The idea here is that the current way you solve a machine learning problem is, probably many of you, your companies are solving machine learning problems.
You have data, you have some way of doing lots of compute, a bunch of GPU cards do something, and then you have a human machine learning expert saying, "Okay, I'm going to try this kind of model, use this learning rate, and I'm going to do transfer learning from this dataset." Then you hopefully get a solution. What we'd like to turn that into is you have data and maybe you use a hundred times as much compute, but you don't need a human machine learning expert.
If we could do that, that would be really good because if you think about what's happening in the world, you know, there are probably 10 million organizations in the world that should be using machine learning and actually have probably data in electronic form that would be suitable for machine learning, but there are only about a thousand organizations that have really hired machine learning experts in the world actually tackle some of these problems.
So, we're trying lots of different efforts in this area, and I'll talk about two of them: one is a way of designing neural architectures automatically, and the other is a way of learning optimizers automatically. Architecture search—the idea is we want to have a model-generating model. So the same way a human machine learning expert says, "I'm going to try this kind of model," we're going to have a model-generating model that's going to spit out models for this problem to solve to tackle a particular problem.
The way this will work is we're going to generate ten model architectures, we're going to train each of them for a few hours, and then we're going to use the loss of the generative model as a reinforcement learning signal for the model-generating model. This is sort of just on the realm of feasible for small problems today, but it actually works for small problems.
Here is an example of a model architecture it came up with, and you'll see it looks sort of not like something a human would have designed. The wiring is kind of crazy, and this is CIFAR-10, which is a very small color image problem with ten different classes—pictures of horses and planes and cars. Not that many classes, but it's been pretty well studied in the machine learning literature.
The error rate, like all machine learning image problems, has been dropping over the years, but everything above these last four lines is a human-generated machine learning expert model that someone came up with, a new thing published, and beat the previous state of the art. This neural architecture search basically with that architecture got very, very close to that state of the art without any human knowledge of the underlying architecture.
We also tried it on a language modeling task. The traditional way you do this for recurrent models is using LSTM cell, whose structure is shown there. That's kind of the default thing you're going to do if you're going to use any sequence data. We just gave the architecture search the underlying primitives of an LSTM cell and said, "Go to it, find us some way of dealing with sequential data," and that's the cell it came up with.
It looks somewhat different, but in this case, it actually beat the state of the art by a pretty substantial margin for this language modeling task. The other interesting thing is we then took that cell and used it on a completely different sequential task in medical records, future prediction tasks, and it performed better than the LSTM cell in that domain as well.
So learning the optimizer rule is similar. We're going to have symbolic expressions and give it the model—the optimizer of expression, learning model, access to the raw primitives that you might consider using in a neural optimizer update rule. Things like, "Here's the gradient," "here's the running average of the recent gradients," "here's the momentum term," and the top four lines here are human designed update rules that people traditionally use.
They've been designed over the last decade or few decades in the case of SGD. They are generally what people use—Adam is a pretty good choice these days, and often SGD with momentum, which is the second line, is the best choice. What you see is that this thing came up with, you know, 15 or something completely different expressions than what we've explored, and they're almost all better than all of the human-designed ones.
So that's kind of encouraging. That's going to appear in ICML and I believe Goop, and we also took one of the most promising ones of those, and we then transferred it to a different problem where we hadn't designed the optimizer on, and we used this other optimizer and found that it gave you no better training performance for bags of apples—the lower is better for publicity and better loss core, which higher is better for that metric—than Adam, which was the best optimizer we'd found before.
So I think this whole notion of learning to learn is going to be pretty powerful because a lot of what machine learning experts do when they sit down to solve a problem is they run lots of experiments. Right now, a human can't run that many experiments; it's just a lot of cognitive load to run 50 experiments or 100 experiments. This thing can run, you know, twelve thousand experiments in a weekend. Many of them suck, but many of them don't.
So the other thing is interesting is that a lot of what's happened is we've been able to solve lots of problems because we have a lot of data and we've been able to scale the amount of compute we throw at different problems. Really, one of the nice properties deep learning has is really transforming how we think about designing computers these days.
Deep learning has two really nice properties. One is that it's perfectly tolerant of very reduced precision arithmetic; so, you know, one significant digit kind of thing. You don't need double precision. You certainly don't need single precision floating point. The other property is that it's generally made up—all the algorithms I've showed you are made up of a handful of specific operations, kind of cobbled together in different ways.
That really leads to an opportunity where if you can build custom machine learning hardware targeted at doing very reduced precision linear algebra, then you can all of a sudden unlock huge amounts of compute relative to CPUs or GPUs or you're not really targeted at doing these kinds of things.
So this is—we've been doing custom machine learning accelerators for a while. We've had a first-generation one that was targeted at speeding up inference, so not training but inference when you're actually running a trained model in the context of a product.
We had our first version deployed in our data center for two and a half years or something and we just revealed this system which is designed for both training and inference at Google IO. This is a board. One of the things we felt was important was to design not just a chip for training, but also an entire system.
Because you're unlikely to get enough compute for large problems on a single chip, we designed a really high-performance chip, and we also designed them to be hooked together. So this is what we call a pod, which is 64 of these boards, each of which has four chips, so 256 chips and that's eleven and a half petaflops of compute.
We're going to have lots and lots of these in our data centers, which is pretty exciting because I think we'll be able to tackle much bigger problems. It's going to bring a lot more compute for "learn to learn" approaches, and normally programming a supercomputer is kind of annoying. So we decided we make these programmable via TensorFlow.
You essentially can express a model with a new interface that we're adding to TensorFlow 1.2 called estimators, and then the same program will run with minor modifications on CPUs, GPUs, or on TPUs. That's going to be available through Google Cloud. You can get a thing called a Cloud TPU later this year which is going to be a virtual machine we have, a 180 teraflops TPU version 2 device attached, and it will run TensorFlow programs super fast, we hope.
We're also making a thousand of these devices available for free to researchers around the world who are doing interesting work and want more compute and are committed to actually publishing the results of that work openly, and also hopefully giving us feedback about what's working well on these TPU devices and what's not. Ideally, open-sourcing the code associated with those models, but we're not sure that's going to be a hard requirement—it's a desire on our part to help sort of speed up the whole science and machine learning research ecosystem.
You can sign up there if you're interested in any of these things. Google Cloud is also producing higher level APIs that are more managed services or pre-trained models that you can just use without necessarily being a machine learning expert. So if you have photographs, you can run them through the vision API, and it will read all the text in it and find all the faces and tell you what kind of objects are in it and do all kinds of good stuff.
The translation API has really nice high-quality translations that might be useful for lots of things. One final closing thing, we've also been experimenting with machine learning for doing higher performance machine learning models, and so in this case what we've been doing is a similar kind of reinforcement learning.
We're going to take an abstract integrand; we have a bunch of computational devices that we want to run that on, say, four GPU cards, and we say to the RL algorithm, "We want you to find the placement of TensorFlow operations on two devices that makes that model run as fast as possible." The current way people do this is they hit "Okay, for GPU cards, I'm going to run this part of my graph on GPU card 1, this part on GPU card 2," and that's okay.
But it's kind of annoying because it's not something that humans really want to think about. We're actually able to come up with pretty exotic placements. Each color there is a different GPU card, and on the left, you see a sequence prediction model unrolled in time, so different time steps are on different GPU cards, which is not kind of counterintuitive to what a human expert would do.
This is an image model, but the punchline is they're basically 20% faster than the human expert placement that people came up with. Okay, so now we're here, and we think there's a big opportunity with more compute to actually accelerate a lot of the use of machine learning and the different applications and societal benefits that you can actually get from it.
I'm pretty excited about that. And you know, example queries of the future: actually the upper-left one we can already answer, "Describe this video in Spanish." I didn't show you, but we can actually caption and generate sentences about images. It's probably not that long before we'll be able to describe a human video, "Find me documents related to reinforcement learning for robotics and summarize them in German."
You know, that's a pretty complicated request, but imagine how pert—that's the kind of thing you would give to an undergraduate as a semester project and then please come back with a report for me. But imagine if we could actually do that; how much more productive everyone would be. It'd be pretty amazing.
I think robotics is at an inflection point where, through machine learning for control, we're going to have robots that can actually operate in messy environments like this one or the kitchen over there and actually know how to manipulate things in a safe way interacting with humans. So that's going to be exciting too.
You already know this, but deep nets are making big changes, and you should pay attention. You can find more info about our work at g.co/brain, and, oh, I am—you could join our team, but you're already starting a company! Before we get to questions, I have a poll that requested me to do, and I'm curious, too—how many of you are using deep learning models in what you're doing? Okay, ah, how many of you are using Caffe?
How many of you are using PyTorch? How many? Okay, and TensorFlow? Wow, okay, cool, that's good to know! Excellent. Yes, yes, roughly in proportion, in fact. Ah, cool. Ah, anything to add, Zach? Okay, any questions?
Yeah, well, when you talk about sort of the learning to learn stuff, into the neural net models designing other neural net models—like, for example, when the neural net model designed a model that performed better on CIFAR-10 and other models—do you look at those models and say, "Oh, I understand why that performs better," or is it kind of the case that it did something wacky and you don't understand why it works better?
I mean, I think it depends. Sometimes you just want the most accurate end model for the problem you care about, and that's fine. Sometimes you're trying to come up with a model, and you want to understand why it's more accurate so that you can then drive further human-oriented machine learning research.
So I think it depends. The symbolic expressions for the optimizer update rule, those are actually pretty interpretable. If I go back to the—it's actually pretty interesting, right? If you look here, there's this sub-expression e to the sine of the gradient times the sine of the momentum that seems to reoccur in a lot of these different optimizers that it's learned, and that sort of makes sense.
Basically, if the sine is the same as the direction you've been going, then speed up, and if it's different, then slow way down. Right? And that's kind of a good intuition to have, and you can see that the reinforcement learning wanted to do that in like five of these things.
In some sense, depending on what problem you set up in the "learn to learn" framework, you can actually come up with human insights about, "Oh, well, that makes sense from the experiments I've had."
But, you know, here, you can kind of investigate that cell and understand if you actually look. It's doing a bunch of adds at the bottom, but it's also doing an element-wise multiply for the input data in for one of the paths through the cell, which is kind of different from the LSTM cells doing.
So that might be sort of insight into why it’s doing that if you look here. You know, I think that architecture is kind of crazy, but we do know from ResNet that these skip connections make a lot of sense. This is just kind of like crazy skip connections in lots of places.
Oh, yeah, then I guess maybe a solid question is: do you think this is going to be a tool for humans to build better nets, or is this going to be how nets are built in the future instead of others?
It could be both, but I will say that this system can run twelve thousand experiments in a weekend, and humans are not that good at that. So with all that compute you are showing, it strikes me that you might run out of human-trainable data. Is that stuff really for the reinforcement learning where you can run twelve thousand experiments in a weekend, or do you have enough human-labeled data?
Oh, so for example, that amount on computation—when we were training our translation models for one language pair, we were using hundreds of GPUs for a week. For that problem, we actually had enough training data that we could only get through 1/6 of that data once. So we know that if we could get through all of it, the quality would be way better, right?
Because that's just a general rule of machine learning—if you could get through all your data, probably, it would be better than not. If you could even go through it a few times, it would be even better. So we think there are plenty of problems where there's enough labeled data in the world that you want to tackle a single problem and train a single model on something like that.
But it's also going to be pretty good for small model exploration, where you try to, you know, ten thousand different things and maybe take an hour to run on some subset of the chips. It just depends with the problem. You know, the architecture search is kind of tenable with not the current generation but one generation to go GPUs for things like CIFAR-10 because you run that for an hour and get an answer for one of the experiments and you run twelve thousand of those.
So for 700 GPUs over a weekend, we know there are a bunch of algorithmic improvements we could do to drop that by a factor of 10, but it's kind of just on the boundary of practical for tiny problems, and making it practical for real problems at scale I think is going to be really, really cool.
Maybe it may be a follow-up question to that—so you had that slide on there where we have personnel data, compute, person gone with data. Do you see anything in the near term in which you could have really powerful models on very, very small, much smaller datasets than a company like Google would have access to?
Yeah, I mean I think the right way to tackle that is right now, the way we as a community tackle learning problems is we say, "Okay, we're going to train a model to do this," and we might say, "Gee, we don't have much data for this problem. We're going to do transfer learning from ImageNet."
Then I have my 5000 flower images, and I'm going to do transfer learning and fine-tuning on that, but that's really kind of lame, right? Like we want to build real systems, real intelligent systems. We want a model that knows how to do a thousand things, ten thousand things.
When the ten-thousand-first thing comes along, we want it to build on its knowledge for how to solve those ten thousand things so that it can solve the ten-thousand-first thing with much less data, with many fewer examples, building on the representations that's already learned.
So if we can build a single giant model that can do thousands of things, that's going to improve the data efficiency problem a lot and also the time to wall time to actually be able to master a new task problem as well.
So I think that's the way we're going to get to more data-efficient, more flexible things because the problem with the current approach is we train a model to do one thing, and then it can't do anything else, which is pretty mean.
What do your best engineers do while they're waiting for the model to learn? Well, they often start up other experiments and hit reload on the visualizer. They write code; they think of ideas at a whiteboard; they do lots of things. But, you know, getting that alliteration time down from, you know, days or weeks to hours really just qualitatively changes your workflow, and so I think we're really shooting for making that time to result as well as possible.
Then they won't have, you know, people will not have these week-long things where they think, "Gosh, I hope my experiment works." What would you attribute the gap in translation quality to between languages? Is it just the amount of data behind each one?
I think in some language pairs, the translations are more natural because they're more related kinds of language families and the alignment is maybe similar as opposed to a very different word order and very different character sets, for example. But I think ultimately we will get higher accuracy models by, you know, using a pod to train a really big model and get through all the data once.
I suspect we could probably exceed human quality translations for some language pairs, you know, if we get through all the data once, maybe that may need a slightly bigger model. The analogy is, you know, even the best human translator is only going to see so many words in their life.
If your translation system can train a lot more data and see more of them, even though it's probably not as intelligent and flexible at getting maximal information for each word that it sees, maybe at some point, it's going to do better.
To be honest, we haven't experimented with a broad enough set of tasks to really make conclusions here. I suspect that there may be tasks—probably for any supervised tasks like where you have a crisply defined input and output and you have enough training data, you know, it'll probably work.
It's a question of how much compute you need to apply. We have a lot of ideas around, you know, making the algorithmic search more efficient by cutting off experiments that are obviously dead early rather than running them to conclusion, doing lots of things like that.
I think the architecture search itself right now, we train a bespoke model generating model for each problem we're trying to solve, and so obviously, you'd want to train a model-generating model that solves many problems, and then you’ll be able to get in a better state of good architectures for a new problem because of seeing similar problems and being like, "Oh yeah, lots of convolutions and 12 layers is a good place to start or something."
How does the internal development cycle look for optimizing the machinery involved? Like Calvin, how frequently do you play with the different set of parameter grinders for the learn-to-learn models in particular for example for all the free training APIs to provide life for this API?
How frequently do we kind of do everything right? It varies depending on the domain—like some domains like vision are pretty stable. You don't need to retrain every hour. But other domains, like for some of our internal problems, like predicting, you know, maybe you're trying to predict what ads are relevant, that set changes fairly rapidly.
There's a new chocolate festival on Long Island tomorrow that wasn't there yesterday, and now you actually want to know that that's important. So some things have a very stable distribution; some don't.
It really does vary a lot depending on the problem. Certainly, it's easier for things like speech or vision where just the basic perception is what you're trying to do, and the distribution is pretty stable. If you have a changing distribution, that introduces lots of annoying production issues because now you have to retrain, and you need to sort of somehow integrate new data so that you can learn new concepts and new correlations relatively quickly so that you can then produce good correlations and good output.
You mentioned sort of fast iteration being really important for developing this stuff. How much of the process sort of now and like the cutting edge neural net development is still trial and error, and how much of it is like, "I'm going to insert this and I know what's going to happen"?
I mean, I think a lot of machine learning research is empirical these days. Right? You have an idea you think it'll work, but you need to try to implement it, try it on interesting problems, explore the set of hyperparameters or whatever that will make the idea go from not working to hopefully working.
So it's often the case that you need to do this kind of empirical stuff. I mean, there are some ideas that you have a lot of intuition like, "Oh yeah, that's definitely going to work," even beforehand because it's sort of putting together two things that did work with a third thing that also did work, and it seems pretty obvious that combining them is going to work as well.
But other things, it's hard to build the intuition. A while ago, you guys have done some really great work on helping visualize what a convolutional neural network doing image classification was doing. The interpretability of models seems to be like a focus for a bit, and then there's a point where you kind of cross that, and it's like, "To learn to learn stuff you just can't interpret."
Maybe this is kind of what you're asking, but how important is that for delivering production models to humans, who maybe are not machine learning experts that need to work alongside a robot classifier?
It's really important in some domains and not important in others. We actually have a pretty big focus. I have a much longer talk or a set of slides that I selected a subset among.
We were doing about to work in sort of understanding and visualizing and building interpretability for models that I didn't talk about, but it is an important area. The main areas where I think it's really important are in health care.
If you tell someone you're providing advice to a physician and say, "This patient needs a heart valve replacement," you know they're going to want to know why you're saying this.
If you can go back and highlight a part of a medical note that says, "You know, a year and a half ago a patient was complaining about like their hearts felt like it skipped a beat every so often," or something, that's going to give much more smooth interactions between a machine learning system and a human and let them kind of each do play to their strengths.
Whereas if you just give a black box prediction, that's often not as useful in some domains, but some things like image classification, I just want the most accurate image possible classification.
Well, thank you.