Meta-Learning and One-Shot Learning
Hey guys, this is my kids 101, and today is a very special video because I'm going to be describing to you an algorithm that has never been seen before, which achieves general intelligence. I've been working on this algorithm for about a year now, and it just finally got to the point where it really passes the Turing test and can learn a huge variety of tasks just from me telling it what to do. So, it is pretty much equivalent to a human level intelligence, maybe actually— I mean, it's a lot better in a lot of respects. I think this is a really new breakthrough in AI, and I wanted to share it with you guys first. It'll smooth.
So, yeah, I'm not actually going to be talking about artificial general intelligence, but what I am going to be talking about is almost as interesting. It is the idea of meta learning or learning to learn, and this is kind of a subfield of machine learning that has recently had some papers put out that kind of are expanding the field. So, I want to talk about a few different papers and then present some of my own work at the end.
So, before I really dive into the details of meta learning, I want to talk a little bit about one-shot learning, which is what got me interested in meta learning in the first place. So, the basic principle behind one-shot learning is that it should be possible to learn something from one example. So, in the spirit of one-shot learning, I'll give you one example of what one-shot learning might be, and I hope you will figure out what one-shot learning is in general.
So, in this example, suppose I give you two different handwritten characters, and I tell you one of them is character one and one of them is character two. So, I've basically shown you a made-up alphabet with two different letters, and now I can show you a whole bunch of pictures of letters, and you can probably tell me pretty easily which one of these you think are character one and which are character two.
So, basically from just the one example I gave you of each character, you can learn how to label basically all handwritten characters, no matter how much the handwriting kind of differs. Maybe it's written in a different part of the image, but you can kind of figure it out. So, that's the basic idea behind one-shot learning.
The reason one-shot learning is interesting is because it's kind of impossible to get normal classical machine algorithms to do them well. If you just think about if I just gave you two examples of a picture and you know nothing, you're just a clean slate, a neural network that's never been trained— something like that— you don't know anything about pen strokes, about handwriting, about lines; all you're seeing is two images with a bunch of pixels. There's no way from that that you're just going to derive on your own, well, you know what defines a handwritten character. There's just so many possible explanations.
So, one-shot learning would require a different approach where we would have a neural network with a lot of background information, a lot of background knowledge, like a human adult person, and it could apply that to new tasks and do things like one-shot learning.
So, that kind of brings us to meta learning, where the idea is that we're going to, instead of training some model, some neural network or whatever to do the specific task we're interested in, instead, we're going to train that model to be able to learn new tasks quickly. So, we'll train that model on a variety of different tasks, and hopefully that will make it, you know, capable of learning a new task without needing a ton of data because it's basically applying all of the information it already has from its previous experiences to the new experience.
So, now I'm going to go ahead and talk about these four papers, which I think provide a really good sample of the sort of research that's being done in the field of meta learning right now, and they're also really great because I think they logically build up to some of the work that I was doing, even though I started my work before I knew about these papers.
So, I'm going to present these four papers kind of on a high level, explain what they do in such a way that pretty much anyone should be able to understand kind of what's going on, and then I will describe some of my own research and results.
The first paper we're going to be talking about is called "Human Level Concept Learning through Probabilistic Program Induction." This paper is really important because it introduces a new dataset called the Omniglot dataset. So, basically they had 20 different people write out over a thousand different characters from different alphabets, and so what you get is a grid like this where you have all these different letters, and you have 20 examples of each letter.
So, why is this so brilliant? You know, why is this so for meta-learning? The reason is because if you wonder, you know, why were you able to do that one-shot learning task I showed earlier where I showed you, you know, one example of two different letters, and then you were able to identify those across a large spread of letters? Maybe the reason you're able to do that is because you have experience with an alphabet; you know, you at least probably know the English alphabet. You might have encountered other alphabets; you certainly encountered other line drawings in your life.
So, you kind of have background knowledge on what makes a line drawing similar to another line drawing, what makes a letter similar to another letter. What you can do with the Omniglot dataset is you can train some kind of machine learning model with a couple of alphabets that it uses kind of as background knowledge and then see how fast it can learn a new alphabet once it already has that background knowledge.
Now, the way the authors did this is actually to me a little bit questionable because it seems like they actually took a lot of their pre-existing knowledge about handwritten characters and hard-coded it into their model. So, for example, they didn't just train their model on the raw images, you know, the raw pixels of these hand-drawn letters; they also trained the model on pen strokes, meaning the model got to see where the pen was over time.
So, they basically hard-coded into the model is the idea that characters are drawn with a pen— kind of something like that. And they also hard-coded into their model the idea that you can combine pen strokes, you know, by doing one, and then doing it again, or doing one and then adding another one onto it to make something more complex.
So, they did hard-code some pre-existing knowledge into their model; it didn't learn everything from scratch, but with that being said, what it did learn a lot, and they got some impressive results. So, this is, you know, a different approach than the other papers I'm going to be talking about, which use neural networks, and the neural network is kind of expected to learn everything, whereas this kind of uses some pre-existing knowledge, but still very impressive.
The next paper we're going to talk about is called "One-Shot Learning with Memory-Augmented Neural Networks." This paper also uses the Omniglot dataset, but they do it in kind of an interesting way. So, I'm going to give you an example of an episode of the task that they would have their model learn so that you understand kind of what they're doing here.
So, learning is done in episodes, and you know for making an analogy to a human life, you could imagine that a human lifespan is one episode. A human is born; they have to figure out how to optimize things in their life, and then they die, and their goal was to maximize some quantity, maybe their happiness during that period of their life.
And so the human performs learning during their lifespan, but evolution in general and society performs meta-learning; it makes it so over time when a human, when a new human is born, they might do better on their episode than humans in the past have done. Or you know, more generally, when an animal is born, it is more fit and more capable of dealing with this episode than animals in the past— you know, that's kind of the evolutionary viewpoint of it.
So, they break their task up into episodes. You can think of an episode kind of like a lifespan. The key behind this idea is that each lifespan might contain different problems. You know, I might have to solve completely different problems than someone else born in some other country, so learning is necessary during an episode.
And they wanted to set up an environment using Omniglot; they could be similar, you know, the episodes contain similar kinds of problems— you know, general problems—but they are different problems. So, here's how they might set up an episode for their experiment. Firstly, they pick out a couple different characters from the Omniglot dataset, and they give these characters labels.
So, in this example, I've taken four different characters, and I've labeled them one, two, three, and four, and you can see I've chosen two examples of most of them. I chose one example for label one. The basic idea is we just kind of come up with an arbitrary alphabet; in this case, it has four different letters in it, and this is the setup for our episode— this is the situation that the network will have to learn.
Now, what we're going to do is we're going to shuffle these examples randomly in some random sequence, and we're going to go through each one one by one and show it to the network, and it's going to have to predict what label it is, and as it keeps getting things wrong, and we tell it what it should have guessed, it will learn.
So, let's actually just go through an example of this. So, suppose I only see the first letter; we don't know what the labelings are, anything like that— well, we have no idea what this is going to be because we don't know, you know, what arbitrary number was given. So, maybe I'll guess that it's label one, and I would discover that I was wrong; it was actually label three.
So, after we make a guess, they tell us what it should have been. So now we're asked to identify the second picture in this list, and it doesn't look anything like the first, so I'm going to guess it's not labeled three. So, maybe it's label one. I guessed that, and in this case, we were right; that was kind of a guess. I mean, it was an educated guess because we knew it wasn't three.
And now we have a third image. Well, it doesn't look like one or three, so it's probably either two or four. I guess we can just guess two, but it actually turns out that it's four. So, so far we've gotten one out of three correct. But now we have more knowledge; we have seen more stuff, so now this next letter looks a lot like the first letter. So, we should actually guess that it's a three, you know, it's labeled the same, so we get three because we saw the other three in the past, and we would actually be right.
So, this is the first time we've really taken our knowledge that we gained from earlier in the episode and applied it later in the episode. And now, we see another letter; this one doesn't look anything like the others, so there's only one label left—it's two—so we can guess that, and we would be right. And then the next one also looks like a two; it looks like the last one, so we guess that, and we're right.
And then this next one looked a lot like a four, so we guess that. So, the idea is basically that we go into this episode knowing nothing about which letters are going to be used for how they're going to be labeled, and as we guess and they're told the correct answer, over time we kind of learn.
In the paper their experiments, the episodes are either of like 50, so they show you 50 characters and there's 5 different labels, or they're of length a hundred and there's 15 different labels, and they actually also play around with the labels a little bit, but I don't want to get into the details; that's the basic idea of the tasks. They basically define an episode as a sequence of images, and in each at each place in the sequence, you have to guess a prediction.
So, in case you're not familiar with neural network stuff, this kind of problem where there's a sequence and there's a correct output at each time step in the sequence, this is classically solved with something called a recurrent neural network. So just to review what that is: normal standard networks are feed-forward, meaning that the neural network gets an input, and that input propagates through in a forward direction and then produces some output at the end.
And this the model of a feed-forward neural network, the problem with it is there's no idea of a memory; it can't remember things. You know, if you give it an input, there's no feedback that allows it to remember that input for the next time you show it something, and we need feedback because we want the network to learn over the course of this episode which pictures are which, you know, which pictures have which labels.
So, what you do is you have some of the outputs of the neural network feeding back as inputs, and that allows you to have some amount of memory. Basically, by having feedback, you can store information over a longer period of time; that idea is called a recurrent neural network, and that is one way to solve this problem.
So, what this paper does is not something entirely new idea, but it is kind of new in the field of meta learning, and that is they take a recurrent neural network and they augment it with some external memory structure. So now you imagine you call the recurrent neural network the controller; it's kind of like the CPU of this model, and there's also some external memory module which kind of acts as RAM.
And what happens is the controller network can send information to the RAM to be recorded in this external memory, and it can send queries to the external memory and get back information from it. So, it's basically like if you think about doing this task with a notepad where you can write things down and then look back at it. That's kind of what this model is. The external memory allows the model to store more information about the— you know, there's stuff up until now in the upper sowed, and the authors find that this external memory actually helps their model a lot, and they use it to get results that are much better than the results they would just get with a plain recurrent neural network.
This next paper is called "Model Agnostic Meta Learning for Fast Adaptation of Deep Networks." This paper takes a slightly more direct approach at solving this problem, and the reason I say that is really because the problem is that training a neural network just with a regular training algorithm from scratch takes a long time, and you need a lot of data. And what they're really trying to do is get a neural network that can be trained quickly through, you know, the traditional training algorithm.
They want to make a neural network which is prepared in such a way that it can be trained quickly on new tasks. So if you imagine, you know, you have kind of this neural tissue, and it has a lot of knowledge in there, and it's just floating around in the neural tissue, and you want it so when you go and train this thing on a specific task, all of that stuff, all of that tissue can kind of be used for whatever task it is, you know, the training quickly figures out a way to kind of restructure the brain's neural tissue in such a way that it can do the particular task.
So, that's the idea behind this paper. It's a totally different approach than the previous paper, but it got very similar results on Omniglot, so very good results. Now, the results aren't directly comparable; their experimental setup is a little bit different. You know, they scale the images differently; they do a couple things differently, but overall pretty similar results.
So, this model actually does surprisingly well. It's actually possible to get a neural network, you know, train it in such a way that now it can be trained on a new task really quickly.
The last paper is called "Optimization as a Model for Few Shot Learning." The goal of this paper is similar to the goal of the previous paper we talked about, where they want to get a neural network to learn new tasks as fast as possible. So, they're still going to be taking this neural network, which they call the learner, and they're going to be training it on the samples from the new episode and trying to get it to just learn as fast as possible.
What they add on top of the previous paper is that now, another neural network—a small neural network called a meta learner— is going to try to help the learner learn faster. So it does this by influencing basically the parameter updates of the learner network.
So, if you don't really know how neural networks are trained, this is going to be kind of opaque, but the basic idea is each parameter of the learner neural network is updated during training. You know, at each time step, each synaptic weight of the network is updated to make it do better on the example it just saw, and the meta learner helps decide how that weight should be updated.
Now, this algorithm is somewhat limited; there's a different meta learner for each parameter for each synaptic weight of the learner network, and it basically the meta learner only sees the history of that particular parameter. So, I'm actually surprised this was very helpful, this architecture, because really they're just implementing a fancy numerical optimization algorithm.
But this is kind of an idea where there's a neural network helping to train another neural network, and it's somewhat related to what I've been working on. So, when I started doing meta learning, I hadn't seen the last two papers I just talked about, but I had seen the first two, and I wanted to do something similar to the second paper I talked about, where you have some kind of controller recurrent neural network, and it can write to some external memory; it can send queries and get results back, and it uses that external memory to kind of learn over time.
However, I wasn't really satisfied with the way that external memory is kind of implemented right now when you make an augmented memory neural network because right now, pretty much every augmented memory structure I know of— I mean, not every single one but most of them— are represented as a matrix, basically a matrix of values. And when you write to the memory, you're just editing a row or some rows in that matrix, and when you read from a value, you're just retrieving a row from that matrix. So, the memory is really nothing more than a scratch pad where you can scratch things down and get them kind of back verbatim.
I wanted to make a memory that could be smarter, that could relate concepts that were related; it could maybe compress information; it could generalize to new things that it didn't really remember but are similar to things that remembered. I wanted a more intelligent external memory, and there have actually been attempts to do things like this. DeepMind works on this kind of stuff all the time, but I had a slightly different idea, and that was I want to use an external neural network as memory.
I want to have the controller train an external neural network, and that training will infuse memories in that external neural network, and then the external net will basically just serve as memory augmentation. It'll serve as an intelligent scratch pad for knowledge, and this actually works surprisingly well.
To see why this works so well, I'd like to consider a few properties that we would like external memory to have. So, the first property that's obvious is we want the external memory to be able to hold a large amount of data. If it's small, it's no better than just like the recurrent feedback connections in a recurrent neural network. You know, we need a lot of memory because otherwise there's no point.
We also need the external memory to be easy to use, and I by that I mean the controller should have an easy time writing memory. It should be easy for the controller to read exactly what it wants from memory, and stuff like that. And, of course, we also want the external memory to be somewhat intelligent. We want to be able to look up things that aren't exactly the same. Like, we might not know exactly what we want from memory, but we still might have an idea, and we want to be able to look up maybe by similarity or by content, by some semantic meaning there.
You know, most augmented networks try to attempt to do all these things; the last one is the hardest one. My hypothesis was kind of that neural networks can serve all of these properties better than just some external matrix or at least as well as some external matrix, and that's kind of what I wanted to find.
So just speaking generally, here's kind of how when I was hypothesizing about this, here's kind of how I saw a storage neural network fitting in to these requirements. First off, neural networks can be huge; they can have a lot of parameters, so obviously they can store a lot of information. But they can also compress information. You know, if you have two similar things you want to store and know that, it will figure out a good representation for that so that it can represent them both somewhat compactly. So, that seemed like a big plus to me.
The other thing is these seem like they're pretty easy to use. Basically, all the controller has to do is, you know when it wants to help train the external network to remember something new, it produces an input and an output, and it tells the external memory, you know, when I give you this input, give me this output in the external memory because it's a smart neural network. It will automatically learn not only that, but it'll learn to generalize from that.
Like, oh, well, if he gives me something— if the controller gives me something similar to this input, I should give it something similar to this output. So, not only is it easy for the controller to use because it just has to say, you know, query and the response it expects a key and a value kind of, but the memory, the external memory will automatically try to generalize.
So, that's kind of also the last point that the external memory will automatically try to generalize because it is a neural network. Neural networks are good at generalizing, so you can get something that actually believes it remembers things that maybe it never even saw because it's generalizing from its existing memory, and that's something we actually see in humans a lot, and I think it could be very beneficial for a model to have this property.
So, how is my model similar to the other papers that I talked about in this video? Well, it relates to Santoro et al. in the obvious way that both my model and their model are memory-augmented RNN systems—controller network with some external memory bank. The only difference is my external memory bank is another neural network.
This also relates to Finn et al. because the memory network has to be easy to train. In essence, you know, the memory network was hard to train; it took a lot of iterations of training to get it to do anything. It wouldn't be an effective source of memory; the controller wouldn't be able to use it to remember things, so the memory network has to learn to be easily trainable by the controller network, and that in a way relates to that paper, Finn et al.
And finally, we relate to Ravi et al. because this is just a network training another network, and theirs was the same thing, you know, so that English sentence makes my model and their model sound very similar. The biggest difference between my model and their model is their meta learner trains the learner by updating the raw weights, the raw parameters of the learner neural network, whereas my controller updates the memory networks by showing an input and an output and telling the memory network, like, I want this input to give this output, learn how to do that.
So, in my case, the controller network has a much simpler task. It kind of gives the memory network a high-level job, and the memory network is trained through standard training algorithms. So really, in a way, my model differs from all of these methods, but it also, if you look at it in a different way, combines some good parts of all of these different approaches with all that being said, I've only been working on this model for about a week now, and I have gotten some really good results specifically on Omniglot and on another meta learning task.
But you know, really I have to run more experiments before I know how good this is, and you know, there’s a lot to do still. So anyway, thanks for entertaining all of that. There will be links in the description to all of these papers and to my own work. Thanks for watching, subscribe, and goodbye.