RBF Networks
Hey guys, this is Maads 101 and today I'm going to be talking about an unusual class of neural networks known as radial basis function networks. RBF networks can be used for a whole lot of things, including classification. So, as I'm showing here, you can make an RBF network that, uh, it's pretty simple but can still tell what handwritten digit you're drawing on the screen just based on the pixels that it sees.
A lot of algorithms can do this, but RBF networks actually were one of the best algorithms I've tried. It learned really quickly and it does a really good job. Another thing you can do with RBF networks is interpolation. Imagine I gave you a couple points and I asked you to connect the dots. Basically, I ask you to draw a smooth curve through the points, and in doing so, you're deciding what goes between the points.
That's what interpolation is; it just means, uh, I give you a limited amount of data and you kind of fill in the gaps. One cool thing you can do with interpolation is resizing an image because if you have a certain number of pixels and you can kind of predict what's going to be between those pixels, then you can actually generate a larger image.
So, I used an RBF network to resize images, which is kind of unusual. I don't usually see people doing that and, uh, it actually looks different than if you just were to resize the image with some kind of image editor. You see weird things like if there's a circle and you resize it, you'll get ripples inside the circle or interesting things like that.
So, I'll probably link to that work in the description, but that's another cool thing you can do with RBF networks. To explain how RBF networks work, I'm going to start off by giving a really straightforward example. Suppose I give you a bunch of data points and they're just XY coordinates, so they're locations on a plane. For each coordinate, I tell you whether that coordinate is red or green.
Now, I want you to be able to find a pattern. That way, if I give you a new XY coordinate you've never seen before, you can tell me what color you predict it will be. As you can see from this data that I have, it's probably pretty straightforward for you to do this because there's clusters of red and there's clusters of, you know, everything else is green basically.
There's already been videos that I've made that have shown ways to do this, like nearest neighbors, but RBF networks take a slightly different approach. What they do is they model the data in terms of circles or spheres or any, you know, radial shape. So, what we might do with an RBF network is have two circles, and as you can see, if you're inside one of these two circles, the point is red. If you're outside of the circle, the point is green.
Now, we could use this to basically color in the whole image and decide, you know, for every pixel whether we think it's red or green based on whether or not it's inside or outside of the circle. So, you're probably asking, does this always work? Because in this case, the data was nicely clustered into circles basically.
But, uh, what if we got data that look something more like this, where, uh, it's not obvious how to divide things up into circles? And the answer is you can always divide data up into, you know, some kind of radial shapes. You might need more and more, but you can combine them in such a way always to basically, I mean, in the limiting case, you could just have one circle around each data point.
That way, you could split the data up into circles, you know, you need a lot of circles, but you could do it. So, you can always split the data up into circles. Usually, you don't need as many circles as there are data points. For that handwriting classifier, I had about, uh, 300 of these radial shapes dividing up the space.
Now, I need to explain, uh, radial basis functions because they are basically the heart of RBF networks. Earlier, I showed that we could basically draw a picture by coloring each pixel in based on whether or not that pixel's inside of a circle. That's not really what we do with RBF networks.
There is an idea of radial, you know, circles, but, uh, they're not hard sharp closed-off circles; they're more, uh, I guess smooth. They're smooth transitions. As you can see in this picture, we still have these basically these, uh, circles, but it transitions from red to green as you get further outside of the circle.
There's a smooth transition and, uh, smooth transitions are nice for a whole lot of reasons. They will make, uh, the entire algorithm for RBF networks work better. One thing that's nice is now we can model, you know, how confident we are. You know, if you're on the outskirts of a circle, maybe it's less likely to be red than if you're in the center of the circle, especially if the data wasn't actually split into, you know, an entire circle, like a perfect circle.
Uh, you might want to say as you get close to the outside of the circle, maybe the circle model isn't as accurate. So, having a kind of smooth drop off is actually really important and it makes training easier. It basically will just make our lives easier in general to make things smooth instead of having a hard cut off.
So really, we want it so that as you get further out from the center of a circle, the effect of that circle gets less and less. It basically decreases. Here's a graph, you know, we have the redness, uh, on the y-axis and we have the distance on the x-axis. Basically what we're saying is as you get further away, the redness goes down and it drops down.
Once it gets to zero, it kind of levels out and, uh, the shape of this curve really matters. You know, we want it to be basically level at, you know, when you're close to the circle and then drop off once you get outside the circle. We still want it to be smooth but we still want a kind of, you know, sudden abrupt drop off.
Picking the right shape of this drop off curve is actually really important, and depending on which mathematical function you use to get that drop off, the picture will look a lot different. Here's, uh, you know, four different pictures basically using four different ways of getting a graph that drops off.
So that's just what a radial basis function is; it is something that basically drops off as you get further away. So, as the distance increases, it decreases. The radial basis function that I see used pretty much, uh, exclusively in RBF networks is, uh, this one here, which is basically just an exponential decay kind of situation.
So, you have e to the minus distance squared. You square the distance, you make that negative and then you exponentiate it. One thing to note is that, uh, in this formula we have the distance, that is the distance to the center of a circle. So, we have the center of the circle and as you move further away from that on the image or, you know, in however many dimensions we're working, the radial basis function will drop off towards zero.
But how do you compute the distance between two points in, you know, space? In a previous video, I already talked a lot about, uh, computing distances. If you watch my nearest neighbors video, you will see what I'm talking about. I'll probably link to that timestamp in the description, but just as a very brief review, all we have to do to find the square of the distance is we just take the difference in x's and the difference in y squared and we add those together.
That just gives us the square of the distance. So, if we have, uh, three dimensions, we have x, y, and z coordinates and we have two points, we take the difference in x squared, the difference in y squared, the difference in z squared, and we add all of those squares together and that gives us the square of the distance.
So, it's actually really straightforward to compute, uh, the square of the distance between two points as long as we know the coordinates of those two points, and, uh, this formula can be extended to any number of dimensions. It's basically just, you know, you can see how you could expand it.
This is not the only way to measure distance, but it has a lot of nice properties and that's why we're using it. Now, there's one more piece of this radial basis function idea that we have to talk about, and that is how do we actually get, uh, circles of different sizes? You know, because, uh, in some cases we want, we might want the radial basis function to drop off really fast if we have a small circle, but in other cases we might have a very large circle and we want it to drop off slowly at first.
So, all we have to do is add an argument that I'll call beta because that's what Wikipedia calls it, so I want to be consistent. If you look it up, beta is basically a control of the radius. The higher beta is, the faster, uh, the radial basis function will drop off with respect to, uh, the distance. You can look at this in a graph. You know, you see beta is one, beta is two, beta is three, and you can see that, uh, the higher the beta, the faster, uh, the sharper the drop off.
We can tweak this parameter to basically change the size of a circle by changing how fast the drop off happens. If we're trying to train an RBF network, which we'll talk about later in this video, you might have to decide a beta yourself, you know, beforehand or you might, you know, the network might be able to learn the betas of all the different circles so that the different circles can all be different sizes.
And, uh, beta is really what controls that. We're almost ready to talk about what the first layer of a radial basis function network actually looks like. But before we do, I'm just going to introduce one more piece of notation so that if you look this up, you'll understand what's going on and so that I can use the notation too.
Suppose we have two points and we label these points X and Y. They're points in space; they might have, you know, any number of dimensions. We can denote the distance between these two points using this notation here. Basically, we just use double bars and we say x - y in the middle or y - x; it doesn't matter. That represents the distance between X and Y.
To get the square distance, we just do this; we just write a little superscript two over it. Now that we know that notation for expressing distances between two points in space, I can actually show you the formal notation for what the first layer of a radial basis function network looks like. We have some input called X and we have a bunch of centers.
That's what C1, C2, etc. are. The centers basically represent the, uh, centers of different spheres or hyperspheres or circles or whatever. For each center, we're going to compute, uh, you know, the output of a radial basis function. We take the distance between X and that center, we square that distance, we make it negative, we multiply it by beta to get the radius for that particular center, and then we exponentiate it.
All we're really doing is we have a bunch of different points that comprise our network and we're, uh, you know, basically measuring how close the input is to all of those different points and that's what the output of the network really represents. Here is a more concrete example. In this case, I have chosen to make the input and all the centers images, and what that really means is we could just have a large set of coord and we have one coordinate basically for each pixel.
You know, the coordinates might all range between 0 and 1, and, uh, the input in this case was a picture of a seven and one of the centers is also a picture of a seven. We might predict that these two pictures are actually pretty close together in space because they have a lot of overlapping pixels.
Suppose that the output of the first center is 8 because the input and that center are pretty close to each other. So, we get 8; it's close to one, which is the maximum value, and, uh, the other centers, you know, are far away from them, so we get low values, you know, .13. The thing to basically take away from this is, uh, you can already see how we've managed to split the space of handwritten characters up into these circles.
There's one circle for this seven that we have and there's another circle for this six and a circle for this five or, you know, a sphere of some sort. When we input a seven, we might hope that it's inside of the sphere for the other seven but not inside the spheres for the other digits.
But we're not quite done yet. That is because there are some issues that might come up if we were to just use, you know, different, you know, one sphere for seven, one sphere for six, one sphere for five, things like that. The problem that might come up is something like this, where the input is a seven and we have two even two spheres for sevens.
We also have a sphere for a nine and the nine is kind of distorted. Actually, the nine looks more like the seven than the other sevens do because there's more pixels basically overlapping. It's, it's kind of a weirdly shaped nine; it almost looks like a seven.
The problem here is the output for the nine sphere is actually higher than the output for the two, for either of the two sevens. An obvious solution for this might be, okay, we have two centers for sevens, you know, two seven centers; maybe we should add those together. If the sum of those is greater than maybe the output for the nine center, then maybe we go with the, we go, we say it's a seven, something like that.
You know, you might have multiple spheres for different, uh, categories of digits and then you want to combine the outputs of those, you know? How close are you to this and how close are you to that and how close are you to that and you make decisions based on all of those distances as opposed to just, you know, picking the maximum or the minimum.
To basically generalize that idea, we're going to introduce output neurons and these are, uh, going to be things that combine the values from all the radial basis functions to produce some output value. We might have multiple output neurons. In this picture I'm showing, there are two output neurons. Those are the two yellow things on the right, but if we're doing something like digit classification, we might have an output neuron for each digit.
So, we would have 10 output neurons, you know, an output neuron for zero, an output neuron for five. If we were predicting attributes of a human being, you know, we'd probably have an output for age, another output for height, things like that. So, that's basically what an output neuron does; it just produces a number based on the values from each of the radial basis functions.
But we still have to decide what the output neuron is actually going to do. You know, how is it going to combine the information from all of the radial basis functions to produce an output? The kind of output neuron I'm going to be using in this video is known as a linear classifier, and I talked about these a lot in my, uh, first neural networks video.
So, I'll link to that in the description if you want to check it out, but I'm still going to briefly review it. Basically, let's call each output of each radial basis function a feature, so fub1, fub2, fub3, those are just, you know, the arrows coming from each of the centers in the network diagram, and, uh, we're going to feed all of those features into our linear classifier, our linear neuron.
That's what you can see in this picture; fub1, fub2, fub3, fub4 are all coming into this neuron and those are just the outputs from all the centers. Now we want to produce one output. What we're going to do to produce that output is we're going to take a weighted sum.
As you can see here, we have fub1 * W1 + fub2 * W2 + fub3 * W3. What we're really doing is each weight, each, uh, W assigns a certain level of importance to that feature. So, if a W is high, if a weight is high, the corresponding feature will have a large influence on the output of the neuron.
Suppose a feature was or a weight was zero; you know, then we would just completely ignore the feature because we'd be multiplying it by zero, and if the weight is negative, then that feature would actually have a negative influence on the output of the neuron. If the weight is one, then as the feature increases, the output of the neuron will decrease.
One thing to note while you're looking at this diagram is what changes and what doesn't change. So, every time you give a different input to the network, the features will likely change because they represent how far the input is from each of the centers. But, uh, the weights will never change. The weights are part of what determine how the network behaves.
So, when you're training the network, you update the weights to find an optimal value of the weights, but once you're done training, once you're using the network, the weights are basically fixed; they're a fixed part of the network. So, now we have to talk about how you could actually get one of these networks to learn. You know, how do you have a problem, handwriting recognition, let's say, and you want the network to, uh, solve it? You know, how do you do that?
Well, uh, the most basic way is to just use a regular neural net training algorithm. So, I already made a video on training neural nets. Basically, what you do is the network is just a bunch of numbers. Maybe you decide beforehand that you're going to have 300 centers; you decide how many output neurons you're going to have, stuff like that, and you have all these initial values in it.
Maybe you randomize the entire thing and all you're going to do to train the network is you keep showing it examples and adjusting all of the weights slightly, you know, all of the values in the network slightly so that, uh, the output is a little closer to what you want. If you keep doing that, the network will eventually learn.
I have a much more detailed video explaining how to do that, uh, and I'll link to that in the description. But, uh, that's basically the simplest way you can train a network. For choosing the centers of the network, there's actually a much simpler approach. If you have, you know, 60,000 handwritten digits like I did, you can just pick, you know, a random set of them, a random 300, and use each random digit as another center in the network and just start things off like that.
Maybe, uh, you'll set things up so you initialize all the centers to be random points from the data and then you can train it more with backpropagation like I showed in that neural net video and, uh, you can fine-tune the centers. You know, so that's another option; you can basically initialize the thing to already have a good set of centers and then, uh, train it and refine it using regular neural net training algorithms.
Another thing you can do, which I haven't talked about much in any of my videos, is a clustering algorithm which basically finds clusters of data points that are together. You would pick the thing in the middle of that cluster and use that as a center. That's another option, just so that you know it exists.
There's also an alternative way to train the output weights, so, uh, you would basically do the procedure I'm about to show you for each different output weight, assuming that you've already chosen a good starting set of centers. Maybe you've chosen a bunch of random centers from the data and now you want to choose output weights accordingly.
What you can realize is that say you have a couple data points and you know the outputs of each of the radial basis functions for each of these data points. You know you've already decided on the centers, so you know, you know basically what the output of the first layer of the network is going to be for every data point and you're just trying to decide what this output neuron should do to give you the values you want.
Well, if you put everything in a table like this, you know, you have for each sample, you have, uh, the outputs of all the centers that you've already chosen and you have the desired output. You want to pick up weights that produce the desired output. You can actually set up a system of equations, and I'm not going to talk too much about how to solve a system of equations. You might have learned it in school, and there's plenty of tools to do this.
You know, you can use MATLAB, you can use Octave, you can use Mathematica. Basically, you can solve, you know, you can actually solve exactly for what the weights should be to give you the desired outputs. Sometimes you will have more samples than you will have centers; hopefully, you will. In that case, uh, you'll actually have a system of equations where there's more equations than there are unknowns.
In this case, you actually cannot always solve the system exactly, but you can use an approach from linear algebra, like least squares regression, that will actually give you the best solution; the solution that gives you outputs as close to what you want as possible.
So, actually, uh, when I was doing handwriting recognition, I set everything up by choosing a bunch of random digits and then choosing the output weights based on, uh, least squares regression and the network already had 91% accuracy on handwriting. After fine-tuning everything with backpropagation, which is the normal neural net training algorithm, I got it up to, you know, 97.5% accuracy.
But that initial setup where I used linear algebra and random centers was actually really helpful and sped up the process. So, uh, that's really my favorite thing about RBF networks is that there are so many ways to train it because it's just such an intuitive model and such a simple model.
So anyway, I hope you learned a lot from this video. Go ahead and leave a comment if you have any questions or if you have something cool to show that you did with RBF networks. I would love to see it. Uh, thanks for watching, subscribe, and goodbye!