Neural Networks (Part 2) - Training

34m read

·Oct 28, 2024

Hey guys, this is Matt Kids 100 on, and today I'm super excited because I'm going to be teaching you how to train an artificial neural network. The way we're going to train neural networks is actually really straightforward on a conceptual level. So we're going to have a bunch of samples, and a sample is just the input at, you know, an input we could give the network and the output we would like the network to give for that input.

What we're going to do is we're going to start off with a completely randomized network—no intelligence behind how it's set up; it's just all the way to random numbers. We're going to show the network samples, and we're going to adjust all of the weights in the network very slightly so that the output is slightly closer to what we want. Then we're just going to keep doing that. We're just going to show the network a sample, adjust all the weights slightly so the output is a little bit closer to what we want; then we'll show it another sample, adjust all the weights slightly, and if we do this for long enough, the network will actually start to converge, and the weights will be arranged in such a way that the network will learn how to give us what we want from a given input.

I want to start off by talking about objective functions. So the objective function is something we're going to come up with that is a mathematical definition for what we want the neural network to do. Basically, it will give us a way to compute an actual number that indicates how well the neural network is doing, and you can imagine why this is really important. First off, if you're training two different neural networks and you want to decide which one is better, which one you should use, you use the objective function to measure, to compare them, to see which one is doing better.

But another reason the objective function is important is because it gives a concrete, formal definition of what we're trying to do by training our neural network, and this will make it a lot easier to write a program to train our network because we actually have a mathematical definition that we can program into the computer. So here is the example I'm going to be using to illustrate things about objective functions. In this example, we have a network, which is a gray box containing a bunch of nodes. We have a sample, which is a purple box, and that just gives an input that we feed into the network, and the network gives an output.

So what we want this network to do in this hypothetical scenario is it'll take as input, it will take the pixels of an image—though in this case, an image of a hand-drawn four, but it could be a picture of a lion, it could be a picture of a monkey, it could be whatever—and we want the network to output a one when it sees a picture of a four, and we want it to output zero when it sees something that's not a picture of a four. So it's basically a four detector. If I show it a picture of a donkey, it will output zero. If I show it a picture of a five, it will output zero. But if I show it a picture of a four, it should output one.

Now I want to just quickly draw your attention to the fact that I have two different networks on this slide, and they are both taking the same picture of a four, but the bottom network outputs something pretty different than the top network. In both cases, the goal—the thing we want the network to do—is we want it to output a 1 in this case because the input is a four. So I could just add that to our picture here, and you can see the bottom network does a much better job than the top network of actually obtaining the correct answer—it's a lot closer.

So the objective function is just supposed to be a measure of how close the output from the network is to what we actually want it to be—you know, the difference between the desired and the actual outputs. So a simple way to measure the difference between two numbers is basically just subtraction. You know, we can take the desired output, we can subtract the output that the network actually gave, and we can call that the error or the objective function, whatever you like.

So for the first network, it's 1 minus 0.3, and we see that the error is 0.7, whereas for the second network, the error is only 0.001. So this basically shows us how much better the second network is than the first network by some measure of error. Now, there is a slight problem with this measure of error, which is that using subtraction gives you the difference between two numbers, but that difference might be positive or negative. So if we desired an output of zero and we got an output of 0.999, that would be a negative error. So this actually looks like less error than these other two things even though zero and 0.999 are really far from each other.

So we don't ever want this to be negative; we basically just want to ignore the minus sign and just use subtraction to get the distance, and to do that, we can just use the absolute value. So the absolute value just gets rid of the minus sign, so now we have a measure of how far two numbers are from each other, and it will always be positive because there's always a positive distance between two numbers.

And I want to briefly talk about one more objective function. Basically, we can just square this objective function to get what’s called squared error. Basically, you do the same subtraction but instead of taking the absolute value, you square it, and the rationale is basically, when you square a negative number, it makes it positive. So this has the same effect as taking the absolute value in that now all your distances are positive, but it has the added benefit that it’s an actual nice mathematical function that can be manipulated with things like calculus.

Now, we're not going to be worrying about this too much in this video; we're just going to be using the absolute value error, which will work fairly well, but I just want you to know that this is also something that people will use. So now that we’ve talked about various objective functions, we can revise our picture and just add a little green thing here which represents our objective function. So we take the output of the network and we take the desired output, and that produces something that measures how poorly the network is doing—it outputs the error, basically.

So in this case, the error is 0.49, and I'm using squared error just for illustration, and here the output is this tiny thing. So we really just want to minimize the output of this green thing; we want to make the error as low as possible when we're training the network.

So now I want to move on and discuss a principle called steepest descent, and to do it, I'm going to use a kind of hilarious weird example. Suppose you're a getaway driver; you help criminals escape after robbing banks or something. One time you're driving to a bank with the criminal in the passenger seat of your car—you're about to rob it—and you start noticing your steering wheel is getting a little sticky, and it's getting harder and harder to turn. You figure out, you know, by the time it's time to drive away and get away fast, you're not going to be able to steer anymore; you're only going to be able to go in a straight line.

But the bank robber is like, "Fine, whatever, I don't care." And then you also notice you only have enough gas to get five miles. So once you're at the bank, you'll only be able to drive five miles in any direction, and the direction's going to have to be straight line because you're not really going to be able to steer the car. So you ask the criminal, "Well, what should I do?" You might be robbing the bank for five minutes; during that time, I can try to move the steering wheel to get the car facing in whatever direction you want. But then after that, once we're trying to get away, I'm not gonna be able to turn anymore, and I'm only going to be able to go five miles.

This criminal, who's already insane for still being on board with this plan even though the car is faulty, says he will pay you, the getaway driver, three dollars for every mile north you can get them, and four dollars for every mile east you can get. So we can turn our situation basically into this picture: for every mile north we go, we make three dollars, and for every mile east we go, we make four dollars, and we can go in any direction, but it has to be a straight line.

So for instance, we could go west for five miles, and we would lose twenty dollars—you would, I guess, steal it off us. I mean, it has a bag robber after all. Or we could go east twenty miles, and we would make twenty dollars, or five miles, and we would make twenty dollars. So those are just two basic directions, but what if we went in some diagonal? Like we went five miles in this direction. Well, we'd have to, in order to figure out how much money we would make going in this direction, we would have to figure out how much north we went and how east we went.

So we can do that by drawing these lines to each axis, and then it has to check out with the distance formula. But in this example, let's say you go 4.33 miles north and 2.5 miles east, and this will total up to a total of— a total of five miles. We can calculate our earnings just by, you know, doing the typical thing we do: 4.33 times three dollars per mile and 2.5 times four dollars per mile, and we've seen we make twenty-three dollars if we actually go in this diagonal direction.

So it's already better than what we did here when we just went completely east. So you already see that even though we make more money going east than we do going north, we actually would make more money if we went kind of in between them rather than if we just went to the more profitable axis. And actually, we can ask what is the best possible direction we could go to maximize the money we get from this bank robber.

Actually, in this case, it is we go five miles at such an angle that we go four miles in the east direction and three miles in the north direction. You can check out that this actually checks out with the Pythagorean theorem. You know, if I show you, you can actually have a triangle with the side that is four and another side that is three and a side that is five. So this is actually a valid direction, and this is how far you would go in both sides.

So the maximum amount of money we would make is this, and it would be twenty-five dollars. Now, it would be reasonable to ask how I know that this is the best possible direction you could go in. You know, there's a lot of directions this arrow could be facing. How do I know this one will make you the most money? And to answer this, I'm going to change these numbers slightly to be a little easier to talk about. So let's just change it so we make six dollars per mile as we go east and three dollars per mile as we go north.

So it's obvious that east is more profitable than north; we make twice as much money going east as we do north, you know, six versus three. You know, it's twice as much. So the question is how much— you know, we're going to choose a direction that's kind of crooked. You know, it's between the two. We want to know how much more should we go in the east direction than in the north direction, and the answer is twice as much. If east is twice as profitable as north is, we want to go in a direction such that, you know, we go twice as far east as we do north.

So here in this example, I've gone 4.47 miles. I go four miles east and two miles north; you know, two is half of four. So just the general principle is, you know, you go in a direction in proportion to how profitable that direction is, and that's how you know the best direction. So when I go here, I’m going four miles east and three miles north, and east makes me four dollars per mile, and north makes me three dollars per mile.

So you know, east is four-thirds as profitable as north. So I'm going in a direction such that I go 4/3 times as far east as I go north. The general idea is just, you know, the amount of reward you get for a given direction, it basically just tells you how fast you go in that direction.

So you might be wondering how this actually helps us. How can we utilize this idea of steepest descent to train a neural network? And the answer is actually fairly simple and easy to demonstrate. So to show you how we can use steepest descent to train a neural network, I have a small neural network here with three neurons. So the yellow things are neurons; the little circles attached to them are the weights. So this is the weight of the thing going into it. The circles that aren't attached to anything on the neuron are the biases—it's just a number added to the output of the neuron. Then these great things are activations, and these things at the beginning are the inputs to the network.

Values throughout the network are bigger if they have a higher magnitude, so lines will get thicker as they get higher, and things are blue when they're positive and red when they're negative. So what I want to do is look at how different parts of this network affect the output because ultimately, the objective function depends on the output of the network. How well the network does depends on how different the output of the network is from what we wanted it to be.

So say we want this output to be positive, and right now it's negative—it's red. So we want it to increase, and so we're going to look at how different things increase the output of the network. Well, if the first input changes really, it has virtually no effect on the output, whereas if the second input changes, it actually seems to have a gigantic effect on the output. So if we make it all the way down to negative one, the output has gone pretty close to zero, it looks like, as opposed to being negative.

So that's actually a step in the right direction, and likewise, if we look at this weight here, we can see that it actually has a large effect on the output. You know, the output is getting, instead of being negative, it gets close to zero as we change this weight on this. But if we change the weight on this neuron, the output really isn't changing that much.

So the way we're going to apply steepest descent is we're going to say, you know, if I change this weight even a little bit, maybe it makes the objective, you know, it helps us reach the objective a lot, whereas if I change this weight, it doesn’t help us reach the objective as fast. So we're going to change the weight that affects the objective the most. We're going to change that weight more; we’re going to change it in proportion to how much it affects our goals—how much it achieves our goal.

So that's just how we apply steepest descent to neural networks. It's just if a parameter, like if a neuron has a serious effect on the output, the larger its effect is on the output, the more we will change it by. Now, it might be clear what we're going to do; basically, we'll show the network a sample, and we'll figure out for each weight how much does the objective function change with respect to that weight.

Then we're going to update all the weights a little bit, but we're going to do it in a way that, you know, weight is updated in proportion to how much it helps the objective function. In this way, more important weights will get updated faster, and we're basically employing the idea of steepest descent to train the neural network. One thing that's missing from our picture so far is how do we actually compute, you know, mathematically, how do we compute how much the objective function changes with respect to a given weight?

You know, with the car example, I told you how much money you made when you were going east and when you were going west and all that, but if we're dealing with a neural network, how do we figure out, you know, how much money, you know, how much the objective goes down as we change a weight? And to do that, we're going to have to develop a science of change, you know, a science of how much does the output change with respect to an input. And that science actually has a name: it's calculus.

But don't fear if you don't know calculus, and don't leave the video if you do know calculus because actually, I think for both of you, you will learn a lot from this video. And I don't think you actually really have to know calculus to be able to train a neural network. I think it might help a little bit, but yeah, I mean you can be the judge. Just keep watching and see how I do this because I'm not going to teach you calculus, but what I'm going to do is still going to make sense, and you will actually be able to implement it and understand it, and it does what calculus does.

To get started, I want to talk about the notion of a derivative. So let's say we have some unit in our neural network—maybe it's an activation function and it takes an input and produces an output. It's a reasonable question to ask if we change the input by a little bit how much will the output change by? You know, we can do this; we input X plus delta X. Delta X is probably some small number, maybe 0.0001—it's a little change in X. We want to know what's the little change in Y that results.

So let's say sigma didn't do anything; let's just say it took the input and fed out the output. Well, obviously, delta Y would be the same as delta X because we just added something to the input and it would get added to the output. Sigma could also be an activation function that does something weird. You know, it maybe it squashes things, so maybe delta Y would be less than delta X. You know, the output wouldn't change as much as the input, or we can imagine maybe this is a neuron with one input or something like that, and in that case maybe this change would actually be amplified depending on the weight.

So this is just the idea of a small perturbation. Basically, we change the input a little bit, and we get a resulting change delta Y, and we can write this as sigma prime of X. So this is the derivative right here. Basically, the derivative is delta Y divided by delta X—it's the change in output divided by the change in input. So if the output changes not very much as the input changes, maybe it's a squashing activation function, delta Y divided by delta X will be small; you know, it might be 0.2 or something.

But if the output changes a lot, even if the input only changes slightly, delta Y will be a lot bigger than delta X, so the derivative would actually be large—maybe 10 or 20 or something like that. So this is what a derivative is. It's just how much the output changes as the input changes; it's just a ratio between the two changes. So it's really quite simple on a conceptual level.

Now one thing to note here is this unit only has one input and one output, but suppose we have a neuron. You know, a neuron can take multiple inputs. Well, what we do is we just look at one of the inputs, so we can still talk about derivatives here. Basically, let's call one of the inputs to the neuron X, one of the outputs of the neuron Y.

We can still look at what happens to Y if we change X a little bit. You know, we have delta X minus X; we had delta Y minus Y as a result—that's what happens because of whatever's going on in here—and we call that the partial derivative delta Y divided by delta X. In this case we call it a partial derivative; it's just a derivative —the only difference is there are a bunch of other arrows in the picture.

So this is just the idea of a partial derivative. And we can take that a step further and extend it to our neural network. We can say, well, this is the output of the network, and this is a weight here. If we add some delta X to this weight here, the resulting output will get some delta Y, and the output, so delta Y over delta X will tell us how much this weight at the beginning of the network will affect the output of the end of the network.

So now I actually want to start computing some partials, and we're going to start by looking at neurons. So let's use the simple construction that a neuron just takes a weighted sum of its inputs. So it's N1 times weight 1 plus N2 times weight 2 is the output of the neuron. We could also add on a bias term like plus B here or something like that. The bias term just offsets the output of the neuron, but this is really the simplest possible neuron we could do.

And let's see what happens. So first, let's name this Y because we had all our outputs called Y before, so let's be consistent. But now I want to see what happens to Y—how much does Y change if we change W1? If we perturb the first weight, and I'm going to keep this expression for the input in yellow; so whenever it appears, I'm going to make it yellow.

So let's go ahead and look at what happens. So we start off; the new output is going to be the new weight times the first input plus the old weight, you know, we didn't change the second weight times the second input. So really, the only difference between this and this is that we have changed this first weight, and we've added delta W on to it.

Now we can use sixth-grade arithmetic, I guess algebra, to distribute this thing across the, you know, across this sum. So if we use the distributive law, just some basic algebra, we get what was the old output plus delta W1 times N1. So the only thing different in the new output versus the old output is this term.

So this is delta Y; you know, this is how much Y has changed when we change the weight by delta W1. So if we divide this—this is delta Y—we divide this by delta W1, which is how much the parameter was changed, the input—we get the derivative. We get the derivative of the output of the neuron with respect to the first weight of the neuron. So it tells us the derivative is N1.

So basically what this means is the bigger the first input to the neuron is, the more of an effect the first weight will have on the output of the neuron, which makes perfect sense—they're multiplied together. So if the first input is five, if we increase this weight by one, the output will increase by five. If the input is zero, in that case, then the output will not be changed at all if we change the first weight because it’s multiplied by zero.

So this is just—I mean, it should be pretty intuitive. The derivative of the output of the neuron with respect to the weight is just the input that the weight modifies. And we could flip this around and see what happens when we change the input by a little bit. And we do the same thing, and we get that the output—the derivative of the output with respect to the first input—is the first weight.

So that means the bigger the first weight is, the more rapidly the output of the neuron changes as we change the first input, which makes perfect sense—the weight basically controls how fast the output changes with respect to the input. And so that is how we take the partial derivatives of a neuron.

The next thing I want to look at is activation functions. So one activation function, which is actually used a lot, is called the ReLU—or I might pronounce it rel-oo. So basically here is what this activation function does: if the input is positive, it outputs the input. It’s just the same thing as the input; if the input is negative, it outputs zero. So if you look at this graph here, it basically, you know, the x-axis is the input, so it, you know, goes up until zero and then becomes a line. So this is, you know, pretty simple; it's not a complicated activation function—it's not like those formulas I showed for other ones earlier—but actually, this works really well.

And I think it's used in most state-of-the-art convolutional neural networks, image recognition neural networks—it's actually really powerful. So this is a good activation to know, and it also happens to be really easy to think about, you know, in terms of derivatives. Because if the input is positive, well, let’s say we add a little bit to the input; well, that same exact amount will be added to the output.

Now if the input is negative, the output doesn’t change at all with respect to the input. So we can keep increasing the input, and nothing changes—so basically, the derivative of the ReLU activation unit, if the input is positive, the derivative is one (delta Y equals delta X), so delta Y over delta X is one. If the input is negative, the derivative is zero. You know, the output doesn’t change at all.

So that is basically the simplest possible activation function you could have that's not just a straight line—it's basically a straight line, but then there's a bend in the middle, and then there's another flat straight line.

Next, I want to talk about the objective function. So if you recall, we have this definition of absolute error, and we can draw it as a unit. You know, we have basically the actual output from the network; we have the desired output, and then we have the error. The desired output is never going to change—that's basically hard-coded into our system or, you know, it's fixed in the database. You know, this is just—we know this is what we want, but this might change—the output of the network might change, and as the network changes, we would expect the error to change.

And so I would like to first think about this intuitively. You know, if right now, let’s just say the output from the network is too low, it's supposed to be higher, you know, we want it to be one, and right now it's 0.3. Well, if we increase this a little bit, you know, say we bump it up to 0.5, now we're closer to the desired output, and the error would become lower because we're closer. So when the actual output is less than the desired output, the derivative is actually -1.

So for every increase in the actual output of the network, every small increase, we get an equal decrease in the error, you know, because we're getting closer. But now consider the second case where the network output is 2 and we want a value of 0.5. Well, the error comes from the fact that the network's output is too high. So if we increase the output of the network even more, we will increase the error an equal amount. You know, so if we bump this up to 2.5, now the error will get, will get point five added to it, you know, it’ll become 2.

So when the actual output is too high, the derivative of the error with respect to the output is positive; it increases as the network's output increases. So this is just an example of when, you know, the sign matters. You know, the derivative is either 1 or -1, but whether it's 1 or -1 depends on whether we're too low or too high. Basically, if we're too low, the derivative is -1, and for too high, the derivative is positive 1.

Now I want to go over one last thing, and then I'm going to go into some actual examples where we're going to compute derivatives in neural networks. So this last thing is called the chain rule. So imagine we have a network now where we have one thing—maybe it's an activation function—and then another thing right after it; maybe we have two activation functions, one right after the other, or maybe we have another neuron or something.

So the idea is the input goes into the first thing, and then the output of that first thing goes into the next thing, and then that thing's output is the final output Z. So we call the initial input X, we call the output of the first thing Y, and then the output of the second thing we call Z. And what we want to do is we want to figure out how much Z changes with respect to X; we want to figure out delta Z over delta X.

So if we change X by a little bit, we want to know how much will Z change? And suppose—I'm just going to make up numbers. Suppose we know how much Y changes with respect to X; suppose we know that delta Y over delta X is 2. So if we change X a little bit, Y will change twice as much. And let's say we also know how much Z changes with respect to Y; let's make up a number—let's say 3.

So for every change in Y, Z will change three times as much. Well then, how much will Z change if X changes a little bit? Well, X will change a little bit, Y will change twice as much as that, and then Z will change three times as much as Y changed. So in the end, it's going to be six times as much. We're going to do two times three, right?

So if this changes, we multiply the change by two, then we multiply the change by three to get all the way to Z. And so that's what the chain rule does, and it tells us delta Z over delta X will be six. It's just the product of these two; it's a way to combine our knowledge about derivatives across one little guy and across another guy, and we can combine that knowledge to figure out the derivative across the entire chain—so hence the name chain rule.

It's a chain of things, and we can figure out the derivative across it, so it's pretty straightforward, and it's going to come in a lot of handy. Finally, we're at the part of the video where I'm going to give some concrete examples. So what I'm about to do is show you a couple neural networks with actual numbers for weights and for inputs, and we're going to do some derivative computation. We're going to compute partial derivatives and figure out how rapidly the error changes with respect to different weights, and as I go, I'll comment on how we would use that to actually train the network.

For this first example, we're going to be looking at possibly the simplest neural network we could have. We have one input, we have a single neuron, we have an activation function, and then we have at the end our objective function. And the activation function we're going to use is ReLU, and the objective function we're going to use is absolute error—so two things we already know how to do the derivatives of.

The first step to doing anything with a neural network is to do something called the forward pass. Basically, we want to figure out what the final error is, and to do that, first we have to figure out the output of the neuron—then the output of the thing that it feeds into, etc. So first, let's work on the neuron itself. So the input is 0.5, and the weight modifying that is 1, and then there's a bias of 3.

So the output we can compute by 1 times 0.5 plus 3—basically just the trivial linear classifier. The bias term just gets added on to the end, and so that gives us 3.5, and then since this is ReLU, and since its input is positive, the output is just equal to the input—couldn't be simpler. And now finally, we have our absolute error. We wanted one, we got 3.5, so the error is 2.5, and we have completed now the forward pass of the network, which just means we've gotten all the way to the end of the network and figured out all the values along the way.

Now that we've completed the forward pass, we need to start computing derivatives, and we're going to start with this part of the network right here. So we want to know if we add some small value delta—this is a lowercase Greek letter delta—delta, if we add some value delta that's really small here, we want to know how much will that affect the error function because the network's output is 3.5 and we wanted one. The network's output is already greater than the desired output, so if we add to the network's output something more, that same thing we added will get added to the error.

The error will grow as the output grows, because the output is getting further and further from what we want. So if we add delta to this part of the network, we also add delta to the error just because, you know, we said the derivative of the error with respect to this part of the network is 1 in cases like this. So we're going to write exactly that, because we added delta here and we added delta here, we can say that the derivative of the output with respect to this part of the network is exactly 1; you know, we add something here, and the exact same thing gets added here when that something is small.

So this one just means that let’s think about how much the error would change as we change this part of the network before the activation function. Well, if we add delta before the activation function, since the input to the activation function was positive already, you know, it’s just gonna—it’s just all put its input, so it’ll also add delta to the output of the activation function. You know, this is just how ReLU works; we said the derivative was 1, and that's what we're seeing here.

And now we already established that if we add delta to this part of the network, delta will also get added to the end of the network. We already established that—that proves that we've already established that. So we see that we add delta to this part of the network, and we get it out at the error, so the derivative of the error with respect to this part of the network is also 1. And once again, that's just what this purple thing means.

Now let's think about how much the bias influences the final error. So what if we add delta to the bias? Well, if we add delta to the bias, that means we've added delta to the output of the neuron; that's just how the bias works, and since we've added delta to the output of the neuron, we've already established, as we can see here, that if we add delta to the output of the neuron, 1 times delta will get out into the final error, so that’s what we see.

So since delta got added here and got out of here, the derivative of the error with respect to that bias is also 1. So a lot of 1s so far, but we're about to get something that isn't 1. So now let's think about this weight. Well, we discussed how changes in the weight will influence changes in the output of the neuron. If we add some delta to the weight, well, it's multiplied by this input.

So since the input in this case is 0.5, we're going to add 0.5 delta to the output of the neuron. So now how do we figure out how much the error will change by? Well, we do 0.5 delta times 1, and that's what the error will change by. You know, if this part changes by 0.5 delta, this tells us that the error will change by 1 times 0.5 delta, so that’s what we get. The error will change by 0.5 delta.

So note here we added delta, and here we added 0.5 delta. That means that the change in Y is 0.5 times the change in this weight, which means that the derivative of Y with respect to this weight is 0.5. So we have now figured out basically all of the derivatives in this network. You know, for the error, we know that if we were to increase this weight, the error would increase half as fast; but if we were to increase this bias, the error would increase at the same speed.

Now I want to do a more interesting example. So in this example, we've got two neurons, one feeding into the next, and then after them, we have an activation function, and then we have our objective function. Now, this is pretty uncommon; usually, you wouldn't just see a neuron feeding right into the next neuron without an activation function in between, but for the sake of the example, I think this will be pretty nice.

So the first thing we're going to do, like we did last time, is the forward pass, so we can compute the output of this neuron. I mean, you just multiply this by this and then add this. You can do that again, so multiply this by this and then add this, and then because this is positive and this is ReLU, it's the same thing, and the distance—the difference between 0.95 and one is actually pretty small—it's 0.05, so the error is 0.05 in the end. It's actually pretty good.

And now we want to do the same backward pass. You know, the same backpropagation that we did last time, and we're going to start by looking at this part of the network right before the objective function. So if we increase the part of the network, you know, the network's output right before the objective function by some value delta, we must ask ourselves what will happen to the error?

Well, the network's output is slightly less than the desired output, so if we increase the network's output by a little bit, it's getting closer to the desired output. So naturally, the error will decrease, and we already talked about this. You know, if basically we said the derivative for the absolute error is negative one if the network's output is less than the desired output.

So we can see if we add delta to the network's output, the network's output gets a little closer to the desired output, and so the error gets less. The error reduces by delta. So now we have figured out the derivative basically of Y with respect to this part of the network—it’s minus one, because if we add a little here, we subtract that same amount here.

Next, we can move on to the part of the network right before the activation. So the input to the activation is positive. So if we add a little delta before the activation, we add that same delta after the activation. And we already argued that if we add delta right here, we add minus delta to the end. So we added delta before the activation; we got minus delta back. So the derivative of the error with respect to this part of the network is also minus one. You know, we added delta here; we got minus delta out there.

So now we're ready to start talking about this neuron, and let's begin with the bias, which is always the easiest part. So if we add delta to the bias, that just immediately adds delta to the output of the neuron. And what this tells us here is if we add delta to the output of the neuron, we add minus delta to the error. So right now, we see we added delta to the bias, we added negative delta to the output, so our to the error, so the derivative is minus one for this bias as well.

And now we can talk about the weight. The weights are always kind of the hardest. So for the weight, if we add delta to the weight, well, the weight modifies this input, which is 1.5. So the output of the neuron doesn’t go up by delta; it goes up by 1.5 times delta, right? You know, this times delta. So the output of the neuron increased by 1.5 delta, and what this tells us is if we add some amount here, we subtract that amount from the end—so this tells us that if we added 1.5 delta to this part of the network, we're going to subtract 1.5 delta from this part of the network at the end from the error.

So what we see is we added delta to this weight, and the resulting change in the error was minus 1.5 delta. So the derivative of the error with respect to this weight is minus 1.5, which actually means the error changes more rapidly than this weight, and it changes in the negative way. So if we increase this weight, the error will decrease even more rapidly than this increase, so that's kind of cool.

And we can do a similar thing for this part of the network. So if we add delta here, well, that's fed into this weight, so delta is multiplied by the weight, which is 1.3, and then it comes out. So we add 1.3 delta to the output of the neuron since we added a delta to this input. And we can do similar logic here; we know that negative one times 1.3 delta will make it to the error, so the derivative of the error with respect to this part of the network is minus 1.3, and this is already pretty cool.

We're already pretty far away from the error in terms of where we are in the network, but we know how fast the error will change with respect to that part of the network, so it's pretty cool, and let's keep going. So here we have this bias, and the biases are always the easiest. I love this part. So we had delta, the bias; that means we had the same delta to the output of the neuron. And we know if we add a delta here, we've already established that we subtract 1.3 delta from the output.

So we added delta to the bias; the output changed by minus 1.3 times delta, so the derivative here is one minus 1.3. And if you haven't noticed already, the derivatives that we get for the bias will always be the same as the derivatives we got for the output of the neuron. So we got minus 1.3 here because we got minus 1.3 here, and now finally we can consider this weight.

So if we add delta to it, that means since the weight is being multiplied by this input which is minus 0.5, that minus 0.5 delta will be the change in the output of the neuron. And so minus 0.5 delta times minus 1.3 is how much change will make it to the output. You know, we said whatever we added here, we multiplied by minus 1.3, and that's what we'll get in the error. So we can write that.

So here's kind of a nasty thing—we do minus 1.3 times minus 0.5 delta, and we can simplify that. So we see that by adding delta to this weight here, we've added 0.65 delta to the final error, so the derivative we get is 0.65. So let me take a step back, and we actually have to establish what this means now because I just did a lot of stuff. And, you know, what does this really mean? Well, it means if I add something small to this weight—well, let's call that value delta—it means 0.65 delta will get added to the error.

So if I change this weight from 1 to, say, 1.01, you know, I've added 0.01. So what's 0.01 times 0.65? Well, it's 0.0065, and you can see the error actually went up by 0.0065. It was 0.05 before; now it’s 0.0565. So you can see that this derivative is actually useful. We added a little bit here; we multiplied it by this, and that's how much we added here. We actually could predict what would happen if we changed something all the way at this end of the network or what would happen down here at the end, so this is the beauty of backpropagation.

Basically, throughout this entire process, when I wanted to figure out how much influence this had on the end, I didn't have to hop all the way across the network and apply the chain rule. All I had to do was hop once and apply the chain rule once, you know? I hop here, and then I hop from here to the end. So I mean it’s nice, and it’s subtle, and you know, it’s not obvious how nice it is, but this would—you know, this is really where it’s at.

So anyway, before I get to the next example, which will just be a whole 'nother journey, I want to talk about how you would use this information—all these little purple things—to actually train this network. So if we want to minimize the error, we want the error to go down as much as possible. I mean, let's just look at what would happen—what we should do for this one weight. Well, we saw increasing the weight increased the error, so really if we were training the network, we would want to decrease this weight; we would want to make it, you know, 0.99 or something like that.

And the reason we know that is we know the error will increase as this weight increases because the derivative is positive. So while we’re training the network, we’re going to want to decrease this weight. So already if we just used the sine of the derivatives, you know, whether it’s positive or not, we could already go through and get a pretty decent result. We would just, you know, if the derivative is positive, we subtract a little bit from the parameter, and if the derivative is negative, we add a little bit to the parameter.

And that is something called negative gradient descent; it’s actually a decent learning algorithm, but that's not what steepest descent is. So to do steepest descent, remember that thing about going north or going east? Basically, we can think of each of these weights or biases as a different direction. So usually when you’re training a network, you will have a step size; you know, maybe it’s point O one or something like that. So what I’ll do is I’ll subtract point O one times this derivative from the weight. I’ll subtract point O one times this derivative from this bias.

I’ll subtract point O one times this derivative from this weight, and I'll subtract point O one times this derivative from this bias. And so basically, I'm just adding, I'm just scaling down all of these derivatives by the same number—in this case, point O one—and then I'm subtracting that from the parameter—from the weight or bias. And so basically I'm just adding, I'm just scaling down all of the derivatives by the same number; point O one.

It is steepest descent, you know? I will decrease at a rate that is proportional to its derivative, and this I will decrease in a rate that’s proportional to its derivative, etc. So that’s just how training would work. You know, you would just modify all of these parameters by multiplying their derivatives by some small number and just adding that to the parameter, and it actually works really well.

Now for the last example I'm going to do, I'm not going to go in as much detail for backpropagation and forward propagation, but there is one detail I want to emphasize. So I’m going to go through this kind of fast. Here’s the setup. We have two neurons in a hidden layer; they both feed into a third neuron, and then that feeds into our final objective function. And note that there's no activations here—that's pretty unusual, but for this demonstration, it’s fine.

So for the forward pass, it’s pretty straightforward. We just feed 0.5 into both of these neurons and get their outputs. Then we feed these into this neuron, so minus 0.5 times 0.5 plus 2 times 0.5 minus 2 gives us our final neuron output, and then we just get the distance between this and this, giving us our total error of 2.25. And then backpropagation is pretty much the same; we can do this step. You know, I’ve talked about this extensively.

Basically at this point, we can deal with the bias easily; we can deal with the weights in the same way we dealt with them before, you know? For this weight, we do minus 0.5, which is the input to the neuron that goes into this weight, and then times 1. And so we can do that for both weights. Now this we did 2 times minus one, so this is just regular backpropagation—basically nothing has changed, nothing is different.

I really just want to show one point, so we can keep backpropagating. We can backpropagate the weights—it's really straightforward to backpropagate here. We can even backpropagate to what the input of this neuron is and what the input of this neuron is. So we say—so now, well, here's the weird part, okay?

So if we change the input to this neuron a little bit, we know that the derivative—we found the derivative is minus 1.5, and if we change the input to this neuron, we found that the derivative is 1. But the weird thing is both neurons get their input from the same source—this thing here. So what’s the derivative with respect to this input? You know, and it might seem silly to ask what the derivative is with respect to some input, but this could easily come from some other network or some other neuron, so it’s really important to be able to figure out what the derivative is for this input.

So here's what happens. Let's say we add—let's say we add delta to this, right? Well, from this, from this thing I’m circling here, we know that just from the effects of this node, if we add delta here, we subtract 1.5 delta from the error. But we also know from this thing that if we add delta here to the input of this node, we add delta to the error.

And the resulting change, as you can see here, is minus 0.5 delta, so the final, you know, the actual derivative for this value that branches off is minus 0.5. We just add up all of the different derivatives for all of the places it branches off, and this is a pretty standard thing; you're going to have to do this a lot if you implement a neural network. And I mean, that's just the main idea that I wanted to stress, is that if you have a value and that value branches off in two different directions with two—and you know the derivative for both of those directions, you just add up those derivatives, and you get the total thing, and that works no matter how many different directions that branch off—you just add up all the derivatives.

Now there's just one last thing I want to discuss before I call it quits. Suppose your network has multiple output neurons, like maybe you just want your network to predict an X and a Y coordinate or something like that. You know, maybe you give it an image, and you want it to tell you the coordinates of where to put the mustache on someone's face. Well then, your network would have multiple outputs; it might have a neuron for each of the variables you want to predict.

So how can we measure how well that network is doing? You know, how do we get it into a single number to measure how accurate the output is? And I've already kind of given a hint here. Basically, we have a desired output for each of the output neurons—a desired number, and we're going to have a different instance of our objective function for each neuron. So here’s—you know, it’s all the same probably; we would use absolute error for all of these. But each one has a different error.

So we have, you know, how off is the third neuron? How off is the second? How off is the first neuron? And then we're just going to add up all the errors. So we can—one way to do that is to just feed it into a neuron where all the weights are fixed at one, so this will just add up all of these errors and give us one total error, and we know how to backpropagate through this. So if we want to backpropagate with respect to the error, first we backpropagate here, then we backpropagate here, then we would propagate through whatever feeds into this neuron, whatever feeds into this neuron, whatever feeds into this node.

So this is just the idea of how we would maintain, you know, how we would manage to have multiple outputs and how we would still get a single objective that we can minimize. So this has been basically an overview of how you train artificial neural networks, and my goal with this video was to make it so you could sit down at home, and maybe if you’re good at programming, implement your own neural network and actually get it to learn something, and that was really my goal with this video.

Now I will say there's a lot of refinements we can make to the algorithms that I showed you today. For one thing, you can use other activation functions; you can use other objective functions. I mean, you can use different structures of networks; you don't even have to use steepest descent to train everything. But what I try to do is get you the most knowledge for the least amount of time, so I haven't shown you all the refinements, but what I have shown you is basically the kernel, the core, the most important part of training neural networks.

And with this, it should be fairly straightforward to learn more of the refinements, and maybe in future videos, I will actually touch on some of them. So I really hope you enjoyed this video, and you learned a lot. Thanks for watching, subscribe, and goodbye.

Neural Networks (Part 2) - Training

More Articles