Why the gradient is the direction of steepest ascent

8m read

·Nov 11, 2024

So far, when I've talked about the gradient of a function, and you know, let's think about this as a multivariable function with just two inputs. Those are the easiest to think about, uh, so maybe it's something like x² + y². A very friendly function.

When I've talked about the gradient, I've left open a mystery. We have the way of computing it, and the way that you think about computing it is you just take this vector, and you just throw the partial derivatives in there: partial with respect to x and the partial with respect to y. If it was a higher dimensional input, then the output would have as many variables as you need. If it was f of x, y, z, you'd have partial x, partial y, partial z.

This is the way to compute it. But then I gave you a graphical intuition. I said that it points in the direction of steepest ascent, and maybe the way you think about that is you have your input space, which in this case is the xy plane, and you think of it as somehow mapping over to the number line, to your output space.

If you have a given point somewhere, the question is, of all the possible directions that you can move away from this point, which one of them, you know, will land somewhere on the function? As you move in the various directions, maybe one of them nudges your output a little bit, one of them nudges a lot, one of it slides it negative, you know, one of them slides it negative a lot. Which one of these directions results in the greatest increase to your function?

This was the loose intuition. If you want to think in terms of graphs, we could look over at the graph of f of x's, and this is the gradient field. All of these vectors in the xy plane are the gradients, and as you kind of look from below, you can maybe see why each one of these points in the direction you should move to walk uphill on that graph as fast as you can. You know, if you're a mountain climber and you want to get to the top as quickly as possible, these tell you the direction that you should move to go as quickly.

This is why you call it the direction of steepest ascent. So, back over here, I don't see the connection immediately, or at least when I was first learning about it, it wasn't clear why this combination of partial derivatives has anything to do with choosing the best direction. Now that we've learned about the directional derivative, I can give you a little bit of an intuition.

So let's say instead of thinking about, you know, all the possible directions and all of the possible changes to the output that they have, um, you know, let's say you've got your point where you're evaluating things, and then you just have a single vector. Let's actually make it a unit vector. Let's make it the case that this guy has a length of one, so I'll go over here and I'll just think of that guy as being v and say that v has a length of one.

So this is our vector. We know now, having learned about the directional derivative, that you can tell the rate at which the function changes as you move in this direction by taking the directional derivative of your function. Let's say this point, I don't know what's a good name for this point, uh, just like ab. Ab is this point. When you evaluate this at ab, the way that you do that is just dotting the gradient of f.

I should say dotting it evaluated at that point because the gradient is a vector-valued function, and we just want a specific vector here. So, evaluating that at your point ab together with whatever the vector is, whatever that value is, and in this case, we're thinking of v as a unit vector.

Um, so this is how you tell the rate of change. When I originally introduced the directional derivative, I gave kind of an indication why, you know, if you imagine dotting this together with, I know, let’s say it was a vector that's like 1, 2, really you're thinking this vector represents one step in the x direction, two steps in the y direction. So, the amount that it changes things should be one times the change caused by a pure step in the x direction plus two times a change caused by a pure step in the y direction.

So that was kind of the loose intuition. You can see the directional derivative video if you want a little bit more discussion on that. Um, and this is the formula that you have. But this starts to give us the key for how we could choose the direction of steepest descent because now what we're really asking when we say which one of these changes things the most, you know, maybe when you move, move in that direction, it changes f, you know, a little bit negatively, and we want to know, you know, maybe does another vector w, is the change caused by that going to be positive? Is it going to be as big as possible?

What we're doing is we're saying find the maximum for all unit vectors. So, for all vectors v that satisfy the property that their length is one, find the maximum of the dot product between f evaluated at that point, right, at whatever point we care about, and v. Find that maximum.

Well, let’s just think about what the dot product represents. So, let's say we go over here and let's say, you know, let's say we evaluate the gradient vector, and it turns out that the gradient points in this direction, and maybe it's, you know, it doesn't have to be a unit vector. It might be something very long like that.

So if you imagine some vector v, you know, some unit vector v, let’s say it was taking off in this direction, the way that you interpret this dot product, the dot product between the gradient f and this new vector v is you would project that vector directly, kind of a perpendicular projection onto your gradient vector, and you'd say what's that length? You know, what's that length right there?

And just as an example, it would be something a little bit less than one, right? Because this is a unit vector. So as an example, let’s say that was like 0.7, and then you'd multiply that by the length of the gradient itself of that vector against which you're dotting. And maybe, and maybe that guy, maybe the length of the entire gradient vector, just again as an example, maybe that's two.

It doesn't have to be; it could be anything. Um, but the way that you interpret this whole dot product then is to take the product of those two. You would take 0.7, the length of your projection, times the length of the original vector, and the question is when is this maximized? What unit vector maximizes this?

If you start to imagine maybe swinging that unit vector around, so, you know, if instead of that guy you were to use, you know, one that pointed a little bit more closely in the direction, then its projection would be a little bit longer. Maybe that projection would be like 0.75 or something. If you take the unit vector that points directly in the same direction as that full vector, then the length of its projection is just the length of the vector itself.

It would be one because projecting it doesn’t change what it is at all. So it shouldn't be too hard to convince yourself. If you have shaky intuitions on the dot product, I'd suggest finding the videos we have on KH Academy for those S. It does a great job giving that deep intuition.

Um, it should kind of make sense why the vector that points, the unit vector that points in the same direction as your gradient is going to be what maximizes it. So the answer here, the answer to what vector maximizes this is going to be, well, it's the gradient itself, right? It is that gradient vector, you know, evaluated at the point we care about, except you'd normalize it, right? Because we're only considering unit vectors.

To do that, you just divide it by whatever its magnitude is. If its magnitude was already one, it stays one. If its magnitude was two, you're dividing it down by a half. So this is your answer. This is the direction of steepest ascent.

I think one thing to notice here is the most fundamental fact is that the gradient is this tool for computing directional derivatives. You can think of that vector as something that you really want to dot against. Um, and that's actually a pretty powerful thought is that the gradient, it's not just a vector; it's a vector that loves to be dotted together with other things.

That's the fundamental, and as a consequence of this, as a consequence of that, the direction of steepest descent is that vector itself because anything—if you're saying what maximizes the dot product with that thing, it's well the vector that points in the same direction as that thing. And this can also give us an interpretation for the length of the gradient.

We know the direction is the direction of steepest descent, but what does the length mean? So, let’s give this guy a name. Let's give this normalized version of it a name. I'm just going to call it w. Um, so w will be the unit vector that points in the direction of the gradient.

If you take the directional derivative in the direction of w of f, what that means is the gradient of f dotted with that w, and if you kind of spell out what w means here, that means you're taking the gradient of the vector dotted with itself. But because it's w and not the gradient, we're normalizing. We're dividing that not by the magnitude of f; that doesn't really make sense, but by the value of the gradient.

And all of these, I'm just writing gradient of f, but maybe you should be thinking about the gradient of f evaluated at ab, but I'm just being kind of lazy and just writing gradient of f. Um, and the top, when you take the dot product with itself, what that means is the magnitude, but the whole thing is divided by the magnitude, so you can kind of cancel that out. You could say this doesn't need to be there, that exponent doesn't need to be there.

Basically, the directional derivative in the direction of the gradient itself has a value equal to the magnitude of the gradient. So, this tells you when you're moving in that direction, in the direction of the gradient, the rate at which the function changes is given by the magnitude of the gradient.

So, it's this really magical vector. It does a lot of things. It's the tool that lets you dot against other vectors to tell you the directional derivative. As a consequence, it's the direction of steepest ascent, and its magnitude tells you the rate at which things change while you're moving in that direction of steepest ascent.

It's just really a core part of scalar-valued multivariable functions, and it is the extension of the derivative in every sense that you could want a derivative to extend.

Why the gradient is the direction of steepest ascent

More Articles