More formal treatment of multivariable chain rule

8m read

·Nov 11, 2024

Hello everyone. So this is what I might call a more optional video. In the last couple of videos, I talked about this multivariable chain rule, and I gave some justification. It might have been considered a little bit handwavy by some. I was doing a lot of things that looked kind of like taking a derivative with respect to T and then multiplying that by an infinitesimal quantity DT and thinking of canceling those out. Some people might say, "Ah, but this isn't really a fraction; that's a derivative; that's a differential operator, and you're treating it incorrectly."

And while that's true, the intuitions underlying a lot of this actually match with a formal argument pretty well. So what I want to do here is just talk about what the formal argument behind the multivariable chain rule is. Just to remind ourselves of the setup of where we are, you're thinking of V as a vector-valued function. So this is something that takes as an input T that lives on a number line, and then V maps this to some kind of high-dimensional space. Right?

In the simplest case, you might just think of that as a two-dimensional space—maybe it's three-dimensional space—or it could be 100-dimensional. You don't have to literally be visualizing it. And then F, our function F, somehow takes that 100-dimensional space or two-dimensional, three-dimensional, whatever it is, and then maps it onto the number line. So the overall effect of the composition function is to just take a real number to a real number.

So it's a single-variable function, so that's where we're taking this ordinary derivative rather than, you know, a partial derivative or gradient or anything like that. But because it goes through a multi-dimensional space and you have this intermediary multivariable nature to it, that's why you have a gradient and a vector-valued derivative.

With the formal argument, the first thing you might do is just write out the formal definition of a derivative. In this case, it's a limit. Definitions of derivatives are always going to be some kind of limit as a variable goes to zero. Here, you're loosely thinking about H as being DT. You could write delta T, but it’s common to use H, just because that can be used for whatever your differential quantity is.

So that's on the denominator because you're thinking of it as DT. The top is whatever the change to this whole function is when you nudge that input by T. What I mean by that is you'll take F of V, not of T, but of T plus H—that kind of nudged output value—and you're wondering how different that is from F of V of T, your original value V of T.

This is what happens when you just apply the formal definition of the derivative, the ordinary derivative, to your composition function. Now, what do you do? You're trying to reason about what this should equal. A good place to start, actually, is to look back to the intuition that I was giving for the multivariable chain rule in the first place.

You imagine nudging your input by some DT, some tiny change, and I was saying, "Oh, so that causes a change in the intermediary space of some kind of, you know, you could call it DV—a change in the vector." The way that you're thinking about that is you take the vector-valued derivative and multiply it by DT. It's the proportionality constant between the size of your nudge and the resulting vector. Loosely, you might imagine those DTs crossing out, as if they were fractions.

It doesn't really matter. Then you say, "What does this change—this change by DV—cause for F?" By definition, the resulting nudge to the output space of F is the directional derivative in the direction of whatever your vector nudge is of the function F. So this is the loose intuition.

And where does that carry over to formality? You say, "Well, in this intermediary space, we had to deal with the vector-valued derivative of V." So it might be a good thing to just write down that definition. Write down the fact that the definition for the vector-valued derivative of V, again, it looks almost identical. All these derivative definitions really do look kind of the same.

What you're doing is taking the limit as H goes to zero. H, we're still thinking of as being DT, so it kind of sits on the bottom. But here you're just wondering how your vector changes. The difference, even though you know we're kind of writing this the same way and it looks almost identical notationally, is what's on the numerator here: this V of T plus H and this V of T. These are vectors, so this is kind of a vector minus a vector.

When you take the limit, you're getting a limiting vector—something in your high-dimensional space. It's not just a number. Now, another way to write this one that's more helpful, more conducive to manipulation, is to say not that it equals the limit of this value, and I'm going to go ahead and just kind of copy this value here, kind of down here and say the value of our derivative actually equals this subject to some kind of error, which I'll just write as e of H, like an error function of H.

What you should be thinking is that that error function goes to zero as H goes to zero. This is just writing things so that we're able to manipulate it a little bit more easily. So I'll give ourselves some room here. What you can do with this is multiply all sides by H. So this is our vector-valued derivative, just rewriting it multiplied by H, and you're thinking of this H as a DT.

So maybe in the back of your mind, you're kind of thinking of canceling this DT with the H. What it equals is this top, this numerator here, which was V of T plus H minus V of T. In the back of your mind, you might be thinking this whole thing represents, you know, DV—change in V. So the idea of canceling out that DT with the H really does kind of come through here.

But the difference between the more hand-waving argument before of canceling those out and what we're doing here is now we're accounting for that error function. In this case, it's now multiplied by H because everything was multiplied by H—error function. And there's actually another way that I'm going to write this. There's a very useful convention in analysis where I'll take something like this and instead I'll write it as little o of H.

This isn't literally a function, it's just a stand-in to say whatever this is, whatever function that represents it, satisfies the property that when we take that function and divide it by H, that will go to zero as H goes to zero. Right? Which is true here because you imagine taking this and dividing by H—and that would be this H cancels out and you just have your error function is going to go to zero.

So now what I do is I use this entire expression to write this V plus V of T plus H. The reason I want to do that, if we kind of scroll back up, is because we see V of T plus H showing up in the original definition we care about. This is just a way of starting to get a grapple on that a little bit more firmly.

What I'd write, I'd say that V of T plus H, that nudged output value, is equal to the original value that I have, V of T, plus—and it's going to be plus this derivative term. You can kind of think that it's almost like a Taylor polynomial where this is our first-order term, right? You know we're evaluating it at whatever that T is, but we're multiplying it by the value of that nudge—that linear term—and then the rest of the stuff is just some little o of H.

Maybe you'd say, "Shouldn't you be subtracting off that little o of H?" It's not an actual function, it just represents anything that shrinks. Maybe I should say it's the absolute value, like the magnitude, because in this case this is a vector-valued quantity. You know that error is a vector, so it's the size of that vector divided by the size of H that goes to zero.

So this is the main tool that we're going to end up using. This is the way to represent V of T plus H. Now, if we go back up to the original definition of the vector-valued derivative, I'll go ahead and copy that. Go ahead and copy that guy—a little bit of debris—so copy that original definition for the ordinary derivative of the composition function.

Now when I write things in according to all the manipulations that we just did, this is really, it's still a limit as H goes to zero. But what we put on the inside here is it's F of now— instead of writing V of T plus H, I'm going to use everything that I did up there. It's the value of V at T plus the derivative at our point times the size of H.

So again, it's kind of like a Taylor polynomial; this is your linear term, and then it's plus something that we don't care about—something that's going to get really small as H goes small—and really small in comparison to H, more importantly. From that, you subtract off F of V of T, kind of running off the edge. I always keep running off the edge, and all of that is divided by H.

Now the point here is when you look at this limit because we're taking it as H goes to zero, we'll basically be able to ignore this little o of H component because as H goes to zero, this gets very, very small in comparison to H. Everything that's on the inside here is basically just the V of T plus this vector value, right? This is H times some kind of vector— but if you think back, I made a video on the formal definition of the directional derivative.

If you remember it, or if you kind of go back and take a look, now this is exactly the formal definition of the directional derivative. We're taking H to go to zero; the thing we're multiplying it by is a certain vector quantity. That vector is the nudge to your original value, and then we're dividing everything by H. So by definition, this entire thing is the directional derivative in the direction of the derivative of the function of T.

I'm writing V prime instead of getting the whole dV/dt down there, all of that of F evaluated at where? Well, the place that we're starting is just V of T. So that's V of T, and that's it; that's the answer. Because when you evaluate the directional derivative, the way that you do that is you take the gradient of F, evaluate it at whatever point you're starting at—in this case, it's the output of V of T—and you take the dot product between that and the vector-valued derivative.

Well, I mean the dot product between that and whatever your vector is, which in this case is the vector-valued derivative of V. And that's the multivariable chain rule. If you look back through the line of reasoning, it all really did match the thoughts of kind of nudging and seeing how that nudged. Right?

Because the reason we thought to use the vector-valued derivative was because of that intuition. The reason for all the manipulation that I did is just because I wanted to be able to express what the nudge to the input of V looks like. What that looks like is the original value plus a certain vector. Here, this was the resulting nudge in the intermediary space.

I wanted to express that in a formal way, and sure, we have this kind of o of H term that expresses something that shrinks really fast. But once you express it like that, you just end up plopping out the original definition of the directional derivative. So I hope that gives kind of a satisfying reason for those of you who are a little bit more rigor-inclined for why the multivariable chain rule works.

I should also maybe mention there's a more general multivariable chain rule for vector-valued functions. I'll get to that at another point when I talk about the connections between multivariable calculus and linear algebra. But for now, that's pretty much all you need to know on the multivariable chain rule when the ultimate composition is, you know, just a real number to a real number. And I'll see you next video.

More formal treatment of multivariable chain rule

More Articles