The Lagrangian
All right, so today I'm going to be talking about the Lagrange multipliers. Now, we've talked about Lagrange multipliers; this is a highly related concept. In fact, it's not really teaching anything new; this is just repackaging stuff that we already know.
To remind you of the setup, this is going to be a constrained optimization problem setup. So, we'll have some kind of multivariable function F of XY, and the one I have pictured here is, let's see, it's x² * e^y * y. So, what I have shown here is a contour line for this function. So that is, we say, what happens if we set this equal to some constant, and we ask about all values of X and Y such that this holds, such that this function outputs that constant. If I choose a different constant, then that contour line could look a little bit different. It's kind of nice that it has similar shapes.
So that's the function, and we're trying to maximize it. The goal is to maximize this guy. And, of course, it's not just that. The reason we call it a constrained optimization problem is because there's some kind of constraint, some kind of other function G of XY—in this case, x² + y²—and we want to say that this has to equal some specific amount. In this case, I'm going to set it equal to four.
So we say you can't look at any XY to maximize this function; you're limited to the values of X and Y that satisfy this property. I talked about this in the last couple of videos, and kind of the cool thing that we found was that you look through the various different contour lines of F. The maximum will be achieved when that contour line is just perfectly parallel to this contour of G.
You know, a pretty classic example of what these sorts of things could mean or how it's used in practice is if this was, say, a revenue function for some kind of company. You're modeling your revenues based on different choices you could make running that company, and the constraint that you’d have would be, let's say, a budget. So, I'm just going to go ahead and write budget, or B for Budget, here.
You're trying to maximize revenues, and then you have some sort of dollar limit for what you're willing to spend. These, of course, are just kind of made-up functions; you'd never have a budget that looks like a circle and this kind of random configuration for your revenue. But in principle, you know what I mean, right?
So, the way that we took advantage of this tangency property—and I think this is pretty clever—let me just kind of redraw it over here. You're looking at the point where the two functions are just tangent to each other. Is that the gradient—the gradient vector for the thing we're maximizing, which in this case is R—is going to be parallel or proportional to the gradient vector of the constraint, which in this case is B. It's going to be proportional to the gradient of the constraint.
What this means is that if we were going to solve a set of equations, what you set up is you compute that gradient of R, and it'll involve, you know, two different partial derivatives. You set it equal not to the gradient of B because it's not necessarily equal to the gradient of B, but it's proportional with some kind of proportionality constant, Lambda.
Now, let me—that's kind of a squirrely Lambda. Lambda now, that one doesn't look good either, does it? Why are Lambdas so hard to draw? All right, that looks fine. So, the gradient of the revenue is proportional to the gradient of the budget.
We did a couple of examples of solving this kind of thing. This gives you two separate equations from the two partial derivatives, and then you use this right here—this budget constraint—as your third equation.
And the Lagrange multipliers—the point of this video—this Lagrange function is basically just a way to package up this equation, along with this equation, into a single entity. So, it's not really adding new information, and if you're solving things by hand, it doesn't really do anything for you.
But what makes it nice is that it's something easier to hand a computer, and I'll show you what I mean. So, I'm going to define the Lagrange itself, which we write with this kind of funky-looking script L, and it's a function with the same inputs that your revenue function, or the thing that you're maximizing, has, along with Lambda—along with that Lagrange multiplier.
The way that we define it—and I'm going to need some extra room, so I'm going to say it's equal to and kind of define it down here—the revenue function, or whatever it is that you're maximizing, the function that you're maximizing minus Lambda, that Lagrange multiplier, so that's just another input to this new function that we're defining, multiplied by the constraint function—in this case B—evaluated at XY minus whatever that constraint value is. In this case, I put in four, so you'd write minus 4.
If we wanted to be more general, maybe we would write, you know, B for whatever your budget is. So, over here you're subtracting off little b. This here is a new multivariable function, right? It's something where you could input x, y, and Lambda, and just kind of plug it all in, and you'd get some kind of value. Remember B in this case is a constant, so I'll go ahead and write that this right here is not considered a variable; this is just some constant. Your variables are x, y, and Lambda.
This would seem like a totally weird and random thing to do if you just saw it out of context or if it was unmotivated. But what's kind of neat—and we'll go ahead and work through this right now—is that when you take the gradient of this function called the Lagrange and you set it equal to zero, that's going to encapsulate all three equations that you need.
And I'll show you what I mean by that. So, let's just remember the gradient of L—that's a vector that's got three different components. Since L has three different inputs, you're going to have the partial derivative of L with respect to X, you're going to have the partial derivative of L with respect to Y, and then finally, the partial derivative of L with respect to Lambda—our Lagrange multiplier, which we're considering an input to this function.
And remember, whenever we write that a vector equals zero, really, we mean the zero vector. Often, you'll see it in bold if it's in a textbook, but what we're really saying is we set those three different functions—the three different partial derivatives—all equal to zero.
So, this is just a nice, closed form, compact way of saying set all of its partial derivatives equal to zero. And let's go ahead and think about what those partial derivatives actually are.
So, this first one, the partial with respect to X—the partial derivative of the Lagrange with respect to X—is kind of fun. You know, you have all these curly symbols—the curly D, the curly L—it makes it look like you're doing some truly advanced math, but really, it's just kind of artificial fanciness, right?
But anyway, so we take the partial derivative with respect to X, and what that equals is, well, it's whatever the partial derivative of R with respect to X is minus—and then Lambda, from X's perspective, Lambda just looks like a constant—so it's going to be Lambda multiplied by the partial derivative of that with respect to X.
Well, it's going to be whatever the partial derivative of B is with respect to X, but subtracting off that constant doesn't change the derivative. So, this right here is the partial derivative of Lambda with respect to X. Now, if you set that equal to zero—and I know I’ve kind of run out of room on the right here—but if you set that equal to zero, that's the same as just saying that the partial derivative of R with respect to X equals Lambda times the partial derivative of B with respect to X.
If you think about what's going to happen when you unfold this property that the gradient of R is proportional to the gradient of B, written up here, that's just the first portion of this, right? If we're setting the gradients equal, then the first component of that is to say that the partial derivative of R with respect to X is equal to Lambda times the partial derivative of B with respect to X.
Then, if you do this for Y, if we take the partial derivative of this Lagrange function with respect to Y, it's very similar, right? It's going to be, well, you just take the partial derivative of R with respect to Y. In fact, it all looks just identical. Whatever R is, you take its partial derivative with respect to Y and then subtract off Lambda, which looks like a constant as far as Y is concerned, and then that's multiplied by, well, what's the partial derivative of this term inside the parentheses with respect to Y?
Well, it's the partial of B with respect to Y, and again—if you imagine setting that equal to zero—that's going to be the same as setting this partial derivative term equal to Lambda times this partial derivative term, right? You kind of just bring one to the other side.
So, this second component of our Lagrange equals zero equation is just the second function that we've seen in a lot of these examples that we've been doing where you set one of the gradient vectors proportional to the other one.
And the only real difference here from stuff that we've seen already—and even then, it's not that different—is that what happens when we take the partial derivative of this Lagrange with respect to Lambda? With respect—now I'll go ahead and give it that kind of green Lambda color here—well, when we take that partial derivative, if we kind of look up at the definition of the function, R never has a Lambda in it, right? It's purely a function of X and Y, so that looks just like a constant when we're differentiating with respect to Lambda.
So, it's just going to be zero when we take its partial derivative. And then, this next component, B of XY minus B, all of that just looks like a constant as far as Lambda is concerned, right? There's X's, there's Y, there's this constant B, but none of these things have Lambdas in them.
So, when we take the partial derivative with respect to Lambda, this just looks like some big constant times Lambda itself. So, what we're going to get is, I guess we're subtracting off, right? It's up here kind of writing a minus sign; we're subtracting off all the stuff that was in those parentheses, B of XY minus B—that constant—and this whole thing, if we set that whole thing equal to zero, well, that's pretty much the same as setting B of XY equal to B.
And that's really just the same as saying, "Hey, we're setting B of XY equal to that little B," right? Setting this partial derivative of the Lagrange with respect to the Lagrange multiplier equal to zero boils down to the constraint—the third equation that we need to solve.
So, in that way, setting the gradient of this Lagrange function equal to zero is just a very compact way of packaging three separate equations that we need to solve the constrained optimization problem. And I'll emphasize that in practice, if you know, if you actually see a function for R—for the thing that you're maximizing—and a function for the budget, it's much better, I think, to just directly think about these parallel gradients and kind of solve it from there.
Because if you construct the Lagrange and then compute its gradient, all you're really doing is repackaging it up, only to unpackage it again. But the point of this, kind of the reason that this is a very useful construct, is that computers often have really fast ways of solving things like this—things like the gradient of some function equals zero.
The reason is because that's how you solve unconstrained maximization problems. This is very similar to if we just looked at this function L out of context and were asked, "Hey, what is its maximum value? What are the critical points that it has?" And you said, "It's gradient equals zero."
So, kind of the whole point of this Lagrange is that it turns our constrained optimization problem involving R and B—and this new made-up variable Lambda—into an unconstrained optimization problem where we're just setting the gradient of some function equal to zero.
So, computers can often do that really quickly. So, if you just hand the computer this function, it will be able to find you an answer, whereas, you know, it's harder to say, "Hey computer, I want you to think about when the gradients are parallel and also consider this constraint function."
It's just sort of a cleaner way to package it all up. So, with that, I'll see you next video where I'm going to talk about the significance of this Lambda term—how it's not just a ghost variable, but actually has a pretty nice interpretation for a given constraint problem.