The hidden beauty of the A* algorithm

12m read

·Dec 1, 2024

Why are map applications so fast? If I want to find the shortest path from Prague to Rome, Google Maps tells me the answer extremely fast, in about a second or two. If this were your first course in programming, the answer would be that we can represent the road network as a weighted graph and then run Dijkstra's algorithm, a standard shortest path algorithm. It starts in Prague and searches in all directions until it discovers Rome, at which point it finds the shortest path.

Dijkstra's algorithm is certainly very nice, but before it discovers Rome it needs to visit all other places that are closer to Prague than Rome, like Brussels, Gdansk, or Lviv. But if I asked you to find the shortest way from Prague to Rome, you probably wouldn't even consider going through these cities. This suggests we can improve upon Dijkstra's algorithm and this is exactly what the A* algorithm does.

A* is a variant of Dijkstra's algorithm that can incorporate additional information about our graph and in the case of a map, it can search like this. You can see how the algorithm only needs to explore a much smaller portion of the network before discovering Rome, which is also why real map applications are based on A* and not Dijkstra.

But how exactly does the A* algorithm work? We'll show you in this video! My explanation is going to be very different from the usual one and on the way we'll learn a general technique that was recently used to solve another basic problem from computer science. So even if you already know A*, I'm pretty sure you'll learn something new.

I will explain the algorithm in three parts and in the first one, I want to tell you what is, for me at least, the most important idea behind A*. First, let me think of the graph as a directed graph with every edge having a non-negative length. The main idea is that we will simply change the edge lengths in our graph and then run Dijkstra's algorithm on that new graph.

Here you can see that I changed the edge lengths in such a way that the edges that get us geographically closer to Rome got shorter and the opposite edges got longer. If we run Dijkstra on this new graph, the algorithm prefers the edges going roughly in the right direction and finishes sooner, just like we wanted.

But how should we change the edge lengths? There's an obvious problem. If we for example just pick some edge and make it shorter, maybe the shortest path with original lengths was this one but after the change this one becomes shorter. This means that our algorithm will return the wrong path. So what can we do? Well, maybe we need to change more than one edge at a time.

What if I take these two edges and while increasing one of them, decrease the other one? Now any path from Prague to Rome that uses the longer edge first gets longer by one, but immediately after that gets shorter by one, so it keeps the same length, which is what we wanted. We can make this change even better if we change all these four edges at once.

One reason why this is better is that we can generalize this operation to other nodes of the graph. For example, pick a different node, make all of the ingoing edges longer by one and all the outgoing edges shorter by one. Then, we can apply the same argument. Any path going through this node first gets a bit longer and then shorter by the same amount, so it maintains its original length.

There are two special cases for our operation, Prague and Rome themselves. Let's try the operation for Prague and make the ingoing edges longer by two and the outgoing edges shorter by two. Now it's a bit different because all paths from Prague to Rome go out of Prague but they never go in, so their length does not stay the same, they all get shorter by two.

But that's also okay, we want the shortest path in the new graph to be the same as in the old graph and if we make all paths from Prague to Rome shorter by two, it doesn't change which one of them is the shortest one. Great, we managed to change quite a lot of edge lengths in our graph without changing what the shortest path is. It feels like dark magic, right?

Fortunately, there's also a very visual way of thinking about these operations. Do you still remember how we changed these lengths by one? Here's how we can think about it. What I did is that I simply added a third dimension to the picture and raised this node to the elevation of one.

Every edge then just gets longer by the amount you need to climb or shorter by the amount you descend. It's just a different way of viewing the same trick but it makes it much more intuitive. We can view the three operations that we've discussed so far as raising these two nodes to height 1 and Prague to height 2, and then we just redefine the lengths of all edges using the following formula: the new length of every edge is equal to its old length + the height of the node it goes to - the height of the node where it starts.

For example, the formula says that this edge should get shorter by one if you walk from Prague, but longer by one in the opposite direction. This formula will be super important for us, because although we defined it to be true just for edges, it holds for any path. Example: take this path from Prague to Rome. Originally, its length was 6.4 but because Prague is higher than Rome by two units, the new length of the path and also the new distance from Prague to Rome decreased by 2.

In general, whenever we know the distance between two nodes in the original graph, we can use this formula to compute the distance in the new graph. These heights are usually called potentials, so I'll use that name from now on and I will also use the name potential reweighting for the operation where the edges going up get longer and the edges going down get shorter.

Why is it called potentials? Well, if you know some physics, you might have a bit of a deja vu right now, but let's not get distracted. Let's go back to our plan. Remember, we wanted to first change the input graph and then run Dijkstra on it. We can now refine this strategy. First, we choose some clever potentials. Second, we apply our potential reweighting trick. And third, we run Dijkstra on the reweighted graph.

And that's basically the A* algorithm. But there's still a lot to understand, like how to pick the potentials in the first step, and how should we implement it? Let's see. Let's try to make a list of what a good potential should satisfy. The first property that we want is that the new distance from Prague to Rome after we do potential reweighting is as small as possible. This will ensure that we find Rome fast.

Fortunately, we understand very well how to achieve that. Remember, we have a formula for the new distance. It says that the new distance from Prague to Rome is equal to the old distance + the potential of Rome - the potential of Prague. To make it simple, let's assume that the potential of Rome is zero. We can do that without loss of generality, because moving all potentials by the same amount doesn't change the result of potential reweighting.

Then, minimizing the distance from Prague to Rome is equivalent to making the potential of Prague as large as possible, so let's add that to the list as the first property. Let's see what breaks if we simply follow this first rule. Let's leave the potential of all nodes as 0 and simply increase the potential of Prague. The distance from Prague to Rome decreases like we wanted.

Actually, when the potential of Prague exceeds the original distance from Prague to Rome, the distance even becomes negative! That sounds very suspicious. Dijkstra's algorithm doesn't work if there are negative length edges in our graph and if the distance from Prague to Rome is negative, it doesn't even matter how we set potentials of other nodes, there will simply have to be a negative length edge somewhere.

So we have a necessary condition that the potential of each node needs to satisfy: it's always at most the distance from that node to Rome. Although this condition is quite easy to understand, it's unfortunately not sufficient. Even if the potential of Prague is just three but we leave all other potentials to be zero, all of these edges are still negative, although the distance from Prague to Rome no longer is.

So, the sufficient condition that needs to be satisfied is simply that for every edge its new length has to be nonnegative. If we plug in our formula for the new length of an edge and rearrange we get an inequality that says that along any given edge, the potentials cannot increase by more than the edge's length.

Of course, if the sufficient condition is satisfied, so is the necessary one. However, let me add both of them to the list simply because the necessary condition is easier to understand; it simply says that the potential has to be an optimistic estimate of the actual distance to Rome. Reasonable estimates fortunately also usually satisfy the sufficient condition.

Let's see some examples now. First of all, what happens if we simply choose the potential of every node to be the distance from that node to Rome? That's actually the best potential we can hope for. First of all, this inequality is satisfied and I leave it to you to check the more complicated condition. And second, since the potential of Prague cannot be higher than the distance from Prague to Rome, our choice of potential clearly elevates Prague as high as possible.

This choice of potential is so amazing that if we apply potential reweighting and run Dijkstra's algorithm on the new graph, the algorithm simply walks along the shortest path to Rome and it doesn't even bother exploring anything else. It's so amazing that it's also pretty suspicious, so do you see the problem? Well, the problem is how to compute them. Remember, we want to speed up Dijkstra's algorithm, so we have to be able to compute these potentials fast.

But we chose the potential of Prague to be the distance from Prague to Rome, which is what we are trying to compute in the first place. So, we have to go back to our list of requirements for a good potential and add a third property: the potential should be something we can compute fast. Now, we finally have a complete list of requirements. One good choice of potential is to make it not the actual distance from that City to Rome but the straight line air distance between it and Rome.

For example, Prague is roughly 900 kilometers away from Rome which in our units is 2.9, so its potential would become 2.9. We compute the potentials of other nodes the same way. This potential approximates the true distance from Prague to Rome quite decently. It also clearly is an optimistic estimate of the actual distance to Rome and I again leave checking the more complicated inequality to you.

Finally, it's also easy to compute it if we know the geographical locations of the cities on the map. If you're a flat earther, just plug the two positions into the Pythagorean theorem. If not, the formula is a bit different, but it's the same story. Let's see what happens if we apply potential reweighting with this potential and then run Dijkstra. Dijkstra prefers the paths that go in the direction of Rome simply because they go downhill which means they got shorter.

We implemented A* with this potential on a real sample of European roads and it was four times faster than Dijkstra, which is a great speed-up. Real map applications also usually implement the A* with this potential, together with many other tricks. Speaking of implementation, there is still one important piece of the puzzle remaining. Let's write down the pseudocode of our algorithm. We'll first write the function that computes the potential.

Then we apply the potential reweighting trick to change all edge lengths. And in the end, we implement Dijkstra's algorithm. This implementation of A* is pretty bad, because in the potential reweighting part, we are iterating over all the edges in the input graph which is horrible. Remember, looking at the whole graph is exactly what we're trying to avoid. Fortunately, there is a simple trick that we can do.

Let's erase the code for potential reweighting and instead let's do the reweighting implicitly in Dijkstra. We can actually do it by changing only one line of code in the algorithm. Basically, we let the algorithm work with the old edge lengths, but when we're deciding what node to pick next, we'll use our reweighting formula to convert the distances in the old graph to distances in the new one. And as the final beautification, we can drop this term which is constant, so it doesn't matter.

If you didn't catch everything, never mind. What's important is that the new algorithm is doing exactly the same thing as it was doing before, but it's finally fast. And this is how the A* algorithm is always implemented, and if you understand Dijkstra's algorithm, there's a very simple reason why this code makes sense. At every step of Dijkstra's algorithm, we have a boundary of explored nodes. We look at the node from the boundary that's currently the closest to the starting node and recompute the boundary.

In A*, we also have a boundary of explored nodes, but we always select the node that minimizes the distance to the start plus the potential. Remember, the potential is an estimate on the distance from the node to the end, so when you think about it, the A* rule is trying to select the node that looks like it lies on the shortest path from the start to the end. And that's the usual intuition behind A* and, to be honest, this way of thinking about the algorithm is quite a bit simpler than how we developed the algorithm using the potential reweighting trick.

In fact, what I call potentials is usually simply denoted as the heuristic in the A* context. But there is one puzzling thing if you have the usual intuition. Remember, any A* heuristic needs to satisfy this property, which is called consistency in the context of A*, and this more intuitive version admissibility. But why does the algorithm break if the heuristic doesn't satisfy the consistency property? This is the place where our alternative intuition about A* really shines.

If you also understand that the A* algorithm is, under the hood, good old Dijkstra being run on a different graph, it's pretty clear what consistency is saying. We simply don't want to create negative-length edges. Another thing that I like about this alternative intuition is that it's very physical. A* works because we are literally creating a graph where it's harder to walk in the wrong direction and easier to go in the correct one.

The last reason why I like the potential reweighting approach to A* is that potentials have many other applications. Here's my favorite example. If you try to solve the shortest path on graphs that may have edges with negative lengths, you can't use Dijkstra's algorithm anymore. In fact, we didn't have a very satisfying algorithm for this problem, until a few months back. There is a new fast algorithm for the problem and it's based on using the potential reweighting trick to convert the input edge lengths into new ones that are nonnegative.

So this was A*. Since you already got this far into the video, you might enjoy seeing one more quick application of the algorithm - solving puzzles, for example the 15 puzzle. There, you are given 15 tiles arranged in a square and the goal is to rearrange them using the sliding moves to get the tiles nicely sorted like this. If I give you a solvable position of this puzzle, for example this one, how do you find the shortest possible solution?

Well, you can imagine the graph where all possible states of this puzzle are nodes and edges correspond to the moves you can make. This is a common trick by the way, and if you don't know it that well, I recommend watching this recent extremely nice video by Tom Slama. In this graph, you can run breadth-first search to find the shortest path, but I wouldn't advise that. The graph has roughly 10 to the 13th possible states, so you would need to run the code for weeks and use terabytes of memory.

Let's improve on it by finding a good heuristic and then run A*. One simple heuristic is to look at each tile separately and compute how many moves we would need to put it in its place - like we need three moves to put tile 1 into its final place, zero moves for tile two and so on. If we sum up all of these numbers, we get a lower bound on the number of moves needed to solve the puzzle. In other words, the heuristic is admissible and, as is the case for most reasonable heuristics, it's also consistent.

It's also solid in the other two properties, so let's plug it into A*. We tried that and on average we needed to explore roughly 10 million nodes to find the best solution, which takes about 10 seconds. I really wanted to show you this example because it features a very different heuristic than the one we saw in the case of road network graphs. In general, the A* algorithm is so amazing because whenever you decide to use it, you just need to cook up a consistent heuristic and then the A* algorithm does the rest.

By the way, in an earlier video we were looking at a similar problem - solving a Rubik's Cube using the meet in the middle trick. I find it nice that with a lot of squinting that solution can also be seen as running A* with a certain heuristic. This shows you how general and powerful A* really is. Actually, that's also why it's called the A* - it's simply the best algorithm.

So I hope you like it as much as I do and I'll see you next time! In the meantime, don't forget to check out Tom's Channel and if we forgot about a cool application of A*, let us know in the comments. Ciao!

The hidden beauty of the A* algorithm

More Articles