yego.me
💡 Stop wasting time. Read Youtube instead of watch. Download Chrome Extension

DeepSeek R1 Explained to your grandma


6m read
·Jan 29, 2025

This new large language model has taken the tech world by absolute storm and represents a big breakthrough in the AI research community. Last Sunday, while TikTok was banned for 12 hours, an AI research team from China released a new large language model called Deep Seek R1.

As you can see on the screen, Deep Seek R1's benchmark shows that it performs at a similar level to OpenAI's 01 model on reasoning problems like math, coding, and scientific reasoning. In this video, I'll talk about the three main takeaways from their paper, including how they use Chain of Thought in order to have the model self-evaluate its performance, how it uses pure reinforcement learning to have the model guide itself, and how they use model distillation to make Deep Seek and other LLMs more accessible to everyone.

Chain of Thought is a very simple but effective prompt engineering technique where we pretty much ask the model to think out loud. We add to our prompt that we want the model to explain its reasoning step by step. That way, if the model makes any mistakes, we can easily pinpoint where in its reasoning it was off so that we can reprompt the model to not make the mistake again. Here is an example from the paper: if you give the model a question like this math problem, you can see that in its response it actually reasons through it and gives you the steps to how it got to the solution. It showed its work. You can see in red it says, “Wait, wait, there’s an aha moment” as well as “Let’s evaluate, let’s reevaluate this step by step.”

In doing so, the model is going to have a more accurate response than if you were to just give the response by itself without Chain of Thought reasoning. The way Deep Seek uses reinforcement learning is a little different from how most AI models are trained. We don’t give it the question and answer; we kind of let it learn on its own. This is exactly the same way in how a baby learns how to walk for the first time. If you notice, if you ever seen a baby, it’s actually pretty funny. They stumble around the environment and they maybe hold on to things as they try to decide how to walk, and in doing so, they’re learning how to move and position their joints so that they don’t fall.

In the same way, reinforcement learning allows us to train a model by optimizing its policy, aka how the model behaves, and it does so to maximize the reward. As it explores its environment over time, it learns which policies maximize the reward, and then it just probably picks the policy over here or the policy over here. For example, if you’re solving an equation like this, there are two or three different ways to solve it, but one of them is much shorter than the other way to solve it and thus has a much higher reward than the other.

Reinforcement learning is exactly how most robots learn how to walk and how Tesla’s self-driving car learns how to drive through a city. If we go to the paper and look at this graph, we can see how Deep Seek R1 improves how accurately it can answer questions if we train it over time using reinforcement learning instead of telling the model what a correct answer is to a question. Since that kind of data is pretty expensive to obtain, we instead let it figure out on its own while measuring how accurate the model is.

You can see while OpenAI's 01 model is static, Deep Seek R1 eventually outperforms OpenAI's 01 model. If we let it train for even longer, it looks like it's going to perform even more and get closer to 90 or even 100% accuracy if we kept training it. You can see how the model uses Chain of Thought reasoning in order to improve its responses over time and self-reflect.

In reinforcement learning, we can't exactly tell the model how to change its policy, so that’s why we use Chain of Thought reasoning to force the model to self-reflect and evaluate, to change its behavior to get closer to a maximum reward. That way, we can kind of give the model the right incentives using prompts, and the model can re-evaluate how it answers questions, and it can do so with increasing accuracy.

This equation is the key behind how Deep Seek uses reinforcement learning in order to optimize its policy. It uses group relative policy optimization in order to essentially use this equation to score how well it answered a question without having the correct answer. This looks very complicated, and I'll just briefly explain the most important parts of it. What we do is we take pretty much the expectation of the old answers from the old policy the model has, and remember the policy Pi; this is the key thing that we’re trying to optimize with Deep Seek, where we want to change the policy so that Deep Seek can then output better and more correct answers.

We take a weighted average of how the model responded with its old policy and how it used its old policy to answer questions versus how the model's new policy answers questions as well. We also multiply it by some standardization value Ai, and Ai is basically saying, compared to the average reward, how well does this new policy increase the reward?

What we also want to do is not have the model's policy change too much because that can cause a lot of instability with model training. If you look at most reinforcement learning charts and graphs or even the example of a baby, the baby’s going to fall down unpredictably so many times, and what we want to do is make sure our model is as stable as possible to avoid a roller coaster of policy changes. That’s where this clipping comes in. Clipping essentially restricts how much our policy can change by 1 minus Epsilon and 1 plus Epsilon, and we also standardize that so the weighted average is taking basically how small of a change can we change our policy to maximize the reward.

We also subtract it from this regularization term called K Divergence. This is another way for us to stabilize our model training by making sure it doesn’t change too much. In short, all this is trying to say is that we don’t want our policy for our model to change too much, but we want to do so in a way that we can compare our old answers with the new answers, and then we change our policy so that we can maximize the policy changes while minimizing disturbance. There’s like a min-max kind of situation here, and that’s what it’s doing here with the weighted average.

The third important technique that the Deep Seek researchers used with their R1 model is model distillation. The idea here is that the actual Deep Seek model is 671 billion parameters, and to run this, you pretty much need a couple thousand GPUs at least, as well as a pretty expensive computer to actually run the full model. To make it more accessible, what they do is take the larger LLM and then use it to teach a smaller LLM how it reasons and how it answers questions so that the smaller LLM can actually perform on the same level as the bigger LLM but at a magnitude of a smaller parameter size, like 7 billion parameters.

In the paper, the Deep Seek researchers distilled from their Deep Seek model into Llama 3 as well as Quen. The idea here is that the teacher uses, again, Chain of Thought reasoning in order to generate examples or generate a lot of examples of it answering questions, and then those examples it just gives directly to the student as part of the prompt. The student is supposed to answer the questions with similar accuracy as the larger model, and this makes the whole LLM ecosystem much more accessible for people who don’t have as many resources.

The key insight is that, in this paper, they found that the student model during reinforcement learning training actually outperforms the teacher model just by a little bit. But it's doing so, again, at a small fraction of the memory and storage required to use it. In the experiment from the paper, the researchers actually found that these smaller distilled models from Deep Seek, as I said, outperform larger models like GPT-4 and Claude 3.5 Sonet in these math, coding, and scientific reasoning tasks, as you can see in the table below right here.

From those three things, those are kind of the key concepts behind how Deep Seek works. Hopefully, you enjoyed this video, and if you want to, you can go read the paper in the description below as well as play around with Deep Seek on AMA yourself.

More Articles

View All
Planet or Plastic? | Explorer's Fest
[Applause] Some of you may have seen or heard about that classic film called The Graduate, starring Dustin Hoffman. As a young graduate was advised by the crusty businessman, “The future, my boy, is plastics.” Think 1960s, when plastics in our society wer…
Calculations using Avogadro's number (part 2) | Chemistry | Khan Academy
Let’s solve a few numerical on Avogadro number and moles. Here’s the first one: how many glucose molecules are in 2.37 moles of glucose? Let’s quickly remind ourselves what moles are. Moles are like dozens. Just like how one dozen equals 12, a mole repre…
Hiring Engineers with Ammon Bartram
Hey guys, today we have Almond Bartram, co-founder of Socialcam and Triplebyte, and he is here to talk to us about hiring. So, could you just give us a quick intro about what you’ve worked on? Cool, so I joined Justin.tv fresh out of school in 2009. It w…
Algorithms are Destroying Society
In 2013, Eric Loomis was pulled over by the police for driving a car that had been used in a shooting—a shooting, mind you, that he wasn’t involved in at all. After getting arrested and taken to court, he pleaded guilty to attempting to flee an officer an…
How Many Things Are There?
Hey, Vsauce. Michael here. If you threw every single human alive today into the Grand Canyon, we would not fill it up. We could make a pile about this big. That’s it. That’s all of us. All 7.159 billion of us in one place. A species portrait. It kinda put…
How To Earn Customers For Life
Can we take a big step back? Do you like your users? You know, do you like them? Yeah, and people think that’s a weird like, that’s not what they expect you to say. This is Michael Seibel with Dalton Caldwell, and today we’re going to talk about caring a…