The REAL potential of generative AI

15m read

·Nov 3, 2024

You've heard of large language models like Chat GPT, Chat GPT, Chat GPT, Chat GPT. They can answer questions, write stories, and even engage in conversation. But if you want to build a business that uses this technology, you'll need to ask yourself an important question: How do I take this raw model and this raw intelligence and actually customize this to my use case? How do I make it really good for my user so that it's differentiated and better than what's out there?

This is Raza Habib. His company, Human Loop, enables large language models to have even greater superpowers. We can help you build differentiated applications and products on top of these models. The range of use cases is now feels to be more limited by imagination than it is limited by technology. You can replicate your exact writing style, customize tone, fact-check answers, and train the model on your company's unique data.

We really hope that this is a platform on top of which the next million developers can build LLM applications. In our conversation, we explore the secrets to building an app that stands out. What made it so good that a million users signed up in five days? Was it a fine-tuning exercise? The impact of generative AI on developers? They're finding a significant fraction of their code is being written by a large language model.

And what the future of large language models might bring to society as a whole? It's an ethical minefield. There are going to be societal consequences on the path to AGI. The potential benefits are huge as well, but we do need to tread very carefully.

Let's start with the basics and high-level: What is a large language model, and why have they suddenly made a splash? I assume they've been around a lot longer than the past year or two.

So, language models themselves are really old concepts and old technology. All it is, is a statistical model of words in the English language. You take a big bunch of texts and you try to predict what is the word that'll come next, given a few previous words. "The cat sat on the mat" is the most likely word, and then you have a distribution over all the other words in your vocabulary.

As you scale the language models, both in terms of the number of parameters they have and the size of the dataset that they're trained on, it turns out that they continue to get better at this prediction task. Eventually, you have to start doing things like having world knowledge. Early on, the language model learns letter frequencies and word frequencies, which is fairly straightforward. That’s kind of what we're used to from predictive text on our phones.

But if the language model is going to be able to finish the sentence "Today the President of the United States acts," it has to have learned who the President of the United States is. If it's going to finish a sentence that's a math problem, it has to be able to solve the math problem. Where we are today is that starting from GPT-1 and 2, but then GPT-3 was really the one that everyone said, "Okay, something is very, very different here."

We now have these models of language that don't know anything about the outside world. There are lots of debates about whether they actually understand language, but they are able to do this task extremely well. The only way to do that is to have gotten better at some form of reasoning and some form of knowledge.

What are some of the challenges of using a pre-trained model like Chat GPT? One of the big ones is that they have a tendency to confidently hallucinate stuff. I think Matt Friedman describes this as alternating between spooky and kooky. Sometimes, it's so good that you cannot believe the large language model was able to do that, and then just occasionally, it's horrendously wrong.

That's just to do with how the model is originally trained. They're trained to do expert predictions, so they don't necessarily know that they shouldn't be dishonest. They sometimes get it wrong, but the danger is that they confidently get it wrong, very persuasively, very authoritatively. People might mistakenly trust these models. There are a couple of ways that you can hopefully fix that, and it's an open research question.

The way we can help you with Human Loop is to make it very easy to pull in factual context to the prompt that you give to the model. The model is much more likely to use that rather than make something up, and we've seen that as a very successful technique for reducing hallucinations.

This is an element to building a differentiated model for your use case. Absolutely, add an element for making it safe and reliable, right? I think when Chat GPT came out, there was a lot of frustration from people who didn't like its personality. The tone was a bit obsequious; it'll defer and doesn't want to give strong opinions on things, and to me, that demonstrates the need for many different types of models and tone and customizations depending on the use case and depending on the audience.

Can you talk a little bit about what it means to fine-tune a model and why that's important? If you look at the difference between Chat GPT or the most recent OpenAI Text DaVinci 3 model and what's been on the platform for two years, the difference is fine-tuning. It's the same base model, more or less. What made it so good that a million users signed up in five days was a fine-tuning exercise.

Fine-tuning is gathering examples of the outputs you want, first for the tasks that you're trying to do, and then doing a little bit of extra training on top of this base model to specialize it to that task. What OpenAI did first and others have followed is to do a fine-tuning round of these models on input and output pairs that are actually instructions and the results you would like from those instructions.

So, there are human-generated pairs of data, and then further fine-tune the model using something called reinforcement learning from human feedback. You show people different generations from the model, ask them to rank them or choose which of two they prefer, and then use that to train a signal that can ultimately fine-tune the model. Reinforcement learning from human feedback makes a huge difference to performance.

In the Instruct GPT paper that OpenAI released, they compared a one or two billion parameter model with instruction tuning and RFH to the full GPT-3 model, and people preferred that despite the fact it was a hundred times smaller.

Anthropic had this very exciting paper just a couple of weeks ago where they were able to get similar results to RFH without the H, just actually having a second model provide the evaluation feedback as well, and that's obviously a lot more scalable.

What data do developers need to bring in to fine-tune a model? There are two types of fine-tuning you might do. They might just show up with a corpus of books or some background; they just want to fine-tune for tone. They have their company's chat logs or tone of voice from marketing communications and want to adjust the tone.

For example, all the emails they've sent. That's kind of almost extra pre-training. I would think about it as fine-tuning as well. The other fine-tuning data comes from in-production usage. Once they have their app being used, they are capturing the data that their customers are providing. They're capturing feedback data from that, and in some sense, it's being automated at this point.

Human Loop is taking care of that data capture for you, making the fine-tuning easy. You have an interaction with a customer that the LLM produces, and the customer gives a thumbs up or thumbs down as to whether that was helpful.

To give you a concrete example, imagine you're helping someone draft a sales email. You generate a first draft for them, and then they either send it or they don't. That's a very interesting piece of feedback that you can capture. They probably edit it, so you can capture the edited text. They may or may not get a response; all those bits of feedback are things we would capture and use to drive improvements of the underlying model.

Got it. If a developer is trying to build an app using a large language model for the first time, what problems are they likely to encounter? How do you guys help them address some of those problems? We help developers with three key problems: prototyping, evaluation, and finally customization.

At the early stages of developing a new large language model product, you have to try and get a good prompt that works well for your use case. That tends to be highly iterative; you have hundreds of different versions of these things lying around. Managing the complexity of that versioning and experimenting is something we help with.

The use cases that people are building now tend to be a lot more subjective than what you might have done with machine learning before, so evaluation is a lot harder. You can't just calculate accuracy on a test set. Helping developers understand how well their app is working with end customers is the next thing that we really make easy.

Finally, customization. Everyone has access to the same base models; everyone can use GPT-3. But if you want to build something differentiated, you need to find a way to customize the model to your use case, to your end users, to your context. We make that much easier, both through fine-tuning and a framework for running experiments.

We can help you get a product to market faster, but most importantly, once you're there, we can help you make something that your users prefer over the base models. That seems pretty fundamental. I mean, it's prototyping, getting the first versions out, testing and evaluation, and differentiation. This seems pretty fundamental to building something great.

I think so! We really hope that this is a platform on top of which the next million developers can build LLM applications. We've worked closely with some of the first companies to realize the importance of this and understand the pain points they had. In a proper YC approach, we have tried to build something that those people really wanted, and I think we got to a point where we're seeing from others that it really does solve acute pain points for them.

It doesn't really matter to us what base language model you're using. We can help you with data feedback collection, fine-tuning, prototyping, and those problems are going to be very similar across different models. We just want to help you get to the best result for your use case, and sometimes that'll mean choosing a different model.

I wanted to ask how the job or role of a developer is likely to change in the future because of this technology. It's interesting; I've thought about this a lot. I think in the short term, it augments developers. You can do the same thing you could do, but faster.

To me, the most impressive application we've seen of the large language model so far is GitHub Co-Pilot. I think that they cracked a really novel UX and figured out how to apply a large language model in a way that's now used by, I think, 100 million developers. Many people I speak to say they find a significant fraction of their code is being written by a large language model.

If you'd asked people two years ago whether that would happen, no one would have written that down. One thing that is surprising to me is that the people who say they use it the most are some of the people I consider to be better or more senior developers. You might have thought this tool would help juniors more, but I think people who are more accustomed to editing and reading code actually benefit more from the completions.

In the short term, it just accelerates us and allows us to do more. On a longer time horizon, you could imagine developers becoming more like product managers in that they're writing the specs, they're writing the documentation, but more of the grunt work and more of the boilerplate is taken care of by models.

I don't know; in a long enough time horizon, there are very few jobs that can be done so much through just text, right? We've really pushed it to the extreme. We've got GitHub, and with remote work, engineers can do a lot of their jobs entirely sitting at a computer screen.

When we get toward things that look like AGI, I suspect that developers will actually be one of the first jobs to see large fractions of their jobs automated, which I think is very counterintuitive. But predicting the future is hard!

What do you think the next breakthroughs will be in LLM technology? I think here, the roadmap is quite well-known. There are a bunch of things that we know are coming. We just have to wait for them to be achieved.

One thing developers will really care about is the context window. At the moment, when you sort of use these models, there’s a limit to how much information you can feed it every time you use it. Extending that context window is going to add a lot more capabilities.

One thing I'm really excited about is augmenting large language models with the ability to take actions. We've seen a few examples of this. There's a startup called Adept AI that is doing this, among a few others, where you essentially let the large language model decide to take some tasks.

It can output a string that says, "Search the internet for this thing," and then based on the result, generate some more and repeat. You start treating these large language models much more like agents than just text generation machines.

Well, if something we have to sort of expect or look forward to is AI taking actions, can this technology fundamentally be steered in a safe and ethical direction? Oh gosh, that's a tough question. I certainly hope so.

I think we need to spend more time thinking about this and working on it than we currently do because as capabilities increase, it becomes more pressing. There are a lot of different angles to that, right? There are people who worry about end safety. People like Eliezer Yudkowsky distinguish themselves from just normal AI safety, suggesting that AI shouldn’t kill everyone.

He thinks the risks are potentially so large that this could be an existential threat. Then there are the shorter-term threats of social disruption. People feel threatened by these models. There are going to be societal consequences, even to the weaker versions on the path to AGI, that raise serious ethical questions.

The models contain the biases and preferences that were in the model and the data and the team that built it at the time of construction. It's an ethical minefield. I don't think that means we shouldn't do it because the potential benefits are huge as well, but we do need to tread very carefully.

How strong is the network effect with these models? In other words, is it possible that in the future, there may be one model that sort of rules them all because it will be bigger and hence smarter than anything anyone else could build? Or is that not the dynamic that's at play here?

I don't think that's the dynamic at play. The barriers to entry for training one of these models are mostly capital and talent. The people needed are still very specialized and very smart, and you need lots of money to pay for GPUs.

Beyond that, I don't see that much secret sauce. Opening AI, for all the criticism they get, have actually been pretty open, and DeepMind has been pretty open. They’ve published a lot about how they achieved what they've achieved.

The main barrier to replicating something like GPT-3 is whether you can get enough compute, smart people, and data. More people are following on their heels. There's some question about whether the feedback data might give them a flywheel—I'm a little skeptical that it would provide so much that no one could catch up.

Why? That seems pretty compelling. If they have a two-year head start, and thousands of apps get built, then the lead they have in terms of feedback data would seem to be impressive.

I think the feedback data is great for narrower applications. If you're building an end-user application, you can get a lot of differentiation through feedback and customization. However, they are building this very general model that has to be good at everything.

They can't kind of let it become bad at code while it gets good at something else—others can do that.

Got it. Now let me ask you probably the hardest question here. OpenAI’s mission is to build AGI—artificial general intelligence—so that machines can be at the cognitive level of humans, if not better. Do you think that's within reach? Do the recent breakthroughs mean that that's closer than people thought, or is this still, for the time being, science fiction?

There's a huge amount of uncertainty here, and if you poll experts, you get a wide range of opinions. Even if you poll the people who are closest to it, opinions differ.

However, compared to most people's perceptions in the public, many think it's plausible sooner than I think a lot of us thought. There are prediction markets on this—Metaculus sort of polls people on how likely they think AGI will be. The median estimate is something like 2040.

Even if you think that’s plausible, that's remarkably soon for a technology that might upend almost all of society. What is very clear is that we are still going to see very dramatic improvements in the short term, and even before AGI, a lot of societal transformations, a lot of economic benefits, but also questions we have to wrestle with to make sure that this is positive for society.

So, on the short end of timelines, there are people who think 2030 is plausible, but those same people will accept there's some probability that it won't happen for hundreds of years. You know, there's a distribution. If you take it seriously, you should take it seriously. It's very hard to take it seriously, even making that choice of, "I'm going to accept that by 2030 it's plausible we'll have machines that can do all cognitive tasks that humans can do and more."

Then you ask me, "Okay, Raza, are you building your company in a way that makes sense in that world?” I’m trying, but it’s really hard to internalize that intuitively. Stuart Russell has a point where he says, if I told you an alien civilization would land on Earth in 50 years, you wouldn't do nothing. There’s some possibility that we’ve got something like an alien arriving soon.

So, let me ask you, what does this new technology mean for startups? Oh man, it's unbelievably exciting! It's really difficult to articulate. There are so many things that previously required a research team that felt impossible that now you just ask the model.

Honestly, stuff that during my PhD I didn't think would be possible for years or that I spent trying to solve problems is now achievable. For example, you want to have a system that can generate questions or do something or be a really good chatbot like Chat GPT—a realistic one that can understand context over long ranges of time, not like Alexa or Siri that only handles single messages.

The range of use cases now feels to be more limited by imagination than by technology. When there is a technology change this abrupt, where something has improved so much, YC teaches this, right? There are a few different things that open up opportunities for new applications.

We're beginning to see it—a sort of Cambrian explosion of new startups. I think the latest YC batch has many more startups. We see it at Human Loop; we get a lot of inbound interest from companies that are at the beginning of their explorations and trying to figure out how to take this raw model and this raw intelligence and turn it into a differentiated product.

Hopefully, we have some AI engineers or aspiring AI engineers listening today who might be interested in working at Human Loop. Are you guys hiring, and what kind of culture and company are you trying to build? We absolutely are hiring! We're hoping to build a platform that's potentially one of the most disruptive technologies we've ever had, ideally used by millions of developers in the future.

There's going to be a lot of doing stuff for the first time and inventing novel UX or UI experiences. So, full stack developers should feel genuinely comfortable up and down the stack! I deeply care about the end-user experience and who will enjoy speaking to our customers.

They're fun customers to work with because we're working with startups and AI companies who are really on the cutting edge—they're innovators. If that sounds exciting to you, it will be very hard, and lots of it will be very new, but it will also be very rewarding.

Well, this has been really fascinating. I think what my crystal ball says is that one day in the future, literally millions of developers will be using your tools to build great applications using AI technology. So I wish you luck, and thank you again for your time! Thank you, Ollie. It's been an absolute pleasure. [Music] Thank you. [Music]

The REAL potential of generative AI

More Articles