How To Build Generative AI Models Like OpenAI's Sora

22m read

·Nov 3, 2024

A lot of the Sci-Fi stuff is actually now becoming possible. What happens when you have a model that's able of simulating real world physics? Wouldn't it be cool if this podcast were actually an Infinity AI video? One thing I noticed is that, like, the lip syncing is extremely accurate; it really looks like he's actually speaking Hindi. How do YC companies build foundation models during the batch with just 500,000? This is literally built by 21-year-old new college grads, and they built this thing in 2 months. I think he locked himself in his apartment for a month and just read AI papers. You can actually be on the cutting edge in relatively short order, and that's an incredible blessing.

Welcome back to another episode of The Light Cone. Today we're talking about generative AI. First there was GPT-4, then there was Mid Journey for image generation, and now we're making the leap into video. Harge, we got access to Sora, and we're about to take a look at some clips that they generated just for us. Yep. Should we take a look?

Okay, so here's the first one. The prompt is: it's the year 2050, a humanoid robot acting as a household helper walks someone's golden retriever down a pretty tree-lined suburban street. What do we think? I like how it actually spells out "help." It's like a flex, like I can spell now, yeah? Which was not true with the image models; it always screwed up the text image. Stable diffusion and DALL-E were notoriously bad at spelling text, so that is a major advance that no one's really talked about yet.

I mean, it's wild how high definition it is; that's almost realistic. And the other really cool thing is the physics. The way the robot walks, for the most part, is very accurate. You do notice a little kind of shuffle that's a little bit off, but for the most part, it's believable. And the way the golden retriever moves—I have a golden retriever, so I can personally vouch that they perfectly modeled the life, yeah?

You have one, right? Like your dog, right? Yeah, perfect. A perfect representation of how a golden retriever walks. I also like that with DALL-E and stable diffusion, as you made your prompts longer and longer, it would just start ignoring it and not actually do exactly what you told it to do. And like, we gave it a very specific prompt here, and it did exactly the thing that we told it to do. You can see it's still not exactly perfect. So, I think towards the end, you see it's like a floating dog or something in there.

Okay, I was going to call out a couple other imperfections here, which is that like the street is not a street, guys; like, care, it's a...yeah. And like, what's up with that? It's like a weird—it's like not quite a sidewalk, not quite a street. Yeah, but in the future, we won't need cars anymore. And then only one side structure is like jumping; there's this floating object thing. Here's a floating object on the right, if you watch carefully, which looks like a little dog or something. I'm not sure. This is still a real breakthrough.

If you look at some of the stuff that Meta put out, I mean, I always think about what is it Will Smith trying to eat a plate of spaghetti, and that looks insane. And it's sort of just what you would do if you fed the previous frame into the same model to try to generate the next frame, and it just wasn't durable. And that wasn't too long ago.

Yeah, the other thing that I find really impressive about this video is that they have long-term visual consistency, so it's like a minute long, and all the houses are similar in architectural style. There's no discontinuity— all the trees look similar. It's clearly all taken place in the same world. The next one's a drone camera circling around the Golden Gate Bridge. The view showcases the magnificent cliffs and ocean waves with views of San Francisco in the background.

The view is stunning, captured with beautiful photography. That is the Golden Gate; that is the Golden Gate. It knows what the Golden Gate Bridge looks like, and I think you can see Alcatraz there a little bit too. Yeah, the high definition is amazing, and you can see the city in the background as we asked for. It's definitely not geographically accurate, but yeah, the terrain is not quite actually the way it is in the real world, but it looks visually somewhat similar.

Yeah, and you can see it's not quite perfect 'cause early on in the clip, if you look at one of the columns of the bridge from a particular angle, it looks disjointed. Can you see that one? Oh yeah, the back, yeah, and then it sort of lines up when we get to this angle. Also, if you go back to the beginning of the clip and you look at the cars driving on the bridge, they're driving on the wrong side of the road! Like, that one's about to cause a traffic accident, maybe there's some data from the UK?

I guess the other detail is, in computer graphics, it's incredibly difficult to simulate fluid. It's still a little bit wonky with the waves; they're a little bit static. I've seen other Sora clips where it captures the motion of water just incredibly. One thing I'm really curious about is just how Sora works under the hood and just how they're generating these videos. So, D, could you even say a brief primer on just what's actually going on?

And one thing I was particularly curious about is: is this like a new model, or is this like an extension of the transformer model that we all know about as powering ChatGPT? I think the TL;DR, and the really cool thing here, it is really a combination of a transformer model, which typically has been mostly used for text, and a diffusion model, which is a lot of the tech behind DALL-E and Mid Journey to generate images.

So, it's combining these two and then adding a temporal component so you can see the consistency between frames and the time. And I think the key thing that OpenAI did was to train this with videos and with what they call SpaceTime patches. So, it is like this basically this 3x3 matrix of pixels. So, you have the spatial and then patches of temporal, which is like multiple frames create a video. And the way they do it, they have a variation of the sizes of these patches; they could be sized smaller to bigger in XYZ basics, right? And then they basically train all this in this giant architecture, which is really expensive.

So, are the patches these SpaceTime patches the video equivalent of tokens? Sort of? Because I think there's a lot of prior work behind Sora. Because the first thing is transformers have been mostly applied to text, and one of the prior work arts was Google's work on demonstrating that you could do transformer models not just for English text but for images. So, that was a foundational paper that came back in, I think they published it in 2020, and the paper was called "Image is Worth 16 x 16." So, they call it a visual transformer, so they demonstrated that you could create and use transformer models for image recognition tasks, because the state of the art up to then was convolutional neural networks, which was very expensive to compute.

So, that was one piece of the puzzle. The other piece of the puzzle was kind of the SpaceTime concept, and I think some of that comes from stitching some different work in the past—this other paper, "World Model," that came out in 2018, that separates itself for robotics, actually, that separates the detection piece, so that's kind of the perception of the visual part. And then the other piece is the memory model for the temporal aspect.

The temporal aspect in the world model paper uses IRN, and then there's a controller model that combines it. So, I mean, they don't explain too much. OpenAI is a bit coy about it, but we can only speculate. It's a combination of like robotics papers plus transformer plus text. And then how much more expensive is it to generate one of these videos compared to sharing the text? Like, how do we even think about that?

Oh man, so imagine that GPT-4 is like a trillion parameters, and that imagine is only two dimensions, right? Text is just the matrix of 2 by 2. Now this is like an order magnitude, so I can imagine it's like at least one order magnitude—10 trillion. Okay, that's amazing! Probably 10 times the amount of GPUs. I can only imagine I think it was about 20, 30,000. I forget exactly the number of GPUs it took for GPT-4.

What's crazy is that we have companies within YC that have also been able to achieve similar types of functionality, and they clearly have way fewer resources than OpenAI does. So I'm curious how they managed to do that. The way I kind of think about this is that there are components of building one of these foundational models, like data, compute, and expertise.

Should we talk through some of the YC companies and how they've managed to, like, hack each or all of those things? Basically, how do YC companies build foundation models during the batch with just 500,000? Yeah, I think it's an important topic because I think because people know how much money OpenAI is spending on GPUs, there's this meme going around that in order to do this, you need to have raised billions of dollars and have a data center full of GPUs. And we've actually seen that is not true. There are actually a bunch of companies in the current batch, Winter 24 right now, that just in the time of the batch, with just the 500k that YC gives them—they have actually built really awesome foundational models that are producing like magical results.

Should we look at some of these demos and see how they managed to get this to work? Yeah, let's start with Infinity AI. Infinity AI is coming in the current batch, and what they do is they make deep fake videos of a particular person. So, for example, they have an AI replica of Elon Musk, and you can just tell Infinity AI what you want Elon Musk to say, and they will produce a video of Elon Musk saying exactly that thing.

Watch a demo? Yeah, let's see a demo! Speaking of YC companies training their own models, did you guys see the Infinity AI demo last week? Yeah, they're a company in my group. Infinity allows people to make videos by just typing out a script. Wouldn't it be cool if this podcast were actually an Infinity AI video? That'd be super cool! You think they'd be up for that?

Well, guys, I have a surprise for you. Here we are! It was pretty good! So special thanks to the Infinity AI team who made a model for the Lite Cone podcast. And the way that they did this is they literally just downloaded our YouTube videos from the first three episodes, and they trained their model on that. The cool thing about these models now is like you don't need that much data once you've trained the foundation model to adapt it to learn a new person.

So just like the hour or so of YouTube video that we had was enough for them to get a really accurate representation. I could talk about another company, so I'll explain what Synlab is. Synlab is an API for creating real-time lip syncing, and the crazy thing about this team is that they trained the models on a single A100 and are generating these CS of results. So let's take a look at it.

I'm guessing this guy doesn't actually speak Hindi? No, no. Okay, okay. One thing I noticed is that like the lip syncing is extremely accurate; it really looks like he's actually speaking Hindi. Yeah, and if we put it in this framework that you were mentioning, Harge, with how YC companies do this, there's different vectors: computation, data set, and speed. So they kind of hacked all of those.

So for the data set, the clever thing they'd done is unheard of chaining a video model, video audio model, with so little resources is they compress a lot of the data into low-res video. So you don't need the high res video because if you have a high res of 1080p versus let's say the 240p version, that's like a factor—quadratic factor less because it's two dimensions, right? So they'd done that.

The other thing that enabled them to really move a lot faster is the deal that we did with Aure, where we have a dedicated GPU cluster for companies in the batch. They've been able to iterate 100 times faster than they were before in the batch. So a lot of companies out there, they decide they need to do fine-tuning; they need access to GPUs and they just can't get it, or you've got to pay an arm and a leg and prepay for a year in advance, and maybe you'll get it in 2025. But if you're in the YC batch, it turns out you can get them.

Yeah. You get over half a million in credits and there's no contention for resources; you actually get instant access within 24 hours for a GPU cluster, which is pretty cool because YC invests half a million dollars. But I think all the companies in the YC batch to train these models, I think they literally didn't have to touch the YC money to train the models. Like that was all extra free money, like unrelated to the YC investment.

Should we talk about Sonado? So Sonado is another company in the Winter 24 batch, and they have built a text-to-song model. So you can give their model lyrics to a song and tell it who you want to perform the song. Like, you can tell it, I want Taylor Swift to sing a birthday song for my dog, and it will make exactly that song. There's only like two or three models in the world that have ever been trained that actually do this, and I think Sonado is actually the best one.

Oh wow. And the really cool thing is that the founders of Sonado are literally like 21 years old. So, Harge, to your point about expertise, this was not built by like PhD machine learning researchers who have been working in machine learning for like 10 years or something. This is literally built by 21-year-old new college grads, and they built this thing in months. They did it basically—they just taught themselves, just went online and figured out how to do it.

That is very impressive. Should we take a look at it? Yeah, so this is a song that they made for the YC batch, and it's like a power march about Y Combinator.

[Music] In the heart of the valley where futures are made, founder of—Is this how we're going to open the batch? Yeah, from now on. That's a good idea! We need big orange banners behind us, and we have to wear military garb, though. With orange armory gear? We could do our own song for Demo Day. AI generated! I think we have to now! You have to!

This is very impressive. One thing I really like about this is like you can actually understand the lyrics; it really does do the lyrics, but it really does sound like someone is singing it. This is the first time I've heard AI vocals like that.

Yeah, and to your point, Jared, there's another company that also didn't have the expertise of PhDs in machine learning. It is called Metalware. They're building a co-pilot for hardware, and these were founders who used to work as hardware engineers at SpaceX, and they had to build all these hardware designs. So they're very familiar with building hardware, and when they came into the batch, they decided to build basically a co-pilot for hardware design.

They didn't have much AI background, and they figured it out. One of the cool things about them is they also trained a foundation model for this because there was no model available for this. They were able to do it during the batch, and in that same framework of the things that they hack with data and computation, in terms of the data, they got away with using less data but more high quality. What they did is they took a bunch of figures and information from textbooks on hardware, and they scanned all of that and added that as input, which is clever, right?

The other problem—the other thing is, because they didn't need as much data, then they could choose to work with a model that's less computationally intensive. So they actually use GPT-2.5, which seems counterintuitive because the 2.5 GPT only has like one billion plus parameters, I think. I think it's one billion, right? Yeah, versus GPT-4 is like trillion.

Yeah, and they were able to get away to use less computational resources because they used a smaller model and better data, and then they could do all these hardware design co-pilot tasks, which is really cool. So when you kind of constrain a lot of your tasks and you're very specific and the data set is very high quality, that's another way you could hack and build a foundation model during the batch.

And for all different kinds of applications—not just generating video text. There's one that I'm really excited about in the current batch called Guab. They're building an explainable foundation model because one of the things with all these foundation models and deep learning is kind of this black box magic; nobody knows what's going on— that you put in the data, it kind of predicts the label, and you have no idea how that happened.

Prior to deep learning, you could because you could have the weights and understand which feature indicated and gave the weight for the label. So this team is building a foundation model that can explain the outputs, and they trained a model during the batch. Nice. As a founder, when is it the right call to invest in building your own model versus just using one of these open source models and fine-tuning and tweaking it to fit what you need?

Well, I guess it depends, right? Depends on what you're really looking to build. If you're in a very specific—and it can be niche space—you can get away with training your own foundation model, like the Metalware guys. But if you're, let's say, doing something more with language, GPT-4 gets you quite further along. So, it depends on the task too, right?

So, is it—so if we're thinking about it as like the data, compute, expertise—like we're basically saying expertise is maybe overrated. We're like proving that if you're just smart and willing to read the papers, you can figure it out. Compute—there are many ways, like being in YC is one way to get around that. Like you can get credits, and you can take some of that cost off.

And so then is it like the data piece is sort of where all the edge is? Like if you can find high quality—I say like high quality but not like giant data sets—that's the hack? Oh yes! Let's talk about Find. So Find is this company that's building a co-pilot for software. The answers that they're generating are even better than Stack Overflow.

Interesting! And these were also kids out of college with not a lot of a background, and they've done a very clever hack to build their own model for the data. They created a bunch of synthetic data for programming competitions, so they would have a bunch of those data sets generated, and that got a lot higher quality. Imagine that! It's basically infinite if it's synthetic. It's interesting 'cause I feel like synthetic data has been looked down on. It was controversial initially.

Yeah, why? Like why was it originally controversial and why does it actually seem to be working? It seemed like circular; it seemed like it would be impossible for a model to generate its own data. How can you learn from the data that you generated?

Yeah, it wasn't obvious that such a thing could be possible. It seemed to like violate some conservation of energy. I remember the meme that was going around on Twitter was like the mosquito drinking its own blood, and that's how synthetic data works.

Yeah, but then it turns out it actually works! Interesting! I think maybe this is related to the idea that some of these LLMs are actually capable of reasoning. Once you can reason, maybe that's the part that sort of spins up the flywheel and makes it possible.

You know, there are other interesting analogies. I think there's a healthy debate out there whether or not this will come together. But you could look at self-driving car models, which are often trained on massive amounts of simulation data instead of actual real drive time—sometimes by a factor of 10:1 or more. And that might end up being true for some of the generative AI models too.

Is it possible Sora will do that as well? Like will Sora generate its own video to continue improving its own model? Probably! I know OpenAI doesn't share much about their data sources because that's part of the secret sauce, but 100% they're using video footage that's generated from Unreal Engine probably or Unity, one of these game engines, because they have a full physics simulator.

So you could create multiple scenes of the same—let's say if you have the example of the car driving on the cliff—they could generate it from all multiple camera angles because what that game engine does, you can position the camera anywhere, and you could basically generate all the footage on all possible camera views.

The physics part of this is really interesting. Most people, when they're seeing these Sora demos or just generally get this concept, your mind goes to, “Oh, this will be cool for generating films or video games,” like entertainment. But if what you're saying is it can actually simulate the real world, there's probably going to be lots of further-reaching implications for that.

What happened in America? No commercial company has been able to create their own version because it's too expensive to do it the old school physics-based way. And so, what's really cool about Atmo is instead of using the old school physics way, they've trained a foundational model. Using machine learning, it's like a million times more efficient to run the same calculation or something like that.

And because of that, this startup, which has only raised a seed round, is actually able to make a WEA prediction model that is more accurate than the NOAA-funded one that cost over a billion dollars. Interesting! What's really surprising about the text-to-video is just how far-reaching the implications are. So you can go way beyond just generating video games.

What are other examples of cool things that we could do if we can generate like have a physics simulator of the real world? Well, there's a bunch of companies that are applying it to biology. Diana, do you want to talk about a couple of those?

Yeah, so it turns out all these foundation models are great function approximators for anything. So functions—they're general-purpose learning algorithms, and the human body can be simulated with functions too. One of the companies that we funded as well is called Theuse Bio. They're building generative AI for proteins.

So what they're doing is building these big models to be able to create new molecules for new types of drugs and new kinds of gene therapies. And in order to hack this aspect of how do they make progress with not as much resources, they had a lot of expertise. This is different than the set of founders we talked about that don't come from the background of your eye. Namada, the founder, she has published some very legit papers in nature before this.

She had a lot of expertise in terms of how to short circuit the computation loop. What she did is build custom kernels on the models so that the whole process of building the foundation models is a lot faster with fewer resources. One of the other companies in the current batch is a pyramid. Do you want to talk about them? They're building a foundation model for the human brain, which turns out they’re predicting EEG signals, which could be used for all sorts of applications, from predicting stroke to reading.

At some point, they could—your brain could be read, perhaps? What EEG signals are—they're also pearls. So sort of like Sora, Sora has the images plus images over a timestamp, so that's video. EEG is the same thing; it's just electrical impulses but over a time period. So they kind of do something similar with chunking SpaceTime, chunk, but for EEG.

So they're able to train this model, and the way they were able to train and iterate during the batch, they were experts in the space, so they also did a lot of hacks around the computation where they found a way to divide a lot of the sequential data into chunks, sort of like what Sora has done, and that actually reduced the runtime complexity by quadratic, which is impressive. And they could get a single run of an iteration of an initial model with just 800 hours of compute, GPU compute, which is really cool.

One thing that's really cool about that is like if people sat down and tried to think of different applications for foundational models, EEG would not be the one that would immediately come to mind. And to me, that suggests that there's probably a lot of other application areas like EEG data that just people haven't thought of yet.

Yeah, it's like who would have thought that EEG is sort of like videos? It's just this whole concept with SpaceTime! You can apply SpaceTime to lots of things. It's also possible that applications of AI that people thought wouldn't exist will now exist. Like robotics, I think is a good one, a huge one.

You remember I think we talked about this in a previous episode about how when Sam was starting OpenAI, he talked about they originally thought that, you know, AI in robots and AI in the real world would be like the first application. And I remember I went over to the OpenAI office in like the first year or two, and they had all these robots trying to like learn how to solve the Rubik's Cube by like reinforcement learning.

Which is also kind of an interesting side note because OpenAI is so wildly successful right now that it's easy to think that they knew that they had this like straight line path to get there, but it was definitely not that! It was like a meandering path; they pursued a bunch of dead-end ideas, like the reinforcement learning robots that didn’t work well. Even the researcher working on Transformer architecture at OpenAI was off in the corner, I think, at the start.

It wasn't clear even within OpenAI that that was going to be the thing, the right thread to pull on, right? But Sora and just like text-to-videos are interesting because again, if we have a real physics simulator for the world, that potentially getting plugged into robots is like a breakthrough to make the AI robot a reality.

We actually have a company in the current YC batch, K Scale Labs, that's working on consumer humanoid robots. And yeah, they have pretty cool demos; they're very early. But like a lot of the Sci-Fi stuff is actually now becoming possible. The cool thing about Ben, who the founder for K Scale, he was the guy that built the foundational robotics model for Tesla.

Oh cool! He put it into the Optimus Prime robot as well. Oh awesome! The real world is governed by the laws of physics, and it turns out we have a bunch of equations that can describe it for different things like weather. There's also the space—for example, there's this company that we funded called Draft-A that is building AI models for CAD design.

So CAD follows a lot of the laws of physics with Newton, right? With force, shear, etc. A lot of software behind SolidWorks and AutoCAD run on these really old kernels that basically, again, solve these giant pools of lots of equations. So that when you do a design of a structure, and you want to calculate the force and the tolerances, it's accurate because you don't want a building to just flop, right?

So what they—and it's very expensive. I mean, whenever you build all these models in CAD, and these kernels are super old, and they kind of, at the end of the day, they run on these equations that compile to some wild thing like Fortran because they haven't been updated.

What Draft-A is doing is they are short circuiting some of these with AI models that can do some of the predictions, so it's a lot faster and cheaper in terms of computation. There's a lot of computational geometry behind the scenes. That's really cool. That's a perfect example of just like a valuable problem to solve that the general-purpose model is just all going to get around to specializing in.

That's a great point. And there's a lot of startups that are very worried that if they go into AI, they're going to get run over by OpenAI or other foundational model companies. And so one solution to that is like train your own model that's doing something else.

Yeah, a great point! There's actually a YC company called Playground run by our friend Suel Doshi that is a good example of actually you probably can go up against people who are really well funded and come up with something that is far better. What we're looking at here is the newest version of Playground 2.5, and they are hot on the heels of Mid Journey. But at the same time, the models that they've actually even released open source go toe-to-toe against ST—you know, the latest versions of Stable Diffusion and in a lot of cases, outperform that.

And they've done it on far less money than Stability AI and other teams in the space. So I think Suel and Playground are really one to watch to sort of go toe-to-toe with Mid Journey and in the long run, potentially beat it because I would never bet against Suel Doshi; that guy is a beast! The image quality is super impressive! That looks so cool!

And maybe some of the audience would have thought that Suel comes from an AI background, but he doesn't! Yeah, he started Mixpanel before when he was 19. And Playground is also an interesting example of something that Harge was talking about last night, which is the phenomenon of companies pivoting into AI because Playground actually did not start with this idea. When it started, it was a completely different idea, and a couple of years in, Suel, after raising a bunch of money, Suel hard pivoted the thing into AI.

And he literally just taught himself AI! I think he locked himself in his apartment for a month and just read AI papers, and then he built Playground. So don’t be afraid! I mean, I think that that's one of the most interesting things that we've seen across many of these different examples—that if you’re looking for a reason why you can't succeed, guess what? You're right! But on the other hand, the field itself is so new, so brand new that if you spend six or nine months literally reading every paper and then meeting all the people who are in the space, they will meet you.

You can actually be on the cutting edge in relatively short order, and that's an incredible blessing! Totally! It's a really important message, actually, right? Because we're all grateful to Sam and OpenAI for bringing this field forward and making all of this stuff possible. But at the same time, all of the news headlines tend to be around the companies that are raising huge amounts of money or about, you know, like Sam himself, who is a world celebrity at this point.

But you can actually compete with OpenAI for very valuable verticals and use cases by training your own model without having to be Sam Altman or having a $100 million. So we're out of time for today, but we could talk for hours about the crazy things that we're seeing in AI being built by people who are probably not that different than you who's watching right now? A lot of the world right now is looking at people like Sam Altman and Dario Amodei and some of the luminary figures who have really pushed forward the whole space.

But remember, all of these people started someplace, and we hope that Y Combinator might actually be the place for you to start, just like it was for Sam Altman back in the day. That's it! Catch you next time!

[Music]

How To Build Generative AI Models Like OpenAI's Sora

More Articles