Building The World's Best Image Diffusion Model

39m read

·Nov 3, 2024

I think we thought the product was going to be one way, and then we literally ripped it all up in a month and a half or so before release. We were sort of like lost in the jungle for a moment, like a bit of a panic. There's a lot of unsolved problems basically. I mean, you even this version of it, you know people are going to try it and then they might be blown away by it, but like the next one's going to be even crazier. To get to soda, you basically have to be maniacal about like every detail. There are going to be some people that train their models and they get cool text generation, but the kerning is off. Are you the kind of person that will care about the kerning being off, or are you the kind of person that is okay with it, like, or you don't even notice?

[Music]

It welcome back to another episode of The Light Cone. I'm Gary. This is Jared Hardge and Diana. Collectively, we have funded companies worth hundreds of billions of dollars, usually just with one or two people just starting out. And we're in the middle of this crazy AI revolution, and so we thought we would invite our friend Suel Doshi, founder and CEO of Playground, which is the state-of-the-art image generation model with also a state-of-the-art user experience, and it just launched. So, how you feeling Suel?

Very under pressure right now.

That's good! You start like a startup founder, so which is normal. Maybe the best way to start off is to look at some examples of the images that you were able to generate. And this is stuff sort of right off the presses. So, at Y Combinator, I also am one of the group partners, so I fund a number of companies. Every batch, I funded about 15 for the summer batch. And so, what we're looking at here is one of the t-shirt designs I made. As you can see, there's a GPU, and it was based on one of the core templates in your library. I like metal, so this very much spoke to me. This one was off of a sticker design, and I guess I just really like that sword. What I was able to do is add GPU fans.

Love it! I love it!

And so, that's one of the noteworthy things about Playground. You can upload an image, it'll sort of extract the essence of like sort of the aesthetic and some of the features of it.

This feels like a tattoo.

Yeah, exactly. Do you remember what you prompted it with to get those?

Oh, yeah. I basically, so the cool thing about Playground to create this was I picked a default template that I liked, and I think it only had the sword and sort of this ribbon. I said make it say "House Tan" on the ribbon and add a GPU with two fans. I was very specific; I wanted a two-fan GPU. And that's one of the things you'll see in all these designs. This is actually the t-shirt that House Tan itself actually chose. So, you know, it's a very summery vibe. I think this was based on something around summer in surfing, and we replaced the surfboard with a GPU.

I feel like you used a preset that we had.

I did! All these from presets, they're pretty good. I think the noteworthy thing that I was able to do is I didn't have to like prompt and reprompt and reprompt and sort of keep trying to refine the same text prompt. Like, I actually could just talk to a designer, and it would just give me what I wanted, going from left to right. For instance, by default, I think the template had this yellowish background and I said make it on white. And so, that was like a very unusual interaction that I, you know, I'm not used to. Like, usually, you're used to either Discord with MidJourney or you're sort of used to a chat interface or like prompt and then twiddle things and reprompt and reprompt and reprompt. Whereas this felt much more natural language; I could just talk to a, you know, a machine designer that would take my feedback into account.

Yeah, normally when you make these kinds of images, you have to like describe all of it, right? You'd have to say like I want it on this, you know, beige background and I want this orange sunset, and then you'd have to even just describe like the lines of the sun. Or you don't describe very much, and then every time you try, it's like totally different from the other thing. So, usually, you know, you either have to learn like a magical incantation of words, or versus being able to like pick something that you start from.

And then also with these images, Gary, did you add this text in post-processing, or is the model actually like incorporating the text organically?

Oh, the model will both your direction on what should be there, what its size is. You can actually specify where in the design; you could say I want it in the middle, I want it at the top. Could we use a font that's bigger, smaller, you know, a better letting? Could you see it a little bit? Like you could just speak to it in plain English, and I've never seen that in any image model today.

That's crazy, 'cause the text is flawless. And anyone who's used DALL-E knows that if you try to get it to write text, the text comes out like garbled and like — yeah, it's pretty incredible having just accurate text and then being able to position the text exactly where you want. That is very cool.

It is really soda in terms of text adherence and coherence in terms of following prompts, which is really cool. One thing that we think is really cool is it's inventing fonts. Like, I don't know what font that is. Yeah, it might be a real font, but I think that there are all these circumstances where it's actually just like extrapolating from many different kinds of fonts and like actually inventing new things, which is really, really cool.

Okay, and these are just a couple other versions. You know, I saw some old-timey thing, and I was like okay could you do a vectory version of a GPU on the left? And then on the right, you know, there was a very sort of Japanese arthouse aesthetic. These are great!

And this one, if left to my own devices, I was going to print this one 'cause I really like — yeah. And what I could do is actually tell it like make it even more sort of prototypically like Japanese art, you know? Like I want more waves and like more sun and, you know, it basically kept doing it.

I think I know this preset; I remember making this preset like a month and a half ago. I think it's called like Mythic Inc. Or something? That's how the app works. You know, you open the app, you select a preset, or you can upload your own design that you're like really into, and then it will seemingly extract the vibe of that particular thing. Like, you know, it's not going to be a copy; it will be a remix.

Did you purposefully design it to be so good with text, or is that like an emergent property of just how you architected everything?

We definitely focused on making text accuracy really good. I think it's been kind of our number one focus, and part of it is that the text to us is so interrelated with actually the utility of graphics and design. Because a lot of things without text just mostly feel like art. But yeah, text was an extraordinary high priority, and it was really hard actually. There's like a maybe a point there where like our text accuracy was 45%. We were sort of like lost in the jungle for a moment, like a bit of a panic, but we figured it out.

I think one of the remarkable things on all these designs is that a lot of — I was playing a lot with it as well. A lot of the outputs are very utilitarian and useful because they play with MidJourney and all of those, and I think they're fun but they're more like toys, more like art. But it's really hard to work with it if you actually wanted to design logos, t-shirts, font sizes. I could totally see this replacing Adobe Illustrator.

Right, right.

Yeah, I think that, you know, part of it's kind of funny. It's like the reason why we're — I'm partly so excited about graphic design is because actually when I was younger, when I was in high school, I used to do logo contests and I would try to win them. I think there's a site called like sitepoint.net or something. And I was just trying to make like a little bit of money before college, before going to college, and so I did all these like logo designs and did all these tutorials trying to win them. And during the training of this model, I tested it for logos and I started to be like wow, it's actually way better than anything I could have made. And then I've also made like my own company logos typically which are also very bad. And so it just feels to me like if you can get text and you can get these other kinds of use cases, you're probably going to be able to beat the at least the midpoint designer, graphic designer that's an illustrator. And then I think over time we should be able to get to like the 90th percentile designer, graphic designer.

So this is actually a really different use case that really hasn't been addressed. You know, I haven't seen image models try to design graphics or illustrations. It's less, you know, generating really cool images that would replace stock art or something like that, it's more literally allowing you to create Canva-type things right, you know, whenever you want. And you don't have to mess around with it; it's, you know, plain English. Just talk to the model; the model is going to create what you want. I've never seen anything like that.

Yeah, I think we're just sort of like looking at what are the use cases for graphic design, and it's, you know, when it actually interestingly it has a lot of real-world, like physical impact. Physical world impact because there are like bumper stickers and then t-shirts. I think it was at Outside Lands the other weekend, and I was just looking at everyone's t-shirts looking at what they have on them. And I saw a bunch of women at Outside Lands had this thing, this t-shirt that said "I feel like 2007 Britney." I just thought that was such a cool shirt. And so we made the template for it and put it in the product. But there's just like so much cool real-world impact, and there's — and I think that the world often sometimes thinks that — I'm almost a little disappointed that Myspace doesn't exist for those that were on Myspace 'cause it was such an expressive social network. And I feel like humans really deeply care about that form of expression. And so it's really cool to be able to make a model that's like really focused on all those kinds of things.

But you're actually building a product. It's not just research because with all these designs in Playground you can actually go and purchase them, like the stickers, the t-shirts.

Right, right! Can you tell us about kind of this marketplace that you're building?

Yeah, so I think that you know one thing we learned was that it's kind of hard for people to prompt, and because it's hard to prompt, we found also it's hard to teach people how to prompt. Like, and the truth is that when you make these models, it's not like we even know how it works. We are also discovering with the community how the model kind of works. And so one of the things that we decided to do was, you know, me and our designer, we decided that one core belief was that the product should be visual first, not text first, which is a huge departure from like language models and ChatGPT. Because our product is so visual, why should it not be? And so in order to make it visual first and to make it so that you don't have to learn how to prompt, we decided that we would start from something like a template, which is something people already understand in a tool like Canva. Right? It's not something that we necessarily invented; like, there are templates everywhere. But I think that if you could start from a template and then we could make it really easy to modify that template, then it feels like we've already taken you like 80% of the journey. If it was like, I feel like 2007 Britney, but then you wanted to change the celebrity in the year to a different person, then you totally could. Wanted to make that very easy. But it also required a lot of integration with research because how do you make these changes? How do you make them coherent? How do you keep things like similar? It's not as simple as, you know, just 75-77 tokens that you put into stable diffusion. The existing open-source models aren't really capable of that. So it required kind of, yeah, like the marrying of like what a good product should feel like and then, you know, enabling that with research, which is not always possible.

I think that's what Gary was getting at with you building the state-of-the-art UX, the UI for all these models. Because up to this point, people just get raw access. It feels like kind of back in the days you would just SSH to a computer and kind of work with it. That's how people interact with these models. But you kind of basically built a whole new browser into it. Nobody has done it, and you've done it really well! Can you talk about this idea of departing from raw model access?

Yeah, I think just we observed the users over 18 months, like failing, you know? And so, AI is a little bit weird right now because there’s so much — there’s such a big novelty factor, I would say. And it’s exciting because we're able to do things we've never been able to do before. And so, as a result, you get, you can easily get millions of users using your product, and that's totally what happened to us. And so it feels almost like, oh, maybe I've got the product. But then when you actually go look at the data and how people are using it, there's just this constant failure people using the product.

Yeah, you're talking about sort of the prior version of Playground.

The prior version of Playground, yeah. So it didn't have this type of model; it didn't — it was — we mostly used open-source models, and then we started training some of our own that are very similar to stable diffusion as like a way to ramp up to where we are now. When we watched users prompt this model that, you know, obviously the two pieces of feedback were, you know, this is fun, it's cool, I can get like a cat drinking beer and then you post it to Twitter and it's exciting. Yeah, but then — but why would people come back? You know, is one big question. And then the second part is that people are using our service a lot, but they're not always using our service a lot because it’s like a useful thing. It’s because they're not getting what they want, so they have to keep retrying.

Yeah, yeah! You know, where like Google’s trying to get you off the website, you know, that sort of feeling? Like it’s almost bad that people are using it too much in some sense. And, you know, they just keep re — we call it rolling, right? They just keep rolling to get a different image or slightly better image to fix like a paw or tail that's off. You know, and the other thing that happened was that our model can take an extremely long prompt. Like, most of these models, you can only write 75 tokens, but with our model, it's like 8,000. And most people, you're never really going to go over a thousand. Right now, I say that now, but we'll see. A thousand tokens is a lot.

And our model lets you be extremely descriptive. And so you can really describe the texture of the table, the skin texture. We have all those like puzzle prompts where it's like green triangle next to an orange cube, you know? And it works; like spatial reasoning is all there, actually, including text. That's totally novel, and I’d never seen that before. You know, the first generation of these models, almost immediately what you do is say like generate me a green, you know, green sphere on top of a blue, you know, triangle, right? And just wouldn't do it. It's like, you know, there would be those elements, but it would just be all jumbled up because it was using CLIP. It did not have contextual reasoning or understanding.

Yeah, and CLIP was trained with a lot of error, actually, because it's just using kind of the alt tags of the images that are scraped on the internet, which could be like anything. We sort of decided that what we were going to spend our time on was prompt understanding and text generation accuracy because we also felt like aesthetics were kind of saturating. Like they're getting better, but they're also just kind of like not getting better at a fast enough rate. And users even vote and say, even in the MidJourney Discord, you know, they'll poll their users and they'll say what do you want to be better. And like aesthetics is, like, going lower and lower on the rank of things that people care about; so we wanted to like try to leap on something that really mattered to users, which was prompt understanding and text generation accuracy for those kinds of use cases.

But when you have a very long prompt, it's not really feasible to ask anybody like are you going to write like an essay? And so we started to realize that actually the prompt is, it's almost like a — it's kind of like HTML for graphics, which I think is so cool. I think you've done a lot here because you completely have a novel architecture that really gets to magical prompting because the experience of using Playground feels as if you're talking to a designer. It has a coherence; it listens to you. Because with, with other— I know with MidJourney, if you want to move the text or that, it doesn't, and the positional awareness is not there.

I guess one of the insights you had when we chatted a bit earlier, one of the problems you learned to create good designs, you have to have a lot of description for the images. Yeah, and users are basically lazy.

Right, right. They might just tell you I want a nature scene. And if you input this into MidJourney, what would it give you?

Like, yeah, it'll give you like this very beautiful, very rich, high-contrast nature scene, but you've done something very interesting. We want to talk a bit about how you've done kind of aiding users and expanding upon the prompt to actually build something much better.

The first thing to kind of like improve our prompt understanding was just like making your data better, pretty much. It's actually just that simple. And so, one of the first things we wanted to do was we wanted to like have extremely detailed prompts. So, when we train the model, we train on very, very detailed prompts, but we also want users to feel like they could just say nature scene. And so, sort of what you see here is just how detailed we can get. And actually, we're actually even more detailed than this these days when we train the next model. It'll be even more than this, but once you get to like this level of detail, I mean, we're just teaching the model to like represent all of these concepts correctly, you know, whether something is in the center or whether there's like a background blur.

One thing that we want to get better at, and I think we're actually already pretty good at this, but it's emotional expression. Is he like another thing? Like we have this like image of Elon Musk, and he's like disgusted, he's anxious, he's happy, he's sad, he's confident. And like trying to see his expression in all these different ways. And so, that's just like one thing that we want to make sure is represented in these prompts. There's obviously a lot more like, you know, spatial location. And so by doing this, we can ensure that the model can be — could be a good experience if you raw prompted it as a user if you just said nothing. And most of the time users are not really writing more than like maybe like caption 3 here or something. I mean, even that's kind of a lot.

That's a lot! I think when I was playing, I was mostly like on 5 and 6.

Yeah, yeah, exactly! When you're playing around. Like the normies are kind of doing 5 and 6, and then the like hardcore prompters are like copying each other's prompts, and then they end up more like one, but they don't even look like one. And one is a very natural way of typing, you know? Like nobody's writing these like essays and paragraphs of text; it’s too much work. And that was one thing that we didn't — we knew we were going to probably fail if we expected users to do this. So this kind of led us to a more visual approach where you're like picking something you already like in the world that we understand how that's represented in our model. And then we can like make those changes and edits and stuff like that.

Is the benefit of like expanding the prompts this way that you're more likely to get what the user wants at the first go, or is it that it just makes it easier for them to iterate on it to get to what they want?

You know, I don't even know that we necessarily needed to do this. But I think the reason why we did it was because initially we didn't know how good the model would be. And so we needed to serve users in the way that they already used the existing models. And so we didn't exactly know the breakthrough like interface; we hadn't gotten there yet. And so, in order to make sure that we would work the way everyone is happy with, we wanted to do this kind of like segmented out. It's almost like a lossy prompt, and so that's why we do it. But I think, you know, it's not even that it's not as necessary. But I think — and then the other reason to do it this way is once the prompts get extremely detailed, it’s hard to have too much like variation between the images because you're kind of locking in on your image.

And so by having kind of ambiguity in the prompt, you can get like more variational ability. So there's like — we call it like image diversity. So that way you say squash dish, but it's like really different each time.

Yeah, I guess the cool thing about your product, you basically remove all of the prompt engineering with zero guess because you do it behind the scenes with expanding and exploding into this multicaption level system.

Right! I guess what comes to mind is sort of back in the day if you needed to navigate a website through the command terminal maybe you cURL and do GET some posts, literally like typing the commands until you had like a browser to actually have the right UI.

Right, and what I told my team was I said, yeah, we should be doing the prompt engineering for users. It should not be like the users are the prompt engineers, and then they like — or the prompt graphic designers, if you will here. But like it shouldn't be like the users have to go, like you can't write a — what are we going to do? Write a manual on how to do this, you know? It's just too tricky. Like 1% of humanity will understand that manual and the rest will be like I don't know how to use this; too difficult. So, I think it's really valuable that, you know, I told my team I think it's very important we do all of that work. Like, we should have an extremely strong sense of how the model works rather than putting that on the users, where I think it's like — and then the other thing that we do is we now work with creators to help us like kind of construct these like different templates and different prompts around these templates and stuff like that. And they might be like the 1% of humanity that's willing to learn this on behalf of users. And this is totally normal; like, that's what YC does. We build these great companies that, you know, billions of people in humanity use as a result of that.

Yeah, I guess there's two things out of this that come out. One is you might be creating this whole new set of a profession. Sort of back in the day with design, you have beans where people hire designers. Right now people will through Playground hire like AI designers that are this right top person.

Well, we’re doing it actually, so we are hiring them.

Oh, you’re hiring them?

Yes, we’re hiring them. We're going to launch a creator program soon, actually, and the goal is to bring on creators that have good taste. That still matters, right? Like, you know, there's this image of a, you know, stuff a squash dish, but it's not like not a very beautiful image. And there is — I think taste is still real in the world. And it’s also in design, you know, in LLMs you get to measure how well you did on a biology test, and that's like a pretty objective thing. But for design, it's constantly evolving. Like design from 10 years ago can look dated unless you're like dear ROMs. But I think, you know, more fundamentally we want to bring on creators that are going to help create graphics that other people can then use, and we're actually paying them.

I guess one think that's cool. The second thing, because of this, you actually are state-of-the-art on many aspects for this model. So much of it was driven by a product because now in order to get the good captioning, you probably are beating GPT-4, right, in terms of image.

We are beating, yeah. We now have a new like soda captioner, yeah, to generate these, and that was not just to be like a benchmark but actually a very practical purpose to build the model.

Yeah, can you maybe tell us a bit of what's underneath? Because PG V3, Playground V3, right, is all in-house and state-of-the-art on many aspects.

Yeah, so the whole architecture of the model, we had to like rip everything out. So, like the normal stable diffusion architecture that people know about is like there's a variation autoencoder, a VAE, and then there's CLIP, and then there's like this U-Net architecture for people that are in the know. And then since then, it's kind of evolved to using, you know, more transformers. Like, there's this great paper by, I think it was like William Peebles, that did DIT, which I think is like what people believe SORA is based on as well. And then so there's some new models that are using that. We actually don't use any of those new architectures either; we did something completely from scratch. But one of the reasons we had to kind of blow everything up was because you can't really get this kind of prompt understanding using CLIP because there's just so much air in CLIP, and it's also just bounded by just the architecture of that model.

And then the second thing is we also needed the text accuracy to be really high. So, you can't just like use the off-the-shelf VAE from stable diffusion because it cannot reconstruct small details. Like, I don't know if you guys ever noticed, but the hands in the logos, hands zoomed out faces. Yeah, you need something that you also need like a state-of-the-art VAE or something like a VAE that's better than the existing one. Like the existing one's like four channel, and so there's all these like pieces, and they all interact, and they can all bound the overall performance of your model. And so we basically looked at every single piece. And then I think like four months ago, there was a — I think with the team there was literally, we were at the whiteboard with the research team, and there was like the non-risky architecture, which was kind of more similar to some of the open, the state-of-the-art open-source models that are out these days, like flocks and stuff. And then there was like this other architecture that shall not be named. And we were like, well that's, that’s like the risky one where we don't even know if it'll work. And if we try it for two or three months, we’ll like waste compute, and if it didn't, it might just like blow up and then we'll be behind. And we just like put everything in that basket.

Nice!

We decided that we had no choice, you know. It was like we were just going to fail if we didn't do it.

Anyway, I think what's remarkable, you are order magnitude on text, and in a lot of all these aspects, you’re basically soda. I think that's really impressive. Can we maybe talk a bit about as much as you can, how you beat the text and code? I mean, you tease that out a bit. You basically don't use CLIP because the traditional stable diffusion just uses the last layer, right? But you guys have done something completely new where you allowed a basically almost infinite context window because MidJourney is only 256. The Prompt had adherence; like, you can actually talk to it like a designer. So, tell us what you can talk about that.

Yeah, as much as you can tell us about it. I think it's fair to ask the question, share as much as you want.

I think that to kind of get here, you know there's some obvious things that you would do. The most obvious thing that you would do, you know, is not use CLIP. But the second most obvious thing is kind of like using the tailwinds of what's already happening in language. You know, like the language models already so deeply understand everything about text. And so, there's some models that use this, you know, they use like T5 XXL, which has this – it's like another embedding but it’s like a much more rich embedding of language understanding. Kind of feel like language is just the first — it's just the first thing that happened. And there are a whole bunch of AI companies that are going to come about, whether they train their models or not, that are just going to benefit from everything that's going on in language and in open-source language. And so, you know, I think our model is able to have such great prompt understanding in part because of the big boom in language and all of the stuff that, whether it's Google or Meta or what have you is doing. And so we can be slightly behind in terms of language for our prompt understanding because the language stuff is already just so good, and it will just continue to get better, and our models will also continue to get better. So, that might be my like one small hint.

Maybe the analogy playing with a lot of this and from chatting with you, the current state-of-the-art stable diffusion models, their language understanding feels like in the NLP land as like VWE. That was this paper that came out from Google in 2013 and it didn't really understand text per se; it was more at the latent space. The famous example was that it could take the vector of King and then you would subtract the vector of man and then add the vector of woman, and the output would be the vector of Queen, right? Which is like but very basic and still very cool, which I think is what kind of stable diffusion current current models before you are. But playing with your model, you basically — the leap to, for the audience, the leap is that you basically got GPT level of understanding. It's like sort of the word V to GPT was, I don't know, like six, ten years?

Yeah, I’d say it’s like GPT-3 level image model, like sort of prompt understanding now. Yeah, and I think there's much more leap. There's another leap to go many more actually, I would say.

And that's impressive! It's safe to say that this is the worst the model has will ever perform for sure!

Yeah, for sure! I mean, there's small things that we already want to fix, like, you know, we wish that the model understood concepts like film grain. I mean, it could still go — it could still be better at spatial positioning. Even like the model has issues with like the idea of like left and right. Like, put the bear to the left. What is left? Is it to you? You know, is it your left? Is it the bear's left? So there's still like lots of interesting problems that I think are really fun that probably we're going to have to figure out.

But what we hear from users is that they feel a strong sense of control now. Like, it has really good prompt adherence. And actually, there was this really funny thing when you, you know, I think like a week or two ago that we realized about the model which is when we started to do eval for aesthetics, you know? And the way we do this is we just show, like two, it’s an AB test. We show users two images: one from maybe a rival of ours and then another image from our model. And we're constantly doing evals and constantly asking our users what they think so that we can get better. And anyway, one of the things that we realized was that there's this new thing that I don't think has been talked about, but you apologize to the audience if there — if this has been talked about. But there's a problem with — we have entanglement issues, which is that if the model adheres too well to the prompt, it can adjust — it can have an effect on aesthetics.

So when we compare ourselves to, say, something like MidJourney, which we've actually evaled, it has great aesthetics, best in the world at that. One of the problems is that we will get dinged because our model is adhering more. So I'll give you an example. We have an image and it’s like a image of a woman and it's like kind of like a split plane like she's on this side and on this side, so it's like a two — it’s like a composite and MidJourney doesn't respect that. It just shows the woman one frame, and the users will always pick that because it's more aesthetically pleasing compositionally versus this like split plane thing, but our model is adhering to that prompt.

Right, right. And so the users ding us and then we get a lower aesthetic score to me because it's not listening. And so there's this entanglement problem like what do you do? We had another image that was like hand-painted palm trees or something, and the users chose the other model because they were less hand-painted looking, and the hand-painted ones do look less aesthetic. But our model is adhering. So we have this entanglement problem, and we don't know how to measure ourselves for aesthetics now. And there's no — I'm not aware of any, if anyone has any literature, please send it to me, but I'm not aware of any literature on this. And so we don't know what to do.

I think what it sounds to me is basically your model is too soda that the current evals don't work because they — it's actually following the rules.

Yeah, we're trying to figure out what we have. We have to make a new eval basically because you're too ADV, like you broke the test.

Yeah, you kind of broke the test. And yeah, so now it’s a little weird externally. We don’t — you know it’s like obviously we want to portray to the world, hey, you know, we have this great thing, and okay we lose here, but like but not really. And so, I think we're going — it does what you want.

Yeah! But it does what you want. And so I think we're going to try to, you know, we're going to talk about this in more detail, this kind of entanglement problem because it's actually like a very interesting, more fundamental insight.

Yeah, it sounds like you're just building a completely different kind of company. Like the thread that comes up hearing everyone here is using Playground feels like you're talking to a graphic designer. Which then in my head actually buckets you into just the companies in YC that are really taking off are the ones that are replacing some form of labor, which is just different to how people talk about MidJourney. Right? It's not just like a tool to play around with. This is actually just going to be like the replacement for hiring a graphic design team potentially, which is like way more commercial.

Yeah, yeah. I mean we've been — we've been searching for like where is the utility? Where, how are people using, you know, things like MidJourney? And I think that for me, it's actually — it's even simpler; it's just that I think we're just enabling the person to have more control over the whole thing.

Like I always feel bothered, you know, when you're like I produce music, and so if I make a song, like I have to go to a designer and say can you make me album art? And then I only get like four variations of it, and then I feel badly asking for a fifth if I don't like any of the four. But the more you just like put in control the person that’s actually making the thing, they'll always — they'll be able to connect exactly the thing that they're looking for with, you know, the core product or song or whatever they're making.

So, we’re always telling founders, hey, you should talk to your users more. Or, you know, and what you did was you had so many users you couldn't just talk to them; you needed to look at how were they actually using it. And at some point, you realized, you know, somewhat uncomfortably, that they were generating near porn.

Yeah, we get a lot of near porn and porn. And then, you know, I think people sort of, when they're exploring a space, often run into that situation. Like what happens when you know the users that you're getting aren’t the users you actually want?

Yeah, we, me and my COO talked about this, like if we listened to what the users wanted, we would have to build a porn company, potentially, which is not something that I think my wife would be happy with or my mother. It was kind of this tricky thing where you're like listen to your users, talk to your users. And look, I'm not saying everybody does that with image models for sure they don't, but a lot of them do. And so we had to kind of go ask ourselves well then what can you do with these things? And the answer was like not much else. Nothing big and commercial enough. We could make a cool website that people use. And the problem is all the image generator sites are played with this problem, and we all know it. We all know, and there are huge safety problems. And, you know, it turned out to be just like a business we didn’t like.

That's a hard — like that's like a hard thing after, you know, 12-18 months of working on something, and you're just like well, I don’t really like this that much, and now what? And when we looked around for use cases, we were like oh, all the use cases have — all the big ones, practically all of them, logos, posters, t-shirts, bumper stickers, everything, everything has text. Because text is also a way to communicate with humans. That's why I became number one; like the number one priority.

I mean, this isn't the first time that you've sort of confronted this issue before, you know, in your prior startup, Mixpanel, which you built into a company that, you know, makes hundreds of millions of dollars a year, one of the leaders in analytics from a really young age. You know, I think you started it when you were 19, and I remember because I met you when you first started it. That was another moment where here's this brand new technology, and there are sort of very commercial use cases that you could build a real business on, and then there were other use cases, in that case, I think it was sort of Fly-by-Night gaming operations that would come and sort of pop up on Facebook, steal a bunch of users and then disappear, and you had to make some choices about who you wanted your users to be. Like do you want it to be people who can actually pay you money for a real product over the long haul, or sort of, oh yeah, they're here and they're gone, and we can make our graph go up.

Like it's sort of a quandary that a lot of founders are facing. How did you approach that?

Yeah, I mean, I — that one's like burnt into my memory, actually. So, we, you know, the simple story was just that like we got all these gaming companies back in the gaming heyday, you know, Zynga, and Rocku, and Slide and all this stuff. And we were making so much money off of them, but then they would die because they had bad retention or games just have like a decay factor. And you could tell that they were going to die because thetion was that we saw it happening, you know. It's like all real-time data on it. And so, you know, one day I go to — one of my mentors is Max Levchin, and I interned for him at his other company. And I was like, hey, you know this thing is happening and we have all these competitors that are building gaming analytics tools or products, and I don't really know how to compete. Feels a little weird to just go after gaming when it's like this weird thing that's churny, and he just looked at me and he was just kind of like, he's like, what do you think is the biggest market? And I was like probably not gaming, probably like the rest of the internet, you know? And mobile was like just starting, and we didn't really know, like the top free app in the App Store was like a mirror on mobile. So, it was kind of like, is mobile going to maybe. It’ll be there next year, I hope.

And but anyway, he said, well, if you're, you know, there's the name of our competitor. And he was just like if your competitor gets bought for $100 million tomorrow, you know, that's focused on gaming, don't cry about it then. Just go after the biggest market. And that's what when we did that. And then mobile went huge; it went so big, and it completely TR. We got rid of all the gaming stuff, and that was like a 100% the right decision. So I think it's just being kind of like ruthless almost about where the value is with your own users, with what you're doing. Like all those things I think are like is very, very important.

I mean, it sounds like you had to close a door and then God came and opened a window for you.

Yeah, yeah. I mean I think, yeah, we kind of have a similar problem where it's, you know, our current us, our current user base that we have, is not exactly, you know, an exciting thing we want to do as a team. And so then we're like hunting for where the rest of the value is.

Yeah, that's a really important lesson you — I mean, I guess that's like the super big lesson here is you can choose your users or your customers. You know, often your customers or users choose you, and if you don't want them, it is a choice that you can make, and sometimes it actually allows you to find the, you know, global maximum instead of just the local maximum.

Yeah, it's, you know, we're kind of faced the same decision, it's like a real-time decision almost. It's fun to talk about things when they work when your decisions are right. So, we'll see, you know, years later if this is right. But, you know, I think it's tough because, you know, MidJourney is doing $200-$300 million, but the biggest market in graphic design is probably Canva doing $2.3 billion, doing. And so we're just kind of like, well forget it. Let’s go after the biggest, most valuable thing in the world and, you know, not a lot of people know, a lot about Canva I find. In Silicon Valley, most people know about Figma, but Canva makes vastly much more money than Figma. So by enabling everyone, so if you have this amazing, you know, AI graphic designer of sorts enabling like so much more of humanity, I find. I mean, I think a lot of people believe this, but I do. I also believe this. Which I think AI will certainly — it feels like it's expanding the pie of all these markets. Not like they're not the same size, I think most of the time, right? Like you're enabling more people to write code that otherwise couldn't, you know, that kind of thing.

I guess the interesting thing about Playground, it was also a previous more radical pivot you had because you had gone through YC twice.

Yeah.

So you went through with Mixpanel, which became this successful company making hundreds of millions of dollars. Then you went through it with Mighty. Mighty, can you tell us about that second time going through YC and then what was it? And then you pivoted into.

Yeah, so we — I did this browser company called Mighty where our goal was to try to like stream a browser and the real goal was to try to make a new kind of computer. And we basically did it, but the problem was is that, you know, we hit this wall where it was like we didn’t — I didn’t believe that it was going to be a new kind of computer anymore. I just couldn't make it more than two times faster, and I just didn't feel like if I couldn't get like a 10x or 5x on this thing. Like and or at least see that it could get a 10x, that it just it wasn’t a company that I wanted to work on anymore. I remember I had invested before I came back of course YC, and one of the big ideas that really got me was that actually our MacBook Pros were really sucking at the time.

Yeah, they were.

Yeah, there was no M1 at the time?

Yeah, and we actually I don't think we even knew that Apple was going to M1 silicon yet.

I mean, it’s interesting. I think that in Silicon Valley, we maybe underestimate how valuable strategy actually is. Mainly because strategy is so fun and so interesting, and the MBAs who come into our sector like immediately seize on that and just want, you know, like you need a strategy person as like the, you know, as a co-founder. And it's like no, no, we don’t actually need that. But that's not to say that strategy is not necessary in this particular case. Like I think that we were trying to solve a real problem, which was our browsers really sucked and the cloud was getting very, very good. And then suddenly, you know, the mood changed when Apple released Silicon.

Well, they clearly thought so too!

So, you know, strategy was right in some sense, like the overarching problem of trying to make our computers faster. They were able to make a chip. Yeah, but still, even, you know, even in the face of the M1, we had kind of convinced ourselves like, well, it doesn't really matter. Like the Mac only has like 8.3% market share, desktop market share; the rest is Windows. And, you know, I even met, you know, the prior CEO of Intel, Bob Swan, and, you know, talk to him about like, yeah, why is Intel behind here and all that? And I was trying to figure out like why is AMD and Intel behind? Where is it going? Is anyone even going to get close to the M1 or not?

And so, I think one problem is that like wanting them to be behind is like non-ideal for your company. Right? Like don’t bet against macro is the problem.

Yeah, you definitely don't want to bet against that.

And then, and then I think the second piece was I sat down with one of the engineers that works on V8, the browser engine behind Chrome, and then I gave him like every imaginable idea me or the team had on figuring out how to speed up the browser, and he had an answer to all of them. Once I realized that the team is basically focused on like 1% improvements, and they had already tried everything. I mean, that was like a very depressing moment. Like I was like out of ideas, you know? People say when is it the right time to pivot or change or whatever? And I had just run out of ideas. But I really wanted to like stick with it, but I just couldn't figure out another way to get it.

We went as so far as we started building a computer in the data center, and we had like figured out how to use like consumer CPUs in a data center legally with the right architecture, and like I think PG came over once, and it was just like the sprawl of all these components of maybe we're building physical. We kind of building hardware.

I learned major lessons at Mixpanel, but the one major lesson that I learned at Mighty was that it's so valuable to have a tailwind for your company as opposed to a headwind. There were just so many obstacles in our way, you know, whether it was the M1 or, you know, there's no real way to like change the fundamental architecture of the browser. You know, like JavaScript is just innately run single-threaded in a tab; we can't change that.

With Playground, it feels like it's all tailwinds all the time, you know, we just like wait and things get better, things get faster, cheaper, better, easier.

The thing that's remarkable is you had this really impressive career with building a standard SaaS business with Mixpanel. You give it a try with the browser GPU, and then you now kind of retool yourself and are building this soda stable diffusion model. What was that journey like? How did you retool yourself?

That was one of those things that is like so impressive. I just started learning. I don't know, I took whatever AI courses were out there that I could take. Unfortunately, the Carpathia courses didn't exist back then. But I think, you know, at first, I was trying to actually build like a better AI ad bar in the browser, which now exists, Google just released that, I think, and this was before GPT-4. Like, I think we were talking to OpenAI. They were like very helpful because I think they didn't have, you know, ChatGPT didn't exist yet, and we were trying to figure out how to get that integrated in the address bar at low latency. And so, I was like learning AI, doing AI, learning AI, how to do AI research and train models before all of that happened.

But I think something weird happened, which was in doing that, in getting kind of connected with like the folks that OpenAI and like learning these things, I ended up just getting to see it happen. Like I knew it was about to happen like earlier than other people. I got kind of lucky. I guess a lot of people probably remember the DALL-E 2 moment. That was like a crazy moment where image generation, you know, really was exciting. And then so, I just tried to, I just kept learning. And then, you know, I think stable diffusion came out like maybe, I don't know, maybe I got access like two weeks before it came out. And so it was just like by being in the mix of this thing, you just — I got to see everything about to start.

And so, I think we were the first AI image generation like website that you could like go to and sign up, and you didn't have to like run it manually on some GPU. And so, I think that our website just took off because of that. That was like the easiest thing. I think MidJourney was still in Discord, right? So we’re like, what if you make a website?

I didn't actually know that story. I mean, that's a great lesson for any technical founder, right? Like essentially you stumbled on the biggest tailwind of our generation by just following technical things you found interesting.

It's great.

Yeah, it's a little weird. After Mixpanel, I actually tried to intern only at, I tried to do like an internship at companies and because I don't know, I was like wanting to do something but not ready to start a company, and I only was trying to talk to AI companies. And I like interviewed at OpenAI, and like they wanted me to come work five days a week, but I only wanted to do three. And then like somehow at the end of that, I made this huge mistake in I guess 2018 where I had decided that there was nothing interesting going on in AI because I was like training and even then I was like training my own models. I was like trying to help a scooter company keep the scooters to detect if they were on the sidewalk or on the road, because the regulation in SF required them to do that. And I like learned all this AI stuff, went to all these AI events, and then I just yeah, I concluded there’s nothing, and I started Mighty. And I was like just off by like three months. And so I kind of felt like I almost feel redeemed in some sense, you know? I don't know, it's — it's so hard to like time these things, how do you know whether you're early or late?

And then for a long time, you were behind on the model, right, for Playground.

And I've just felt continuously behind, but I've now kind of come to realize after like learning the history of like Microsoft and microprocessors, like I don't know, it might just be like year two. This all still might be really, really early. We really don't know where it's going.

How does it feel to run Playground, which is sort of part startup, part research lab versus just pure startup?

Well, one thing we try to do is we try to differentiate on not trying to go after AGI. That's one thing we try to say we're not doing because there are lots of people doing that, and it feels really tractable, I guess. The research does, you know, where it's not always clear whether research will be like that. You know, I've kind of learned that you can't, you can't do research in a rush.

So, one big problem is that when you're building a startup, like you want to ship everything, like you wanted to ship, you want to ship it today. You want to fix the bug, you want to ship the feature. Like you're just trying to move on such a fast pace, but that's not tenable with research in the same way. Research is moving fast, but it's not like you ship, you can't build and ship your model in a week. And so, I think that's been like really challenging, and I've had to kind of adjust my brain for one team versus the other.

Yeah, one thing I think is interesting about successful research labs in the past, if you look at Bell Labs for example, it's almost like the CEO of the lab's main responsibility shielding the lab from like the commercial interest that are pushing for like things now. But as CEO of Playground, you're kind of both like protector of the researchers but you're also the commercial interests. Like how do you juggle those competing forces?

Yeah, I don't know that I've probably mastered yet by any means, but I think I asked Sam Altman once like, you know, to what degree he allowed the researchers at OpenAI to like wander. I guess so I just wasn't really sure, you know? Usually, it’s like there’s a task and you do it. But what about wandering? How does wandering make sense in a research or an engineering team? And he said, there’s like a — he's like there's quite a bit of wandering. So I took that to heart. And so I let the research team kind of wander and get to a point where they are able to show an impressive result, and then we kind of like kind of start to like really accelerate that. But until then, there's not much to do.

Well, not all who wander are lost. I love that! That should be a t-shirt in the right.

Make a! We will add that as a template! We can link it below in the video. I'll be a creator in the Playground Marketplace. Love it!

You were asking like how do you like inter, almost like how do these two teams integrate in a startup? I think that we just like have a channel now where we just see so much feedback that now the researchers can actually like look into the failure and they can decide from themself while wandering. Do I want to fix that? That's surprising, why did that happen? And so I want to try to like integrate these two because I think that that's like a weird — that's a more differentiating factor these days. I think that like the researcher research labs are very lab based, and they don’t necessarily — they're not always deeply looking into real user behavior. What are they really trying to do? But sometimes it’s just like we need to get to this, like we need to get a high score on this eval and we got to put it in the paper and then we got to like get really good score for LLM arena, and then there’s like some KPI, you know, to do that.

But then, you know, does that thing matter? Does it correlate? Does the eval that we see out in the world, does it strongly correlate to your usefulness to users? Like I still want the LLMs to like help me make rap lyrics, but there's no e for that. So, you know, who will do that? How will that happen? It's certainly possible to do that, but I — if you note it, I always pick on this rap lyrics thing because to me it like belies a fundamental problem with how people are evaluating the models because the model should be extremely good at it but they're not.

Maybe the problem is some of these, there's a gap between commercialization of research because all these evals publicly are academic and very different use cases than if you wanted to go beat Canva, let's say.

Yeah, I mean, I might be talking out of turn here — sorry to the LLM folks — but the if you go look at the evals for the language models, they’re all like, you know, math, biology, legal questions. It's no wonder that the biggest use case of ChatGPT is homework because they, you know, all the models are like basically hit these numbers, right? Initially, maybe they’re different now. They're probably more sophisticated now, but it's no wonder that the models are good at homework, and that's a huge category.

So you made it to soda! People are watching right now, and they're just asking like how do I do it? What's your answer to that?

There's like this feeling that all you need is a lot of data and a lot of compute, and you just — and then you just run, you train these models, and you'll get there. You know, they'll just generalize, and suddenly everything will be great.

I think there are a lot of smart software engineers, and so they fundamentally understand that these are the core components, ingredients to make like this great model. But it's vastly more complex than this. And what I've — at least what I've experienced is that to get to soda you basically have to be maniacal about like every detail of, you know, the model's capability. For example, like you can like look at text generation, there are going to be some people that train their models, and they get cool text generation, but the kerning is off. Are you the kind of person that will care about the kerning being off, or are you the kind of person that is okay with it? Like or you don't even notice it?

Like do you have this just maniacal sense? Like we look at skin texture. Like my eyes feel burnt out basically from looking at like the smallest little skin texture, you know. Smoothness! We like talk about these things as a research team day in and day out. We like argue about it. To build these soda models, you have to be so — you care so much about in our world, it's image quality. And, you know, we even look at like little small things like if there’s even a slight film grain and it’s missing, we go, oh, the prompt, the captioning model is bad, not good enough. We need to be better at this.

And I think this maniacal mindset, I think, allows you, if you do this a hundred times, the model extrapolate even more. I think people don't quite internalize like extrapolation of all of these dimensions together and how they work together to make everything better. Like you don't know how making one thing better here will impact like another thing over there. We can't — it's hard to understand that. But I think that that's what's required to get to a soda model. And it is possible! It is possible! It is possible!

It's not easy though; it’s really hard!

Yeah, well, Suel, thanks a lot for coming on The Lite Cone. That's all we have time for but you can try Playground right now, playground.com, or in the App Store, Android, iOS. And this is actually the biggest flex; you didn't have any waitlists. It was just available on day one. So go try it out right now, and we'll see you guys next time.

[Music]

Building The World's Best Image Diffusion Model

More Articles