Things That Don't Scale, The Software Edition – Dalton Caldwell and Michael Seibel
We'll get a founder that's like, "Oh, how do I like test my product before I launch to make sure it's gonna work?" And I always come back and tell the founders the same thing: like if you have a house and it's got full of pipes, and you know some of the pipes are broken and they're gonna leak, you can spend a lot of time trying to check every pipe, guess whether it's broken or not, and repair it, or you can turn the water on and like you'll know. You'll know exactly the work to be done.
Hey, this is Michael Cybel with Dalton Caldwell. Today we're going to talk about what does it mean to do things that don't scale: the software edition. In this episode, we're going to go through a number of great software and product hacks that software companies used to figure out how to make their product work when perhaps they didn't have time to really build the right thing.
Now Dalton, probably the master of this, is the person we work with, a guy named Paul Buhit, who invented this term the 90/10 solution. He always says something like, "How can you get 90% of the benefit for 10% of the work?" This is what he always puts onto people when they tell him it's really hard to build something or takes too long to code it. He'll just always push on this point, and you know founders don't love it.
Right? Would you say that's a fair assessment, Michael?
Um, that's a fair assessment. Yes, founders hated it.
Tell the audience why it's worth listening to the guy; why does he have the credibility to say that to people?
Well, PB is the inventor of Gmail, and as kind of a side project at Google, he invented something that 1.5 billion people on Earth actively use. And he literally did it doing things that don't scale.
So, I'll start the story and then please take it over. So as I remember it, PB was pissed about the Gmail product, a Google product—sorry, the email product he was using. Google had this newsletter product, the first version of Gmail. He basically figured out how to put his email into this Google Groups UI. And as he tells the story, kind of his eureka moment was when he could start reading his own email in this UI. From that point on, he stopped using his old email client.
And what I loved about this is that, as he tells the story, every email feature that any human would want to use he just started building from that point. He would talk to YC batch and he'd be like, "And then I wanted to write an email," and so I built writing emails. If you know PB, like he could have gone a couple days reading emails without replying at all, so like he didn't need writing emails to start.
I remember him telling the first time he got his co-worker, like literally his desk mate or something, to try to use it, and his desk mate was like, "This thing's pretty good! It loads really fast! It's really great! The only problem is, PB, it has your email in it, and I wanted to have my email," and PB was like, "Oh, okay, well I gotta build that."
And so then it started spreading throughout Google.
And do you remember when it broke?
No, what happened?
Oh, so he told the story where like one day PB came in late to work, which is, you know, knowing PB, every day you know, and everyone was looking at him really weird and they're all like a little pissed they got to his desk, and someone came over to him and was like, "Don't you realize that Gmail's been down like all morning?" And PB was like, "No, I just got to work; I didn't know!"
And so he's like trying to fix it, trying to fix it, and then his co-workers see him grab a screwdriver and go to the server room, and it was like they were like, "Oh god, why don't we trust PB with our email? Like we're totally screwed!"
And I think he figured out like there was a corrupted hard drive. I remember that point of story he was like, "And that day I learned that people really think email's important, and it's gotta always work!" [Laughter]
And like perfect because I think the reason, I think the reason he did it, man, is because he liked to run Linux on the desktop and he didn't want to run Outlook, like the Google suits were trying to get him to run Outlook on Windows, and he was like, "I don't really want to run Windows!"
But yeah, it was the dirtiest hack! And as I recall in this, you know, final part of the story, it was hard for him to get Google to release it because they were afraid it was going to take up too much hardware.
And so there was all these issues where there was a decent chance I think it never would have been released.
Well, this part was that everyone thought Gmail's invite system was like some cool growth reality hack—yeah, like a virality hack! It's like, "Oh, you got access to Gmail, you got I think four invites to give someone else," and these were like precious commodities.
And it was just another version of things that don't scale—they didn't have enough server space, so they had to build an invite system.
Yes, there was no option other than building an invite system; it was not like genius PM growth hacking; it was like, "Yeah, well we saturated this; the hard drives are full, so I guess we can't invite anyone else into Gmail today." That's it! That's it!
So you had another story about Facebook early days that is similar in this light.
So let me paint the picture. Back when you started a startup a long time ago, you had to buy servers and put them in a data center, which is a special room that's air-conditioned that just has other servers in it, and you plug them in. They have fast internet access.
And so being a startup founder, until AWS took off, part of the job was to drive to the suburbs or whatever, drive to some data center which is an anonymous warehouse building somewhere, go in there and like plug things in.
And what was funny is when your site crashed, it wasn't just depressing that your site crashed; it actually meant getting in your car—like part of being a startup founder was waking up at 2 A.M. and getting in your car and driving to like Santa Clara because your code wedged. You had to physically reboot the server, and your site was down until you physically rebooted to the server.
So I'm just trying to set the stage for people; this was what our life was like, okay? And so my company, I Meme, we had a data center in Santa Clara and there were a bunch of other startups there as well.
And so something that I liked to do was to look at who my neighbors were so to speak. There was never people there; it was just their servers and there'd be a label at the top of the rack, and you could see their servers, and you could see the lights blinking on the switch.
Okay, so this is what it was like, and so, uh, our company was in that data center in Santa Clara, and then one day, there's a new tenant—oh, a new neighbor!
So I look at it, and the label at the top of the cage next to ours, you know, three feet away, the label said "thefacebook.com." And I remember being like, "Oh, yeah, I've heard of this! Like cool, sounds good."
And they had these super janky servers; I think there was maybe eight of them when they first moved in, and they were like super cheap; they're like super micro servers. You know, like the wires were hanging out, like it did—I'm like, "Cool!" But the lights were blinking really fast, okay?
And so what I remember was that there were labels on every server, and the labels were the name of a university. So at the time, one of them, one of the servers was named Stanford; one of them was named Harvard. You know, like, and it made sense because I was familiar with the Facebook product at the time, which was like a college social network that was at like eight colleges.
Okay, so then I watched every time we would go back to the data center, they would have more servers in the rack with more colleges, and it became increasingly obvious to me that the way they scaled Facebook was to have a completely separate PHP instance running for every school that they copy and pasted the code to.
They would have a separate MySQL server for every school, and they would have like a Memcache instance for every school. And so you'd see like the University of Oklahoma, you'd see the three servers next to each other, and the way that they managed to scale Facebook was to just keep buying these crappy servers.
They would launch each school, and it would only talk to a single school database, and they never had to worry about scaling a database across all the schools at once because, again, at the time, hardware was bad.
Okay, MySQL was bad, like the technology was not great. If they had to scale a single database—a single user's table to hundreds of millions of people would have been impossible.
And so their hack was the 90/10 solution like PB used for Gmail, which is like just don't do it. And so at the time, if you were like a Harvard student and you wanted to log in, it would— it was hard-coded to the URL was harvard.thefacebook.com, right? Then like, and so if you tried to go to stanford.facebook.com, it'd be like, you know, error, like that was just a separate database.
And so then they wrote code so you could bounce between schools, and it actually took them years to build a global user's table, as I recall, and avoid this hack.
And so anyway, the thing they did they didn't scale was to copy and paste their code a lot and have completely separate database instances, and then talk to each other.
I'm sure people that work at Facebook today, I bet a lot of people don't even know the story, but like that's what it took—that's the real story behind how you start something big like that versus what it looks like today.
So in the case of Twitch, all if not all, like most of the examples of this came from this core problem, and it's why I tell people to not create a live video site. A normal website, even a video site on a normal day will basically have peaks and troughs of traffic, and the largest peaks will be 2 to 4 times the steady state traffic.
So you can engineer your whole product such that if we can support 2 to 4 times steady state traffic and our site doesn't go down, we're good.
On a live video product, our peaks were 20 times. Now you can't even really test 20x peaks; you just experience them and fix what happens when 20x more people than normally show up on your website because some pop star is streaming something.
And so two things kind of happened that were really fun about this. So the first hack we had was if suddenly some famous person was streaming on their channel, there'd be a bunch of dynamic things that could load, like your username would load up on the page or our channel, and the view count would load up, and a whole bunch of other things that would basically hit our application servers and destroy them if a hundred thousand people were trying to request the page at the same time.
So we actually had a button that could make any page on Justin TV a static page. All those features would stop working; your name wouldn't appear, the view count wouldn't update, like literally a static page that loaded our video player, and you couldn't touch us!
We could just cache that static page, and as many people as possible want to look at it. Now to them, certain things might not work right [Laughter], but they were watching the video; the chat worked because that was a different system; the video worked; that was a different system.
And we didn't have to figure out the harder problems until later—later actually Kyle and Emmitt worked together to figure out how to cache parts of the page; we'll make other parts of the page dynamic, but that happened way, way later.
Dude, let me give you a quick anecdote. Yes, remember Friendster before MySpace?
Yeah, of course!
Every time you would log in, it would calculate how many people were two degrees of separation from you, and it would fire off an MySQL thread where you would log in; it would look at your friends and it would calculate your friends' friends and show you a live number of how big your extended network was.
And the founders, you know, John Abrams, he thought this was like a really important feature—I remember talking about it. Guess what MySpace's do-things-that-don't-scale solution was?
I mean, if they were in your friends list, it would say this is in your friends, you know, so-and-so is in your friends list, and if it wasn't, it would say so-and-so is in your extended network.
There it is! That was it—that was the feature! And so Friendster was like trying to like hire engineers and scale MySQL and they were running into like too many threads on Linux issues and like updating the kernels, and MySpace was like, "Uh, so-and-so is your extended network." That's our solution!
Anyway, carry on! That's the same deal!
So our second one was, um, it would always happen with popular streamers. Our second was, if you imagine, um, if someone is really popular and there's a hundred thousand people who want to watch their stream, we actually need multiple video servers to serve all of those viewers.
So we basically propagate the original stream coming from the person streaming across multiple video servers until there was on enough video servers to serve all the people who are viewing.
The challenge is that we never had a good way of figuring out how many video servers we should propagate the stream to, and if a stream would slowly grow in traffic over time, we had a little algorithm that could work and like spin up more video servers and be fine.
But what actually happened was that a major celebrity would announce they were going on and all their fans would descend on that page, and so the second they started streaming, a hundred thousand people would be requesting the live stream BAM! Video server dies!
And so we were trying to figure out solution solutions and like how do we model this? How do we—there were all kinds of like overly complicated solutions we came up with.
And then once again, Kyle and Emmitt got together and they said, "Well, the video system doesn't know how many people are sitting on the website before the video stream starts trying to start video, but the website does! All the website has to do is communicate that information to the video system, and then it could pre-populate the stream to as many video servers as they would need to and then turn the stream onto users!"
So what happened now in this setup is that some celebrity would start streaming; they would think they were live; no one was seeing their stream while we were propagating their stream to all the video servers that were needed; and then suddenly the stream would appear for everyone, and would look like it worked!
Well, and like the delay was a couple seconds—it wasn't that bad, right? But like dirty, super dirty! But it worked!
And honestly, that's going to be kind of the theme of this whole setup, right? Super dirty, but it worked!
You had a couple of these in Ime, right?
Yeah, there were a couple that we had at Ime. So one of them, um, so at the time, again, like to set the stage, the innovation of showing video in a browser without launching RealPlayer—no one here probably knows what that is.
But it used to be to launch a video; it would launch another application in the browser, that sucked, and it would like crash your browser, and you hated your life!
Okay, so one of the cool innovations that YouTube—the startup YouTube had before it was acquired by Google was to play video in Flash in the browser that required no external dependencies or just play right in the browser! At the time, that was like awesome!
Like it was like a major product innovation to do that!
Yeah! And so we wanted to do that for music at Ime, and we were looking at the tools available to do it, and we saw all this great tooling to do it for video.
And so rather than rolling our own tools that was music specific, we just took all of the open-source video stuff and hacked the other video code that we had so that every music file played on Ime was actually a video file.
It's a .flv back in the day, and it was actually a Flash video player.
Um, and the entire—it was basically of we were playing video files that had like a zero bit in the video field, and it was just audio!
And we actually were transcoding uploads into video files, you know what I'm saying? Like the whole, the entire thing was, was it was a video site with no video.
I don't know how to explain it!
Um, and it works, and I do think this is a recurring theme, is a lot of the best product decisions are ones made kind of fast and kind of under duress.
I don't know what that means, but it's like when it's like 8 P.M. in the office and the site's down, you tend to come up with good decisions on this stuff!
So we had two more at Twitch that were really funny. The first one, talking about duress, was our free peering hack.
So streaming live video is really expensive! Back then, it was really expensive, and we were very bad fundraisers; that was mostly my fault.
And so we were always in the situation we didn't have enough money to stream as much video, and we had this global audience of people who want to watch content.
And so we actually hired one of the network ops guys from YouTube who had figured out how to kind of scale a lot of YouTube's early usage, and he taught us that you could have free peering relationships with different ISPs around the world.
And so that you wouldn't have to pay a middleman to send your video to folks in Sweden; you can connect to yourself—your servers—you go, I forgot what they're doing, it saves you money and it saves them money.
That's what they wanted!
Yeah! And there were these massive like switches where you could basically like run some wires to the switch, and bam! You can connect to the Swedish ISP.
Now the problem is, is that some ISPs wanted to do this free peering relationship where basically you can send them traffic for free; they can send you traffic for free—others didn't.
They didn't want to do that, or like they weren't kind of with it. And so I think it was Sweden, but I don't remember; some ISP was basically not allowing us to do free peering, and we were spending so much money sending video to this country, and we're generating no revenue from it! It's like we couldn't make a dollar on advertising!
And so we did is that after 10 minutes of people watching free live video, we just put up a big thing that blocked the video that said, "Your ISP is not doing a free peering relationship with us, so we can no longer serve you video. If you'd like to call to complain, here's a phone number and email address!"
And that worked!
And how fast did it take for that to work?
I don't remember how fast; I just remember it worked, and I remember thinking to myself, it's almost unbelie—like that ISP was a real company! Like we were like a website in San Francisco!
Um, and hey, that worked!
And then the second one was translation.
So we had this global audience, and we would like call these translation companies, and we'd ask them like, "How much would it cost to translate our site into these like 40 different languages?" and they were like, "Infinite money!"
And we're like, "We don't have infinite money!"
And so I think we stole the solution from Reddit. We were like, "What happens if we just build a little website where our community translates everything?"
And so basically it would just like serve up every string in English, and it was like served to anyone who came to the site who wasn't from an English-speaking country and was like, "Do you want to volunteer to translate the string in your local language?"
And of course, you know, people are like, "Well, what if they do a bad job translation?" I was like, "Well, the alternative is it's not in their language at all, so let's not make the perfect enemy the good!"
And I think we had something where like we would get three different people translated and like match, but like that happened later.
We basically got translation for a whole product for free.
Um, maybe to end because I think this might be the, like maybe the funniest of them all, tell the Google story because I think this one's like the like really like—so look, for the Facebook story, that was firsthand where I personally witnessed the servers with my own eyes, so I'm 100% confident that is what happened because it was me, right?
This Google story is second hand, and so I may get some of the details wrong; I apologize in advance, but I'll tell you this was related to me by someone that was there.
All right, you ready?
So look, the original Google algorithm was based on a paper that they wrote, which you can go read, Page Rank. It worked really well; it was a different way to do search.
Okay, it worked! They always didn't have enough hardware to scale it because remember, there was no cloud back then; you had to run your own servers.
And so as the internet grew, it was harder and harder to scale Google. You still with me?
Like there were just more web pages on the internet, so it worked great when the web was small, but then they kept having more web pages really fast.
And so Google had to run as fast as they could to just stay in the same place—just to run a crawl and re-index the web was like a lot of work.
And so the way the work of the time is they weren't re-indexing the web in real time constantly; they had to do it in one big batch process back in the day.
Okay? And so there was some critical point—this was probably on the 2001 era, again this is secondhand, I don't know exactly what it was, but there was some critical point where this big batch process to index the web started failing.
And it would it would take three weeks to run the batch process; it was like the, you know, reindexweb.sh, you know, it was like one script that was like, you know, and it started failing.
And so they tried to fix the bug and they restarted it, and then it failed again.
And so the story that I heard is that there was some point where for maybe three months, maybe four months, I don't remember the exact details, there was no new index of Google; they had stale results.
So any user of Google—they didn't know that! You know, the users didn't know this, but a user of Google was seeing stale results, and no new websites were in the index for quite some time.
Okay?
And so obviously they were freaking out inside of Google. Um, and this was the genesis for them to create MapReduce, which they wrote a paper about, which was a way to parallelize and break into pieces all the little bits of crawling and re-indexing the web.
And um, you know Hadoop was created off of MapReduce; there's a bunch of different software used, and I would argue every big internet company now uses the descendants of this particular piece of software.
And then it was created under duress when Google secretly was completely broken for an extended period of time because the web grew too fast.
But I think this is the most fun part about this story: when the index started getting stale, did Google shut down the search engine?
Did you like, that's the coolest part!
Like people just didn't realize—they didn't know.
And did they build this first again in terms of do things they don't scale? Did they build MapReduce before they had any users?
No! Like they basically made it this far by just building a monolithic product, and they only dealt with this issue when they had to.
You know, I think this is like such a common thing that comes up when we give startup advice.
You know, we'll get a founder that's like, "Oh, how do I like test my product before I launch to make sure it's going to work?"
And I always come back and tell the founders the same thing: like if you have a house and it's got full of pipes and you know some of the pipes are broken and they're going to leak, you can spend a lot of time trying to check every pipe, guess whether it's broken or not and repair it, or you can turn the water on and like you'll know—like you'll know exactly the work to be done when you turn the water on.
I think people are always surprised that that's basically all startups do: just turn the water on, fix what's broken, rinse and repeat, and like that's how big companies get built!
It's never taught that way, though, right? It's always taught like, "Oh, somebody had a plan and they wrote it all down," and it's like never—never!
And you earned the privilege to work on scalable things by making something people want first!
You know what I think about sometimes with Apple? Picture like Wozniak hand soldering the original Apple computer and like those techniques compared to like whoever it is works in Apple to design the AirPods.
Like it's the same company, but like Wozniak hand soldering is not scalable!
But you know, they earned, because that worked, they earned the privilege to be able to make AirPods now.
And because Google search was so good, they earned the privilege to be able to create super scalable stuff like MapReduce and all these other awesome internal tools they built, right?
Yes! But if they wouldn't put that stuff first, it wouldn't be Google, man!
And so to wrap up, kind of what I love about things that don't scale is that it works in the real world, right? The Airbnb founders taking photos, the DoorDash folks doing deliveries.
It also works in the software world, right? Like don't make the perfect the enemy of the good; just try to figure out any kind of way to give somebody something that they really want and then solve all the problems that happen afterwards, and you're doing way better.
All right, thanks so much for watching the video!