YC Tech Talks: Machine Learning

21m read

·Nov 5, 2024

[Music] foreign [Music] Welcome to YC Tech Talks Machine Learning. I'm Paige from our work at a startup team, the team that helps people get jobs at YC startups. So for tonight's event, we have Founders who are going to be talking about interesting problems that they deal with in the machine learning space and a little bit about the problems that keep them up at night.

Hi everyone! Hey, I’m Andrew Yates. I’m the CEO and founder of promoted.ai. I’m going to be talking about composing models: how to stack recommendation models to always win.

So what does promoted do? We are ranking, searching, and promoting the best listings at the top to increase revenue. You’re familiar with this problem if you've ever used Facebook News Feed or Google Search or Airbnb or any kind of "hey, there’s a list of things,” and you want to sort it, and there’s some sort of objective. Amazon, this is a very, very common machine learning task of what should I show you at this time that you are most interested in. And then we can auction this off and make money on this.

Oh, if I can show you these things, and maybe you are 10th, but if you want to be first, here’s how much you would pay for that. We run an auction, and then you have an ads business.

So in our business, in our machine learning task, there are many, many machine learning models and techniques. It is a very well-established problem. There are many different alternatives. There are vendors, there are in-house teams, they have different strategies, techniques, et cetera, et cetera.

So, another challenge is there is frequently a trade-off between some models that may be good at something and some models that are better than others. An example would be some models trained on past engagement, so they're very good for established items. But what about new items that have no engagement? Well, those do poorly. Then you maybe need some sort of content understanding model. The content understanding model doesn’t do well on things. We have a tremendous amount of user experience.

Another example would be, hey, we have a model that is very fast to update within like the current day for like an ads model, for example. People are always creating new campaigns, and they are maybe running for a day; it needs to update very quickly. Versus like this gigantic recommendations model that has been trained on years of user preferences says maybe it’s some sort of gradient boosted account or excuse me, some sort of gradient descent type of algorithm, and it’s very slowly updating. It won’t learn a pattern except for weeks and weeks and weeks after accumulating data.

But you want both of these two things, and then different aspects of the objective you're trying to accomplish in your search and feed: there may be relevance versus engagement, for example. So the idea of, well, I want things that people are going to buy, and I also want things that are relevant to this search query, as people would say, so in some sort of human review.

So you have many different types of models, and you have different production constraints: you have data availability and dimensions, inference time, different trade-offs, and reliability and false tolerance. And so the challenge here is, as you as a machine learning practitioner or as us at promoted, what do we do to always win in an A/B experimentation? How do we always win?

And the answer is, you will always lose if you always try to make a single best model. So what we do is we combine all the models together—of course, like a very straightforward, pragmatic solution. Oh, we'll just take all of the good things and combine them together, and we'll get a better model. In the worst case, it's going to be as good as the individual signals that we're composing.

I’m going to talk a little bit about how this is actually done in practice. There are two techniques for doing this effectively, and they both sound simple, but there’s actually a lot of really interesting theory about and very successful techniques that use one of these two.

So first is horizontal composition. This is as simple as taking all of the models and averaging them together, and you get usually a better result than what you’ve originally started with, any individual model. This is the wisdom of crowds. So some extremely powerful, successful models are based on this idea. One is a PID controller. If you're familiar with real-time control systems, which is, hey, do you want like the difference now? Do you want the integral of the difference or the derivative of the difference? How about just add them all together and then average them, and that’s going to give you the best controller objective? Yes, it works really well.

Grading boosted decision trees: which decision tree is best? I know! Let’s just take a lot of decision trees and average the result together, and that's going to give us the best possible model. This is a very, very powerful model, very difficult to outperform, even very sophisticated techniques.

Another advantage of this horizontal composition is it's simple, it's effective, it's easy to understand. Some disadvantages are that it is hard to tune—it's itself a type of model. So if you’ve ever had an internship, for example, or a job in a ranking team or a recommendations team, as soon as you have a simple model, you’re like, well, why is it average? Like, can we make a weighted average? Why is it a like? Can we multiply? Can we put like some sort of non-linear transformation on it? It can get difficult to answer those types of questions of what is the best model on top of it because it's generally composed as something that's very easy to get started with, not some sort of organized composition.

An advantage from the infrastructure perspective is that for this horizontal composition, you can compute all of these signals in parallel, and it’s efficient and modular. So you don't depend on any single one of these signals to be computed before you compute another model inference. You can do all of them in parallel and then combine them.

The other type of model composition is vertical composition. This is when model outputs are inputs to another. A big disadvantage of this is that it can be very intensive to log training data.

So a little bit about promoted and why promoted is very successful at doing this is that we are, in some ways, a data streaming infrastructure business where we are taking every single inference and logging it with a tremendous amount of metadata and features so that we can take every single model output at every single inference and log it and train on top of it. This is sometimes infeasible to do depending on what you’re trying to do in your system.

Another way of thinking of vertical composition is it’s a type or a form of feature engineering. This idea of, hey, I have these raw signals; I need to transform them in some sort of way before they go into the main model to be used. You can think of vertical composition as a really, really big version of feature engineering.

An advantage of vertical composition is the composition itself is learned as part of the model of whatever the architecture is. So unlike in horizontal composition, where you have like an average, it has a weighted average, maybe has some like some other things and it’s kind of ad hoc, the composition here is just part of the model. You put in the signal as the model definition to figure it out for you.

The disadvantage is that computation is serial, so you have to finish executing all of the signals before you can start on the next layer of execution.

One really interesting thing about doing this concept of model composition, as opposed to a more typical machine learning idea of, hey, we just have the model, the single one great model, is it helps you think in terms of organizing an entire machine learning engineering organization. So if you are running Pinterest or you are running Facebook or Airbnb, you don’t have a model; it’s not like the recommendation model, and you don’t have like the recommendation engineer who built it. You have many, many different teams doing different pieces of this and all working in parallel over time and evolving over time to build the final and deliver the final product.

So how do you think of not only models as pieces among other models but then how do you map that into an entire engineering organization so that all of these pieces work together literally in the computer but all also as organizations and people working together?

One way that this is done is to separate meaning from implementation. Different models have different characteristics. As an example, a click prediction model means this is the probability of a click on the specific item in the specific location. That means you can change how that model is implemented—from, let’s say, a grading piece of decision tree to like some sort of neural network—and it doesn’t change fundamentally what that signal is meant to be. It can then be used in another system that says, okay, my feature is the probability of click for this item. Well, it doesn’t matter how that was computed; it just matters that it’s the same interface.

So you can start thinking of models as having an abstract interface in the same way that you can build other types of software, and then the same way that you can do other kinds of concepts of software organization like microservices.

One example of this is in ads systems. This idea of analytically versus approximately. An example in ads is that— and this is something that promoted is doing—is that you can separate the price, like what’s the optimal price someone should bid, from the other aspects of the rest of the system, like what’s the probability of clicking or conversion or like other objectives around user experience versus ad revenue.

That’s my few-minute talk. I'd love to take any questions.

How do you pick your models within your compositions, and have you run into a scenario where either a single model or a few models have negatively skewed your results?

Oh, great questions! For picking models, it’s engineering. The challenge is you generally, as an engineer or a practitioner working at a job, it’s not a research problem. You’re not developing new techniques or modeling; your job is to accomplish some sort of task for a system objective.

So, it depends—that’s the short answer. The longer answer is to choose models that work reasonably well and then choose to have a more complicated model depending on if it’s worthwhile to spend the resources and energy to do so. There’s a whole discipline around that.

I’ll come back. So like start with a linear model, start with like a hand-tuned rule, and then eventually invest in is it worthwhile for doing the additional complexity.

The other part is, is it possible to overdo this? Yes, yes! It’s like if you follow Elon Musk’s Twitter feed, though the microservices video that you recently posted, models are the same way. Like you can construct a horrible Kafka-esque type of world of models going to models all over the place, which could all just be condensed down to something that's relatively straightforward. That is sometimes more of an organizational, like human organizational problem as opposed to like an engineering or technical problem.

From the engineering and technical problem, sure, this is like not every single model is going to add additional value, and you may not have the model complexity or the training data that matches your domain. Simply increasing the complexity or signals may not actually improve your objective.

How has your experience been with developers who might not be very familiar with, you know, certain ML concepts or even the fundamental basics interacting with these technologies?

This is where the numerical interfaces are important, this concept of this is what this model means, and this is what it's supposed to be used for. If you don’t have a con, like you don’t have to understand how—and you shouldn’t, and you won’t understand how—let's say, this is the probability of a click, for example. Like it just is; it’s probability of a click. But you can still know, oh, it means the probability of a click. Like you can understand what it means.

So that’s in contrast to some other types of black box models where you can’t understand what the score means unless you have the entire final composition.

An example of a model like this is a learn to rank model in this domain is that the score produced only matters in the context of other scores in the same result.

What I’ve seen that I think most intelligent software engineers— which are, I mean all software engineers are digital intelligent—what I have seen is people don’t have a problem with like the more complicated theory part of it because I recognize that’s complicated—they don’t try to understand it. Where I’ve seen people have trouble, if they have less experience, is using models without understanding what they’re meant to do and then like running into a mess. And then they try to fix it with A/B experiments like, oh, well, it’s a mess with no one can understand it, let’s just run an A/B experiment. That’s where I see people burn a tremendous amount of time and energy versus like just simply not understanding how a gradient descent works or something like that in practice.

You don’t really need to know that someone who’s simply consuming a signal from another system.

So with that, I’m going to turn it over to Josh.

Well, if you are interested in learning about, you know, deep learning and how it works and all that stuff and not using it for anything, then maybe this presentation will be a little and will be up your alley.

So how and why we created one of the fastest 3D reinforcement learning simulators.

What we built is a reinforcement learning environment called Avalon and actually just presented it at NeurIPS, the machine learning conference, this morning. It’s open-source, it’s free, anyone can download and play around with it. It’s a procedurally generated set of worlds; there are an infinite number of worlds of different tasks. There are buildings and predators and tools, and it’s sort of a game like Minecraft, in which reinforcement learning agents can learn to interact with the world.

So why did we build this? The reason that we built this is that our goal—and generally intelligent—is to make more intelligent software agents. And why do we want to do that? Well, I mean, we want to automate boring tasks, we want to cure diseases; there are all sorts of really cool things that we could do if we had very intelligent software.

Today we have some pretty cool machine learning stuff. You know, we can learn to rank things. We've got things like ChatGPT, but it’s still pretty far from AGI.

Here’s a good example that I pulled just this morning from ChatGPT: someone asked it, "What is the fastest marine mammal?" It says, "The fastest marine mammal is a peregrine falcon." A falcon is not a marine mammal!

Yeah, so okay, maybe it’s a sailfish— that’s not a mammal, right? So it just goes on. It’s like it doesn’t really know things necessarily. In another sense, it’s very powerful, very interesting—it can definitely do lots of things, really cool stuff—but they’re still pretty far from AGI. Even Sam Altman, previously at Y Combinator, is kind of agreeing that there are a lot of people who think, "Oh, this is the AGI!" though he realizes this is not obviously very close yet.

So, you know, they are very powerful, though, right? AI, when we apply it to a particular task like ranking things or playing Go or Dota or something like that, they do extremely well and often much better than people.

So isn’t this kind of a contradiction? The real problem though is that we want general intelligence. A really good definition from Shane at DeepMind via the definition of intelligence page on Wikipedia is that intelligence is the ability to achieve many goals in a wide range of environments.

So really what we want is the way to construct and evaluate a wide range of problems and environments—in other words, a simulator.

We actually did a lot of sort of customer research for researchers as we were developing this, and what we heard over and over again is one of the biggest things holding back the reinforcement learning field is a lack of really good benchmarks. A lot of people work on Minecraft or Atari or these other games, but they really are capped in kind of their ability for us to build really interesting agents in them.

So Avalon is built from the ground up as a sim emulator made specifically for reinforcement learning. Most systems use existing games like Atari or Unity or Minecraft, and those have made trade-offs to make games that are very different from what you want from a reinforcement learning simulator.

For example, in a game you want stuff to be fun, right? But that’s not actually what you want in a reinforcement learning simulator. Instead you want it to be similar to tasks that people do every day, which are often kind of grindy and not very fun.

In a game, you want it to run, you know, 30 to 60 frames per second. In a simulator, you want to run 1,000 frames a second or 10,000. You want it to be profitable as a game. Here we want it to be free and open-source so people can do research on it.

You want a game to have lots of features; we instead want this to be really debuggable and simple. You want the game to be challenging for adults; here we want it to be a range of challenges. Some of the things are very easy so that the agents can get started, and some of the things are very challenging as well, in a wide spectrum of things between nodes.

So what we did is we built Avalon on top of the Godot game engine, which is actually a really cool game engine. It’s completely open-source, has physics and rendering and everything inside of it, is cross-platform, supports VR, and is about a 30 megabyte download—a single executable. It’s really, really easy to use. It has lots of tutorials and an active community. It has a really good debugger and editor; it’s a great base to build welcome.

It’s also nice and simple. So what we did is we packed it up basically into a crazy fast simulator. So it was simple enough that we could just sort of reorder things to make it into a deterministic actual simulator where we can say, like step, wait for the agent, step, wait for the agent—which is very different from how a game is normally played.

We also then created our own custom EGL rendering back end to avoid having an X server at all because we want to run these on, you know, headless Linux machines with lots of GPUs in the cloud, and this avoids some extra frame buffering and passing and things like that.

We also tweaked it so that the physics happen much less frequently than they normally would. In a normal game they’d have in maybe 30 or 60 or maybe 120 times per second but here we update physics just 10 times per second per simulated second, and that allows us to do a lot less physics computation, although at the cost of some kind of funny physics bugs of if you don’t tune it properly. Things fall to the floor or go through each other or do all sorts of weird stuff, so it took a while to get down to this like very minimal amount of physics work.

We also did a lot of work tracing through the OpenGL rendering with the Nvidia profiler until we trimmed out everything like transparency—don’t need that—textures—don’t need that—mipmap—don’t need that—shadows. You can turn that back on if you want more visual realism. But we did this so we can get to about 10,000 frames a second actually. Eventually, the Geo clear call, which is just like clear the screen, is a significant fraction of our rendering time.

Another thing we did was transfer the data in a very fast serialized way via shared memory using numpy. And then, you know, profilers can move stuff to C++ if necessary. We also wrote our own reinforcement learning worker rollout logic to work around the fact that, you know, this is a much more complex environment. It takes a little while to reset when the agent dies, and we want to, you know, change to a new world and a whole lot of other stuff.

Basically, we did all these things and ended up getting to about maybe seven to ten thousand frames per second on a single GPU, which is pretty impressive, and we’re hoping to get to something that where even for a single agent it’ll be about 100 times faster than real time, so you can train a one-year-old in maybe three and a half days.

There’s a lot of interesting future work to be done. One of the things we'll be doing is moving the Godot game process into a library rather than a standard loan process, which will let us do batch rendering. We'll also be doing a little bit more with multi-threading, so we can work on multiple kinds of agents at the same time in a single simulation. We’ll be moving some of the parallelism out of the Python level into the C level within Python to avoid the global interpreter lock. Also, maybe moving to asynchronous execution of environments and agents so that rather than everything waiting around for this lowest agent, if an agent is too slow, it just sort of misses its turn, just like in real life—if you’re too slow, stuff happens.

We’ll also be doing a lot of performance improvement both on existing networks and on other large language models and agents in order to make them actually fast enough to integrate and put together. So there’s a lot of really cool work.

If any of this stuff sounds interesting or if doing machine learning work and research, in general, sounds interesting, we’re definitely hiring. So feel free to reach out to me to meet here, and thanks for listening.

Do clients use this as an API to build their own environments?

So it’s free and open-source, so people are welcome to use it for whatever they want. We’re hoping that academic researchers will use it primarily. It can run on a single GPU pretty easily, so it’s accessible for most academic labs. If people want to use it for business, they’re welcome to, although it is currently GPL, so we, you know, kind of require that you contribute back any fixes or changes that you make.

Yeah, um, another question here: can it be used to generate a digital twin of small-scale agricultural farms with 3D simulated plants for various types of AG work?

You could try doing that. We have very purposely stayed away from making it ultra-realistic, like the 10, you know, physics ticks per second type of thing, and the visuals you saw before are purposely not very realistic because we’re really going for speed. It’s really meant as a scientific tool for asking questions about how can we make agents learn, and a little bit less for how do we make agents that we could then transfer to the real world. We might do that in future work—kind of extend things to be more realistic, but that’s not our focus right now.

And you had started to answer this, but what are some of the most common use cases?

Yeah, so right now it’s primarily intended as a research tool, as I was saying. So there are some researchers that are working on kind of extending this in various ways. We are extending it to add a bunch more tasks—not just the sort of simple physical tasks like running and jumping and throwing that are in there right now—but adding more linguistic tasks or a sort of unbounded task. One of the things that we really want to work towards is making effectively a benchmark for general intelligence, which would have thousands or tens of thousands of tests. So that’s the thing we’re working towards.

Other people might be able to use it for, in the future, more multi-agent considerations or some people are looking at how you can reuse computation from previous RL agents. One of the things that this opens up is the ability to train agents for much longer than agents have been able to train before. So you can ask how do you learn things within your lifetime? So it opens up a bunch of new possible research questions.

So my question is, you know, do you think this sort of training agents to act in this sort of simulated virtual environment is going to be the most important or useful application of RL in the next few years? Or will it, you know, be more of things like, you know, reinforcement learning from human feedback for like aligning large language models and code generation models, things like that? Just, yeah, curious how you think about the evolution of RL in the next few years.

Well, yeah, I guess so. RL from human feedback is an application of reinforcement learning to other things, and the purpose of this tool is to make it so that we can make agents that are better able to do reinforcement learning in the first place. So anything that we discover using this tool sort of applies to those types of applications.

So this is sort of like one meta-level removed from that, that if we can make agents that can learn much better from less data or learn much more quickly, we can apply that to tons of different possible applications.

Cool, makes sense! Thanks.

Okay, so we’ll have a couple of pitches now from other companies. First up, we have Jay.

Cool!

Yeah, so hi everyone! I’m Jay, I’m a co-founder at Eventual. We are the data warehouse for complex data—sorry, so this is data like images, audio, video, documents, things that don’t really fit in a table like a SQL table. Our product is open-sourced, and it’s called Daft.

You know, it’s often said that data is the most important part of machine learning, right? And Daft is kind of the data engine that’s a core part of the infrastructure. Daft is a distributed Python data frame library, so if you’ve ever used pandas or PySpark before, you’ll be right at home with Daft.

It’s built for Python, so you can use all of your Python functions, classes, and objects, making it super easy to work with any custom complex data types you have, like images or crazy DICOM file formats for healthcare. That’s distributed; we built it to run on Ray, which is a distributed computing framework. So we can process petabytes of data on hundreds of machines, and we’re going really quickly.

We did a soft launch with no real marketing; just wrote a month in. About two months ago, we have about 400 stars, and we’re doing a bigger beta launch at the end of the year. I’m working on really core technical challenges like query optimization, distributed systems, and data engineering. Our front end is in Python and the backend is now in Rust.

And yep, learn more about daft.daft.io and come talk to me. Thanks!

Hey everybody! I’m Stan, I’m one of the co-founders over at HyperGlue. Here at HyperGlue, we’re building a platform that extracts insights from text you find in apps like Slack or Zoom, or Zendesk, or customer-facing forums like Reddit.

Our customers leverage HyperGlue to generate real-time analytics on all sorts of things. So like, uh, what are my gamers saying about my latest DLC on Reddit? What are our common questions or competitors that are coming up in my sales calls? Or what are the top issues in my support tickets this week?

The best way to think of it is as a center of excellence, if you will, for unstructured data. So, we provide the why for why your numbers are moving, right? The A’s are up or down, or churn is up or down. Why is that? Well, the hints are probably, you know, your customer touch points. So, we want to make it really easy for you to have that visibility across the org.

Yeah, the founding team, Paris, Stamper, guys who have nothing better to do than work with ML, apparently. Before this, we used these language tools primarily in national security, doing things like media monitoring or tracking terrorists around the world. And yeah, it’s a really cool technical problem and we’re happy to connect with anybody interested in the ML space or how it translates to real-world commercial applications.

And happy to answer any questions—any specific types of engineers that you’re looking to hire?

Yeah, so we are hiring on the platform side, the ML side, and the UI side. So we’re just about to start growing up, hopefully. You know, so happy to, like I said, connect with anybody who feels like the problem space is something that they’d be interested in.

Okay, my name is Ben Coleman. I’m the co-founder and CEO of Reality Defender. We do real-time deep fake detection for platforms, banks, streaming, social media, adult entertainment. Just to kind of dive in what that actually means: an arms race is incredibly expanding and imbalanced. There are over a hundred thousand defect models; that’s three percent focused on detecting deepfakes. All the rest are all the cool gender of AI you guys are seeing every day.

What we do is provide an ensemble approach to deepfake detection, integrating multiple models together and looking for whether it’s examples of known models or unknown models, known signatures. We create our own deepfakes in the lab. We have a no-code web app, we have an API, we also have a passive internet scale scanner.

We find all kinds of dashboards, exports, report cards, and email alerts. The possibilities are endless because deepfakes attack every single entry vertical—everything you do, everything you touch, everything that looks like you, whether it’s a bank, whether it’s an insurance, whether it’s a streaming platform, social media, or government group.

Going very fast here, it’s a very, very obvious use case: fake faces, fake accounts, social media shopping, banks, fake housing, fake interiors. With two minutes of Paige’s voice, I create a perfect deepfake. We won’t do it on this call, and any of us can be depicted online—fake people, real companies, real videos, to fake voices or vice versa.

We have an API platform; it’s super simple to log into using a password, drag up a file, and get immediate results across multiple models relevant to the type of media, the type of codec, and type of compression. All kinds of exports. I'll leave it at that. It’s a fun, scary company, a scary problem, and a very amazing company, and we’re recruiting here across a number of areas: across research, engineering, data science, and strategy and operations.

[Music]

YC Tech Talks: Machine Learning

More Articles