Making Music and Art Through Machine Learning - Doug Eck of Magenta
Hey, this is Craig Cannon and you're listening to a Y Combinator's podcast. Today's episode is with Doug Eck. Doug's a research scientist at Google, and he's working on Magenta, which is a project making music and art through machine learning. Their goal is to basically create open-source tools and models that help creative people be even more creative. So if you want to learn more about Magenta or get started using it, you can check out magenta.tensorflow.org.
All right, here we go. I wanted to start with the quote that you ended your I/O talk with because I feel like that might be helpful for some folks. Tilda Brian Eno quote, and I will have the slightly longer version. Yeah, good, yeah. So, yeah, it goes like this: whatever you now find weird, ugly, uncomfortable, and nasty about a new medium will surely become its signature. CD Distortion, the jitteriness of digital video, the crap sound of 8-bit— all of these will be cherished and emulated as soon as they can be avoided. It's the sound of failure. So much modern art is the sound of things going out of control, out of a medium pushing to its limits and breaking apart.
So, that's how you ended your I/O talk, correct? And what it kind of opened up for me was like, when you're thinking about creating Magenta and all the projects therein as new mediums, how are you thinking about what's going to be broken and what's going to be created? The reason that I put that quote there, I think, is to be honest with the division between engineering and research and artistry, and to not think that what I'm doing is being a machine learning artist, but we're trying to build interesting ways to make new kinds of art.
And I think, you know, it occurred to me, I read that quote and I thought, you know, that's it, right? No matter how hard yeast men or whomever invented the film cameras, sorry if that's the wrong person, right? Like, they clearly weren't thinking of breakage or they're trying to avoid certain kinds of breakage. I mean, you know, guitar amplifiers aren't supposed to distort, you know? And, you know, I thought, well, what if we do that with machine learning? Like, the first thing you're going to do if you think if someone comes to you and says here’s this really smart model that you can make art with, what are you going to do? You're going to try to show the world that it's a stupid model, right? But maybe the way that maybe it's smart enough that it's kind of hard to make it stupid, so you get to have a lot of fun making it stupid.
Right? I was playing with a quick-draw this morning with my girlfriend, and what she was trying to do was make the most accurate picture that the computer wouldn't recognize. Like, immediately out of the gate she works in art and like, yeah, doesn't want to believe— okay, it's a good intuition. I mean, you know.
Yeah, so maybe the best way to start is then talk about like what are you working on right now, what do you guys making? So, right now we're working on, kind of, we think it's a good question. We have this project called Henson, which is trying to get deep learning models to generate new sound, and we're working on a number of ways to make that better. I think one way to think about it is we have this— we have this latent space. We have a—so to make that a little bit less a buzzword, we have a kind of compressed space, a space that doesn’t have the ability to memorize the original audio, but it's set up in such a way that we can try to regenerate some of that audio.
And then regenerating it, we don't get back exactly what we started with but hopefully we get something close, and that space is set up so that we can move around in that space and can come into new points in that space and actually listen to what's there. Right now it's quite slow to listen, so to speak. We're not able to do things in real time, and we also would love to be at kind of a meta level building models that can generate those embeddings, having trained on other data so that you're able to move around in that space in different ways.
And so we're kind of moving, we're continuing to work with sound generation for music, and we also are spending quite a bit of time on rethinking the music sequence generation work that we're doing. We put out some models that were, you know, by any reasonable account, primitive. I mean, kind of very simple recurrent neural networks that generate MIDI from MIDI and that maybe use attention that maybe, you know, have a little bit smarter ways to sample as when doing inference when generating. And now we're actually taking seriously, wait a minute, what if we really look at large data sets of performed music? What if we actually start to care about expressive timing and dynamics, cared deeply about polyphony and really care about, like, not putting out kind of what you would consider a simple reference model but actually what we think is super good?
And I think, you know, those are the things we're focusing on. I think we're trying to actually make things, you know, really pull up quality and make things that are better and more usable for people. And so, with all that supervised learning, are you like— are you going to create a web app that people will evaluate how good the music is? Because I heard a couple interviews with you before, where that was the issue, right? Like, how do you know what's good?
Yeah, so that is like the pausing because that's the big question. I think in my mind is how do we evaluate these models? Yeah, at least for Magenta, I haven't felt like the quality of what we've been generating has been good enough to bother, so to speak. Like, you find that you cherry-pick, you find some good things like, okay, this model trains and it's interesting, and now we kind of understand the API, the input-output of what we're trying to do. I would love— yeah, I don't know how to solve this.
Like, I conceptually what we do—here's what we do, right? We build a mobile app and we make it go viral. That's what we do, right? And then once it's viral, we just keep feeding all of this great art and music in and I used to do music recommendation, we just build a collaborative filter, which is a kind of way to make, you know, recommend items to people based upon what they like, and we start giving people what they like and we pay attention to what they like and we make the models better.
So all we need to do is make that app clover—oh, okay. In fact, maybe someone in the Y Combinator world can help us do that. In writing jokes, like, a lot easier now. And trombone—exactly, yeah, right? Maybe that particular web app is not the right answer now. I mean, I'm saying that as a joke, but I think look at it this way: if we can find a way, or the community in general can find a way for machine-generated media to be sort of out there for a large group of interested users to play with, I think we can learn from that signal and I think we can learn to improve.
And if we do, we'll make quite a nice contribution to machine learning. We will learn to improve based upon human feedback to generate something of interest, so it's a great goal. Yeah, but I'm totally, like today in this room, you know, I wish I could tell you we had a secret plan, you know, like we're, you know, oh, he's figured it out, that's going to launch tomorrow. Yeah, like it's really hard, we're bleep—yeah, yeah, sorry interest.
Okay, because I was wondering what kind of data you were getting back from artists. You know, do people just use all of your projects? I, you know, all of the repos to create things of their own interest? Are they pushing back valuable data to you? So we're getting some valuable data back, and I think what we're getting back, some of the signals that we're getting back are giving us such an obvious direction for improvement. Like, why would I want to run a Python command to generate a thousand MIDI files? That's not what we do, you know? Like, you get that kind of feedback in, like, okay, we wanted this command line version because we needed to be able to test some things.
But if musicians are really going to use the music part of what we're doing, we have to provide them with more fluid, more useful tool. And there I think we're still sitting with so many obvious hard problems to solve, like integration with something like Ableton or like really solid, you know, real-time I/O and things like that. That we know what to work on, but I think we'll get to the point pretty quickly where we'll have something that’s kind of solves the obvious problems and plugs in reasonably well to your workflow, and you can start to generate some things and you can play with sound, and then we need to be much more careful about, you know, the questions we ask and how good we are at listening to how people use what we're doing.
And so, what are artists using it for at this point right now? So we have most of what we've done so far is had to do with music. Yeah, if we look for a second away from music and look at Sketch RNN, which is a model that learned to draw, we've actually seen quite a bit of—so, first at a more at a higher level, Sketch RNN is a recurrent neural network trained on sketches to make sketches, and the sketches came from a game that Google released called Quick-Draw where people had 20 seconds to draw something to try to win at Pictionary with a computer, you know, a classifier counterpart.
And so, you know, we trained a model that can generate new cats or dogs or whatever. There are some really cool classes in there—cruise ship, yeah, the one that always threw me was camouflage. Like, it smells of camouflage all the time! Same, like, I'm never seeing as if by definition you can't draw it. Yeah, yeah. Because we're 20, I couldn't better, right?
Yeah, I actually won a Pictionary round with the word "white," and I just pointed as a paper. I'm like, no way! And she said white, and I'm like, you’ve got to be kidding! It's kind of—like a corollary to camouflage. So we've seen artists start to sample from the model. We've seen artists using the model as like a distance measure to look for weird examples because the model has an idea of what's probable in the space. We've also seen artists just playing around with the raw data, and so there's been a nice explosion there.
I think, you know, I'm not expecting that artists really do a huge amount with this Quick-Draw data because, as cool as it is, these things were drawn in 20 seconds, right? That's kind of a limit to how much we can do with them. On the music side, we've had a number of people playing with NSynth with just, like, dumps of samples for medicine. So, basically, like, a rudimentary synthesizer. And there, I've been surprised at the kind of—I would expect that if you're really good at this, like your Aphex Twin or you want to— how about this—you want to be Aphex Twin, right? That you look at this and go, yeah, whatever! There are 50 other tools that I have that I can use.
But that those are the people that we've found have been the most interested because I think we are generating some sounds that are new. I mean, first, you can't, as someone pointed out on Hacker News, you can take a few oscillators and an annoy generator and make something new. But I think these are new in a way when you start sampling the space between, you know, a trombone and a flute or something like that. But these are new in a way that captures some very nice harmonic properties. It captures some of the essence of the Brian Eno quote: we're kind of broken and glitchy and edgily in a way, but that glitchiness is not the same as you would get from like digital clipping. The glitchiness sounds really harmonic.
And so, like, for example, Jesse on our team just—the angle he built some Ableton plugin where you're listening to these notes but you're able to change the—like erase the beginnings of the notes. So you like erase the onsets, which is usually where most of the information is. Like, most of the information in a piano note is kind of that first percussive onset. But it's the onsets that the model is doing such a great job of reproducing because it gradually kind of moves away from, in time, the temporal embedding, and the noise kind of adds up as we move to the embedding in time.
So, it's the tails of these notes that start to get ringy and like build-tuned, and you'll hear these rushes of noise come in or there'll be this little weird group at the end. And so, we've found that musicians who've actually played with sound a lot find these particular sounds immensely fascinating. I think that the kinds of sounds that sound interesting in a way that's hard to describe unless you played with them, yeah, I think they're interesting because the model has been forced to capture some of the important sources of variance in real music audio.
And even when it fails to reproduce all of them, when it fills in with confusion, so to speak, even that confusion is somehow driven by musical sound, which, you know, you see by the corollary. You know, if you look at something like Deep Dream and you see what models are doing when they're sort of showing you what they've learned, it may not be what you expect from the world, but there's something kind of interesting about them, right?
Anyway, it's a long answer, but the short version of the answer is we found that working with very talented musicians has been really fruitful. And our challenge is now to be good enough at what we do and make it easy enough and make it clean enough that even someone who's not an Aphex Twin—and I'm not saying we worked with Aphex Twin, we didn't work with Aphex Twin—but, like, you know, that kind of amount of kind of artists that we can also be saying, hey, this is really, like, genuinely musically engaging for a much, much larger audience, not surprising. So, it's not necessarily generating melodies for people so much as it is generating interesting sounds. That's what's brought them in.
That's what's brought them in, though. The parallel has existed for the sequence generations. Yes. And what I noticed even with AI Duet, which is this like web-based—like, it's a simple iron, and it's like I can lay claim, its technology that was published in 2002—it's really a very simple—really simple, but like—so like this model, if you haven't, you know, if your viewers haven't seen it, you play a little melody and then the model thinks for a minute, and the AI Genius, which is an LSTM, that where comes back and plays something back to you, right? If you play Für Elise, right? And you wait, you're expecting, maybe, you know, that it'll continue the tune. It's not going to, right? It's going to go through the book.
Right. So, like this idea of like expecting the model to carry these long arcs of melody along, it's not really understanding the model. What we saw was like, especially jazz musicians, but like musicians who listen, the game they play is to follow the model. And so they'll like—I'd see guys there, people—women, to sit down, go like, you know, dumb, dumb, dumb, dumb, and just wait! And it's almost like pinging the model with an impulse response, like what's this thing going to do? And then instead of trying to drive it, it comes back and goes, dumb, dumb, dumb, dumb, right?
Yeah, and then the musician says, oh, I see, let's go up to the fifth, and then you get this really—it's almost like follow the leader, yeah, but you're following the model. And then it's super fun, and like it's basically a challenge for the musician to try to understand how to play along with something that's so primitive, right? Right. But if you don't have the musical—so, basically, it’s the musician bringing all the skill to the table, right?
Yeah, so like even with the primitive sequence generation stuff, it's still been interesting to see that it's musicians with a lot of musical talent and particularly the ability to improvise and listen that have managed to actually get what I would consider interesting results out of that. So, yeah, so it's become more of like a kind of response game than a tool.
Yeah, I think so. I think. And that's partially because the model, the model's pretty primitive. I think that we can, you know, if we can get the data pipelines in order so that we know what we're training on and we can actually do some more modern sequence learning, you know, having like generative adversarial feedback and things like that, we can do much better. And even we have some stuff that we haven't released yet that I think is better. But, yeah, I think as we make it better, it'll be more of a this model's going to give me some more ideas from what I've done.
Right now, it's more of a this model is kind of weird, but I'm going to try to understand what it's doing. Okay, both are fun modes, by the way. Yeah, they're both cool modes, right? Yeah, I mean I've enjoyed—like I'm definitely not a pianist. I'm going to play guitar before, and I tried to get a song going, but I had trouble with it. Include re-its?
Yeah, I think it's most of my fault, to be honest. I love the YouTube video. Lame the user, right? Yeah, the video or somewhere that guy played a song with it. Mmm, that was amazing. Yeah, that was cool. It was very cool. Have you seen a lot of that stuff?
Well, yeah, we've seen. We saw, like— we haven't—I haven't. Well, we haven't pushed the sequence generation stuff much because we really wanted to focus on temper. But when we have released things and kind of tried to show people were there. Yeah, we've gotten—if you look on them, does the Magenta— is a Magenta mailing list that's just like—it's linked in GQO Magenta, and if you look around, there's like a discussion list, which is like as flamey and spammy as some discussion lists, but a little less so. It's pretty—you know, every couple weeks, someone will put up some stuff they composed with Magenta, and usually, they're more effective if they've layered their own stuff on top of it or they've, you know, taken time offline rather than in performance to generate, but some stuff's actually quite good.
I guess none, it's a start. Yeah, I mean, I think it's great. And so, how, like, you compared it to, you know, the work you did in 2002? Where has LSTM gone since then? Like, you talk about like you ended up doing this project—I saw in your talk that because you kind of like failed at it a while ago—failure is good. Yeah, yeah!
So, there was a point in time I was at a lab called the Dalla Molly Institute for Artificial Intelligence, and I was working for Jürgen Schmidhuber, who's one of the co-authors. He was the adviser to—oh, that Hock Rito, who did the two-did LSTM. And there was a point in time where there were three of us in a room in Mono, Switzerland, which is a suburb of Lugano, Switzerland, who were the only people in the world using LSTM. It was myself, Helix Garrison, and Alex Graves. Among the three of us, by far, Alex Graves has done the most with LSTM. He continued after he finished his PhD, and he continued doggedly to try to understand how recurrent neural networks worked, how to train them, and how to make them useful for sequence learning.
I think more than anybody else in the world, including the person who created LSTM, you know, Alex just stuck with it and finally started to get wins, you know, in speech and language. Right? And I more or less put down LSTM; I started working with audio, audio stats, and other more cognitively driven music stuff at the University of Montreal. But, like, it worked finally. Right? And it was, you know, like there's this thing in music—a 20-year overnight success, right? Something like this worked because he stuck with it, and now, of course, it's become like sort of the touchstone for, you know, recurrent models in time series analysis.
It forms some version of it, forms the core of what we're doing with translation. I mean, these models have changed, right? They've evolved over time, but basically, you know, recurrent neural networks as a family of models is around because of that effort of like—it's interesting, right? It’s really—there really were three of us, and Helix went on with his life and I went on with my life and Alex stuck, which is kind of really one person carrying forward. But you may get letters from people saying, hey, wait, you forgot about me!
You forgot about me! This is a little bit reductionist. Obviously, there were more, but I felt that way at the time, right? What was the breakthrough then that got people interested? I think it was the same breakthrough that got people interested in deep neural networks and convolutional neural networks. It's— but these models don't work that well with small training sets and small models. So—and then it's not—yeah, because they’re data absorptive, meaning that they can absorb lots of data if they have it.
And, you know, neural networks as a class are really good with high-dimensional data. And so, his machines got faster, and memory got bigger, and they started to work. So, you know, we were working with really small machines and like working with LSTMs that were, you know, maybe had like 50 to 100 hidden units and then a couple of gates to control them and trying things that had to do with, like, the dynamics of how these things can count and how they can like follow time series. So, you try to scale that to speech or you try to scale that to, you know, speech recognition was one of the first things.
It's really hard to do. So, I think a lot of this just due to having faster machines and more memory, kind of, we read the surprising stuff that would be it. Yeah, I think it surprises everybody a little bit. Yeah, now the running joke, like having coffee here at the brain is sort of like, like, what other technology from the 80s should we rescue? AI's back, exactly, right?
How far have you pushed LSTM? Like, you know, obviously, there’s some amount of like text generation that people are trying out. You know, have you let it create an entire song? No, we haven't because we haven't got the conditional part of it right yet. So, I think like LSTM in its most vanilla form, I think everybody's pretty convinced that it's not going to handle really long time scale, a hierarchical patterning.
And I'd love it if someone comes along and says, no, you don’t need anything but vanilla LSTM to do this. But, you know, I think what makes music interesting over, you know, even after like five seconds or ten seconds is this idea that, you know, you're getting repetition, you're getting, you know, harmonic shifts like chord changes. They're there, right? And one way to talk about that they're there is that, you know, you have some lower level pattern generation going on, but there’s some conditioning happening. Oh, now continue generating, but the condition shifted; we disap did chords, for example.
And so, I think if we start talking about conditional models, if we talk about models that are explicitly hierarchical, if we talk about models that we can sample from in different ways, we can start to get somewhere. But I think, you know, only a recurrent neural network is, you know, it’s—it's not—it would be reductionist to say that it's the whole answer and it's in fact true it's not the whole answer.
Hmm, I was thinking about how you were—was it the TensorFlow or the I/O talk where you were talking about Bach? Oh, probably, oh that we did stuff that was like more Bachman Bach. Yeah, we nailed it. Yeah, I think you should start making the things that are, you know, more palatable. I was like, I'll make the best Picasso painting for you, but that's not necessarily so—painting. Correct. Early saying anything precisely right.
So, I think, I think by analogy. So first, in case it's not clear, I don't believe that we made something that was better than Bach, but when we put these tunes out for like untrained listeners to listen to, the, you know, times voted them as sounding more Bach. And I think it's—imagine like what these models are learning, right? They're learning kind of the principal axes of variance, they're learning what's most important. They have to, right? Because they have a limited memory, they're compressed.
So if you sample from Sketch RNN with very low temperature, meaning with like without a lot of noise in the system, you actually get what, like if you want to squint your eyes and break philosophy, is like the Platonic cat. You know, you get the cat that looks more like a cat. Anyone to draw sort of the average cat, and I think that's sort of what we're getting from these time series models as well—they're kind of giving you something that's more a caricature than a sample, right?
So then, in the creation of art, what are you predicting is going to happen? Like as Magenta progresses, I think in kind of make predictions that are on the timeframe of like 28 to 40 years, and no one will ever, 10,000 years, Magenta is going to be the only—this, joking aside. I do believe that machine learning AI will continue to become part of the toolkit for communicating and for expressions, including art. I think that in the same way—I think that it's healthy for me to admit that those of us who are doing this engineering won't almost by definition know where it's going to go.
We can't, and we shouldn't know where it's going to go. I think our job is to build smart, you know, AI smart tools. At the same time, I want to point out—like some people find that answer boring. It's hedging, but I do think there are directions I can imagine directions we could go, and it'd be really cool. For example, thinking of literature, right? I think plot is really interesting in stories.
And that, you can imagine that we have a particular way as humor—it's a kind of cognitive constraints that we have, like kind of limitations in how we would draw up plots out, you know, as an author, and you're not going to do it in one pass left to right, like in the recurrent neural network is going to be like sketching out the plot. And do we kill this character off? But I kind of can imagine that generative models might be able to generate plots that are really, really difficult for people to generate but still make sense to us as readers.
Right? Like, you know, yeah. Think of it this way: like I think jokes are hard because it's really hard to generate the surprising turns. And that the—like kind of like you go in one direction and you land over here, but it still makes sense. And I can imagine that the right kind of language model might be able to generate jokes that are super, super funny to us and that actually might have a flavor to them of being like, yeah, I know this joke must have been machine generated because it fits in so many different ways, right? Write it like to Heidemann. It exists in so many different ways, and at a Matthew 8, I can't hide Mitchell's face, but it's super funny to us.
Like, I don't know how to do that, but I could totally imagine that we would be in a world where we get that. I thought about it in the complete opposite way, but that makes sense. I was thinking about it, you know, training it to create pulp fiction. Like, that be so simple in my mind, like oh my gosh, create these like airport novels. They can just like bang out plots. I mean, that's probably where we'll start.
I mean, I would love it if we could write—so everybody understands that listening or watching, you know, we can't generate a coherent paragraph. Right? So, I mean, we—Magenta—I mean, kind of we, humanity really, I can't write at all. It's really hard! Like, and it all hits, it all, I think it all hits at structure at some level. Like, you know, nested structure, whether it’s music or—I think there's like—art plays a geometry or color or something else, and you know, it’s its meanings, you know, it's nested structure somewhere.
And has the art world or, you know, I guess any kind of artist, any kind of creator have people push back in the way that they're scared? You know, I imagine when photography came out, everyone was pushing back saying, like, this might end painting. Because it's about, you know, photography captures the essence. But—and it ended up changing because people realized that painting wasn't just about capturing something, you know, capturing an exact moment.
Certainly, the generative art world, and we've seen lots of that—another researcher in London, someone posted on his Facebook something like he posted to us a tweet that was like, what you're doing is bad for humanity, and it was like, really? Like, he's making like new folk clogs right off, like generating folk songs with an LSTM. This is Bob's term, like it's probably bad for humanity! So they, of course, but like what I love about that is, you know, it's okay if a bunch of people don't like it.
And in fact, if it's interesting, what art does—everybody like zero, right? Or it's really boring, right? So you have this idea that if you want to really engage with people, you're probably going to find an audience that audience is going to be some slice. Frankly, it's probably going to be some slice coming up from the next generation of people that have experienced technology that are taking some things for granted that are still novel to someone like myself, right? You know? But it's, you know, it's okay if a bunch of people don't like it!
Yeah, because, well, when we were talking before, I was surprised that you hadn't gotten more pushback. It seems to be like most people in our world are just like, eh, okay whatever! It's like do your thing! It's kind of opening up new territory rather than it is like, you know, challenging. I think that I've gotten—I’ve gotten pushed back in terms of questions I feel we have, and I say, I think this is as a community, in Google and outside of Google and outside of Magenta, I think people are really clear that what's interesting about a project like this is that it be a tool, not a replacement!
And I thought—I think more, I think if we presented this as, you know, push this button and you'll get your finished music, it would be a very different story, but that's kind of boring. I think, yeah, I mean, it's funny you mentioned Hacker News because I was talking with one of the moderators. He loved your news! Yeah, those are great. It's just impersonal, it's so easy to critique people, but I was talking with Scott, one of the moderators, and he was—he was wondering if you guys were concerned with the actual, like, cathartic feeling of creating music or if that's just something you don't even consider right now.
I mean as people, we have course you haven't! Yeah, and I think there's a couple of levels there. I think you lose that if what you're just doing is pushing a button, and like so I think this is everywhere—the drum machine is such a great thing to fall back on. It is just not fun to just like push the button and like make the drum machine do its canned patterns.
And I think that was the goal, my sense of the reading that I've done is just like this will make it really easy, right? But like what makes the drum machine interesting is people working with it, writing their own, you know, writing their own loops or their own patterns, changing it, working against it, working with it. And so, you know, I think this project loses this interest if we don't have people getting that cathartic release, which, believe me, I understand that you mean. That's thing one. The other thing I mentioned is if there's anything that we're not getting that I wish we were getting more of, it's creative people, people coding creatively.
Right, we talked about creative coding in a hand-wavy sense, but like, yeah, like—I would love to have the right kind of mix of models in Magenta and in open source linking to other projects that you as a coder could come in and actually say, I’m going to code for the evening and add some things, I’m going to maybe retrain, I’m going to maybe hack data, and I’m going to get the effect that I want, and that part of what you’re doing is being an artist by coding, sure!
And I think we haven't hit that yet in Magenta, and I'd love to like get feedback from whomever like in terms of ways to get there. The point is there’s a certain catharsis for those of us that train the model—you get the model to train it to work, it’s like it’s fine, you won’t—you’ll be bored if you just push the button. But it feels good for me to push because I’m the one that made that button work, you know? So there’s that rate of act in its own right!
And have people been like creatively breaking the code? Like, oh, it would be funny if it did this or interesting if it did that? A few, though. I think our code is so easy—like, most open-source projects need to be rewritten a couple times, and I think, you know, we've gone through—we're on our second rewrite—is that if the code is brittle enough that it’s easy to break, unclear natively, then it’s hard to also break it creatively.
And listen, I'm being pretty critical. I'm actually, I'm really proud of the quality of the Magenta open-source effort. I actually think we have, you know, well-tested, well-thought-out code. I think it's just a really hard problem to do coding for art and music and that, you know, if you get it wrong a little bit, it’s just wrong enough that you have to fix it.
So, you know, we still have a lot of work to do. So, then where does where does that like creative coder world go? I've seen a lot of people that are concerned with like, even just preserving—I think Rhizome is doing the preserving digital art object. What direction do you think that's going to go in? Presumably a number of cool directions in parallel. The one that interests me personally the most is reinforcement learning and this idea that, you know, models train.
So, there’s a long story or a short story— which one do you want? Long! Okay! Yeah, okay. Well, so we know it’s not that bad. So, you know, generative models 101, you start generating from a model that’s trained just to be able to regenerate the data it’s trained on. You tend to get output that’s blurry, right? Or is just kind of wander-y, and that’s because all the model learns to do is kind of sit somewhere on the big—imagine that the distribution as a mountain, as a mountain range, and it just sits on the high mountain valleys would say, you know?
Yeah, it kind of plays it safe. All t-shirts are gray if you’re colorizing because that’s safe! You’re not going to get punished! And, you know, one revolution that came along thanks to Ian Goodfellow is this idea of a generative adversarial network that is a different—it's a different cost for the model to minimize, where the model is actually trying to counter, you know, create counterfeits, and it's forced to not just play it safe, right? I don't know if this is too technical. It’s very interesting to me!
Oh, yeah! This was part of the talk, right? Where you like cut out the square? Yeah, exactly! Yeah, so that's part. So another way to do this is to use reinforcement learning, yes, and it’s slower to train because all you have is a single number, a scalar reward instead of its whole gradient flowing than the GANs. But it also is more flexible. Okay, so my story here is that the GANs are part of a larger family of models that are at some level critical.
Everybody needs a critic, and they're pushing back, and they're pushing you off of—you’re pushing you out of your safe spot. Safe spot is—and that’s helping you be able to do a better job of generating. We have a particular idea that you can use reinforcement learning to provide reward for following a certain set of rules or a certain set of heuristics. Okay? And this is normally, like, you mentioned rules—at a machine learning dinner party, everybody looks at you funny, right? Like they're stepping backward.
Yeah, and also sees rules come up without use rules. But instead of building the rules into the model, like the AI is not rules, the machine learning is not rules, it's that the rules are out there in the world and you get rewarded for following them. And we had, I thought, some very nice generated samples of music that were pretty boring with the LSTM network. But then the LSTM network trained additionally using a kind of reinforcement learning called deep Q-learning to follow some of these rules, the generation got way different and way better.
And specifically at catch here, what were the rules? The rules were like rules of composition for counterpoint from the 1800s—they were set for simple. Now we don’t care about those rules, okay? But there's a really nice creative coding aspect, which is, think of it this way. I have a ton of data, I have a model that's trained, I have a generative model, whatever it may be. It may be one trained to draw, maybe one trained for music, and that model has kind of tried to disentangle all the sources of variance that are, you know, that are sitting in this data.
And so, it's smartly generated, you know, can generate new things. But now, think like as long as I can write a bit of code that takes a sample from the model and evaluates it, providing scalar reward, anything I stuff in that evaluator, then I can get, like, the generator to try to do a better job of generating stuff that makes that evaluator happy. It doesn't have to be 18th-century rules of counterpoint, oh, right?
So, you could imagine, like, taking something like Sketch RNN and adding a reinforcement learning model that says, I really hate straight lines! And suddenly the model is going to try to learn to draw cats, yeah, without straight lines! The data is telling it to draw cats. Sometimes the cats have triangular ears with straight lines, but the model's going to get rewarded for trying to draw those cats that it can write without drawing straight lines, right?
And straight lines was just one constraint that I picked off the top of my head. It has to be a constraint that you can measure in the output of the model, but like, musically speaking, if I could come up with an evaluator that described what I meant in my mind by shimmery, really fast-changing small local changes, I should be able to get like a kind of music that sounds shimmery by adding that reward to an existing model.
And furthermore, the model still retains like the kind of nice realness that it gets from being trained on data. I'm not trying to come up with a rule to generate shimmery; I'm trying to come up with a rule that rewards a model for generating something. Clear?
Yeah, it’s very different, right? So I think that's one really interesting direction to go in is like opening up the ability, if you can generate scalar reward and drop it in this box over here, we’ll take a model that's already trained on data and we'll tilt it to do what you want it to do. That kind of underlies a fear that people have, right? Which is like, what happens when you can create the best pop song?
And what do people do? And do you have thoughts on like, A, is that possible? And B, what would the world look like if that world comes to be? I think that it is—I had an algorithm for doing this, which is which is the best pop song for me, which is when we used to sell used CDs. It was usually like a two-to-one, okay? So every time, if you have a thousand CDs, you trade them in, and you have 500 that you like better—you just kind of keep going, right?
You finally got that one, here’s that hill climbing in that space! Yeah, I think that—so I'm not sure like, part of me wants to say people love the kind of rawness and the variety of things that aren't kind of predictable pop. But let's face it, like, people love pop music! Even the—like, there’s a kind of pop music that you’ll catch on the radio sometimes that isn’t like—like, most of your listeners are probably the same campers or viewers, like, there’s pop that we love—like, yeah, you know?
Like, I love the poppies—Frank Ocean’s music, I can listen to it forever! But then there’s like, just like, there's like the gutter pops, yeah? And so, melee distinguishes the artist as played the big festival. So, I guess that kind of unasked the perfect pop. Let me pop is such a broad thing, yeah?
But, yeah, I think like I can imagine that with machine learning and AI at the table, will we will—here’s another way to look at it: like some things that used to be hard will be easy, right? And so we’ll offload all of that, and if people are happy just listening to the stuff that’s now easy, then yeah, it’s a problem solved! And we'll be able to generate lots of it, but then what people tend to do is go look for something else hard, right?
It's like the drum machine argument! So, you solved that, you solve the metronomic—you solved the metronomic, you solve the metronomic, you know, beat problem, and then what you actually find is that artists who are really good at this—they play off of it. And they’re allowed—like when they sing to do many more like rhythmical things than they could do before because now they have this castle that they didn’t have to work with, but they just constantly break it, right?
As soon as you started the electric guitar— but I hope that's an honest answer to your question. I mean, your question was a different flavor. It's like, hey, are we really moving towards a world where we're going to generate the perfect pop song?
Yeah, I don't know! I don't think so. I mean, I don't feel like that's gonna happen. But, you know, maybe it happened so quickly, and then as soon as we realize, like, okay, this is how we're going to break it; this is how we're going to retrain ourselves. Yeah, it can learn so fast that it's like, okay, now I can do that too!
Yeah! Think of a pen! They can do that too! So then, what it—like what I was wondering is there like, in, you know, the next handful of years, is there like a Holy Grail that you're working toward for Magenta? Like, okay, now we've hit it! Like, this is the benchmark that we're going for?
There are a couple of things I’d love to do. I think composing, creating long-form pieces, whether they’re music or art, I think, is something we want to do. And this hints at this idea of like not just having these things that make sense at like 20 seconds of music time but actually say something more that—that direction is really interesting because I think that not only so, let's say, said that would be at least more interesting if you push the button and listen to it!
But also this leads to tools where composers can offload very interesting things. Like, some people, some people—I'm one of these people, I'm really obsessed with like expressive timing, I'm really obsessed with musical texture. Okay, I don’t know what that is!
Oh, no, I just mean like—let’s say you’re playing the piano—oh, I know! Lighting in the gray art space, there’s a great talk—you contrasting the piano—oh, yeah! You heard your homework! Yeah, yeah!
If you listen to someone play waltz, it'll have a little—little bit! Or like some of my favorite musicians, like Thelonious Monk, if you're familiar. If you're not familiar, do your homework goes into him. He played piano with a very, very specific style that almost sounded lurching sometimes!
It was very—he really cared about time in a way, and so like if you—the way that you're thinking about music and composition is really, really caring about kind of local stuff, right? We're very, very interesting if you had a model that would handle for you some of the decisions that you would make for longer time scale things like—like, when do chord changes happen? Right?
So like, usually, it's the other way around! You have these AI, you know, machine learning models can handle local texture, but you have to have to decide that. So, yeah, my point is if we get to models that can handle longer structure and nested structure, we'll have a lot more ways in which we can decide what we want to work on versus what we have the machine help us with!
Right? And it has—this now has it affected your creative work or do you still do like creative, like composition? Yeah, so that I'm— I'm working here at Google, and you know, this is like a coal mine of work to do this project, Magenta—like, every day now! If it’s joking aside!
Yeah, as we read to kids, yeah, no, I basically, I've been using music as more of like a catharsis, relaxation thing. I know I don’t feel like personally I've done anything recently that I would consider like creative of a level that I want to share with someone else. It's been more like jamming with friends or like, you know, just like throw-away compositions, jamming.
I like here’s 10 chords that sound good. Let’s jam over for the evening and then, like, don’t even remember the next day! And really trying hard to understand this creative coding thing. Like, that's just—I worked on—I've a lot of it's just like I'll start and then I'll get distracted! But yeah, so that's sort of the level of my creative output, I'm afraid.
Well, the creative coding van, it's seemingly—I don't know, so many people are looking for it in every venue, and it's so difficult to find people. They're just kind of like one-offs now. Yeah, I think that's right! It’s so hard to have the right—I think maybe, you know, maybe we need the Garage Band of this—like, you know, we need to have like something that’s so well put together, but it makes it easy for a whole generation of people to jump in and try this!
Yes, they haven't had, you know, like four or five years of Python experience. I didn’t know if that's what you were alluding to when you’re saying that, you know, command-line obviously not the way to do it where it dumps MIDI files. But now it's an API, right? Yeah!
Like, what is the next step that? Yeah, very obvious! Try to make it, try to make it more usable and more expressive, right? Experts’ activity is hard in an API! Right? You know, it’s like so hard to get it right!
And I think it's almost always multiple passes. So we've got, I think the API—the core API that allows us to move music around in real time and MIDI, yeah, and actually have a meaningful conversation between an AI model and multiple musicians is there, and there's just a bunch more thinking that needs to happen, right, to get it right!
Cool! So if someone wants to become a creative coder or wants to learn more about you guys, what would you advise them to check out? So I would say that the call to action for us is to visit—visit our site. The shortest URL is g.co/magenta. Okay, interest!
It's also magenta.tensorflow.org where you can look at all VI quartet and like have a look at—we have some open issues. We have a bunch of code that you can like install on your computer and then hope you can make work! And oh, maybe you will be able to! And, you know, we—we want feedback! We have a pretty active, and we certainly follow our discussion most closely!
And our game for philosophical discussions on or game for technical discussions. And, you know, beyond that, we just were just kind of keeping rolling! Like we’re just going to try to keep doing research and keep trying to build this community.
Okay, great! Thanks, man! Sure! No fun!