Transitioning from Academia to Data Science - Jake Klamka with Kevin Hale
So Kevin, for those of our listeners that don't know who you are, what's your deal?
I'm a partner here at Y Combinator. I actually was in the second ever batch. I was in Winter 2006 and I founded a company called Wufoo, ran that for five years, and then we were acquired by SurveyMonkey. That moved us from Florida to California, and that's when PG asked if I'd be interested in helping out. I might see I've been there pretty much ever since.
Yeah, and you suggested Jake as a guest for this episode. So Jake, what do you do?
I'm the founder and CEO of Insight. Insight is an education company. We run fellowships that help scientists and engineers transition to careers in data science and AI. It’s a pretty unique model because they’re completely free, these fellowships. They’re full-time. The companies sort of fund the process—engineer scientists build projects for four to seven weeks. They meet top data teams and they get hired on those teams. We’ve got over 2,000 Insight alumni working as data scientists now across the U.S. and Canada.
Nice! And you haven't always been working on this, so you applied to YC for the Winter 2011 batch?
That's right, yeah.
And what was your idea then?
So I was back in summer, I started my career—and this is relevant to why I started Insight—because I basically started, I wish I wish I’d had access to when I was around. I was a physicist at the University of Toronto. I thought I was going to be a scientist for the rest of my life, and then partway through my PhD I realized I wanted to go into technology. I thought to myself, I’m writing code, I’m building machine learning models; this is great! I’ve got what I need. And it frankly took me a long time to transition. Eventually, I got into Y Combinator, came down here from the Winter 11 batch. I was building a bunch of mobile productivity apps that were machine learning-enabled and didn’t quite get the uptake into the right graph that you would hope for after all IC, but it was an incredible experience.
In those sort of late 2011, after, called six to twelve months after YC, I was searching for a new idea. I actually went and spoke with Paul Graham and a few other advisors, and the recommendation was to work on a problem you self-identified. You were kind of building these apps that you know you’re trying to use these machine learning models and hopefully, somebody’s got that as a problem. But flip it around; start with a problem, then figure out what the solution is. When I reflected on it, it took me a few years to really make this transition. I’d been so close all along, but I didn’t know product. I wasn’t really connected in the Valley. There’s a bunch of, you know, technically I had the fundamentals, but a lot of the tool sets were different in the industry.
So I didn’t know what I didn’t know. When I got down here and I started talking to people, that’s when I finally started figuring it out and was seeing a lot of my friends having that same struggle.
So brilliant mathematicians, neuroscientists, biologists, and engineers later, we found the same thing—the desire to go into data science, AI, and other cutting-edge fields—yet they felt their resumes didn’t say the right thing. It’s really hard to bridge that last mile. I thought, okay, well this is a problem I want to solve, because these are some of the most brilliant people I’d ever worked with. A lot of them were my former colleagues from physics. I thought, what does the solution for this look like?
At first, I was focused on it being an app again—right? Something machine learning-enabled. Then I realized, no, it actually probably looks more like an in-person program where folks are getting together, building cool projects, and then getting started from there.
So did you just go ahead and teach a class?
Yeah, so I basically, you know, started talking first. I talked to companies and I said, listen, I’ve got these brilliant friends coming out of academia who I think you should be hiring. Why aren’t you hiring them? Basically, what they told me is, I know they’re brilliant, I know they’ve got all these great skills, but they’re probably like one to two months away from where I need them to be. In terms of, if I had full days to mentor them for a month or two, they’d be incredible data scientists. But they’re like, I don’t have a month or two to mentor them, so I say no in the interview, right?
And so I’m like, why a month or two? Maybe what Insight is going to be is that month or two where folks are filling in this last piece of the puzzle, learning the cutting-edge techniques, and then let’s bring those data scientists in the room and have them hired. I just jumped right in and I ran the first session. The first session was just me, first student, so the focus with it were PhDs.
So that first group was in 2012?
You’re wrong. No! I mean, I had to go beyond my friends, so at first, I started talking to my friends in academia. You know, I got confirmation from my friends in academia that you had. I mean, I already knew that they were looking for jobs, and they were excited about transitioning. I got confirmation from the hiring managers to say, listen, we’re hiring, we can’t find folks with the full skill set. If you bring them into a room, we’ll go look at them. Then the rest was kind of because you mentioned that you didn’t know what you didn’t know. Yeah, and so at that time, I had spent like three years figuring it out, including doing YC and meeting so many data scientists and building a bunch of data products. So you know, by that point I kind of knew what the pieces were, but also really the program was focused not on me teaching the fellows; it was focused on me bringing in the leading data scientists at the time and putting them directly to tell them.
So we had, you know, Facebook, LinkedIn, Twitter, Square— all these early data teams in 2012— their heads of data science came in. So they were only like, one day. They just couldn’t commit like, yeah, yeah, that’s exactly, that’s exactly. If they were like, okay, I’ll come in for a few hours, but I don’t have two months. And I’m like, well, if I have a bunch of you come in for a few hours, plus, you know, really have these folks kind of working away for a few months, learning from each other, learning from these mentors. Once we had alumni too, it was incredible. We had all these alumni coming in to help.
So that first class cycle, it was eight fellows, and then how many of them did you get jobs?
Pretty much, yeah, yeah, 100%. Yeah, yeah. One went to Facebook, one went to Square, one went to LinkedIn, one went to Twitter. I mean, at the time, these were like, they still are the top data science positions. But I mean, it was a clear success; it was super stressful. I didn’t know the model, I hadn’t figured it out; it was crazy.
I mean, what mistakes did you make? Was that first class kind of a shit show?
Of course! In the sense that it was the first time I was doing it. A lot of a career transition is always stressful. Whenever people are doing Insight, they’re stressed. But at least there’s a track record there and now we have things baked pretty well. At that time, the overall idea was there, but a lot of the details weren’t there, right? Frankly, the track record wasn’t there. So a lot of these folks were like, what have I done? I’m in a room with this guy who’s never done this before. So there was a lot of stress just around, is this even gonna work? This weird model.
What I'm trying to say is, we made it work. What is that like? What did those eight students believe, right? Were they desperate? Or like, were you great at sales?
No, no, I think they’re genuinely excited. Like, part of the application process, I got way more applications than I expected. And when I started, there’s a real demand to get into the field. I didn’t have a track record, but I basically went around to these universities and said, I’m gonna have the head of data science from Facebook, LinkedIn, Twitter, all these companies coming in, and you’re gonna meet them. So the roster can fill up more guts, right?
Yeah, and they were in my interview process really centered around how excited you were about this. For folks who were like, I really don’t want to do this but I need a plan B—no thank you. Right? It was the people who were who said to me, I love my work as a scientist, but I really want to have a kind of more applied impact in the world. I’m excited about what I’m seeing here. Here’s what I think I could do. I mean, that’s the kind of folks I would take into the program.
Totally makes sense, starting off with like qualifying the lead. It’s such a more common technique you’re seeing a lot of startups do now, like Superhuman for example.
Yeah, heavily qualifying a lead before they’ll even let them access the product. So that way you’re trying to guarantee that the time spent is with someone that’s going to have a spectacular outcome.
That’s why this is. How did you know which hiring managers wanted to come in? How did you figure out what students were going to be the most excited about this? Like, what do you ask them?
Yeah, so I mean, I had some opinions, but really what I did is I went to these early heads of data science teams and said, what do you look for? And what they said is, like, you know, they’d list off some technical skills, but you know it’s kind of like the laser. They need to know SQL, they need to know Python, they need to know… And you know, you’re like, okay, but what really would clinch it for you?
Like, they had like, you want this person? And there are two things always: there’s like they have a side project. In their eyes, it would light up. They’d go, oh, they had a side project? If you send me a URL, oh my God, I know they’re excited! Then I know they’re—so that’s where the idea came around for, hey, this isn’t about, you know, these folks have been through enough classes; it’s actually building for it—actually creating something, proving that I’ve got all this great background, but now I’m gonna do this last piece of the puzzle to show you I can do something relevant in this area.
The second thing that they wanted, and I think this is where the project really shows this, they wanted overall just curiosity. So folks— and I thought that they weren’t being serious, to be honest with you, because I was like, yeah, you say you want curiosity. Really, you just want somebody who’s good at SQL or something, right? Or good at like machine learning. And it proved to be true—the people they would hire would be the ones who were the people who, hey, I studied astrophysics, but in my spare time I was, like, coming dabbling with genomics. Then I got into machine learning on the side and then I built this cool fun project that, like, I don’t know, predicts where I should go camping or something because I’m a big camper or something. And then you take a person like that, and that’s the kind of folks that these teams wanted and still want because these problems are so open-ended.
People who are curious don’t get blocked as much, right? They’re willing to try this. And you know, it’s such a new field. The rules are that our fellows are getting hired into most of the companies. It’s not like we know what you—we need you to do; just do it. It’s what can we even do here? Right? What can we—what kind of impact our data can have?
How did you test for that curiosity? Like, I think the project seems like, okay, that’s something we have to shoot for. But again, it’s like, how did you know that these were the right eight people?
Well, you know, it’s a lot of trial and error. I would do like twelve plus interviews a day, and you kind of get to know folks, and kind of get to know. But I think the main thing, the signal that I saw was kind of that example I gave is that almost people would be almost apologetic. They’d be like, listen, what I’m about to tell you is not part of my usual work, but it’s on the side. And it’s like, no, no, I want to hear about that.
I remember had this one of the fellows came—she became a fellow, but she was an early session mathematician at Berkeley, and she had done all this incredible analysis— I can’t quite remember what, like this really cool data analysis project I think on flight times or something in sports. I can’t quite remember. And part way through the interview, I’m like, but you’re a mathematician? Like, don’t you try pencil and paper math?
Yeah, I know this isn’t part of what I do. She’s like, oh yeah, I did like a camera or what. She’s like, oh yeah, this is—not even part of my— and she almost felt kind of apologetic. I was like, this is who I want as a fellow, right? Brilliant mathematician doing incredible work and able to, on the side, on the weekend, quickly pick up Python; this got the other, make something useful. She went on, she works at Facebook. She went to Facebook after the program, been super successful ever since.
So this isn’t related to one of these overarching questions we had for you, so basically it’s like how can people get into data science, and then what are the pitfalls for people who say have a PhD? You know, they know Python; they’re like at a higher level than a coding bootcamp person. What are the pitfalls they make when they’re trying to bridge that gap and get into a data science role, provided that they didn’t do your program?
Yeah, absolutely. And we see it because we, you know, I started with scientists, and now we also have programs for engineers transitioning to machine learning engineering, deep learning research. You sort of see very similar problems on both sides, which is folks are extremely focused on the sort of technical. “Let me get the algorithmic knowledge down, let me know every last algorithm,” which of course you need, and you need those foundations.
But when you’re already dealing with someone who has, you know, been doing a bunch of work for years in a PhD or in engineering in these areas, what you actually want to see and what these teams want to see is communication ability, is ability to understand the underlying business and product problem. Because what they want to do is hire someone who’s gonna first think about what are we trying to accomplish here? How can we help our users? How can we help our company succeed? And then figure out how do I use my tool set of machine learning or analysis to do it?
What often happens—this is the pitfall— is, you know, part of why you get into it is because you’re excited to be at ALI, excited about the machine learning. So you start always putting that first, and you’re always like, let me tell you the algorithm so I can build. It’s like, what folks need to start who are trying to transition into it need to start think about product, need to start to think about business.
They need to ask, like, the skills— there’s like, what are they actually trying to solve? That would make them a better salesperson. What’s interesting about the advice that we give to a lot of people is it’s not about selling your own thing; it’s understanding their problem. Then I can complete it, and then you’re feeding whatever you have into them.
And so it seems like, for the data science, the same thing is happening. It’s not, “Alright, here’s all the things I have.” Right? It’s like trying to figure out what it is that you fit into for them.
Exactly right. And it’s like understand the underlying—forget data, forget machine learning— it’s what are we trying to accomplish here? What’s our mission? What are we trying to do for our users? And then making yourself look like the solution, not trying to be like, oh, I have a bunch of stuff; which one of these things are you interested in?
Exactly! I put the hammer and screwdriver like, which can I use? All of it! It’s like, what are we trying to build here?
And sometimes that’s actually a separate role. So for instance, say like Facebook might list a data science job, whereas some, you know, smaller startup would say, like, we have an engineering role open.
You might classify yourself as a data scientist if you have to pitch data science to a startup.
Right. How do you do that as a writer?
This is a great question. So first of all, data science, machine learning—these are all like super broad umbrella terms. It’s such a new field. Yes, maybe you should define it.
What we see in the industry is kind of broadly speaking—broad terms, details let’s not worry about the details—it’s sort of. I see you got three big pieces of how sort of data science is used.
So some data science roles are what I kind of call product analytics or business analytics roles. The idea there is you’re looking for a better understanding. You’re analyzing data about users or company and trying to understand how to improve it, help users succeed, help the business succeed.
The second types of roles that we see are data product roles. So these are roles where you’re actually using machine learning and predictive models to actually change the user experience and give them something they want, and the third one is kind of, you know, usually what you hear termed as AI, which is, you know, AI roles, machine learning engineering roles, where it’s not just a feature in the product; like, that’s the prediction, it’s like the product is machine learning.
Like, it’s like a self-driving car. If the machine learning doesn’t work, then the whole product doesn’t work.
You’ll have an example of a lot of misunderstandings— which one we’re not a category in between, where it’s like, oh machine learning supplements a feature?
Or a service?
Yeah, yeah! So usually that’s where folks talk about data products. So when they talk about data products, often like a feature—so like the Netflix recommendation engine. That’s a situation where, honestly, if they didn’t have machine learning, they could still just say, like, here are the top movies; go watch them.
But with that predictive model, you’re really getting a much better experience. We have probably 30-plus fellows working at Netflix. A lot of them work on that stuff, but some of them work on analytics, which is how are people even using this product? What can we add at a more product level to improve it?
And the output isn’t a feature that the user sees, like an actual algorithm serving recommendations. It’s like they have to go and communicate with the product team to say, hey, users seem to want us to be building this sort of product for them. Let’s over the next 6 to 12 months take the product in that direction.
It’s like a very different role.
Here’s an interesting question. So I know like what the dream scenario for a lot of data scientists is: I want to get a job and work on these interesting problems.
What should they look out for that they should avoid in a company? Would you say a company uses—because I think everyone’s kind of thinking, oh, they’re not ready to actually hire me, and if I go here, this will be a bad experience?
We see, wrong! I said that the data scientist needs to know what the actual problem is. Your company needs to know what the actual problem is. And so the companies need to be wary; the ones where it’s like, hey, I want deep learning.
It’s like, what does that mean? What do you want us to do here? So you want to go to the one that’s got a mission you align with. You want to see them succeed. You want to have whatever solution you bring to the market thrive in the world.
Then they have a clear sense of if we add some data analysis to this; if we add machine learning, it’s going to be better, and then you can help them get there.
So someone from Twitter asks this: Chuck Graham, when do you know you need to bring in seasoned data scientists? So like, is there any kind of benchmark you can offer?
Yeah, I think I’m—so first of all, I think you have to start as a founder, start with the idea, and you can do this. I recommend this before you have a data scientist: understand is data sort of critical to building my product or is it something I’ll just add on once it’s already working and I need to kind of optimize the experience?
An example for something critical is like Amazon Alexa. Right? Like, if you’re building Alexa, like that algorithm for voice recognition analysis better work from day one versus a scenario where, like say, you’re on an analytics team at Airbnb and you already have a lot of users and you’re just trying to optimize that experience.
Right? And so for a startup, figure that out first and then if you need one from day one, hire one from day one. If you get a machine learning engineer in the door who really that’s their forte, you’re gonna be better set up for success instead of trying to sort of, you know, kind of hawk it and then have to kind of catch up later because often you don’t know what you don’t know.
And you might not be tracking the right data or you’re not sort of setting things up, your infrastructure in a way that’s gonna help you scale later, and then there’s especially in products where machine learning is critical; that becomes challenging.
One thing I recommend to startups, actually, is talk to folks in the industry and frankly get an advisor, right? If you’re not ready to hire a data scientist yet, at least maybe think about getting on a data science advisor because they’re gonna be able to sit.
Where do you find this?
Yeah, good question! So free email me, yeah!
I mean, you’d be surprised. A lot of—I mean, maybe some of the top folks who started the data science team at like LinkedIn, you know, that’s hard to get into. But, you know, I think even any sort of data scientist who’ve been in the field who knows what they’re doing will be able to sit with the founder and say, listen, you’re probably gonna want to instrument these features to collect this data because you’re gonna want to analyze this later or here’s the type of work you want done probably down the road.
So if you want someone to help you understand how to lay the groundwork to actually do that, hire someone.
You guys started off with eight students in that first class. Can you talk about where it is right now? How many students are you processing now? And then also, like, what’s different about the curriculum program?
Yeah, it’s definitely scaled up a bit since then. We’re now in five cities—so San Francisco, New York, Boston, Seattle, Toronto—my hometown, just launched it this year, which is fun. We’ve got a bunch of different specializations now, so data science is one, data engineering, health data, AI. We’re even sort of doing product management now, helping product managers transition to AI.
So overall, we do three sessions a year. It’s like almost like you have different classes depending on where you’re starting on the specializations. Because the fields specialized, right?
It used to be like you just hire a data scientist who you hope will take care of everything and now you want folks who are building infrastructure, the data engineers. You want the data scientists who are sort of laying building the early prototypes and figuring out what to build and then more often than not, now you need machine learning engineers to really kind of put that into production now.
So you see these different specializations and we essentially have a program for each. So the data science program is for PhDs because that sort of scientific experience is critical. The AI program for instance is predominately for engineers who are going into machine learning engineering roles.
And how big are these classes?
So there are—overall across all the cities and programs, we’re at about 300, or just over 300 fellows per session now. But each program is small, so we keep it sort of maximum 20 to 30, 35 fellows. Because the idea is each one of those sub—that’s right, programs in each location—because you want that the collaboration is critical. You want that group to sort of gel. Everybody’s working on a project. You want people kind of tapping each other on the shoulder asking for help.
You want that alumni who’s coming in to be able to kind of sit with the fellows. So long as the small groups really are critical for that!
How long is the program?
Seven weeks.
And then what gets done in seven weeks?
Yeah, so it’s pretty incredible how fast people learn and what they build. So literally, you know, they’ll go from in Week 1 trying to come up with the idea or partnering with a startup. Often fellows work with startups, a partnership with YC. They start with a project. Well, Week 1 is figure out what project. So like, your first week is like, should I come up with something on my own and build it based on advice I’m getting from our alumni, from our mentors, or our team?
Or should I go partner with a YC startup that’s got a data challenge that they want solved? And so that’s step one: is figure out what you’re building, and again, a fearow problem you’re playing in. The next couple weeks, you better build it fast. Folks have to go from literally nothing to like an MVP.
And you know, Week two, I didn’t—then they’re out presenting the antenna me. By Week Time, they’re working individually because they’re trying to show that they’re able to kind of execute end to end on a real-world problem. But it’s incredibly collaborative.
So if you come to Insight, it doesn’t look like a classroom; it looks kind of like a startup office. Everybody’s just kind of at desks sitting together. People are on whiteboards, they talk to each other, helping each other because you encounter the same problems, technical otherwise, and it’s that collaborative aspect that allows people to move super fast and learn a ton.
And if you’re in the program or you’re just checking out the program, maybe applying for jobs like this, yeah, what are the types of projects that you recommend avoiding?
You know, things that people have seen a hundred times before?
Yeah, I recommend like, are people happy on Twitter? It’s like that’s maybe done—that is a good general example because there’s people been doing like this. The more kind of— I think the useful example is make something useful, right?
So I think it’s really easy to just be like I took this algorithm that used to operate at 99.1% accuracy and now I’m gonna make it 92.3%. Like, you know, and I don’t know why, but it’s better now! Or what you see scientists sometimes do is this very generic: like, I studied—you know, here I’ll give you an example of a project I loved that I felt come up.
But you know, so here’s the bad version; here’s the version someone did at Insight, but you can do this at home. So the bad version: let’s say the topic is solar panels. You want to understand solar panel usage and really enable people to adopt solar panels. Bad project is, “I analyzed general trends about solar panel usage in California.” It’s like, “Look at this interesting fact I found.”
It’s like, okay, whatever, right? Maybe for an analyst report that’s interesting, but not for actually getting anything done. To me, it has no call-to-action.
Exactly! Like, you want it to be almost opinionated because that way a business knows, “Oh, I can look at this, know what to do.”
Exactly right! I think the bad projects are the ones that feel like, oh, now I have homework.
That’s right! Oh, here’s some problem! I have with a lot of like analytic services— yeah!
All you do is like just tell me, yeah!
That’s why I don’t know anything.
Yeah, but now I still don’t know what to do! So I’ve paid to be the whole figure stuff out! Whether feels dumb.
Exactly! And so the good version of this project—which is a fellow’s project, one of my favorites—is, I’m a homeowner! Should I buy solar or not? Will solar be profitable on my roof?
That’s a hard problem. What’s the weather like? What’s the—I mean, a ton of different factors! There’s some predictive aspects of all that. All this fellow took all this data, synthesized it, built a predictive model. I come in, I type in my address, it tells me whether I should buy solar.
Oh, they based me!
Yeah! So all these projects are very product-focused! They’re so product-focused that sometimes companies are like, why are you showing us products when we just want data scientists? And the answer is because that demonstrates that people can think product-wise.
And they end up loving it because they sort of abstractly don’t understand why they’re showing us products, but people gravitate to real solutions.
And then hold it. Yeah, they hire the fellows.
Well, this is related to something we talked about the other day, which is in the future, are more data scientists going to become founders, or is that like personality—that mentality—best suited within a big company?
Oh, really?
Yeah, it’s not gonna be kids; like, designers, who, for some reason, designers don’t tend to become founders.
You know, we’ll see how it shapes up in terms of, is it gonna be on mass data scientists? But certainly, I would say probably about a quarter of every fellow’s program I see, like, raise their hand when they say they want to start a company in the next five years.
Fuck yeah!
Yeah. And so, and so I think that’s going to be a big thing. We’ve already seen some of our alumni start companies.
Although again it’s early in the early difficult, Diana Wu, who was in one of the early sessions, started Trace Genomics, a genomics company which uses genomic data to tell farmers when to plant, when not to plant—super interesting!
Not an alumni, but like an early mentor, Ben Kamen’s used to be the kind of founding engineer at Khan Academy, and he hired one of our fellows, Lauren, who is a physicist.
She went there, helped them sort of, you know, help impact a bunch of—hopefully impact a bunch of kids’ lives led by helping them learn faster.
Yes, they really have millions of data points of data on how people learn, and she was there for a few years with him, helping with education. Now Ben went off—I mentioned Ben because he’s very much kind of a data scientist at heart.
And although he found her, you know, his title’s officially CTO, he went off and founded Spring Discovery. Now they’re doing sort of helping aging-related diseases using machine learning to do that.
Lauren went over there with him, sort of part of the founding team. And so, you know, again, more, you know, TBD in terms of what the stats are gonna be in terms of founders, but that founder spirit is there, and the skill set is so easy.
I mean, that’s the thing; like, regardless having an understanding of product is like the pinnacle absolutely. You can go absolute because whether you’re an employee, whether you’re founder, employee 10 or 100 or frankly a thousand, you better know what you’re—
So do you teach that as well?
Oh yeah, it’s one of the biggest things!
I mean, how do you teach it?
You know, I found the only way to teach is by doing.
Yeah, so you say, like, build the product, and then they don’t—they give you a graph that shows me interesting things, and you say no, no—
Like, yeah, and you iterate. You just iterate.
Yeah, I mean, that’s the learning experience! You do it wrong, and then you iterate and you fix it and get better!
The model at Insight is really just continual feedback. So if at the end of the program I tell you that’s wrong, then that’s a bad learning experience. But at Insight, you’ll be told like half a day in that that’s like not the way to go, and by the next half day you’ll be closer there, and by the first week, you’ll hopefully be on a good path building a cool product!
Is that fast iteration?
If you were cool, I think one of the things that ends up being a problem for a lot of startups or for even people getting into the data science field is like they’re encountering very dirty data.
And so now a lot of time actually is like this is not like, oh, I’m solving cool problems, I’m making products; it’s like, oh, I’m just sitting here cleaning up this data just so I can get to this point.
And so I’m trying to figure out, is this something that data scientists need to beware of? That you’re just gonna walk into this?
And there’s something like startups need to start thinking about and about like what can they do to like prevent that?
Both, but I think you could never avoid it. So it really is the data scientist’s job to be prepared for that, to do well at that, and that’s what it is.
What’s the ratio of the job of, like, mining versus—
There’s this joke like 90% of the job is data cleaning!
I don’t know if it’s 90, but it’s a lot!
And it’s not just data cleaning; data cleaning sounds kind of lame like you’re just kind of cleaning things up.
Yeah, it’s—I think more interesting than that—it’s like literally like what data even makes sense to get here? It’s not obvious in advance.
You think it’s obvious. You’re like, I’ll just throw some data.
What data of what, and how can you combine that data, and what does it mean to have clean, relevant data?
And give an apple—that’s a skill set!
Well, you know, I’ll have you. I have an example around the founder side, right?
So I think founders often make the sort of assumption that they’re tracking all the right things, and then we’ve had many experiences where, you know, we’ll talk to a founder of a fellow’s gonna work with like a founder, and they’ll say, yeah, I’ve got all the data. We got everything!
Right?
Big data, big data!
Yeah, it’s all its data! And then, you know, and then you open it up, and it’s like, oh shit! They didn’t track user logins.
Like which user was logging in!
They’re tracking on all the movements on the site, yeah, but not which movements, which user was using that and at what time stamp!
And again, it’s like, oh my God! Like all this data is borderline unusable because we can’t kind of peg it to specific behavior and model that behavior.
And you know, when you’re looking from the data perspective, it sounds like hilarious, like why don’t you track users?
Yeah, but you know what? I’m a founder; I know it! When you’re a founder, you’re thinking about a million different things!
Yeah!
You have a million different trade-offs!
Exactly! And honestly— yeah, the loggings are turned on—like let’s go, right?
Let’s build, let’s build!
Then a year later, you’re regretting that. So again, I think a lesson learned for sure.
That’s why it’s like, hey! Have a coffee with the data scientist—like, maybe all you’ll get from it is like log your user logins, but that might be enough.
And then a year later, you can get started with the data scientist.
What are the best tools that people should do for tracking data? Or like, is there a product that sort of should use?
Just for a question, yeah, again, that you wrote.
If they do this, we’re just gonna start on— I guess.
So you know, honestly, I saw some of the questions on Twitter, and I know, you know, folks always ask about tools.
So I was actually asking around some more of my teams, like, hey, what’s the latest on this?
And there are great tools! I think for just sort of like basic analytic tracking or like websites— but if you’re really building products, like it’s still to this day we see teams roll their own.
Hmm!
Because there’s so much—there’s so much! Such a disappointing answer!
And I think, you know, listen, there are companies working on it, some YC companies, and they’re slowly progressing up to more sophisticated sort of data products!
But at the end of the day, if your lifeblood is a very specific product that does something very specific, like, there’s like nothing beats just having somebody very thoughtfully say, what do we actually care about tracking here?
Okay! So they’re tracking how soon back then.
Yeah! Like, assuming they’re there, there’s no easy answer.
Then, your founder, yeah, you just started. You’re saying, yeah, can you give me like five or ten things that I should be tracking?
Well, I mean, it really depends on the company, right?
Okay, fine! So I think the number one thing you have to think about as a founder is actually not even what you’re tracking, because honestly, if you think about this first thing right, I think that’ll become more obvious.
Yeah.
The first thing you got to think about, I think about it, right, is what are you actually trying to optimize? What’s the one or two metrics you actually care about?
What if you’re thinking about machine learning and building predictive models?
Like say you had a magic machine learning model that like did whatever you want, but you only had one or two metrics— for which problem your company would you apply to?
Because I think what I see folks do is, oh, I know my business in and out, I know my metric is this, this, this, this, this, and this, and then say, oh, machine learning, I’ll build this, this, this, this. You know what, you might at some point down the road!
But initially, you’re gonna have to focus!
And if you don’t have that focus, that’s where you get into this habit of I’ll just track everything or nothing.
Whereas if you know what you’re trying to optimize, let’s say I’m Netflix.
Yeah, what am I going to start tracking?
Oh, I mean, you’re—you obviously want to see how long people are watching the video, how far they get in that video.
We one of the teams—they’re less obvious, is people are using different devices on different bandwidths, so they tracked.
I mean, they test this stuff and track it on all sorts of different machines.
So again, like if in a generic tool, would you have a situation where you’re testing like a stream on a hundred different devices?
No! You wouldn’t because like if that’s not the core of your business, why would you ever do that?
Right!
But if you’re Netflix, you better be doing that! And because you know that user experience is the key, right?
Khan Academy is something different, right? For Khan Academy, it’s like, you know, maybe it’s the amount of time kids are spending on a question, and that’s telling you something about whether they’re learning.
Where on another site, it’s like you don’t really care about the timing! You just care about the flow!
So for understanding for any startup and most companies, it’s like always like my goal is growth!
Yeah!
And for us, how I see, we’ve actually pretty much simplified it where it’s just like look, for the most part, your KPI that your company is actually interested in driving, it’s either going to be revenue, right? And that’s like 99% of the company.
Yeah! And for some, like consumer, it’s very difficult play; it’s like I’m going after engagement!
Like, sure!
KPI is active users; that’s ideal. Sometimes it’s weekly active users; that’s just the nature of the product.
So to me, it’s just like, okay, what drives those two things are really just like only two numbers.
It’s like conversion and then like churn!
Yeah!
And so I imagine like most questions fall into those two cats— like what increases conversion for revenue and what reduces churn for revenue?
And the same thing for like engagement!
So those are, so I’ll maybe I’ll speak directly to those because now you’re kind of zeroing in on certain types of companies.
And for churn, we often have fellows built churn prediction models for startups.
So again, they’re—there are demise because there’s, I mean churn for what? What’s happening yet?
But when we’re talking about churn, it’s a customer deciding to stop using the product, and if we can predict that ahead of time then they’re able to intervene, maybe offer a discount, maybe engage that user, get feedback—so those are top of the list!
And for conversion experimentation is the key! It’s like these experimentation frameworks always feel like a lot of times startups, especially earlier ones, they neglect that whole churn question because I always tell them, look, you’re obsessed about conversion because you’re in sales trying to bring them in!
But I think it’s always feel like it’s very expensive and I feel like improving churn—like improving churn by the same percentage—that’s the exact same thing for acquiring new customers, but it’s way easier!
Keyword setups!
With and so is that usually what the first projects that startups and companies should be looking at if they haven’t at all?
Absolutely! Great! And you know, one thing a lot about churn is it’s often more reflective of what is actually working or not working, right?
It’s like making people want— it’s like, if you improve churn, that means you’re truly understanding what the user wants.
Maybe you can get them to sign up or convert just by sort of having a flashy sales pitch.
But churn? Really, you understand it, and then that’s where the exploratory data analysis comes in.
Do you really understand what your users are doing? That’s where the A/B testing and often what’s called like multi-armed bandit testing—where you’re trying various different experiments at once!
That’s where you’re predicting churn and then trying to intervene to help the customer.
But it’s, you see what I’m saying?
It’s like a number of different things, all of which are grounded in, do I understand what my user wants, and am I building to what they really care about?
I think the other big trend that you’re having people sort of obsessed with metrics-wise is like cohorts and like retention curves over time.
And so what are usually like the best things people should do? Like, yes, Jim, understanding and knowing it and like that sometimes really difficult, but in terms of improving that, like where does data science usually help?
Right?
I mean, I think it’s coming back to churn, right? Because if you’re seeing folks drop off at month three and like your early cohorts, I mean, I mean, that’s a churn problem right there!
So yeah, I think it goes back to churn a lot of those sort of dashboards are, you know, there are great tools for those!
So certainly like when I started people with like hand code like cohort analysis—now there’s a bunch of tools for that!
So I’m not saying—so certainly I think in the metrics sort of dashboard domain there’s a lot of solutions.
When I was saying that there isn’t really a ready-made solution, it’s more that stuff that’s—that’s kind of where it’s you’re actually building models to improve the product in a very sort of deep way.
You guys have a favorite further—like because you said that good startups have good problems?
Yeah!
Are you waiting for a sponsorship from trying to understand like some tool that you—?
No! Honestly, at Insight, almost everybody just uses open source!
Right, okay! Reasonable!
And frankly, that’s actually what we’re seeing reflected in the industry. So if you go to a top data science team, by far and away, the vast majority of what they’re using and building on is open source!
What are those projects?
Like I think Python is definitely it used to be like Python, and they’re still building it themselves!
Yeah!
Absolutely!
And then, they just use like what—Jupyter Notebooks, like that for prototyping!
And then, you got to then start building!
Then you roll your own!
And frankly, at that point, as soon as you get away, as soon as you get past the prototyping stage, you—you’re really just building product, right?
It’s the same thing an engineering team does at Stride, right?
It’s like what tools are they using to build the fundamental product?
And now—and that’s where you’re—that’s where you’re living!
Those data scientists are often embedded with the team.
They’ll be directly—
Who makes the best data science?
Like from what film have you noticed where like, oh, this much better that they come from this field?
What’s kind of been shocking about it?
One of your favorite children?
What? I was accused of, you know, I’m from physics!
So now, you know, there’s—we have fellows from all different backgrounds, so they all succeed! No, I mean, I think that’s been the shocking thing is like how different the backgrounds are.
We have a fellow in this session; he’s an archaeology PhD.
We had a fellow last session ago who was like an engineer at SpaceX, right?
Like, we had—imagine each other.
So we have similar systems. Like you got like a mathematician, exactly— we’re going to get the math, understand the particle; but selling themselves and understanding problems probably does—exactly!
So often you’ll find like a mathematician is great—for instance, I’ve made great data engineers because they think about large-scale systems and how can they fail.
I mean in Malthus, logic systems, but any kind of transfer that sort of mode of thinking to data infrastructure!
But someone like, for instance, psychology was one that like in the early days I didn’t really have a network and kind of psychology. Or neuroscience— so we did a lot of work to try to kind of put the word out there.
We found social scientists are incredible data scientists quite often because they know how to write questions and they know how to think about people.
And ultimately, you know, obviously data is branching out, but most of the time when you’re talking about users, you’re talking about customers; it’s people!
Right? And so fantastic data scientists from those fields! But it’s just one of my favorite parts of my job actually, is the fact that you’ll sit at lunch or a happy hour or just hang out at the office, and it’s like an astrophysicist with a psychologist, with a software engineer, with an electrical engineer, and they’re all kind of working, collaborating.
And it’s just an incredible, incredible kind of environment to be around all these different people you have!
All these companies coming in, talking with all your students during the program, and they’re usually coming like with a problem or they’re just talking about, here’s the kind of problems we work on and solve, because they’re kind of doing a little bit of recruiting in addition to giving an understanding!
Absolutely!
Yeah! And it was really great at that! We have—I mean, what do they do that’s really?
There’s a bunch of teams, you know, listen. I think—so the way the program works is fellows will often work with a startup company on a project, but most of the interactions the fellows have with companies is actually companies coming in to try and hire them!
Right? And when I say companies, I mean like the actual technical data team coming in talking about what they work on and, you know, trying to hire them.
And so the teams that do really well, listen, obviously the ones with great brands—the Airbnbs—the list, the Ubers, the Facebook.
I don’t know what the little guys have to compete with them.
But this is what I found is when startups come in, what often happens is fellows come in, “What’s this startup? I don’t know, I’ve heard of that. Why do I have to go to this?”
And they come out and they’re like, this is my dream job! I want to look at this company!
And I started trying to figure out what certain startups did to do that, and what it really boils down to is impact.
The startups that do well recruiting data scientists make the pitch: you are critical to our success!
If we—if they maintain, it’s like they’re gonna be all-stars! And they’re telling the truth because a lot of companies these days, frankly, if the machine learning or if the analytics doesn’t work, like the company will fail!
Like that’s what they’re pitching.
Well, also when there’s one of you versus 300 of you, right?
Well, that’s a personality thing, right? Some people are excited about I’m gonna be the first data scientist. And some people are like, I want some mentorship.
A lot I need a little motor on me, yeah!
But when it comes to like, I’ve never heard of this company before and then an hour later, like, oh my God, I want to work for that!
It’s always the impact piece. It’s always the like, if you come here, what you do will matter in a big way!
And obviously there’s the technical piece, you’re gonna work on cool stuff!
But I thought the technical piece would be the biggest one.
But the biggest one actually is the impact, for sure.
So one thing we haven’t talked about, and I actually don’t know if you have an opinion on this, isn’t contracting. So for an average startup, say they’re like a couple years in, I don’t know if we really have a need for this, but we have all this data; maybe we could put it to use.
Do you see people like doing two-month contracts and getting a system up and then just letting it go? What happens?
Yeah, I think contracting is good for prototyping!
So we see a lot of like, what I’m saying, YC startups work with our fellows; that’s essentially it’s a pro bono controlled consult thing.
But they’re working with them for, you know, the program!
Yeah, I think to deliver some results and where that works really well is, you know, often it’s integrated, but it’s at this sort of prototyping stage.
Well, this even work, like, or I’ve got a model, well this one work better?
Okay, if we try this.
So let me give you an example of one I really like. Recently, a fellow worked with Isnoso Health, a YC startup—really amazing product— it’s like sort of in-home breast cancer screening.
So it’s a device instead of going once a year to get screened for breast cancer if you’re in high risk, you can do it at home!
Secondly, using it as a leading cause of cancer death in women!
So huge impact, potentially life-saving technology!
And obviously, a big part of that is can we—do we have the right algorithms to detect and notify a user that, hey, you need to go speak to your doctor?
Or notify a doctor, obviously a doctor does the final thing, but is there something abnormal here that needs to be taking a closer look at?
They had algorithms that were working great and doing well for them, especially at that stage of, hey, let’s just bring it to a doctor to be safe; but they were curious about, hey, are some of these more like these newest sort of deep learning algorithms that just kind of been published?
Are they going to do better for us?
So a fellow did that—they took the data and essentially used some brand new sort of convolutional neural network techniques that have just kind of been published and got better results for them that were almost on par with sort of expert radiologists!
And so, I mean, that’s—that’s awesome, right?
And so of course that team then has to do some more work to implement it, but that’s an example where I think consulting works is like, is this gonna work? Is this feasible?
As a prototype, anytime it actually I think then kind of becomes a part of your product, you need a team, right?
Because it’s never static! Something’s gonna evolve and change!
You need to be able to evolve it; it’s something—it’s just like asking like can a startup just have like contract software engineers overseas!
It’s like, well maybe to prototype something. But in general, the probably answer’s no because that product’s gonna keep evolving every month, every year!
And you need focus on the stuff to do it!
Makes sense to me!
Great, cool!
So I think one last thing I wanted to talk about was just areas you’re excited about in particular; we mentioned health the other day, but yeah, what’s exciting to you right now in the field?
No, I mean, I’ll—there’s a bunch of stuff that’s excited me, but health! Health is the top one that I’m pumped about! Because it—I mean the impact’s there, right?
Like then what the example I just shared with you—I mean early detection, disease monitoring—I mean literally saving people’s lives as this stuff works!
And what’s interesting is people have been talking about the impact in data science, machine learning, and health for years because, you know, you start thinking about this stuff and pretty quickly we’re like, this could make an impact!
But, you know, the—the actually getting into work is tough!
And only I think in the last few years, we’ve been seeing a lot of teams actually making really amazing progress there!
I’ll give you an example I love of like the impact here: Memorial Sloan Kettering Cancer Hospital in New York has hired out a team of data science and data engineers from us over the last few years.
And what they do is they build essentially data products—but that are used internally by their doctors.
These are cancer doctors, really tough situations, and they’re faced with a situation of what clinical trials do I recommend to my patient?
And there’s thousands of clinical trials, and there’s new ones coming online every day!
Which one do you suggest?
And so they’re building these kind of data products where the doctor gets based on the specific personalized, you know, whether it’s genomic or clinical factors, hey, you should at least think about these new clinical trials that are coming online.
And again, the doctor makes the final decision, but it’s, hey, maybe one of those trials they hadn’t heard about now saves that patient’s life!
Right? And it’s—it’s a fascinating— it’s a whole stuff that it’s a hospital, right?
And then soon thereafter, New York-Presbyterian hired a fellow, and then Mount Sinai hired a fellow.
And now pharma companies are hiring fellows!
And then it’s—it’s really fascinating to see data broaden now when companies realize that they can just take it beyond, oh, I want to optimize this!
Like business efficiencies and really think, what can I create that’s gonna add incredible value?
So health is what I’m excited— but there’s a ton more out there!
Yeah, that’s super cool!
Alright, well thanks for coming in!
Thanks so much!
Thanks, Trey!