Everybody's going deep now. Deep work, deep learning, deep mind. If 2025 is the year of agents, then the 2020s are the decade of deep. While LLM-powered search is as old as Perplexity and Search GPT, and open source projects like GPT Researcher and clones like Open Deep Research exist, the difference with commercial deep research products is they are both a gen... and bundling custom-tuned frontier models like OpenAI's O3 or, as today's guests discuss, a fine-tuned version of Gemini.
Since the launch of OpenAI's deep research on February 2nd, the reactions have been nothing short of breathless. Deep research is the best public-facing AI product Google has ever released. It's like having a college-educated researcher in your pocket. End quote from Jason Calacanis. Quote,
I have had deep research write a number of 10-page papers for me, each of them outstanding. I think of the quality as comparable to having a good PhD-level research assistant and sending that person away with a task for a week or two. or maybe more. Except deep research does the work in five or six minutes. End quote from Tyler Cowen. Quote, Deep research is one of the best bargains in technology.
End quote from Ben Thompson. Quote, my very approximate vibe is that it can do a single digit percentage of all economically valuable tasks in the world, which is a wild milestone. End quote from Sam Altman. Since then, a dozen open and closed source clones have emerged from the woodwork trying to replicate this success, from perplexity to x.ai with their Grok 3 launch late yesterday.
In today's episode, we welcome Arush Selvan and Mukhan Sridhar, the Lead PM and Tech Lead for Gemini Deep Research. the originators of the entire category of deep research agents which have overnight become the newest killer use case for AI. We asked detailed questions from inspiration to implementation why they had to fine-tune a special model for it instead of using the standard Gemini model, how to run evils for them, and how to think about the distribution of use cases.
Arish and Mukund will also be joining us as keynote speakers for the Agents Engineering track at the AI Engineer Summit in New York City on February 21st. This is our last in our recent series of upcoming AI Engineer Summit speakers, and we hope you're as excited for their talks and workshops as we are. You can sign up for the online live stream linked in the show notes. See you at the summit. Watch out and take care.
Hey everyone, welcome to the Late in Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swix, founder of Small AI. Hey, and today we're very honored to have in our studio Arusha Mukun from the Deep Research team, the OG Deep Research team. Welcome.
Thanks for having us. Yeah, thanks for making the trip up. I was fortunate enough to be one of the early beta testers of deep research when he came out. And I was very keen on, I think even at the end of last year, people were already saying it was one of the most exciting. agents that was, you know, coming out of Google. You know that previously we had on Ryza and Usama from the Nobook LM team. And I think like this is like an increasing trend that like...
Gemini and Google are shipping interesting user-facing products that use AI. So congrats on your success so far. Yeah, it's been great. Thanks so much for having us here. Yeah, excited. Yeah, thanks for making a trip up. And I'm also excited for your talk that is happening next week. Obviously, we have to talk about what exactly it is. I'll ask you towards the end. But so basically, okay, you know, we have the screen up. Maybe we just start at a high level for people who don't need.
yet know, what is deep research? Sure. So deep research is a feature where Gemini can act as your personal research assistant to help you learn about any topic that you want more deeply. It's really helpful for those queries where you want to go from zero to 50 really fast on a new thing. And the way it works is it takes your query, browses the web for about five minutes, and then outputs a research report. free to review and ask follow-up questions.
This is one of the first times, you know, something takes about five, six minutes trying to perform your research. So there's a few challenges that brings, like you want to make sure you're spending that time in the computer doing what the user wants. of the ux design that we can talk about as we go through an example and then there's also challenges in
The web is super fragmented and being able to plan iteratively and as you pass through this noisy information is a challenge by itself. Yeah. This is like the first time sort of Google automating yourself as searching. You're supposed to be the experts at search, but now you're meta-searching and determining the search strategy.
Yeah, I think at least we see it as two different use cases. There are things that you know exactly what you're looking for and their search is still probably one of the best places to go. I think where deep research really shines is they're like... multiple facets to your question and you spend like a weekend you know just opening like 50 60 tabs and many times i just give up and we wanted to solve that problem and and give a great starting point do we want to start
query so that it runs in the meantime and then we can chat over it okay here's one query that that we like we love to test like super niche random things like things where there's like no wikipedia page already about this topic or something like that right
because that's where you'll see the most lift from a feature like this. So for this one, I've come up with a query. This is actually Mokhan's query that he loves to test. Help me understand how milk and meat regulations differ between the US and Europe. What's nice is the first step is actually where it puts together a research plan that you can review. And so this is sort of its guide for how it's going to go about and carry out the research. And so this was like a pretty...
decently well-specified query. But let's say you came to Gemini and were like, tell me about batteries, right? That query, you could mean so many different things. You might want to know about the latest innovations in battery tech. You might want to know about a specific type of battery chemistry. And if we're going to spend like five to even 10 minutes researching something, we want to, one, understand what exactly are you trying to accomplish here? And two, give you an opportunity like...
to steer where the research goes, right? Because if you had an intern and you asked them this question, the first thing they do is ask you a bunch of follow-up questions and be like, okay, so like... Help me figure out exactly what you want me to do. And so the way we approached it is we thought like, why don't we just have the model produce its first stab at the research query, at how it would break this down, and then invite the user to come and kind of engage with how they would want it.
steer this yeah and many times when you try to use a product like this you often don't know what questions to look for or the things to look for so we kind of made this decision very deliberately that instead of asking the users just follow up questions directly we kind of lay out hey
This is what I would do. Like these are the different facets. For example, here it could be like what additives are allowed and how that differs or labeling restrictions and so on in products. The aim of this is to kind of... Tell the user about the topic a little bit more and also get steer. At the same time, we elicit for follow-up questions and so on.
It's kind of like editable chain of thought. Right. Exactly. Exactly. Yeah. I think that, you know, we were talking to you about like your top tips for using deep research and your number one tip is to edit the plan. Just edit it. Right. So like we actually, you can actually edit conversationally. We put in a button here.
just to draw users' attention to the fact that you can edit this. Oh, actually, you don't need to click the button. You don't need to click the button. Yeah, actually, in early rounds of testing, we saw no one was editing. And so we were just like, if we just put a button here. maybe people will like i confess i just hit start a lot i think like we see that too like most people hit start um like it's like the i'm feeling lucky yeah yeah all right so like i i can just add a add a step here
And what you'll see is it should refine the plan and show you a new thing to propose. Here we go. So it's added step seven, find information and milk and meat labeling requirements in the US and EU. Or you can just go ahead and hit start. I think it's still like a nice transparency mechanism, even if users don't want to engage. Like you still kind of know, okay, here's at least an understanding of why I'm getting the report I'm going to get.
which is kind of nice. And then while it browses the web, and Mogan, you should maybe explain kind of how it browses. We show kind of the websites it's reading in real time. Yeah. I'll preface this with, I haven't, I forgot to explain the roles. You're a PM and you're a tech lead. Yes. Okay. Yeah. Just for people who don't know. Oh, okay. We maybe should have started with that. Yeah.
We do each other's work sometimes as well, but more or less that's the boundary. So what's happening behind the scenes actually is we kind of give this research plan that is a contract and that has been accepted. but then if you look at the plan there are things that are obviously parallelizable so the model figures out which of the sub steps that it can start exploring in parallel
And then it primarily uses like two tools. It has the ability to perform searches and it has abilities to go deeper within, you know, a particular web page of interest, right? And oftentimes it will start exploring things. in parallel, but that's not sufficient many times it it has to reason based on information found so in this case it one of the searches could have led the eu commission has these additives banned it wants to go and check if the fda does the same thing right so
This notion of being able to read outputs from the previous turn, ground on that to decide what to do next, I think was key. Otherwise, you have incomplete information and your report becomes... a little bit of a like a high level bullet points so we wanted to go beyond that blueprint and actually figure out you know what are the key aspects here so yeah so the this happens iteratively until the model thinks it's finished all its steps
And then we kind of entered this analysis mode. And here there can be inconsistencies across sources. You kind of come up with an outline for the report, start generating a draft. The model tries to revise that by self-critiquing itself to finalize the prompt, finalize the report. And that's probably what's happening behind the scenes. What's the initial ranking of the websites? So when you first started it, there were 36. How do you decide where to start? Since it sounds like, you know.
The initial websites kind of carry a lot of weight, too, because then they inform the following. Yes. So what happens in the initial turns? Again, this is not like a... It's not something we enforce. It's mostly the model making these choices. But typically, we see the model exploring all the different aspects in the research plan that was presented. So we kind of get like...
a breadth-first idea of what are the different topics to explore. And in terms of which ones to double-click on, I think it really comes down to every time you search, the model gets some idea of what the page is. And then... depending on what pieces of it. Sometimes there's inconsistencies. Sometimes there's just like partial information. Those are the ones it double clicks on. And yeah, you can continually like iteratively search and browse until it feels like it's done.
Yeah, I'm trying to think about how I would code this. A simple question would be like, do you think that we could do this with the Gemini API? Or do you have some special access that we cannot replicate? You know, like if I model this with a tool call of like search, double click, whatever. Yeah, I don't think we have special access per se. It's pretty much the same model. We, of course, have our own post-training work that we do.
Y'all can also like, you know, you can fine tune from the base model and so on. I don't know that we can do fine tuning. Well, if you use our Gemma open source models, you could fine tune. Yeah. Yeah. So I don't think there's a special access per se, but. lot of the work for us is first defining these oh there needs to be a research plan and how do you go about presenting that and then a bunch of post training to make sure you know it's
able to do this consistently well and with high reliability. Okay, so 1.5 Pro with Deep Research is a special edition of 1.5 Pro. Yes, so it's not pure 1.5 Pro. This also explains why you can't just toggle on 2.0 Flash and just... Yeah. Right. Yeah. But I mean, I assume you have the data and, you know, it should be doable. Yeah. There's still this like question of ranking. Yeah. Right. And like, oh, it looks like you're already done. Yeah. Yeah. We're done. We can look at it. Yeah. So let's see.
It's put together this report and what it's done is it's sort of broken, started with like milk regulation. And then it looks like it goes into meat probably further down and then sort of covering how the US approaches this.
problem of like how to regulate milk, comparing and then, you know, covering the EU. And then, yeah, like I said, like going into the meat production. And then it'll also, what's nice is it kind of reasons over like, why are there differences? And I think what's really cool here. is like it's showing that there's like a difference in philosophy between how the US and the EU regulate.
food. So the EU adopts a precautionary approach. So even if there's inconclusive scientific evidence about something, it's still going to prefer to ban it. Whereas the US... takes sort of the reactive approach where it's like allowing things until they can be proven to be harmful, right? So like, this is kind of nice is that you also like get the second order insights from what it's being put.
what it's putting together so yeah it's it's kind of nice it takes a few minutes to read and like understand everything which makes for like a quiet period doing a podcast i suppose but but yeah this is this is kind of how it how it looks right now yeah and then from here you can kind of keep the usual chat and iterate thing so this is more if you were to like you know compare it to other platforms it's kind of like a entropic artifact or like a chat gbd canvas where like
You have the document on one side and like the chat on the other and you're working on it. Yeah, this is something we thought a bit about. And one of the things we feel is like your learning journey shouldn't just stop after the first report. And so actually what you probably want to do is while reading.
be able to ask follow-up questions without having to scroll back and forth. And there's like broadly a few different kinds of follow-up questions. One type is like, maybe there's like a factoid that you want that isn't in here, but it's probably been already captured as part of the web browsing that it did, right? So we actually keep everything in context, like all the sites that it's read remain in context. So if there's a piece of missing information, it can just fetch that.
Then another kind is like, okay, this is nice, but you actually want to kick off more deep research. You're like, I also want to compare the EU and Asia, let's say, in how they regulate milk and meat. For that, you'd actually want the model to be like, okay, this is sufficiently different. that I want to go do more deep research to answer this question. I won't find this information in what I've already browsed.
And the third is actually maybe you just want to change the report. Maybe you want to condense it, remove sections, add sections, and actually iterate on the report that you got. So we broadly are basically a lot of... try and teach the model to be able to do all three. And the kind of side-by-side format allows sort of for the user to do that more easily. So as a PM, there's an open-end docs button there, right? How do you think about...
What you're supposed to build in here versus kind of sounds like the condensing and things should be a Google Docs. Yeah. Bart extensions is different. It's just like an amazing editor. Like sometimes you just want to direct edit things. And now Google Docs also has Gemini in the side panel. So the more we can kind of help this be part of your workflow throughout the rest of the Google ecosystem, the better, right?
And one thing that we've noticed is people really like that button and really like exporting it. It's also a nice way to just save it permanently. And when you do export all the citations, and in fact, I can just run it now, carry over. which is also really nice. Gemini Extensions is a different feature. So that is really around Gemini being able to fetch content from other Google services in order to inform the answer.
So that was actually the first feature that we both worked on on the team is actually building extensions in Gemini. And so I think right now we have a bunch of different Google apps as well as I think Spotify and a couple, I don't know. And Samsung apps as well. Who wants Spotify? I have this whole thing about... Who wants Spotify? Who wants that in their deep research? In deep research, I think less. But the interesting thing is...
Like we built extensions and we weren't really sure how people were going to use it. And a ton of people are doing really creative things with them. And a ton of people are just doing things that they loved on the Google Assistant. And Spotify is like a huge, like playing music on the go. was like a huge value. Oh, it controls Spotify? Yeah.
Deep research. For deep research, you purely use... Yeah, but otherwise, yeah, you can have Gemini go... Yeah, you have YouTube Maps and Search for Flash Thinking Experimental with apps. the newest, longest model name that has been launched. But like, yeah, I think Gmail is obvious one. Calendar is obvious one. Exactly. Those I want. Spotify. Yeah. Fair enough. Yeah. And obviously,
free to dive in on your other work. I know you're not just doing deep research, right? But we're just kind of focusing on deep research here. I actually have asked for modifications after this first run. where I was like, oh, you stopped. I actually want you to keep going. What about these other things? And then continue to modify it. So it really felt like a little bit of a co-pilot type experience, but more like an agent that would research. I thought it was pretty cool.
Yeah, one of the challenges is currently we kind of let the model decide based on your query. amongst the three categories so some there is there is a boundary there like some of these things depending on how deep you want to go you might just want a quick answer versus
like kick off another deep research. And even from a UX perspective, I think the panel allows for this notion of, you know, not every follow-up is going to take you like five minutes. Right now, it doesn't do any follow-up. Does it do follow-up search?
It always does? It depends on your question. Since we have the liberty of really long context models, we actually hold all the research material across turns. So if it's able to find the answer in things that's found, you're going to get a faster report. play yeah otherwise it's just going to go back to planning and
Yeah, yeah. A bit of a follow-up on the, since you brought up context, I had two questions. One, do you have a HTML to markdown transform step? Or do you just consume raw HTML? There's no way you consume raw HTML, right? we have both versions right so there is the models are getting like every generation of models are getting much better at native native understanding of these representations
I think the markdown step definitely helps in terms of, you know, there's a lot of noise, like, as you can imagine, with the pure HTML. JavaScript. Exactly. So, yeah, when it makes sense to do it, there's... we don't automatically try to make it hard for the model. But sometimes it depends on the kind of access of what we get as well. Like, for example, if there's an embedded snippet that's HTML, we want the model to, you know, to be able to work on that as well. Yeah, and no vision yet. But...
Currently, no Vision. The reason I ask all these things is because I've done the same. I haven't done Vision. The tricky thing about Vision is I think the models are getting significantly better, especially if you look at the last six months. being able to do VQA stuff and so on. But the challenge is the trade-off between having to actually render it and so on, the gap, the trade-off between the added latency versus...
the value add you get. You have a latency budget of minutes. Yeah, yeah, yeah. It's true. In my opinion, the places you'll see a real difference is like... like i don't know a small part of the tail especially in like this kind of an open domain setting if you just look at what people ask there's definitely some use cases where it makes a lot of sense to do it but i still feel it's not in the
not in the head cases. And we do it when we get there. The classic is like, it's a JPEG that has some important information and you can't touch it. Yeah. Okay, and then the other technical follow-up was just you have 1 million to 2 million token context. Has it ever exceeded 2 million? And what do you do there? Yeah, so we had this challenge sometime last year where we said when we started like wiring up this... multi-turn where we said, hey,
let's see how long somebody in the team can take DR, you know? Yeah. What's the most challenging question you can ask that takes the longest? Yeah. No, we also keep asking follow-ups. Like for example, here you could say, hey, I also want to compare it with like how it's done. to bust it yeah yeah yeah we also have we have retrieval mechanisms if required so we natively try to use the context as much as it's available beyond which you know we have like a rag setup to figure out
Okay. This is all in-house tech. Yes. Okay. What are some of the differences between putting things in context versus rag? And when I was in Singapore, I went to the Google Cloud. Well, when I was in Singapore, I went to the Google Cloud. team and they talk about Gemini plus grounding. Is Gemini plus search kind of like Gemini plus grounding or like how should people think about the different shades of like I'm doing retrieval on data versus I'm using deep research versus I'm using.
grounding sometimes the labels can be hard too yeah Let me try to answer the first part of the question. The second part, I'm not fully sure of the grounding offering, but I can at least talk about the first part of the question. So I think you're asking the difference between being able to... When would you do rag versus rely on the long contact? I think we all get that. I was more curious, like from a product perspective, when you decide to do rag versus shit. Like this, you didn't need to.
you know do you get better performance just putting everything in context or the tricky thing for rag it really works well because a lot of these things are doing like cosine distance like a dot product kind of a thing and that kind of gets challenging when you're
Query side has multiple different attributes. The dot product doesn't really work as well. I would say, at least for me, that's my guiding principle on when to avoid drag. That's one. The second one is... I think every generation of these models are like the initial generations, even though they offer like long context. that performance as the context kept growing was, you would see some kind of a decline. But I think as the newer generation models came out, they were really good even if you...
kept filling in the context in being able to piece out, uh, like these really fine game information. So I think these two, at least for me are like guiding principles on one too. Just to add to that, I think like, just like a simple rule of thumb.
that we use is like if it's the most recent set of research tasks where the user is likely to ask lots of follow-up questions that should be in context but like as stuff gets 10 tasks ago, you know, it's fine if that stuff is in RAG because it's less likely that the user needs to do, you need to do like very complex comparisons between what's currently being discussed and the stuff that you asked about, you know, 10 turns ago, right?
So that's just like the rule of thumb that we follow. And so from a user perspective, is it better to just start a new research instead of like extending the context? Yeah, I think that's a good question. I think if it's a related topic. I think there's benefit to continue with this thread because you could, the model, since it has this in memory, could figure out, oh, I've found this niche thing about...
I don't know, milk regulation in this case. In the US, let me check if you're in a follow-up country or place also has something. So these kind of things you might have not caught if you started a new thread. So I think it really depends on...
On the use case, if there's a natural progression and you feel like this is like part of one cohesive kind of a project, you should just continue using it. My follow-up one is going to be like, oh, I'm just going to look for summer camps or something. Then, yeah. I don't think it should make a difference, but we haven't really pushed that and tested that aspect of it for us. Most of our tests are more natural transitions. How do you eval deep research? Oh, boy. Yeah, this is a hard one.
I think the entropy of the output space is so high. It's like people love orderators, but it brings its own set of challenges. And so for us, we have... some metrics that we can auto generate right so for example as we move when we do post training and have multiple models we kind of want to make sure the distribution of like certain stats like for example how long it's spent on planning how many how many iterative steps it does on like some dev set
If you see large changes in distribution, that's kind of like an early signal of something has changed. It could be for better or worse. So we have some metrics like that that we can auto-compute. So every time you have a new version, you run it across a test suite of cases and you...
see how long it takes yeah so we have like a dev set and we have like some kind of automatic metrics that we can detect in terms of like the behavior end-to-end like for example how long is the research plan do we like Does a new model produce really longer? Many more steps. Just number of characters. Like number of steps in case of a research plan. In the plans, it could be like...
Like we spoke about how it iteratively plans based on like previous searches. How many steps does that go on an average over some dev set? So there are some things like this you can automate. But beyond that. There are autorators, but we definitely do a lot of human evals. And there we have defined with product about certain things we care about and been super opinionated about. Is it comprehensive? Is it complete? Like groundedness and these kind of things.
So it's a mix of these two attributes. There's another challenge, but... Another challenge in that, sometimes you just have to have your PM review examples. Yeah, exactly. And for latency. So you're the human reader. The human reader. But broadly, what we tried to do is... for the eval question is like, we tried to think about like, what are all the ways in which a person might use?
a feature like this and we came up with what we call an ontology of use cases yes and really what we what we try to do is like stay away from like verticals like travel or shopping and things like that but really try and go into like What is the underlying research behavior type that a person is doing? So there's...
queries on one end that are just, you're going very broad, but shallow, right? Things like shopping queries are an example that way, or like, I want to find the perfect summer camp. My kids love soccer and tennis. And really you just want to. find as many different options and explore all the different options that are available and then synthesize, okay, what's the TLDR about each one? Kind of like those journeys where you open many, many Chrome tabs, but then like...
need to take notes somewhere of the stuff that's appealing. On the other end of the spectrum, you've got a specific topic and you just want to go super deep on that and really, really understand that. And there's like... all sorts of points in the middle, right? Around like, okay, I have a few options, but I want to compare them. Or like, I want to go not super deep on a topic, but I want to cover slightly more topics. And so we sort of developed this ontology of different...
research patterns. And then for each one, came up with queries that would fall within that. And then that's sort of the eval set by which we then run human evals on and make sure we're kind of doing well across the board on all. all of those. Yeah. You mentioned three things. Is it literally three or is it three out of like 20 things?
I basically just told the full set. Yeah, I told you the extremes, right? The extremes, okay. Yeah, and then we had like several midpoints. So basically, yeah, going from like something super broad. and shallow to something very specific and deep. We weren't actually sure which end of the spectrum users are going to really resonate with. And then on top of that, you have compounds of those, right? So you can have things where you want to make a plan.
Right. Like a great one is like, I want to plan a wedding in, you know, Lisbon and I, you know, I need you to help with like these 10 things. Right. And so that becomes like a project with research enabled. Right. And so then it needs to have research planners and venues and catering, right? And so there's sort of compounds of when you start combining these different underlying ontology types. And so we also thought about that when we tried to put together our email set.
What's the maximum conversation length that you allow or design for? We don't have any hard limits on how many turns you can do. One thing I will say is most users don't go very deep. Right now? Yeah. It might just be that it takes a while to get comfortable and then over time you start pushing it further and further. But like right now we don't see a ton of users. I think the way that you visually present it.
suggests that you stop when the doc is created. Right. So you don't actually really encourage, the UI doesn't encourage ongoing chats as though it was like a project. Right. I think there's definitely some things we can do. on the UX side to basically invite the user to be like, hey, this is the starting point. Now let's keep going together. Like, where else would you like to explore?
So I think there's definitely some explorations we could do there. I think in terms of sort of how deep, I don't know. We've seen people internally just really push this thing to quite a lot of ways. I think the other... thing i think will change with with time is people kind of uncovering different ways to use deep research as well like for for the wedding planning thing for example it's it's not one of the you know first thing that comes to mind when when we
tell people about this product so that's another thing i think as people explore and and and find that this can do these various different kinds of things some of this can naturally lead to longer conversations and even for us right when we dog fooded this we saw people use it in like ways we hadn't really thought of before so that was because this was like
little new like we didn't know like will users wait for five minutes what kind of tasks will are they you know going to try for something like that takes five minutes so our primary goal was not to specialize in a particular vertical or target one type of user. We just want to put this in the hands of like...
We had this busy parent persona and various different user profiles and see what people try to use it for and learn more from that. And how does the ontology of the DR use case tie back to the Google? main product use cases. So you mentioned shopping as one ontology, right? There's also Google Shopping. To me, this sounds like a much better way to do shopping than going on Google Shopping and looking at the wall of items. How do you collaborate internally to figure out where AI goes?
Yeah, that's a great question. So when I meant like shopping, I sort of tried to boil down underneath what exactly is the behavior and that's really around like... I called it like options exploration. Like you just want to be able to see and whether you're shopping for summer camps or shopping for a product.
or shopping for like scholarship opportunities. It's sort of the same action of just like, I need to curate from a large, like I need to sift through a lot of information to curate a bunch of options for me. So that's kind of what... we tried to distill down rather than like thinking about it as a vertical. But yeah, Google searches is like awesome. If you want to have really fast answers, you've got high intent for like, I know exactly what I want.
And you want like super up-to-date information, right? And I still do kind of like Google Shop because it's like multimodal. You see the best prices and stuff like that. I think creating a good shopping experience is hard, especially like... When you need to look at the thing, if I'm shopping for shoes and like, I don't want to use deep research because I want to look at how the shoes look. But if I'm shopping for like HVAC systems.
Great. Like, I don't care how it looks or I don't even know what it's supposed to look like. And I'm fine using deep research because I really want to understand the specs and like how exactly. does this work and the voltage rating and stuff like that. Right. So like, and I need to also look at contractors who know how to install each HVAC system. So I would say like where we really shine when it comes to shopping is those, that kind of end of the spectrum of like.
It's more complex and it matters less. It's maybe less on the consumer-y side of shopping. One thing I've also observed just about the... I guess the metrics or like the communication of what value you provide, and also this goes into a latency budget, is that I think there's a prerequisite sentence for research agents to take longer and it be perceived to be better.
to people are like, oh, you're searching like 70 websites for me, but like 30 of them are irrelevant. I feel like right now we're in kind of a honeymoon phase where you get a pass for all this. Being inefficient is actually good for you. Because people just care about quantity and not quality.
right so they're like oh this thing took an hour for me like it's doing so much work like or it's slow that was super counterintuitive for us so actually the first time i realized that what you're saying is when i was talking to Jason Calacanis. And he was like, do you actually just make the answer in 10 seconds and just make me wait for the balance? Yeah. Which we hadn't expected that people would actually value.
the the like work that it's putting in because you're actually worried about it we were really worried about it we were like i remember we actually built two versions of deep research we had like a hardcore mode that takes like 15 minutes and then what we actually shipped is a thing that takes five minutes and i even went to eng and i was like there has to be a hard stop by the way it can never take more than 10 minutes yep because i think at that point like users will just drop off
But what's been surprising is that's not the case at all, and it's been going the other way. Because when we worked on Assistant, at least, and other Google products, the metric has always been if you improve latency. Like all the other metrics go up, like satisfaction goes up, retention goes up, all of that, right? And so when we pitch this, it's like, hold on. In contrast to like all Google orthodoxy, we're actually going to slow everything right down.
And we're going to hope that users still stick with it. Not on purpose. Not on purpose. Yeah, I think it comes down to the trade-off. What are you getting in return for the wait? And from an engineering slash modeling perspective, It's just trading off, inference, compute, and time to do two things, right? Either to explore more, to be more complete.
or to verify more on things that you probably know already. And since it's like a spectrum and we don't claim to have found the perfect spot, we had to start somewhere and we're trying to see where, like there's probably some... cases where you actually care about verifying more than the others in an ideal world based on the query and conversation history, you know what that is. So I think, yeah, it basically boils down to these three things from a user perspective.
Am I getting the right value add from an engineering slash modeling perspective? Are we using the compute to either explore effectively and also verify and go in depth for things that are vague or uncertain in the initial steps? The other point about the more number of websites, I think, again, it comes with a trade-off. Sometimes you want to explore more early on before you kind of narrow down on either the sources or the topics you want to go deep.
So that's one of the, if you look at the way, at least for most queries, the way deep research works here is initially it'll go broad. If you look at the kinds of websites, it's time to explore all the different topics that we measured in the research plan. And then you would see choices of websites getting a little bit narrower on a particular topic or a particular entity that it has come across and so on. So that's roughly how the number kind of fluctuates.
do anything deliberate to either keep it low or, you know, I'll try to... Would it be interesting to have an explicit toggle for amount of verification versus amount of search? I think so. I think like users would always just hit that toggle. I think I worry that like- Max everything. Yeah. If you like give a max power button, users are always just going to hit that button, right? So then the question comes like, why don't you just decide from the product POV, where's the right?
Where's the right balance? OpenAI has a preview of this. I think it's either Anthropic or OpenAI, and there's a preview of this model routing feature where you can choose intelligence. cheapness and speed and but then they're all zero to one values so then you just choose one for everything right obviously they're going to like do a normalization number thing but users are always going to want one right
We've discussed this a bit. If I wear my pure user hat, I don't want to say anything. I come with a query, you figure it out. Sometimes I feel like... There will be, based on the query, like for example, right, if I'm asking about, hey, how does rising rates from the Fed household income for a middle class? And how has it traditionally happened? These kind of things, you want to be very accurate.
And you want to be very precise on historical trends of this and so on and so on. Whereas there is a little bit more leeway when you're saying, hey, I'm trying to find businesses near me to go celebrate my birthday or something like that. So in an ideal world, we kind of figured that trade-off based on the conversation history and the topic. I don't think we are there yet as a research community, and it's an interesting challenge by itself.
So this reminds me a little bit of the notebook alarm approach. Ryza, we also asked this thing to Ryza and she was like, yeah, just people want to click a button and see magic. Yeah, like you said, you just hit start every time, right? Most people don't even want to add up the plan. So, okay. My feedback on this, if you want feedback, is that I am still kind of a champion for Devin in a sense that...
Devin will show you the plan while it's working the plan. And you can say like, hey, the plan is wrong. And I can chat with it while it's still working. And you live update the plan and then, you know, pick off the next item on the plan. I think it's static, right? Like while you're working on a plan, I cannot chat. It's just normal. Bolt also has this. That's the most default experience. But I think you should never lock the chat.
You should always be able to chat with the plan and update the plan. And the plan scheduler, whatever orchestration system you have under the hood, should just pick off the next job on the list. That would be my two cents. Especially if we spend more time researching. Right. Because like right now, if you watch that query we just did, it was done within a few minutes. So your chance, your opportunity to chime in was actually like, or it left the research phase.
after a few minutes so your opportunity to chime in and steer it was less but especially imagine you could imagine a world where these things take An hour, right? And you're doing something really complicated. Then yeah, like your intern would totally come check in with you, be like, here's what I found. Here's like some hiccups I'm running into the plan. Give me some steer on how to change that or how to change direction.
And you would do that with them. So I totally would see, especially as these tasks get longer, we actually want the user to come engage way more to create a good output. I guess Devin had to do this because some of these jobs take hours. Right. So, yeah. Yeah, I can totally imagine. And it's pervasive since it's where they charge by hour.
Oh. So they make more money the slower they are. Have we thought about that? I'm calling this out because everyone is like, oh my God, it takes hours. It does hours of work autonomously for me. And they are like, okay, it's good. But this is a honeymoon phase. At some point, we're going to say, okay, but it's very slow. Yeah.
Anything else that like, I mean, obviously within Google, you have a lot of other initiatives. I'm sure you like sit close to the Noble Gallum team. Any learnings that are coming from shipping AI products in general? They're really awesome people. Like they're really nice, friendly thought, just like as people, I'm sure you met them, you didn't like realize this with Razor and stuff. So like, they've actually been really.
really cool collaborators or just like people to bounce ideas off. I think one thing I found really inspiring is they just picked a problem and... Hindsight's 20-20, but in advance, just like, hey, we just want to build the perfect IDE for you to do work and be able to upload documents and ask questions about it and just make that really, really good.
And I think we were definitely really inspired by their ability, their vision of just like, let's pick up a simple problem, really go after it, do it really, really well and have, be opinionated. about how it should work and just hope that users also resonate with that. And that's definitely something that we tried to learn from. Separately, they've also been really good at, you know, and maybe Morgan, you want to chime in here, just extracting the most out of Gemini 1.5 Pro.
And they were really friendly about just like sharing their ideas about how to do that. Yeah, I think you learn a bit like when you're trying to... do the last mile of these products and pitfalls of any given model and so on. So yeah, we definitely have a healthy relationship and share notes and we're doing the same for other products. You'll never merge, right? It's just different teams.
They are a different team. So they're in labs as an organization. The mission of that is to really explore different bets and explore what's possible. Even though I think there's a paid plan for Noble Callum now. Yeah. And it's the same plan as us, actually. So it's like... It's more than just the labs, is what I'm saying. It's more than just labs. Because, I mean, yeah, ideally you want things to graduate and stick around.
But hopefully one thing we've done is, like, not created different SKUs, but just being like, hey, if you pay the AI credit school, you get everything. What about learning from others? Obviously, I mean... OpenAI is deep research. Literally, that's the same name. I'm sure there's a lot of contention. Is there anything you've learned from other people trying to build similar tools? Do you have opinions on...
Maybe what people are getting wrong, that they should do differently. It seems like from the outside, a lot of these products look the same. Ask for research, get back to research. But obviously when you're building them, you understand the nuances a lot more. When we built...
Deep research. I think there was a few things that we took a few different bets around how it should work. And what's nice is some of that is actually where we feel like was the right way to go. So we felt like agents should be transparent. around telling you upfront, especially if they're going to take some time, what they're going to do. So that's really where that research plan, we showed that in a card. We really wanted to be very publisher forward.
in this product so while it's browsing we wanted to show you like all the websites it's reading in real time make it super easy for you to like double click into those while it's browsing and the third thing is you know putting it into a side-by-side artifact so that you could
Ideally easy for you to read and ask at the same time. And what's nice is you kind of, as other products come around, you see some of these ideas also appearing in other iterations of this product. So I definitely see this as a space where like everyone in the industry is learning from each other. good ideas get reproduced and built upon. And so, yeah, we'll definitely keep iterating on and kind of following our users and seeing how we can make our feature better. But yeah, I think like it's...
It's like, this is the way the industry works is like, everyone's going to kind of see good ideas and want to replicate and build off of it. And on the model side, OpenAI is the O3 model, which is not available through the API, the full one. Have you tried already with the two model? Like, is it a big jump or is a lot of the work on the post-training?
Yeah, I would say stay tuned. Definitely, it currently is running on 1.5. The new generation models, especially with these thinking models, they unlock a few things. So I think one is obviously the... better capability in like analytical thinking like in math coding and these type of things but also this notion of you know as they produce thoughts and think before taking actions they kind of inherently have this notion of being able to
critique the partial steps that they take and so on. So yeah, we're definitely exploring multiple different options to make better value for our users as we iterate. I feel like there's a little bit of a conflation of inference time compute here in the sense of like, one, you can inference time compute within the thinking model. Right. And then two, you can inference time compute by searching and...
Yeah, reasoning. I wonder if that gets in the way. Presumably you've tested thinking plus deep research. If the thinking actually... does a little bit of verification, so maybe it saves you some time, or it tries to draw too much from its internal knowledge and then therefore searches less.
You know, like does it step on each other? Yeah, no, I think that's a really nice call out. And this also goes back to the kind of use case. The reason I bring that up is there are certain things that I can tell you. from model memory last year's the the fed did x number of updates and so on but unless i sourced it it's it's going to be hallucinated yeah yeah like one is the hallucination or even if i got it right
As a user, I'd be very wary of that number unless I'm able to source the .gov website for it and so on, right? So that's another challenge. There are things that... You might not optimally spend time verifying, even though the model is like, this is a very common fact the model already knows and it's able to reason over. And balancing that out between... Trying to leverage the model memory versus being able to ground this in some kind of a source is the challenging part. And I think as...
Like you're rightly called out with the thinking models. This is even more pronounced because the models know more. They're able to like draw second order insights more just by reasoning over. Technically, they don't know more. They just use their internal knowledge more, right? Yes, but also, for example, things like math. I see. They've been post-trained to do better math. Yeah, I think they probably do a way better job in math than in the previous one, in that same way.
Yeah, I mean, obviously reasoning is a topic of huge interest and people want to know what an engineering best practice is. We think we know how to prompt them better, but engineering with them, I think also very, very unknown. Again, you guys are going to be the first to figure it out. Yeah, definitely interesting times. And yeah, no pressure. If you have tips, let us know.
While we're on the sort of technical elements and technical bend, I'm interested in like other parts of the deep research tech stack that might be worth calling out. Any hard problems that you solved, just more generally? Yeah, I think the iterative planning one to do it in a generalizable way. That was the thing I was most wary about. Like you don't want to go down the route of being able to teach how to plan iteratively. per domain or per type of problem. Even going back to the ontology, if...
If you had to teach the model for every single type of ontology how to come up with these traces of planning, that would have been nightmarish. So trying to do that in a super data-efficient way by leveraging a lot of things in model memory as well as like... There's a very tricky balance when you work on the product side of any of these models.
Knowing how to post-train it just enough without losing things that it knows in pre-training, basically not overfitting in the most trivial sense, I guess. But yeah, so the techniques there, data augmentations there, and multiple. experiments to tune this trade-off. I think that's one of the challenges. On the orchestration side, this is basically you're spinning up a job. I'm an orchestration nerd. So how do you do that? It's like a sub-internal tool.
Yeah, so we built this asynchronous platform for deeper search, which is basically to, like most of our interactions before this were like sync in nature. Yeah, follow chat. Things are synced. Exactly. And now you can leave the chat and come back. Exactly. And close your computer. And now it's on Android and rolling out on iOS. I saw you say that. I told you we switch roles sometimes. Okay, you're reminding him, right? Yeah, we wrapped. on all Android phones, and then iOS is this week.
But yeah, what's neat, though, is you can close your computer, get a notification, and so on. So it's some kind of e-sync engine that you made. Yes. So the other one is this notion of synchronicity and the user able to leave. But also if you're... if you build
like five six minute jobs they're bound to be like failures and you don't want to like lose your progress and so on so this notion of like keeping state knowing what to retry and and kind of keep the journey going is there a public name for this or just no i don't think
a public name for this yeah all right data science would be like this is a spark job or you know it's like a ray you know thing or whatever in the old google days might be like map reduce or you know whatever but like it's a different scale and nature of work yeah then those things so i'm trying to find a name for this
and right now this is our opportunity yeah we can name it now yeah well the the the classic because i used to work in this area this is what i'm asking so it's uh it's workflows uh this sort of durable this was like back when you were made up Airflow, temporal. You guys were both at Amazon, by the way. Yeah, AWS Step Functions would be one of those where you define a graph of execution, but Step Functions are more static.
and would not be as able to accommodate deep research style backends. What's neat, though, is we built this to be quite flexible. So you can imagine once you start doing hour or multi-day jobs. Yeah, you have to model what the agent wants to do. Exactly. But also, in short, it's stable for hundreds of LLM calls. Yeah. It's boring, but this is the thing that makes it run autonomously.
you know yeah so like it's yeah anyway i'm excited about it just to close out the opening i think i would say opening i easily beat you on marketing and i think It's because you don't launch your benchmarks. And my question to you is, should you care about benchmarks? Should you care about humanities last exam or not MMIU, but whatever? They're like, I think benchmarks are great.
The thing we wanted to avoid is like the day Kobe Bryant entered the league who was the president's nephew and like weird like benchmark. He's a big Kobe fan. Okay perfect. Just like these like weird things that like nobody talks that way so like why would we over solve?
for like some sort of a benchmark that doesn't necessarily represent the product experience we want to build. Nevertheless, like benchmarks are great for the industry and like rally a community and help us like understand where we're at. I don't know. Do you have any? No, I think you kind of hit the point. I think for us, our primary goal is solving the deep research user value for the user use case. The benchmarks, at least the ones that we are seeing.
They don't directly translate to the product. There's definitely some technical challenges that you can benchmark against, but they don't really, like if I do grade on... HLE, that doesn't really mean I'm a great deep researcher. So we want to avoid going into that rabbit hole a bit. But we also feel like benchmarks are great, especially in the whole Gen AI space with like models coming every other day and everybody claiming to be.
So it's tricky. The other big challenge with benchmarks, especially when it comes to the models these days, is the output space entropy. Everything is like text. So there's a notion of verifying even if you got the right answer. Different labs do it in different ways, but we all compare numbers. So there's a lot of art slash figuring out how you verify this or how you run this in a level plane.
But yeah, so I think there's trade-offs. There's definitely value to doing benchmarks. But at the same time, we also... For like a selfish PM perspective, benchmarks are a really great way to motivate researchers. Make number go up. Exactly. Or just like prove you're the best. Like it's like a really good way of like rallying the researchers within your company. Like I used to work on the ML Perf benchmarks and like.
That was like, yeah, you'd put like a bunch of engineers in a room and in a few days they do like amazing performance improvements on our TPU stack and things like that. Right. So just like having a competitive nature and a pressure like really motivates people. There's one benchmark that is impossible to benchmark, but I just want to leave you with it, which is that deep research, most people are chasing this idea of discovering new ideas. And deep research right now will summarize the web.
in a way that is much more readable. What will it take to discover new things from the things that you've searched? First, I think the thinking style models definitely help here because they are significantly better on... how they reason natively and being able to, you know, draw these second order insights, which is like very premise. Like if you can't do that, you can't think of doing what you mentioned. So that's one step. The other thing is...
I think it also depends on the domain. So sometimes you can drift with a model for like new hypothesis, but depending on the domain, you might not be able to verify that hypothesis, right? So like coding math. There are reasonably good tools that the model already knows to interact with, and you can run a verifier, test the hypothesis, and so on. Even if you think about it from a purely agent perspective, saying, hey, I have this hypothesis in this area.
go figure out and come back to me, right? But let's say you're a chemist, right? So what are you going to do there? We don't have like synthetic environments yet. where the model is able to verify these hypotheses by playing in a playground and have this very accurate verifier or reward signal. The computer uses another one where there are...
Both in the open source research and so on, there's nice playgrounds coming up. So I think if you're talking about truly being able to come up with, my personal opinion is the model doesn't... has to do with the second-order thinking and so on that we're seeing now with these new models, but also be able to play and test that out in an environment where you can verify and give it feedback so that it can continue iterating.
basically like code sandboxes for now. Yeah, yeah. So in those kind of cases, I think, yeah, it's a little bit more easy to envision this like end to end, but not for all domains. Physics engines. Yeah, yeah. So if you think about agents more broadly, there's like a lot of things that go into it. What do you think are like the most valuable pieces that people should be spending time on? Like things that come to mind that I'm seeing a lot of early stage companies is like memory.
You know, like we already touched on evals. We touched a little bit on a tool call. There's kind of like the odd piece. Like, should this agent be able to access this? If yes, how do you verify that? What are things that you want more people to work on that would be helpful to you? I can take a stab at this from the lens of like deep research, right? Like I think some of the things that we're really interested in, in how we can push this agent are.
One, like similar to memories, like personalization, right? Like if I'm giving you a research report, the way I would give it to you if you're a 15 year old in high school should be totally different to the way I give it to you if you're like a PhD or postdoc, right? You can prompt it. You can prompt it, right? But... The second thing though is like, it should like ideally know where you're at and like everything you know up to that point.
right? And kind of further customize, right? Have this understanding of like where you are in your learning journeys. I think modality will be also really interesting. Like right now we're text in, text out. We should go multimodal in, right? but also multimodal outright. Like I would love if my reports are not just text, but like charts, maps, images, like make it super interactive and multimodal, right?
and optimize for the type of consumption, right? So the way in which I might put together an academic paper should be totally different to the way I'm trying to do like a learning program for a kid, right? And just the way it's structured. Ideally, like you want to...
do things with generative UI and things like that to really customize reports. I think those are definitely things that I'm personally interested when it comes to like a research agent. I think the other part that's super important is just like, we will reach the limits of the open web.
you want to be able to, like a lot of the things that people care about are things that are in their own documents, their own corpuses, things that are within subscriptions that they personally really care about, right? Like, especially as you go more niche.
into specific industries and ideally you want ways for people to be able to complement their deep research experience with that content in order to further customize their answers there's two answers to this so one is i feel in terms of like the approach for us at least or for me rather, trying to figure out the core mission for like an agent building that. I feel like it's still early days for us to try to platformatize or like try to build these.
oh, there are these five horizontal pieces and you can plug and play and build your own agent. My personal opinion is we are not there yet. In order to build a super engaging agent, I would... If I were to start thinking of a new idea, I would start from the idea and try to just...
Just do that one thing really well. Yes, at some point, there will be a time where these common pieces can be pulled out and platformatized. I know there's a lot of work across companies and in the open source community about... providing these tools to really build agents very easily. I think those are super useful to start building agents, but at some point, once those tools enable you to build the basic layers, I think...
Me as an individual would try to focus on really curating one experience before going super broad. Yeah, we have Brett Taylor from Sierra and he said they mostly built everything in-house. Built everything in-house, which is very sad for VCs. I want to find the next great framework and tooling and all that. But the space is moving so fast. The problem I described might be obsolete six months from now and I don't know. We'll fix it with one more LLM ops platform. Yes, yes.
Okay, so just a final point on just plugging your talk. People will be hearing this before your talk. What are you going to talk about? What are you looking forward to in New York? I would love to actually learn from you guys. What would you like us to talk about now that we've had this conversation with you?
Yeah. What do you think people would find most interesting? I think a little bit of implementation and a little bit of vision, like kind of 50-50. And I think both of you can sort of fill those roles very well. Everyone, you know, looks at you, you're very polished.
Google products, and I think Google always does polish very well. But everyone will have to want deep research for their industry. He's invested in deep research for finance, and they focus on their thing. And there will be deep researches for everything. You have created a category here that OpenAI has cloned. And so, okay, let's talk about what are the hard problems in this brand of agent that is probably the first real...
product market fit agent, I would say more so than the computer use ones. This is the one where like, yeah, people are like, yeah, easily pays for $200 worth a month worth of stuff, probably $2,000 once you get it really good. So I'm like, okay, let's talk about like how... to do this right from the people who did it. And then where is this going? So yeah, it's a very simple.
Happy to talk about that. Yeah. Thank you. Yeah. For me as well, you know, I'm also curious to see you interact with the other speakers because then, you know, there will be other sort of agent problems. And I'm very interested in personalization, very interested in memory. I think those are related problems. Planning, orchestration, all those things. Off insecurity, something that we haven't talked about. There's a lot of the web that's behind off walls. How do I delegate to you my...
credentials so that you can go and search the things that I have access to. I don't think it's that hard. It's just people have to get their protocols together. And that's what conferences like that are hopefully meant to achieve. Yeah, no, I'm super excited. I think for us, we often live and breathe within Google, which is a really big place, but it's really nice to take a step back, meet people approaching this problem at other companies or totally different industries.
Inevitably, at least where we work, we're very consumer focused space. I see. Right? Yeah. It's also really great to understand like. Okay, what's going on within the B2B space and like within different verticals? Yeah, the first thing they want to do is do research for my own docs, right? My company docs. Yeah, so obviously you're going to get asked for that. Yeah, I mean, there'll be more to discuss.
looking forward to your talk and yeah thanks for joining us yeah thanks for having us thanks so much guys