Uber’s On-Call Copilot with Paarth Chothani and Eduards Sidorovics - podcast episode cover

Uber’s On-Call Copilot with Paarth Chothani and Eduards Sidorovics

Apr 08, 202544 min
--:--
--:--
Listen in podcast apps:

Summary

This episode explores Uber's AI-powered on-call copilot, Genie, designed to improve engineering efficiency. The discussion covers the motivations behind Genie, its architecture, and the challenges in building and evaluating such a system. Key topics include data curation, model selection, security considerations, and the productivity gains observed after implementing Genie.

Episode description

At Uber, there are many platform teams supporting engineers across the company, and maintaining robust on-call operations is crucial to keeping services functioning smoothly. The prospect of enhancing the efficiency of these engineering teams motivated Uber to create Genie, which is an AI-powered on-call copilot. Genie assists with on-call management by providing real-time responses to

The post Uber’s On-Call Copilot with Paarth Chothani and Eduards Sidorovics appeared first on Software Engineering Daily.

Transcript

At Uber, there are many platform teams supporting engineers across the company and maintaining robust on-call operations is crucial to keeping services functioning smoothly. The prospect of enhancing the efficiency of these engineering teams motivated Uber to create Genie, which is an AI-powered on-call co-pilot. Genie assists with on-call management by providing real-time responses to queries.

streamlining incident resolution, and facilitating team collaboration. Parth Chothani is a staff software engineer on the Uber AI Gen AI team. Eduard Siderovich is a senior software engineer on the Uber AI platform team. In this episode, they join the show with Sean Falconer to talk about the challenges that motivated the creation of Uber Genie, the architecture of Genie, and more.

This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him. Thank you. Yeah, thanks for being here. You know, I'm really excited to talk about, you know, Genie, this on-call co-pilot that you guys were involved in at Uber. But maybe before we get there, let's have you introduce yourselves just so since there's both of you.

people can, you know, hopefully learn your voices. But let's start with you, Parth. Like, who are you? What do you do? Yeah. Hey, everyone. I'm Parth here. I'm a backend infrastructure engineer on Michelangelo, which is... like the SageMaker equivalent in Uber. And I've been here at Uber for four years working on distributed systems, generative AI.

and core ml problems and before that i was at aws also building chatbot like solutions and also at microsoft working on teams and those kind of products here Awesome. And Edward, same question to you. Who are you? What are you doing? My name is Edward. And I joined over a bit more than a year and pretty much started right away to work with Parts and Genie. Also part of ML AI platform team. Yeah, and before that was pretty much working in some startups, training mostly some deep learning models.

and partially for a year also working in the same manufacturing company and doing some MLO stuff for them. Awesome. So let's get into Genie a little bit. Can you explain what this project was and sort of how did it come to be? Yeah, I can maybe take a first shot at it. So at Uber, generally, there is so many platform teams supporting many engineers across the company.

And there is a lot of tooling and all of that that gets built to support all the engineers and to make sure that infrastructure is very highly scaled. And as part of that, there's many, many like support forums, specifically Slack is a very popular one. And people generally, engineers will come to Slack for help.

And what, you know, we also went through as part of like our background and whatnot is like there was a lot of pain that we as engineers faced when asking for help from other support teams or other platform teams. And this was a recurring pain across the company. So that was something that prompted us to think, like, how can we solve this kind of a problem? And that's where Inception of Genie started.

that we wanted to have an automated solution which can look at all the internal knowledge sources and be able to answer questions that customers across the company, engineers across the company can take help from and really improve their efficiency. So, I mean, I think that this is like a super common problem that like many companies all suffer from. And it's certainly as you. scale any organization it becomes more and more of a problem where you end up with

data silos that exist or in the Slack world, they're like chat silos of one-off bespoke conversation that's happening where someone gets help. And then inevitably people ask sort of same types of questions. It's hard to surface that in a uniform way. And it becomes this kind of like death by a thousand cuts. Like, you know, basically every company like suffers from this and certainly at scale become a real hindrance.

So I totally get that. I want to get into, you know, Genie's architecture a little bit and like how you built up the project. So can you talk a little bit about what's actually going on? What is the user interaction and then sort of what is happening behind the scenes to support that user interaction? I mean, I'll just maybe start with the user experience of it. Assuming I have a team and then we have we maintain our own engine wiki and we have our own helpdesk channel.

so we want to customers we want to onboard genie so how it happens is pretty much we have like a platform michelangelo then you go there you create a project you specify the ngv key which you want to use, and then everything else happens on our side. It creates the pipeline, you run the pipeline. On backend it pretty much scrapes the data, embeds everything and stores it.

And yeah, there's another backend service, which actually is queried when someone asks a question, meaning that on other end, there's a, let's say a Slack bot, and then it calls this backend service. which gathers from which channel it is queried and then sends the question to LLM with the provided contact.

Yeah. And just to add to that, basically, like our goal has always been a simplified, very collective user experience where people can come in. They just specify the sources and then boom, that's it. And then everything else is just. one-click setup for them to be able to you know use genie in their own slack channels or in their own uis whatnot yeah so that's our been not star experience that we have been always trying to build

So me as a user, I point this to some internal wikis and knowledge bases. And then this RAD pipeline kicks off where it's going to go and essentially parse those, presumably go through some chunking process. create embeddings, land that in some sort of vector store. You get into details, how does that pipeline work? What are the steps and components of it? Yeah, so underneath... We use a lot of big data technologies like Spark, which helps us be able to take a lot of internal sources.

be able to generate embeddings on the fly, parallelize basically with different, let's say we have executors taking different chunks, trying to create embeddings either through in-house models or third-party models, embedding models. And then even for something like when we want to push the data to a vector store, we have like workflows and open source technologies that we have used something like Cadence, which is a uber grown workflow system.

to be able to ingest data into VectorStore at scale. And there also we have Spark behind the scenes to be able to take all this data and do a faster ingestion on the fly. And when I come in and I select these sources, how long does it take to essentially generate the vector embeddings to a point where I can start actually interacting, going through sort of the user experience, being able to get my questions answered by this copilot?

let's say there's a different thing like one is when you onboard yourself and because of there's some like you have to wait for approvals and whatnot so it can take like a day for example But if you, let's say, update your sources or you completely revamp your sources, whatever, it doesn't matter, run the pipeline, but yeah, it takes maybe today I run one and then it took 15 minutes. Then it completely updates the sources. Yeah. Why have it where it's sort of a user configuring these sources?

for their specific needs versus more of a wholesale pipeline that is scraping everything, building one vector representation of all these internal knowledge bases and wikis. And then presumably being able to use the semantic search to attach the right context for the user when they're interacting with the copilot.

No, that's a fantastic question because that was very much like our first thought as we wanted to build something like that. I think some of the things we learned also, so we were trying to explore. some solutions like glean which could support something like this out of the box and what we kind of found was that the way we had configured glean inside uber

It was very, very individual access oriented and there was not something like public data, which was all scraped for us already. So that was one problem that we kind of surfaced very, very early on. And then I think we also found like when we are more focused on ingesting sources, which are, let's say, more hand curated, more.

filtered the vector store always does a better job at surfacing that information and also the accuracy is much higher versus when we experimented like we tried ingesting all for example ng wiki and The accuracy of the answer seemed all over the place for a given use case. So we felt like not the best of the performance either. So we try to find a sweet spot where we can enable people to bring in their sources. But then again,

make it more like a magical user experience so that it's more UI driven. People don't have to do much. And that's what we have been building towards. Right. I just wanted to add that it's also the use case driven. In our case, we have a help desk channel from Michelangelo for our ML platform. So people are coming there to ask specifically about Michelangelo, so it's better to narrow down only to Michelangelo and not to surface anything.

Someone else have written about Michelangelo, which might not be updated. Maybe they once wrote it. They're not updating, but now because of their outdated information, they will surface the wrong answer. So we might also kind of eliminate that. Got it. Yeah. sort of like an easier way to get the performance and accuracy that you need by having people sort of self-select into where how they want to constrain the universe than try to.

programmatically essentially figure out what the right context is going to be because you're going to end up with a lot of noise across these different potential internal wikis. Yeah. Was there consideration around essentially scraping everything, but then for each sort of chunk?

keeping the representation of the source. That way, when I'm selecting my sources, I don't have to go through the generation process of the pipeline based on sources. I'm essentially just subsetting the existing set of sources and embedding. yeah we definitely wanted that kind of experience to start with i think we found some infrastructure gaps where we pretty much don't have all this internal sources in an offline store that we can just pretty much take and create embeddings in the background.

So we found some limitations there and which is where we went with the next best experience, which is like, let's just create it on the fly. Also, the other thing is There is a lot of wastage if we just were to create everything behind the scenes. The reason being that... It feels like almost every team runs its own processes and own style. so some teams prefer like michael angelo for example is very wiki and wiki driven

We will be very meticulous in updating wikis with all FAQs and all user documentation. And some other teams seem to not have that discipline, which is where, again, enforcing something like this means like... If we were to just blindly do everything, we pretty much might waste a lot of resources also and not have much business gain either. Versus letting people choose, I think, then we have given them the capability to refresh knowledge.

which means like once they know what they need, then they are able to refresh it at their own pace, which pretty much is like a second best experience of like what you're talking about. So if internal information gets updated in any of these source information, do I need to go manually do a refresh? Or is there automatically when these updates happen to the actual internal knowledge base or wiki? it kicks off this pipeline to do the updates automatically.

I mean, it doesn't detect anything so far. It's meaning that initially it was only manual. Like if you update and you want to update, you go ahead and click. But then now it's also orchestrated in a way that it's like a cron job in a way. You can update it daily or whatever cadence you prefer. Okay. And then what model are you using for generating the embedding?

Yeah, so we have two different flavors of models. We have some of the third-party models which are open source and which we have hosted inside. Those are some options and other options are also like, you know, third party models, which OpenAI and other providers also give.

So I think we have those options, but generally we preferred the ADA embedding models from OpenAI to begin with, and those have worked reasonably okay for what we have been trying to test with so far. And what are you using for VectorStore? yeah so we have a homegrown solution right now for vector store and we are trying to move towards other better vector store solutions but

That homegrown solution is what we call a SIA. And that's a solution that we have been working towards. And we're trying to embrace a new technology as we call open search, as we are trying to kind of become more open source compatible. Was that something that existed already at Uber or was that something you built specifically for this project? Yeah, so the technology did exist, but the technology existed for more for like the typical search, which is more text-based search.

I think when we started the project, there's like more of like company started realizing that there is a much better, bigger need for. VectorDB store hosted solution, right? In-house hosted solution. So then the team, one of our sister teams, they spinned up infrastructure to be able to host a managed VectorDB solution. So it was more like, I think we... learned as a company, there was a need across for GNI solutions like this.

to have a very nice highly available infrastructure for VectorDB here. And in terms of both from the pipelines and also like sort of the user experience interacting with the copilot, how is that essentially built to maintain like reliability and sort of durability so that... Parts of essentially this pipeline don't end up breaking or going out. Yeah, I can start and maybe Edwards, you can chime in.

We, as part of any production internal or external applications, we have a highly available monitoring system. So as part of that, what we do is we make sure we have alerts on the backend APIs that surface responses to the Slack channels or any UIs that we are supporting for Genie.

so that's obviously part one we also look at logs to make sure like if there's nothing obvious that is going wrong And then Edwards can probably chime into more of the evaluation and what we have built around to make sure customers learn about. how their applications, how their channels and UIs are performing against our backends and the whole end-to-end. I think one of the solutions is that we constantly receive feedback.

Meaning that when Genie replies, there's like a pop-up of, you can reply with the emoji saying, okay, is it good? It's resolved by Genie or it's not good enough? So this also keeps always a feedback loop for us to know if something is good or not good. And then we kind of build some evaluation on top of it. being that one of the more interesting solutions what we did is that we thought, okay, that Genie has an interesting perspective on the documentation.

because like when you as an engineer you write the documentation you think you know what people need to know but typically it's not true And when customers ask a question, typically it means that something is not covered in the communication or they were just lazy to check the communication. And what we actually built is that we checked what answers were not good.

Meaning that if the answer is not good, it means that either RAC components were not good or actually the decommutation was not there. And if the decommutation was there, then we... Like with another LLM, the judge, we try to... suggest what is missing in the documentation. It kind of summarizes all the unanswered questions.

and tries to point out where it should be added and what should be added. Specifically, how to run this pipeline, how to debug, it kind of points out. Obviously, it doesn't know how to do it because it's... over internal knowledge. But yeah, it helps. And it's actually, yeah, some users actually kind of acting on it pretty well. And just to add to this, like basically our idea is give people these tools that Edwards was talking about.

where they can pretty much figure out some of the high-level themes around what documentation is missing, where the bot might be underperforming. They have at least a headway to figure out how to improve their channel quality. In terms of the feedback loop, is that primarily for you to sort of monitor performance and also give the team some insight into where maybe the documentation isn't meeting the needs, essentially?

Or is some part of that also factored into sort of the learning cycles of the actual copilot? So if I know that the response wasn't good, I can take that into account the next time I generate a response to a similar query. So that is more of a first one that, yeah, it's to hold us accountable that knowing how good it performs and to make sure that we also motivate customers to update their engine wikis.

And yeah, to point out what is missing. But it's also, yeah, now we're kind of adding more on top of it. That means that you can... help to update the documentation right away. Like I think part can maybe explain it more, but this means that you can update the FAQs and then it will go. eventually the knowledge base and it will help to answer the question later on. yeah it's more like building a loop where

People find out what is missing. They add FAQs to the documentation. And then, you know, we have the refresh knowledge pipelines, which pretty much take this FAQs, refresh it. So it's like a quick feedback loop. And we've actually found out inside even Michelangelo that

There were parts of our documentation which are outdated. We didn't know about it. And then the bot surfaced some answers and we were like, how did this happen? And then we took initiatives to clean up documentation. And that was like a quick feedback loop that... Without even looking at our evaluation reports, we found out immediately that, hey, there is this promise within our documentation where we give conflicting information that we ourselves have not reconciled.

Yeah, I mean, I think that's super valuable because I know any significantly large organization I've ever worked for, the internal documentation usually is like can get really horrific over time. And just, you know, there's not a huge incentive to keep those things up to date.

So it can really fall out of date. But they're really valuable, especially for new people, because they don't know where to get those answers. And the only option you have is internal documentation, or you end up having to message somebody and get that sort of bespoke answer. This episode is sponsored by MailTrap, an email platform developers love. Go for high deliverability, industry-best analytics, and live 24-7 support. Get 20% off for all plans with our promo code SEDaily.

Check the show notes for more information. Can you take me through sort of the life of a query? So I'm interacting with this over Slack. I put in a query. Then what happens sort of behind the scene? Yeah, so behind the scenes, when you're querying, basically, we will invoke API, which pretty much underneath. tries to figure out what the user is trying to do. And there is also, as part of the query, we also have a very customized Slack workflow functionality that we have built a plugin inside.

which can take additional information from the user on what they're trying to do, what is the action they're trying to perform, which particular product they're trying to interact with, what is the way to reproduce their problem. So pretty much think of all additional context that... on-call and a bot needs to be able to even figure out what the user is trying to do. With this whole additional information that we send as part of the question.

Then we pretty much generate embeddings on the fly for the question. We do a vector DB lookup. We make sure, you know, we have all the right context. And as part of the ingestion that we have done for the source data, we make sure the ingestion follows the schema. That way there is, you know, source URLs, there is metadata around what the page was about. And all of this pretty much is available as part of the ingested data in VectorDB.

When we are sending all the information to LLM, we want to make sure that there is information around citation. There's much more metadata that we can surface for different use cases, pretty much. So that's all of the metadata is fetched along with the source URL and everything. And we send that to LLM with different prompts and we allow users, different users to configure prompts. There is flexibility in what they want to solve.

And as part of this, then the LLM pretty much decides what the answer should be based on the prompt and all. And then that's what is surfaced to the user today. In terms of the LOM, what model are you using? Yeah, we have experimented with different models that OpenAI came up with. So we started with GPT-4, then we moved to Turbo, there's GPT-4-0 now, and then we... Trying to look at the reasoning models also to see.

you know, how we can have certain questions answer in a much more crisper and cleaner way with the detailed reasoning still. You mentioned at the beginning of that query to response pipeline that... You're trying to figure out what is the user actually want so that you can sort of attach that to creating the correct context. What's involved with figuring out what the user actually wants, what the intention behind the query is?

Yeah, I think part of this, what we are trying to also currently experiment with is like user intent detection. where we can figure out like, is the user trying to debug a problem? Is the user's question about like a product? Those kind of things we are trying to experiment and see where intent detection can help us figure out more of the user's thought process because not all type of questions we have understood also the bot can do a great job at.

So we want to also as part of like our accuracy. enhancement is be more mindful of where the bot can excel and where the bot cannot excel. And that's part of where we are trying to do experimentations and detect some user intent detection right now. What about metrics around... evaluating sort of the effectiveness of this like do you have

Things that you're tracking, even in sort of the development process, like using like an eval framework, some of these newer frameworks exist for building generated AI applications in order to figure out if you make a change. to how you're generating your embeddings or how you're figuring out the intent, that's actually a performance improvement versus a degradation of some sort.

I think the main metric is the customer feedback. That's, I guess, our end goal. Yeah, so if you make a change, essentially, you're waiting for sort of live feedback to see if your accuracy is improved based on the feedback from the users. So I think that's part one of it, obviously. And then there's the golden data sets that people generally hand curate.

so that we make sure there is more like quality built in before also deploying the change so if somebody changes a prompt or something we generally ask the users to do more golden data set against testing so that they have thought about what kind of implication that have. And Edwards can maybe chime in on the post-production rollout here.

And we also tried with a different evaluation and I think like more classical NLP and then also LLM as a judge. And apparently most of the cases actually LLM is just more simple, but it's actually typically works better. Was there any challenges or thought around the risk of like sensitive information being shared with Jeannie? Yeah, that was something we really brainstorm and thought a lot.

And I think me also coming from Amazon where I was like a security certifier. So security was always like top of our minds when we started this. And, you know, we want to be very, very mindful of what data gets exposed to the outside the company. So in the beginning, we were very, very thoughtful about hand curating which data sources are secure.

And we have different levels of gradations like many other companies of what data is private versus public or, you know, what is very sensitive that cannot be leaked outside. so we work with our security teams we hand curated certain data sources which were reasonably you could say public inside the company And we obviously went through a lot of different processes inside the company before we...

were okay to even create embeddings for those kind of data sources. And that was our due diligence to make sure that as we develop a new productivity enhancement, we don't leak out data that will mess up our company's reputation. One thing to add, I think there is a very cool solution which is built in Uber. It's a Gen AI gateway. It's pretty much, imagine that you have OpenAI.

API, but then it doesn't go directly to OpenAI. It goes through a gateway, and the gateway, they actually filter PI data. So there's not high risk of leaking anything. Yeah. So if I put in my social security number for some reason, it's going to get filtered out by the gateway. Yeah. Yeah. We really wanted, as Edwards mentioned, like the redaction PII data should be redacted before it gets sent out.

That's built into our other ecosystem that our sister teams have built to make sure that we have security built in. We don't have to worry. But still, as application owners, we have still done our due diligence to even make sure PII doesn't even come in our ecosystem. Yeah, basically shift that problem left before it enters the model. Exactly. You know, in terms of building the system, like what were some of the biggest sort of technical hurdles that you had to work through?

Yeah, so I think there are many, many different angles to where we struggled in the beginning. One was obviously hallucination was a start where we were like, you know, just spitting out things that were pretty much wrong sometimes and not right. So there's a lot of this prompt based evaluations that we had to do. There is also the UI experience that we really, really thought very deeply about because there were other solutions that Glean provided, for example.

And those solutions were very individual access driven, needed approvals from users even to see the answers in channels. And we didn't want that kind of experience because we wanted a frictionless experience. So, definitely the experience was part of it. Then, obviously, when we started developing, there was no industry standard on how to evaluate GNI app.

So building like the feedback looping system, for example, we had to come up with methodologies on how we can even compute and say we are saving time for the company and users. So there was some methodology we had to develop inside to figure out. how to even say there is some productivity gains here.

And then obviously, like Edwards can speak more about the eval part, which he's driving the whole evaluation of how to showcase what is problem with your documentation. That was like a unique thing that we had to brainstorm. and the UI, the product we built around to support this kind of monitoring.

that was a very new thing is all there's no industry precedence as such and this is how other people have done it so there's a lot of these new things we had to maneuver and also we were working in a very small team pretty much a two to three people team so that was like we are very short on people to try something like this and also Other challenge was how to platformize this kind of stuff, not only prove that this works well.

how to platformize in a way that we can benefit a lot of, you know, other parts of the company and make sure that people can leverage this fast enough and show gains. the speed of execution, the accuracy, the UI experience, the monitoring. and working in a very small team. I think all of these were like different challenges. We had to really maneuver all throughout to deliver something.

Yeah. And I think just to jump in for a second, like I think like one of the challenges that probably anybody building, you know, like sort of an AI. or gen AI application like this today is facing is that even if you have deep expertise in ML, very few people have.

10,000 hours of experience like building these types of applications right so there is a lot of sort of net new ground to figure out and you can't necessarily draw on your 10, 20 years of engineering experiences that you've seen this problem a hundred times before. I think one of the challenges was the, not maybe UI, but the UX, like how to make sure that everyone can make it scalable so that everyone can create their own genie and be like specifically tuned for them.

I think there was quite a lot of design thinking how to do it. And also, I think one of the, not also a technical problem, but it's expectation management. I think it's like... chat gpt works well and then everyone has this miracle experience right but then you see You go to other helpdesk channel and you see that it performs very well.

Even though you feel that it performs well because you don't know much of a context, you think that it works very well. But when you run it on your own documentation, like, oh, no, it doesn't work as well as you expected. And you look, okay, why is it? And typically it's maybe just because the documentation is not up to date. So it was challenging to explain that.

In machine learning, we say garbage in, garbage out. Yeah, I mean, it goes back to the data quality problem. And if your data is bad to begin with, or some portion is bad to begin with, What can you expect in terms of what the model can only do so much, right? It's not going to fix your data problem for you.

Yeah. Yeah. And also circle back to the question you were talking about, like the lack of experience in building this kind of thing. I think what there are some parallels that I kind of sense still, like, yeah, while nobody had experience in this technology and whatnot. I think we, inside the small team, we were all part of. We were trying to be scrappy at the same time.

speedy in execution and we had to balance obviously security i think those three angles we tried to do and i felt like pretty much most new projects have that kind of thing where you're wanting to be scrappy you're wanting to be showing something But also being mindful because we are in a bigger company. We're not in a smaller company where you can afford to make mistakes. And this is a public company.

Drawing from our previous experiences, we try to have these principles in mind. And I think these principles helped us guide while we didn't know the nuances of the technology. But I felt basics of software engineering were still in place to make sure they were our guiding light as we delivered something here. developers, we've all been there. It's 3am and your phone blares, jolting you awake. Another alert.

You scramble to troubleshoot, but the complexity of your microservices environment makes it nearly impossible to pinpoint the problem quickly. That's why Chronosphere is on a mission to help you take back control with differential diagnosis. a new distributed tracing feature that takes the guesswork out of troubleshooting. With just one click, DDX automatically analyzes all spans and dimensions related to a service.

pinpointing the most likely cause of the issue. Don't let troubleshooting drag you into the early hours of the morning. Just DDX it and resolve issues faster. See why Chronosphere was named a leader in the 2024 Gartner Magic Quadrant for observability platforms at chronosphere.io.

Understanding the details of infrastructure tools matter, and there's no better way to understand that than looking directly at the code. Open source code bases give everyone the ability to inspect, audit, and contribute to the software they use. enhancing trust and transparency. Bitwarden is a trusted open source and end-to-end encrypted security solution that empowers businesses and individuals to securely manage and share information online. Made by developers like you.

Bitwarden offers open-source solutions for virtually every credential management use case, from secrets management to password management and passwordless. Developers can even securely manage their SSH keys with the new Bitwarden SSH agent. Get started on your open source security journey today and start your free trial at bitwarden.com.

Though, with building this type of application where you're relying on kind of, there is a certain amount of non-determinism going to be involved in sort of the stotastic nature of some of these, you know, the models themselves. Like, does it require a bit of a mindset shift? when you're engineering in that way versus

traditional application development where it's going to be very deterministic. You can rely on, if the output's not what you expect, you can kind of trace back to a bug in the program that you put in there. I think that definitely non-deterministic aspect did throw us off. And I think we had to even build stuff in our, you know, experience and UIs and explicitly call out, hey, this. you know the answers you know make sure you don't take it for the word of it you make sure you evaluate right

And that non-determinism definitely is one of the things that makes this whole product building so challenging. Though I felt like that aspect has also changed as the models have become better, as we have learned how to... restrict the prompts restrict the citations and we we started doing citations and that also has led to more like let's say trust in what we are now saying versus like

you know just being so open-ended that you just don't know what it's saying is true or not so i think that with evaluation stuff that edwards led and build i think like there's many of that stuff is starting to come together now and it's become more deterministic where we feel like, okay, there is more control on what we are saying now versus what the system is generating versus what it was.

I think we're going to build the muscle when the models are less deterministic. And now we see with all this progression of new models, we are healthy skeptical, which is, I guess, good. Yeah, I've definitely seen in the two years or so that I've been building on large language models, like a significant improvement in terms of like reliability of their performance and answers. You know, the problems haven't completely gone away, but there's...

It's a lot better than it was, I guess, two years ago. I think they've addressed a lot of these challenges. Yeah, I think it's also a good thing that LM is becoming cheaper, right? And what helps is that, I mean, before it was you'd make one LLM call and then you're like, okay, it's enough. Now you can make like, okay, validation calls like two or three times to validate the answer. So you can make it artificially more deterministic.

because it's become there. First of all, it becomes better and then it becomes cheaper. Yeah, that helps a lot with applying some of these basic patterns around reflection and go through a series of iteration of refinement and so forth so that you can actually get a much better response.

validation that you mentioned, especially if you're expecting a certain type of output. And then, of course, all the things that are happening with even agents where you can bring in tools to help evaluate or request data as needed and so forth. You mentioned productivity gains earlier and how it was important to be able to demonstrate.

that this project is worth the time investment, worth the presumably the compute resources that you're putting into this, the token costs and stuff like that. So what were sort of the impact and productivity gains that you saw? Yeah, I think as we were publishing the blog also, like we have been able to roll the bot out to more than 150 plus channels, right? And it's answered like 70,000 plus questions.

We've seen around like 48% helpfulness rate, which is like mix of, you know, the questions that bought auto-resolved and where the board actually helped the user to prompt in the right direction. I think from that perspective, we estimated roughly when we did math around like 13k engineering hours so far we have saved across the company. And that's, as I was saying, we had to do some creativity here to even figure out how to measure this kind of thing.

I think first part of what Edwards was previously mentioning, we have this emojis that people react on. So that is something we did. Some other things we did was also to gain more data. Some of the partners that we were working with, we wanted to have a higher rate of feedback there. Like Google search, not many people leave feedback on accuracy because users just want answers. They don't like to leave feedback. And that's something.

we have seen i mean i personally observed in my own experience that While working with any customer support in like airline or anything, you know, just never want to leave feedback. It's a waste of my time. That's what I feel always. Unless it's negative feedback. Yeah, unless it's negative feedback. That's what we found. Some channels, some partners, we enforce the feedback because that gave us more confidence that.

you know is the bot really even performing well right so that was one of the features we initially built and also other experiences we also built is with some teams we try to experiment and say where the bot is like the first level of resolution always.

And the on-calls only come when the customers say, I want to escalate to on-call. So that was another experience we built to validate and see how useful is the bot. So this mix of different experiences plus us doing some creative math to incorporate these feedback emojis and convert it into some engineering hours. and determine how much we are actually saving to the company. That's how we came up with some of these math, you know, response rate evaluation. The 13,000 hours.

Over what time frame is that? Since the inception of the bot, roughly I would say a year plus. Okay, so that's a pretty substantial amount of time saved. I mean, the adoption is kind of, it's not like everyone is onboarded, right? It's meaning that you have to come and onboard yourself. So I think the... Most of the heavyweight was also in the latest months. What's next for this project? Are you continuing to invest in this? What are you looking to do with it?

Yeah, definitely. I think like what we have seen is the expectations have completely shifted while it answers we were giving six months back. and what was acceptable has completely shifted users are expecting much more so what was a helpful answer six months back seems like not a helpful answer anymore so we are definitely very much thinking about like taking this and making it a v2 version where

we can have a very high level of accuracy and work with substantial partners. That way we can bring this to the next level of expectations that people are having from the bot. So that's like an ongoing investment for sure. As we start to wrap up, is there anything else you'd like to share? Overall, anybody building Jainai apps there, definitely this landscape is extremely, very fast evolving. It changes literally in days, not weeks, not months.

So I think the pace at which technology is changing here is way faster than any other technology that I've ever worked with in my career so far. People have to just be open to the fact that whatever we build is, it might be thrown away in a week or two.

And that's something just like being open about that just makes us not feel frustrated because I think we were at times when we were feeling like pretty much, hey, what have we built? Do we have to throw everything? I mean, that was the question that we would get a lot. So I think just being open about the pace of the change and being open to experimentation is a healthy mindset for JNI world, at least I feel.

Yeah, I would think you would have to, you know, ideally you'd factor that a little bit into your design as well so that you have flexibility and sort of the architecture of the design to swap, you know, in and out of models as those things improve. other components, essentially, where you might be able to squeeze out a little extra performance by going through an additional cycle of inference or something like that.

Yeah, I think I also wanted to encourage to build these Genie apps. I mean, first of all, it's kind of fun. And second of all, sometimes it feels frustrating because you build something and then something develops. But then like, okay, so we feel like it's a waste of time. But at the same time, I mean, you build something and then you can make it adapt to your specific.

case and then i think it also was kind of also with the genie that okay i mean we built the chatbot but then like at the same time like gleam was coming up with something similar

But because we build something on our own, we can put agentic stuff. We can put like, okay, if... if someone put a log we can go and check the log and then that's something what other solutions cannot do and will not be able to do at least in the pretty nearest future so yeah i just wanted to encourage people to experiment And to add to that, I think like what Edwards was mentioning is very spot on that build unique features because I think that's what creates value in the long run.

I think while we had other competitive solutions also being built outside by third-party vendors and whatnot, I think we focused on trying to be unique. with the experience of UI or the tools that we allow users to integrate. And I think that probably proved us right in the long run that We're able to customize a lot more things because it's in-house.

the experiences can be tuned in, change much faster. So overall being unique also helps like stand out in the long run. Yeah, I think even from like, if you were building a company around some sort of... AI application today kind of going deeper might be better than going like really wide in general, because a lot of the like hyperscaler companies are going to probably address the wide.

But you can out-compete them if you go really deep on a particular thing. You can create the best possible, I don't know. medical device-related AI experience or something like that. And that's probably not going to be something that Amazon's going to put a ton of resources into or open AI or something like that versus sort of the generality of what they're trying to solve.

Yeah. Awesome. Well, Parth and Edwards, thank you so much for being here. This was great. Thank you. Thank you, Sean, for having us. It was really, really nice to have this podcast here. Cheers.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.