Hello, and welcome to the data engineering podcast, the show about modern data management. If you lead a data team, you know this pain. Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one off tools instead of doing actual data work.
Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data while keeping it all secure. Type a prompt like build me a self-service reporting tool that lets teams query customer metrics from Databricks, and they get a production ready app with the permissions and governance built in. They can self serve, and you get your time back. It's data democratization without the chaos.
Check out Retool at dataengineeringpodcast.com slash Retool today, that's r e t o o l, and see how other data teams are scaling self-service. Because let's be honest, we all need to retool how we handle data requests. Your host is Tobias Maci, and today I'm interviewing Raj Shukla about building self improving AI systems and how they enable AI scalability in real production environments. Raj, can you start by introducing yourself?
Hi, Tobias. Very nice to be here. Thanks for having me. My name is Raj Shukla. I am the CTO at Symphony AI. Symphony AI is a vertical AI company, which means we take one domain or one vertical at a time, and we go very deep with AI agents, AI models to achieve end to end business process automations, and just generally helping build the autonomous enterprise.
That's what we sell to our customers. So it'll be agents as a service, models as a service, and some applications that bring it all together. I've been at Symphony AI for for the last three and a half years. It's been a great journey working with true real AI in real industries and factory floors and grocery stores and, you know, financial institutions fighting crime. So that's
my journey here. And before this, I used to be at Microsoft, kind of had roles across the company, mostly in applied AI, machine learning, engineering leadership positions at Microsoft.
And do you remember how you first got started working in the ML and AI space?
Yes, I do. I I I just happen to be at the very you know, right around my I I was a computer science major. And right about when I was finishing my bachelor's, coming into my master's, just machine learning systems were becoming popular. So I've, you know, got introduced to machine learning as a true, you know, with its mathematical backgrounds in optimization
problems and all related areas. So it started with a bit of theory around machine learning and then got into applied machine learning. And as a part of my master's, I was doing anomaly detection systems and very large scale network systems. So like, you know, telecom
networks and how you detect anomalies and how do you, how can you build systems that predict and react fast to anomalies and so on. So, you know, as a part of my kind of growing up in computer science itself, I was trained a bit in machine learning. And when I came into the industry, my first real jobs were in these areas of search and advertising. So my first job was in click prediction models that allowed, you know, advertisers to figure out where to place their ads on pages, etcetera.
And once I got into Microsoft, my first work was on search ranking problems, and which was the kind of the early at scale machine learning models which used to be applied in the industry. Industry. So always been in this space, always enjoyed it. There is a element of magic to how it all works. So that's what keeps me going.
And now digging into this concept of self improving AI, obviously, system is a very necessary element of it because the AI models themselves are largely static until you retrain them or post train them. And I'm just wondering if you can just start by giving an outline of what constitutes a self improving AI system and what are the signals that you look to for whether the system is capable of that type of evolution?
Yeah, absolutely. This is very critical once you apply the modern foundation models and agentic system based on these foundation models out in the industry. So, you know, you have the concept of the environment. An environment is really a real world system which has where an agent is operating or one of your models is operating and it's trying to do a task And it's operating under
some influence of certain external variables. So there could be triggers. For example, people are buying things off of grocery shelves and the shelves are going empty.
And there is a monitoring system sitting there and which is observing it. And so that can create a trigger. And once the system takes actions on that, there is a reaction to it from the environment, right? And the reaction could be on the positive side that that action was good, or it could be on the negative side that that action wasn't good. So, for example, in financial crime fighting, the models and agents can detect whether something is fraudulent or looks like a money laundering activity.
And then at the end of it, there is a human who is investigating it, and the model makes a recommendation and the human could say, No, that is not right because of something that they know or they found out, or it's right and it's largely right. And so that's kind of, that's, so there is a definition of an environment which kind of creates
the conditions under which the, all the input variables under which the agent is operating. And once the agent takes an action, there is some feedback coming whether that is right or not. And the idea is that whether it is with a self learning model inside the agent or whether it is through intelligent kind of memory updates or other techniques, that once you get the feedback, there's something should change, right? So the
the system should say, okay, I know I made a mistake and I got feedback for it. And I'm updating something, whether it's a memory database or whether it's it's core LLM configuration or the model itself that I'm trying to improve. Next time, this will not going to happen, probabilistically, of course. And that's what we try to achieve. I think there are practical considerations of it, and then there are theoretical ways of treating the environment as a reinforcement learning or RL environment.
And there is practical aspects of, you know, treating it as a self learning system with triggers and hooks and so on. So the field is very exciting, but I think it's primed for, you know, I would not say disruption, it's primed to go big
this year and next. And in that element of interacting with the environment of the system that the model is operating within, what are some of the components that comprise that environment, whether that's digital infrastructure, physical manipulation? So particularly if you're starting to think towards things like robotics, I'm just wondering if you can talk through some of the pieces that are necessary in order for an AI system to have some of those reinforcement and learning loops.
Yeah. Yeah. That's a good question. So I think, the the core aspect of the environment first is how are you digitizing the information that is coming in, whether it's the core data that is coming in. And sometimes that's obvious, sometimes that's hard, right? When you're working with physical environments like factory floors we operate in or grocery stores we operate in, you have to have some vision based, image based sensors
out there that is feeding in an input. And that is going to be digitized. That forms your core data layer on which these triggers are created, etcetera. And on the flip side, on the action side, sometimes the actions are notifications to humans
to change something. So for example, to go restock the items on the factory floor or on the grocery floor. And sometimes it's about ordering your next batch of goods to come in based on what the agent sees. So on both sides, it's you have to, I think fundamentally you have to look at what is the digital form of the information and then what is the physical translation of it. And it can be as complex as sort of edge computing or edge ML models, and it could be as simple as an API connection that actually triggers some action on
the action side.
One of the pieces that comes to mind as far as a way for guiding a particular model and improving its capabilities is the what's broadly being termed context engineering, but making sure that the model has access to particular information at a particular point in time. Another component is this idea of agentic memory where the model, as part of its execution,
decides that there's a particular piece of information that is relevant for future use, and so it will decide to push that into some form of context store, whether that's a dedicated memory layer or just a text file somewhere. And I'm just wondering if you can maybe give some differentiation between the ways that you think about this overall concept of self improving systems versus the or agentic system terminology that's also broadly used and how you can decide which one you have.
Yeah, absolutely. I think there has been many evolutions of it. So if I were to start from the very early evolution that existed even two years ago in the early agentic systems, there is this area of in context learning, which is basically saying if you if an agent at a particular task or a step is making a decision, it is guided by a prompt. Now, what is the prompt guided by? The prompt is guided by some examples usually, which is a few shot kind of exercise.
Now, even two years ago, people knew that you could adjust those few shots to be dynamic to the problem. So depending on what type of input you're getting, you could select a different set of examples to make the result a little better. And that itself gave huge improvements in what the models could accomplish. So in this case,
the learning loop would be quite simple. The agent is making a decision or the step advocate makes a decision. You get feedback back in terms of examples of where it got wrong or examples of where it got right. And just in context of the next input coming in, you choose the right examples. Right? And that that's a very simplistic extreme of it on the other side is to really set up the true RL environment and to have a true language model be trained with the right kind of techniques
through reinforcement learning, whether it's, in some cases, we have like verifiable feedback. So you can do these RLVR kind of loops. In some cases, you do GRPO kind of policies where you are creating reward systems out of feedback you're getting. But in a sense, you are truly learning in the sense of learning and not just updating artifacts anyway. So that's the true form of learning. And then this always learning model, it's what's guiding the true decision step that the
agent is operating on. I think what is getting very interesting lately with agents like Claude and agentic systems like OpenClaw recently is that there is a middle ground somewhere where the agent, well, the feedback loop is coming in, but the feedback loop is not going directly into an RL learned system. It's not going as raw prompt. It's actually being updated as memory,
as an intelligent memory that gets pulled in at the right context at the right time. And the update on the of that memory is also not just a pure append of that feedback loop. It's an intelligent append in the sense there is an LLM call that says,
you can update taste and preferences of this user based on this interaction. And you can update, you know, there's episodic sort of memories, there is kind of stochastic or time based kind of memories and all. And but with these, and you can keep all these memories in rather simple file system based, you know, architectures. And the agent harness is intelligent
enough to pick the right memory and update the memory at the right time as a background process. So it's really an engineering feat that we are accomplishing more than a science feat. I think it's definitely turning out to be better than prompt based in learning. And on the other side, the RL based true learning models are harder to implement. So from a practicality of it, it feels like the right middle ground, which is gaining traction, and we are adopting it in our products as well.
And another piece of the context engineering beyond the specific
memory system where the agent is involved in the creation and retrieval and curation of those pieces of information is also the idea of the various tool calls or MCP servers or even agent to agent interactions that are involved. And I'm wondering if you can give your sense of whether and how you differentiate between those tool use capabilities and the evolution of information available from those tools, versus this idea of the memory system or reinforcement learning or model fine tuning.
Yeah. I think the tool use area overall has also seen a very rapid evolution in the last, I would say, one year or so. Right? I remember us, like, we at Symphony AI, we operate in very regulated industries like financial crime. And so our need for our agents to get things right is very high, in the high 90s, 99 percentages, etc. So we tended to not leave things to LLMs. And we had made hundreds of tools, which are very specific deterministic tools.
And we used to rely on our agents to do the right tool calling. Within the tools, you would do deterministic calculations so that we don't leave a chance for LLMs to hallucinate while doing those sorts of calculations, right? So it was a very complex system, but it used to get things right, etc. And then, you know, models kept improving on just like, you know, better tool usage, etc. But particularly, they kept improving in
certain kinds of tools and their usage a lot. Like, so search as a tool became very popular. Code execution or code writing, obviously, became really good at it. So in today's day and age, you can actually assume that the code writing aspect of the model will get 95% of those tools right. And so you don't have to pre write those tools. You can actually
explore that space at runtime and cache those tools later on if you have to be. But I think for me, the mind blowing moment was when these kind of Unix tools and file system tools start getting really good. And I think credit goes to Claude a lot, the Claude team for exploring that path. And just using the file system or Unix based tools as base tools, which can come together to, you know, produce a lot of derived tools, really simplify the stack, right? And I think by the end of last year,
we had thrown away maybe 80% of our tools and gotten the same kinds of results with just relying on core basic tools. Now in certain cases for us, we cannot do that because I think you are effectively trading test time compute or long running tool usage and thinking models for latency and speed at which you're executing. And so certain domains we cannot. And we've we've kept the older architectures. But wherever we can trade off time to let the model do its kind of magic with the base tools that it brings and the business context
and, the right knowledge context that we bring from our verticals, that has been, like, really simplified our architecture.
And as you're talking about this idea of creating the tools dynamically at runtime, Obviously, as you pointed out, people who are using these models for software engineering use cases, it's doing that all the time, and it's encouraged to do so. And I think that also opens up the overall question of what does it mean for an AI system to improve, improve along what axes and for what use cases.
Yeah, absolutely. I think if you look at how the leading foundation model companies are thinking about it now, they don't think about AGI as a pure LLM concept. Right? They are saying, oh, you know what? We realized in the middle something
interesting happened. Models are improving, but models got really good at code code writing. And that gave an ability for a other kinds of inner loops where when the model cannot figure it out, it can write code and it could create environments where it tests the code. And, you know, so it can create systems not entirely as a model output, but as inner loop sub agents where it can keep improving. And so that is seen as a much clearer path towards AGI
than just seeing it just a model improve overall itself. And I think we take a similar approach in our vertical AI systems and agents that we do. I think we gotta make sure that we build these sub agentic loops where the which are vertical specific for us, and we have to be the best at it where the models write code, we give it an environment where it gets the feedback, and and many times, based on the feedback, it rewrites the code and and updates
these systems to be more effective. Like, a very simple thing is, like, if you think of deep research agents and in enterprises you can think of deep research agents working on enterprise APIs as the main knowledge context and not search as an API. And, you know, we realized internally that if you're working with internal enterprise APIs and are trying to build a deep research agent on it, you could let the model write code in the middle. And as you give it feedback
that whether it was able to find the right kind of information or not, it will write more code to say in, you know, in the future when it does one API call, it gets an output. It does not pass that output to the next LLM call. It actually has written code where it is doing analytics on that output and passing only a good summary of it to the next tool. So in simple terms,
our own sub agents evolved from Rag based systems to these really highly evolved agentic search, and then now into this agentic search plus code writing in the middle kind of sub agents. And that's that's really been groundbreaking in some sense to help us achieve the level of autonomy we were expecting for these subsystems to do.
One of the other challenges that comes up when you do have this dynamic probabilistic system that is capable of creating its own tools, exploring particular areas of focus is the potential for misalignment with the stated goals of that system.
And so that brings a lot of questions around security, identity management, access controls, guardrails, and I'm wondering how you're seeing some of those capabilities evolve in the ecosystem in terms of how people are starting to think about constraining and guiding these systems to ensure that they don't become misaligned, misaligned or if they do that they're quickly redirected to guiding the stated purpose?
Yeah. I think that is probably the biggest difference between a consumer application of agents and enterprise application. And so we pay a lot of attention to it. We build a lot of kind of agent lifecycle management layers in our platforms to deal with it. So just to give you an example in financial services of financial crime fighting, there is a before an agent can make any decision.
There is an agentic process first which is on policy alignment and so the first step the agent has to say is it has gone through the policy and standard operating procedures and it proves that it got it right it actually goes one step further and highlights the policy gaps it sees where the human should come in and clarify what to do in what what scenario.
And in a practical application when we go to our customers we'd say let's make sure we get this right. And the fact that it creates this big to do list and it maps how it created that to do list to what snippet in the standard operating procedure in the bank, that itself gives these, you know, every bank and every big financial institution has these governance committees and model governance kind of teams. And just that is actually far more transparent
than what predictive ML models used to do. I mean, these systems, these agents are actually telling, okay, I took this document, I created this to do list. I think when I execute this tool, I will write this kind of code. And I'm doing this because this mapping exists. They are actually being very receptive of it because they never got this level of transparency and explainability from ML models before. So that's like a conscious step we take when something has to be policy guided.
That policy alignment itself is the first goal of an agent or a sub agent. Then comes the process of executing on it and while it's executing on it, it go off trail or whatnot? And yes, you just have to keep the right evals, you have to keep the right guardrails around it.
Many of these guardrails are, many of these are while the agent is running, but many of these are also at an aggregate level at an agent monitoring step. While these agents are running in production, you are seeing how they are performing, why they are making certain decisions. We have metrics in place which we are observing as these agents are running and seeing how the performance is going. And I'll just go one step further. We are prepared
that these agentic systems will not go live right away. So the way we prep for it is we say, let these agents run-in the background and see how they are performing, right? And then over time you build with these data driven, with these metrics driven approaches, you build confidence that look, over the last three months, the agent is performing as good as your
human investigator in the case of financial crime, for example. And there also we take extra steps because there is three levels of investigators, level one, level two, level three. And we'll first say, look, it is doing as good as a level one investigator and
that gets adoption. And then you prove it doing as good as a level two investigator, kind of like software engineering, you would say these models are doing as good as a junior software engineer and as a senior software engineer and then as a staff or architect or whatnot. But, I mean, these stages are very clearly defined in the verticals and industries we operate in. And we are taking gradual steps, keeping this transparency and
strict guardrails in mind, proving it out while running in the background in production systems, not going live, and then stage wise going live. And I think it takes all of that to actually develop confidence in a system like this. And we are also learning along the way as we are doing that. But I'm proud to say we feel like we are the furthest along amongst every company we see in these domains to, you know, and how we are operating on this.
The other element of security comes in when you're talking about the agent's ability to dynamically create tools, and therefore, it's executing unreviewed code in potentially a production environment, which necessitates a certain level of sandboxing or constraints on what code or what features or functions can be executed. And I'm curious how you're seeing people manage that aspect of that dynamic runtime compute.
Yeah. Yeah. 100%. We we actually provide code execution sandboxes as a part of our platform, and they are being configured to run Python or TypeScript, but more importantly, the how much of the network it can access, which APIs it has access to, whatnot. Yeah, so as far as sandboxing and securing what these agents are doing, it's a very critical aspect. And we don't rely on any third party for that. We ship it as a part of our
platforms. What is interesting is file systems are getting an interesting twist here because the agents are operating more and more as files in the file system and between sandbox local file systems and some, you know, more persistent cloud storage, there has to be a lot of system engineering done to keep those two things in sync as the agents are operating on it. So more than the LLM magic, there is a lot of
rigorous system engineering that has to be done to to make sure that that happens. But, yeah, sandboxes come come by default as a part of all our deployments. And the choice of, like, Python versus TypeScript versus what they are writing is based on evals in the use cases that we are doing. And then the final thing I would say is we try to make sure the agents
operate as real humans. So every agent gets its own auth and every agent has to you know, operate as an identity which can be reviewed and verified in a customer that we deploy it in. And it follows the right authorization as well. So whether it's the MCP servers it accesses or whether it's the API calls or data that it accesses,
all of that is governed by the same RBAC and other access control policies that they have. And that's actually the hardest part to get right As like once going live, think POCs and proof points are easy. But when you're going live, the concepts of these governed like for us, every agent,
every agentic system is a project with a set of agents operating in it, a master agent of sorts, and a set of resources that are being governed in the project. And the project comes with its own RBAC and access control policies and so on. And while onboarding, we have a step of onboarding an agent. It goes through kind of getting it the right authentication and the right
ID in the company that it operates in. So all that has to play in some sense in getting that right kind of security, secure setting in an enterprise for it to work efficiently.
Another aspect of this idea of self improving systems is that the models themselves are no longer the differentiator where before generative AI became so widely adopted, there were a number of businesses that were investing a lot of resources into building their custom models that were specific to the problem domain that they were focused on. Deep learning maybe expanded the bounds of what those models could do, but they were still purpose built for a particular use case.
Now everyone has access broadly to all of the same set of models. And so in order for it to be something that is useful and a differentiating capability of that business, it needs to have all of these other system level capabilities
around it. And I'm wondering how you're seeing organizations think about that level of competition and ways that they think about the purpose of these machine learning and AI models in the broader context of their organization beyond just being an API call to a foundation model provider?
Yeah, that's a great question. And we see it in when we talk to our real customers, there is an unknown in how much of their data and IP is leaking into the models. There is an unknown into what is what are they creating as dependencies on these models. I think we are very clear on our stance.
We know our industries, we know our domains, we bring that domain knowledge or the domain knowledge graph into every problem that we bring in. We know that there is a lot of context that is customer specific. We provide the right semantics around how to bring that context in that is customer specific. And we provide sort of as the agent works on it, the context it uses, the context is updates,
as we discussed earlier in terms of memories. We are very clear to our customers on the IP that lies in there and much of the IP sits with them. So it is very interesting now that as we if we can create these self learning systems truly in production, where the model remains the same, but in updating memory as sitting in file systems and markdown files and so on is the real magic. That's a very attractive proposition to most enterprises,
right? It feels like that they are, with that, they are creating that layer of kind of problem context and then process context. And we also use a term called action context. And in some sense, that is not captured by any CRMs and ERPs. And as these systems go live, these
folders with these markdown files that get created with the agent start becoming these sources of your processes and your actions and your escalations and all of that. I think it's we are very early in this journey right now, but it's playing out is what I can say. And if we are successful with this, I think every enterprise can own their own knowledge layer per se that is updating. And we as vendors, we are experts in our industry knowledge. We bring that context.
Every enter and we help every enterprise have their kind of sovereign knowledge and and and process context. That'll be a great place to land in. But I think we are very early. I think it'll take a whole year, maybe 2026, for that to evolve and see how how it is truly successful.
And another piece of these self improving systems is that beyond all of the context and data knowledge capture that these agents are capable of, there is also the underlying evolution of the models by these foundation providers as each generation adds new capabilities or focuses on particular use cases. But it also brings with it a certain level of platform risk as a particular model that you build a lot of your operational
capacity around gets deprecated in favor of a different model that maybe behaves slightly differently. And I'm wondering how you think about that aspect of owning your own destiny and how enterprises are thinking about their level of reliance on a particular model governed via API access versus
self hosting one of these open weights models or even building some of their own large language models now that that has become, I'm gonna say, commoditized even though there is still a lot of knowledge and capability and in particular hardware required, but it's at least a a known quantity that people can do if they so choose.
Yeah. Yeah. No. That's a great question. I think, people don't realize practically what havocates creates to move from one version to another version or an update to like Claude move from 3.5 to 3.7 and then 4.5 and then 3.7 or 3.5 is deprecating. The reality is enterprise systems need a lot of reliability, right? And one taking it to production,
there are these strict, you know, we have an evaluation in our platform, which the first thing it checks is if I run this a 100 times, how many times it gets the same result. And every time we've updated the model version, that reliability metric has broken in practice. So we've never been able to do just model upgrades just, you know, by switching from one API to the other. There is always some prompt changes and there is always some,
you know, investigation and updates we have to do. So it's not a, I think your point is very right. Just at the end of the day, the foundation model companies have a very wide set of benchmarks that they are operating against. And they can see that overall the model is improving, but in some local benchmarks, it does go down from one model to the other. There are some very popular examples of this when GPT-five
launched and 5.2 came along and all that. And in some of the coding benchmarks, it was actually going worse. So I think that's a real problem. Right now, what happens is what's happening is vendors like us, we take the hit of that for our customers. And we are at least build systems where we catch this first and don't let it go live. And then we iterate and help our customers on it. But the whole thing is a little brittle.
The challenge is how do you make it, like how do you improve it and not let it be brittle? So of course, one way is to say, you know, have your small language model, small reasoning model that is very good at that task, host it yourself and, you know, let it live out. The, of course, it comes with more investment and more, it's more capital intensive. You have to run your own GPUs, etcetera.
And, but it does come with more reliability and more bet on the future. So far, I haven't played that one, like that thing play out primarily because even though benchmarks are getting saturated, models are getting commoditized. Even now, every new big release that comes out in models does improve overall performance in many things a lot, right? So having your own small language model is
seen as a bit of a maintenance project by enterprises and they are a little afraid to do it. I see it as an opportunity for vendors like us, like Symphony AI to make it easy for them, where they don't have to worry about it. And this is where I was saying this, if we go from models to the learning capabilities sitting in agentic systems as memory, memory upgrades and intelligent invocation
of the right memory updates to the right memory. If that agentic harness can play a lot of that magic, then everything will get simplified. And so I think that is a more practical approach in my mind. Of course, if RL learning systems become very easy for every enterprise to operate with, I think, you know, what you said can play out. But right now, it it feels feels a little far out.
Regarding the ways in which you're seeing organizations actually invest in these self improving AI systems, what are some of the common patterns that you're seeing play out where it sounds like reinforcement learning and fine tuning are maybe further out or not as widely adopted, but just wondering how you're seeing people think through the levels of investment and sophistication that they're willing to operationalize.
Yeah. I think one very clear pattern is everyone's realizing is that I need to form any self learning system, the environment setup has to be right. And so it takes a lot to set up the right environment. It's that you have to start at all the way from the input data ingestion,
all the way to the action layer, and then getting the human feedback right. And and I think I see a lot of enterprises putting efforts in at least getting that right. Like getting the data right for this environment to be ready. And I see a lot of vendors asking questions, rightly so. I see a lot of enterprises asking questions of like, you know, everybody has agents. How is your agent better than the other? How is it learning? And when we go and we say,
you know, we show them how it learns, but we ask for it. How are we going to get this feedback back? Right? Is this feedback getting captured somewhere? We would like to implement the right hooks in your systems for that feedback to flow in. And I think customers are receptive to that. They realize that just digitizing that whole process from input data capture, trigger capture, to actions being digitized, to feedback loops being very clear kind of digital entities, if you will, as
sort of a task action being taken and a human feedback being written, etc. They are IC enterprises even being ready to put a human judge in the loop, or at least an LLM judge in the loop. And that is something as a property they want to own to convert
whether the task was right or not into an input format that the LLMs can use for feedback. So I think the readiness of, you know, kind of readying the enterprise for this environment is one pattern I see. Second pattern is where we play in, is that building the right implementation
of the context layer, building the right implementation of the memory layer, and the fact that it sits in your file systems or as knowledge graphs or as something else. And I think that is becoming a differentiator and
that's where the right architecture questions are getting asked when we go to sell inside enterprises and so on. So overall, I think the enterprises are getting ready for agents operating, a lot of agents operating. They know that they have to get their data and APIs and and just the whole feedback loop ready. And I feel like that's practically speaking, that's that's where the efforts are right now.
As you're talking to these organizations and enterprises about investing in these self improving AI systems and really pushing forward in a focused and concerted manner on agentic capabilities, what are some of the pitfalls or hidden costs that you need to explore before they are able to actually put these things into production in a repeatable and reliable fashion?
Yeah. I think it's full of landmines and not just pitfalls once you go into putting these systems in place. I think the first thing you realize is, yes, there's databases, there is ERPs and CRMs, then there is policies and standard operating procedures. Essentially, you're trying to see how do humans run these complex business processes today. The first thing you realize is enterprises
think that they are running as per that policy or that as per that standard operating procedure, but they are not. So there's lots of policy gaps. So the reality is humans over time have found a way to work around those policy gaps and have developed a tribal knowledge around that. But agents fail with those policy gaps. And so how do you fill that domain knowledge of for the LLM inside or the brain inside to say,
know, I did everything right as per what the policy said and yet my outcome was wrong. Well, was wrong because there are gaps in the policy and there are hidden processes in every enterprise where they found ways to fill that gaps. Those processes could be people oriented like escalation paths, etc. Those those gaps could be just tribal knowledge built over the type of scenarios that that you operate in, etc. But that data is just
isn't captured anywhere. And so but the there is no way to fill that also like that data just doesn't exist historically. So you have to kind of start at day zero and build that build that knowledge graph or build that sort of tribal knowledge layer, if you will, in a way that the LLMs can operate on it and build on it. And I think that's the hardest one to get right. The second one is, for these things to be truly truly actionable. There's a lot of gaps and action steps like not everything
exists. Not every action is in an executable like API like format or you know, or in a digital format. It is in a set of steps which goes through its own processes of, you know, email channels or, you know, some other formats of executing on it. So it is hard to capture the end to end loop because it is very fragmented.
And so in some sense the right way to attempt it is to first automate the sub processes and to get that right while you figure out how to digitize or get right the integration gaps across the set processes.
And as you have been working with these businesses through that overall adoption curve and as they start to operationalize and rely on these systems, what are some of the most interesting or innovative or unexpected ways that you've seen them either apply self improving AI or ways that you've seen them approach the creation of the systems that support that capability?
Yeah, I think we we have a lot of success stories. I think one thing we pride ourselves in is to go from zero to production in the least amount of time, right, to And the way we are even in a POC to go from having signed a POC to going live on some kind of results at the fastest sort of time. So I think what has been really great to see is
what we bring once we go into a use case. We, as I said, we are a vertical AI company. We operate in few verticals where we know the end to end use cases perfect. So we have a process graph for it already. We have the underlying entity graphs for what data is needed, what are the relationships between them and so on. And we now have agentic capabilities. Once we map it to the customer's data,
we have agents that can map it, like do this mapping from their entity graph to our entity graph or rather an industry entity graph. And we can map it from their data to a process graph very fast. So there is an agent that can do that. That really speeds up, that kind of standardization really speeds up the path to value that these enterprises are getting. Otherwise there is a whole discovery phase and coming up with what now you can only do it when you're operating in this vertical slices
in the use case you're aware of. If you start with like a general problem in an enterprise with which can go in n number of paths, it's a little hard to do it. So for us, that's been practically speaking, very fruitful exercise to develop these standard industry or vertical knowledge graphs and process graphs we land in day one when the enterprises that we go into. And on the flip side, on the outcome side, it is actually quite remarkable what,
you know, some of these automations help you get to. Like I was saying, there are customers we have in financial services where their L1 and L2 agents and an investigation step have been completely automated by AI. And so L1 agent is primarily, it follows a policy and standard operating procedures. There's no human involved. It takes in that document. It creates an agent for us that would follow it, and it would follow it exactly to the T. And it will get it like 100%
right. And so, and then you get to L2 investigator stage where there is some human intuition involved, and we are getting at very, very high accuracies there as well. The ultimate surprising example I'll give you is that by following policies and by coming up with its own judgment of some of the ways these investigations can be done, we are actually finding new detections
like banks policies were not getting updated with the latest in what the regulations say. But because these agents are always improving, always discovering new stuff, it started detecting new crime, which the policies were not ready to capture yet. So the self improving loops actually fix the refresh cycle of what how quickly from once a a US regulator passes a new regulation to how quickly it gets to it being caught in real scenarios,
that time really shortened. And it was surprising to us, it's surprising to the end customer. But I mean, if you think about it, it does make sense. This is the kind of like you're cutting many layers of human processes in the middle, and that's why you kind of expect it. We just didn't expect it to come so fast.
And in your own experience of working in this space and exploring this overall capability of allowing these AI systems to learn and improve from their execution and interactions, interactions? What What are are some some of of the the most interesting or unexpected or challenging lessons that you've learned in the process?
I think for me and my teams, we are following kind of two parallel paths. One is our own kind of product building and system building, platform building is a process. It's a software development lifecycle process, and agents are
getting more and more autonomous there. And so we see that as a parallel to what our products and our agents and our products are doing. And it's kind of how much autonomy can we get in our software development processes kind of parallels how much autonomy we are able to build in our end customer process automation with agents. And there's a lot of lessons we learn in our internal code writing and PR and testing and DevOps processes
that we learn from, that we are able to apply to the end scenarios. And a lot of problems that we face with implementation here are similar to the ones that we face in practice there as well. Creating that right environment that self improves in a coding environment is just as complex as creating that environment in a practical example. So I think without answering your question specifically,
I think that's been the most interesting journey for us. So in some sense, we are improving our own selves as developers while we are learning how to improve autonomy for the industries we operate in. When you're thinking about the application
of AI for a particular problem, what are situations where you would advise against the investment in those self reinforcing loops where you're fine just using an out of the box LLM or out of the box predictive model?
Yeah, that's a question we ask almost every day, every use case we go into. I think there are levels of this question as well, right? Should you use an LLM? Yes. Should you use a large language model, small language model? You you should probably use a very small language model in many cases and doesn't have to be large for the kind of task you are on. So I think a awareness of the task complexity
and whether an LLM can do it, a smaller language model can do it, or does it need a LLM in a loop to do it? That is the key kind of categorization. Agents are effectively LLM in a loop with certain context and clever ways of changing that context, etc. Because we operate in these specific use cases in the industries, the tasks are very, very clear to us. So for example, if you were to do a web research summary,
we actually have a small model for it. And it's like, you don't need to burn tokens on a very large language model there. When you are operating in a more undefined domain where, you know, the agent is making decisions with a lot of different contexts and variables, it should probably be thinking or it should be reasoning. And so for us, every step of the process automation
is fairly well understood how deep it has to go in reasoning or thinking or how what kind of a task of kind of natural language understanding kind of task is it. And some of those tasks are not that hard. Some of those tasks actually don't even need an LLM. And so we have that fairly well laid out. The recommendation
I would give it's it's a bit of an art to figure out which task requires what level of thing. But but in general, you would say if it's the same task over and over again, and if it can be done by deterministic code, you can use an LLM to write that code, but just then save it, right? Just don't use an LLM again and again to do it. Just write an LLM once, save that code, execute that code again and again. If the input changes and if the decisions have to change again and again, that's when you have to decide whether you truly need an agent. So in our cases,
the truly hard, like you look at some of the truly hard scenarios of root cause analysis when some systems are going down in a plant and it has to go into figuring out previous context like this, exploring the path of whether similar things have happened and, you know, making judgments based on that. When it's a complex task like that, something which takes humans also hours and days today, quite obvious you have to look into kind of agentic systems.
And when you were talking about that aspect of trying to avoid burning tokens unnecessarily, that also brings up another axis of self improvement of cost optimization where maybe one of your objectives is to minimize the expense of a particular agent use case or, obviously, cost optimization on the infrastructure side is a separate question. But in terms of the agent itself or the AI system itself optimizing its own efficiency,
I'm wondering what you're seeing as far as capabilities on that horizon as well. Yeah, that's a great question. I mean, are facing that every day in with our cloud code and cursor costs. I wish that's my favorite feature from Cursor team if they could build a self improving cost optimizer.
But I think it probably goes against their business model to try to do that. I think really the to be fair we haven't invested much in the area but you could see it in the horizon. As I said we do use many small language models in many of our tasks. I think the the real challenge is how repeatable of a task is it Like how exactly the same pattern you see the next time you do it. And the great thing about LLMs and bigger models is their how
across task they are great at, right? So if your task just changes parameters a little bit, they will adapt to it, or rather they have already shown good results in many other similar tasks and so on. So the task specificity versus how general you want to go typically defines it. Wherever we can achieve the levels of task specificity where, you know, it's
hard for it to deviate too much. I think that's where we should just optimize the hell out of it and and go to smaller models. You know, I'll tell you, practically speaking in the industry, we haven't built systems which try a whole lot of models. Like, I am very surprised by how good the Gemma series of models are from Google.
And like, I see hardly anyone in the industry trying Gemma for many of their tasks. Right? And even within Gemini versus Gemma, like people don't even try Gemma. And so it's a little bit of a, I think we should build something that once you've defined a task and an eval, you give it to an optimizer and it tries all these cheap models
and it says where it gets the most accuracy and it's able to achieve it. I think the question always comes, what happens when there is a drift in the input a little bit? Are you ready to absorb that? And I think with smaller models that has been a risk, but it's coming. I think the amount people are spending on coding agents, it's getting quite crazy out there. Just, I think coding agents
will drive that cost optimization first. I've already seen a lot of our developers trying to use OLAMA and local models and trying to use cloud only for thinking and using a local model for writing the actual code. And so I think people have begun to develop these systems for wherever they are facing these cost pressures. I think the economy and the cost pressures and product margins will drive many of these things. I think right now people are just trying to get their use cases
and automation right with high reliability. Right after this will come the cost aspects. I think everyone feels they can reduce the cost, so they're kind of delaying it to one year onwards, but it's coming. It's coming for sure. And
as you continue to work in this space and monitor the evolution of the ecosystem, what are some of the ways that you anticipate the tooling and substrates and agentic frameworks adapting to these concepts of self improvement and making the actual execution and integration of the supporting systems easier to do?
I think it's very, very hard to predict. That's the true honest answer. I think what we can see is there'll be companies like us at Symphony AI who will build very good performance, reliable, agentic systems in some industries we are in. And our hope is that that results in a wide adoption in the industries we are in. But the enterprises will realize that the stacks are converging. There is
we get a lot of requests from our customers that can you help us besides the use cases you are in, can you help us in our company in standardizing ways in which we should see our agentic systems? I think the layers of data, I think we already see a fair bit of standardization in, you know, MCP servers and the MCP protocols standardizing interfaces to agents.
I haven't seen that much pickup in A2A like protocols between multi agent interoperability, but, you know, with systems like OpenClaw and all getting popular, there will be some standardization in the agent control planes as well that enterprises see, I think like with many agents running, who's governing and controlling policies across those, Clearly that is emerging as an area. I think the tooling is getting very, very standardized going forward.
What is hard to predict is whether it'll be Postgres databases as the agent preferred layer or whether it'll be file systems and like how do these things change? That's quite evolving and that's hard to predict. But the concept of treating agents as like the agent lifecycle management has become fairly standardized, like just like model ops and model lifecycle management.
And so those things are getting standardized. I think it's a question of agents going into production, creating value, and then everything under them will start getting standardized in its layers.
All right. Are there any other aspects of this aspect of self improving AI systems or the ecosystem of AI applications that we didn't discuss yet that you'd like to cover before we close out the show?
I think for me, it's very interesting. Mean, ultimately, it's known to everybody that the foundation model companies are that reinforcement learning with real world environments and especially with verifiable systems is a big area of investments and models are improving at different domains with more and more RL environments being created and so on. What's very interesting to me is if that same subset of capability
can come very quickly to enterprises in a way they can harness it. And so for just my business process, can I get kind of that same level of capability? Start with a smaller model and how do I apply the same reinforcement learning steps without needing to have a research scientist employed in my company, right? But knowing that the process is dynamic, knowing that if
I follow the process in one way, I know it is suboptimal. If I follow it another way, is better. How can I turn it into a reward for the model without knowing what rewards mean and RL means and all that? If we can simplify that, I think a lot of businesses, and that's an area we are trying to go deep into as well. I think that will enable companies to own sort of reasoning layers of their own without relying on the big bottle companies. So I'm excited how this area emerges forward.
All right. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gaps in the tooling technology or human training that's available for AI systems today.
Biggest gap in tooling and technology. I think it's, I would probably say the other way. I think it's with the way Cloud Code and some of these other agents are evolving. It is very easy to fill whatever gap exists, right? I think the gap I'm more worried about is the integration steps when it needed to apply this to real practical industries, real practical example. I think there is no clear integration outline. Like everything has to be done
differently for every implementation that you go to. And so if there is a way to use agents to go and discover these processes and create like templates of, I think effectively we've been talking about digital twins for a while in the industry, and there have been several attempts at it. We have our own attempt at it at an industry perspective in Symphony AI, but a true recognition of a digital twin in a company as a tooling, as an integration layer is
not there. And if it was there, then agents would onboard onto it very. But I think on the flip side, you put Cloud Code in an environment in a company, you give it access to various live infra and resources, and it starts to figure out. I think we've got the best tooling ever in the history of mankind and what developers had, especially with these AI tools. So I'm actually seeing the picture as very, very optimistic on what we can do to fix the gaps and tooling where it exists.
All right. Well, thank you very much for taking the time today to join me and share your experiences and insights into how to build these AI systems in a way that they can continually evolve and improve
and some of the safety considerations around how to make sure that they stay well aligned with the organization's objectives. It's a fascinating and fast moving space. So I appreciate you taking the time to help share some of the expertise that you've developed through working through the hard bits and hope you enjoy the rest of your day. Thank you. Thank you for having me.
Its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com
with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
