Hello, and welcome to the AI Engineering Podcast, your guide to the fast moving world of building scalable and maintainable AI systems. Unlock the full potential of your AI workloads with a seamless and composable data infrastructure. Bruin is an open source framework that streamlines integration from the command line, allowing you to focus on what matters most, building intelligent systems.
Write Python code for your business logic and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. With native support for ML and AI workloads, Bruin empowers data teams to deliver faster, more reliable, and scalable AI solutions. Harness Bruin's connectors for hundreds of platforms, including popular machine learning frameworks like TensorFlow and PyTorch.
Build end to end AI workflows that integrate seamlessly with your existing tech stack. Join the ranks of forward thinking organizations that are revolutionizing their data engineering with Bruin. Get started today at aiengineeringpodcast.com/bruin. And for dbt cloud customers, enjoy a $1,000 credit to migrate to Bruin Cloud.
Your host is Tobias Maci, and today I'm interviewing Carter Huffman about his work in building an ensemble approach to low latency voice AI. So, Carter, can you start by introducing yourself? Hey. I'm Carter Hoffman. I'm the CTO and cofounder over at Modulate. We do proactive voice analysis understanding in Frontier Voice AI. Super nice to meet you. And do you remember how you first got started in the ML and AI space?
Absolutely. Yeah. It was actually over at the Jet Propulsion Laboratory in Pasadena, California, which is one of the NASA centers focused on the robotic exploration of space. And I've been really into space, really into robots my whole life. And so when a recruiter came to JPL
to my school, and I, you know, kind of talked to him, and he was like, hey, you know, my job is landing robots on Mars. You want to come interview? I was like, oh, yeah, sure. You know, I don't do a ton of, like, space robotics. But yeah, absolutely. I'll come apply. And I got hired into the the machine learning and instrument autonomy group. So I helped make the spacecraft smarter. That's where I started.
And so in the context of what you're working on now, I know that you recently had an announcement about some of the restructuring and refactoring of at least some component of your VoiceAI capabilities. And before we get too much into the details of that, I'm wondering if you could just unpack a little bit about what you mean by that term voice AI and how it is maybe differentiated from a lot of the other AI applications that folks are likely using today.
Sure thing. Absolutely. So by voice AI, I really mean machine learning and artificial intelligence models that are built towards understanding, manipulating, and producing human voice signals. And the reason why voice AI is kind of its own category is because voice is a really interesting form of media.
You have text. You have images. You have video. You have voice. And if you've been following the trajectory of AI just in general over the past twenty years, you notice that people build models that tackle text, images, and video first. Voice kinda trails behind, and that's because voice is really, really technically difficult. There's a lot of data there, but it all encodes
a ton of nuanced information that's very hard for us as human beings to verbalize and really recognize. Like, you can tell my tone is off just because you're, like, evolutionarily built to notice that stuff. And when you try to reproduce that or manipulate that with a machine, it's very, very difficult to get it perfect. And if you do voice AI slightly less than perfect, it becomes super noticeable and super, like, disconcerting.
Contrast with, like, generating an image of a face, you see a face in anything. So it's very easy to make a face even if there's a ton of cool research going on. So that's kind of why voice AI is its own separate thing. And another aspect of that is that, actually, maybe one of the earlier models that was introduced in this current era of generative models and transformer architectures was the whisper model from open AI, which is very focused on voice with the
caveat that it's only trying to convert that to a textual representation to increase the amount of corpora available for training of these language models. And I'm wondering if you can add a bit more nuance to some of the detail of what you're talking about with those elements of inflection and tone and how that's differentiated from purely transcription based or being able to even do
speech to speech where I just wanna be able to take that voice signal, convert it to text, feed it to an LLM, generate some output, and then convert it back to voice. Absolutely. Yeah. So doing that flow speech to text to speech works reasonably well when the emotional content sentiment or kind of long term context of a conversation is relatively unimportant. So you can get kind of reasonable responses and reasonable back and forth on short time scales in that situation.
But voice contains a ton more signal than just the, like, textual content of of what you're saying. How you say it is super, super important. So, like, for example, you know, I I say, like, hey. I'm super excited to be here. I'm thrilled. Or I can say, like, yeah. I'm super excited to be here. I'm thrilled.
The exact same words. And if you transcribe it and feed it to an LLM, it will do exactly the same thing with those two signals. But you and I know that you would react to that statement extremely differently. And that's just one of the aspects. And there's a ton of, like, social relationship and context packed into
why am I expressing that emotion in that way. So that was what makes it difficult, and that's kind of what makes it very important to be able to capture those other signals because you have to react almost in opposite ways depending on the tone and emotion of the voice.
And now digging into some of the announcement of your ensemble listening approach, what are some of the challenges or constraints that are specific to voice AI that led you to putting in the effort to actually researching and developing this approach to how to actually manage some of that input processing and some of the general expectations and requirements around how it's supposed to operate that led you down that path?
Absolutely. So I would say there's two main constraints that factored into our decision to pursue a totally new architecture. Constraint number one is cost and processing power required. Just transcribing, analyzing a waveform, even doing simple relatively simple operations like detecting accent or detecting whether there's music in the background or what kind of background noise there is, those kinds of operations on an audio stream, you can get
a little bit of the way there with relatively simple and cheap models. But to do a very good job, you have to deploy much more powerful and expensive models. Like, there are transcription services out there that cost, you know, anywhere from 10¢ to a dollar to process an hour of audio. If you're a large social platform, you could have 50,000,000, a 100,000,000, three billion hours of voice content on your platform each month. And you're not really gonna pay $3,000,000,000
a month to analyze that audio if you wanna do something useful with it. So on the one hand, you have to be really, really cost effective and efficient at processing those audio signals if you're gonna do anything useful at scale. So that's problem statement number one. Problem statement number two is that audio and especially voice conversations in particular are very kind of like a yeah. I I call it like a lumpy multimodal distribution.
Right? Like, you have different pockets of kinds of audio, but within those pockets, there's a ton of shared similarities. So we're doing, like, a podcast conversation. Right? We're in relatively high quality audio environment, relatively low background noise. We're enunciating relatively clearly or at least you are. I hope I am. And we have kind of these properties that are shared among other very similar audio environments.
And then contrast that with calling into a customer support center, and I'm on my phone. It's over telephony, so it's, like, eight kilohertz audio, and I'm on a bus, and there are people talking in the background. And there are a ton of also calls like that, but that's a very different distribution. You take a model like Whisper and you apply it to that distribution, it's gonna produce absolute junk. Right?
So you have these different pockets where you have a lot of voice conversations that are very similar to each other, but the pockets are very, very different and require very different modeling approaches. And if you wanna build a big model that can handle all of those different pockets, all of the different modes of that distribution, it has to be extremely flexible and powerful to handle those different data distributions
equally well. If you build a model that's extremely powerful, it's gonna be super cost effective, so you violate the first constraint. So it's a really, really tricky situation, which is why it required new tech.
And so digging into the specifics of this Ensemble architecture, what are some of the high level details about how it actually addresses some of those complexities around the high cost as well as high latency issues that come up when you're dealing with these large multipurpose models and then also tries to address some of the accuracy issues that you run into when you go to a smaller or more quantized model to address some of those latency considerations? Absolutely.
Great question. So there are really two main properties that led us to the Ensemble model architecture. Property number one is that when we're talking about those multimodal distributions, because the similarities are so close to each other within a single kind of mode of that distribution, all of the data is relatively similar, and so you don't need a ton of capacity. A small model could get you very far.
Small specialized models can perform with high accuracy in certain circumstances, but you lose a lot in terms of generalizability if you are focused on a small model. However, the other insight is that these distributions continue on over the course of the conversation. So you and I are having this high audio quality back and forth conversation. We're very unlikely to switch to something like a noisy bus, low audio quality environment
halfway through the conversation. Or even more importantly, we're unlikely to be swapping back and forth all the time. And you might have different distributions for different participants, but they remain relatively static over the course of a conversation.
So the way to bring these two things together is that if you can isolate quickly and efficiently which distribution you're in or which distribution each participant of the conversation is in, then if you also have a small model or a set of small models that are tuned specifically for that distribution, you can pick the right model to apply to that distribution, trust it will mostly do a good job over the whole course of the conversation,
and then course correct if something does change as it as it occasionally might. So this is sort of like how a really big model will find paths through the weights in the large network that are appropriate for different distributions.
But instead of having to run all the weights in the network or at least a large subset of the weights in the network, you are pre isolating down to a very, very small subset of the overall Ensemble model that you know ahead of time or know based on the first analysis of the data in the conversation is going to be applicable to likely the whole rest of the conversation distribution. Architecturally, having more models increases the complexity
and potential for failure as well as the variety of failure modes as is the case with any complex distributed system. And so that definitely requires having a high degree of confidence in the orchestration, the failure recovery,
and I imagine also some measure of fallback capability for the situation where the ensemble just catastrophically fails. And I'm just wondering if you can talk to some of the ways that you thought through and proved out this overall engineering effort and what your calculus was as far as which pieces you were able to pull off the shelf and what are the pieces that you had to engineer from whole cloth. Great question.
So one of the things that surprised us most when we started deploying these Ensemble models was really that as they were running inference, we found it very hard to predict actually which models were important to the results
and which models were relatively unimportant to the results and when a model was doing a good job or when a model was doing a bad job. So the actually, the initial idea, the initial architecture, sort of the v one of these ELMs in our particular model Velma, sort of v one, is much more of a static ensemble that is preselected at conversation initialization. So you would have, like, okay. We have a big ensemble of a bunch of different models.
As you mentioned, a ton of models leads to a complex system, so you have to select down to a very sparse set of these models to actually deploy at runtime other for a given conversation. Otherwise, complexity goes up, which means the cost and difficulty of orchestration goes up, which also means that the cost in general of running the system goes up. And we talked about cost being one of the big constraints earlier on.
So the initial idea was to statically select, like, okay. We know for a given language and a given, like, social chat environment or interview call environment or other things like that, we can kind of test based on some ground truth labeled data sets which subsets of models work best on these environments and then just kind of do a lookup table and and statically look up which models to run and run those for the whole conversation.
And what we found was that when that fails, it fails very, very, very poorly because these models are small. And, individually, they don't generalize well. The whole ensemble generalizes well, but, individually, the models don't generalize well. If you pick the wrong small model, you're going to get very bad results. So this led us to doing a ton of research into how do you cost effectively monitor the results of the models,
How do you spot check whether or not the models are performing well? And how do you course correct and select different subsets of models if you know these aren't performing well? And this is the idea of the dynamic ensemble block, which is one of the cool pieces of the ELM architecture that we invented, where we realized that this is actually an optimization problem.
Each model has a cost to run-in a given ensemble, like our ensemble of transcription models or an ensemble of emotion detection models, each has an expected accuracy given the distribution of audio that you're currently looking at. And they each have a degree of confidence that you can have about their outputs. And the optimization problem is that as you're running data through, you're getting results from the models that you selected in the Ensemble.
You compare those results to each other to the rest of the context of the conversation to figure out how much you trust those results. And then the optimization problem is around the next part of the conversation, say, the next minute that you're analyzing. For that next minute, how do you optimize for accuracy while minimizing cost?
And that transforms that whole problem down into a very well known, like, multi armed bandit exploration versus exploitation problem so you can deploy a ton of well known machinery to solving that optimally. That was the really, really cool thing once we found out this surprise that when the models go off the rails, they go really bad. Yeah. As you're describing the architecture, it brings to mind two, I guess, complementary
aspects of prior art where you're talking about the optimization question. It brings to mind a lot of the cost based optimizers that go into SQL query planners, particularly for these large scale out distributed database engines or things like the Trino engine that allows you to be able to do predicate pushdowns based on what is the actual underlying storage.
And then in terms of being able to determine which particular model or module you want to actually route to, it brings up a lot of the conversation that's happening right now about how to actually implement these multi agent systems and how to manage communication and routing between those different agents to be able to achieve the desired outcome.
Exactly. And I think that's what makes it so exciting and so I don't wanna say possible, you know, but at least tractable to develop this Ensemble listening model architecture as a, you know, what, 45 person start up, is that as you described, cost optimization problems are everywhere.
And we're really, really like, as a species, we're really good at solving cost optimization problems. So as soon as we found out, oh, this really tricky part of analyzing conversations cost effectively and accurately can be transformed into a cost optimization problem.
That was the eureka where we were like, okay. We have the tools to actually deal with this. And you can deal with that in terms of, like, you know, what is your overall cost budget, but how sensitive are you to that? You can add in things like how many GPUs or clusters of GPUs we have available. Because one of the exciting things about cost optimization around our small models
based architecture here is that you don't need the biggest and best GPUs. You can run on we're running on, you know, five, six year old GPUs sometimes because they're the most cost effective. So you can make the optimization problem even more complicated, but we have the optimizers to do that. And we've had them for decades, and they're really, really, really good. The the routing question, I think, is a really, really fun research problem.
And the thing that makes multi agent systems writing so hard, in my opinion, is that agents are so flexible. Any agent could, in theory, do a wide variety of things, take in a wide variety of inputs, produce a wide variety of outputs. It's very flexible subcomponents of the system. So figuring out how to route to which one is optimal when many of them can do many of the tasks can be very tricky. This is an advantage to
the ensemble of small models approach, at least when trying to solve the routing problem, is that we know exactly what all of these small models do. We know exactly what their cost is. They don't have a thinking budget. They don't really have a very flexible compute graph. Each of these individual models are relatively small and self contained and predictable.
And we also, because they're small, have a very good understanding of their output distribution across different kinds of inputs that they might be given and what kind of accuracy ranges we can expect. So in terms of routing between the different models, making each of the individual nodes super predictable, as we have turns that routing problem into a pretty deterministic
kind of thing as opposed to something with agents where it's very easy for the agents to, like, get themselves into a loop or something like that. Or we're a very much a, like, directed graph with some feedback loops, asynchronous, low priority to do things like optimize selecting which models. But that doesn't get in the way of the feed forward processing of the data.
It also reminds me of at least my understanding of the GPT five category of model deployments where my understanding is that it's not actually one monolithic model. It's actually several sub models that are specialized for different cases, and they abstract that underneath
a routing layer that determines which model to actually use under the covers. And I'm wondering what are some of the aspects of prior art that you were able to pull from either anything that they've published or any other similar product launches that have happened over the past several months. Absolutely. We've been working with ensembles of models for years.
So the prior art that we pull from is mostly more from traditional mixture of experts models and traditional hybrid systems between rules based and neural network or black box kind of architectures and combining those things together. I also it was really exciting. Got to pull on some of my work from JPL where we were doing a lot of investigations into how can you deploy AI systems safely on spacecraft. And on a spacecraft, doing something unpredictable
is often quite fatal. Right? If you accidentally turn your solar panels away from the sun, it's hard to recover from that if your battery goes dead. And if you accidentally turn off your radio, unless you have a fallback system, which they do, it's hard to tell the spacecraft to turn it back on again. So when you're deploying an AI system onto a spacecraft, the number one lesson is bounding the effects that that one model can have
on the rest of your system. So no matter what the network says, like, hey. We need to go look at this rock next. It's the most important rock I've ever seen in my entire life. Ultimately, the outputs of that model can be perfect or they can be garbage, but they're all fed into more of a rules based, well understood, constrained optimization algorithm such that even if the AI encounters something completely unpredictable, you've already got the predetermined bounds to keep it from
completely crashing everything else. So this actually related back to how we orchestrated some of our models because we're using those input distribution features and the results from some of our sub ensembles to figure out, are we doing a good job at understanding this conversation? But you might see the problem with that of we're using our models to figure out if our models are doing a good job. And if they're confused but confident, you can mess up the entire conversation.
And some of the stuff we're doing is actually pretty, like, high risk, high high reward kind of stuff. You know? We're deployed in social game stopping harassment. You accidentally banned the victim instead of the aggressor in our harassment situation. That can cause a lot of real harm to a lot of real people. We look for extremism,
child safety incidents, other things like that. It's like, it can be relatively high cost if you get the stuff wrong. Of course, you have other safeguards on it, but it's still an important problem to get right. So if your models are just confident in their own abilities and you don't have a check on them, you can run into serious problems. So that's why you were asking a little bit about off the shelf models earlier. There's actually a place for the large, highly general models
in an ELM framework as kind of a supervisor or a check. So we actually have a suite of models in our different ensembles that are very high capability, high cost generalists. And we will occasionally and this is, again, an optimization problem. We will occasionally run them to check the outputs of our smaller, more specialized models and make sure that they agree to a reasonable degree. And you can't run it too often or it gets expensive,
but you can't run that not at all. Otherwise, you fall into this confidence problem. So you have this kind of checking feedback loop, and that's how you constrain your overall system to do a pretty good job. But, again, a lot of that priority is coming from stuff that people were working on one, two decades ago. Some of these ideas, they're not super, super new.
And then generalizing beyond the voice use case, what are some of the environments or situations that you would advocate for teams to explore, either building their own Ensemble approach or if there are any other off the shelf frameworks that allow for building this style of Ensemble model capability, again, not including the multi agent use case because of the inherent flexibility involved. Great question.
The applicability of Ensemble model well, first, I would say anybody working in a regime where you are going to do a similar task many, many times over, I would recommend looking into small models in general and ensembles of small models in the cases where you need a lot of general flexibility, but all around a similar kind of task. So the reason why ELMs were so directly applicable to voice conversations is because voice conversations have a ton of shared properties.
You're always wanna gonna wanna get a transcript. You're always gonna wanna figure out who are these people that are talking. You're always gonna wanna figure out what emotion are they using, what accent are they using. Give me some basic demographics so I can understand the context. Always gonna wanna understand the topics of discussion, the intents of the people involved, like what are these people trying to achieve when they're discussing in order to understand that conversation.
And there are tens of billions of conversations that happen on this planet every single day, but they all fall into that structure. And if you contrast with something like just a large language model, there's none of that structure of how must I analyze this data built into it. Even though the space of conversations, all voice conversations is very, very general, it's still a much, much more constrained
space of what kinds of things do you want to do to analyze this data than just say an arbitrary stream of text or an arbitrary stream of data, which is what the big foundation models are applicable to. So if you can identify that structure and say, I'm gonna be tackling a very diverse set of data, but they all need to be analyzed in similar ways, then deploying small models lets you take advantage of the cost efficiencies of those small models.
You can break the problem down into several pieces, like let's analyze emotion. Let's analyze transcription. Let's do these other things. And if you still want to maintain the flexibility and generality of applying to a bunch of different distributions, like different kinds of conversations, then that's where you'll want ensembles of those models. So that's that's where I would suggest looking into using ensemble models instead of just going straight for a big foundation model.
And then the large open question that I see a lot of people trying to deal with right now as a broader set of people are entering this ML and AI ecosystem is how do I actually verify that anything that I changed didn't just completely destroy everything that I've been working on for the past six months? And so one of the approaches that has gained a lot of popularity is the idea of evaluations.
There are also the observability aspects of doing real time analysis of model performance and outputs to determine if that is actually producing the desired result. And as you add this Ensemble aspect to the overall request path, what are some of the ways that that complicates that question of how do I actually validate the overall functionality of the system as well as being able to do more isolated validation of those sub models?
It gives you added complexity you need to handle, but it also gives you a lot of power and tools to solve that complexity. So the added complexity, as you might guess, is the extraordinary number of different routes through the system that your data can take. So if I have some small models that are specialized to high quality audio and my ensemble picks a model specialized to low quality audio instead, the rest of the system, all bets are off, anything downstream from that.
And because you're doing this dynamic optimization that I was talking about earlier, where the results of models running earlier in the conversation feed back into the system to help it pick which models to run-in the future. If you've got garbage coming out of your models early in the conversation, you can poison that decision making in those choices for later on in the conversation. So it makes it really, really hard because of the sheer number of paths the data can take through your Ensemble.
And the flip side is the Ensemble approach also gives you a lot of tools. So part one is the structure that I was talking about earlier. So we know that a conversation has these different pieces of structure that we're looking for. And all of the structured outputs, like for example, the text content and emotion content of what I'm saying are related. They're not identical. One's just not a strict function of the other. Otherwise, you wouldn't need the multiple models, but they're related.
And so for observability and debugging purposes, one thing you can do, which we in fact do, is check sentiment extracted from text and check emotional tone and see if they're matching up relatively often. It's a statistical approach, not a deterministic debugging tactic, But you expect different outputs from different pieces of your ensemble to correlate in certain ways.
And so if I identify that for certain kinds of conversations, like, let's say, low quality audio calls, no matter what text is being output, my emotion model always thinks people are angry, then I know that something's likely to be wrong. And in certain situations, again, the emotion and the text might not match. But if they're completely uncorrelated over a long a large amount of data, then that gives you a pointer for what to look into.
And digging now into some of the specifics of these voice models, I have a generalized understanding of the ways that large language models work in terms of these transformer architectures, the attention mechanisms that are used, the variance there, the idea of these mixture of experts models for the case where you want to be able to incorporate more reasoning.
And I'm just curious how much of that is analogous in this case of audio and particularly voice focused models, and how much of the actual underlying architecture needs to shift because of the modality and the specifics of speech? It's actually very complementary. So when you're tackling an audio analysis problem, you're looking at both how do I get the right results out of the underlying data?
And then how do I interpret those to make broader sense of the conversation and do something useful with it? The techniques that you're talking about around attention and all sorts of other capabilities for these models, audio is a very long time scale kind of data, right? Like, you've got a 48 kilohertz audio signal. That's
a ton of different data points for just a second where you might utter, like, you know, what, one or a couple of different phones make a couple of different sounds. So you're you're mapping this very, very, very high dimensional audio signal down to, even if you're doing emotion and all the nuance and everything, a much, much, much lower dimensionality space.
Technologies like transformers and even older school stuff from the twenty tens, you know, earlier twenty tens, like dilated convolutions and wave nets and things like that, All of these different strategies and architectures perform
very, very well on these longer sets of data. And a lot of the research that's been going into these very large models has been increasing their context windows and how to efficiently handle and select in extremely large data contexts what parts of the data to pay attention to. All of that is applicable in extracting different pieces of information accurately from these audio signals. But the next layer up is really where a lot of the Ensemble technology that I've been talking about comes in.
So when you're building an ensemble model, if you can slot in a better emotion detection model, the whole ensemble's gonna work better. And the better emotion detection model might be replacing, say, a standard dilated convolution network with a network that employs an attention mechanism or some other thing like that. And that you have to change in terms of the Ensemble model is your priors on the accuracy of that model, assuming it gets more accurate,
and your cost parameters of the model, assuming it might cost more to run. And then once you've changed that, the rest of the Ensemble just works as it normally did. So they're very complementary. Speaking to that longer horizon and the longer window of attention that's necessary, it also brings up the question of another trend that's becoming very popular in agentic contexts is the idea of agentic memory and being able to store and retrieve certain aspects of information for later usage.
And particularly if you're dealing with a minutes to hours long conversation or even monologue. Obviously, there are cases where you're going to overflow the available context window for even an ensemble model, and I'm just curious how you take advantage of some of these storage and retrieval questions for being able to do that more long horizon analysis of this high information density medium. This is where knowing the structure
of your data is so important. It's like what's what's old is new again. Right? It's always been in machine learning, like, understand what data your models are going to operate on. Bring in domain expertise as well as do backprop through the neural net. Right? For something like a conversation, you could have a five hour long conversation, but you have priors, very good priors on what data is actually important to that conversation.
Right? You wanna know who are these people, what are their intents right now, what have their intents and goals been through the course of the conversation, How are they acting towards each other? What's the distribution of emotions? Like, for example, I sound pretty excited all the time. So when I deviate from that norm, then that's information that's important.
By digging into, like, how do humans understand conversations, like, what kinds of tasks are you trying to achieve by analyzing these conversations, you can actually pre kind of preregister, precreate, preallocate the the the pieces of information that are going to be broadly, generally relevant to
the conversation. If I'm gonna make if I'm if I'm gonna make a super, super simplified example, it's like, you know, pinning the header or pinning the rows in an Excel spreadsheet or something like that. It's like, yeah, you can kinda scroll back and look at a sentence that happened, like, three hours ago or a tab that's, you know or like a cell that's, like, you know, 200
rows over. But you always know that some of this data is going to be important and should be front and center and visible to your models at all times. So instead of for our use case, so instead of trying to have some extremely very flexible dynamic storage with absolutely no priors or structure on it, we have a ton of preallocated structure around what's a running summary of the conversation. Who are the participants?
What do they sound like? What's their general distribution of emotions across the conversation? All of these different things. And those are preregistered as kind of a fixed memory block that serves as the starting point for context for all of the individual models in the ensemble. And of course, some of the models can add more context and go in dynamically.
And of course, those things like the summary and the participant roles and the behaviors they're exhibiting, they change over time. But we always know that those pieces of information are going to be relevant. So it massively, massively, massively simplifies the memory problem for this system, and it also reduces the ability to make weird errors. Right? If you're thinking about a system with memory, the system might be doing something really, really stupid.
And you're like, why is it doing this? And it's because, oh, it didn't have the right information in memory. Why didn't it have the right information in memory? Now you're going down a very long debugging path that is going to be very complicated. And by changing how the system stores things in memory, you could change a ton of other behaviors without knowing it.
Preregistering the structure of the memory or even just a subset of the memory ahead of time means that your models will always have the relevant information and reduces a ton of the errors that these very, very, very dynamics and flexible systems can have. As you have been exploring this architecture, developing your own implementation of it, and keeping an eye on just the overall ecosystem of how people are dealing with these model deployments, model optimization.
What are some of the complexities that you as an engineering team have had to come to terms with and address to be able to achieve this objective and some of the useful learnings that you developed in the process of actually getting to delivering this as a feature set? Probably the biggest and most obvious complexity that we've had to deal with and overcome has been the fact that this Ensemble listening model architecture by necessity is a very distributed system.
In general, if you're going to run a neural network on a lot of data, you want to load the weights into the GPU and keep them there. You don't want to be swapping models in and out. If you're going to do that for a possibility of over one one hundred different models that you might run on any given piece of data, some models can be shared in a single GPU. But you're not going to fit all 100 models into one GPU or even a couple GPUs.
So you're starting to look at, in order to run inference on any individual data point, I'm going to need to route that data to a small subset of a bunch of different machines with a bunch of different models loaded onto them. So it very, very quickly becomes a distributed computing problem. When tackling a distributed problem, the distributed problem becomes a lot harder if you need strong consistency guarantees.
One of the really nice things about modern machine learning is that a lot of models can be very forgiving when given errors or gaps in data. And we exploited that very, very heavily in the ELM architecture.
So instead of asserting that all of the different models in the ensemble must have their data available by some sort of time or worse, locking until every single data point has come in, we make sure that our data flow goes from less flexible models to more flexible models in a pretty strict hierarchy, at least for that feedforward processing.
So if our less sophisticated models, smaller models, don't all return results in a required amount of time, The good news is that data is flowing into a more flexible and more forgiving, in some ways, model. And we have a much greater ability to account for and deal with missing data than we would if we were feeding into another super tiny, super specialized model in the Ensemble.
So that's been a very, very strong rule of thumb for us. Always flow data from less sophisticated, smaller models into more flexible, more general models. And that helps us account for these distributed computing errors and still do a good job generally, even in times of high model dropout.
And as far as the cost question of being able to deliver this functionality more economically, what are some of the ways that you're tracking the real world impact of that, particularly given the distributed nature of the actual cost as well as the increased operational overhead and potentially headcount needed to be able to keep the system running? Good question. In terms of the additional complexity, I would say that most engineering systems have complexity in a variety of different places.
And by specifically focusing on orchestrating a bunch of different models together, we're solving a ton of these different orchestration problems, but they're all very similar problems. So it actually pays off for us to have a small number of really, really deep knowledge experts in orchestrating systems of models and then building reusable tools to manage those different ensembles.
Given that we can do that, we end up not having a huge amount of what I would call additional overhead Because the alternative is if you're working with a very, very, very big foundational model, now you're trying to shard that model across a bunch of different GPUs. And if you're trying to do data communication between different parts of, say, one neural network, that's a situation where changes in data or dropout or distributed system failures can be very catastrophic.
It's very, very unpredictable what a neural network will do if you just cut out a little chunk of its weights. And so you have to solve those distributed problems if you're building big, like very big foundation models in strong consistency ways. And you have to deploy a ton of really, really capable hardware and network them in extremely low latency setups in order to do that kind of inference in a way that results in a good customer experience.
By chunking down all of our models to be able to run on single machines or single GPUs, we actually buy a ton of flexibility. And so I wouldn't say it's more complex. It's a different kind of complexity, and in some ways, actually more forgiving than trying to be a model provider that just holds a really big foundational model.
And then in terms of the evolution of this style of architecture, what are some of the impacts that you foresee these capabilities having on the broader industry and the challenges that we're already running into as far as just brute force scaling and increasing the number of parameters versus these more small specialized models, and then also how that factors into
the push to the edge of having this be a more naturally distributed system where you can have some of these very small models on the edge to determine whether to even propagate that request downstream. I think that Ensemble models are going to replace a lot of existing foundation model use cases and extend a lot of other foundation model capabilities. So in the replacing existing use cases, I see Ensemble models as being the right tool to solve repeated structured problems.
Again, even as general as conversation analysis, that is a very general problem statement but much, much narrower than arbitrary operations on streams of data. So if you're going to do a lot of conversation analysis, it can pay off in terms of accuracy, determinism, and cost to build out an ensemble to take advantage of that structure. I think of the parallel as if you have a novel task, say, this is the first time I ever wanna just write down a transcript of this podcast.
You just go ask somebody to do it. But if you need to transcribe 10,000 podcasts, you don't go ask a human to transcribe 10,000 podcasts. You ask a specially constructed system to do that transcription because it's going to do that same work deterministically 10,000 times over. In a similar way, these large model companies, like OpenAI and Anthropic, are building basically artificial people.
And those systems are trying to be extremely, extremely general. But just as you wouldn't ask a real person to do the same repetitive task 10,000 times, I think you also wouldn't ask an artificial person to do the same repetitive task 10,000 times. You're just wasting a ton of capacity in either case, and you would much rather prefer a deterministic, accurate, and more cost effective solution that takes advantage of the structure and repeatability of the problem you're trying to tackle.
So a lot of businesses and individuals are using big foundation models to tackle repetitive, highly structured tasks just because they're so powerful. But I think that is going to turn out to be very, very far from the most cost effective and scalable solution. When you have these repeated situations, you can deploy systems that are much better adapted to them. So I think that's how Ensemble models will replace
some of the applications of the larger foundation models. I think the flip side is they'll actually work in harmony to augment the intelligence and capabilities. Right? So especially with things like like like chain of thought systems and memory systems, having more compute gives you access to more intelligence capabilities.
But if you're trying to do something like media understanding or voice understanding or conversation understanding, you're spending a lot of your compute and your capacity to process that stream of data and get to a useful representation of that data that the rest of the system can work with.
I see Ensemble models as basically becoming a suite of extremely powerful data processing tools available to these more flexible agents so that if you have an incoming data stream or a media stream, the agent can deploy a tool like an ELM to understand that stream, save, again, 100000x on the compute it would have by itself needed to understand that data stream, and then take the representation of that data produced by the Ensemble model
and process it with all of that leftover compute that was available to it because it didn't have to understand the data itself. And the real question I wanna know is when is this Voice AI going to eliminate the need to manually edit my podcasts? It's coming. It's coming very, very soon. I think 2026 is already starting off as an extraordinary year for voice AI models.
You're seeing increases from Ensemble models and other kinds of technologies of 10, a 100 x in terms of cost, in terms of scalability. You're seeing improvements in accuracy, not just on the normal standard clean data sets, but on tricky conversation data sets. You're seeing voice AIs that are starting to be able to take instruction and follow instructions and follow nuances in order to be much, much more capable.
And you're starting to get to the level of accuracy and capability that human beings doing these kinds of tasks also exhibit.
So I think we're going to see a lot of really, really cool tech around voice AI coming out within the next twelve months. 2026 is gonna be a very, very exciting year for voice AI tech. Are there any other aspects of the work that you're doing at Modulate, this overall Ensemble architecture, or the applications that you're applying it to that we didn't discuss yet that you'd like to cover before we close out the show?
I think we've hit on the main points. I think for a different audience, it could be interesting to go into the determinism and kind of ability to avoid things like hallucinations better by using these ensembles of small models. But I think that's actually more applicable to, like, maybe almost like a a compliance or safety audience,
which is a lot of aspects of the things we're working in. So I think we actually hit on a a ton of the main topics I really wanted to cover. I'm feeling pretty good about this.
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gaps in the tooling technology or human training that's available for AI systems today.
The biggest problem with AI tooling and systems is understanding where they fail and what kinds of failures they can exhibit. Think of things like customer support bots and voice bots that will just offer refunds or even when they're out of policy or reset people's passwords even when they haven't authenticated, or will flirt with somebody because they started flirting with it. These systems can go off the rails very, very easily and in unpredictable ways.
And there is not These large models are the epitome of a black box algorithm. Even things like chain of thought, it gives the illusion that you can understand how the model operates. But there's actually a ton of different kinds of failure modes and different kinds of behaviors that these models can unpredictably exhibit from
the right random combinations of input data. And it's very, very hard to know or guarantee or even put bounds on when you will experience the normal happy path operation and when you will experience a negative, unexpected, or out of distribution operation. And I think that in addition to some tools for testing and observability, which are, I think, still relatively primitive.
We are going to start looking for a lot of automated tooling around red teaming models, finding gaps in how models are running on data and systems and data streams, and monitoring these systems at extremely high scale. Watch every single conversation that's going on just like you would monitor human interactions at a very, very high level of scale. And right now, a lot of the tooling for observability and monitoring rely on looking at the model's own intermediate outputs.
Like if you have a voice bot that's doing speech to text to speech, you monitor the text. But the problem with that is that you're monitoring a part of the system that you're trying to observe the behavior of the whole system. And so if a different part, like the text to speech piece of the bot, is going wrong, if you're just monitoring the text part,
you're not going to know what's happening. So being able to observe the behaviors of these models and interacting with each other and with people in the same way that you would observe human beings interacting with each other and being able to derive insights and debug those things is I think a really big gap and where a lot of these problems and failure modes and newsworthy headlines are coming from. Absolutely.
Well, thank you very much for taking the time today to join me and share the work that you and your team have put into developing this Ensemble architecture and some of the interesting use cases for voice AI as differentiated from the textual interfaces that have exemplified
the past couple of years. So I appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Well, thank you so much for having me. It was wonderful to chat, and I really enjoyed it. I hope you enjoy your day too. Thank you for listening. Don't forget to check out our other shows. The data engineering podcast covers the latest on modern data management, and podcast.init
covers the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@aiengineeringpodcast.com with your story.
