Hello, and welcome to The Machine Learning Podcast. The podcast about going from idea to delivery with machine learning. Building good ML models is hard, but testing them properly is even harder. At DeepChex, they built an open source testing framework that follows best practices, ensuring that your models behave as expected.
Get started quickly using their built in library of checks for testing and validating your model's behavior and performance and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually. Go to the machine learning podcast.com/deeptext today to learn more and get started.
Your host is Tobias Massey. And today, I'm interviewing Vikram Chatterjee about GALILEO, a platform for uncovering and addressing data problems to improve your model quality. So, Vikram, can you start by introducing yourself? Thanks, Tobias. Thank you for having me. So I'm Vikram. I'm the cofounder of Galileo along with 2 other folks. Before Galileo, I was leading product management at Google Cloud AI, where we were building out vertical specific machine learning models for the enterprise.
And do you remember how you first got started working in machine learning? Evertrist with machine learning actually goes way back to 2, 009. I was an undergrad at the Indian Institute of Technology, and I was super enamored by Carnegie Mellon's work in the Language Technologies Institute, which still very much exists and thrives.
And I applied there for an internship position. And when I went out there and worked with that team, we were building out cognitive tutoring tools for kids. So essentially think of that as a q and a bot for teaching kids math and science. And this was basically natural language processing in 2009.
And we did really interesting named entity recognition and classification models, but I literally had to take the help of postdocs to be able to do that, which is very different than the way it is today. But I was enamored by ML back then And much more recently before starting Dalleo
at Google Cloud AI, I was leading, product management there and got to face firsthand what ML looks like today on unstructured data with a team of, like, 20 to 30 people. Saw a bunch of huge bottlenecks, which led me to start Galileo with my cofounders.
Focusing on the Galileo product, what is it that you're building there? And you mentioned a little bit about the story behind it, but why is it that this is the particular problem that you've decided is worth your time and energy right now? Yeah. That's a great question. I feel like while I was at Google and my cofounders come from Uber and Google speech,
we recognized very similar pattern that was emerging. Right? Like, for my team, we were building out ML models. Our idea was, can we build out specific ML models for different industries like financial services, health care, retail? We had these great ideas, and we realized 3 things. Number 1 was just the magnitude of unstructured data that there exists in the world. I know there's 1 part of this where you think of this as a statistic that, oh, over 80% of the world's data is unstructured. But when you actually look at banks and insurance companies and, you know, companies dealing with tons of images and video and speech and contact centers. It is enormous. It is a huge amount of data. And you can think of that as just latent data where it's very, very hard to find insights from. It's just lying there. And so that was insight number 1. Insight number 2 was we noticed that these large enterprises were now waking up to the fact that, wait, we can now maybe use machine learning on all of this data to get some insights, to figure this out better, to improve our processes, to not have 4, 000 analysts look through different w twos. But there is a smarter way to do this.
So we saw that the need for machine learning in these enterprises was going away from just being, you know, what is machine learning? And I'm curious generally to, I know that this can help. I wanna now figure out how can we do this better, faster, cheaper.
And insight number 3 for us was around the fact that models for unstructured data were absolutely commoditized, especially if you look at the NLP world with transformers and everything that Hugging Face has done on top of that and all these great libraries.
It's super easy for somebody to plug and play. If I went back to 2, 009 now and told myself that, hey, you can build an NLP model, Just plug and play a model. You don't need these postdocs for that. It would blow my mind. But that's the reality of today. So still, that being said, my team was struggling with building out these models and reaching a good level of efficiency and accuracy and for NLP models, a good f 1 score.
And if you looked at why that was happening and why it took them many, many weeks to build out a high performing model, it was all about the data. And if you looked above the shoulders of these data scientists in my team and you would ask them, like, hey. What are the tools that you're using right now? Are you using
some really great monitoring tools? Are you using some really shiny MLOps product? It wasn't the case. They were mostly using Excel sheets or Google Sheets or Python scripts to just rummage through the data to make sure that they're working with the right kind of data at every step of the ML process.
And so I saw this with NLP. My cofounder was heading up Google Speech ML. He saw this with speech. And 3rd cofounder comes from Uber AI. He saw this at Uber as well. Lots of data quality problems there on the machine learning side. So we decided to start this company with the idea that let's build out a way such that people can
short circuit this entire process of working with the right kind of data. And can we do this in a way such that we can provide a really intelligent system, an intelligent bench on top of which they can bring their data, analyze it, fix it, and store it, and keep track of it? We call this the ML data intelligence platform, and that's where we just recently launched with Galileo.
And so as far as the data intelligence aspect of what you're trying to do with machine learning, you mentioned that people are building their own ad hoc workflows to validate that the information that they're using for building the models is actually what they want it to be and what they expect
it to be. But what are some of the core challenges that users are facing in terms of how to understand what the data is that they're working with, you know, do any sort of maybe cleaning or validation or testing before they bother trading their models, and just some of the types of tools and workflows that they're using, and how you're working to replace some of those ad hoc and bespoke approaches with something that is a bit more sort of purpose built and standardized.
Yeah. That's a great question. I think the issues that we've noticed users have vary based on where in the ML process they are at that point in time for a particular model. What I mean by that is so we've spoken to hundreds of ML teams in the course of the last year to kind of figure out what their issues are and what tools they're using.
Turns out that if you look at the 1, 000 foot view of of unstructured data ML, you can have 3 different parts to this. There's a pretraining, posttraining, and postproduction phase. In the pretraining phase, we noticed that the issue there becomes that, you know, labeling is expensive. Data procurement is extremely expensive. How can you be better about making sure that you're working with just the right kind of data? You're labeling just the right kind of data.
You're procuring the right kind of data, either synthetic or not. So that you're being mindful about your budgets. And also making sure that the data is representative of the real world, and that's not biased and skewed in some way or just garbage to some extent. And on that side, the the pretraining side, we noticed that, for instance, with b to b companies,
there'll be these folks who just get a data dump from their customers. And now they just have to look through millions of lines of text in an Excel sheet to make sure that there aren't any issues with it. That's a data insight problem. That's a problem where they need to figure out what the errors are. Are there empty values? Are there different languages in there? Is there PII data that's stored in there? Are there clusters that are just weird and strange that they need to check out and remove or ask more of?
So those are big questions that they have to answer on that side of the fence, which is, again, just a question of finding all these insights.
Once you do that, you've labeled the data and you start training your model. The data scientist comes in, and that's the post training piece. Where you're doing what we call data detective work, right, where you're just constantly going through all of this data. You've done a retraining job once, and you get an f 1 score of, let's say, 0.3. It's not good enough. You try HACAN. Mostly you're fixing the data over and over again. You go through many, many different iterations, go through that experimentation
zone. And between each and every experiment, we realized that you're just trying to figure out how can you assarge and tweak the data so that it's more representative and more robust. And on the third side, once you're done with all of this, when you have a model in production, then it's this constant process of retraining
where typically you're theoretically using the active learning workflow. You're trying to figure out what your outlier samples are, what's maybe on the boundary, what is potentially drifted from a semantic perspective. And all of this
is done across the 3 phases that I mentioned. All of this is done in very ad hoc way. Like, if you ask someone, like, what's the tool that you're using for all of this? For all of this ML data analysis and insight generation storage, how are you tracking all of these experiments on the data side? How do you know that
like, if I ask you for 5 days ago, like, the model that you trained and retrained again, like, what data changes did you specifically make? It's very hard to have a rigorous answer around all of that.
So that's the big bottleneck that we noticed. And, again, from a tooling perspective, for NLP, for instance, folks are just using Excel sheets, Python scripts, and rummaging through all of that, which is which is really not the way they should be doing things for something as critical as data for ML.
And as far as the stakeholders who are most impacted by this current state of the art, if you will, I'm wondering who you see as being the typical persona or role that identifies that this is actually a problem and then goes out and identifies something like Galileo as a potential solution and then brings it into the business? And how does that sort of core persona influence the way that you think about the design and user experience and future priorities for the platform that you're building?
Yeah. It's a great question. So we think of the central figure in our universe as the person who is working on these models, who is training these models and making sure that it goes into production, does the active learning, typically, this person is called the data scientist. Sometimes it the person is also called the ML engineer, the unstructured data especially. But it's that person who's working right now. We've started with a focus on NLP. So the person who's working with their NLP models or thinking about building out some NLP models, that's the person we have focused on to start with.
However, what we realized was that training and productionizing a model does take a village and making sure that it actually can work really well in production over time takes a lot of other folks as well in the loop. So you typically have a subject matter expert, especially if you're working in something like legal tech or Fintech.
You have labelers, of course, but you also have product managers, and you have your own managers as data scientists. And now you're all of a sudden answerable to everyone around, you know, I just spent
2 weeks in trying to retrain this model and I changed a bunch of stuff, but my overall f 1 score dropped. Why did that happen? How do I explain that? I just spent $500, 000 on labeling all of this data, and then I actually made things worse. Can we actually go back and look at why that happened? Maybe there's a bias that was introduced. Maybe there was something else. And can I work with my PMs and my subject matter experts on that? When we think about that, we think of the user experience from a perspective of,
let's optimize for the data scientist and everything that they wanna see in the product. Whether it's the right metrics, whether it's the right insights, can we short circuit that? Can we provide almost like an assistant to them through this entire process? So it goes from rummaging through their Excel sheets to becoming an incredible assistant on the side for their ML data. But at the same time, can we make sure that this is a collaborative data bench, the entire ML team, so that we could just break down the data silos.
And that was a very big problem for us at my in my team at Google as well. Like, if I wanted to just ask, like, what data did you just train with? Can I look at what just happened with what did the how did the model react to this? The answer wasn't straightforward. I had somebody had to go and export an Excel sheet from somewhere, and I would look at that and sort it by the right stuff. It was not easy. Versus here, we try to make sure that it's collaborative and anybody can get a link to
exactly what's going wrong. So that's how we think of it. It's the data scientist in the middle, but there is an entire ML team around that person who also needs access to our visibility into the data. Given what you were talking about of wanting to identify
what the impact was of the different changes in the source data to the, you know, f 1 score or the accuracy of the model. I'm wondering how much things like explainability factor into what you're trying to do to be able to identify inside the actual, you know, neural that's in the case of deep learning, you know, what were the features that were actually being highlighted and used as the decision points as it propagated through the different layers of that network
to then be able to trace that back to say, actually,
this approach to labeling the data is what caused the drop in recall, and so we want to actually change the types of labels that we're trying to use because it's keying too heavily on this 1 feature that is not actually indicative of the problem that we're trying to solve for versus being able to actually just treat it as a black box and say, I got this outcome. This is the change in the data. I'm going to use my domain expertise to understand that because this was the diff between the initial data set and then the secondary data set that this is actually what's happening under the covers?
No. It's a great question. I think explainability is a big part of this. I do feel like that can be defined in different kinds of ways. The absolute research side of things would be, like, exactly which features led to what impact on the model. And in some respects, that is 1 big part of what we provide, but we provide that in different kinds of ways. What I mean by that is we have, like, a score within the product, for instance, which tells you that literally for every single
sample, like, what was the how easy or hard was that for the model? Now you can go 1 level deeper than that and talk about, like, within each text sample, like, every word and every token, how much of an impact did that have. What you've noticed from talking to users is that that's important and critical, but the issues in the that the users have are even more brass tacks than that. They have issues around just how do I detect what the mislabeled samples are? How do I detect what the garbage samples are? I thought my model was training well on Spanish samples, whereas
it's not. And I didn't even realize we had so many of these Spanish samples. Maybe I need more of that. How do you figure out that, you know, you have metadata around
gender or state, and how do you make sure that there is an even class distribution there? And some of this stuff, you can even do it in pandas, for instance, with a few lines of code. But if that's where it starts becoming problematic, where how much code are you gonna write? Instead, can you just do a single click to see if you find a problematic sample, how many samples are there that are similar to that? How do I find samples like that from my unlabeled sample source
and do better sampling overall? So those become the more brass tacked problems. But the features within the product to help them get to that do involve some level of understanding of how the model was learning on the samples overall.
In terms of the Galileo product itself, can you talk through some of the architectural and design elements that go into it and how you've gone about implementing it and some of the assumptions that you had early on in your work of approaching this problem that have been challenged or updated in the process of actually building it out and putting it in front of early customers?
Yeah. George, to pass. So I think at a high level, we have 3 different parts of the product or the platform. There is the Galileo library, the client, and then there is the, what we call, the intelligence engine. And then the third part is our UI. The idea behind the library was can we make it very, very easy for any data scientist to be able to just add a few lines of code based on the framework that they're using? And we do the rest around just, you know, auto logging all of their data
and making sure that that's easily available for our intelligence engine to run a bunch of algorithms on top of that to be able to provide certain insights and not just visualizations, but actually showcase, like, what the potential errors are or the potential issues are with your data in our client. So all 3 of them kind of work together towards that effect.
In terms of how we changed over time and some of our initial assumptions that were challenged, we initially thought that it might be a good idea to just have the library,
you know, because you meet the data scientists where they work in their notebook as they're training a model, add a few lines of code. That's great. Within the notebook itself, the notebook itself has a great UI. Right? And so let's just provide them the errors and insights right there. And we realize that, you know, if people do write a bunch of code there, but when it comes to inspection of your data and analysis,
turns out the notebook is not the best idea. It doesn't scale very well. In unstructured data, you know, you have all these embeddings and you have millions of data points, and it's a nice big data intelligence problem. Right? And that's when we realized that, wait. We need to go beyond this. We need to create our own UI. We need to create our own back end services, which can deal with very large amounts of data points at millisecond latencies.
And that's the only way we can serve our users in a magical way rather than just asking them to write a couple of more lines of code in their notebook, which is kind of the problem which we're trying to go away from and flip the narrative around. Don't seek the answers by writing a few lines of code, but instead get some answers and errors in front of you as soon as you get to this UI. So that was something which we got we challenged ourselves on
middle of last year. And then since then, we've been we matured the product to also have this intelligence engine in the middle as well as the UI at the end. So you mentioned that a big part of your focus is on this unstructured data problem, so working with audio data, image data, videos, you know, unstructured text.
I'm wondering what is maybe unique about that aspect of machine learning and those types of data sources that is disjoint from the tabular approach that a lot of these AutoML frameworks are able to build off of or that people might be using to try and build different predictor algorithms or recommender algorithms because they have these very structured datasets that they're able to easily featurize and turn into signals for the machine learning models and just what you see as whether that is
a kind of black and white comparison between these different approaches to machine learning and the types of data that you're working with, or if it exists more along a gradient. We've come to this realization that the users' needs change quite a bit depending on whether they're working with structured or unstructured data. The the needs from what they want from
a data quality perspective or data error checking perspective change quite a bit. In fact, even within unstructured data, if you look at folks who are working on NLP versus folks who are looking at speech, their needs can change as well quite a bit. And you're not looking at f 1 scores anymore. You're looking at word error rates and a bunch of other kinds of nuances come into play. The different task types that you have for NLP are fairly different from speech or images.
And so we took this approach of let's start with NLP. But the underlying algorithms that we leverage for the product, they scale really well to images as well as to speech and any other unstructured data, as well as to structured data eventually. But the other big interesting thing for us was with the unstructured data, you get all of the features for the model basically from the model. The embeddings themselves
kind of act as those features in a way. And so from a user experience perspective, it became really interesting to us where as the model is learning, you get all of these great signals from the model around how it's learning or epoch, which is not the case for a lot of the structured data machine learning. So that's why we started here, especially because of all of the very recent usage of unstructured data ML in the enterprise.
We noticed that the bottleneck around what data am I working with was even larger on the unstructured data side than on the structured data side. So that's why we started with it, and we think of it as the user persona for NLP is very specific. And then we'll eventually be moving to other data modalities as well within unstructured. But we'll we'll think of that very deeply and carefully before we move into each 1 of those modalities. It's just feel like it's a whole different world altogether.
For an individual or a team that is starting to adopt GALILEO,
can you talk through what their workflow looks like as they decide to say, this is the problem that I'm trying to solve for, so I'm going to build a model for it. I need to identify my data sources. I will onboard them into Galileo. I will actually use those data sources to be able to build the machine learning model, but I need to understand what the featurization looks like and just that overall process of saying, this is what I'm trying to solve for. I'm going to use Galileo to help me, and then moving that all the way to, I've got something in production.
The way our customers are using us right now, as of today, we work with teams where they have a few NLP engineers. They typically have some ideas around some business critical use cases that they're gonna be using NLP for. They have a labeled dataset, or they have labeling teams, some kind of supervised mechanisms in place or semi supervised.
And once you have that labeled dataset and you have 1 of those off the shelf models that you would be using for NLP, the next step becomes, wait. I'm just gonna run this run this model, and and I'll try to figure out where did my model struggle the most. On what data did it struggle the most? What kind of issues did it face? How can I clean that up? And that's where the Excel sheets and Python scripts typically come in. And we wanna say that don't do that anymore. Just add a few lines of Galileo code while you're training your model. We auto log everything, and we'll provide you with a UI where we cannot just see your data, but you can actually see the errors. And that's where the fixing takes place. You inspect it. You spend a couple of instead of weeks, you spend maybe an hour or 2. You You identify a lot of these unknown unknown issues which you had no idea
existed in your data. That's what we typically see. And then we have a bunch of integrations because you're just replacing your Excel sheets and scripts. Right? So we have integrations into your other tooling, whether it's labeling tools or whether it's, let's say, Amazon S3.
The main goal at that point is from for data scientist is to get to a better dataset so they can train on that again. And over time, just be able to compare and contrast where I change. What improvement did that have on my model? Did I change stuff across different dataset slices or just 1 slice? And if I'm improving my f 1 score overall, that's great. But am I also improving for female users in California?
If not, that might not be a good thing because it's gonna blow up for me in production. I need to fix that. That's where Galileo comes in. Now once you get into production, you can run a quick inference job in a very similar way, and we have all of your training data. We enable faster, better active learning essentially at that point where we can, you know, give you a lot of the different tooling within our product to be able to say that, you know, this is how you can perform better sampling
of your unlabeled data to make sure that you're capturing all of the different data gaps that your model had between what your model was trained on versus what your model is seeing right now in production. Because as the world is changing, you have to keep adapting to it, and we make sure that you have all the right ways to be able to not just sample better, but also track that over time. The other interesting element of the machine learning ecosystem is that there are so many different
frameworks and tool chains that people are using. You know, they're definitely the dominant ones of and PyTorch, but there is a constant evolution. There's JAX and MXNet and Keras and, you know, all the different frameworks that are in the Julia and Java ecosystems, which I am willfully ignorant of. I'm just wondering how you have approached that aspect of whether and how deeply to integrate with those different tool chains versus
the abstractions that you want to build to make it agnostic to what people are actually using for the model training and development? We made sure that we provide support for the most popular ML frameworks out there. So whether that's Keras, TensorFlow, PyTorch, spaCy. And then I'm sure there'll be more that'll come around every now and then. But we've noticed that if you cover these 4, then that's around 95% of what most folks are using today in the enterprise.
Some of them are a bit harder to work with because they're more like black boxes, like spaCy for instance. But that's been a big part of what we do, where we make sure that the user experience that a person who's working with these frameworks gets for by using Galileo
is you just add a few lines of Galileo logger code, and it should work like magic where we just do everything behind the scenes. And you just get a URL to kind of look at your data and make sure that you're identifying the blind spots quickly. Predabase is a low code ML platform without low code limits.
Built on top of their open source foundations of Ludwig and Horovod, their platform allows you to train state of the art ML and deep learning models on your datasets at scale. The prediabase platform works on text, images, tabular, audio, and multimodal data using their novel compositional model architecture. They allow users to operationalize models on top of the modern data stack through REST and PQL, an extension of SQL that puts predictive power in the hands of data practitioners.
Go to the machine learning podcast.com/predabase today to learn more. That's predi b a s e. Another aspect of what you were talking about and that is a kind of key factor of the model development workflow is the experiment tracking and understanding
for this model output that I created, this is the code that I used. This is the data set that it was trained on. This is what I changed in the source data. This is how I changed the code. This is how the model performed, and just being able to manage those linkages of the different datasets and how the datasets change and then being able to track that through to the model outputs and being able to version those different models and understand, you know, that matrix of
differences and version changes and figuring out what types of information is useful for the people who are trying to debug and which pieces of information are just noise and you actually wanna suppress them because it's not ultimately useful. Yeah. So for the most part, what we try to do is we act like a data bench. So we try to not necessarily take action on the user's behalf, but try to surface different kinds of issues, and they kind of take action thereafter.
Our goal with trying to provide some kind of a mechanism for users to be able to track their data over time is more from a perspective of 2 things. Number 1 is just organizing their runs and organizing their data in a better way. That's number 1. But number 2 is also just for some highly regulated industries. Just from a compliance perspective, it becomes really important to be able to say that, hey. I trained on x and y, and that's what I've moved pushed to production.
And that's from an auditability and transparency perspective, it becomes really important. But over time, essentially, because we have this ML data store, and it's a really rich store of not just the data itself, but the metadata behind it as well from the model, it becomes really interesting things that you can then start surfacing in terms of insights across all of your different runs.
And that's something which we're working on from a future perspective in the next many months to come. But as of now, it's helping users just become much, much more organized
around how they work with the data without them having to maintain Google Docs and all sorts of other mechanisms to figure out what they changed in their data. And secondly, for the customers that we have, which are highly compliant, they can really point to data that was that they worked with before. And if they wanna recreate a particular run with the same with the specific kind of data, they can do that again. And that's important for folks to be able to do in MLP just given how iterative everything is, just given how complex things can become.
And then as far as being able to understand the impact that you're making on the productivity and
the quality of the output for these data scientists and machine learning engineers. I'm wondering how you think about quantifying that and the types of metrics that you collect in terms of how people interact with your tool and your platform to be able to then feedback and say, since you've adopted Galileo, you've been able to ship models at, you know, x times your usual rate with x times the usual accuracy.
You know, we've prevented x number of potential errors in production. Just what kinds of metrics are useful for those data scientists and machine learning engineers to be able to maybe demonstrate the value of your tool to their, you know, VPs and CSOs and CDOs? Yeah. That's a great question. So we think of it in the classic faster, better, cheaper mechanism. So if you look at those 3 axes, on the faster side, we typically say, let's look at the time to insight.
And if you're looking at an Excel sheet or scripts, you're typically ask looking for known unknowns versus with us, you very, very quickly find a bunch of unknown unknowns, and we'll just bring it in front of you. So what's your time to insight? And that's generally what mostly gets a lot of our customers excited as well. If you're like, wait. This is like magic. I had no idea I had this kind of data, and I could find that within a matter of minutes.
So that's 1. It's a bit more qualitative. That's something which we talk to our customers about. But the second 1 is around just the amount of time that you're spending between different runs.
Typically, that's like a week to 2 weeks of rummaging through your data versus with us, we can track that. And we've noticed a 10 x improvement from before and versus after Galileo in terms of just how fast you go through the entire process of experimentation itself. And the other piece is around better models.
When you improve your data very quickly, you start seeing better models. And I don't mean f 1 scores particularly. Right? Because you can hack your f 1 score to be, like, 0.99 for all you care. But it might still be producing bad results for very, very important cohorts. So identifying about the
quality of the predictions are and making sure that those are better is also really important. And that leads to the whole cost and cheaper piece where you're saving a bunch of cost from using labeling tools willy nilly, just throwing the kitchen sink of the data at it or asking for a ton of extra
synthetic data from data procurement teams. So you save a lot of cost, but we've also noticed that ML mostly is used for very critical business use cases for the most part. And so we asked them about, like, what's your business KPI here,
and how is Galileo helping improve that? Because a better data leads to better models, better models leads to better cost efficiencies and better revenue or improved churn numbers. It's a combination of all 3 of those things that we stack up and go back to our users as well as the buyers and say that this has actually had pushed the needle quite a bit. And that's sometimes an organic conversation, but we try to be very quantitative about it when we when we speak with our customers.
Another question that many ML teams and data scientists are dealing with is the question of ethics and bias and being able to identify and account for or counteract some of those potential problems that could come out of training a model on unbalanced datasets. And I'm just wondering how you think about surfacing that type of information as people are identifying which pieces of data they actually want to use in the model that they ultimately ship to production.
Like, 50% of our team is ML research, which is notice it it's very different from most other ML companies, especially in this field. And the reason for that is because we realized that there's a lot that you can do from all of the different signals that the model is giving you and then provide that back to the user and kind of, like, almost like a superpower.
And 1 aspect of that is, you're right, it's around ethics and how you can figure out is there a certain kind of bias? Then biases can come in and creep in in many different ways. Can you kind of automate that? These are all pieces of research that we're working on embed into the product in different kinds of ways. Apart from that, just providing the user with a quick way to be able to check for, wait, is there PII
within in my model? So can I make sure that if there is an address or an email lurking in, can I just remove that very quickly? Is there a certain kind of class imbalance
across, say, the gender metadata or state metadata that I need to know about? Sure. You can maybe look at it and you can kind of write a few lines of code to check that out, but can we kind of automate that experience for you? Can we just tell you that? Those are lower hanging fruit which you can provide back to the user, which you'll be surprised by how often all of that just goes completely unnoticed because you're not looking for it.
And apart from that, there's this other notion of data slices, which is becoming very popular amongst a lot of tools. But I think that's also a very powerful mechanism
for allowing giving the power back to the data scientist and the data science team to say that, look. I mean, I have all of this data. I know what the critical cohorts are. I know that we need to make sure that we're doing well across every single state, across every single gender identity, and only when we are doing well across all of these different kinds of slices is when this model should be productionized.
If not, then it fails. And being able to do that in a rigorous way also helps with a bunch of these ethical issues and making sure that you don't have these different kinds of biases creeping in. And in terms of the
ways that you're seeing Galileo applied across the different teams that you're working with, what are some of the most interesting or innovative or unexpected applications of it that you've seen? I think our customers keep surprising us all the time. We have an inference version of the product, right, where you can run an inference run and you can see how your model did on a bunch of unlabeled data. And we're noticing a lot of customers kind of using it in the way that we would expect them to, where it's fairly
typical around active learning and trying to make sure that they're building the right kind of samples for the next set. Versus we've also seen a lot of customers who say that, look. We even before we start figuring out what model to use here, we just have a ton of data. Some of it is labeled. How can we find issues with that before we really embark on this massive experimentation process?
That was really interesting where this whole pretraining piece comes in before the data scientist actually starts using Balilio. That was really interesting to us. Apart from that, we've also been always very amazed by the kinds of insights that customers find. Right? Like, 1 of our customers in the call center AI space, and they had a huge data dump that was
basically supposed to be all English data, but turned out that a huge part of that was Spanish. And we would imagine that that should not have happened from when the customer gave them that data, but it did. And it would have, to the most extent, like, gone just unnoticed. And those are the kinds of insights they found within minutes. And they would talk about how now that's identifying languages,
different kinds of languages within the data dump that they get is the first thing that they try to do. And if they identify that, hey. There's German in here, Spanish in here, French in here, either they go back to the customer and say that, look. Give us data that's just English, or they'll go back customer and say that, look, I mean, give us more German and more more Spanish data so that we can actually build a representative model for your use cases. And so those are things which actually influence how we think about our product as well. Like, can we automatically detect this stuff for them? Can we short circuit these kinds of insights? So we're constantly on the lookout for what are those quick insights that people are finding, and how can we short circuit that for them so that this data problem is not
a gnarly issue for them anymore? In your experience of building the product and the business for Galileo and starting to help teams understand more about the datasets that they're using for building their models? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process? I think data is hard. It's just very, very messy. It comes in many different shapes and forms.
We knew this going in, eyes wide open, that if you're looking at trying to be a product, which is trying to uncover data intelligence and give that back to the user,
it's not gonna be as easy as just looking at the model's f 1 score and giving that back to the user or just giving them a loss curve, for instance, which can which is software engineering. In this case, we would have to kind of adapt to the different kinds of frameworks that exist, make sure that our product, our library can work well with all of them. We can log all of these different kind of data from each 1 of these frameworks. Once we do all of that, then how do you surface this data in an adaptable UI
based on, you know, depending on the task type. Right? If it's named entity recognition, what does that mean to a user? How does the user wanna see that data? Let's just not just slap all of the data on a UI, but let's be very, very
thoughtful about how that surfaced. And when you think of doing all of that at scale with, like, millions of data points at millisecond latencies, now you need everything to work in harmony all the way from your cardigans to to the platform stack that we have as well as the UI and user experience.
So we're forever learning. We forever hit edge cases, and I feel like that's in a way a good thing because it keeps robustifying the product more and more. And our users keep challenging us in a lot of ways around, you know, this is all great, but here's 1 other use case, and here's another edge case. And that just keeps better testing the product more and more, and I feel like that's the piece which it it really keeps us going around. It's a big technological challenge to solve for, but it's a very gnarly problem where on the other side is very good user experience for a very critical problem in the industry.
For people who are working with unstructured data sources, they're trying to build machine learning models on top of it. What are the cases where Galileo is the wrong choice?
Typically, what we noticed is, currently, we are a great fit for any company where NLP is being used or planned on being used for critical business functions. But if you come across a couple of companies where they're still, like, thinking about it, right, and they're not completely sure if they should be using NLP or they maybe have not even adopted machine learning yet, If you like that's where it's the wrong choice, it we start becoming extremely
helpful once you have a good idea for how you wanna use NLP in your within your company and you have at least 1 or 2 NLP engineers already because those are our our target personas. So as of now, those are the kinds of companies that we talk to when we say that, look, let's definitely talk once more once you have NLP engineers. And we're working on some certain techniques. And we can help you across the entire ML workflow there.
As you continue to build and iterate on the product and work with your customers to help identify areas of improvement and where you should be spending your focus. What are some of the things you have planned for the near to medium term of the project? Yeah. So what gets us really excited is the fact that there is just so much you can do in terms of looking at all of the different research that exists around what's now called data centric ML and providing that back to the user.
And that's also why 50% of our team is ML researchers that's working on productizing, not just maybe research that's out there, but for the most part, research that we have worked on for the last year, year and a half. And And so we plan on, in the next 12 to 18 months, continuing to invest a lot more on r and d. We've seen a huge ROI in terms of how you can create that and take that into the product and create really incredible value for our customers.
So that's something which you wanna continue doing so that the North Star for us stays at how can we get users to faster data insights and error discovery and just completely short circuit this process of gnarly data management and data error analysis and data insight discovery, but instead make that a really, really smooth experience for them and make it seem like magic.
Are there any other aspects of the Galileo product and the overall space of doing machine learning across unstructured data that we didn't discuss yet that you'd like to cover before we close out the show? We touched upon this just a little bit, but I do feel like it's a very important piece. That whole topic of collaboration
and making sure that that's easy to access, I think is very, very key. Because ML data is, again, machine learning is a team sport, and ML data is the lifeblood of machine learning models. And so I think in any tooling that comes across these days for ML teams, it's very important for you to think about how everybody in the team can access that very, very quickly.
And so from our side, we try to remove any kind of bottlenecks and any friction points that might exist between, let's say, you as a data scientist sharing data insights and errors with me, the product manager, or with me, the subject matter expert, and me being able to communicate back to you. So when we think of our own product, we think of it as an intelligent but collaborative
data bench for ML teams. And that's an aspect which is people notice firsthand as soon as they start using the product, and I think it's something we wanna keep doubling down on. In terms of that collaboration aspect, I know that the core sort of user role that you're focused on is the data scientist and machine learning engineer.
And I'm wondering how you think about the collaboration and the sort of handoffs and interfaces with the data engineering team who are responsible for collecting and managing these source datasets that you're working with and also some of the maybe organizational visibility that you're looking to present to managers and VPs to understand, you know, how the data science team and machine learning team are progressing on the business objectives? Typically, what happens is you
would find certain issues where the typically, what happens is you would find certain issues where the model did not perform very well. On the data, another question becomes, either I just don't get it. I don't get why the model is not performing well here. It's legal tech or it's insurance information. I need a subject matter expert to look at this. Can I just quickly send a link to them and they can just see what I see and share their notes with me within the product? That's 1 aspect of collaboration where you just break down the data silos immediately.
The other aspect is more functional. Right? Like, you have labelers. You have product managers in the loop. And for each 1 of those kinds of operational folks, you would need them to take certain action within the product itself. And so we have a huge emphasis on fixing and taking action, whether that's quick relabeling or quickly removing certain samples. You can do all of those kinds of data modifications very easily, and the team can do it. And you can keep track of who made what changes.
In terms of upwards mobility, and it's looking at how your execs are looking at how much is spent and what what are we actually moving the needle for machine learning across the different workflows that we wanted to? That's where comparing across all of your runs becomes really useful. So we have 1 click comparisons across all of your different runs for different kinds of projects from a data perspective
within the product, which is typically where we've seen customers use this in their, let's say, Friday meetings with their heads of data science to say that, look. We have these 10 machine learning models running right now. This is what we changed on the data side. This is what we changed on the model side. This is what it led to overall. This is the decision that we wanna come to, and they can have a discussion around that. You can generate reports pretty easily as well. That's where we've seen a lot of visibility
happen at the highest level of abstraction. So it's a bit of both when it comes to collaboration. How do you share data with your stakeholders immediately when you're in the trenches trying to improve the model performance and prediction quality. But also, how do you communicate upwards and to let people know that, hey. This is the state of the world right now, and this is where I need your help. The first 1 also includes, to your point, the data engineers, the data procurement teams,
where in the example that I gave before around, I just didn't know that we had all the Spanish data. You can go to your data procurement team or maybe the data engineering team and say that, look, I need more Spanish data. Or just make sure that you do not have Spanish data in the next time that you give me give me data to work with. So that's typically how we've seen folks work across the organization and use Kala. Io in each step.
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest barrier to for machine learning today.
I feel like maybe you said this a few times, but I feel like I think it's extremely important with ML. The data is everything. The ability to work with the right kind of data across different as you're training your model and once it's in production is extremely critical. And I don't think it's
taught enough in schools. It's definitely not practiced as much as it should be just yet. But I think the world is moving in that direction, and having the right kind of tools to be able to do that in an efficient way is hopefully gonna help us move in that direction much, much faster. But working with the right data is a big bottleneck today. And if we can solve for that, we'll have much faster, better, cheaper ML for everyone in the world.
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Galileo. It's definitely a very interesting problem space. It's definitely great to see somebody who is trying to make it a more tractable problem and manageable for people who are trying to
build machine learning models on the majority of data that exists out there. So I appreciate all of the time and energy that you and your team are putting into solving that problem, and I hope you enjoy the rest of your day. Thanks, Tobias. You too. Thanks for having me. Thank you for listening, and don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management,
and podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast.com
with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.