Hello, and welcome to The Machine Learning Podcast. The podcast about going from idea to delivery with machine learning. Do you wish you could use artificial intelligence to drive your business the way big tech does, but don't have a money printer? Graphd is a cloud native platform that aims to make the AI of the 1% accessible to the 99%. Wield the most advanced techniques for unlocking the value of data, including text, images, video, audio, and graphs.
No machine learning skills required, no team to hire, and no infrastructure to build or maintain. For more information on graph or to schedule a demo, go to the machine learning podcast.com/graft. That's grapht, and tell them Tobias sent you. Your host is Tobias Macy. And today, I'm interviewing Jacopo Tagliabue about building reasonable scale ML systems. So, Jacopo, can you start by introducing yourself?
Of course. First of all, 9 out of 10 on pronunciation of my names are very good, very good. Thank you very much for for having me here. So I'm Jacobo, director of AI at Coveo, a public company in Canada, but I actually calling in from New York. I used to be an entrepreneur in Silicon Valley, so I built a company called Tuzo. I was the CTO and cofounder. It was a company who was doing NLP
and information retrieval. Before all of these developed tools, there was a lot of work. We grew it for 3 years and then sold it to Cobayo, and then I'm believing, like, AI and MLOps roadmap as well. And I'm pretty active in the open source community and world. So where we basically try to since we made all the possible mistake or most of the possible mistakes in DataOps and MLOps, we try to save people some time. And so we're open sourcing a bunch of repositories and data
and kind of articles that kind of explain our journey and, you know, like, the setup that makes us happy and, you know, the mistake that we made along the way. And I'm also the auto practice, which is a relatively popular package for testing ML system. In fact, we recommend our system would be the real testing. So if you're curious about that, there's another challenge right now that you can actually participate in for CIKM. So thanks again for me. And do you remember how he first got started working in machine learning?
I think he wasn't called machine learning at the time. So I think the first thing that would qualify as learning today was a part of speech tiger with with Markov models. It was 2012, maybe, something like that, 2011, something like that. A long time ago.
But my interest for a lot of teams that people now as they would ML is long like, is is older than that. I used to say to VC when when we're building a company. I used to do AI before it was cool as I grew up with this interest with, you know, AI in particular language, formal languages, and also natural languages. So so NFP was kind of a natural first application.
I would say the first modern application in the sense of, like, you know, being dockerized, deployed, and scaled to millions of people would have been a Twoso when we had a shallow parser based on the conditional random fields that would actually help you understand that a query such as blue shoes by Prada
is a color and a type of object and a brand and not just a string of letters. And that was used at tools in my company to provide people with a better experience. So that, I think, would be my my first thing that, you know, millions of people scale would would have been that. Very
cool. So given the fact that you have been working in this space, as you said, since before it was cool, I'm wondering if you can give your take on how you view the current state of the overall ecosystem for ML practitioners, you know, both in terms of the availability of tools and platforms for being able to get work done and the availability of useful resources and tutorials for being able to understand how to do those things?
I think the answer to this question kind of diverge. So the first question is, how about the tooling and things that makes you productive? I mean, it's awesome. It's the best way possible. It's best moment possible, at least in our season before, to be in this field as you don't have to do much anymore. I think 1 of the lessons that we learned by building, you know, the you don't need a bigger bot report, other stuff, the open source, is how much the field has progressed in the last 5 years.
Now 1 question that we kept asking ourselves was, what if we have to rebuild what we build for Twoso today? What technology we kept? And the answer is none. Like, it's 5 years, but the things are changing and improved so rapidly. Then none of the choice that I made at Tuzo, I would make them today. Not because, you know, I mean, not because I'm terribly dumb, but because now the there's variety of tools in open source and platform that you can use. They will actually avoid a lot of work
that I had to do with my first company. My first company basically had a terrible version of DBT that we build our self with the influence par SQL, you know, as a as a terrible version of, you know, I don't know, seldom, you know, with with our dockerized deployment, but it was, like, kind of suck. We basically have terrible version of everything we have today, but it was terrible, and you have to build it yourself. I think now the ecosystem is so mature that for most use cases,
not the Facebook of this world and not some niche things, you know, when maybe security is super important, related is super important. But for most use cases, the reasonable scale, as I call it, there are tools in place that gets you from data to serving to monitoring
with a fairly small team and with fairly low barrier to entry. So question number 1 is the moment the ecosystem is amazing. I think it's a fantastic moment to be in this field, and you can be incredibly productive. Question number 2 is, how is the educational material out there? And, of course, I'm biased. So, you know, like, you just admit down there. I think there's an over like, most of the materials out there is by model. So there's a million toward the assigned article of, like, how you deploy a psychic,
you know, logistic regression with Flask. Good. That means not that it's not useful, but that's really a point. Like, you know, that's very small thing. So it just runs on an easy 2. And there's a lot of materials about what Uber platform does or meta platform does or what Pinterest is doing and so on. What is really missing or, like, what has been missing, and it's 1 of the reason why we while we're writing so much in the space in the last year, has been
something in between the low code or no code or simplified toy environment and the custom ad hoc mega platform for Fang. There's a huge chunk of people in the middle, the reasonable scale that we think are not as well served by tutorials. As, again, everybody will discuss how to serve a petabyte of data or 3 data points on your laptop. But very few people actually tells you what to do with 10,000,000.
Just, you know, it's big enough that you have to worry about, but it's not an unreasonable request. Like, it's not, you know, the population of India. Right? You know what I mean? I think there's a huge opportunity, not just for education, but in the market of ideas. Also for paper, for research, and, you know, something that we try to do in the last couple of years of addressing this market, which is super interesting and is where the bulk of application
are gonna be in the next 5 years. Yeah. I think that that's a pretty common challenge across all, kind of technical endeavors where there's a plethora of information out there for people who just wanna go from 0 to 1 of understanding, how does this thing even work? How do I do the equivalent of hello world.
And then there's a giant gap between that and, okay, this is how you build a system that will serve production grade capacity for Google or Facebook, and everything in the middle is few and far between and variable quality, and it's hard to really get anything that's of kind of intermediate level that gives you enough information beyond what you may already know that doesn't have to start with, 'Okay, here's how you do Hello World' and then do the next thing.'
In the process of writing your examples and the repos that you've put together and the tutorials that you've gone through, I'm wondering what your experience has been or the thought process you went through about how to structure those tutorials and how to actually address that kind of intermediate scale and the assumptions that you had to make about what knowledge and understanding they already had by the time they came to your tutorial?
I think for us, which is the pros and cons, but the way to streamline this for us has been imagining myself on the other side of, like, hey. If somebody with maybe my knowledge 2 years ago, not now, but, like, 2 years ago,
what would have we wanted somebody to explain to me 2 years ago to streamline my my process, you know, to somehow speed up my learning in this landscape? And so, I mean, it's biased because, of course, you know, talking to a previous version of yourself is not exactly doesn't exactly generalize much. But as a prong, that is you have a very concrete set of skills and person and, you know, type of problems in mind. I think a lot of the things that we build and put out has been successful because
we are ourself a reasonable scale companies. We are ourself reasonable scale people. We're that for people that are basically like us and that they have our problems, so to speak. So a lot of things that we put out are the things that I use every day. Like, I don't
have to put myself into the issues of somebody using Metaflow. I actually use Metaflow. I don't have, you know, to put myself into the issues of somebody using Snowflake. I love Snowflake. Like, it's 1 of the thing that changed my life than most technically speaking in, you know, in the last 4 years. So all of the things that we actually propose are things that we actually use every day, which is how we know that they work at our scale. Like, you know, when for our needs, they work. The other thing that I would add is a lot of things we put forward are information
retrieval use cases, If you think about where they fit as a functional as a functional piece, which is the pros of, a, it's very easy to understand what they do. Even if you never build a recommender system in your life, everybody kind of understand what recommender system is. You know what I mean? It's something that, you know, you can kind of relate to it as far as input and output goes. And the second thing is that they kind of upper bound the reasonable scale people for data and sophistication.
As ecommerce, even as small ecommerce, we'll generate a lot of data way more than most people will see in their life. And ecommerce tech is way more advanced than, again, most application or reasonable scale. So the argument is if this runs a transformer based recommender system with a click from your laptop, chances are it's gonna be okay for many other use cases because this is actually on the other side of of the spectrum.
As you mentioned, the work that you did at your previous company, you said you would just throw away and rebuild from scratch today because of all the changes that have happened. And I'm wondering if you can maybe highlight the most notable aspects of the change that you've seen in that time frame and maybe even relating that to some of the ways that those changes have been catalyzed or influenced by the accompanying changes that have happened in the data engineering and data management space?
I mean, absolutely. So the biggest change of all, the 1 that had the most communication, is the most boring. And it's when we move our SQL operation from Spark SQL to Snowflake. Again, as I said, like, my company was doing information retrieval use cases. So most of our data is
semi structure, like, you know, sort of tabular in nature. There are some fields that are not categorical or numerical. There are some fields that are free text. But if you think of a typical search engine, queries are typically fairly short. So what what you actually end up is with mostly tabular data transformation. And for that, Spark, in particular, we're using Spark on EMR.
It's kind of a pain to work with. Like, you know, it's very slow. When there's an error, you've got this, you know, hundred lines of of Scala code that you don't really get or I don't really get. It kind of forces you to, you know, some patterns that are not super familiar to, you know, to people with my background. And it's very hard to tune properly because of, like, a million configuration that you can use for part. When you move from that to just do stuff in Snowflake, my life changed completely.
Now I don't have to maintain clusters anymore. I can just use SQL, which I have decent at to retain my data. And I don't have to worry about scaling that or about configuration about the clusters or about any of that. I just throw it to Snowflake as it figures out myself. I mean, as far as I know from my code, I can be querying SQLite,
like a a high latency SQLite. Because the experience to me is exactly the same. You know? I give you a SQL query, and I get back, like, a data frame, like, list of dictionaries or or whatever it is. So that has been massive. We decommission, Spark SQL, Athena, you know, and a lot of stuff when you can move to Snowflake. And then unlocks,
you know, building on top of the modern data stack, you know, DBT and so on for data preparation, which, again, we kinda sort of have that in my previous Intuos, but it was a hacky, you know, kind of not super good way to do that ourselves. But the principle were there. My point is, like, the transition to all these tools was easy for the team because we were already philosophical thinking the same thing. Immutable data, up and all the pattern, ELT. Like, all of that was already in place. It was just with bad tools. So that was a huge
time saving and maintenance savings. So that was great. The second piece that I would say, in order of importance, that we change our workflow was the adoption of Metaflow. So these are more on the MLOps side. So NetOps has been, let's say, figured it out with, you know, with this new setup, so to speak. And then the question is, well, now that it is aggregated and and prepared, what do you do for training your model? And before,
if you ever had a model in your life, you know that it's not like a tidy 4 lines of code, you know, that that you that you that you send to to Keras or to PyTorch. It's like, you know, a 100 lines before, a 100 lines afterward. There's a lot of checks you need to do, a lot of things you need to move around. And before, it was, like, either a series of scripts,
but, like, you know, hockey and terrible or, you know, things like, I don't know, Luigi or or even airflow. You can even use, you know, the general orchestrator to do that. But that kind of conflates
the general problem of orchestrating task to the interactive job of the data scientist, and it should be much faster. So enter Metaflow, which is a open source tool that everybody can use. It's from Netflix. So when Metaflow came on board, we kinda throw away all this logic and scripts and, you know, and kind of planting pipelines. And now with Metapro, we have 1 way of basically doing fast deployment fast sorry. Fast development so we can iterate locally, and then we can use GPUs when we want from the cloud. But then the same setting can be easily
sent to production. So now our feedback loop from, you know, when we try stuff and, you know, want to run experiments, when we have an endpoint running, is much shorter. And also Metablock gives us a lot of stuff that we didn't have before, like versioning, you know, auto scaling, and so on and so forth. So this is the second big thing that I would change. Everything else, experiment tracking, monitoring, and so on, is awesome. And now that I use them, I can't live without them.
But I would say the 2 foundational pieces for my productivity and my team productivity has been Snowflake and DataOps and Metaflow for, like, you know, this training loop and and fast iteration. I don't think that makes some sense.
In terms of the mistakes that you mentioned that you had to go through the pain of making and learning from, I'm curious what are some of the most notable ones that you've held on to and that have influenced the way that you think about building these systems today, especially given the advent of these new tools and platforms to build on top of?
The mistake we made, we made some mistakes because we knew that what we were doing was not optimal. But then the old daemon optimal solution at the time wasn't really possible. Our deployment
our multi deployment package at the time was like, you know, a Flask app, a Docker app with you know? When you spin it up, what happens? It goes to s 3. You retrieve the latest, you know, the latest artifact that is there, and and then it's just gonna reboot the, basically, the container. It's gonna load that in memory, and then it's gonna serve a Flask endpoint in Python. Right? It was a conditional on feed models. The model was well enough. It could totally fit in memory and, you know, it was fast enough. And we knew that that wasn't particularly good. Why it's not particularly good? Why it's not particularly good? Because, well, every time now we scale it, you know, now you have to scale, you know, you have to download, like, end copies of the model, which is not, you know, like a super a super good way of of doing this. It's not good because there's no way of versioning or specifying in container
which type of model you want. Right? Sure. If you always pick the latest 1, that may be fine. But sometimes you want to roll back. Sometimes you want to build AB testing. Sometimes you want to experiment all of that. And if you build just, the latest model is what goes in production kind of thing, which is what we did,
Obviously, none of these use cases can be solved. I mean, they can be solved, but with another act on top of the original act. So our entire model registry and model serving stuff was obviously wrong. It did the job. It was okay. Again, we were most worried about, you know, velocity and kind of by iterating that, you know, it being being being right. But it was something that we're we're not super happy with. Plus sorry. Last thing is we built it on Fargate, which was good for something,
but it's bad because Fargate doesn't support GPUs. By the way, Amazon people, if you listen to me, please add GPU support. So for the condition on the field was fine, but if I have to part the same mentality to what I'm doing today, because, like, you know, mentioned before, transparent model, it won't work. Right? So the entire thing wouldn't work. Okay? So that was clearly wrong. And the truth of the matter is that I don't really have today, like, a full fledged solution to this problem.
So for some of the standard model serving and GPU component,
you can use SageMaker or similar tool, what whatever you want to use. And it will do a decent job in doing that. At least it will scale for you, like, some property. You would batch for your request to, you know, kind of utilize, you know, GPUs a bit better. Some version of SageMaker, I don't think the GP 1 also do multi model endpoints. You can have multiple models in the same end point to, you know, to kinda make it more, you know, efficient and cost effective.
When your deployment thing is a bit more complex, think about model recommendation engine with 2 stage processes or something like that. It's still not a solved problem as in there's still not a tool that I know of. If somebody knows, please reach out. That you can just give in, hey. This is my artifact.
Can you please deploy it for me and scale it, you know, and scan scale it out? So I'm still in search of that type of solution. Of course. Now you can build yourself with the with a lot of our tools. But my brain's like, since I never want to build stuff myself that is not what I really care about, I'm always looking for a solution that will simplify my life. So that mistake is kind of still partly with us,
but at least some of the, let's say, low end game front has been reaped by by this past solution that you can just adopt. I don't know if it makes sense. 1 of the things that you just noted in there is the difficulty in being able to say, I have this model, I just want it to be in production. And I'm wondering, what are some of the other points of friction that ML practitioners and ML teams might typically run into if they're trying to go from,
I have this idea of something that can be addressed as a machine learning problem, and then to the point of, okay, I have this in production, and I'm able to iterate on it and deploy whenever I want to. Just some of the challenges that still exist on that path.
I think a lot of the challenges in that tool. If you have like, let's say, the use case, I think it's very telling as there's a new problem. So I'm not maintaining a system. But there's something in my app. I don't know. A churn prediction model, a recommendation system, whatever, that I want to automate for some reason. I want to build a machine learning out of it. I think today, if you want to build this thing, it's mostly gonna be a people and process problem and not a tool problem.
And I, you know, I can say more clearly. Like, if you want to solve this problem, you need to strive for 1 thing only is velocity.
Like, the only thing that matters is the time that it takes for you to build end to end system from when you're dataizing your in your Snowflake to where you have an endpoint that your clients can use and interact with. And then you need to get that feedback back into the same Snowflake so that you can actually join the 2 data, your prediction and the feedback, and then kind of, like, iterate on that. I think a lot of people get stuck, not because of tooling, but because of
internal processing or the temptation
of being perfect first, which is really something we should avoid. Being stuck in the middle, stuck for weeks or months into optimizing time legal thing that don't really matter. And how do you know they don't really matter? Well, that's actually that you don't know. Like, until you put something in production, you don't know what is the potential value. You don't know if you use it to interact with it, and you don't even know which are the bottleneck in your system.
Unfortunately, ML is much more of an art and a science. ML in production well, also ML is a science. But, like, ML in production, especially, is more of an art and a science. And we figure out a lot of stuff just when we do them, you know, to some extent. Do the least amount of work and sophistication possible to put out a feature that makes some sense. And, hopefully, work in an environment that reward this. Work in an environment that reward experimentation.
Safe experimentation, you know, to a check-in place and making sure you're not destroying, you know, people life. You know what I mean? Make an environment that is okay with the idea that all this thing is gonna be an iteration, and all this thing is gonna be, you know, kind of validated as we go. Instead of trying to do something in a cave for 3 months and then try to bring into production by merging 4 teams together and do that, And give the people that build the model the key to production.
Right? We always press this end to end ML person. The end to end ML person doesn't know infrastructure, doesn't know Kubernetes, doesn't know provisioning. He knows how distributed SQL query runs. But he knows SQL to get the data from Snowflake.
You know? He knows Metaflow to run stuff at scaling GPUs. Even if he doesn't know how to provision GPU, no worries. Metaflow is gonna do it for you. And then he knows enough to send the model to SageMaker and and evaluate that. You know what I mean? Like, our job as leaders, I think, is to abstract away all the maintenance and kind of, like, provisioning.
But the job of the ML person is not to do the model. Job of the ML person, at least in my team, is to do the entire thing. Because it's only by understanding the entire thing they're gonna know what the problems are. In that question of saying, okay, as an ML person, you don't necessarily know how to deploy a Kubernetes cluster or how to deploy a pod to Kubernetes or how to manage the scaling, etcetera.
And in the tutorials that you wrote, you made a point of trying to focus on a so called no ops approach to being able to get models out into production. And I'm wondering what you see as the benefits of being able to empower the ML engineer to be able to actually own the whole end to end workflow of saying, I have this model, now it's in production
and not having to involve an operations team? And what are the points where you might decide that you do need to bring in an infrastructure team and some of the challenges that a typical production engineering team might need to be aware of as they start to move into the space of MLOps? It's a very good point, and I think there's no clear cut distinction. But I think for most people, like, in a company my point is that in a company, these 2 things can totally coexist.
There are new feature that are still experimental. You're still testing on 5% of your user of your clients, And then they may go in a totally no ops approach because, again, you're optimizing for speed. You're optimizing for velocity. And the faster the person that builds the model
can deploy, get feedback, and iterate on that, the better it's gonna be for the company. That's why if you start the end over and kind of, you know, finger pointing or, like, you know, you know, rewriting stuff at that level, you're gonna kill innovation. And you might do a lot of work that doesn't really matter at the end. Right? If the person, like, you know, runs this for a couple of weeks and it doesn't really
work, we just wasted 1 person time. If now we spend 2 months to bring this proton production and nobody really worked, now we spend 2 months of time on different teams. Right? There's analogy to the reinforcement learning thing. Right? There's a moment for exploration and there's a moment for exploitation. So in exploration, you privilege speed and self serving. The people that build the feature should be autonomous.
In the exploitation, so when we figured out that this recommender system is gonna make us 10 more $1,000,000 at the end of the year, you may want to, you know, you may want to optimize for, you know, margins, latency, robustness, and so on. So that's the moment when another team can get involved. And, again, in healthy companies, this is not an or.
This is an and. Like, there are some things that work in 1 way, and then you somehow move them the other way. Not that there's just 1 way of doing stuff. And, hopefully, I mean, at least that's what work for us. What people should be aware when when moving from from normal engineer to models, well, the things that everybody said. Right? The model fails silently. I think that's the most that's the most common problems of of ML. Right? Because when you have an endpoint that is a normal API,
when typically fails, it's gonna go into, you know, you know, 500 or, like, something. And, you know, there's a, you know, Datadog or alert or something. And it's like, oh, there's something wrong.
And most of I mean, not all of that, but a significant amount of bots are gonna be catch because the endpoint somehow fails. In models, the vast majority of bots are not because the endpoint somehow return an error or just dies. And because the party's doing something that you'd really don't want him to do or whatever listening, and then we can discuss the causes. And so I think that's the difference between the 2 things is that you should be paying attention
for wider variety of failure as compared to a normal deterministic deterministic system. I think that's the biggest change for ML production. In terms of the level of infrastructure knowledge that you would expect an ML engineer to be aware of, I'm wondering,
at what point would you say, okay, stop wasting your time on that. Just go and use this managed service instead, or stop wasting your time on that. Bring in the production engineering team? Like, what do you see as the kind of useful overlap between an ML engineer understanding infrastructure and then the infrastructure engineer needing to understand some of those vagaries of ML systems?
I think the ML engineer needs 1 to know enough, in his company to be independent and get feedback from his model. Then how much enough is enough? It really depends on the maturity of the company. Like, even in the case when there's a data engineering team, like, call it ML platform in our case, but same thing. I don't think that all of the ML platform team is to take over or somehow, you know, rewrite what the ML people do.
It's to build tooling and abstraction so that the ML people can, you know, still do their job in a less possible automated fashion, but still in, you know, obstructing away all the details. I don't think communication between team is maximized when people need to talk to each other. The communication between is maximized when we have an API, a contract, or abstraction that actually works. So to me, the job of an ML platform team is not to talk to ML people. Sure. Let's talk.
But it's to build tools that makes my ML people, you know, as productive as possible. You know, depends on when you tilt the balance. Right? You may have incredibly good ML platform team that simplifies everything. So the requirement for the ML engineering is very low, or you may have as less mature companies when the ML platform is is nascent. And so the ML engineering needs to kinda know a bit more of the details.
But in theory, none of the other word, I think, you know, the the open development engineer ends with once you once you validate in production, you know, ends with the with the artifacts and, you know, what you need to to kind of deploy. And then everything else will be taken care by the by the platform. So
As ML engineers and ML teams are going down this path of saying, okay, I have this ML problem. I've built a model. Now I need to figure out what this end to end life cycle will look like and how we manage to build in these feedback loops and the cyclical aspects of this. As you said, there's a whole suite of tools out there, open source, managed platforms, etcetera.
I'm wondering what are the kind of key elements that you see as being useful to put into a rubric to figure out what are the tools that we need to use, what is the selection process for understanding which ones will fit well in our environment and suit the expertise that we have on staff? And just some of the kind of challenge of this paradox of choice that has arisen in the space?
There's surely, like, you know, vendor tools fatigue in the space. Like, you know, I also experience it myself. Right? You know? Even if I think we are in a relatively stable place, as in we kinda figure out most pretty things we wanted to figure out, so we're not exactly, you know, drowning in pain points. I still,
of course, look at the new cool kids on the block or, like, I've been asked them to, you know, to chime in on, like, a new product feature or something like that. So that's definitely true for everybody at different level of maturity. My usual suggestion is start simple and start for the things you cannot do without.
The thing we cannot do with our 3. So 1 is data. So where is your data and how you retrieve it? If you're doing, again, tabular or, like, you know, semi structured data, I think the problem has been solved by, you know, the BigQuery. There's no click of this work. So there's not much thinking there, I think. Then how do you do your training?
And the question is gonna be, which type of model do you train? You train like a massive, deep learning model, you can adjust the regression. How much data are you realistically facing, and so on and so forth. So find a tool for the training that actually, you know, suits your need. Training means
training. So actually doing the training and the ancillary things of, like, how do you version the model you get from training? How do you reschedule training every day? Or, you know, how you retry a training if it's failing. Okay? So all of this needs to be figured out. And then, of course, deployment. And, you know, because if your model stays on your laptop, it has no impact in the world. Right? So you need to find a way for the model to be served. Attention.
Doesn't really mean that online serving is the only way. For example, recommendation are very common, and I would say prevalent way of serving them even at Netlist scale, is in batch. So if you know your user, for example, you can precompute movies recommendation for your users every day. And when the user log in, you're not gonna run a prediction. You're just gonna get the same information from a cache. The last open source stuff we did for with NVIDIA
is actually following this pattern. Okay? So you precompute all your stuff, and then you put them in DynamoDB, and then you can serve them even with a Lambda and so on and so forth. So these are the key things that cannot
mess. Otherwise, you don't have a machine learning pipeline. If you learn data, if you learn training it in a model, like, you're you're doing something else. And so I would suggest starting with this. And once you get comfortable with this, you can go into the higher level of, like, how do I track my experiment properly? How do I monitor my system in production? Because, again, monitoring it, maybe tricky, and so on. I'll start with the basics, and then, you know, go on the higher level stuff.
Another aspect of the challenge of doing that tool selection is what you were saying earlier about the fact that the space is evolving so rapidly,
and I'm curious what you have found to be some of the useful interfaces for being able to say, okay, this is a seam where I can easily take out this tool and replace it with something else if I find something better or if this tool stops being maintained or if my requirements change or just how to think about the interfaces and the sockets where you can plug in the different pieces and without having to rebuild everything from the ground up every time.
This is a concern that most people have, and I think it's a very genuine 1. In our case, in our setup, this boils down to the choice of the back 1 of the system in how you build your your ML pipeline, which in our case, as we mentioned, is Metaflow. I think similar reflection would apply to
more traditional system like, you know, airflow or pre fit if you don't want to do an ML specific orchestral, for example. But if you think about what I said, like, 3 functional pieces that I mentioned before, the connection between them is already fairly clean. Like, my connection to a data warehouse is a secret query. If Snowflake
cease to exist tomorrow, which is, by the way, fairly unlikely, but but I can just replace it with BigQuery, and most of my code is gonna basically stay the same. Actually, my downstream code is gonna stay exactly the same as as I replace the client about to BigQuery instead of Snowflake.
My training doesn't really depend on anything. It's pure Python on, you know, PyTorch or, you know, TensorFlow or whatever. Again, they need to state they're not gonna change. And the result is an artifact in s 3 that gets shipped to whatever deployment platform that I picked. So the way in which the 3 parts talk to each other, data with training is SQL.
Training with deployment is, you know, a version of the artifact. And those, at least to me, are already solved that they're not gonna change. They're ready to stay. And I have to say that most of the the other tools that I saw in monitoring space or experiment tracking and stuff like that are also very good in this integration game. As in they provide you very, very nice hooks
to kind of improve what you have very easily. And, of course, if you should get tired of them, you can also remove them or change them, I would say, with not much more work. Like, for us,
like, changing Snowflake would be just going back on those numbers, part c will be, like, going back 5 years in our life, and raising Metaflow. We're going back, like, 2 years or something like that. Once that is in place, I think every other tool has been, you know, like, relatively straightforward to to incorporate into this vision.
On this question of the kind of reasonable scale moniker that we've been addressing, as we said, it's the question of going from your laptop to actually serving production traffic with potentially 1,000 or 100 of 1000 or millions of users.
What are some of the common bottlenecks that you tend to run into along that path from training to production and then once it's in production, some of the performance issues that you might need to account for. And I'm also curious how the specific application of ML might influence
the manifestation of those bottlenecks. So whether it's an NLP system or a recommendation engine or computer vision problem, things like that. For some problems, NLP is a is a good example. Let's take you if you build something in, like, contemporary NLP with deep learning and vectors and so on. At some point, it's very, very likely that your prediction system is gonna have a retrieval phase in which you're gonna be asked to go and find
a vector that is similar to the back you know, that the user is inputting for, you know, for some for some sense. This also applies to recommendation as well. Right? You have a recommender system at some point, and the user is viewing, a Samsung TV on Amazon. You may want to go to a place and fetch similar Samsung TV,
you know, to then recommend it to the user. Right? So the first thing you need to do is fetch similar ones, and the second thing too, you have to rank them. Because, of course, different user may have different price point or preferences and so on. That part
for us in the last year, I think, has been the the part where we've actually working the most in production. The retrieval of vector search basically at scale. Okay? It's a very hard problem. Now there's a bunch of startup trying solving in Pinecon, for example. There's open source stuff that you can use like Vespa. Like, there's there's a bunch of stuff that, of course, you can use. I think Redis
right now is also a built in memory memory vector search and something like that. And, of course, some model like TensorFlow recommender include, like, a k n n inside of them if you if you deploy as as k as model. So but that is is a huge difference between what happens in your laptop and what happens in production. What happens in your laptop when you, you know, you you don't really care that much about the latency of that of that moment. You just care about the quality of the recall. But when you put it in production, now you have maybe a 1000000 user. Now, you know, the vector search over
100 of thousands of items need to be very fast. And that's why, you know, like, complex use cases with with more latency, like, for example, information retrieval becomes especially challenging. Because that's not the end of the game. Like, you know, once you retrieve that, you still have to do stuff to produce value. So there's a bunch of things that need to happen, and they need to happen in a very constrained amount of time. Well, when you train your model or even when you
may start it trying out with a bunch of users, like, you know, like, a bit of, like, testing is less important. You know? You can sacrifice
some of that to be fast, to have an accurate prediction. But then when you actually deploy, it need to be fast and scalable. And so at Covia, there's been a lot of work into making our index processes and so on. Then it's vector compatible. It's a very challenging it's, honestly, it's a very, very challenging work. It also comes into question what is the best possible architecture for this. Do you have 2 indexes? 1 dense and 1 sparse?
Do you have 1 that kind of does both of this? Like Elastic, for example, it's a popular open source choice that sort of does both.
But there's another good way to be made, and maybe the vector search in Elastic is not the most optimized 1 for machine learning. Now what do we do? We have a part of the company that use Elastic and a part of the company use, you know, something else. It's a very tricky thing when you actually have to have to work it out. But, yeah, but it's been it's been an interesting journey so far. To that point of
there being some particularly challenging problems that you've had to tackle as you move to these reasonable scale levels, what are some of the machine learning applications or specific use cases or model architectures where it's not really feasible to run at so called reasonable scale and those are still kind of the domain of the big tech companies that have deep pockets to be able to spend on the research and the infrastructure and the talent to be able to realize these capabilities.
I think the main ball the main division is gonna be mostly the first the first parameter that is gonna put you out of the reasonable scale is gonna just be a boring 1. Just how much data do you have? At some point, even NICE tool like Snowflake and so on, become just pretty expensive if you really need to run all the competition with another Snowflake, like, on that. And then it will make sense to have internal tool that are built, you know, on, you know, on more
old scale, but better scalable platform. They can actually can control and tuning because at a petabyte scale, tuning your Spark cluster is actually a value. And my scale is a my scale is a product annoying,
but a petabyte scale is actually makes the difference, you know, between a successful company and not. So I think data is the first big hurdle. Remember what Peter Norvig said when they asked, like, a long time ago why Google was so good. And it was like, we're not we're not a better algorithm. We just have more data than you. Because the single reason why,
you know, why model's interaction actually work better is because this company weigh more data and can average way more data than than you could ever than you could ever do. And then, of course, their scale makes also the second part interesting, which is the modeling aspect. Right? A 1% increase in modeling accuracy for a recommender system at petabyte scale is gonna produce a $1,000,000,000,
you know, down the line. So everybody wants to optimize for it. If you run a reasonable scale, a 1% increase in your recommender system, you know, may or may not be actually change your UI of the company or the client. So it may or may not be Maybe even justify the amount of GPUs you're spending in proportion. You know what I mean? So there there's a lot of trade off that makes sense only after a certain level of scale.
But if you read what I'm always, like, interested, like, if you read, like, you know, Pinterest has been doing that, like, a while. If you read that some of the thing that Pinterest is doing is not the old story, but some of the things they do is really basic. Like, the dot product between users and pins or something like like, when you read about that, it's like, it's not a sophisticated it's actually way less sophisticated than a lot of things that other people do.
But they have so good at producing good representation of user and product into vectors. But at the end of the day, you know, they can kinda simplify the problem, like like, well, we have a lot of data. We're very good at representation.
That's a real product. It's a very easy thing to do. You know what I mean? Like, I also think there's something that people don't say that much, which is in a large enough scale, simpler things becomes convenient again because data makes up for the difference in performance. To your point of saying, if I have a recommendation engine, I'm able to improve the accuracy or the recall by 1%, it's not gonna have a meaningful impact on my business. What are some of the kind of premature optimizations
that people might fall into where you can say, don't bother wasting your time with that. It's not gonna have the impact you think it's going to do. Focus on making sure that you can catch errors faster or focus on making sure that you can reduce the latency from model development to model deployment. Like, what are the things that make sense to focus on at reasonable scale, and what are the things that are just gonna be
a waste of your time? So first thing, second thing, and third thing is gonna be your data your data cycle. So make sure that your data is collected properly. Make sure you have the proper, you know, ELT patterns and things that actually makes you, you know, makes you confident that what you do is a replayable, scalable, and version, and you can kind of count on it. And second, make sure that the data you ingest at the end of the pipeline are joinable with the data that you that you start with it. Right? If data is not joinable, data doesn't exist as a general thing. Like, if you're gonna put data in relationship with other data, just not having that data. And you'd be surprised by how many times I saw people deploying a model in production and thinking later of, like, well, what do we know if we're gonna be successful or not? Like, well, that's the first question to us before building the system, not the last 1. At the moment when you write the first line of code, you need to have very, very clear how you're gonna get the feedback from the user back into the system. If you don't have that clear, don't bother building the system in the 1st place. So data is always the first place to start looking for, you know, like optimization.
Which means if you didn't match data, you can do documentation or synthetic data generation. So knowing your data, become 1 with your data is important for everybody. That goes for Tesla. There's a famous paper by, you know, Carpathi about that. Become 1 with the data set. And it's even more important at, you know, at our scale. The second thing that I would say, you know, try to build a model that's simple enough to produce value,
but not complex enough that you don't immediately understand, you know, when it when it's wrong. The secret to many applications is understanding that distribution is a power law. So in many things, like, I don't know, customer service, for example. Let's say you need to build a model
that tells you if a customer is is happy or or sad about about a topic. Okay? And you don't have any data, and you don't know how to start. A very, very easy thing to start is, like, highlighting from your backlog of data, like, sentences that are very clearly expressing, like, anger. Like, I hate you. I want to this act the activity account, whatever. And you literally can reject your way out of your 1st month,
and then you put the prediction in live, and then you collect the data at that point to kind of bootstrap your machine learning system. Like, the first rule of machine learning is always asking if you need machine learning. And in many cases, the truth is that you need it, but maybe not at day 1. Or you also need it day 1, but you don't have the data. And, again, data is the most important thing. So to get to the point when you have the data, you need to do some tricks
here and there to do that. So these are the things that should occupy your 1st month of a new ML feature. Not latency,
not scaling, not GPUs, and so on and so forth. And then, you know, the more you get confident, the more you can improve that. Of course, that doesn't apply to everybody. Like, Depending on level of experience and maturity, you could you know, when I need to build the recommended system nowadays, I don't start with Drew. You know, I I already have a base where I know I know what things work and not because of my background, and everybody has his own depending on their application. But if it's a new problem,
don't be afraid of being of being simple and dumb, you know. Like, everybody needs to start somewhere. In your experience of working in this space and putting together these tutorials and talking to people who you encounter in conferences and in your day to day work, what are some of the most interesting or innovative or unexpected ways that you have seen people design the their ML systems and be able to build out and prove the utility and value of their ML approaches to a particular problem.
There's a bunch of cool things that I've seen recently. I wouldn't say it's a finished job. Like, my friend Ethan has published a blog post recently on using Materialise and SQL to somehow do ML monitoring, basically a pure SQL solution to ML monitoring, which I think is an interesting and clever take on the problem. I'm not sure if it it is right, though.
So if that idea is that you don't really need Python or fancy stuff to do the bulk of your monitoring. What you need is a good streaming service, let's say, materialize, and some secret queries to kind of, like, you know, do basic counts. Okay? And I think you'd be, like, you know, just, like, publish, like, an open source plus on that. And I thought it was a it was a clever it's all the things that I find that try to simplify
things. It was welcoming from my perspective. I still not sure if it's completely right. I'm not sure if you can actually do proper modeling without at least some statistical testing or, you know, some fancier stuff, especially if you do an LP, if you do more complex stuff than just counting or tabular problems. But I thought it was a very clever spin on the usual, well, do you need a full monitoring tool solution? And then it was like, well, you know, maybe. But also maybe something that literally is like an open source simpler. So that definitely, like, resonated with me, and I thought it was pretty cool. Generally speaking, I really like all the serverless
stuff. So I really like server not not as a in this case, but I don't mean the the framework specifically or or the or the or the AWS implementation. But the general idea where the word is mutate where the word is kinda moving, when you don't really care about much, if not just, you know, giving a function to some provider and the provider is gonna run it. So every time that I see people that are able to
serverless their way out of their pipeline, I'm always kind of, like, thinking, well, good job. Like, you know, I really like the approach. So yeah. In your own experience of working in this space and building out your own systems design and managing these machine learning problems, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?
There was a lot of missing fact that I realized that, which is controversial, But, like, smaller team in this space may or will typically outperform bigger team if they have the proper tooling and data in place. Like, small team will drastically outperform teams that are, like, you know, 10 times or 20% bigger if they know how to,
you know, leverage this ecosystem in a way that makes, you know, their feedback look very, very short. And if they have a way to obstruct away the non important stuff or sorry. Not very important. Not differentiating stuff like infrastructure scaling and so on and so forth. So 1 thing that I learned is that it's very hard to judge a productivity of an ML team by size. As you know, as incredible teams can actually be produced by relatively small rather small company.
I know it's not small by most definition, but if you think about OpenAI, like 300 people or something like that, well, you think consider the amount like, you know, consider how complex is what they do. No. Like, if you would have guessed from the outside, there would be a much bigger company. That's my point. So I think that's 1 interesting learning that I have is, like, smaller talented team will actually have prefer bigger ones because it will cut on dependencies and all the communication,
like, you know, channels, which slows everybody down, basically. That's surely 1 1 interesting learning. And on the other side is, well,
what building models is super cool, and I really like it, but it matters less in practice of what I thought it would do. It's not true in every use cases. It's not true in every company. It's not true when life or death sits on the table or when marginal gains means 1,000,000,000 of dollars. But in many cases, I think I came to realization and many ML people came to realization the last, you know, 4 or 5 years, which is that our job is not very important.
I mean, it's cool and it's fun. But where the value is is how that job is connected in a principal way upstream with our data, as we said, or downstream with our user. That's, I think, where the really applied ML person shines. The applied ML person shines in using the right tool for the job, but mostly understanding how his job is in a context
when the model doesn't have just to predict, but needs to predict certain type of thing to certain type of people in a certain way based on certain data. And then I think what is the value is. But on this, I think is less controversial. I think more and more people, especially, they have these 2 set of skill, like research skills and and practice. They they do both things. I think more and more of us realize the huge gap between these 2 words.
That both interesting and worthwhile in their own right. But then I think a successful person need to be able to switch off the ads or the the research hat when you when you do 1 or the other. And on the other side, you need to switch off the 2 practical ads when you're trying to advance the field. But these 2 thing don't necessarily go an end in end. And I'm not sure if it's good or bad or, you know, where we're going. But I think many people in my position,
like, deeply realize this every day now. As you continue to build and manage your own systems and work in the ML space, what are some of the things that you're keeping a close eye on as far as evolving patterns or upcoming tools or practices and some of the new capabilities that you think are worth paying attention to or useful resources for people who want to gain more understanding of this space and evolve in their own capabilities?
So I'm generally interested in team topologies in large organization and people who work with, you know, with data together. The fact that you are ML end to end data, like, engineers,
it doesn't mean you're the only person touching data. So at some point, there's gonna be some data what they need to share with the analytics engineer or the BI folks or something like that. And I'm always interested in talking to people that are solving this problem as I think it's a totally non solved problem. Like, the collaboration of people with different skills and different downstream needs over the same data warehouse, I think, is 1 of the big team of our time. Like, now that all the data is centralized,
and we pretty much use the same tools or languages. But how can we work together instead of stepping on each other's toes? So this is 1 thing. If anybody's interesting thing to say, please reach out to me. I think this is a pretty novel but exciting area of research, and I hope to devote a bit more of my time as well in solving this. On existing things, so less spot out in the future, there's a lot of emphasis right now in streaming infrastructure.
So in going from, you know, batch to quasi real time, I would keep an eye on that not because it's gonna happen in the next year or so or even 2 years. I don't think that's gonna happen for most people close enough. But I think that's where the future naturalize, you know, and it's always,
you know, the saying. Right? Because I skate where the puck is gonna be, not where the puck is. So it's always good to keep yourself, you know, a bit ahead of the curve. But, yes, I think streaming is super cool. I just think that the amount of people there right now can operationalize it and make sense of it at a tiny minority because the tooling is so hard to use. But in a couple of years, I see that as a growing, you know even consider the investment in database. I don't think I'm making that. Like, I see streaming as obviously a growing market
for MLOps and, you know, ML applications. Are there any other aspects of this question of being able to design and build ML systems at reasonable scale and your experiences working in that space that we didn't discuss yet that you'd like to cover before we close out the show? No. The last thing that I've mentioned is is testing just because we've been working a bit in this in the open source stuff as well. So testing a model is
hard, you know, and it's much harder than what people think. Not even monitored, not even when the model is online, but even when you have a test set and you test your recommender system or classification, and you get a number out of it, and based on that number, you decide if the model is good enough to be shipped outside, There is a tremendously hard thing to do, and we, ourself, came to realize this relatively recently, which is why we built reckless. We couldn't find anything
that will satisfy our need. So, you know, we didn't set out to build it. It just it was nothing that could actually set another need. So we built this package and open sourced it. But there's a lot of nuances in what model do in real life that are not captured by test set or what we call point wise metrics.
Like, the typical example we make is, you know, recommender system, since a lot of people watch blockbuster movies, if you build a recommender system that is slightly better at recommending Batman, it's gonna give you the impression that is a better recommender system if you measure by e trade or MRR or whatever.
And then if you zoom in, it may be that by being a bit better in background is actually destroying the viewing experience of every Italian viewer because they're a niche. But now the recommender system doesn't really take care of them. And so the point is, like, not telling you which 1 is the best recommender system, but it just kinda make everybody aware that deciding what to put in production, what the model is gonna do when unleashed, is harder than 1 1 measure
on adult data points. So if you like this topic, super happy to talk about that. Again, we're organizing a free open competition with nice price and some students awards as we speak. If you're passionate about building recommender system and evaluating them in a according to not just accuracy, but fairness and robustness, you know, please get in touch.
Yeah. And I'll make sure to add a link in the show notes for people who want to get involved in that. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as my final question, I'd like to get your perspective and what you see as being the biggest barrier to adoption for machine learning today. I always do 3. Like, first,
algorithms are hard, but this is getting better because models are getting better obstructed away and so on and so forth. 2nd, MLOps is hard. But, of course, it's actually getting better because of all the tools and the things we discussed. And third 1 is interesting question in this space are typically
geared or set as an agenda by people that have problems that nobody else has. And I think that's still hard as in the people that actually do interesting research at reasonable scale are very few because of the constraints. So we need to be better at this. My app, those are the 3 big buyer for me. But
the message is, these are fantastic moment to live. Because 3 years ago, my answer would have be these 3 barriers aren't, like, you know, insurmountable. Like, nobody can do that. Now 2 out of 3, you know, can be tackled with, you know, by reasonable skilled people that know what they're doing. And, hopefully, in 2 years, they would be able to be tackled by almost anybody, right, if we continue in this trend. So it's a very, very good moment to be in this space.
Alright. Well, thank you very much for taking the time today to join me and share your experiences of building and designing machine learning systems and how to think about the set of requirements that map well to the types of scale that most people are going to be encountering. So I appreciate all the time and energy that you've put into that, and I hope you enjoy the rest of your day. Of course. Thanks so much for having me. And guys, thanks so much for listening. Ciao.
Thank you for listening. And don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management, and podcast.init, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com
to subscribe to the show, sign up for the mailing list, and the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast. Com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.