Build Better Models Through Data Centric Machine Learning Development With Snorkel AI - podcast episode cover

Build Better Models Through Data Centric Machine Learning Development With Snorkel AI

Jul 29, 202254 minEp. 5
--:--
--:--
Listen in podcast apps:

Episode description

Summary
Machine learning is a data hungry activity, and the quality of the resulting model is highly dependent on the quality of the inputs that it receives. Generating sufficient quantities of high quality labeled data is an expensive and time consuming process. In order to reduce that time and cost Alex Ratner and his team at Snorkel AI have built a system for powering data-centric machine learning development. In this episode he explains how the Snorkel platform allows domain experts to create labeling functions that translate their expertise into reusable logic that dramatically reduces the time needed to build training data sets and drives down the total cost.
Announcements
  • Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery.
  • Building good ML models is hard, but testing them properly is even harder. At Deepchecks, they built an open-source testing framework that follows best practices, ensuring that your models behave as expected. Get started quickly using their built-in library of checks for testing and validating your model’s behavior and performance, and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually. Go to themachinelearningpodcast.com/deepchecks today to get started!
  • Data powers machine learning, but poor data quality is the largest impediment to effective ML today. Galileo is a collaborative data bench for data scientists building Natural Language Processing (NLP) models to programmatically inspect, fix and track their data across the ML workflow (pre-training, post-training and post-production) – no more excel sheets or ad-hoc python scripts. Get meaningful gains in your model performance fast, dramatically reduce data labeling and procurement costs, while seeing 10x faster ML iterations. Galileo is offering listeners a free 30 day trial and a 30% discount on the product there after. This offer is available until Aug 31, so go to themachinelearningpodcast.com/galileo and request a demo today!
  • Predibase is a low-code ML platform without low-code limits. Built on top of our open source foundations of Ludwig and Horovod, our platform allows you to train state-of-the-art ML and deep learning models on your datasets at scale. Our platform works on text, images, tabular, audio and multi-modal data using our novel compositional model architecture. We allow users to operationalize models on top of the modern data stack, through REST and PQL – an extension of SQL that puts predictive power in the hands of data practitioners. Go to themachinelearningpodcast.com/predibase today to learn more and try it out!
  • Your host is Tobias Macey and today I’m interviewing Alex Ratner about Snorkel AI, a platform for data-centric machine learning workflows powered by programmatic data labeling techniques
Interview
  • Introduction
  • How did you get involved in machine learning?
  • Can you describe what Snorkel AI is and the story behind it?
  • What are the problems that you are focused on solving? 
    • Which pieces of the ML lifecycle are you focused on?
  • How did your experience building the open source Snorkel project and working with the community inform your product direction for Snorkel AI? 
    • How has the underlying Snorkel project evolved over the past 4 years?
  • What are the deciding factors that an organization or ML team need to consider when evaluating existing labeling strategies against the programmatic approach that you provide? 
    • What are the features that Snorkel provides over and above managing code execution across the source data set?
  • Can you describe what you have built at Snorkel AI and how it is implemented? 
    • What are some of the notable developments of the ML ecosystem that had a meaningful impact on your overall product vision/viability?
  • Can you describe the workflow for an individual or team who is using Snorkel for generating their training data set? 
    • How does Snorkel integrate with the experimentation process to track how changes to labeling logic correlate with the performance of the resulting model?
  • What are some of the complexities involved in designing and testing the labeling logic? 
    • How do you handle complex data formats such as audio, video, images, etc. that might require their own ML models to generate labels? (e.g. object detection for bounding boxes)
  • With the increased scale and quality of labeled data that Snorkel AI offers, how does that impact the viability of autoML toolchains for generating useful models?
  • How are you managing the governance and feature boundaries between the open source Snorkel project and the business that you have built around it?
  • What are the most interesting, innovative, or unexpected ways that you have seen Snorkel AI used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Snorkel AI?
  • When is Snorkel AI the wrong choice?
  • What do you have planned for the future of Snorkel AI?
Contact Info
Parting Question
  • From your perspective, what is the biggest barrier to adoption of machine learning today?
Closing Announcements
  • Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0

Transcript

Unknown

Hello, and welcome to The Machine Learning Podcast. The podcast about going from idea to delivery with machine learning. Building good ML models is hard, but testing them properly is even harder. At DeepChex, they built an open source testing framework that follows best practices, ensuring that your models behave as expected.

Get started quickly using their built in library of checks for testing and validating your model's behavior and performance and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually. Go to the machine learning podcast.com/deeptext today to learn more and get started.

Data powers machine learning, but poor data quality is the largest impediment to effective ML today. Galileo is a collaborative data bench for data scientists building natural language processing models to programmatically inspect, fix and track their data across the ML workflow, from pretraining to posttraining and postproduction. No more Excel sheets or ad hoc Python scripts.

Get meaningful gains in your model performance fast, dramatically reduce data labeling and procurement costs while seeing 10x faster ML iterations. Galileo is offering listeners of the Machine Learning podcast a free 30 day trial and a 30% discount on the product thereafter. This offer is available until August 31st, so go to the machine learning podcast.com/galileo

and request a demo today. Your host is Tobias Macy. And today, I'm interviewing Alex Ratner about Snorkel AI, a platform for data centric machine learning workflows powered by programmatic data labeling techniques. So, Alex, can you start by introducing yourself? Thanks so much for having me on the new show. And like you said, I'm Alex. I'm 1 of the cofounders and CEO at Snorkel AI. We build a we call it programmatic data centric development platform for machine learning and AI.

And I'm also assistant professor at University of Washington in the computer science department. And before that, we started the Snorkel project back at the Stanford AI Lab where we we spun out of. So, I'm very excited to talk about what we do and ties into some of the broader,

you know, shifts and themes we're seeing in the ML space today. And do you remember how you first got started working in the area of machine learning? Yeah. Well, you gotta cut me off if I ramble on too long about this, but let's see. I've coded since I was little, but I kinda did a detour into physics during undergrad. So then, you know, I was actually not a computer science major.

I was working actually in some consulting I remember we were looking at the patent corpus, and I thought I started to think it was really fascinating that you had all this information

about, you know, basically, everything anyone thought was worth, you know, patentable. My view about, you know, goes into the patent corpus has gotten a little bit more nuanced since then, but, you know, just this idea of all the scientific knowledge, you know, that even back then, this 2012, 2011, you could fit into a thumb drive, at least if you ditched the images then, but it wasn't functionally accessible.

Even just, you know, figuring out how to pull out simple facts about basic attributes of technologies was beyond modern search or analysis capabilities. So I started getting really fascinated about how the heck can you, you know, pull information out of this great mass. And that led me into the field of natural language processing,

which, you know, right around then was was really starting to evolve towards more machine learning or statistical based methods. So that in turn led me into the world of machine learning. You know, in a nutshell, 1 of the grand challenge kinda areas that machine learning is has been uniquely

good at solving and and those boom times continue more than ever today is, you know, dealing with language and classifying it, tagging it, pulling information out of it, etcetera. So kinda went from this big pile of documents that I was fascinated about how could we use computers to pull information out of to natural language processing. Right around, it was shifting towards machine learning, and that led me to machine learning and and ultimately back to grad school and snorkel.

And as a brief detour, you used the terms both ML and AI during your introduction, and that's 1 of the things that's always interesting to get people's perspective on what is the, you know, semantic difference between those 2 terms, or are they just synonyms of each other? I'm curious about your take on that. I won't do as good and complete of a job as, you know, intro to AI textbook or course. But, you know, in a nutshell, machine learning is a a subset. AI is, you know, I won't even try to give a proper definition, but broadly speaking, you know, you know, the field of techniques for automating things that we traditionally associate with intelligence.

And machine learning is the subset of AI methods that basically, you know, learn to do this automation from data rather than from kind of manual

programming of that automation. And I think maybe 1 of the key things that we'll touch on, I'm guessing later, is that, you know, it's key to point out that it's never just learning from data. I mean, it's effectively never. When you're talking about in practice, there's always some kind of manual specification of the domain expertise about the thing you're trying to automate.

And, you know, which parts of the machine learning system are truly learned from data and which are kind of learned automatically and which parts involve human specification,

you know, is 1 way of phrasing kind of the practical problem of of building ML in in my view. So we'll come back to that. But in a nutshell, ML is a subset of AI where you're actually learning to do this automation from data, at least in part. Effectively, AI would incorporate or encapsulate the purely expert systems approach where you just have a decision tree where some subject matter expert says, you know, in this situation, choose this path or this path. And, you know, machine learning is the portion of that overall umbrella that says, you know, we will take some of that input. We're actually going to use observed data from this particular subject domain to be able to influence what that weighting of the decision tree might be.

Exactly. And I think that example of, you know, kind of learning the weights or tuning is a great example. I mean, like, if you take the simplest form of machine learning and the kind of where you're, you know, taking some hand engineered teachers and kind of learning some kind of weighting over them, and let's start with a simple linear 1, then it's exactly that. I mean, imagine another example is, you know, spam filters. Say, if I see the word, you know, loan,

then I think that maybe makes it spam. If I see the key phrase, you know, please let your father and I know when we can see you, then I think it's not spam. And if I see the phrase, you know, use card, and I think it's spam, and how do we weight these? Right? And so, you know, basic machine learning is you still have actually you know, this is the classical approach kind of pre feature, you know, representation or deep learning where

we still specify those, but we learn from data, from statistics kind of how to weight them in terms of their correlation with the target prediction variable, spam or not, say. And then, of course, 1 of the big shifts in terms of, you know, the times types of AI technologies that we practically use today. Not a fundamentally new idea, but in terms of practical usability is this idea that we can also

learn, you know, which features to pick out as well. Right? So we can go just from kind of label, what we call training data, to this model without basically learning the features and the weighting as well. That fits in nicely with the work that you're doing at Snorkel. So I'm wondering if you can just give a bit more context and detail about what you're building at Snorkel AI

and some of the story behind how you ended up building a business around what started as an open source project that was the output of a research endeavor.

Let let me dive in. And if I forget and you're curious, we can go back into why I say AI, not just machine learning. The TLDR is that we support today on our platform involves often the mix of, you know, learned components and other operators. So, you know, that's actually why, you know, I'm being more precise if I say AI in terms of the full output or that users build in our platform. But, mainly, I'll talk about the ML part today. And so in a nutshell, Snorkel started as

a project very broadly scoped to study the shift we were seeing or rather that we were kind of betting was gonna come into play from what we, you know, call model centric to data centric development. So if we looked at how people build machine learning models,

it used to be that you start with some labeled data that the model learns from. You know, here's a bunch of emails labeled spam or not spam. We call that training data. And that came from somewhere else that you didn't care about as a machine learning developer. It's exogenous to your process. You download it from Kaggle or from a benchmark data setter. You got it in a spreadsheet from a subject matter expert or line of business partner, and then you started, you know, building your model and doing feature development and bespoke architecture design and all that kind of stuff. And we started seeing the shift where the models were getting more

automated, more push button, more supported, you know, projects like TensorFlow and then soon after PyTorch were just taking off and, you know, deep learning was coming back into Vogue with new ways that made it, you know, really practically usable and and increasingly effective. And all these techniques were showing this trend toward the models becoming more push button. So a lot of the research project was very broadly scoped as, okay, if the models are becoming more push button, you know, what's the new bottleneck? Where does the development activity go in machine learning? And we started to see was that it kinda shifted to the data

because these models, you know, you don't get anything for free in life. Right? So they were, you know, much more push button automated, but they were much more data hungry. And, specifically, they need lots of this labeled data. And so this data labeling and relabeling and curation and broader set of operations around the data was becoming what people got stuck up, but also what people were starting to kinda get creative around, you know, hacking together. So we really started at a very high level saying, look, let's imagine that the model is effectively

fixed. We'll just, you know, put table stakes down that it's gonna gonna get solved increasingly and bet on that trend. And as academics, let's focus on how people develop this data. And can we, in particular, look at ways beyond just labeling it by hand, you know, spam, not spam, spam, not spam, 1 data point at a time. So that was really how the project started. And then we decided to start on looking at labeling as 1 of the key operations and understand if we make it more programmatic.

And this was also influenced by working with all kinds of subject matter experts in all kinds of settings where it's not just cheap to label. Right? A lot of ML is kind of used to things like spam, not spam where you have some data that's labeled by people clicking, you know, those buttons or data like, you know, cat versus dog or stop sign versus pedestrian where you can kinda ship it out to crowd workers and get it labeled effectively.

Or or, you know, think of other examples you see all the time, customer churn prediction, where you know if a customer churned or not. Right? You've got the label already. We were back to my little personal intro, always fascinated by and anchored on these problems where you had very unstructured and unlabeled data. You were trying to pull some stuff out of some medical records or out of patents or out of the scientific literature. So we did a lot of work with, you know, doctors and scientists

and folks in, you know, areas, the very private, highly expert data. You know, today, government, finance, insurance, medical, and health care, you know, telecommunications,

and stuff involving user data. So all these settings, it's not easy. In fact, the largest organizations in the world, many of whom we've deployed tech with and published with, are kind of blocked on this task of labeling and relabeling training data. So we looked at this and said, okay. Can we make it more programmatic, more like software development? And that's, you know, 1 of the kind of key anchor point or starting point ideas around Snorkel is this idea of writing what we call labeling functions to programmatically

label data. And a lot of the academic project was all about that and about how you deal with the noisiness inherent in this programmatic labeling approach, which formally we we refer to as weak supervision. We can go into that more later about how you kinda clean that up and make it usable for machine learning models. But really then from there, the project and now the company zoomed back out again this broader idea of data centric development. What are these operators and operations and sources of signal you can use above and beyond just kind of click, click, click 1 data point at a time to

make ML data labeling and development, you know, faster, more auditable, more adaptable. Again, more like software development. That's ultimately what it was, and we study all kinds of things, labeling, slicing, augmenting, sampling data, all these data operations

and said, how could we make all of this more like a structured software development process where you could sit down and build a training set in, you know, an hour or a day. And then when something changes in the world, you need to relabel or or reshape it, you just edit it like you would with any other kind of code and ship it out again

versus what it still often looks like today, you know, 6 or 7 years later, which is, you know, months of people clicking 1 data point at a time for every single turn of the crane, which increasingly folks are seeing is, you know, even for the largest organizations in the world, just not not feasible for productionizing ML. That's the project in a nutshell and a company spun out of that really as we started to go from some of the core ideas and, you know, theoretical and algorithmic techniques

for dealing with these more programmatic approaches

to this realization really that was kind of, you know, hit in the face of it from the users of the various open source repos and projects we put out that, hey. This is just a new development process. And the ideas are great. The theory and the algorithms are key to it, but also it's a whole new development process, needs new workflow tooling, new platforms, new forms, new IDEs, new tools, etcetera. And that was what we spun out a company, Snorkel AI, to build and what we've been building and serving to customers over the last 3 plus years.

And there are a lot of interesting areas to dig into there. 1 of the things that I was thinking about as you were commenting on how you are trying to turn this labeling exercise into something that's more akin to regular software development is the question of how do you handle testing and validation

of those labeling functions to make sure that it's doing what you want it to do just like with regular software where, you know, it works when I run it, you know, manually, but then when I actually put it out in front of, you know, real people and real data, I find weird edge cases and bugs that I didn't know about. And so just identifying

when you're starting to hit those edge cases and failure conditions and then what the error handling logic looks like when you do hit a case where, okay, this labeling function is getting data that it was never expecting to, it's, you know, seg faulting or issuing a traceback or whatever it is. But, you know, maybe that's 1% or 5% of the actual data assets. And for the majority of them, it runs fine. And so just being able to handle that, you know, testing, validation, error handling aspect of the iterative development process that goes into building those labeling functions.

And that's a great place to go to for framing, you know, both the ideas and also just a lot of what we build today, right, and the reality of this workflow.

Because and, again, that's why I often you know, I use and I often use this idea of software development. Right? There's a lot of and I'll circle back to your question in a second, but just just to kind of, you know, frame, there's a lot of noise out there in the market, the commercial market and frankly, various, you know, parts of the academic literature too of we're kind of claiming to kinda make this data development, you know, problem go away. Some kind of put you know, just solve with some kind of push button label it, push button approach, push button auto magic approach. Right? There's a lot of logic that sounds very circular

and indeed is completely circular. We're gonna, you know, automate the labeling with a model to label data for a model. Right? It is actually as circular,

you know, down to the core ideas as it sounds, and that has never been our goal. And I think it's important even with prospective customers, users, you know, collaborators to reframe that way and say, look. We're trying to make this a software development process, and it's fundamentally about making it as efficient as possible for the human in the loop, but not kicking the human entirely out of the loop, rather just making it structured and and fast. So what does that actual process look like? Well, let's go back to the example of software development. You know, be too precise here, but at a high level, you know, when you're going and taking some code to compile it, right, you have a whole bunch of techniques, including newer statistical ones using machine learning like we do that are, you know, automating some of that process. But you also have, you know, a whole bunch of manual validation at various levels. You have manually written tests, and then you have just actual, you know, usage out in the field and rigorous ways of testing that. And then you have, you know, tools for iterating and debugging.

Right? And so there's kind of a parallel to that in there there is a direct parallel to that in, you know, how we build our platform Snorkel Flow. So there's a part, and this is a lot of what the academic project and my thesis works and many others now studied this weak supervision idea of using theoretically grounded techniques to figure out kind of how to up and down weight and trust the different labeling functions actually without needing any ground truth. So the cool thing, and this builds on very old theoretical ideas, once you have a collection of labeling functions,

even if some of them are straight up worse than random, you know, adversarial, as long as the majority are decent because someone's trying to, you know, nonadversarily develop something, you know, in in a positive way, which is 1 of our base kind of symmetry breaking assumptions in in the theory and and the and the practice. As long as you have a couple of these labeling functions, you can look at how they agree and disagree with each other, and you can back out with formal guarantees

kinda which ones to trust, which ones might be correlated, which ones to trust where in different parts of the data where they're better or worse. And so there's a lot of the theory and algorithms, and we can talk more about that. Let's say that's the automated part. That's bucket 1. Then there's bucket 2 of just good old fashioned testing, which we

strongly recommend. Right? We include a lot of tooling for just manual annotation of testing and validation splits, and we think that's the most kinda trusted and basic way to validate, you know, a final score before you wanna say get permission to put it in production.

Again, the key point there is, you know, you can usually label a, you know, infinitesimal fraction of the data that you'd otherwise have to label for training the model just to kind of spot check it over time. Right? And then, you know, potentially most important of all, there's the debugging tools. Right? No matter how much automation you have, garbage in is still garbage out. You know, no theoretical tricks will will cover that up. And, you know, writing better inputs will produce better outputs and not just better, but adapted when you'd have to change things over time. And so these tools that guide you back and say, okay. Where are there error modes in my data? Where are there modes where there's been shifts? Either with the data distributions coming in or with my output objectives because the team needs, you know, a different set of labels now or something like that, which happens all the time. Right? So, again, just in summary, there's a whole bunch of kind of, you know, fancy automation that we spent years working on and and has, you know, cool theoretical ideas and cool, you know, impact in practice.

But it's critically complemented by, you know, lots of kind of good old fashioned kinda standard validation, you know, blind held out test sets, regular checks and refreshes over time, you know, regular efforts to audit and govern and debias, etcetera that all ML practitioners, you know, should be doing. And then there's the debugging loop, which is a big thing that we build around.

As far as the point that you were making where you potentially have this cold start problem of, I have this data that I that I need to label, but in some cases, the labeling function is almost as sophisticated as the model that I'm trying to develop with the labeled data. And I'm curious

how you have seen people approach that problem of working with complex datasets. So maybe you've got audio data, video data, image data that you're working with, or even in the case of NLP where you have to try and bootstrap for a particular, you know, application of, you know, sentiment analysis within a particular language community or something like that. Using the image data use case as an example, You know, I want to programmatically generate appropriate bounding boxes for identifying,

you know, people on a bicycle in a given context. But in order to do that, they need a machine learning model to do the object detection to build the bounding box. And so I'm just wondering how you approach some of those kind of complex scenarios of being able to generate label data for complex or sophisticated data types and use cases.

Let's go with that example. We can also talk about well, anyway, so I think there's a again, a kind of a suite of ways that we approach this and and have, you know, both research over the years and published about as well as spent and are spending lots of effort building for all kinds of use cases and data types in the platform today. I'd start with the first 1 is just, you know, the basic but super important answer of making it easier to write

these labeling functions and other operators, make it easier to do this programmatic development,

especially for the folks who actually know about the data. We often refer to as the subject matter expert. Right? So, you know, this is, like, for example, making it easy for you know, we've done a bunch of publications with the Sanford Hospital and the radiology department and others there. Make it easy for clinician to, you know, click on a mass in an image and say, well, you know, without writing a line of code, say, okay. Well, if I see a bright mass that's bigger than this number of centimeters in this part of the image, that's a signal that it should be labeled as malignant or something.

Right? So and this is similar to, you know, other efforts to make software development more, you know, no code or low code, and this is both for data scientists who just wanna move quickly. Here's an artifact or here's a keyword or key phrase in a document indicates it should be labeled this way. Let me just kind of point and click real quick. As well as for actually looping in the subject matter expert to actually have the rich domain knowledge that may not write, you know, code

into the process. So that's kind of bucket 1. There's a second bucket, which is a lot of technology we've developed and and my cofounder published a bunch of papers about some of the early work here about automating the generation of these labeling functions and other operators. So you can start with a seed set of a couple labels, you know, not anywhere near what you need to train a model, but this can be used to auto generate or auto suggest

labeling functions and other operators that then you can kind of review and approve. Right? So you're still, you know, curating with human knowledge, but you're just kind of reviewing auto suggestions or generations. And so, you know, we have that in the today, and that's how a lot of users start, especially in these high cardinality problems where you have, you know, maybe 100 or thousands of classes.

You'll upload a little bit of labeled data. Maybe it's, you know, very imperfectly labeled, which is almost always the case, and Snorkel Flow will autogenerate a bunch of these labeling functions, and then you can basically start, you know, warm started, not cold started, and edit from there. And then the final and this is in many senses the biggest idea behind all the stuff we worked on and and and all the stuff we build, is just this idea of all of this is about pulling in organizational

knowledge or domain knowledge into the process in ways that can include, but also are much more expansive than just kind of labeling data. Right? So this idea of a labeling function, it's kind of like a low level mechanism or it's like a low level API. The bigger idea is I've got all this knowledge. It's in, you know, my head as a subject matter expert. It's in codified sources like knowledge bases or rule sets or business logic, you know, specifications or lexicons or dictionaries. It's in

existing resources like pretrained models that detect bounding boxes or now large language models or foundation models. And how do I get all this information into a model for a new dataset and a new task that is specific to what I need to do? And effectively, what we're saying with this whole approach is that you can get it in via the data. You can use all these resources to shape your dataset

and then, you know, use that. Right? So the abstraction is often this labeling function. Right? So I can use a labeling. You know, we just posted the paper from the research team at Snorkel about, you know, using 0 shot learning and large language models to basically turn natural language prompts from users into labeling functions. We have, you know, work on how to, you know, use labeling functions to auto generate them from ontologies or knowledge bases. Let's say I'm trying to tag some symptoms in a medical record. I can auto generate those labeling functions from a bunch of existing medical terminology ontologies, etcetera.

But, again, the high level ideas, you know, let's not do what a lot of AI people do, which is to say implicitly or explicitly,

throw out everything you know, throw out that expert knowledge, throw out those expert systems, throw out everything else, and just start clicking, you know, labeling data 1 at a time every single time you wanna turn the crank for machine learning. And so we're saying no. Let's use all that signal, any of that signal, including the signal that's really messy. Labels you have sitting around that are stale, models that aren't good enough are around maybe even just related tasks,

knowledge bases, language models, trained on web data, you know, all this stuff that's not gonna solve your task alone, but you can use it to label and develop data and power your models. That's kind of the bigger idea here. Sorry for waxing poetic about it. No. It's definitely great. I appreciate all of the context and the examples there. And particularly,

1 direction that I wanna go to in a minute, but keeping on with this theme for a little bit longer is you were pointing out, you know, existing models that maybe aren't perfect, which brings in the possibility

of saying that I have this model. I've started to detect that it's undergoing concept drift because the world is changing around it, but it still has, you know, some of the context and a decent amount of the detail that I'm trying to capture. So now I'm going to use that as part of my relabeling

regimen to be able to say, okay. Here's, you know, 80% of what I'm trying to do, but now I'm going to write a write a function that will kind of complement that to adjust for some of the skew that I'm observing in the real world operation

so that I can then get a higher accuracy of the labeling function that I want to feed into the retraining of its replacement model. So, basically, you're kind of in sort of the the workspace analogy of, you know, I'm hiring somebody into train as my replacement so that I can go do other things or so that I can go retire. This idea of, you know, this is a way of kind of, you know, adapting and repurposing,

you know, information. Again, this is a very classical idea in AI, not just machine learning, AI. How do I repurpose old knowledge for new tasks. Right? And in some ways we're trying to make it look more like software development where you can take kinda old existing models, labels, sources of information,

business logic, sources of signal, use it to get started and then iterate on it with code. Right? To make it more concrete, that's how a lot of projects in Snorkel Flow start, especially these bigger, more complex ones. You know, you dump in a bunch of these kinda sources of signal. Again, example, 1 of our customers, a large global insurer,

you know, for 1 of the problems they tackle, they started with for 1 project, 85,000 labels from junior underwriters that they estimate are about 60 to 70% accurate.

This is for, like, a multi hundred way classification problem. And then they iterated from there. So they then looked at some of the analysis tools that we have and said, okay. Here are the the classes that are getting conflated. Here are the error modes, and then target kinda did iterations to make targeted corrections to those with the labeling functions and other operators. So, you know, definitely, if this this old idea of

repurposing and adapting, whether it's to bootstrap a new task from 0 to 1 using some of the old things you have laying around or to just modify an existing task as things shift in the world. And in some ways, we're just letting you use more

programmatic or kind of software development techniques rather than a lot of research is focused on to completely automated way. And that's cool, and we use a lot of those techniques too. But in our experience, you have to be able to have the human in the loop to do this this intervention and and modification and development, and that's kinda what Snorkel is about.

For organizations who are either starting their journey of implementing machine learning, and they are exploring the so called traditional approach of managing their labeled data, or they're already in that loop of, you know, we manually label our data or, you know, we hire 1 of these data labeling firms or we use something like Mechanical Turk. What are some of the kind of decision points or customer education that you work with them on to understand what is sort of the cost benefit of

continuing with that manual labeling approach where you have a fairly high confidence that you're going to get a higher accuracy from those manual approaches versus going this programmatic path where maybe it complements the work that they're doing, but it allows them to scale more economically and effectively. Let me just start by pulling out that point that you made about kinda complementing.

This kinda goes what I was talking about, but this idea of it's not just about, like, labeling functions and programmatic labeling. It's really about this bigger idea of pulling in all the information and being able to edit and iterate and develop it like software, you know, like you would with any other kind of piece of code. We have lots of, projects. Some of which we published about, others with customers where, you know, there are both manual annotation

and programmatic approaches being used. And in fact, that is actually something we build core workflows around. You know, on the 1 hand, how can you, you know, speed this up and also make it more auditable and adaptable, etcetera, through programmatic approaches?

But then how can you use your subject matter expert collaborators, many of whom are very busy for t you know, they we work a lot of data science teams who are kind of having to beg for time from these partners who know how to label the data. How can you make better use of their time by being more targeted? You know, we actually have a a manual annotation kind of focus mode to our platform, and we have a lot of workflows that support things like doing some programmatic iterations,

being guided to some of the error modes where you're getting stuck, and then sending those slices of the data out for manual annotation

or programmatic annotation in simple ways by the subject matter experts and back. And so the reality is that you almost always have both of these folks in the loop and you have a mix of some, you know, programmatic to speed things up, but then also manual review at very least for final testing and validation, which we talked about at the beginning and we thoroughly recommend.

So that's an important note to make because we work with data science teams as our primary kind of, you know, users and customers. But almost always, you know, 99% of the time, there's a kind of subject matter expert or annotation team who we're also trying to support and work with very closely. We're trying to save both of those groups time and also make the process more kind of manageable, auditable, etcetera. You know, back to the actual decision tree, I'd say 1 decision

point, you know, is kinda basic. It's just, are you struggling around, you know, the construction development of the training sets? Now

most people start to struggle over time because of things like drift and everything, but there are problems where, you know, you'll look and say, okay. Look. I'm your customer churn prediction, I already have those labels, you know, organically collected or, you know, really cheap for me to get them in the zone where I I have, you know, 15 columns and and I wanna predict column 15 from columns 1 through 14. Right? Structured labeled data.

We can help there. We can help denoise the labels and help make them more adaptable, etcetera, but maybe that's not your biggest pinpoint. You know, a lot of the folks we're talking to, they have very unstructured data, and they don't have the labels, and that's a major pain point. Right? So we tend to gravitate there also just because we like to,

you know, pick the biggest delta as a small company. And then, you know, once you're there, then I think a lot of what determines it is, you know, what is the cost associated with, you know, throwing people at the problem and kind of the old way? And, you know, the cost can be both kinda time and monetary, but also things around, you know, governance. You know, how do I actually

check the labeling of the data? How do I prove that it's bias free? How do I correct it and adapt it over time? And so, you know, if you have a really simple problem, you know, stop sign versus pedestrian, even simple problem

cases and it's not, you know, again, not a similar problem. But just to use as a toy example, yeah, you got settings maybe where most people know how to label stop signs and they don't change that much because they're kind of, you know, restricted from frequent change by regulation

versus if you're dealing with, you know, legal documents in a bank or clinical information or, you know, government and, like, all this kind of stuff where it's often very private and it's constantly changing and it requires special experts to to label it. And, you know, then, you know, you're looking at the effective cost of training data, labeling, and development that is prohibitive even for the world's largest organizations as, you know, we published about Google and others.

So that's a rough guide. You know, are you in this kind of land of unstructured unlabeled data? We think of this kind of iceberg under the surface of ML, you know, applications today where you're really gonna run into this right away.

And then, you know, are you especially gonna run into it in a painful way because you're not dealing with trivial problems that are kind of, you know, cheap and quick to label? That's a rough guide that at least we use to try to find, like, the highest value problems that we know how to solve. Yeah. And the evocative terminology that I think you had used when you first came out with the snorkel project was calling it dark data or sort of dark data that is yet to be sort of unearthed and turned into something valuable. Yeah. And it's still the case. I mean, you know, it's striking

the level of talent and technology that's kind of

coming into the enterprise via data science teams. I mean, I'm sounding like I'm just trying to suck up a little bit and maybe I am, but I really mean it. You know, like, the talent is there, the spend is there, the tooling is there. I mean, especially, you know, both both your vendor solutions and internal tools, but also just out in the open source. You know? But the number of teams that are kind of blocked on the data because they can't even start training the latest fancy machine learning model that's out there in the open source until, you know, a line of business team has spent, you know, several months labeling data.

Like, it's crazy. So that idea that there's this data that it's dark for the the broader organization, you know, because it hasn't been refined and classified and tagged and extracted by animal methods. But it's also really dark for the data science teams who just don't have it structured and refined in a way where they can use it. Predabase is a low code ML platform without low code limits.

Built on top of their open source foundations of Ludwig and Horovod, their platform allows you to train state of the art ML and deep learning models on your datasets at scale. The prediabase platform works on text, images, tabular, audio, and multimodal data using their novel compositional model architecture. They allow users to operationalize models on top of the modern data stack through REST and PQL, an extension of SQL that puts predictive power in the hands of data practitioners.

Go to the machine learning podcast.com/predabase today to learn more. That's predibase. And so bringing it into the Snorkel Flow platform that you're building, I'm wondering if you can talk to some of the overall design and functional elements that you're providing there and the way that you approached the implementation and integration points for people who are using it and want to hook it into their overall ecosystem of data engineering or MLOps or custom internal tooling?

Integrations alone is a huge topic, but let me start with kind of the basic workflow that we support. So a lot of what we built is based around this kind of core, what we think of as data centric development workflow, where you start with labeling and more broadly developing your data programmatically. This is, you know, labeling. We talked about this idea of labeling functions of, you know, labeling via code. Other operators like transforming or augmenting data, slicing data,

sampling, relabeling, all these operators that, you know, we've published about and talked about broadly make up these kind of very critical AI operations. But we'll start talk about labeling today. Right? So you have this first, you know, area where you're developing these, and we call this studio. And a lot of this is based on, you know, offering a stack of interfaces all the way from, you know, hosted notebooks and the Python SDK by which you can just write these labeling functions as code all the way up to, you know, auto generated

point and click, even kind of natural language interfaces that make it, you know, easier and faster, more accessible to write them. So that's the kind of, you know, stack and and kind of studio or IDE there. Then we have this kind of weak supervision

component that is figuring out how to clean and combine and model those inputs. And then we include a full AutoML suite that gives, you know, basically feedback. Right? So we can come back to some of our users actually use this, and then we support MLflow export format and others to basically take the models out or even the the more broadly applications, which can have multiple components.

Some just pull out the training data, but all of our users use this because it's the way that you actually get feedback. It's the way that you avoid flying blind. Right? If you're just labeling data, you know, you could be spending

especially if you're labeling manually, it could be spending months labeling a part of the data space that is really not helping model performance, right, and missing critical error modes. So that key part of the loop of, you know, labeling data programmatically and or manually or via other sources like we talked about, cleaning it up, but then training a model to give fast feedback

about where do I go next. You know, both via active learning approaches, which kinda guide to where the model is confused, but also a range of other analysis about basically, where do I go next? That's the core loop. And so I'm I'm simplifying quite a bit, and I'll note that a lot of our effort you know, a lot of our engine product effort working with, you know, customers and users goes into all the rest of the stuff that you need to tackle with these complex

data types and problem types. You know? You know, let's say you wanna build a news classifier. A top 3 US bank built a news classifier on top of Snorkel Flow. You know? That was actually what we call an AI application. We have a blog post about this because it was not just a model. It was, you know, a model to tag entities in news feeds and then a model to link them to canonical identifiers like stock tickers and then a model to classify the semantic role. And and a lot of these problems over, you know, natural language data and image data and others

really need to be broken apart. And it's a lot more complex than just building a single model. So there's a lot more there, but this basic loop of just you know, labeling and developing your data, cleaning it up automatically, and then getting feedback from a range of analyses, including from kind of ones that are driven by models, That's the basic loop we we support. Now all of this is accessible in a GUI,

and the newest 1 that we just rolled out, like, literally, this whole loop happens, you know, automatically. You poke around your data. You could develop these labeling functions or auto generate them and other operators, and then a model trains in the background to give you feedback about where to go next in real time.

So that basic loop is there. But then 1 of the key design principles that we've worked around from the beginning is mapping everything in our workflow to a Python SDK or to Python SDK endpoints, you know, 1 to 1 so that, you know, you can always kinda mix and match because, you know, especially with the sophisticated teams that we have the privilege of working with. But just in general, you know, there's open source tools, there's internal tools, there's vendor solutions. Like, we wanna plug in very easily. Right? So, you know, you can do all of this and then in Snorkel Flow and just, you know, ship out a, you know, a package at the end that has the model or the application.

You can pull stuff in. You can pull stuff out. Right? So everything is is accessible and drivable via Python SDK, which is kinda how we you know, we have other push button integrations, but that's how we make sure there's always kind of a a way to kinda mix and match and integrate for the developer.

To that point of integration too, I'm also curious how you think about working with the overall experimentation cycle of, I have this model that I'm trying to develop, trying to do it for the first time, so I'm going to see, you know, which layering of networks are going to give me the best results, and I'm going to tweak the labeling function to label things in a certain way to be able to generate a certain set of features for how the model wants to think about the problem space.

Okay. That didn't work, so now I'm going to go back and tweak the labeling function to regenerate the training set to rerun the model development and just working with that kind of flow of the machine learning process

and how that's maybe disjoint from the kind of software development flow that some people might be familiar with. Just to start with the first point, I mean, a lot of our design principles for the platform and the workflow really just derived from this view of a data centric AI development.

And as a side note, if I didn't, you know, make this clear enough at the beginning, I think now this phrase is getting more used as a buzzword and people will shorten it to data centric AI. And I don't wanna, you know, play the stereotype of, you know, pedantic academic, but, you know, not even in an academic sense. The real intention, at least from us, is to say data centric guide development. Right? Because

AI and so much as people are mostly talking about ML these days has always been data centric. I mean, it's literally in the definition of machine learning. We mean development in terms of what's the kind of main object that you're developing for the majority of your time. Is it the model and attributes to the model or is it the data? So we build our workflow

around this idea that, you know, the data development is the main thing that you need to be able to iterate on. And the model development can be largely automated or kind of, you know, done more more minimally. And you see this with the proliferation of both I I mean, there's a whole bunch of trends that lead into the kind of model development being more automated.

Some of them are just, you know, first big wave is just deep learning models or representation learning models that take away a lot of the need for manual feature engineering in many cases. Then you've got also the convergence of architectures. Right? Increasingly, a whole range of tasks are accomplished by an increasingly consolidated set of model architectures.

And then you've got progress in AutoML where, you know, basically a lot of the kind of, you know, experimentation to find the right model or the right configuration and tuning of the model can be automated itself. So a lot of our building is done, you know, based on this bet that mostly you're gonna be wanting to iterate on the data and just kind of push a button, to train a standard architecture or do AutoML. Now, of course,

it's never that, you know, black and white, and we're not dogmatic about this. Right? So we offer some capabilities for doing model development. And then, you know, data center development works because there's been great progress with model centered development. Right? So there's where we, you know, make it really easy to integrate with other platforms

or internal tools for doing that model centric development. But, again, our bet is that you're spending most of your time and you're getting kind of the the biggest bang for your buck iterating on the data versus kind of, you know, tweaking different configurations of a model.

But again, we're not dogmatic, depends on the situation, and that's again where integrations as well as tooling that we have and inherit from all the great open source progress or kind of, you know, hook onto is a practical way forward. And then the other element of the model development process is the question of explainability and what

impact are these different features that I'm pulling out of this labeled data set having on the overall performance of the model? And so I'm wondering how you have seen teams use tools such as SHAP or some of the other explainability

projects for being able to say, okay. Based on this model training, based on this dataset, this is the explainability of, you know, these are the networks that were being activated based on this set of data, feeding that back in to say, okay. I'm going to, you know, enhance the logic for how I label this particular attribute of my dataset or add more metadata to this object at the point of labeling to be able to get some better outcome from the model and just that overall cycle and how explainability

factors into this data centric approach to AI development. Explainability is a huge top. And, again, you know, an area is exciting. I mean, if you can't even have full agreement on the definition of the desiderata, then it's, you know, definitely an exciting, you know, intellectual space, you know, on the kinda production and practical or commercial side, 1 should should tread carefully around.

So explainability is a a huge area, and I'm not the expert to speak to that progress. But what I'll say at a high level is that, yeah, I think 1 way to to bucket it would be to say, okay, there's 1 definition of explainability around models, which is, can I explain in an interpretable way kind of why a model made a certain decision?

And then there's the part that we focus more on, which is can I actually, you know, trace back model predictions to parts of the data and actually have, you know, lineage and pathways for editing and also for auditing? So I think that that former 1 is, you know and that's where, you know, their approach like SHAP and others, for this as well as, you know, tons of research.

That's something that, you know, strictly speaking, you know, a lot of the stuff we build is orthogonal to. Right? We can work with any models and then any techniques for explaining the decisions of those models. Right? But a lot of where we focus and and offer and where this whole kind of programmatic approach offers a very unique and and often very significant advantage in a 0 to 1 kind of way is this ability to actually,

you know, be able to audit and, you know, trace lineage around your training data. Right? If you label a bunch of data by hand, know, there are some there's some interesting work on tracing kind of model errors back to specific labels. But, you know, really auditing a manually labeled dataset is is a very difficult thing to do. Theoretically, like, as harder than the problem you're trying to solve often by labeling the data.

And then going and correcting it or debiasing it or anything like that is equally difficult because you have to relabel massive parts or potentially the entire training set. When your training set is labeled by code, you can, you know, go back and just make edits to that code, relabel, and proceed. And so this ability to both have kind of auditability and explainability of how your data is labeled as well as to actually take action

in a really concrete way is kind of 1 of the big advantages of our approach. So that's a little bit of a shift from traditional model explainability, but we find it to be very important and practical to have this kind of data explainability and, yeah, also ways to take action to to correct or debias.

Another aspect of what you're building at Snorkel AI is the fact that you did build it on top of this open source project that you built as part of your research, and I'm curious how you have been approaching the question of governance of that open source project and how you think about the

boundaries between which features live in that code versus which features live in the platform for the sort of commercial feature set and just how you're managing the community element of sustaining this open source project while still building a viable business around it? We're actually a a little bit different than the kind of open core models that you're kinda speaking to there. And the platform Snorkel Flow that we build and sell commercially is a 100% at this point new code compared to the, at least, you know, the open source repos we've had out there. There's 1 that I guess got a lot of attention. There are, you know, dozens or I think essentially 100 that really were, you know, were in our research artifacts or or or research repos. Right? So, you know, those were always going to, you know, maintain and keep out there, you know, at minimum for, you know, commitments to scientific reproducibility and for their kind of, you know, value in helping to illustrate the concepts that we've published about out in the open for years and and still publish about.

But they're not, you know, production open source, you know, projects, and they're distinct from the platform that we build today, which incorporates ideas from all of them over the last 7 years, but, you know, plus a lot of new stuff, but is distinct. And so in terms of the ways that you have seen your customers using the Snorkel Flow project and the Snorkel AI platform? What are some of the most interesting or innovative or unexpected applications that you've seen?

First of all, it's just we've seen even with, you know, kind of basic use case, some really cool impact in deployment, you know, in terms of saving, you know, hundreds of thousands of kind of person hours of work that wouldn't have gotten done otherwise,

clinical trial records and financial documents and insurance records and, you know, all kinds of stuff like that. The other creative applications, I think, have come out of this functionality we've leaned into around enabling kinda multiple models or operators to be pieced together.

You know, I gave 1 example before about, you know, large US Bank kinda putting together a news analysis application that was, you know, tagging, linking, extracting. This isn't, you know, a novel thing. A lot of the real, know, especially in NLP applications or pipelines of of operators. But, you know, opening it up has led to lots of creativity, breaking, problems down into, you know, hierarchical decompositions and solving them that way. That's been really cool to see.

A lot of what maybe our recent learnings have been over the last year or 2 is just the the criticality of looping these other stakeholders in, right, of having, you know, not just data scientists, but having, you know, subject matter experts and annotators in the loop, you know, giving signal in in various different ways, tags, comments, labels, you know, labeling functions, labeling function ideas, operators, etcetera, and learning to kind of better support those loops has been

very interesting in terms of learnings around how to build the right workflow. And in terms of your experience of going from research in academia and building this open source project as a sort of artifact of the work that you were doing and then building a business around those core ideas that you were researching and experimenting with. I'm wondering what are some of the most interesting or unexpected or challenging lessons that you've learned in that process. Tons, obviously.

I don't think I get away with even trying to claim that I knew even a a fraction of requisite things walking in the door. You know, I think there were a lot of, and I'm not saying this is a shameless pitch for grad school, but I I think there were a lot of lessons that, you know, out of the appropriate and very critical grains of salt or or modifications did transfer around how you communicate to users and collaborators, how you go about actually, you know,

getting them successful with new techniques. A lot of the learnings have been saying, you know, how can we adapt ways that we went about things to this different context, and we find out that there's obviously a ton we learn, but there's some similarities, right, in how we engage with users in the academic side of, you know, not just kind of throwing an algorithm over the wall, but trying to understand,

you know, help them scope problems, you know, help them set up great metrics, you know, scope how to evaluate. You know, we have a similar, you know, motion in our in our process of engaging new new users, company side now too. Biggest learnings.

Some things are obviously, you know, slower out in the commercial world, but some things can be faster too, which is pretty cool. So, you know, there's some places where I've been blown away by how quickly stuff can actually reach production, you know, have impact.

I guess nothing new there. Right? When you're in, you know, on the academic side, you can produce things at times more quickly, but they're meant to be upstream. They're meant to kind of, you know, spread ideas. Right? And some of the applications, you know, take a little longer to kind of diffuse down.

Whereas when you're out in the field, sometimes things are slower and there's more process and tougher requirements and everything, but then you can just kind of, you know, ship stuff to prod and it's, you know, out there. We were fortunate working with, you know, all these, you know, Fortune 100 companies and government agencies and, you know, the other folks who have, you know, massive scope of impact. So you get to make a change there and you see it, you know, have

pretty huge impact. It's it's extremely exciting. It's all differences. I guess that's the generic thing to say, but it's been been super cool to see the different ways in which we both had new learnings and new access to, you know, getting hands on with customers and users in ways that, you know, were harder in academia as well as new opportunities for actually, you know, shipping ML in impactful ways.

And we touched on this a little bit earlier, but what are the cases where Snorkel is the wrong choice and somebody may be better suited with just manual labeling approaches? 1st answer I'll give is we include, you know, full capabilities for manual labeling. So I would say, with that bifurcation, never with my sales hat on. But the broader question of when is it not a best fit, I mean,

1 of the ways that we're trying to make that easier for everyone to answer is we do a lot of work to, you know, templatize various use cases that we, you know, both offer more push button kind of pre templatized support for, but that also can help guide to kinda where we're a good fit. Right? We're a very broad development platform. We've published for years, you know, in true academic fashion, kind of breadth first of everything from, you know, core ads tech to self driving to genomics to all kinds of stuff. On the company side, we try to use this templeization again, both to improve the experience for, you know, core use cases, but also to help guide to

where we're best suited in a really, you know, kind of granular way. That sounds like a non answer, but it's really to say, look. We we have this library of templates that we continue to grow and and add to, but is meant to kind of help address that question of, hey, where is this platform really good fit today? And, you know, where is it either not ready or not the best approach for via that kind of, you know, use case granularity. Are there any other aspects of the Snorkel Flow

platform and the business of Snorkel AI and just the overall space of programmatic data labeling and data centric machine learning development that we didn't discuss yet that you'd like to cover before we close out the show? Too much exciting stuff that, you know, both the team at Snorkel has worked on or is working on actively and others in the space, you know, are working on. It's really exciting to see this idea

really at its core of just saying, hey, we should pay more attention to the data and the development processes, both existing and emergent around the data as ML people and AI people. You know, that shift is extremely exciting.

There's lots of cool ideas out there. You know, and I I touched on some of the ones that that we've touched on about, hey. It's not just labeling. It's augmenting and slicing and shaping and sampling and all these things that, you know, we've published over the years, you know, either in or making their way into Snorkel Flow, our platform. But there's also just a ton of a growing amount of exciting work out there and, yeah, too much to enumerate. So I'll I'll maybe I'll leave it at that. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest barrier to you

know, all the subject matter experts and expertise, you know, that a lot of the you know, all the subject matter experts and expertise, you know, that a lot of folks have in their colleagues, their organizations, you know, their existing models and resources,

and, you know, the new task you wanna accomplish. And, you know, having the only conduit for that be, you know, arduous manual labeling of data is 1 of the things that I think has slowed down a lot of progress in ML and opening that bottleneck up and and creating new conduits, you know, that even higher level kind of phrasing. But, you know, you know I have to stop I have to end by saying, you know, it's the data. The data is the blocker, but it's also if you handle it right, the enabler and the interface. So I'm gonna be I'll round out on that note. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Snorkel. It's definitely a very

interesting product and platform, and it's great to see that you have been able to provide a way for people to kind of get over that initial hump of needing access to labeled training data to be able to start exploring ai and making it a reality for their organization. So I appreciate all the time and energy that you and your team are putting into that, and and I hope you enjoy the rest of your day. Tobias, thank you again so much for the invite today, and and this is an awesome conversation. I really appreciate it. Looking forward to our next 1. Absolutely.

Thank you for listening. And don't forget to check out our other shows, the Data Engineering podcast, which covers the latest in modern data management, and podcasts dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast.com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Transcript source: Provided by creator in RSS feed: download file