Accelerate Development And Delivery Of Your Machine Learning Projects With A Comprehensive Feature Platform - podcast episode cover

Accelerate Development And Delivery Of Your Machine Learning Projects With A Comprehensive Feature Platform

Aug 06, 202251 minEp. 6
--:--
--:--
Listen in podcast apps:

Episode description

Summary
In order for a machine learning model to build connections and context across the data that is fed into it the raw data needs to be engineered into semantic features. This is a process that can be tedious and full of toil, requiring constant upkeep and often leading to rework across projects and teams. In order to reduce the amount of wasted effort and speed up experimentation and training iterations a new generation of services are being developed. Tecton first built a feature store to serve as a central repository of engineered features and keep them up to date for training and inference. Since then they have expanded the set of tools and services to be a full-fledged feature platform. In this episode Kevin Stumpf explains the different capabilities and activities related to features that are necessary to maintain velocity in your machine learning projects.
Announcements
  • Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery.
  • Building good ML models is hard, but testing them properly is even harder. At Deepchecks, they built an open-source testing framework that follows best practices, ensuring that your models behave as expected. Get started quickly using their built-in library of checks for testing and validating your model’s behavior and performance, and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually. Go to themachinelearningpodcast.com/deepchecks today to get started!
  • Do you wish you could use artificial intelligence to drive your business the way Big Tech does, but don’t have a money printer? Graft is a cloud-native platform that aims to make the AI of the 1% accessible to the 99%. Wield the most advanced techniques for unlocking the value of data, including text, images, video, audio, and graphs. No machine learning skills required, no team to hire, and no infrastructure to build or maintain. For more information on Graft or to schedule a demo, visit themachinelearningpodcast.com/graft today and tell them Tobias sent you.
  • Data powers machine learning, but poor data quality is the largest impediment to effective ML today. Galileo is a collaborative data bench for data scientists building Natural Language Processing (NLP) models to programmatically inspect, fix and track their data across the ML workflow (pre-training, post-training and post-production) – no more excel sheets or ad-hoc python scripts. Get meaningful gains in your model performance fast, dramatically reduce data labeling and procurement costs, while seeing 10x faster ML iterations. Galileo is offering listeners a free 30 day trial and a 30% discount on the product there after. This offer is available until Aug 31, so go to themachinelearningpodcast.com/galileo and request a demo today!
  • Your host is Tobias Macey and today I’m interviewing Kevin Stumpf about the role of feature platforms in your ML engineering workflow
Interview
  • Introduction
  • How did you get involved in machine learning?
  • Can you describe what you mean by the term "feature platform"? 
    • What are the components and supporting capabilities that are needed for such a platform?
  • How does the availability of engineered features impact the ability of an organization to put ML into production?
  • What are the points of friction that teams encounter when trying to build and maintain ML projects in the absence of a fully integrated feature platform?
  • Who are the target personas for the Tecton platform? 
    • What stages of the ML lifecycle does it address?
  • Can you describe how you have designed the Tecton feature platform? 
    • How have the goals and capabilities of the product evolved since you started working on it?
  • What is the workflow for an ML engineer or data scientist to build and maintain features and use them in the model development workflow?
  • What are the responsibilities of the MLOps stack that you have intentionally decided not to address? 
    • What are the interfaces and extension points that you offer for integrating with the other utilities needed to manage a full ML system?
  • You wrote a post about the need to establish a DevOps approach to ML data. In keeping with that theme, can you describe how to think about the approach to testing and validation techniques for features and their outputs?
  • What are the most interesting, innovative, or unexpected ways that you have seen Tecton/Feast used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Tecton?
  • When is Tecton the wrong choice?
  • What do you have planned for the future of the Tecton feature platform?
Contact Info
Parting Question
  • From your perspective, what is the biggest barrier to adoption of machine learning today?
Links
The intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/[CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/?utm_source=rss&utm_medium=rss

Transcript

Unknown

Hello, and welcome to The Machine Learning Podcast. The podcast about going from to delivery with machine learning. Building good ML models is hard, but testing them properly is even harder. At DeepChex, they built an open source testing framework that follows best practices, ensuring that your models behave as expected.

Get started quickly using their built in library of checks for testing and validating your model's behavior and performance and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually. Go to the machine learning podcast.com/deeptext today to learn more and get started.

Do you wish you could use artificial intelligence to drive your business the way big tech does but don't have a money printer? Graph is a cloud native platform that aims to make the AI of the 1% accessible to the 99%. Wield the most advanced techniques for unlocking the value of data, including text, images, video, audio, and graphs. No machine learning skills required, no team to hire, and no infrastructure to build or maintain.

For more information on graft or to schedule a demo, go to the machine learning podcast.com/graft. That's graft, and tell them Tobias sent you. Your host is Tobias Massey. And today, I'm interviewing Kevin Stumpf about the role of feature platforms in your ML engineering workflow. So, Kevin, can you start by introducing yourself? Thanks for having me. I'm Kevin. I'm the cofounder and CTO of Tekton, which offers an enterprise feature platform. And before starting Tekton,

I was at Uber working as a tech lead on Uber's machine learning platform. And do you remember how you first got started working in the area of machine learning?

Yeah. Outside of work in the field in grad school and before the first time that in industry I really dove in was at Uber 6, 7 years ago now when I joined the machine learning platform team, where we built this central ML platform called Michelangelo, really with the intention of giving all the data scientists and software engineers a central system that they could go to to develop machine learning models and actually get them into production within just a couple of hours.

Because without the platform in the earlier days, it just took several months to get any machine learning model into production for a lot of reasons that I'd be happy to go into, but it just took forever. And so we needed 1 centralized,

standardized platform that could just automate away a lot of challenges around productionizing machine learning. And so we built Michelangelo, and that led to this nice Cambrian explosion of machine learning where all the different teams were now able to, with pretty low activation energy, get them all into production for use cases like fraud detection or recommendation systems,

more dynamic pricing, ETA predictions, and you name it. You mentioned in the introduction of the work that you're doing at Tekton the term feature platform. And I know that at the start, Tekton was built around this core of a feature store. I'm wondering if you can just give your definition of what you mean when you say feature platform and some of the components and supporting capabilities that are necessary to be able to fit that definition.

Yeah. Definitely. First off, a feature platform really is a piece of data infrastructure that sits in your broader data stack, where you've got all your compute engines and your data storage technologies, etcetera, to basically solve all the data problems of your organization.

And this data or this feature platform makes it easy to centrally organize your organization's machine learning features, and it does this by helping you turn basically your raw data that you have into machine learning features.

That raw data may come from batch data sources like your data warehouse, like Snowflake, could come from a data lake, could also come from streaming data sources like Kafka or Kinesis, or it could even come from real time data sources like an API endpoint or just data that you happen to have only in memory. And then it makes these machine learning features available for

training purposes so that your data scientists can actually generate training datasets in their Jupyter Notebooks where they do that training. And then it also serves these features to the model that's actually running in production that needs to fetch features at typically fairly high scale and very, very low latency. So it's a very different query pattern than the query pattern of the data scientist who needs to generate these gigantic training datasets.

And having explained where the feature platform sits and high level the problem that it solves, the components of a feature platform are, 1, it's the feature store, 2 is a feature repository, 3 is feature pipelines, and 4 is monitoring. Now let's dive into those. A feature store is really an abstraction layer on top of underlying data storage technologies. And so a feature store basically takes pre computed machine learning features

and stores them in an offline store and an online store. And now the question is, well, why would it be an abstraction layer on top of an offline store and an online store? The reason for that is that for machine learning features, you've got these 2 types of consumers that I mentioned earlier. You've got ML training consumers, and you've got model serving consumers.

And, specifically, the model serving consumers that run-in the production system that need to make predictions at low latency and high scale, we typically refer to this as operational ML, they need to read these features from the online store because only the online store would actually be able to serve these features within a handful of milliseconds, while the offline store keeps a record of the features

and what they look like at any given point of time in the past. Because, typically, when you generate a training data set, you wanna know what a feature looked like over the last couple of months so you can actually train your ML model on it. And then this feature store could be an abstraction layer on, say, Snowflake as the offline store for your offline features, and DynamoDB

as your online store from which you serve the features for online serving. But that's the key. It's like an abstraction layer on an offline store and an online store, and there are different implementations, different online and offline stores that you may use here to store the offline and online features. So that's that. That's what the feature store is. Extraction layer on top of offline online store to store and serve pre computed machine learning features.

Feature repository is basically a catalog that you can use to browse and discover existing features where you see, hey. What models actually consume which features?

Who is the owner of this feature? So you can get a sense for, well, can I actually trust this ML feature and not, which entity is an ML feature associated with, like, a user or a particular product, etcetera? It's basically a component that stores all the metadata around an ML feature. It makes it easy to discover them and identify whether they may actually be a good fit for you as kids or not. And then feature pipelines is this component which actually schedules

data pipelines at the end of the day that read your raw data coming from the batch source, the stream source, or the real time source, and turns it into machine learning features and then stores them in the feature store. And, again, if it stores it in the feature store, it would it would store it in the offline store and the online store so that the feature is available for offline serving and online serving. And then the 4th component

is the monitoring component that basically takes a look at the features and the health of them to ensure that your model is gonna be able to continue to make healthy predictions. Because, typically, with machine learning, it's not the model that goes off the deep end. It's typically that you have data problems, where suddenly the upstream data may have an outage, and your machine learning features may start to go stale, or they may start to drift, or suddenly the value distributions

look like nothing of what you expect it to look anymore. And so those are the 4 components. In terms of the overall life cycle of machine learning models and being able to run them in production, idea to production for these ML models? Yeah. So a idea to production for these ML models? Yeah. So a lot of the work of data scientists is actually involved in identifying data sources that they have access to and turning that data into features

that may have a predictive correlation with the target output that they want to predict. So let's actually make this more concrete. I probably should've done this earlier. But imagine that, like, a common Uber Eats example was trying to predict how long it's gonna take until

your till a order arrives at your doorstep. And to predict that ETA, the time that it takes until the order is gonna be delivered, 1 pretty important feature would be the trailing number of orders a given restaurant has received over the last 30 minutes, because that feature would be a proxy for how busy the kitchen is right now and, therefore, how long it's gonna take most likely until your order is

prepared so that it can be picked up. And, of course, there is a bunch of other features, but that is a very important 1 as a proxy for the busyness of a kitchen. And identifying these types of features, implementing them in a clean and well tested way, and implementing them in a way where they can actually be served reliably in production for serving purposes is a lot of work. And all of that work is typically

the main reason why machine learning projects fail. And so if you now have a central catalog of trusted ML features that run-in production that already serve other use cases. If you're a data scientist tasked with solving a new ML problem, you now can actually explore this catalog of existing features. And if you're lucky, you can just have your pick of the best features there and solve your machine learning problem without having to introduce

any new features or just a much smaller number of new features that you otherwise would have to introduce. And so it can solve the cold start problem basically for a lot of machine learning problems.

So being able to have the features readily available, you have the repository to discover what features there are. You have the pipelines to make sure that the values of those features are up to date. For teams that don't have that feature platform in place, what are some of the points of friction that they're going to experience along the journey of saying, I have this use case for machine learning. I have the

raw data. I have the raw data. I have the raw data. I have the I have this use case for machine learning. I have the raw data available to be able to build these models. Now I actually want to go from this idea through to it's running in production. I'm able to make sure that it's healthy, and I can retrain it in the event that I have concept drift because the world is shifting around me.

So you already made 1 big assumption, which is the data scientists actually has the data, and they have access to it. So that's great. That's 1 big first problem that a lot of companies first have to solve to be successful with an ML use case. But let's assume, okay. They have access to raw data. That's great. Now the challenges that they have to go through to actually get their features built and get them into production. Let's go through some concrete examples.

Imagine that the raw data is actually in your data warehouse, and imagine that your ML use case is an operational ML use case where you actually make predictions in the production system, maybe every time your customer opens the mobile app or clicks a button in the app or opens up a new website. So you need to make those at low latency. Now the challenge is if this raw data just is in the data warehouse and your back end service needs to make a prediction,

the question is where does this back end service get the machine learning feature from? It cannot just execute a query against the data warehouse because data warehouses are not intended to be queried concurrently a thousand times per second and return the results in a couple milliseconds.

They're architected for a completely different type of workload. So what do you do? Well, what you now have to do is you have to basically build data pipelines that on a given schedule, say hourly or daily or weekly, run your data pipeline that

transform your raw data into your ML features. And then, crucially, now you need to move these ML features into an online store, like your DynamoDB or Redis or something like that, from which you can basically now fetch in production the ML features. You can think about this online store like a cache, and so you need to maintain and build this data pipeline

that hydrates the cache that your application in production can fetch the features from. And that's the easy case. That's only if you're transforming data from 1 batch data source, like a data warehouse or data lake into an ML features. It gets significantly more complicated if you introduce streaming features, which is important for low latency ML use cases like fraud detection or recommendation systems that take the user's activity off the last couple seconds into account.

Now you're dealing with having to build and maintain streaming data pipelines that continuously process data on the stream, say, coming from a Kafka source or a Kinesis source,

that transform those, aggregate them, and then make them available in the online store again. And not only that, you now also need to make the stream process data available in the offline store so that later on, you can actually generate a training dataset where you have confidence that ML features were calculated the exact same way online

as they were offline. Because if you don't do this, you introduce, and that's a whole another problem area, what's called train serve skew, which happens when you calculate machine learning features 1 way offline and a whole another way online. If you do this, then and what what I mean with offline is basically when you generate your training data, that's when you generate your ML features offline.

And online is the calculation of the features that you serve to your model that's running in the production system to actually drive your recommendation of fraud detection. If the features aren't calculated the exact same way offline and online, you introduce trained serviced queue because you're training your model on a representation of the world that it will never see like this in production when it actually makes predictions.

Let's again look at the cumulative 30 minute order count in a restaurant. Imagine that by accident, you're actually calculating the cumulative count over the last 300 minutes offline, and online, you do it over the last 30 minutes. Now when you train the model, it's used to seeing numbers that are basically 10 times larger than anything that it would ever see in production.

So you can expect that the predictions in production are just gonna go off the deep end and are gonna be completely useless. And 300 versus 30 minutes is a pretty obvious deviation that is 1 of the easier deviations to find, but there are significantly trickier ones with different rounding errors or filling in null values completely differently. But whenever you introduce the skew between the 2 systems, you're in trouble, and it's pretty hard to debug these types of problems.

That sounds a lot like the issues that came out with people trying to effectively adopt the Lambda architecture from the data engineering ecosystem of having to figure out, how do I build the same logic in in 2 completely different systems that are working at different time scales.

Yes. That's exactly right. That's exactly right. And there are a lot of commonalities here actually where, like, with the Lambda architecture, where off of a stream, you may be calculating representations of your transformation in real time, and you know that the accuracy is not a 100% perfect, but you know that at midnight, you're gonna run a batch job to basically pave over the stream calculated somewhat accurate information and have a much more accurate representation.

And there are a lot of similarities here where, like, our feature platform, for instance, supports for these types of streaming features, continuous circulation of the of the streaming features off of the stream, loads those into the online store, but then it also connects to an offline store where we look at an at a log of all the events that have ever been processed on the stream,

which allows us to afterwards correct the online or the offline store if, say, you're dealing with late arriving data or something like that. It's definitely very interesting. And so in terms of the feature platform that you're building at Tekton, I'm wondering who the target personas are for who's actually going to be interacting with it and who you see as the kind of entry point in an organization to actually adopting the capabilities that you're providing at Tekton?

The main personas that use Tekton are data scientists and data engineers, or the hybrid of the 2, the ML engineers, the rare unicorns out there. The data scientists are typically the ones who engineer new features and to train ML models with the features that they've engineered or that they've used from others, and the data engineers are the ones who are ultimately responsible to make sure that the data pipelines are reliable and

that they're performing and cost efficient, etcetera. They're dealing with a more heavyweight infrastructure beyond the scenes. And Tecton helps both of them. The way I always look at it is that without Tecton, you typically need 1 data engineer per 1 data scientist to make them productive. Sometimes even more than this 1 to 1 ratio, and you need 2 data engineers per data scientist.

And with Tecton, you can drastically lower this ratio and then, say, have 1 data engineer actually support 10 data scientists because the feature platform automates away a lot of the boring work, the boilerplate work that a data engineer has to do over and over again to really productionize a data scientist's work that they've just done in a Jupyter Notebook.

And so those are the 2 target personas or, of course, the ML engineer who can do it all, engineer features, train models, put it all into production. You also ask what's the entry point to an organization.

And, typically, it's it's either an ML platform team, which a lot of organizations have been starting to form over the last couple of years. And whoever is the technical lead of the ML platform team typically reaches out to us and is the 1 that brings Tekton in. Or some organizations don't have centralized ML platform team, but they just have a ML product team, which directly sits with, say, the software engineering team or whichever team it is that solves our particular

end user problem, like fraud detection or recommendation system. And then it's the technical lead, the data science lead of the team who engages with us and then brings us in and works with us. And the other interesting aspect of the overall space for machine learning is that there is a growing list of capabilities that ML teams need to be able to manage the full life cycle of their projects.

And I'm wondering what you see as the core capabilities and the stages of the ML life cycle that you're addressing with Tekton and which of those aspects of that work flow you are consciously deciding not to try and address and deferring to other tools in the space to handle those pieces? Let's look at the different stages. There's basically you wanna collect the raw data, then you want to turn the raw data into features, then you train your model, then you put your model into production,

and then it's running in production. You're monitoring it. Now you're making predictions, and you wanna look at, hey. Are those predictions actually good? Are they leading to the outcomes that you're hoping for, or are they not? And based on what you're observing in production, you may wanna tweak your features. You may wanna tweak your model, etcetera, to make sure that your model is as good as it can. So you've got basically this loop starting with the raw data collection

all the way to making a prediction then observing again what's happening and bringing it back. Tekton helps with the training, deployment, and serving part. And specifically here, Tekton helps with the features part. So Tekton generates the ML features. It serves the ML features, and then it monitors the machine learning features.

What it doesn't do is it doesn't help you with the training of your machine learning model. Like, you should use whichever machine learning framework you're most comfortable with, the TensorFlow's of the world, the PyTorch's, etcetera. That's all orthogonal to Tekton. Tekton also today does not host the machine learning model for that. You'd still use the SageMaker's of the world. And then finally, Tekton today as well does not help you with the monitoring of the model performance itself.

It focuses on monitoring the performance of your features. Are those drifting? Are those starting to get stale or not? So long story short, everything about the features TechDen helps you with, everything about the model itself right now, you have to use another tool for. Digging into the platform itself, as I mentioned, you started off with the feature store as the

initial product that you were building at Tekton. I'm wondering if you can just talk to some of the overall approach that you've taken to how to design this broader feature platform and some of the feedback that you've gotten from your customers that has led you in that direction and some of the ways that just the overall goals and capabilities

of the Tekton platform have changed or evolved since you started working on it? Yeah. Definitely. So if we first look at how do we design the feature platform, we kept a handful of key principles in mind. 1st and foremost, we, from the get go, said, okay. The most important thing of the system is that it has to be entirely reliable. We cannot compromise on reliability if we want our customers to use this for fraud detection or other really, really high

importance use cases that just must not go down. So we can never compromise on reliability, and that's closely followed by simplicity. At the end of the day, we're all about making the application of machine learning easier. That's our goal. That's our mission. That's what we're doing all this work for. And if we have an extremely complicated product number, we're going to fail on that mission.

Now these 2 principles aside, as we architect the Tekton, 1 thing that was extremely important for us is that Tekton is not another black box that you bring into your data stack. It's not a whole another data warehouse. It's not a whole another online store. It's not a whole another transformation engine that's only concerned with machine learning. That's not what it is. Instead, Tecton leverages

our customers' existing data infrastructure. It's really just an abstraction on top of the existing data storage technologies and data compute technologies that a customer already knows and loves and uses day to day. This makes this more concrete. If a customer is a Snowflake customer, Tecton uses Snowflake as the underlying offline store. If a customer uses Redis or Dynamo, then we use either 1 of those 2 stores as the online store.

If, however, a customer is all in on Databricks or Spark, then Tekton connects to our customers' existing Spark cluster, Databricks cluster, and runs the data engineering jobs there. So you can really think about it as Tekton enables your broad data stack, your generic data stack, for machine learning and brings the best practices around production machine learning to this data stack.

And then finally, talking about best practices, another core product principle of Tekton is that we want to bring the software engineering best practice to machine learning. This means that everything that the Tekton platform drives is defined as code. So you can apply the DevOps or GitOps principles to machine learning as well. So all the feature definitions, they're actually stored in files in a GitHub or in a Git repository, so you've got full version control.

Any changes to those configuration files are rolled out using a CLI similar to how you would manage, say, Terraform configurations and roll those out. So we're big believers in XS code approaches and really applying the DevOps principles to the development and the deployment life cycle of your entities and of your systems. And then you also ask, well, how did how did we evolve the capabilities of the product?

The goal always stayed the same, which is let's make the development and deployment of machine learning as easy as we possibly can based on the lessons that we learned at Uber many years ago. Initially, when we started the product, we basically only had a Spark integration, and we only supported batch features and streaming features.

And those batch features, they would always only be stored in a data lake like s 3, and they could only be stored in DynamoDB as the online store. That was the first architecture. That was the first set of components and, really, data architecture, implementations that we supported for our first customer, Atlassian, back in the day. And we've since widely expanded

the data architectures that we can nicely integrate with because there isn't just 1 good data architecture out there. There are multiple modern data architectures that enterprises from SMBs to mid market to larger companies have established, and we want to make sure that we can play nice with sane, good modern data architectures. As a result, we've launched our deep integration with Snowflake. So in case our customer is just going all in on Snowflake, well, Tekton fits

in very nicely here as well. We've also added support for Redis so that customers who care about really, really high scale use cases, making tens of thousands, 100 of thousands of predictions per second, then Redis is a significantly more cost to form and choice than DynamoDB. Another thing that we had added since then was we stumbled across this with an insurance company customer of ours where

years ago, we only supported batch and streaming features, and we realized that a lot of ML features actually need to be calculated in real time, meaning more real time and faster than you even ever could off of a stream. So you wanna turn a customer's GPS location into a geohash or

resolve their IP address to a city or something like that. Like, you wouldn't funnel all that information through a streaming system and then calculate it after. That's gonna take several seconds or minutes sometimes. And these types of transformations, you really need to make basically in memory at request time. And so we added a capability to manage and execute these real time transformations, and you can think about it in a way somewhat similar to AWS Lambda.

Data powers machine learning, but poor data quality is the largest impediment to effective ML today. GALILEO is a collaborative data bench for data scientists building natural language processing models to programmatically inspect, fix and track their data across the ML workflow, from pretraining to posttraining and postproduction. No more Excel sheets or ad hoc Python scripts.

Get meaningful gains in your model performance fast, dramatically reduce data labeling and procurement costs while seeing 10x faster ML iterations. Galileo is offering listeners of The Machine Learning podcast a free 30 day trial and a 30% discount on the product thereafter. This offer is available until August 31st, so go to the machine learning podcast.com/galileo and request a demo today.

And so for somebody who is using the Tecton feature platform as the basis for building their machine learning models and putting them into production and maintaining the features that they're using for feeding those models as they continue to run-in production. What is the workflow for being able to actually go from,

I am a data engineer or an ML engineer, and I'm going to provide access to these datasets to the feature platform. And I am a data scientist. I'm going to define a feature or a set of features, and I want those to be discoverable by somebody who's going to build a model and just that overall workflow of getting to production with these various steps and the inherent complexity of the space and trying to tame that and

encapsulate that in a way that each individual party is able to get their job done without going insane. I think we can split those workflows into the initial onboarding workflows of actually getting started with Tekton and then afterwards, the day to day workflows of creating new features

using existing features. So the first ones that's just onboarding, if you're just getting started with Tekton, what you have to do is you have to connect Tekton to your main data platform, which could be Databricks or Snowflake, and that's basically as simple as giving us an API key to the data platform and then giving Tecton access to the data sources

that you want to be able to generate features from, which could be data lakes, it could be streams like Kafka or Kinesis. And for that, you basically have to register those data sources in Tekton, configure the secrets and any end points

that Tekton needs to know about so that it actually can connect to those. And then you're off to the races. Afterwards, TechCon can actually connect to those data sources and execute data pipelines on your data platform to actually turn that raw data into ML features.

Now day to day for the data scientists, they would just be working in their preferred data science environment, whether it's SageMaker Notebooks or Jupyter Notebooks or whatever it is. And in there, they'd be using a Tekton SDK, which is just a Python SDK, to pull in existing features, generate training datasets, and then use those training datasets to train a model, or they would be creating new features in their Jupyter Notebook. And once they have 1 that they are happy with,

then they would bring this into Tekton. And the way that you bring this into Tekton is you as I mentioned earlier, all the features are managed as code in files that are stored in a Git repository. And so, basically, what you have to do is you just have to edit 1 of those files. They happen to be Python files as well, and create what we call a feature view, which is just a Python function, which encapsulates your feature transformation code, which could be SQL or PySpark or simple Python code.

And then afterwards, Tekton you would register that file with Tekton, and you're then able to actually use this feature for all training purposes or use it in production.

As far as the syntax and API design for that actual feature development workflow, I'm wondering what are some of the design elements and considerations that went into figuring out how to build that interface in a way that would fit well with the approach that data scientists and ML engineers were already using for working with their model life cycle?

The most important thing for us was to ensure that our customers don't have to learn a whole new Python, then Python, then they should be able to use exactly those very same languages and the exact very same compute engines with Tekton as well. So they don't have to learn a whole new DSL or a whole new system to be productive. Now that said, we also decided for some very common types of ML features, We can provide some syntactic sugar to make common features significantly

easier to express and and use. So for instance, if you want to use a time window aggregation over the last 30 seconds, 30 minutes, whatever it is, expressing those types of features in SQL is actually fairly convoluted, especially if you want to express them as off of the time at which you're making the prediction, not some arbitrary point of time in the past. That's fairly complex. It can be done, but it's not easy.

And you can these types of features in Tekton with just a line of code, but at the end of the day, everything just boils down to SQL code or whatever transformation code your compute engine actually understands. And so TLDR was always important for us to allow the data scientists to just use the transformation languages, etcetera, that they're already familiar with and bring those over to Tecton

without having to learn something entirely new. Because of the fact that you're able to do some of that transpiling, I'm wondering how you have had to work through some of the maybe conceptual mismatches where somebody wants to write something in SQL, but it's actually being run in Spark under the covers

or, you know, they're used to using the Pandas API, but it's actually executing against a data warehouse and just some of the overall aspect of building that abstraction layer in a way that is useful to the end user and maintainable for you. It really is only a problem when we add this syntactic sugar to make certain types of transformation significantly easier like these time window aggregations.

Now those are simple enough that we basically have not yet run into complex issues where we execute something completely different under the hood that the customer wouldn't expect and wouldn't be able to debug. And then the other side is the types of features where the customer basically just gives us exactly the type of the transformation code

that we ship as is to the compute engine that we're connected to. So they would give us a SQL query, like a select star from table group by user ID or something like that, and we execute it as is on their compute engine, on their Snowflake cluster, or on their Databricks cluster. That's literally exactly their code that we run there, and we then just take the output of that code. We then store it in the offline store and the online store.

We also do some filtering and some validation around the central transformation code that they give us. But the key here to understand is that we try to do as little modifications as possible from the feature engineering code that the customer gives us on the way to the compute engine

to exactly avoid these types of issues that you just alluded to. And you mentioned earlier that the main focus of the feature platform is to maintain responsibility of all of the data related aspects of the machine learning workflow and to leave the model related aspects to other systems.

But for teams that have to now use different tools in tandem to be able to go through that full workflow, I'm wondering what are some of the interfaces and extension points that you have designed in to be able to

integrate with some of those other utilities and provide a smooth transition from I'm working in the data layer to now maybe I'm working with, you know, COMET or Weights and Biases or something like that that actually manages the model itself and the experimentation flow and being able to go back and forth between those different modes of operation. So there are really 3 different points where you interact with Tecton, where you may want to integrate with other systems.

So those 3 points are 1 is, of course, where you generate training data. Like, you wanna go to Tekton, and you wanna be able to ask for features to generate your data frame so you can, of course train a model. That's the training dataset generation consumption point. Then there is the model serving consumption point where you ask Tekton for features in your production system. And then the 3rd interface point is how do you actually create and update feature definitions?

So those are the 3. And in all 3, we were intentional about choosing languages and principles that we fit into common other environments. So let me make this more concrete and get out of this abstract language. On the training interface point, Tekton provides a Python SDK. And so you can just pip install Tekton and load it into any Python environment, like a SageMaker notebook

or a Datalab notebook or whatever else it is. As long as it speaks type Python, you can pip install and import Tecton and then use it to communicate with Tekton and actually fetch features and generate training datasets that consist of a bunch of historical feature values. On the serving side, especially the online real time serving side, Tekton just exposes a REST API and a gRPC API. So any production application that needs to fetch features in real time from Tekton can just query Tekton

by submitting a simple REST call to our platform to fetch the feature values. Again, that's a very generic integration point that any application really can integrate with. And then finally, the feature configuration point where you actually create and update features, that's all done just by writing Python configuration

code in files and backing those up in Git. And so that neatly now integrates with your existing stack where you manage your code already and where you're already backing up your code in Git or what other whatever other version control system you have. Tekton's configuration files would just be managed in your existing Git repository or some new Git repository if you like, and then you would roll those out, those updates to Tecton using a simple CLI command.

So those are the main integration points. And so as I was preparing for this conversation and reading through some of your recent blog posts, I came across 1 that you where you were discussing the need to establish a DevOps style approach to machine learning data and the machine learning life cycle.

And as I was reading through that and thinking about my own experience of working in DevOps and working in software, the idea of testing and validation came to mind. And I'm wondering what your thoughts are on the overall approach to being able to test and validate the feature definitions and their outputs and ensure

that you aren't introducing errors at that level, not necessarily digging into how that data plays with the actual model itself, but just being able to do the testing and validation of these features and ensuring that they stay within whatever bounds or parameters the creator of that feature wants to set.

Yeah. That's super important. FAD was always front and center of our minds as we designed Tekton, and that's why we chose this features as code approach to Techton, where you manage the feature definitions in your own repository and where you roll them out using your existing CICD pipelines with our CLI. Because what that allows you to do is basically you'd create a new feature, you would check it into Git, Typically, somebody would be reviewing your pull request.

And then once it's checked into Git, you'd have your CICD pipeline actually test and validate your feature before at the very end of it, once everything on screen, it would actually be rolled out to Tekton using our CLI. And now what that allows you to do is in your CICD pipeline that you already have up and running for all your software applications, you would now be able to execute unit tests against your feature definition to make sure that your feature transformation code produces

the output that you expected to produce against certain mock data. Like, you may just have some mock Pandas data frames that you can feed into your feature definition, and then you would look at the output to make sure that everything looks as you'd expect it to. Those are simple unit tests that you should basically build for pretty much all of your features and execute every time you make a change to any existing feature definition.

2nd, if you're more advanced, you could even go as far as executing integration tests against your features. And you'd actually deploy a feature definition against a staging cluster, integration test cluster of Tecton. You'd let it go wild against some stage data that you have, let it transform the raw data into feature data, and you would ensure that the output looks exactly the way you expect it to. Those those are integration tests that you can also execute as part of your CICD pipelines.

And then at production time, once you've rolled out the feature, according to the typical DevOps process, you want to continue to monitor the health of the ML features. And so Tecton automatically monitors the staleness or the freshness of your ML features. It continuously looks at the serving latencies that it observes for all of the features, and we are now actually working on also automatically noticing when the feature definitions themselves drift.

We're working on an integration with data quality monitoring tools where you can actually express certain constraints on what the feature data should look like, like whether it should be of a certain type or whether it should be in a certain range. And if suddenly the features we observe don't meet those constraints anymore, then you would get an alert from the from the platform.

The other element of DevOps that's core to its ethos is the concept of collaboration and business alignment and making sure that everybody is working in the same direction. And I'm wondering what are some of the ways that you have brought some of that philosophy into Tekton as far as being able to encourage collaboration and alignment across the different roles that are critical for being able to actually make production ML a reality?

Yeah. I think all of this comes down to the fact that the different personas all work against the same centralized tool. And so you'd have all the different data scientists seeing 1 central feature store where they can explore the catalog of available features that were created by the different teams that are used by different downstream

ML models. And so just having this central catalog, the central library is like a key piece required to enable collaboration of different teams because they have 1 central system that they can all go to to discover each other's work.

Apart from that, what we talked about earlier, the workflow of data engineers and workflow of data scientists, where Techton is really the interface between the 2, where a data scientist comes up with the initial feature definition, and then together with the help of the data engineer, productionizes the the the feature using Tecton.

In terms of your interactions with your customers and people working in your organization, what are some of the most interesting or innovative or unexpected ways that you've seen Tekton and this overall concept of a feature platform applied? So we've seen some customers actually use Tekton not just to power ML models in production, but some even power just rules engine that running in production.

And that's interesting because rules are, at the end of the day, also just models. They're just not trained. They're hand implemented models. That's been quite fascinating. We've also seen a lot of customers actually use Tecton to manage embeddings, to store embeddings that they calculate and train outside of Tecton, but then they use Tekton to centrally manage them and serve them for prediction

purposes. And then, finally, another thing that's also been just interesting in the journey of building Tekton over the last 4 years is just seeing the diversity of good data stacks that customers have. Like, there are definitely bad data stacks and there are good data stacks, but even within the group of sane,

modern, great data stacks, there is a pretty wide variety, and there are good reasons to go with 1 or the other. And there isn't just, you know, a giant consolidation happening around just 1 data platform. There is a variety of data platforms, and I'm pretty convinced that there will continue to see a variety of data platforms similar to how in programming, there is a variety of different programming languages, and we haven't just all consolidated around

Python or c plus plus. They all have their place. There's a variety of them out there. It's not hundreds of programming languages that companies use day to day. There is a number of languages out there, and, similarly, there is a number of sane, good data architectures and data stacks where there isn't really a right or wrong. It really depends on your use cases, how much you're willing to pay, how important performance for you is, etcetera.

That's just been a fascinating journey to see those data stacks, see how they evolve and where the industry is generally going, and how it is actually not just all consolidating around 1 super high gravity player. And so as you said, there isn't 1 monolithic stack that's going to be applicable to everybody. So for people who are taking a completely different tack and figuring out how to get those capabilities into their own engineering workflow?

So if you're all on prem or if all the data related to an ML use case you care about is all on premise and not in the cloud, then Tekton is definitely the wrong choice. Then you can use either build Feast or build entirely from scratch.

Also, if you are already on the cloud, but you're still modernizing your cloud data infrastructure, and you say are not a Databricks customer, not a Snowflake customer, or don't have a data architecture around, say, if you're an AWS, around EMR or Athena and s 3, but you have a hodgepodge of a variety of disparate data sources that you're trying to still bring together for data science purposes, then you'd be too early for Tekton, then you still would have to modernize and more consolidate

the data stores in your cloud before a system like Feast or Tekton could and should be brought in. As you continue to build and iterate on the platform and work with your customers and keep track of what's happening in the broader ecosystem, what are some of the things you have planned for the near to medium term or any capabilities

or problem spaces that you're excited to dig into? Yeah. Going forward, we will continue on our mission to make it as easy as possible to develop and deploy ML models. And we have so far made it very easy to develop ML features and deploy those to production. Going forward, what you can expect to see from Tekton is much more visibility and much more insights into

the quality of these ML features and whether they're drifting or not. And then over time, what you can also expect us to do is help you more with actually getting your predictions and getting your business decisions. Like, at the end of the day, everything boils down to a data pipeline and machine learning. Turning raw data into an ML feature is a data pipeline. Turning a feature into or a set of features and a training data set into a model is a data pipeline.

Making predictions from a model and some input features is a data pipeline. And Tekton is going to make the management of all these data pipelines

around machine learning significantly easier. And so in the future, you can expect us to monitor the predictions that you're actually making and the ground truths and tie those back to the actual ML features to help you gain insights into when are your predictions going off the deep end and how is that related to the ML features that TechCon is serving to you.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption that you see for machine learning today. The biggest barrier is definitely still the data is too siloed and the data architectures are too idiosyncratic. A lot of companies still have

make it very, very hard for data scientists to know where is the data. And if they have the data, they have to jump through several hoops to actually get access to the data. And then even if they do get access to the data, it's still very, very hard to productionize this data for a variety of use cases.

And I think there's still a lot more work to be done to really centralize the data discovery in an organization, centralize the data access, and to do it all in a very, very well governed and secure way. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Tekton. It's definitely a very interesting and exciting product, and it's great to see the emergence of this idea of the feature platform

and definitely excited to see where that continues to evolve to and the capabilities that it provides and how it's able to help machine learning teams go from idea to production more rapidly and with greater confidence. So I appreciate all of the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Awesome. Thanks so much for having me.

Thank you for listening. And don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management, and podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast.com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Transcript source: Provided by creator in RSS feed: download file