Applying Declarative ML Techniques To Large Language Models For Better Results - podcast episode cover

Applying Declarative ML Techniques To Large Language Models For Better Results

Oct 24, 202346 minEp. 22
--:--
--:--
Listen in podcast apps:

Episode description

Summary
Large language models have gained a substantial amount of attention in the area of AI and machine learning. While they are impressive, there are many applications where they are not the best option. In this episode Piero Molino explains how declarative ML approaches allow you to make the best use of the available tools across use cases and data formats.
Announcements
  • Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery.
  • Your host is Tobias Macey and today I'm interviewing Piero Molino about the application of declarative ML in a world being dominated by large language models
Interview
  • Introduction
  • How did you get involved in machine learning?
  • Can you start by summarizing your perspective on the effect that LLMs are having on the AI/ML industry? 
    • In a world where LLMs are being applied to a growing variety of use cases, what are the capabilities that they still lack?
    • How does declarative ML help to address those shortcomings?
  • The majority of current hype is about commercial models (e.g. GPT-4). Can you summarize the current state of the ecosystem for open source LLMs? 
    • For teams who are investing in ML/AI capabilities, what are the sources of platform risk for LLMs?
    • What are the comparative benefits of using a declarative ML approach?
  • What are the most interesting, innovative, or unexpected ways that you have seen LLMs used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on declarative ML in the age of LLMs?
  • When is an LLM the wrong choice?
  • What do you have planned for the future of declarative ML and Predibase?
Contact Info
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Parting Question
  • From your perspective, what is the biggest barrier to adoption of machine learning today?
Links
The intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0

Transcript

Unknown

Hello, and welcome to The Machine Learning Podcast. The podcast about going from idea idea to delivery with machine learning. Your host is Tobias Macy, and today, I'm interviewing Piero Molino about the application of declarative ML in a world being dominated by large language models. So, Piero, can you start by introducing yourself? Sure. Thank you, by the way, for having me, Tobias. I really appreciate the opportunity.

I'm Pierre, and I'm the CEO of PlaidiBase, which is a company bringing declative machine learning, to the enterprise. And previously, I've been working at many different organizations. In particular, my last stint was, 4 years at Uber where I was doing research and applications and where I developed Ludwig, which is an open source declarative machine learning framework that now is the foundation of the technology that we are building at private base.

And do you remember how you first got started working in machine learning? Yeah. Well, it was quite quite a long time ago, honestly, about more than 10 years now, probably 13, 14 years. My my introduction to machine learning was through actually for recommender system because I was interested in information retrieval and filtering, and I was working on this little path project of mine about how to induce serendipity in recommender systems.

That's what got me started and interested in figuring out machine learning in particular for personalization, but then that brought it up into, like, just natural language processing and then later on many different topics surrounding machine learning. But it all got started from the perspective of, retrieval information and information filtering.

As with so many projects that get people into a particular field, they usually end up getting put to the side in favor of all the other work that comes along. I'm curious if you ever actually got to a satisfactory point of being able to actually introduce that serendipity in the recommender systems. I would say not really. Although I wrote a few papers about it and my my bachelor thesis, I would say,

about it. And, that paper, despite the fact that it didn't really have a lot to it, was more in terms of solution, but it actually posed the problem in an interesting way. For a long time, it was my most cited paper despite being my first and worst paper, but, that then it was superseded by the my more recent work. But for a long time, it actually was was was my most cited 1.

And so bringing us to the topic at hand now, for anybody who hasn't been living under a rock for the past 1 or 2 years, large language models have definitely been

taking the ML and AI world by storm. They have broken out of being a subject purely for academia or industry insiders and have now been introduced into the general population. Everybody's talking about it. I'm wondering if you can just start by summarizing your perspective on the effect that these LLMs have had on the AI and ML industry.

Yeah. It's certainly like a pretty big effect from many perspectives. On 1 hand, you have the fact that companies that before companies and also individuals that before were not thinking at AI at all or they were not thinking about it as something to pay attention to immediately or they were thinking about it as something futuristic that eventually they would have started to consider.

Now by the availability of something like ChargeGPT and interfaces, they make it pretty tangible that these models have capabilities that are particularly useful for solving tasks and also for just interacting with with with data that opened up the possibility set of possibilities and imagination of people and companies. From the perspective of someone building machine learning tooling that made it so that there's way more people that now are interested in the space and way more demand.

At the same time, the that also comes with a caveat, which is, most of these people, may not have a deeper understanding of the implications and the complexity of building infrastructure and tooling around the machine learning. And so there's, like, a little bit of education that we need to do more than what we were doing before in order to be able for people to fully grasp what actually we are doing.

The other aspect to it is that these new mechanisms through using LLMs in particular lower the barrier of entry for many tasks to people who maybe do not have training data. Before, in order to be able to start to do some machine learning, at least you needed to start from some training data. Now you can start without it, And there are limitations obviously, but it's it's a much lower barrier of entry,

for nontechnical people. And that's something that obviously is changing the way the industry is thinking about it. And also we are thinking about it as a company.

And to that point of LLMs lowering the barrier to getting some sort of product level capability deployed, Wondering what are the potential pitfalls or some of the issues that you've seen teams or individuals encounter in their haste to be able to bring this new capability either into their product or even just in the experimentation phase, maybe some of the ways that LLMs are failing to live up to the hype? Yeah. So that's,

definitely something that that that we are observing. I will say, on the 1 hand, you have really great and new capabilities in particular around generation and in particular about creating content and reformatting, if you want, even content like structured information into something that is textual and colloquial and understandable. And so these capabilities, like, are really impressive and unprecedented for the

in in the machine learning space and in the AI space. Right? That said, obviously, there's some, pitfalls, though, that you were mentioning. 1 of them is absolutely the fact that, there is a mismatch in terms of expectations of what these models can actually provide in terms of answering questions and what actually the questions that people really want to ask are.

Some clear examples about it are, like, the kind of interaction that some of the users we have seen having with these models is kinda like trying to, ask an oracle about how how to answer something, but it's really not what these models are. These models are generating the next token and doing it in a smart way, but they cannot really predict the future. They cannot really answer questions if the data to support it is not given to them. Right?

Another pitfall that we are seeing is, also expectations on what the output will look like. And so, these models, again, are trained to generate the next token and then fine tuned for following instructions, but still their goal is to produce text even despite anything, right, despite everything else. Meaning that even in situations where there's no clarity or that they like, the model is not certain about something, they will generate it anyway and, in most situations.

And that means that it could generate things that are untrue, could generate things that are imprecise. It could generate there's no grounding that actually happens at the time of the generation of the outputs. And there's techniques to mitigate that, like, for instance, providing additional context, providing some evidence, and so on, but still,

it's not always the case. There's there's no guarantees. Let me put it this way. Right? So these are definitely pitfalls. And then I would say the last 1 that I would mention is the fact that the way through which some of these models are consumed right now, so mostly through APIs,

may not be the best 1 for for everyone. There's there there are, like, a few pitfalls with that too, lock in and also throttling that API providers can do, or also the APIs providers can change the models under the hood for you and, you know, then what worked before could not work anymore. And all of these things you have no control over. Right?

So that's part of the reason why we believe that the platform, in particular, the Clarity platform using open source models like us, can be really valuable in this new landscape. On that point of open source models, the providers that brought about this initial wave of excitement, particularly in the general community, is the OpenAI,

chat GPT, and GPT series of models. But as you mentioned, there are a number of open source offerings that have been fast followers and that are focusing on different niche use cases or different specific applications. I'm wondering if you can give a summary of your understanding of what that open source model landscape looks like and some of the ways that those models are looking to differentiate from OpenAI in particular, but also the other commercial, LLM providers.

Yeah. Absolutely. So, and we're, you know, strong believers in open source. All our technology is built on top of open source projects, and also the models that we serve and use are are open source, so, you know, it's we are really vested in in this space. Right? I can tell you there are a few aspects about the open source that are particularly interesting. 1 is the speed at which the

field is moving. There's literally a new model every week, or actually more than that, but 1 new very relevant model every week, and the pace of progress is very, very fast, and the community is rallying together

in making these models available for everybody, right? And also some bigger players like Meta, for instance, have started to make their own models available under permissive licenses, so there's a lot of, excitement, that is building in the open source space, and different models have different skills and characteristics.

The particular interesting thing in my mind is that despite the fact that most of these models are actually trained more or less on the same data and more or less fine tuned on the same instructions datasets.

The possibility of having them, for instance, trained on the the datasets in different languages, and the availability of models in different languages that is something that is exciting, that is not really present in the commercial offerings as much, and also the fact that models can be fine tuned for specific needs and specific customers' adaptations, that is also something that the, let's say, larger providers do, but to to a really small extent,

and also don't give you ownership of, you know, whatever you've been fine tuning and adapting, and so the personalization aspect is, 1 that makes for, like, a really exciting use cases for open source models.

For people who are using these LLMs as a component of their stack, there have been a number of different architectures that have grown up around it, in particular using things like vector databases for generating the context embeddings to be able to feed into the model to customize it to your particular application.

I'm wondering if you can talk to some of the ways that, what you're building at prediabase and the overall concept of declarative ML can be used to maybe simplify the operational and architectural aspects of working with these LLMs and incorporating them into the workflow and serving? Absolutely. Absolutely. I would say the, just for for for the sake of giving, like, a more complete answer, what

the approach that we're following with the Clari. M l, really, the way it consists of is you can write configurations of your machine learning pipelines that are like YAML files and really short and human readable, and that makes it possible for you to go from, like, 0 to, like, a deployed model very, very fast without having to write low level machine learning code and do the orchestration of putting together components like, you know, vector DBs and other things like that, right?

What we have been working on is to add these kind of capabilities into our both open source and platform, and basically what we want to try to do is to handhold users from the very beginning of a journey, which, like, it's literally deploying a model to query

it, all the way to, like, fine tuning it, and in the middle, there's, let's say, different degrees of application of models. So on 1 hand, once you deploy them, then you may want to run them both for real time prediction, but also on batches of data, and if we have optimizations for running them on batches of data, then you may want to do,

the very first thing that you should try when you want to adapt a machine a large language model on your data is actually retrieval augmented using something like a vector DB, and the reason for that is that it's very cheap to do compared to the fine tuning, and so it's the obvious first thing to try and do, and so you can write a configuration very easily to specify what are the prompts and the parts of the prompt that retrieve the information,

from from the backdoor database within the configuration file, and that, you know, makes it possible for you to iterate on prompts by changing just the configuration file. Another interesting aspect of the idea of LLMs, especially in juxtaposition to the idea of declarative ML is that they're both aiming to provide a similar level of ease of access and a shallower on ramp to being able to take advantage of AI capabilities for product or business reasons.

I'm curious what you see as some of the different ways that people are coming into 1 or both of these tool chains and approaches of, the the types of problems that they're trying to solve and, some of the maybe misconceptions about the capabilities of LLMs that they have and some of the ways that your work at prediabase and, undeclarativeml helps to address some of those shortcomings. So, yeah, I would say the intent, to a certain extent, is similar. Right?

The clarity of machine learning wants to lower the barrier of entry for building machine learning tools, and LLMs actually lower the barrier for using what underneath is a machine learning tool. Right?

I would say the way the things unfold is slightly different, though, because with the creative machine learning, the output that you have on process is actually a machine learning pipeline that then you can use for, you know, your own use cases or to adapt to your data and do whatever you want with it, right? You don't obtain like a general system, really, but you get like a system that is task specific, right? You can also train something like an LLM, but

it's pretty costly. You can definitely do it. It's valuable in some situations in particular when your data is very bespoke and doesn't really jive with like general data coming from the internet, but in general, the goal is to build the model that you want to deploy and use. Right?

With LLMs, the model is already built for you and you're just querying it and interacting with it, and as we were discussing, there are mechanisms for adding knowledge about your data into the process, like with retrieval augmented generation or also with fine tuning them, but the first barrier of entry is substantially different. Right? And, again, for us, these 2 things come together in the form of providing also the Clarity configurations for the, deployment models, adding

retrieval augmented generation to them and also fine tuning them. So basically, we bridge the gap from what does it take to start using a model to adapting it to your data through providing information in the context, that you are giving to the model for generating and also using that data for adapting the model to your use case and everything in between. Right? So we bridge the gap from 0 to, an entire model fine tuned for you and everything in between.

Another aspect of the hype around LLMs that is interesting is that it seems to have completely eclipsed the conversations around transformer models that were the previous hype cycle, at least within the ML ecosystem.

I'm wondering what you see as the relative use cases for transformer models as compared to LLMs and maybe some of the reasons that these LLMs have become much more widely adopted, maybe some of the challenges that existed in the transformer architecture that are simplified in LLMs and where you still might wanna use the transformer approach. I would say, you know,

LLMs are transformers. At least right now, there's, like, competing, architectures, but I think the the the main difference is not really in terms of the details of how the models are provided, but more the mechanism for using them, and so for instance, if you're thinking about a transform model like BERT, they were trained on large amounts of data, but not for generating text, but there was, like, a side,

let's say, task. There was basically infilling of missing pieces of text, but the goal was not really to use them for that task. That task was an auxiliary task for being able to obtain good embeddings that then could have been used for downstream tasks. The model would generate embeddings and those embeddings are then used for fine tuning or for training a smaller model, a smaller head on a specific task. Right?

Using that obviously requires some machine learning knowledge, and again, for instance, this is something that you can do with our declarative ML approach in Ludwig and in PratyBase. You can just say, I have an input to this text and I want to use BERT for encoding it, and I want to have an output that is whatever, a category. We made it as simple as that, but it's still different than typing text in a chat box and and getting answers out of it. Right?

Despite the fact that underneath trans the the architecture that is used for both these LLMs that are deployed and the BERT architecture. There's very, very little difference between the 2, so this is like something that has changed, I would say, but on the other hand, transformer models like BERT and like more, recent versions of it are actually used for still, for instance, for retrieval augmented.

When you're putting your data in a vector DB, you need to provide a vector for that data, and, these kind of models are used for that purpose. So they're still part of the, conversation, although they are not the ones that people talk about the most. And, also, LLMs have been widespread for these textual and natural language

applications, but there are still a number of different modalities for which machine learning is still very useful and valuable. I'm curious what you have seen as some of the ways that people are trying to address the multimodal, use cases, particularly dealing with images or dealing with sensor information or some of the other applications where pure natural language is not sufficient? Yeah, absolutely, and again,

the declarative ML approach is particularly well suited for that. 1 of the reasons is in the configurations files that we make very easy to write in Ludwig and Pradibase, you can specify the data type associated with each input that you are providing to the model, and those data types could be, both tabular data types, for instance,

like categories, binary, and numerical values, more complex data types, like text, image, and audio, and also, like, time series, like time series coming from sensors, geospatial data, more complex structures like SATS and BAGS, so the multimodality is definitely an aspect that is is being addressed by declarative ML, and declarative ML makes it very easy to, train models on these multiple modalities all at once.

The, you know, interesting thing is that you can also reuse some of these LLM capabilities, for instance, when you decide how the text is treated within this larger model that the Claritima configuration defines. You can also use the models that, the LLM models just instead of repeating the predictions to generate a token at a time, you're just up at the level of obtaining the embeddings from them.

There's a lot of work that is happening in this side. Obviously, now it's a bit less flashy and a little bit less in in the forefront of people's imagination, but the the, industrial and enterprise applications of multi of merging multiple modalities are potentially even stronger than the applications of of of just text interfaces.

Another shortcoming or constraint that I've seen brought up a number of times also is the question of the size of the context windows that you're able to achieve, and there's definitely still ongoing research to expand that. And I'm wondering what are some of the ways that that manifests in terms of what applications are viable with LLMs and which still require custom model development to be able to accommodate either larger context windows or being able to customize to your specific data set

in lieu of having that larger context window to give the model enough information to work with. Yeah. No. It is it is a really a really glaring limitation. Right?

1 example of what we have seen unfolding on why that's particularly relevant and why that's particularly important for real world applications is that, for instance, even for retrieval augmented, generation, which is to a certain extent a mechanism to overcome that limitation in context because instead of giving like an entire document, maybe a document that will be longer than the window sides, you can give just a few snippets from the document that are retrieved

to fill that window size with only only the relevant pieces of information. Still, that that may be not sufficient. I can give you, like, some examples that we're seeing in particular when you need to provide information about all of the elements of something, like, for instance, all the passages of text within within a document to be able to actually obtain

the correct answer, then you're out of luck. Right? That goes beyond what most models now there's, like, some large ones, that make it possible to use very long window sizes, but still, it's not sufficient. 1, like, 1 customer that we've been working with, has been trying to answer questions on their SQL, data, and, they wanted to actually do aggregates over over their information despite the fact that that may not have been, like, the best,

idea to do it with an LLM as opposed to, like, running a SQL query. But, the kind of question that they were asking was not a question that, like, a SQL query could could answer, but they could not provide the entire data to be able, for the model to do aggregate reasoning on top of it, and this kind of limitation, you know, is is is because the window size is not large enough.

Now there's a lot of research that's been happening in increasing that window size, although I think a combination of better, faster models that can work on larger window sizes with, better retrieval mechanisms, to retrieve the relevant parts that are, like, sufficient for answering the questions. It's still open, it's an open research to find the right balance between these 2. Right?

I've seen recently work coming from the HAZY Lab at Stanford that I was part of in the past, that is, like really goes deep in the direction of long sequences using state models, and I think that's a very, valid line of research that could bring us improvements in this direction.

The other thing that is brought up a lot for people who are actually working with large language models, even just recreationally, is the the challenge of hallucination where the models just completely forget what it is that they're trying to do, or they, base successive predictions on faulty information.

And I'm wondering what you have seen as some of the useful mechanisms for being able to identify the occurrence of those hallucinations and, potentially counteract them or ways to, prevent those hallucinations from occurring in the first place? Yeah. So the those are like a big problem. I actually don't really like the term hallucination, because it makes you look like it's

a bug while it's definitely a feature of these models. That's how they are trained. That's what they are supposed to do, really, generating text, so, you know, I wouldn't consider it. It is a bug in the term in in terms of what we want this model to do, but it's not a bug in terms of what they are supposed to do. So it is a big issue though. In particular, there's a lot of work that has been going on on the alignment side to try to limit, but it's kinda like a a a whack a mole, if you want.

You identify areas where models, confidently provide answers and kinda address them partially through, like, multi hop reasoning and multi hop kind of interactions and, like, constitutional approaches to make to to limit the impact of it. I don't think anyone has found yet a mechanism for guaranteeing any, avoiding the problem altogether.

That said, you know, there are also retrieval augmented mechanisms, for instance, are a way to partially address it, and sometimes you go can go surprisingly far in actually specifying to the model what you want

them to do or not to do. I can give a concrete example. If you use retrieval augmented and you retrieve some snippets of text, you can specify only use words, only use information from the retrieved text to answer the original question, and the model, most of the times, will try to comply to that. Again, you have no guarantees, but expert like, empirically, you can see that in most cases, that actually,

ends up happening. I think the problem there is if a model like that is then exposed to end users that may be trying to stress it or be malicious about it, then that may not work anymore because they could send prompts that are contradicting the request to be, truthful to the, information retrieved. Right? So it it it really depends on the, final application. Right? But a combination of retrieval augmented with constitutional, training of models, I think it's the best that we have found so far.

The other aspect of these LLMs that have proven challenging is the specifics of the data that they're being trained on and some of the questions around bias, inclusivity, the types of information that they have available, particularly when you're starting to deal with non English language, use cases.

And I'm wondering what your, thoughts and experience have been around ways that we, as an industry, can try to address some of those challenges or at least improve the visibility of those shortcomings to the end consumers of these models. This is a really complex topic, and it's, let's say, difficult to answer in a way that in a short way that is not, like, dismissive of many of the aspects there. I'll try to do my best, but I would say on the 1 hand, you have,

the issue of the preselection of the data. Right? The models are only they don't have any inherent bias, really, if not if it's not coming from the data that they've been trained on. Right? So that is definitely 1 aspect that is important. They're just reflecting or amplifying the issues that are already there in the original data.

If you, again, train on data that contains them, maybe, you know, finding the right balance in large datasets that are less biased is is is difficult because empirically, the more token you train these models on, the better, but also the higher quality of the data that you're training them on, the better, and it was demonstrated pretty clearly by by both versions of LAMA, for instance, that just exactly the same architecture but trained for longer and on better data and on bigger data actually

produces better models. Right? On the other hand, there are, aspects of like, what about the the systematicity of the use of these models. Right? Because even if a model is biased, if it, if that bias is mitigated in many different ways, like, for instance, there's multiple models and you, you know, mix them or average them or have something that,

and maybe this model is trained on different types of data. Obviously, if you train on all the same data, the same biases will emerge. That could be a mechanism for for limiting biases, and it's not super different from what it seems to be the mechanisms that both, palm and the more recent gpt4 have been trained using a mixture of experts, but the issue is still if a model produces consistently a biased answer to the quest to to questions and

all people ask similar questions, they will all receive the same biased answer, and so a bias that was maybe minimal in the data can be amplified dramatically by the use of the models, and there's no clear answer there, on how to address this problem.

1 interesting talk that was, like, on a different topic that is seen given by Michael Jordan from Berkeley was about routing users of, map system towards a destination, and the idea there was, clearly, if there's a short route from point a to point b but you give the same suggestion to all your users, then that route is not the shortest anymore, right, because there's going to be traffic induced by the fact that you are the 1 providing these suggestions to everyone.

I think a similar kind of dynamic happens with large language models, and so maybe a potential solution there is to actually start to give different answers and be more, I don't know, sampling more with higher temperature from these, from these models than what we've been doing before to try to diversify the outputs.

In your experience of working with large language models and with your customers and finding the ways that declarative ML and LLMs can be used in conjunction or declarative ML can help with the most interesting or innovative or unexpected ways that you've seen them applied? Yeah. So I think on on 1 hand, you have, the tasks. So there's, many interesting tasks that, you know, I was not expecting before, so I've seen quite a few

interesting users. 1 the latest 1 that I've seen is applying, LLMs with with data, coming from sport events to get an edge in playing fantasy football, and it was, like, a pretty fun application that I've seen recently, but I would say the most surprising users more than surprising, I would say the most interesting users for me are users that are beyond tracked users. For instance, information extraction is a very interesting application.

We have been, you know, for instance, demonstrating it by applying it to, datasets of CVs and extracting information that is not 100% there in the CV, but can be inferred from the CV. That was like a a fun application or in general applications of mapping back, back and forth, data that is structured into data that is like textual, so generating out of structured information, and on the opposite from unstructured information, textual generating structured information.

Those kind of applications, I think, are really great for enterprise applications because then you can have, systems that are, certain systems, like, you know, databases and and structured SQL queries on top of information that is unstructured and vice versa. You can generate human readable human readable versions of structured information for users to consume without having to look at tables of data. So I think both, are are pretty interesting applications.

And in your experience of working in the space of declarative ML, particularly in this time where LLMs have been taking over a lot of the attention? What are the most interesting or unexpected or challenging lessons you've learned in the process? Yeah. I would say probably the most interesting, lesson that I've learned is that,

and it may be a little bit obvious, but there's no silver bullet. We have seen applications and customers for which the pretrained models were worked really, really well on their tasks, other customers that really needed to fine tune their models on the, on their data set to get good performance because retrieval augmented generation was not good enough, for their use cases, and also customers that had, went had to go all the way to train models from scratch for applying them to their

to their use cases because the pre trained LMs will not, like,

really work well for their use cases. Right? And so the, let's say, meta learning there is that, which is also what we're trying to put in the platform as as, as a consequence, is that every use case is really different, and what you want to have is a platform that supports you in being able to experiment with all forms of, techniques because, you cannot know beforehand which 1 will be the 1 that delivers the value, in terms of capabilities, performance,

and good trade offs with speed and performance that makes them usable in your for your use cases in your applications. The learning is that all of these things can work in different use cases, and you need to figure them out, empirically on your use case, before being able to apply them for real. And so for people who are interested in being able to incorporate AI into their product offerings or even just internal capabilities, what are the cases where an LLM is the wrong choice?

Yeah. I would say LLM is different to the wrong choice for use cases where there's really no data in the open that relates to that. 1 example that we give, using openly available datasets, by the way, so it could have been the case that the model could have been trained on them, but, so we have this example that we show to users of a dataset of Twitter users, with some labels that telling if they are bots or humans,

and there is textual information about them. There's, like, the self provided description. There's also some structured information about it, like the number of followers and the number of, tweets that they post and things like that, and also, like, images, they profile pictures. Right?

And if you ask an LLM to provide some predictions, on on a user like that, if it's a human or a bot, The best LLMs are actually the ones that tell you that they don't have enough information to be able to make the determination, which basically tells us that this is 1 example use case, but there's many like this where training models from scratch is actually still the preferred way of solving your task, and,

you obviously need to collect the data, in order to be able to train these models from scratch, but that kind of investment is is worth if you are, if you want to have good performance at the task. Right? And so as you continue to build and iterate on prediabase and declarative ML capabilities and help to support the LLM ecosystem and applications, what are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to dig into?

Yeah. So there's a couple. The first 1 and most important is, the fact that we're coming up with a new release of Ludwig, 0 dot 8 release, that, includes all these LLM capabilities through the clouded configurations that we've been discussing, and we are gonna release it in, like, very shortly. And I'm super excited about it because it's the 1, stop, place where you can do all these things at once very very easily, so, super excited about it.

And then, in terms of applications, we are working heavily on a few, including information extraction, and in particular, information extraction that, you know, we have seen many companies being super interested in, and it's a very promising space because it's a it's a, like, a step function in the level of performance given the amounts of data that is needed compared to previous to previous solutions, and so, it's it's a very exciting new space for us to explore this information extraction.

Are there any other aspects of this overall space of large language models and declarative ML and the work that you're doing at predabase that we didn't discuss you'd like to cover before we close out the show? Yeah, I would say 1 important topic in my mind is the data aspects.

Obviously, large language models, like every other machine learning model and also every piece of software really, is garbage in and garbage out, so the quality of the data is very, very important, still in a world where we are interacting with text with with our AI models. Right?

And to to this extent, you know, we have been doing a lot of work in connecting the dots between the place where the data lives, the, you know, data warehouses, databases, Snowflake, Databricks, places where the data deals, and the place where the model live, which is, like, a platform like ours, and basically connect the dots and making it so that the quality of the data that is piped through these pipelines is good enough to be able to obtain really good results,

and those actually the outcomes of the high quality pipelines are very important to avoid situations where

there's data loss. And so the monitoring of of the quality of the data and the connection between the data sources and models only in 1 place makes it possible for, like, very fast pipelines and very reliable pipelines, and having the capability of putting these things together, the data and the models, in 1 place, is 1 of the added value, that our platform provides that we're really proud of and that we think is going to be particularly valuable for customers.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, and I don't think that there's a huge barrier of adoption in terms of

how easy it is to use and interact with the models today. I think the barrier is more on the disconnect between the expectations of what the capabilities of the models are and what the models can actually deliver, and so I think education and conversations like this, are actually the key to be able to set the right expectations and for people to adopt the right solution for the right problem.

Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at prediabase and your thoughts on the large language model ecosystem and ways that declarative ML can help to simplify and improve on the application of these new capabilities

to different business problems. So I appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thank you so much, Tobias. Thank you for having me. Thank you for listening. And don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data machine learning podcast.com to subscribe to the show. Can visit the site at the machine learning podcast.com

to subscribe to the show. Sign up for the mailing list and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at the machine learning podcast.com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Transcript source: Provided by creator in RSS feed: download file