Azure API Management's GenAI Gateway with Andrei Kamenev

Speaker 1

00:12

Hey man, it's not ned. You're not why not? You know, it's just the way I feel this morning.

Speaker 2

00:23

Yeah, you've been had a little too much weekend.

Speaker 1

00:25

I think I'm Carl Franklin. That's Richard Campbell, you know. Okay, I won't do that again until marijuana is completely legal in all fifty states of America. Oh, it's going to be a while and then I'll hey man, how you doing, Richard? I'm good.

Speaker 2

00:43

I did have a bit of a crazy weekend. You know, between my birthday, my wife's birthday, and and many of our friends are all like this last two weeks of July, so we bought them all up to the coast and it's yeah, I've been a week in a debauchery, honestly.

Speaker 1

00:57

Yeah, your nose does look a little red.

Speaker 2

00:59

Actually, I'm a little pinked up, a little pinked up today doing some damage. But you know, yeah, there's a couple of buddies up here that are literally known for forty years. Like that's when I'm wow, get old.

Speaker 1

01:09

That's so cool though.

Speaker 2

01:10

Yeah, it's something. Yeah, how about you. You're staying out of trouble. You're playing I see, I see your calendar. You're playing summer time.

Speaker 1

01:17

Yeah. Yeah, the band is currently just on a tear right now.

Speaker 2

01:21

That's awesome, dude, And uh.

Speaker 1

01:23

We're getting our original bass player back in September. August thirty first is our current bass player's last date. Oh, I'll tell you a little band story. So it was during COVID that Kevin, our original bass player, basically stopped coming to rehearsals because of COVID and and he had a good reason too. He was immune deficient and it turns out he got leukemia. He had and his immune system was compromised. Because of all that stuff, he had to stay home and he was like, guys, I'm I'm out.

01:56

I can't do this and you know, my wife and child all that. Yeah. So then he gradually got over it, and I kept calling him once in a while and said, hey, man, how you doing. He's like, yeah, I'm okay, but I'm still not. You know, I tried playing out a couple of times with you know, somebody else, and I'm still not. And then finally he's like, okay, I'm retiring. I'm in oh,

02:19

but not till September. And it turns out that I called him because our current bass player Chris basically met a girl, sold his house, and moved three hours away and married her.

Speaker 2

02:33

Hopefully not necessarily an order, but okay.

Speaker 1

02:35

Yeah, yeah, yeah. But he said, don't worry, guys, I'm still in the band. And we're like, oh, yeah, right, you know so. But props to Chris three hours. Yeah, he didn't come to rehearsals so that, you know, we couldn't really learn any new stuff, but we could send him a song and say, hey, learn this, and he was pretty good about it. And he never had a problem on the gig. But he showed up at every gig three hours away. It's a lot of driving. But I was going to say that the whole purpose of

03:05

this was that it comes September. See, Kevin knows a lot of our original tunes that Chris doesn't know, and a lot more than the Steely Dan songs which were known for. So we're going to be playing a lot more original tunes, hopefully in bigger venues. Yeah, more Franklin Brothers. It's cool. Anyway, I've taken up enough time with my stupid stories of music and bands. Let's play role. Let's roll the music for better no framework, awesome, man, what

03:37

do you got. I can't believe I've never talked about this particular project before. It's curated by the dot Net Foundation. It's Fluent validation. Okay, it's a validation library for dot Net that uses a fluent interface. You know what that is? This? Do this? Do that? Mm hmm and lambda expressions for building strongly typed validation rules, right, okay, So you can create these rules and then you can obviously use them, and it's very popular.

Speaker 2

04:08

Well, and it's a way of building abstract validation rules you might actually reuse or ither than keep recoding it exactly.

Speaker 1

04:14

Yeah. Yeah, And you know, if you decide to do it on the fly, where do you start that information? Where does it go? Does it go in a database? You know? Does it change? How does it change? Who changes it? Like, there's so many, so many things. But yeah, eight point nine thousand stars. Yeah, I would learn it, love it. Man. That's clearly a great tool. It's cool, and I'm going to start using it. I can't believe that I haven't even known about this existence before. So

04:43

that's what I got. Who's talking to us?

Speaker 2

04:44

Richard grabbed a comment off of the show eighteen ninety one, which you did back in March of twenty four with her friend Anthony LRB. Who did they he was he'd written that open source library for API observability.

Speaker 1

04:56

Yeah right.

Speaker 2

04:57

We had a great conversation with her. An API tool kit I was his two line. I know we're going to talk about APIs a bunch today. And our friend Matt Lacy actually commented on the show. He said, hey, ten plus years ago, I built something for API. Because this is your checking on the client end. I assume there was a potential business in it, but I couldn't work out how, you know, because building good software and selling software two different jobs, right, Like that's a different thing.

05:21

Great to see API toolkitd in existence now I'm making APIs more reliable because that was our whole conversation, right was you know you changed some you changed an API and somebody with a dependency on it suddenly goes down. Yeah, you make people sad in a big hurry, and so API toolkit was all about doing that validation. Hey, Matt,

05:38

thanks so much for being a long time listener. And I don't know if you have a copy of Music Code Buy already, but you do now if you'd like a copy of Music code By, I write a comment on the website at don at Rocks dot com or on the facebooks. We publish every show there, and if you comment there and every reading the show, we'll send you a copy of music code By and Music to code.

Speaker 1

05:53

By for those who don't know, is something you can listen to while you're coding, or to calm a restless dog, or to put your children to sleep at night.

Speaker 2

06:03

Here I thought that was the nuclear weapons geek out for putting children to sleep.

Speaker 1

06:08

Oh come on, now, that was amazing, and it always is every year, but I was that particular one.

Speaker 2

06:14

Somebody told me it's like what I was depressed. So the key, I'm a little more level in that.

Speaker 1

06:19

Yeah.

Speaker 2

06:19

And I did almost all of the talking too, So apparently it's pretty good knocking kids out.

Speaker 1

06:23

Well, there you go. Hopefully we'll have something more positive to say.

Speaker 2

06:27

This is only positive about nuclear weapons.

Speaker 1

06:29

Go on, Well you know, okay, Well let's get into it. We're gonna introduce our guest right now. I'm going to introduce our guest. Andrea Kamenev is our guest, and from twenty sixteen to twenty twenty four, Andrea worked at Microsoft in various architect roles in Europe helping customers to bring their applications to Azure. Now he works as a product manager at Azure Api Management. All that do that we were kind of just talking about, but in the Azure way.

07:00

Welcome Andre, thanks for hearing me, Thanks for being here. We first started like the quick start teams, those folks that helped onboard people into the cloud. Did that digital transformation.

Speaker 3

07:11

Thing, Uh, I mean the the service by itself.

Speaker 1

07:14

Yeah, your earlier role before you joined the product team.

Speaker 3

07:17

Yeah, yeah, So I was a part of what was called here at Microsoft Global Black Built Team. Okay, so it's like a bunch of cloudsision architects who help local teams like field engineers and customer social architects in the MEA region to build stuff with Azure. So our team was mostly focusing on Kubernatis related stuff. So I was working a lot with customers on bringing workloads to Azurecubneti service asually had to open shift and so on.

Speaker 2

07:43

So yeah, deep, So the move over to API management makes sense because that's a lynchpin problem when you expose stuff on the cloud that was typically just on prem before.

Speaker 1

07:53

Yeah.

Speaker 3

07:53

Absolutely, Yeah, we've seen a lot of customers who interested in APIM like and even even back then when I was not a part of a PAM team, like, okay, I have a bunch of APIs and my cuminator is how do they expose them securely? Like what what can you go for me?

Speaker 1

08:06

Microsoft?

Speaker 3

08:07

And then I think back back then there was a self hosted gate when it is it is still out there. It was a solution.

Speaker 1

08:13

So yeah, so what are you working on these days? Yeah?

Speaker 3

08:16

So these days, I guess there's a lot of interest in gen ai in large language models. Chugupt is all over the place. So goodness, right now, we in I believe in May, Yeah, in May we released the gen Ai Gatewakypical just Nature pay Management to help customers build like intelligent applications with lll ms.

Speaker 1

08:37

So yeah, that's I have a great idea. Let's give our AI all of our API keys and let them do whatever they want to do with it.

Speaker 3

08:45

Yeah, that's that's actually a better approach. That's that's what That's one of the things that were actually trying to solve for customers.

Speaker 1

08:52

Right, I know. Yeah, it just sounds crazy, doesn't it, given all the y the AI hiccups, and things that people don't really trust them. But I mean, so let's let's bust that myth. You know, why would we use AI to make our API calls, manage our APIs do all those things that normally trusted folks to. Yeah.

Speaker 3

09:14

So here I think, like from APM side, we have two different stories. Like, first of all, we use gen AI ourselves to help customers write the policies that we have an API M. And the second thing is we have customers for building intelligent applications using Judge PT models, using other models that are out there in az REAI studio, and they have challenges because if you think about as open EI for example, it is still an API, you still have the same challenges when it comes to managing

09:46

and securing access to APIs. So we built but they have specific kind of a number of challenges which are kind of a specific to l ms. So this is kind of the second part where we help customers to secure, manage and scale Open the Eye deployments for the applications with that's what we call JENNYI Gateway and HPA management.

Speaker 1

10:07

So what are some of the ways in which AI can be used with APIs. Obviously creating the management stuff around it. But would you necessarily trust an AI with your API keys and say, you know, here are the rules and times under which you would make these API calls. And I'm trying to put wrap my head around that.

Speaker 3

10:32

Yeah, I think yeahs as always, it depends, right, So Yeah, if you have like specific like access controls in place, why not Like if you, for example, that apimor you can you can provide the specific keys for the LLM to like enhance the experience for those who use those l ms you can and all that. Yeah, then why not Like you're not given the access to like full API, You're just giving access to a subset of operations.

Speaker 1

11:03

Sure, which are for example, wedn't.

Speaker 3

11:05

Leave, or they just have access to specific data that you're not really they're not, which is not really trick. So yeah, that's definitely why I will use it.

Speaker 1

11:14

I'd be definitely okay with gets, but posts and puts.

Speaker 3

11:18

Yeah, I don't know, yeahuse.

Speaker 2

11:21

Otherwise you'd have to make these rules yourself, right, I mean that that's the point here is that you got a machine learning model essentially that's figuring out what the optimal rules are for utilization.

Speaker 3

11:28

Yeah, so there is a lot. There is one thing so as I mentioned, like first thing we've built like an a PM. For example, we're helping customers with with lllms to configure a PM. So I guess you're kind of familiar with AM. We have this XML policies which can be pretty long documents.

Speaker 2

11:46

Yeah, so typically that you're doing the thing, you're trying to say no one user can do more than this many or if it's growing, you know, massively limited so you don't knock other people off and make sure they're using the right accounts like it's it's just forwarding.

Speaker 1

12:01

Yeah, a lot of stuff in there.

Speaker 2

12:03

You know where stuff is, and you know what failover modes look like, like yeah, AP. We've done a few shows on API management now and it's like pretty powerful stuff. You know, you're gonna pub put an API in a public like you're paying when everybody somebody calls that, so you kind of want to put governance around that. But write and all those rules like when you really dig into it, it's complicated. So that was my first thought when I thought about, Yeah, what do I want Generator

12:27

I to do. It's like, look at what's actually going on and write me better rules.

Speaker 3

12:30

Yeah, exactly. And that's kind of the two kind of two use cases that we focused on with Copilot. Like, first, we we decided that, oh, we we know that writing policies is hard. We have like fifty sixty different policses nippets to do like Validay, Jotan retripolicies like write limits

12:45

and stuff like that. So we decided like, let's have a let's have a way for customers to express and plant English, like, for example, I want to have a policy to write limit this API for you know, five for quest per second, and then Copilot will just explain on sorry not explain, but generate a policy for you and then just copy and paste this into the XM

13:03

leditor and that's it. And another one is, as I mentioned that the second scenario, policies can become pretty long, like two hundred three hundred, and sometimes you don't even understand what's going on there.

Speaker 2

13:14

In XML, in XML exactly. To another theme on the show lately is we hate XML. Yeah, there's a use for AI right there. Hey, translate this XML to.

Speaker 3

13:27

Me in English exactly, And that's actually what we do with the second scenario. So yeah, you can just select XML whole thing or just a policy snippet and then you ask it to explain it to you and the fund stuff. It's only explaining just like oh, this policy does, this policy is that, but it also understands the context, like if you have too different variables, you have context,

13:49

you have policy expressions some logic in there. It also will explain that, for example, and all you are doing validate job policy and you have this admin claim that you're checking. If you're checking the saddening claim, if it exists,

14:03

then you are allowed to do this operation. So yeah, it's it's pretty it's pretty good in explaining policies because we also we are not using like the plane model, but we're also like using this it's called retrieval augmented generation pattern where we also have like policy snippets that are stored in a storage and this model can also use this policy snippets that we provided to better like respond with correct policies with better explanations.

Speaker 2

14:30

And so yeah, so the same way I would actually write policies is I go cut and paste from well written policies exactly. Yeah, you've trained a model on well written policy so that it has a good chance of expressing better ones.

Speaker 1

14:41

For a customer.

Speaker 2

14:42

Yeah, exactly, Okay, I mean I could see a few different things going on here at once because you and I'll include a link to this blog post here. You're also talking about using the the API APIM to manage utilization of the open Ai service because that stuff gets expensive, like exactly, Yeah, those tokens run away on you and like you're having a bad day.

Speaker 3

15:09

Yeah, yeah, that's that's that's actually an interesting use case because as I mentioned, we have customers who are trying out, they're building pocs, they're building small applications, and there's azual open Eye service and it makes it really easy for you to start. You just deploy open endpoint, you select for example, you have you want to have GPT four model and the end you're good to go, like you can.

15:29

That's just an API. You get your ap I key, you import dais the care of your choice to your application, and and that's it. You're sending prompts, you're saving completions.

15:37

Everything's fine, but then customers realize that okay token comes exactly, Yeah, there are tokens, and tokens is like something which is super important in edge open and in general and l MS you spend tokens for prompts, you spend tokens for completions, and even when you do play open I instance, there is a quota associate to your model which is expressed

16:01

and TPM which is tokens per minute. Right, and then after all of these experimentations customers, they started to realize that, Okay, now we need to wait to manage this because okay, we've built our first POC. We have one team who developed this kind of a private preview app which is not full in production right now. But now we have ten different departments, ten different teams who also want to get access to this model, And now, how can I

16:24

manage that? How can I limit the consumption per team, per department, per developer, How can I.

Speaker 2

16:31

Make sure signed costs out like my sessonment had is firmly on right now, It's like there's nothing better in this world. And being able to build out resources to the individual teams for what they do.

Speaker 3

16:42

Yeah, and that's and that's a huge issue, Like you need to figure out how many tokens were consumed by a specific team, sure what kind of model they used, And then like okay, at the beginning, you have one endpoint. But what if you want to have multiple endpoints because

16:56

like you're going production, you want to scale. How do you all balance how do you like create circuit breaker rules to make sure that for example, okay, wile our first instance is throat out responses with four twenty nine, how can I fail over to a different endpoint?

Speaker 1

17:13

Right?

Speaker 3

17:13

Yeah, so, yeah, there are a lot of challenges. Now you mentioned the given access API keys. Distributing API keys to all of these teams also doesn't sound like a good idea. So that's why we've built like a lot of stuff that is in this blog post for Jenny I announcement. We wanted to solve these challenges for customers who are kind of scaling and trying to like productize their their investment into as open THEI specifically, but also for other models like elms and stuff.

Speaker 1

17:45

Yeah.

Speaker 2

17:46

Certainly, one of the experiences I've dealt with with companies building a software into the cloud, even when they you know, they've got authentication and they're building back to the customer, the customer makes a mistake with the API and racks up a couple a million transactions that were test transactions, Like they're not making money on the back end. Then you're sending them this ugly bill and they, you know,

18:07

want help. In the meantime, you've also gotten an ugly bill, you know, because you ran it on the back end. So this is this whole game of like who's.

Speaker 1

18:16

Holding the bag here?

Speaker 2

18:17

You know, you don't want to punish your customer for making a mistake. If you do, you may lose them as a customer. You're not necessarily going to get remediated, you know, back to Azure too. But although I've certainly had that experience where I've done stupid stuff in Azure and called them like I'm really sorry I did this, or like yep, fine, I'll wipe it.

Speaker 1

18:34

Oh you're the guy. Yeah, we've been waiting for your call. What was that?

Speaker 2

18:40

But the business reality of this consumption model is you don't always get paid for the stuff that you used, right or and or are willing to like that's this. All of these mechanisms to me speak to let's catch why didn't you notice? Why didn't you catch it before it ran away? You know, after the first million tokens? Why didn't you stop me? And these are the tools, right like, this is how this stops from being worse exactly.

Speaker 1

19:11

Yeah.

Speaker 3

19:11

Yeah, So we were trying to make sure that customers have the right tools to have the proper governance in place. So one of the things that you mentioned like tokens, So we introduced the So we already had like rate limiting policy that works for requests like you can say, as I mentioned previously, like five requests per second for example, and now we need we had to build something for tokens which is aware of these tokens, which is kind

19:37

of the main currency of open the eye, as I mentioned. So, yeah, we introduced the stoken limit policy. It works pretty similarly to rate limit policy. You can say that, okay, we have this application, we have this department, we have this team. Now we assign let's say that one thousand tokens permitted to this application to make sure that do not consume more.

20:00

And yeah, and that that prot works pretty well. And if you want to be extra careful, you also want to you also can configure the policy to estimate the uh the tokens which are in the prompt So whenever there is a request coming with a prompt and you calculate the number of prompts, the number of tokens which is used in the prompt and then if we on APM side understand that it already exceeds the limit, we will not send this to the to the back end.

Speaker 2

20:26

Right, so you will consume in the first place, you're already pressing against the limit. Yes, how do you bubble up that you've hit a limit?

Speaker 1

20:34

Like?

Speaker 2

20:34

What does that look like for the customer? What does it look like for the operator?

Speaker 3

20:38

Yeah, so there is a pattern with great limited. For example, you typically it's four twenty nine returned, retry with retry after header with a specific like number of seconds.

Speaker 1

20:49

That's a message that says sorry yeah version yeah, Canadian version yeah.

Speaker 3

20:56

And that that's what we've built forty for this token limit policy as well. So whenever the limit has hit four twenty nine, retry after a specific number of seconds or minutes, depending on how you can figure it.

Speaker 1

21:07

Right, if you're being rate limited. Yeah, I use the Google YouTube API, and I'm working on a new publisher and it's going to be publishing to YouTube, just like we talked about earlier. And it's weird. I work for a couple hours on this in the morning, and I make several requests and then I get the you know, quota exceeded, and I'm going to look at my quota and it's like ten thousand API calls. I'm like, I

21:35

need to make ten thousand API calls. So it's just an anecdote, but yeah, I'm looking at the response for that, you know when I try to authenticate myself and it'll say nope, quote exceeded. Sorry. Yeah.

Speaker 3

21:50

And we also trying to make sure that it is fully transparent for developers because there is a huge ecosystem of different tools for open the I and other llms like as open the I, s decay, lung chain, prompt flow like there are a lot of different tools and typically typically developers they start with the direct access to open THEI because as the case, they expect a specific like ur L and the open the eye side, they

22:16

expect the apike and so on. So on our side, we wanted to make sure that this experience is the same for developer. So which means that if we put API m behind or sorry between open EI and the developer, they will never notice that something changed. So for us, it was super important to make sure that the developer experience is still the same. That's why yes, yes, so that's why we return for twenty nine because that's what

22:39

open EI does. We are trying to follow the same structure to make sure that everything works as it worked before, isn't.

Speaker 1

22:46

One of the things that ap I M does is you can if you have a process for the developer that includes several API calls, maybe two different services or different you can make one sort of master API that then makes calls and proxies out on your behalf to these other ones and comes with a single result. I've used that feature of API M. There's just so much stuff, and when I got into it, there's just so much

23:17

stuff in there. We could probably spend two hours just talking about all the features of API M. But you mentioned that you put out a Microsoft put out a white paper about, you know, some of these new features. I guess can we get a link to that and what what are some of the other amazing things that we might not know about that are in that now.

Speaker 3

23:40

There's a lot of innovation happening in APAM, so Jenny I gateway is definitely one of those things that I mentioned. We're also currently working on the enhancing the for example, the workspaces feature that we have an APIM to make sure that each team has its soul in workspace with isolation like control plan isolation, data play isolation, and so on. Recently, we also released a couple of new SKUs for APIM which are way faster to provision, they work better, they

24:05

work in a new architecture under the hood. There is a slightly different price in model. But yeah, that's we have a lot of stuff going on there. To your point for the as you mentioned that it's really hard to understand what's going on in a PM, like a lot of policies and stuff like that. With Jenny I gateway that we were discussing, that's also one of the challenges that we wanted to address, like, Okay, we have

24:34

this intelligent application developers. They use JGPT, they know how to use that, but they're not familiar with APM, and now we're asking them to write a bunch of aximal policies to limit to have the token limit, to have the authorization in place, load balancing in place, like metrics for token consumption in place, and so on. So we wanted to address it, and we also kind of we thought that it would be nice to have an easy experience of for those developers and apim to import exist

25:02

natural open AAPIs. So we now have this kind of UI portal experience where you can just say, okay, I was using this open the endpoint, let's configure that one. And also I want to have token limit off I don't know, two thousand GPM, and we can configure everything for them, so they don't really need to care about the eximal policies. They don't really need to look into those.

25:23

Of course, if you need to change something later on or you need like most of his scated policies, of course I need to learn something, but at least to getting started experiences is like super opimal.

Speaker 2

25:34

Well Microsoft, Yeah, I like the copilot ASPD here of Also, I know I wrote this a month ago, but I don't know what it says anymore. Like PARTSES for me, like again with my admin head on, it's like often I have a service level agreement I'm making with certain customers that's written in legal ease and I'm trying to translate it into haven't helped Me XML, But the idea that I have an intermediary tool that would then take them at legally to try and make the XML for me,

26:00

and then after it's done. I could ask for it back and say, like, how close have I gotten here? I actually hit the rules that we've agreed to in the SLA. That translation that layer has always been a challenging part of it? Has this always been about the money? Like that's the main thing that's happening here is you don't want to run it, you know, I presume you'll always sell us more cloud, you know by the transaction.

26:22

If you just keep requesting calls, that's fine. It's just then one day you're going to have to pay for it and it's not what you intended. So is that the important part in API management? Like, I'm not worried about tipping over the cloud, am I?

Speaker 3

26:36

Well, I guess it depends on your shower, of course. But yeah, that's one of the one of the things that you can put into APIM, Like whatever control you need, you can you can build it with the kind of a pretty powerful police engine that we have in APM.

Speaker 1

26:50

That's cool.

Speaker 2

26:51

I appreciate that, And gentlemen, I needed to take a break for one moment for these very important messages, and we're back. It's don at Rock's I'mateurd Campbell, that's Carl Franklin yoh Yo Yo talking to our friend Andre a bit about these improvements to API M which we all should be using. If we're gonna expose an API through the cloud to the world, don't leave it naked, give

27:14

it some armor, and this tool helps. These gen AI tools help us to configure it correctly, operate it well, but then also deal with the additional complexities when it comes to the as you open AI, APIs with limit issuing tokens for software to utilize open ai and put limits in place for all of those good things.

Speaker 1

27:36

Have I summarized that correctly?

Speaker 2

27:37

Andre?

Speaker 1

27:38

Yeah? I think so. Yeah, I think I'm starting to understand what you doing here. Man. I'm pretty excited. Richard is the human AI.

Speaker 2

27:46

I don't know that's true. It's yeah, real, definitely created. Like you said, a very important phrase is sticking with me now, which is tokens or currency?

Speaker 1

27:55

Yeah? Absolutely.

Speaker 3

27:56

You can think about it as your main currency, your main resource you have with all of these models, and that's also what you're paying for, and.

Speaker 1

28:04

It's what what you pay that's what you pay for exactly.

Speaker 2

28:07

And so of course it's a currency because it does ultimately translate into FIA currency. Of whatever form you're using, you're going to you're going to pay for that stuff, and then you get it that pays your models and all you have all that choices when you have these controls over top of Can we talk a little about the semantic casing policies. That sounds like a way to save money and potentially improve performance. That's interesting.

Speaker 3

28:30

Yes, yeah, that's that's actually very interest simple see and

28:33

every interesting implementation from all side. So yeah, as as you mentioned, so first of all, we solve the latency problem just with regular cushion that already exists in APAM for a while, you can cash request, you can cush responses for specific requests, but with with all items is a little bit different because your prompts can be different, but they're semantically similar, right, That's what we do with semantic cash, And so there is a an open opening.

28:59

I provide and embedding models. Embedding model which generates vectors which represent the kind of you can think about it as a kind of semantic minion of a specific prompt war specific like stream, and then we generated for a specific prompt and then if we realize that there is a semantically similar prompt coming in, we will check the cash and we will retrieve the response from the cash

29:21

instead of hitting the open the endpoints. So first of all, as I mentioned, were solving the latency problems or the response is getting to the client faster, but we all sort of saving on the token consumption because this prompt will never go to help on the A endpoint while we have the response cached. In our case, we're using reddis for vector search, so that's where story is responses.

29:43

So yeah, if you're saying hi or saying hello afterwards, they're semantically similar where we just returned.

Speaker 2

29:50

I immediately go to a scenario like imagine an incident that's happened that has caused a lot of flights to be canceled.

Speaker 1

29:58

That would never happen, Richard, Come on, you a real example.

Speaker 2

30:01

Folks are trying to find out if their flights canceled, So you're going to get many requests from different sources that are essentially the same thing. Is this flight canceled? You really only need to want to fetch that once. Now it's sitting in the cash, and you very quickly respond, yes, all flights are canceled.

Speaker 1

30:17

But you know.

Speaker 2

30:19

What I like about a cashing model like that is that it will evolve over time, you know, you imagine other scenarios whence those flights are gone, there's other flights like but you're often only going to need to make that actual request back to the engine once and use it over and over again. So a good caching opportunity when you're going to have multiple people more or less making the same requests but in many different ways of phrasing.

Speaker 1

30:43

And also a way to bust the cash once the flights are back to normal.

Speaker 2

30:46

Yeah, rather than do code it yourself where you have to it's cashing is not hard. Expiring is hard, yeah, inspiring's always hard.

Speaker 1

30:56

So wait a minute, what why is it? Oh? How many times?

Speaker 2

31:03

Although maybe and again i'm reading here this is an early version. This is your first sort of go with this.

Speaker 3

31:08

Yes, yeah, yeah, yeah, well that's that's the nearly preview version for now. We're still like so there there are a lot of customer use cases for that. So as you mentioned, uh that that was a good example. Uh, but then we also have so basically like whenever whenever the company builds some sort of a chat service for answering questions, then you always have frequently asked questions.

Speaker 2

31:33

And that's where you're Hey, you're going to build a factable inevitably, but rather than you define it, let utilization define it with a cash exactly.

Speaker 1

31:40

Yeah.

Speaker 3

31:41

Yeah, and that's where you you have a lot of token saved just with the semantic cash and policy.

Speaker 1

31:47

Yeah.

Speaker 3

31:47

Also also for internal knowledge base, that's also important. Like we have a bunch of for example, support engineers sitting in this in the call center and sometimes problems are similar. Yeah, it's and you're just doing the search through the Chad jubt and yeah, your your responsors are turning from cash and you're not hitting the opening endpoint.

Speaker 1

32:07

Yeah.

Speaker 2

32:08

I was recently reading about folks that aren't securing these kinds of services properly, and people discover them and just use them as their free version of chat ept, basically leaving that that vendor holding the bag for the token costs.

Speaker 1

32:25

It's a great idea, Richard. Yeah, nice, glad. I never I can't believe in everything, but this is what I'm thinking.

Speaker 2

32:33

It's like, I'm not even talking about the you know, the proper utilizations and run away API calls and so far, but genuine nefarious use that somebody's like, oh, look, you've exposed chat to me and I can use it for anything, So I'm not even gonna worry about your product. I'm just going to exploit your token availability to run the queries I want to run, and you know you get to eat it. Congratulations.

Speaker 3

32:57

Yeah, that's that's why. First of all, it's important to have something like APIM where you have API keys which are on APM side represents specific color or application. But also there are certain tools in a measure OPENINGI itself where you can say that there's a specific filter on the content that this model is supposed to respond to, for example, if you're asking it, if you're training it.

33:20

In our case, we're trained to respond about APIM policies if someone asks about the weather right now or something else, or summarizing a document which is which doesn't have anything to do with APIM, and we will just respond sorry, I cannot do that. I'm not trained to do that.

Speaker 1

33:37

My job. Go find your own chatbot.

Speaker 2

33:39

Yeah, and that documents particularly evil because that needs a lot of tokens. When you shove a document up to summarizes formul like absolutely as a token intensive and an easy mistake to make if you haven't boxed that interface properly.

Speaker 1

33:54

Talking about some of the some more the new awesome features. Is there anything that we haven't talked about yet that customers have asked for that you've implemented in this next version.

Speaker 3

34:03

Yeah, there is an interesting I wouldn't say that's specific feature, but that's kind of a challenge that we saw in a PM. So we supported Service cent Events technology for a while in APM, but we had some certain problems with that because that's essentially streaming. So when you when you send the request to judge a BT, typically what you will see and experience that you're used to most likely is that it will be it will be responding

34:27

in chunk of text. It's not just it's not sending, like you, the full response, it's just responding it in streaming fashion. And it turns out the customers want to use streaming because that's what users are used to. They want to see the same experience in their chat experiences as well, like in their propilots and so on whatever

34:47

applications they build. But there is a certain problem with that because whenever you introduce some sort of buffering, then the streaming experience breaks, which which is the case for you PM right now. Because whenever you have a log in policy or a monitoring policy, or you have a retripolicy, so whenever you do a buffer and a response or request, the streaming breaks. So we had certain challenges to make sure that talking limit and the talking metric policies they

35:14

work with streaming scenarios as well. So that's kind of challenging, I would say, and that's kind of one of the things that customers requests to add support for.

Speaker 2

35:24

Yeah, for sure, there's more features still to come down the pipe, you know, like there's a lot we could be doing in here.

Speaker 1

35:30

Yeah, over time.

Speaker 2

35:32

It's although honestly, when we started this conversation, like I think you guys already done too many things Like sting is all as out is challenging, and I know there's still more that could be done.

Speaker 3

35:42

No, there are certainly a lot of scenarios like as as you mentioned, one of these scenarios is kind of content safety. Just to make sure that we do not respond on specific I don't know, if there is a specific question and a prompt, we should not respond to this prompt.

Speaker 2

35:56

Which doesn't sound like an API responsibility. You are at the gateway point where doing content filtering. This is a logical opportunity to hit that. Yeah, that's definitely a different area. Yeah, and actually that's something that you can do today. Like we get access to the request, you can look at the headers, you can look at the body, and then you can write whatever regular expression you want to deny

36:17

the request. But that's to my point that policies are hard, especially for those who are not used to if I am. We just want to make sure that it's easy, easy to use, and easy to configure. So yeah, that's something that we're looking at. Adding like content safety concerns are real, there might be like PII data, there might be some confidential data in the request or response. You want to filter this out.

Speaker 3

36:46

And Gateway seems like a natural place to do this kind of stuff because that's the kind of single point where you see all the requests and responses.

Speaker 2

36:53

Because see asuary AI studio has a whole mechanism for content controls and so forth, you kind of want to pick the policies you've built were there and then push them in a hook to the API side.

Speaker 1

37:04

It's say, here's our saying, I only want to write one set.

Speaker 2

37:07

Of policies, but I want to be able to catch them into different places where it would matter.

Speaker 3

37:10

Yeah, there is also a big piece of kind of a governance and kind of best practices within an organization. For example, you can have multiple model deployments and they have different content safety configurations. With APIM, you're just having kind of this platform engineering side of GENNAI. Let's say where you can say that, oh, these are our rules and all of the models that are deployed they should

37:31

be behind APIM. And then in APIM you can figure all of the rules that you have in your organization to comply with the basically policies whatever you have an organization. So in that case, you're basically shifting the control to APIM instead of configuring stuff on the models level.

Speaker 2

37:50

Yeah no, And you could see that associated with particular authentication accounts too. So it's like, hey, I provide a service for medical and so some pictures are going to be the kind that you wouldn't know normally want to show anywhere. But that's the business here, so it needs a different rule set.

Speaker 1

38:06

Yeah.

Speaker 2

38:07

Interesting, interesting array of problems here, Like you guys are up against it.

Speaker 1

38:11

I appreciate this.

Speaker 2

38:12

Uh, you've got an AI gateway samples on GitHub. Should I include a link to that? That looks pretty cool and super current.

Speaker 3

38:21

Yeah, yeah, that's that's an amazing repole that was built one of the by one of the gbb's that we work with. So that's basically a set of labs that you can try with with API M. So typically probably know that the typical like space for AI engineer is a Python notebook. Yeah, and that's something that we wanted to implement in those labs. So there's a bunch of there's a bunch of Python notebooks, and then there is

38:55

a code. Usually there is a code that is calling open the E through a p I M with the Azure opening I is decate, so it's pretty natural for you engineers. And then we demonstrate kind of a different token limits policy emy token metric policy. Then d a lot of additional stuff like low balance in and sending the augmenting the response with the RAC pattern and so on.

Speaker 2

39:19

Yeah, so it'll seem familiar pretty quickly, dude. Yeah, you know there's different people coming in from different angles. Right, You've got your service builder on the back end, once controls and throttles and logging and that kind of thing. You've got your ll M folks who you know, want to automate the flow and control of tokens. You've got administrators trying to keep things up and make sure buildings go into right places. Anybody involved in cost control, which

39:49

is lots of folks. Like my experience talking developers when they're starting to experiment in ll MS is they want the ladies and greatest of everything. But the price, you know, may the technology may or may not be needed, and the price tag is huge for the latest versions compared to Hey, would this have worked with GPT three point five lass like, because it's a tenth the price, Like, it makes a difference. Yeah, if you don't concern about

40:14

any of that, you just don't. Nope, give me four zero, I want it all.

Speaker 1

40:17

Yeah.

Speaker 3

40:17

What's interesting with the alms, that's actually the opposite. Usually usually like four always cheaper there than four or than three five. Oh really Yeah, that's because they're more kind of optimized, so they say that they consume more less resources, so they're more optimized.

Speaker 2

40:32

That's why it's it's cheaper.

Speaker 1

40:33

Interesting.

Speaker 3

40:34

So yeah, but that's that's actually a good point that we we basically distinguish we have internally we think about two personas. We have a I engineer who's kind of building the application, who's using all of the s DKs they want latest and greatest, and then we have a I platform engineer who is kind of providing access to those models and he here she cares about the token consumption like cross charge and low balance and all this

40:57

kind of stuff. And I engineer they also they always want something new, and that's also kind of one of the challenges for us because the space is evolve when like super fuss, like we are just trying to keep up with with different models. For example, for all I was recently announced. We're just working on adding the support for this model right now because it's multimodel. It supports images, audio, not on the text like for GBT four or in JBT three five. But then we're working on this right now,

41:26

and recently they announced GPT four All Mini. So it's it's really like it's really hard to keep up with the with the industry and like a lot of open source projects building the gateways, building the capabilities to document the lllms. So yeah, it's it's a fascinating place.

Speaker 2

41:45

Yeah, job security for you, it sounds like just trying to keep up, right, But I think that's part of the strength for the customers using this is to go, oh, new model arrived, Okay, well it's in APIM so we're okay, we can add that connection that there. But certainly as they switch over to multimodel, like I suspect your inputs are different. It's not just a blob of text going and work could be almost anything.

Speaker 3

42:08

Yeah, that's actually also an interesting problem because like sometimes people think that, oh, okay, I have this GPT four and what if GPT four is not available, I will go to I don't know, some different model for example mistroll large. But in reality you have the engineers will work a lot on the prompts and make sure that these prompts work with a specific model, right, And typically if you switch to the underland model, most likely the

42:34

result will be not that not what you expected, right, right? Oh, it's important to test it against multiple models.

Speaker 2

42:41

Well yeah, I said, this is such a moving space. Heck, let's face it, you can fire the same prompt at the same model several.

Speaker 1

42:48

Times and get the results.

Speaker 2

42:49

Yeah, right, Absolutely, We're not living in a land at consistency right now.

Speaker 1

42:55

Brian McKay and I used to do this show called the ai Bot Show, and it was a YouTube thing and he would have something to show and he would be practicing it the night before we recorded, and we recorded, the prompts would be completely different, probably because it learned overnight modified. Yeah, something that he could jail break today tomorrow is impossible. Crazy.

Speaker 3

43:20

Yeah, that's why prompt engineering is like it's super important. Like whatever you build, prompt engineering is always going to be very important. And the thing is that, like whenever you're tested, it's not deterministic, Like you cannot say that, Okay, it works as you mentioned, it works today, but it might not work work tomorrow.

Speaker 1

43:37

And as developers, that really messes with our head because we're used to absolute results. Yeah.

Speaker 2

43:42

I think we're also used to building on an existing data set, where so far they're pretty much tearing down his models and rebuilding them over and over and over again. So you can't expect that what worked before works again. That's just not the thing, because we don't revised models. We replace models for better or worse. I was describing paredolia on a walk this weekend that paradolia is the

44:09

tendency for humans to see faces in things. Right, You look at a bowling ball and it's like, that's got a face, or the front of a car, it's got a face, right, And how that's an evolved trade of humans because if you detected the face first in the trees, you were the one running before the other people were running, so you probably lived. And the downside of courses, when you see faces that aren't there is almost is very low. It's not a big deal. Right, So we're talking about

44:34

model bility. I'm like, so, imagine I'd take a shotgun and I shoot at a target and then I say, do you like this face right now? If you say no, I want a better face, I don't take the same target. I did a new target and I shoot it again. And that's you know, the nature of constantly rebuilding models is that you typically don't get the same results again.

Speaker 1

44:54

Yeah.

Speaker 2

44:54

I'm sorry that was a very long winded way of going about that. But I like saying paradolia.

Speaker 1

44:58

But I love that you, you know, introduced that word that I've already forgotten something. But for me, that's like staring up the clouds. You know, our block ink blot tests or shack tests.

Speaker 2

45:10

Yeah, yeah, humans see things that aren't there because it used to be useful at least Now I don't know, it's creating its own set of complexities. In a question, well, how many versions out are you planned? Andre, Like, I could see lots of demand from different folks for various features. We talked about the whole content management thing. But you know what comes next for you?

Speaker 3

45:35

Yeah, So we started with as open AI being kind of the one that is easy to use an Azure and have the most popular one, but we also want to extend to other models because there is definitely demand to use other models like Lama, Mistrol here, hug and

45:51

Face and others. So we're looking at how to expand our genera k to akpabilities to support more models, to make sure that customers can use multiple models and the same in the same APM instance, without like the need to customize policies right crazy post expressions.

Speaker 1

46:07

And so on.

Speaker 3

46:10

Then U there is a there's a huge demand like on logging in monitoring side, as I mentioned, we we started with the talken tracking, but it turns out there are certain phases of the intelligent applications development where you actually want to collect all of the proms and completions

46:25

to make sure that your model behaves correctly. And in general, logan is pretty easy with API M if you are not using streaming, if you're not using the SECE events, because again I mentioned there's a buffering problem, so that's something that we're looking at how we can solve this

46:41

in the future. And also kind of in general, like focus on security traffic management like prompt manipulation policies, like let's say that this example that you share that I just found some copilot that I can use now for my personal personal Again, but what if I have some policy that says that, oh, for whatever context which is presented in the prompt, I will rewrite it so that I know that my application works with it perfectly well.

47:10

So in that case, whatever you send us a problem that will be rewritten on APIM side, and you will not get the response that you wanted from this copile that you just found out in the Internet.

Speaker 1

47:20

So it just occurred to me at the end of the show here that I should ask this question long ago. But is it possible to write two policies that contradict each other? And what happens if that's possible?

Speaker 3

47:31

I believe technically you can do that. But at the same time, we have the policies are executed from top to bottom, so whatever you have at the bottom will be enforced, right, So it's.

Speaker 1

47:41

The order of execution, yep, which is pretty common. Yeah, yeah, yeah, it is. It can lead to some confusion. Now, it'd be nice to have some something when you're creating those policies to say, hey, you know, by the way, this contradicts this policy, you might want to take a look at that. Yeah.

Speaker 3

47:58

We definitely validate policies, so we have the policy ENGINET validates. If there's something like which doesn't make sense, there will be a validation error, so you will not be able to save the policy.

Speaker 1

48:08

Yeah.

Speaker 3

48:09

But yeah, if that's something.

Speaker 1

48:10

Allow Carl to access this API, don't allow Carl to access to.

Speaker 3

48:13

This ABA yeah, yeah exactly, and stuff like that.

Speaker 1

48:17

Cool.

Speaker 3

48:18

So yeah, and in general, like for the future of JENNI, so we are good at like security, traffic management, just kind of general ease of operations and APM. So that's what we are focusing on to make sure that customers have all of these secure access control to those models, like all of the policies and governments in place. But at the same time, we want to make it easier

48:37

to build intelligent applications. So whatever we build, we are trying to give those AI engineers like an easy to use interface if they're not familiar, to make sure that it's easy for them to set up, configure and basically get all the benefits of a APMs GENEI gateway when they're building applications.

Speaker 1

48:56

Great, well, it sounds like the end of the show, Andre Kamanov, thank you for being of this. Is there anything that we missed that you wanted to mention or a shout out or call it action or anything.

Speaker 3

49:04

I would say, just make sure to check the AZRA updates when we release new stuff, and we'll get a technique community blocks where we publish all of the latest and greatest in in a PM. And yeah, all right, that's that's it.

Speaker 1

49:22

Awesome. Well, it's been great talking to you.

Speaker 3

49:24

Thanks for having me.

Speaker 1

49:25

It was great talking to good to Thank you very much. All right, we'll talk to you next time I'm done. Dot net Rocks is brought to you by Franklin's Net and produced by Pop Studios, a full service audio, video and post production facility located physically in New London, Connecticut. And of course in the cloud online at pwop dot com.

50:08

Visit our website at d O T N E t R O c k S dot com for RSS feeds, downloads, mobile apps, comments, and access to the full archives going back to show number one, recorded in September two thousand and two. And make sure you check out our sponsors. They keep us in business. Now, go write some code. See you next time. You got J middle Vans and

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript