Why Local AI Matters and How to Use It - podcast episode cover

Why Local AI Matters and How to Use It

Jun 21, 202645 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Summary

Nathaniel Whittemore and Nufar Gaspar discuss why companies should rethink their dependence on frontier cloud models, citing rising token costs, vendor fragility, and data control issues. They provide a practical primer on local AI, breaking down its basic layers from hardware and open models to serving tools like Ollama and agent harnesses. The episode highlights the benefits and trade-offs of running AI on controlled machines, offering guidance for executives, practitioners, and enthusiasts alike.

Episode description

In this Operator’s Cut, NLW is joined by Nufar Gaspar for a practical primer on why local AI suddenly matters and where to start. They break down the forces pushing companies to rethink full dependence on frontier cloud models — rising token costs, vendor fragility, capacity constraints, data control, and resilience — then walk through the basic layers of local AI, from hardware and open models to Ollama, LM Studio, agent harnesses, and the real tradeoffs of running AI on machines you control.

Register for our new enterprise-grade AI training programs: ⁠http://training.besuper.ai/⁠⁠

Brought to you by:

KPMG – Research from KPMG and the University of Texas at Austin shows the highest-impact AI users treat AI like a reasoning partner — and those skills can be taught at scale. Learn more at ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠kpmg.com/us/Sophisticated⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Section - Section turns AI investment into workforce transformation and ROI - ⁠⁠⁠⁠https://www.sectionai.com/⁠⁠⁠⁠

Outsystems - Stop wondering how AI will change your business and start building the agents that will lead it - ⁠⁠⁠http://outsystems.com/⁠⁠⁠

Scrunch - The AI customer experience platform - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://scrunch.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Zenflow Work - Agents for knowledge work - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://zenflow.free/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Blitzy - Want to accelerate enterprise software development velocity by 5x? ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://blitzy.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

MissionCloud - Eliminate AWS complexity with end-to-end cloud and AI services ⁠⁠⁠https://www.missioncloud.com/⁠⁠⁠


AssemblyAI - The best way to build Voice AI apps - ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://www.assemblyai.com/brief⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Robots & Pencils - Cloud-native AI solutions that power results ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://robotsandpencils.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

The AI Daily Brief helps you understand the most important news and discussions in AI. Subscribe to the podcast version of The AI Daily Brief wherever you listen: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://pod.link/1680633614⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Our Newsletter is BACK: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://aidailybrief.beehiiv.com/⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Interested in sponsoring the show? sponsors@aidailybrief.ai


Transcript

The Growing Imperative for Local AI

B

Today on the AI Daily Brief, how and why to use a local AI. The AI Daily Brief is a daily podcast and video about the most important.

🎵 Music

B

All right, friends, quick announcements before we dive in. First of all, thank you to today's sponsors, Robots and Pencils, section, Mission Cloud and Out Systems. To get an ad-free version of the show, go to patreon.com slash AI Daily Brief, or you can subscribe on Apple Podcasts. To learn more about sponsoring the show, visit aidalybrief.ai slash sponsors or you can email us at sponsors at aidalybrief.ai.

To learn more about the new executive agent training program that Nufar mentions at the end of this episode, go to training.bsuper.ai. And yes, we are back with another Newfar Operators Cut. Specifically, this week the conversation has been so much about the changing composition of enterprise AI strategy or just business AI strategy in general, as people deal with one rising cost from agentic workloads.

and to the new reality that our AI can be turned off on a whim at any moment. And yet there is a chasm between the idea of using alternatives to the major models. and actually being able to do so. And so what Nufar has presented today is a primer that's gonna give you a background understanding of a lot of the key concepts, terms, and steps you would need to take to even explore thinking in this new way. All right, Nufar, welcome back to the Daily Brief. How's it going?

A

Good. How are you?

B

Good. So we are at the transition point moment. I hope Fingers crossed by the time this airs, although I'm not super optimistic, but fingers crossed we might be playing with Fable Five again. But I think that this week, as I've been discussing all week, has shown why investing only in the biggest model or the best model is not necessarily the best strategy. On Thursday's episode from last week I talked about

what the alternative models and model approaches that companies are starting to think through. But there is a huge gap between just shifting thinking from Fable V to some other type of model to understanding what that actually takes. And that's the gap that you are going to fill in at least on a basic or high level for us today.

A

I'll do my best. All right. So I do think that there is a big gap between saying open source and fully understanding the implications and deciding whether you should go and buy a hardware for your company. There is a lot of understanding that needs to be done of what it all means.

So today I'll try to give a very practical overview of why you should care about open source and why you should care about running models locally. In practice, it will also include how can you do it and for whom it might be relevant. So just a quick recap of the perfect storm that makes open source so important nowadays, in my own words.

So the first force that I I see is the cost taxes. Everybody's talking about tokens, the cost and how to maximize value while minimizing the cost or an optimization of the cost. Anyway, a lot of uh conversation around tokens. That's for a good reason, but for the most part it's becoming more and more expensive. Just a f a few examples that I'm sure many of you have encountered, the price growth of GPT.

the release of Opus four point seven that changed the tokenizer and all of a sudden companies who hasn't made any change to their uh prompts has a bill that increased in uh sometimes thirty five percent. And we're seeing more and more companies leaning towards agentic workflows and as such these harnesses are also a huge cost multipliers. So the thing is that many individuals and companies are more than willing to pay the price if the return is is justified.

And we are all very excited and pleased to know that there is a new player in the block, and namely Fable Five. And then, as we all know, it was shut down and we are still kind of waiting to understand the uh events and maybe by the time it airs your optimism is gonna be in motion and we're gonna have Fable five back. But I think that the eye is becoming increasingly uh more volatile because of the geopolitical forces.

So theoretically we said it before, but I think that seeing that in action and realizing that all of a sudden you might have a high dependency on a single vendor that can be shut down by a government that create a new category of dependencies that we all need to start thinking about how to alleviate it. So beyond these two, there's also a third force and we should definitely keep paying attention to the fact that there is a capacity issue.

And data centers are being built in at a very fierce pace. However, the usage is going even faster and we may be heading towards a world where it's not just about the cost, it's about whether you can even get access to sufficient compute when you need it. So many companies and individuals have hardware that is just sitting idle, that could serve their AI tasks. So that's an untapped resource, while the resource that they do tap into might become even more scarce.

B

For reference here, every estimate that I've seen, if you watch just sort of leaders from TSMC or from NVIDIA or anyone, they're all predicting capacity shortages at least through the end of the decade. No one is looking at earlier than twenty thirty. And that might be optimistic just based on the difference in the speed that demand is growing versus capacity is growing. So this is why cost is not just a current issue. It is a leading indicator of a much bigger cost issue in my estimation.

A

I agree. And one more twist to add on that is that even if you are contemplating buying hardware for home or for your company, the hardware itself is inc increasingly becoming more and more expensive.

A lot of it is because of memory shortage. That means that there is a supply chain issue that will not become even better anytime soon. So even if you are contemplating buying, s there is something to say for buying sooner rather than later because the cost Keep going up and up, even for the purchase option.

Local AI: Benefits, Audiences, and Core Concepts

If I put together all of these sources, I think that what we should all start thinking about is that local AI deployment of open source models on a hardware that you own is very much like building a shelter for your AI capability or the equivalent of the AI bomb shelter that you should consider. Obviously on the one hand it keeps you safe from all of the forces that we just named. We also are the owner of your data. You have availability during outages if you have a fully local deployment.

On the other hand, it comes with an overhead and you might save on tokens, but you will spend on maintenance, updates, hardware and the people who keep it running. So we'll talk more about it towards the end, but If you are paying a cloud vendor, often all of these costs and implications are hidden across a very well operationalized uh company.

So if you are contemplating bringing it home, it's very important that you understand all of the implications. And that's what we're trying to do today. And m namely I wanna meet you where you are,'cause I think that everybody should care, whether you are an executive that is

steering your company's AI strategy and vendor decisions, whether you're a practitioner that will drive the actual productization deployment of local AI, or just an enthusiast that wants to experiment and then consider running at least some of your workloads. on local models to save costs or just to be more self sufficient. So bottom line it's everybody, but I wonder what you think.

B

Yeah, I so I think that one of the biggest ways that AI differs from previous technology that I've seen is It's always a priority when there's a new technology movement for companies to come in and reduce complexity as fast as possible. And what's been interesting is that with AI,

The market of people who want to actually understand the guts of these systems and really get in there and figure them out, I think is much bigger. It's not just the sort of traditional addressable market of people who are uh any of these categor or you know, the practitioners or executives or IT people. And I think open claw is a great example of this. Open claw became a phenomenon not because

there were so many people already in the IT or so many developers who were using it. It's because there was 8,000 people who ended up doing Claw Camp within the first month. And the vast majority of them weren't even technical to start. So this is kind of the same spirit where I don't anticipate. I think ninety nine percent of people who listen to this episode will not race out to go build something.

But it's a blueprint. It'll help you understand the systems you're working with. And I guarantee it'll help you understand even the systems where all of this these parts of things are obviated and behind the scenes. So it's why I wanted to put it on the show, especially right now as everyone's paying attention, is that I think The market of people for whom it's applicable is much wider than it might seem.

A

And what it is to be. I agree. All right, so let's dive right in. But before, uh very quick and important distinction just to make sure that we are all on the same page. With AI, we have two phases that require very different hardware. In the training phase, we're building the model from scratch. This is what the labs do. It's why they need billion dollar data center and tens of thousands of specialized chips.

not what we're talking about today. You shouldn't care as an AI enthusiast about what OpenAI and Tropical Adders are doing with their massive data centers. You should care about inference. This is where you use the model that was already built by the various AI labs, asking it questions, getting answers, and empowering the brains of your agents. So everything in this episode is all around the inference, running a pre-built model on your own hardware.

And the hardware requirements for inference are dramatically lower than for training. That's why we are all able to now consider doing that on our own laptop or the hardware that we have lying around. A quick note, I'm going to simplify in places throughout the episode. There are many technical nuances that matter for engineers but would just be noise for

many of the other parts of the audience. So if you are an infrastructure professional, I'm sure that you will uh identify all the areas where I'm kind of cutting corners and you'll also know why it's okay that I'm doing that and will forgive me. So that's my disclaimer. All right.

Four Deployment Levels for Local AI

So if I'm going back to the bomb shelter analogy, you don't have to go and build a full bunker on day one because there are four levels from takes ten minutes still cloud all the way to fully on your hardware, no internet needed. And I wanna walk you through each one and maybe you will find what's the right place for you to be in. So at level one, that's the simplest first step. You can use a routing service like Open Router that sits between you and all the major AI provider.

You have one account, you have one interface, and it connects you to four hundred or more models across more than sixty providers. And this gives you first of all a mix and match by task. You can route complex reasoning to one provider. You can kind of very quickly do another routing to a simpler model. you can optimize cost versus quality per workflow. So you don't have to have a contract per vendo.

And you also don't have a vendor lock-in and you can switch models or providers whenever there is something new or maybe something happened that caused you to want to consider moving between vendors. You also have a very good cost transparency so you can compare side by side and then select the model that works best for your own workflows.

And of course, if there is some kind of uh outage or uh problem with one vendor, you can enable an automatic failover to another one, which makes it more uh robust. And lastly, of course, you can experiment with models to decide whether there is a new kid on the block that is catching your attention and you wanna swap to that. The trade-off for working with something like that is that the data still leaves your network.

You are still cloud dependent and you're still paying quite a lot to a third party, but you're not dependent on a single v vendor. And obviously open router is not the only alternative. It's just the most popular one. There are other alternatives like uh Light L L M, if you want more of a uh router that is self hosted on your own machine, you have port key for enterprise governance and others, just to name a few.

Same concept, one interface to many providers with an automatic failover, very quickly to set up. The level two is if your organization is already on some kind of a cloud, whether it's AWS, Google, Azure, and so on. This level uses what you have. So we we all heard or maybe are using services like AWS Bedrock, Google Vertex, Azure A Foundry and so on.

They all let you run several vendors on your own cloud and they all let you run also open source models in a way that is secure, compliant and in a place that you already most likely operate anyway. That means that your data stays within your own virtual private cloud and for the most part it's gonna be easier to approve that with your own security. You have two ways. You can use the commercial models or you can use them for open source as noted.

And I think that this is the path where most large enterprises are already taking or will be taking first whether they're starting to contemplate uh experimenting with more open source models, just to see the option. And then we have an option that is not for the faint of heart, which is to self host a cloud. It takes everything that we just discussed one step further. So instead of using a managed server like the ones that we mentioned, you rent a GPU.

And you install your own model, your own serving. We'll explain what it means in a minute. That means that you don't have any platform, no restrictions, and you get to do everything good, bad, ugly. So For most organizations this is not very practical because it requires a lot of infrastructure engineering, ones that know how to work with GPU drivers, work well with containers and many other engineering works.

But for teams that have that capability, it gives maximum flexibility and often it's probably the lowest per query cost at high volume. Again, given that you know how to manage your own bare cloud without any help from the cloud provider. And lastly, this is where you go fully local. That means that everything is on a hardware that you physically control. No internet is needed after the initial model download. No model in the loop at all.

And this is where we'll spend most of the rest of the episode, to walk you through what it means to deploy AI fully locally. Because I think that's where most of the learning lives. And that's the level that will truly survive any internet outage, export control, or vendor going dark, or I don't know what's gonna be the future, but you have full control with that level.

By the way, that's not where I think that uh everybody should start here. I think that if you are an enterprise you should probably start at level one immediately, evaluate level two for sensitive workload and you can build towards level four if you have capabilities that must survive all of these disruptions.

Of course that's also the level where many of us, the individual practitioners, can live and build for ourselves and many people are already doing that, and we'll focus on that level from uh here on after. So it's a stack of five layers to go fully local.

Hardware Essentials for Local AI Deployment

And they all matter. At the bottom we have the hardware, where physically do we run our AI. Then we have the model, what is the intelligence that is being loaded. Then we have the serving layer, what software make it available. Then we have the agent harness or the user interface, what orchestrates the action and at the very top

we have the fully uh user-facing what you actually see and what you actually interact with. So I wanted to go from bottom to the top to make sure that you understand how to do it for yourself or at least as mentioned talk to talk. So layer one is the hardware and the question is where does it physically run? Just going very quickly to the basics because this matters, your computer has two types of brains. You have the CPU and you have the GPU. The CPU is the set the general purpose GP.

the trans your operating system, the browser, the email, every computer has one. It can run AI models, but typically more slowly, because it wasn't designed for this kind of mathematical operation. Then we have the GPU, which stands for Graphic Processing Unit, originally built for gaming and video, but it turns out that the same architecture is perfect for AI.

So GPUs do thousands of simple calculations simultaneously, which is exactly what's running an AI model requires. So GPU is typically what you need. And the key number that truly makes a difference is the memory. Specifically, how much memory your GPU has called VRAM, and the entire model needs to fit in this memory if you want to have a usable speed.

If it doesn't fit, the system typically falls back to using regular memory through the CPU and everything slows down dramatically. Just very quick hardware simplification. All right. So what does it mean for different machines? If I have a regular laptop like the PC that I have, I don't have any gaming graphic cards, I don't have so much memory, I can run on my own laptop small models.

through the CPU. It's gonna be quite slow, but still functional for simple things, primarily to learn an an experiment. So that's gonna be like the small stuff. If, however, you have a Mac with an Apple silicon, Then you have a CPU and GPU that share the same memory pool. So your Mac can probably run even larger models and that's why Macs have become so popular for local AI and as a result most of them are very hard to come across nowadays.

Another great option is if you have a desktop with gaming GPU that's gonna have a dedicated graphic card with sufficient memory, it's not gonna come cheap, it's gonna come around uh two thousand dollars. That's probably the sweet spot because that can run between medium to large models at a very good speed. We also have some interesting offering from uh NVIDIA around this category.

But you also have the option to run stuff on a phone or a tablet. So very small models can run even on your old Android machine. So don't uh be very haste to throw away old hardware.

Lastly, a server with enterprise GPUs can run any model well, but the cost structure is very simple. I'm gonna explain what um you when you see these numbers of parameters and so on in a minute, but For now, think about the t-shirt sizes, meaning that your hardware determines the largest size that you can wear or the largest model that you can run, and typically how smart or how sophisticated the use cases that you have in place.

Okay, so prices are quite diverse. They spend seven hundred dollar uh if you wanna buy at the low end some kind of a used high memory graphic card for an existing desktop. That gets you to medium sized model and that's gonna cost you less than one thousand dollars. Uh at the mid range you will have three to five thousand dollars that will buy uh purposefully built AI appliance from Nvidia or AMD.

And the cost keep going up and up if you're contemplating that's a category that becomes expensive as we go along. And I at the high end you have these numbers and if we're talking about purchasing a a server for a company where like it's a completely different degree of orders. A few things to know before you go and pull a credit card. First of all, as mentioned, the Apple products have a massive wait times right now because of the memory shortage, so it can be even months.

Second, you may not need to buy anything. You can just start with the hardware that you already have lying around. answer the ROI question, like do you have a justification to go and buy a hardware? Do you have a use case that you are able to run locally to satisfaction and it will not default back to paying the cloud vendors sooner rather than later, only to have this very expensive or fairly expensive hardware lying uh at home or at your ho office not being used.

And of course, if you are working in a regulated industry where compliance prohibits sending data to a third party API, local may be a requirement and not a choice. But if this is the case, you have to be honest because a machine on your own network

is not necessarily more secure than a well configured cloud API. So the security argument is strongest if you truly are not connected to the internet and no one can infiltrate your network. But if you are connected to the internet It's not necessarily that stuff that you have within your walls of your company are secu are more secure than what those cloud providers are doing for you in order to secure you from a cyber attack.

How different the enterprise costs uh if you want to buy a server for your data center, it starts with a quarter of a million dollars. So completely different ballgame.

🎵 Music

B

I cover the capability gap between AI potential and AI reality every day on this show. Most companies are still figuring out how to start. Robots and Pencils is already launching and scaling. Agentic and generative AI in production at large enterprises in weeks. AWS Advanced Tier Pattern Partner more than doubled in a year. And they're hiring 50 open roles.

If you're someone who knows this moment is different, who wants to be inside it, not watching it, this is worth a look. At Robots and Pencils, the best ideas win, and the team is purposefully kept super high quality. This is the kind of place you look back on as the best decision you ever made. Take a look at robotsandpencils.com slash careers.

Here's a harsh truth. Your company is probably spending thousands or millions of dollars on AI tools that are being massively underutilized. Half of companies have AI tools, but only 12% use them for business value. Most employees are still using AI to summarize meeting notes. If you're the one responsible for AI adoption at your company, you need section.

Section is a platform that helps you manage AI transformation across your entire organization. It coaches employees on real use cases, tracks who's using AI for business impact, and shows you exactly where AI is and isn't creating value. The result? You go from rolling out tools to driving measurable AI value. Your employees move from meeting summaries to solving actual business problems, and you can prove the ROI.

Stop guessing if your AI investment is working. Check out section at sectionai.com. That's S-E-C-T-I-O-N-AI.com. The average enterprise is spending eleven and a half million dollars on AI this year, and most of them can't prove a single dollar came back. What does AI actually look like when it produces ROI? Ask the healthcare company that just made their payment processing 320 times faster, or the law firm whose document research went from three months to ten minutes.

Or the contact center who reduced wait times by 99%. These are real Mission Cloud customers with real results. MissionCloud is a CDW company and an AWS premiere to your partner. They're the AI-first outcomes-obsessed AWS experts who build AI solutions that drive your business forward.

Whether you're flooded with AI ambitions but no idea where to start, or six months into a deployment that's going sideways, they've seen it and they've fixed it. Stop burning your budgets on AI that doesn't produce results. Start at missioncloud.com.

This episode of the AI Daily Brief is brought to you by OutSystems, a leading agentic systems platform built for the enterprise. Organizations all over the world are building, orchestrating, and governing agentic systems on the OutSystems platform and with good reason. OutSystems open and unified platform allows teams to architect, deliver, and scale governed agentic systems with agility.

Teams of any size and technical depth can use out systems to build, deploy, and manage AI apps and agents quickly and cost effectively without compromising reliability and security. With OutSystems, you can rapidly launch ideas from concept to completion. It's the leading agenda systems platform that is unified, agile, and enterprise-proven, allowing you to accelerate growth, reduce operational friction, and deliver real enterprise impact with AI. OutSystems. Build your agentic future.

🎵 Music

Models: The Intelligence of Local AI

A

Let's talk about layer two of the model. And the question that we're trying to answer here is what's the intelligence that we want our hardware to run? And I think that most of us never needed to think about what models were running or more importantly what their size, because if you use the Ch GPT, Claude, Gemini and so on, you were using a model and all you had to decide is between fast and thinking, basically.

because someone else chose it, hosted it, and did everything to maintain it. However, if you are contemplating deploying your own model, you need to understand that model comes in different sizes. And a model size is measured in parameters. Billions of learned values that encode the patterns from the training data. You can think of parameters like a vocabulary and experience all combined into one.

Typically more parameters means that the model can hold more nuance and can handle more complex reasoning and produce even more sophisticated output, again at a high level. The question is, if that's the case, why don't companies just make every model enormous or more and more uh big over time? And that's because bigger models need more compute power to train, to the point of billions of dollars at the frontier.

uh and uh much more memory to run. So the size spectrum is what you should understand very quickly. At a high level We have the tiny models, those will be one to uh four billion parameters. They're very fast, they can run on anything, literally, including even your Android machine, can typically hold basic chat, simple summarization, or a very like a pointed task.

We then have the seven to fourteen billion parameters. Those are the small, quite capable for everyday tasks. They can do writing, they could do some uh boilerplate code, they can do QA, they can run very well on a laptop or a basic GPU. And I believe that most of you if you are contemplating doing a local deployment will first deploy models from this family.

And the medium size we have near frontier and and as time goes by we see more and more models at this size that are providing results that are almost as good as the huge ones. they need quite a good GPU or a high end Mac in order to run. And that's kind of the sweet spot if you wanna be serious about local deployment for yourself or for a s like an immediate team.

The large ones and of course the major ones, those are good for powerful reasoning. They will typically need more expensive hardware or even a setup that involves multiple uh GPUs. And I think that one pattern that is worth watching for is that especially for well defined tasks. around coding and math for the most part, we're starting to see tiny specialized models that match frontier performance. Just this week we've seen a three billion parameter model called VibeThinker.

that match the Cloud Opus and Gemini Pro encoding benchmarks. So three billion is extremely small, such that you can run it as n as noted even on your phone. But the catch is that it only works this well on very structured, very ver verifiable tasks and not necessarily on the knowledge work task that many of us are doing.

So still if we need general general purpose, general knowledge for things that we do as part of the knowledge work, size still matters, but seeming like a future where you might run from Tri class specialized model on very modest hardware. Bottom line, we don't need the frontier level intelligence for every task.

A huge amount of what we do with AI can be done on the smaller ones or staying at the like seven to fourteen billion or seven to twenty seven billion range. Those are open, free to download. They can run on hardware that all of us has. And bigger will be when you need either a more able or a more general purpose type of thing. What you should care beyond the size because as we say there are other parameters that make the models different.

what many people are being uh caught off guard with the how the models behave is that you can download a model that benchmarks beautifully. You try to use it for agentic tasks. calling the tools, following multi step instructions, and they fail spectacularly because it was trained for chat, not for tool use.

So when you are evaluating a model, also check does it support tool calling? How large is the context window? Will it hold the amount of input and output that I plan to run on a single session? Does it handle images? Is the license commercial friendly? These are on the model card that I will explain shortly.

But you need to read it like a product spec and don't just look at the size as the deciding parameter. And if I need to call out some of the most prominent models uh in the open source ecosystem and obviously the Torn Moore, but just to name a few. five names that keep coming up. Gemma from Google, great model to mention, comes in different sizes. Quen from Alibaba and the number here is even not up to date. We have a more updated model. It's a coding champion and it fits well on a one good GPU.

We have the Deep Seek that we all heard about. It has a very strong reasoning and it's quite good and capable. We have the family of models from Meta, the Lama Scout and others, and many models that were based on Lama that are quite good. And another one that I wanted to mention is the Hermes. It w it's a fine tuned model from NOOS Research and specifically it was built for agentic work and tool calling and some of the things that I mentioned are something that you need to look into.

So if you are running an agent harness locally, it might be an interesting one to look into.

Navigating the Open Source Model Ecosystem

Just maybe one more point on fine tuning, because I mentioned that a couple of times. This means that you take a general model and train it further for a specific purpose or with a specific data that you have. Hermes is exactly that. it took another model and improved it further in order to be good at a workflow.

B

Yeah, this list is gonna be changing all the time too. Obviously, you know, on AI daily brief I'm trying to keep track of the ones that sort of transcend from developers are playing around with to it's it's maybe more broadly worth knowing. GLM five point two is the one that came up this week that more and more people are talking about, although we're still only a couple of days into it and

a lot of the the latest Chinese open weight model tends to have this pattern of people get super excited about it in the first few days and then a few weeks later no one's talking about it. So who knows if it'll stick around, but there's that. And we're also seeing even from American companies A lot more experimentation with different model approaches. Cursors, Composer, is one that I bring up a lot on the show. So there's always changes on the on the model front.

Again, which is kind of why this is less a conversation about the exact models and more the principles of running and being able to run and switch in and out these different types of models for different types of goals.

A

Yeah. We have many others, which is exactly why in general one more place that I want you to pay attention to occasionally is a hugging face. Having face is like the app store for all the AI models and the open source models up out there. And if you haven't been there, I strongly recommend that you go and check it out because everything is there open source or for free.

and every major release will go there. Currently they have almost uh more than five hundred thousand models hosted. And when you go into a specific model page, because you heard about it on the podcast or or on X or wherever you're trying to stay up to date.

you wanna understand what's under the hood, you will first encounter what is called a model card, which is basically like a spec that tells you what it is good at, what it was trained on, limitations, and just ready to make sure that if you are contemplating using a certain model that it fits

what you need to do with that. You will also be able to see the license of the model. Typically we're looking to get a model that is either an Apache 2 or MIT. That means that you can use it even for commercial stuff however you want. Some have other restrictions, so pay attention to that if you're planning to use it for a product.

And lastly, you will see a file called GGUF. That's the compressed, ready to run versions that you can download to your own hardware to start deploying and running the model. There are different files for different compression levels, on more on that in a minute. And you need to pick the one that fits your own hardware.

Another thing that I want you to use Hugging Face for is that it's a great place to see the vibes, okay? Because you will see how many downloads, what the community is saying about stuff. And while I know that we're all sometimes falling trapped to the benchmarks, which is maybe a good start, but I know, Nathaniel, that you

repeatedly say that you don't believe in benchmarks, but what you can look into is the wisdom of the crowds and that's exactly what you get in Hugging Face. Because if you see that something has been downloaded a ton of times, that means that real people are finding real value and that's why they're downloading that.

They also mentioned trusted publishers, so you should use the ones that are official and approved. Be more wary of third party unknown publisher before you download anything to avoid any incidents. One more thing to say about Hugging Face, it's not just for models. There are applications and data set and

spaces like live demos that people upload that you can explore and there is tons of inspiration to draw from Hugging Face. So even that then I'm not affiliated by any way, but I just think that it's a great source for anybody who wants to understand the art of possible to go and traverse. And whenever you are considering a specific model, I want you to go beyond the model card and ask your AI tool to do fresh whis research on the community signals. It can be X Reddit, other places.

developer forms and so on, just to see what actual practitioners are saying,'cause often what's written in the model card and the vibes from the community are completely different and you need to be aware of them.

Model Quantization and Serving Software

So that's Hugging Face. I promise that I will say what do I mean by quantization. Basically the concept is how you fit the large model on a more practical hardware, that's something that unlocks basically the entire picture because when a model is published by the creators, it stores typically at maximum quality.

that means that it uses a ton of memory in order to preserve the full accuracy. So a twenty seven billion parameter model like this original quality needs fifty four gigabytes of memory and nobody has that in a consumer grade machine. So what the companies are doing and the model labs are doing in order to make it more accessible is they do quantization and that basically compresses the model into low lower precision.

And if you need the analogy, it's like an image compression. The raw photo has a very high quality, but a JPEG look nearly identical to the human eye, but it's a fraction of the file size. So that's the simplification of the concept. You can see Q four, Q eight, Q five or Q uh six, but

Q4 means that uh that's the standard default and it cuts the model to about thirty percent of the full size. For most tasks, if you see a model with the letter Q4, it's more than enough and it will run well on your hardware. If for some reason you need a higher quality you can go to the Q eight or in between.

stuff like that. And you will see that on the file name. So maybe you will see like a quen three dot seven, twenty-seven billion, that's the number of parameters, Q4, that's mean the quantization. And the name of the file. So that's how you read all of these queues and files and so on.

Uh enough about the models, let's talk about the serving layer. That's the layer that loads the model, because we already covered the hardware and the model file, but you need software that loads the model and makes it available. It's a little bit like a waiter standing between the kitchen and the customers. That's the purpose of this software. It sits in the background, it's ready to serve when it's being asked.

Two dominant offerings here. We have Olama, that's basically the engine. It's free, it's open source, it's the most popular way for you to serve models on your machine. It's very simple to install just one command and one command to run the actual model. And what's nice about it, it will automatically detect your own hardware and configure itself.

Critically, it exposes a standard interface that other tools can talk to, which that m makes it that anything designed for a cloud AI can point to your local llama instead. So it makes it even very easy to transfer between Tools that you're currently running AI on versus the cloud to run all of a sudden locally. And it has a ton, a ton, a ton of models in the library. So it supports almost anything that matters.

The other thing that you might want to consider installing is the LM studio. It's like the showroom. This is a desktop application with visual interfaces where you can browse models, see the hardware usage in real time, you can test two models side by side, and it's very good for understanding what different models can do before you commit. So these two work very well together, the LM studio to explore and evaluate and Olama to serve in production.

And if you need to serve multiple users at scales, there are additional tools for that, but that will typically also require more technical team. So I'm not going down the route of more sophisticated serving as

Orchestrating Actions with Agent Harnesses

Moving to the layer of the agent harness or what orchestrates the AI. We have a chat interface, that's one thing. An agent is another thing. And the difference is that a chat interface lets you talk to the model, but an agent harness lets the model take actions. It can read the files or search the web. It can call the verse APIs or MCPs, send messages, run scheduled tasks and all the fun stuff that we love about our verse Genty capabilities.

If you want to go down a chat interface for your local AI, one very simple path and very useful way to do that is to use open web UI. again, very popular, self hosted web application that look and feels very much like a Chat GPT. You can point it at your local lama and then your team has a very private

chat GPT that runs entirely on your hardware. It enables multi user, it can document the upload. It has a search built in, and so it's a very good alternative if you want to create local IDs primarily for chatting. However, if you do wanna go all the way to an Agentic Harness that is hosted locally,

Obviously there are a ton of options and the list is getting longer and longer and longer over time, but two things to note one is obviously OpenClaw and the other is Hermes Agent. Those are the most dominant in the open source of agentic harnesses. Both of them will run on your own hardware. Both of them support local models to OLAMA. They do tool calling, persistent memory, as well as integrating with various messaging platforms. The difference is the philosophy.

OpenClaw gives tighter manual control because you can create the skills and define the rules and create the context. Helmet leans into autonomy. It is writing its own skills from experience. It does a lot of self evaluation to improve for you.

and it has a compound a capability over time. Because both of them are fully open source and you can install in minutes, they are both great things to explore if you haven't already. But I will say on Hermes that it at least in my opinion, becoming more and more predominant option that you sh if you haven't looked into is something to look into this June or over the summer.

And I think that if you do go and install one of them or one of the alternative agentic harnesses locally and you do all the other layers that we just talked about, all of a sudden you are in full control and running everything locally without paying anything beyond electricity.

Uh, uh one more thing to say with regard to coding specifically, is that even if you are used to working with a different coding tool, and most of them can be pointed also to local models and not too many people are doing that, but All of the major players are now integrating very well into Olama in order to run stuff locally. There is one caveat that some of the features within these tools stay cloud only, regardless.

So for example, autocomplete in some cases will not work if you're uh running on a local model and some other cases automations are run on the cloud and so on. But even if you are primarily using these tools and you don't want to go to Hermes or OpenClaw, you can work also with local models and reduce the costs and the dependency on the cloud model providers.

The Full Local AI Stack and Strategic Trade-offs

Last layer, what you actually interact with, this is the top of the stack. The one thing that you will touch day to day, it can be an open web UI chat window that you give your team. It could be the Hermes desktop that was just released a few days ago or a few weeks ago. It can be something that you interact with through Slack or Discord or wherever you're conversing with your agent.

And the point is that once the lower layers are all working, this layer is completely flexible. You can build anything on top of a locally served model that you could build on top of any cloud API. So that's Not where you should spend a lot of your energy. So I wanna bring us home. I know it was a lot. There is an honest trade-off here. Let's start with what you gain. If you are going local with AI.

you get a lot of data independency. Nothing leaves your network. You have availability. You cannot be shut off by export control, vendor decisions or internet outages. you have cost predictability because after the hardware investment, the marginal cost per query is almost zero, except electricity.

you have uh learning because running model locally teaches your your organization how AI actually works under the hood and many people who interact with the models directly all of a sudden have a ton of aha moments from the process. However you do take on a lot of responsibility and effort. Hardware, if you haven't had it lying around, is something that you'll need to buy.

Maintenance. When something breaks, it's on you. No one will fix your O Lama that is not working or your Open Cloud is not working or whatever you decided to install locally. These tools have a ton of updates. So every time there is a better model or a better software, it's on you to update and make sure that it's still running smoothly. The security integration, if there are new

things that are happening, it's on you to orchestrate or uh install them. And lastly, you might realize that you went all in on local AI in order to save a ton of tokens, but you are uh having a a few people that are working around the clock to maintain your local AI and all in all the cost of tokens versus the cost of humans

are not comparable. So that's something to pay attention to. And also the fact that security is not guaranteed if you don't know what you're doing with local AI. Especially if you are connected to the internet. That's that. I think that if you need to start somewhere, one good machine, one useful workflow, prove the quality, secure it and then decide whether to scale. So if I'm trying to be even more concrete I think that I gave you a ton of vocabulary, a mental model and the landscape.

You understand hopefully the five layers and what decisions live in each layer and so on. But what you can do immediately after depends on who you are. So if you are an executive, maybe you have enough food for thought. to ask informed questions of your technical team. What's our position on local models? Have we evaluated our vendor dependency? What would we do if our primary eye provider becomes unavailable or overly expensive? So these are some of the questions that you should be able to ask.

If you are a practitioner, you can definitely install Olama this week if you haven't had it or experiment with yet another latest and greatest open source model to see how well it serves your own workflows, to see how it feels and so on.

And also I believe that the hands-on experience is worth more than any amount of reading that you can do. And of course, if you are in a regulated industry, that's something to definitely contemplate more and more with your compliance and infrastructure team to see uh what's the right stance for you. And the core message is from my perspective is not that everyone must run AI locally. It's that the landscape has shifted enough.

on cost, on control, on access, that every organization making serious AI decisions need an informed position, at the very minimum and a very deep conversation on that. And even if the position is not for us right now, it should be a deliberate choice and not an assumption that you never go back and re examine. So it's not for everyone, but understanding is for everyone. And one last thing before I go.

I wanted to mention that we just launched the Executive Agent Leadership Program. It is the evolution of the beloved Enterprise Claw program. It was rebuilt for everything that's changed in the last few weeks and months. The token economy, the local deployments, the security, the vendor independence, all of that. It's a six week cohort.

For leaders who want to build the I agents hands on and then design how the organisation operates in the agent era. The first revised cohort will start june twenty ninth. And if this resonated with you or you wanna spend some time with others going through the same process and have fun with us, I'll be more than happy to have you there.

🎵 Music

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.
For the best experience, listen in Metacast app for iOS or Android