We often talk about AI, like it's just one huge thing, right? One big brain. But there's actually a massive difference, a really functional difference between, say, an AI that's read everything and can only talk about a solution versus one that can actually do the task for you. Right. It's pure theory versus action. It's like the shift from asking a large language model, an LLM, how to book a cheap flight to telling a language action model. a lamb, hey, book the cheapest
flight, email me the receipt. Beat. That's the jump from talking to actually doing. Welcome to the deep dive. Yeah. And look, if you're starting to feel a bit lost in all the acronyms flying around, you're definitely not alone. Our mission today really is to cut through that noise, give you a kind of shortcut to understanding the different specializations. We need the AI toolbox. Exactly. Think about it, right? You wouldn't use a hammer to turn a screw. Doesn't matter how great the
hammer is. We need to know the right tool for the job. So today we're diving into eight key AI model types. We'll look at what they do, their main uses, and yeah, the important watch outs for each. Okay. Let's start with the foundation. Probably the one most people know. The LM. Okay, let's unpack this. The LLM, the Large Language Model. This is... like the definition model.
It's the most famous one out there. Think of it like that super smart friend who's basically read every book, every website, you know, Chet, GPT, Gemini, Claude, those ones. Yeah, and they're large because the amount of text they train on is just astronomical. They absorb all these patterns of language, but how they work, it's actually pretty simple at its core. They're mostly just next -word guessers. They don't actually know truth. Right. They break down sentences into
tokens, sort of like pieces of words. Exactly. Pieces of words or even punctuation. And then they just do the math. calculating the most likely next token, the next piece, based on, well, billions of examples they've seen. If you start with, I need to put this on there. I guess a shelf or table or whatever fits statistically. Yeah. And this guessing ability makes them really good for tech stuff. Writing content, sure. Blog posts, emails, summarizing huge articles down to like
three bullet points. Oh, yeah. Summarizing is huge and helping programmers, right? Explaining tricky code, debugging. Yeah. Even translation or customer service chat bots. But that's also where prompting gets tricky. Yeah. And honestly, I still wrestle with getting problems just right myself sometimes. It's not always easy. You really see the difference between just asking vaguely. versus being really specific? Totally. Like, a bad prompt is just, write an email to work
from home, that's it. Vague. Super vague. A good prompt gives the goal, maybe the specific days you want off, the reasons why, like, needing deep work focus, and even the tone. Make it polite, professional, but ask firmly. You've got to guide the guesser. And this leads to the big watch out, right? The whole hallucination problem. Because they only guess what looks right statistically. They can just confidently make stuff up. Fake studies, incorrect historical facts, citations
that lead nowhere. pure invention, but sounds plausible. So if an LLM is fundamentally a great guesser, what's the single most critical thing a user has to remember about its output? You absolutely must check important facts. The AI only guesses what looks right. It doesn't verify. Good. OK, so LLMs are great text guessers, but they can be slow and Well, inventive. Yeah. What if we need pictures and we need them fast? Ah, speed. That takes us to the LCM, right? Exactly.
The latent consistency model, LCM. Think of it as the visual cousin to the LLM, but its whole reason for being is speed in making images. Much faster than older models. Yeah. If you've ever waited like 30 seconds or more for an AI image, that old model like stable diffusion was working slowly. taking maybe 50, even 100 steps to refine the image from noise like a really carol paint party adding tiny details. Right. The LCM, though, is more like a master painter who's done this
a million times. It learns these underlying patterns of the latent consistency and figures out how to jump huge steps. Goes from step one, maybe straight, to step 25. Whoa. So it can generate really decent images in just like two to eight steps total. Two to eight. Compared to 50. That's a huge difference. Massive difference. And that speed opens up doors for anything real time. Think AR, VR apps. The scene needs to update instantly when you turn your head, right? Right,
yeah. No lag. Or making pictures right there on your phone without needing to send data to the cloud and wait. fast design tools, improving video call quality in real time. OK, so speed's the big win. What's the catch? Well, the trade -off for skipping all those steps is sometimes the images might be a bit less detailed, maybe a little too smooth compared to the slower methods. But for many uses, that near -instant result
is what matters most. So why is that master painter approach, that speed, so crucial for something like a VR headset? Because everything has to update instantly as you move. Lag breaks the whole illusion of immersion. Makes sense. Okay, so we have LLMs for talking, LCMs for fast painting, but they're both creating things. What if we need the AI to actually, you know, do a task? A multi -step task. Ah, now we get to the LAM.
The LAM. Language Action Model. This is that critical shift you mentioned from talking to doing. Exactly. If the LLM is the smart friend who just talks, the LAM is like the project manager. It takes action. How does it do that? Is it just a better LLM? Not quite. It's more like a system. It starts with an LLM. Yeah, that's the brain for understanding your request. But then it adds memory like a notebook, a planner to map out the steps. And the really crucial part, the ability
to use tools. This means it can connect to other things, your email, a web browser, calendars, using APIs. Ah, APIs. That's the key difference then, the connection to other services. That's the game changer. So an LLM tells you how to book a flight. A LAM connects to, say, Expedia's API, searches flights based on your criteria, finds the cheapest one, books it, and then asks you to confirm. Wow. OK. That sounds like the
future of automation, really. And it is. Think, AI agents, these things can handle complex sequences. Like, you could tell it. Find 10 potential customers in the software industry in California, draft a personalized intro email based on our new product launch, and show me the drafts. A lot more sophisticated
than just asking a question. Definitely. Advanced customer support, too, actually processing a refund, not just answering questions about the policy or just digital assistance that can actually turn on your lights or set alarms because they can connect to those systems. So what core function lets a lamb move beyond just conversation and actually access, say, your calendar or email? It's that ability to connect directly to other tools and services using APIs. Those are the
hands and feet. Got it. But building one giant lamb that can do everything sounds incredibly complex and frankly expensive to run. It absolutely is. Which leads us perfectly to the next model type, which is all about efficiency. The MOE. MOE, mixture of experts. Yeah, mixture of experts. Now, this isn't one single giant model. It's actually a whole system. Think of lots of smaller... specialized models, the experts. And they're all managed by another AI called the router.
OK, so instead of one generalist, it's like having a team of specialists. Exactly, like asking a specific group of pros. The router looks at your request, your prompt, and figures out, OK, this is about coding, and sends it only to the coding expert. Or this needs history knowledge, so it activates the history expert. Usually just one or two experts get activated for any given task. Right, so you're not running the whole giant brain all the time. Precisely. And here's the
really cool part. the efficiency and size benefit. You can build a massive model overall, let's say, a trillion parameters worth of knowledge. But because the router only turns on maybe 5 % or 10 % of those experts for any single query, you're only paying the computational cost of running a much smaller model, maybe just 100 billion parameters in that example. Wait, hold on. So you get the knowledge of a potentially enormous model. but the running cost of something
much smaller. That's the magic. Wow. I mean, that just fundamentally changes the economics of building top -tier AI, doesn't it? That scaling efficiency is incredible. It's huge. That's why this architecture, MoE, is believed to be used in the really big, powerful models like Mixed Roll and probably GPT -4, too. It's also great for companies wanting to personalize. They can add their own private experts, train just on their internal documents or specific industry
regulations. So you could ask it to explain I don't know, inflation using general knowledge. Yeah. And write Python code to calculate CPI using a coding expert, all in one go. Exactly. Dual knowledge handled efficiently. OK, the efficiency is amazing. But beyond cost, what's the core risk if that router AI messes up its job in a Moe system? If the router picks the wrong expert, the whole answer will likely be poor quality or just plain wrong. Bad robbing means bad results.
Makes sense. Garbage in, garbage out. Or rather, wrong expert in, garbage out. Pretty much. And training them is complex, plus you need enough memory VRAM to hold all those experts, even if most are idle. OK, so Moe's give us efficient expertise. But all these models so far, they've been dealing with text or maybe creating images. based on text, what if the AI needs to actually see the world? Ah, giving the AI eyes. That must be the VLM vision language model. You got it.
V -L -M. It can see and talk. It understands both pictures and text at the same time. LLMs, fundamentally, were blind. VLMs have eyes. How does that work internally? Is it merging two different kinds of models? Kind of, yeah. It uses an image encoder. Think of that as the eyes, which looks at a picture and translates it into a sort of numerical description, like orange cat sitting on a green chair. OK. A number string
that represents the image. Right. And then that numerical string gets fed into a language model, the mouth. which can then understand it and talk about it, answer questions about it. And that opens up some really intuitive uses. The classic example is taking a picture of your fridge and asking, OK, what can I make for dinner with this stuff? That's the famous one. But it goes way
beyond that. Think about visual search. Take a picture of some cool shoes you see someone wearing and ask, where can I buy these exact ones? Ooh, dangerous for my wallet. Huh. Tell me about it. Or analyzing video content. and really powerfully, tools for the visually impaired. Imagine it describing the world around them, reading signs, maybe even recognizing faces or expressions. That's incredible, a truly helpful application. But I guess the same watch out applies
here as with LLMs. Can VLMs hallucinate about images? Absolutely. They can misinterpret an image, describe an object that isn't there, or get an action wrong. And of course, privacy becomes a bigger concern when you're uploading images, especially of people or inside your home. Right. So besides the fridge example, what's a really powerful, maybe less obvious application of a VLM? I think describing the world in real time for visually impaired people is one of the most
profound uses. Yeah, definitely. OK, VLMs are powerful. They see and talk. But they sound like they need a lot of computing power, just like the big LLMs. What if you need AI on a smaller device, like your phone, and privacy is paramount? Great question. That's where the SLM comes in the small language model. While everyone's chasing bigger and bigger, these SLMs, like Microsoft's Fi3, are quietly becoming super important for
on -device AI. They're designed specifically to run efficiently on phones, laptops, maybe even smart appliances. Small. So fewer parameters, I guess. How small are we talking? Just maybe fewer, maybe just one to three billion parameters compared to hundreds or even over a trillion for the big LLMs. Okay, one to three billion versus a trillion. How do they stay capable then? How aren't they just dumb? The secret sauce is the training data. Instead of just scraping the
entire messy internet like LLMs often do. Right. Researchers focus on quality over quantity. They use what they call textbook quality data, highly curated information, specific examples of reasoning, logic problems. It's like feeding it a very well -structured education instead of just letting it browse the web randomly. Textbook quality. Yeah. That's interesting. So why is the training
data for an SLM described that way? Because they prioritize high -quality curated knowledge and reasoning examples over just sheer volume from the internet. Quality over quantity. But doesn't that curated approach risk making them less robust? Like, they only know what's in the textbook. They can't handle the messy, unpredictable real world as well. Are they trading robustness for that efficiency? That is the trade -off, yeah. They will know less about really news or uncommon
topics. Their conversational memory might be shorter too, but the payoff is huge. True on -device AI. Meaning it runs right there on your phone. Exactly. Your phone's assistant can work instantly without needing a cloud connection. That means better privacy. Your data doesn't have to leave your device. Think about instant code completion in your programming editor or voice commands in your car that just work, even if you have no signal. Smart microwaves, maybe.
OK, so SLMs are great for efficient, private, on -device tasks. But they're still fundamentally about generating text or understanding commands, like LLMs. What if the main goal isn't generation at all, but really deep understanding of language meaning? For deep understanding, we need to look back at a slightly older but foundational model type, the MLM Masked Language Model. Masked Language Model. OK, how is that different from an LLM guessing the next word? So LLMs play the predict
the next word game. MLMs, like Google's famous BERT model, play fill in the blank. Fill in the blank? Yeah. During training, they take a sentence and just hide or mask about 15 % of the words. Then they force the model to guess the missing words. Crucially, to do that well, the model has to look at the words before the blank and the words after the blank. Ah, so it needs context from both directions. bidirectional context.
Exactly. That forces it to develop a much deeper understanding of the sentence's actual meaning and structure, not just predicting the most likely next word in sequence. OK, so where is that deep understanding most useful if it's not writing poems? Right. They're analysis models, not creative writers. Their big strength is understanding meaning, which is critical for search engines.
When you search something like, best camera is A to Z, the MLM helps the engine understand A to Z means a complete guide, not literally the letter. Uh, understanding the intent behind the
search. Precisely. They're also great for sentiment analysis, quickly sorting thousands of customer reviews into positive, negative, neutral, and for something called Named Entity Recognition, or N -E -R, that's pulling out specific pieces of info from text, like finding all the company names or dollar amounts mentioned in a long report. So since MLMs aren't really creative writers, what's their crucial function for something like
a search engine? They understand the deep context and the actual meaning behind your search query, not just the keywords. Got it. Pure contextual depth. OK, one more model to go. We've covered text, images, actions, efficiency, seeing, small size, deep understanding, what's left. Precision in vision, specifically identifying exact boundaries. We need the SAM, the Segment Anything model. Segment Anything. SAM. Okay, this sounds specialized.
Highly specialized. It's a computer vision model, but its focus isn't on telling you what an object is. Its job is to draw an incredibly precise outline around everything it sees in a picture. It segments objects right down to the individual pixel. Down to the pixel. Wow, how does it learn to do that for... Well, anything. It was trained on a massive data set, something like 11 million images, containing over a billion segmentation
masks, those precise outlines. So it learned the general concept of what is an object and what is its boundary, even if it doesn't know the object's name. You usually activate it just by clicking on an object or drawing a rough box around it. So it knows This boundary belongs to object one, but not necessarily that object one is a dog or a car. Exactly. It just finds the edges. Okay. Where do you use that kind of pixel perfect? outlining. It's actually a foundational
tool for lots of applications. Think photo editing, that remove background feature. Sam is likely powering the precise selection. Or critically, medical imaging. Imagine a surgeon analyzing an MRI. Sam can draw the exact outline around a tumor or an organ, which allows for really accurate measurements. Right. Precision is key there. Absolutely. And it's vital for robotics and self -driving cars too. They need to know the exact edges of objects in the real world
to nap and interact safely. So why would a surgeon looking at an MRI need SAM, the segmenter, maybe more than a VLM, the describer? SAM provides those precise pixel level outlines needed to accurately measure biological structures, not just a general description. Measurement needs precision. Makes perfect sense. Is there a watch out for SAM? The main thing is just remembering what it is. It's brilliant at finding outlines,
but it won't tell you what the object is. It often needs to be paired with other models, maybe a VLM, to get both the precise outline and the identification. Okay, wow. That's eight different types. Quite the toolbox. It really is. So let's try and pull this all together. What's the big idea here for you, the listener? I think the core takeaway is just realizing that AI isn't
one thing, it's this specialized toolbox. And now hopefully knowing the model name and the acronym actually tells you something concrete about its core function. You know which tool to think about for which job. Yeah, let's do a quick recap. If you need writing help, content generation, chatbots, you're thinking LLM. Right, needs super fast images, especially for real -time stuff like AR or on your phone, that's
the LCM. Want to automate a complex task? Have the AI actually do things by connecting to other tools. That's the LAM, the action model. Need maximum efficiency and the ability to build huge models cost -effectively using specialist parts. That's the MoE architecture. Need an AI that can see and understand images and talk about them. That's the VLM. Want AI that runs directly on your device, prioritizing privacy and efficiency. over knowing absolutely everything. That's the
SLM, the small one. Need deep understanding of language meaning for search or analysis, not generation. That's the MLM. And finally, need to draw those perfect pixel -level outlines around objects in images, like for background removal or medical scans. That's the SAM. Exactly. So now you have this framework, right? We really hope you can use this knowledge next time you hear about a new AI tool. Don't just ask, is
it AI? Ask yourself, okay. Based on what it does, is this primarily analyzing things, creating things, or acting on things? Which tool from the toolbox does it sound like? Yeah, that's a great way to approach it. And the pace things are moving, these tools are starting to combine in really interesting ways. We've covered models that talk, models that act, models that see.
So here's a thought to leave you with. If the next big step... combines that incredible efficiency of the Moe's specialized experts with the Lamb's ability to connect to tools and automate real -world tasks, well, that creates the potential for truly autonomous, hyper -specialized AI agents. So thinking about that future, what's the single most complex task, maybe in your personal life, maybe in your business, that you would finally trust an autonomous AI agent to handle completely?
Something to chew on. Definitely something to think about. We'll see you next time on the Deep Dive.
