Learning LangChain: Building AI and LLM Applications with LangChain and LangGraph

Speaker 1

00:00

Welcome to the Deep Dive. We're the show that gets you quickly and thoroughly well informed on the topics that really matter. And today we are plunging headfirst into something that, honestly, it still feels a bit like science fiction, but it's very much real now, the world of artificial intelligence, specifically large language models or LMS. I mean, think about it. I remember when chat GPT just exploded onto the scene exactly.

00:24

It wasn't just some tech news item. It went global, pulling what over one hundred million users in just two months. It really felt like these things could just conjure up text, answers, anything right out of thin air.

Speaker 2

00:35

It did feel a bit magical, didnety totally.

Speaker 1

00:38

So our mission with this Deep Dive is really to give you a shortcut into learning liang chain, building AI and LM applications. We want to go beyond the buzzwords, you know, get into the core concepts, the actual practical strategies for building powerful AI apps with this thing called

00:54

lang chain. Our goal isn't just to tell you what's happening in AI, but really show you why it matters, point out the aha moments, and crucially equip you the listener with the knowledge to maybe even apply it yourself.

Speaker 2

01:06

And to guide us through this pretty intricate, in let's be honest, rapidly evolving landscape. We're drawing directly from a really fantastic source, the book Learning lang Chain by myo Ocean and Nunocampos. And these aren't just you know, academics writing about the fields from Afar. Mayo was actually an early developer, an advocate for the lang Chain open source library itself, a real pioneer in that whole chat with data movement, and Nun is a founding software engineer at

01:32

lang Chain. So this book, it isn't just theory, it's packed with super clear explanations, actionable techniques. Industry experts are calling it the go to resource for building production ready generative AI and agents.

Speaker 1

01:45

That's fantastic, a really solid foundation. Then, So the big question we're tackling today is this, how can developers, maybe even those who don't have a deep machine learning background, how can they harness this incredible power of lllms to build genuinely production ready generative AI applications and these intelligent agents. Right, we're going to unpack the essential tools, the patterns, the thinking that transforms these powerful models from cool tech demos

02:11

into practical, usable solutions. So, okay, let's start right at the beginning. These lms, they seem almost magical. How exactly do they know the answers they give and what precisely is a token in their world?

Speaker 3

02:22

Yeah, good place to start.

Speaker 2

02:24

So at their heart, large language models are generative models specifically built for text. They're trained on just vast amounts of text, think everything publicly available, books, articles, forums, code, even cleaned up video transcripts, an immense data set. Their core function isn't really magic, though it looks like it. It's incredibly sophisticated prediction. They basically predict the most probable next word or token in a sequence based on all

02:51

the patterns they've learned. So if you feed it the capital of England is, it's learned from countless examples that the highest probability next word is.

Speaker 1

02:58

London, ok N, matching on a massive scale. But what's a token? Is it just a word?

Speaker 3

03:04

Not always?

Speaker 2

03:05

Yeah, a token is the fundamental unit the LM processes. Often it's a word, but sometimes longer or less common words get broken down, like dearest might become two tokens. Done and arrest on average. You know, for common English text, one token is roughly four characters. And the driving engine behind all this predictive power is something called the transformer neural network architecture.

Speaker 1

03:26

Right, the transformer architecture.

Speaker 2

03:27

Heard a lot about that, Yeah, it's key. Think of it as being really good at understanding context. It relates every word in a sentence to every other word, building this rich understanding of meaning and relationships. That's how they handle complex grammar and nuance, not just simple word prediction.

Speaker 1

03:43

And it was that understanding, or perhaps a limitation of that understanding, that led to lang chain, Right, I read Harrison Chase started the open source library back in October twenty twenty two. What was the key realization.

Speaker 2

03:54

Exactly, the real breakthrough, The thing that sparked lang chain was this insight in LM, as brilliant as it is with language, could totally fumble basic arithmetic, Like ask it to calculate one two hundred and thirty four module one twenty three on its own. It might just guess or get it wrong.

Speaker 1

04:09

Huh weird. Right, It can write poetry but not do simple math.

Speaker 3

04:13

It is kind of paradoxical.

Speaker 2

04:14

Yeah, And that raised this crucial question, how do you give this powerful language model capabilities it just doesn't have intrinsically. Harrison Chase's pivotal realization was, and this is key, the most interesting LLM applications needed to use lms together with

04:30

other sources of computation or knowledge. Laying Chain was essentially built to provide the building blocks, the interfaces, and the tooling to reliably combine llms with other things, like giving it the ability to call out to a calculator when it sees a math problem.

Speaker 1

04:46

That makes so much sense giving it tools. So, if the LLM isn't a calculator or a database itself, how do we even talk to it effectively? How do we guide it to do what we need?

Speaker 2

04:55

Yeah, that's all about prompting. The prompt is basically the instructions and input you provide to the model, and crucially, how you phrase that prompt significantly influences the model's output. There's also this fascinating control called temperature. You can think of it like a creativity dial. Lower temperature makes the

05:11

output more focused, more deterministic, predictable. Higher temperature it lets the model take more risks, get more creative, maybe even a bit random, useful for different tasks.

Speaker 1

05:22

Okay, so prompting is key, temperature, controls, creativity. What are the main ways we prompt these things?

Speaker 2

05:28

There are several core techniques, each kind of addressing a different need. The absolute simplest is zero shot prompting. Just give it a direct instruction like that example earlier, how Old was the thirtiest president of the United States when his wife's mother died. It's straightforward for basic questions, but you know it can often lead to inaccuracies or just making things up at the info is and baked into his training data.

Speaker 1

05:50

Hallucinations, right, the dreaded hallucinations joys lely.

Speaker 2

05:53

Then you've got chain of thought or co T. This is where you literally tell the model think step by step. It's intriguing because this often dramatically improves performance on reasoning tasks. It forces the LM to sort of show its work, break down the problem like we learned in school math pretty much. But here's a funny quirk. The book notes that sometimes for tasks where humans also tend to overthink and make mistakes, COKE can actually make the LLLM perform worse.

06:21

A good reminder they aren't just scaled up human brains.

Speaker 1

06:24

Huh interesting?

Speaker 3

06:25

What else next?

Speaker 1

06:27

Up?

Speaker 2

06:27

And this is fundamental for making lllms useful with your data is retrieval augmented generation or OURG. This means providing relevant pieces of text also known as context, directly within the prompt. So if you want the LLLM to know about your company's latest internal reporter today's news, you use OURG to feed at that specific information alongside the question.

Speaker 1

06:49

Ah okay, so ori is how you give it knowledge it wasn't trained on precisely.

Speaker 2

06:53

And then for making lllms do things, there's tool calling. This lets you give the LM a list of external functions or calculator. Example, maybe a search engine, API, a weather service, whatever. You train it to recognize when it needs a tool and to signal it's intent.

Speaker 1

07:05

To use it, so it can decide I need to search for this or I need to calculate that exactly.

Speaker 2

07:10

And often the most powerful applications combine these techniques. Maybe use chain of thought to plan or ready to fetch relevant data, and then tool calling to perform a specific action or calculation based on that data. Oh and one more few shot prompting. This is where you give the LM just a small number of examples, like here's a question,

07:28

here's the right kind of answer. It helps it learn new tasks or formats on the fly without full retraining, like showing it a few examples to get the hang of something new.

Speaker 1

07:37

Wow, Okay, that's a whole toolkit for interacting with them, and lang chain helps manage all of this.

Speaker 3

07:41

Yeah, that's the beauty of it.

Speaker 2

07:43

Lang chain was one of the earliest open source libraries to provide these core LM and prompting building blocks, and it's taken off massively. The community is huge, over seventy two thousand members, twenty eight million downloads a month.

Speaker 3

07:55

It's staggering.

Speaker 2

07:56

What lang chain does is offer these simple abstractions for all those tech niques we just discussed, zero shot solo t rrag tool calling fushot plus. It integrates seamlessly with all the major LLLM providers open Ai, Anthropic, Google, and popular open source models like Lama. This common interface is a really big deal. It means you can easily experiment, swap out different llms, and crucially avoid being locked into a single provider.

Speaker 3

08:22

That gives you a huge flexibility.

Speaker 1

08:24

That flexibility sounds key Okay, so this brings us to a really crucial challenge, especially for us the builders. If these alans are brilliant, but they fundamentally can't know everything. They weren't trained on my company's latest financials or yesterday's news, how do we stop them from just making stuff up, from hallucinating when we ask about that information.

Speaker 3

08:41

You've nailed the core problem.

Speaker 2

08:43

Just relying on the LM's pre train knowledge often isn't enough for real world apps, precisely because, like you said, they lack private data stuff not on the public Internet, and they lack knowledge of current events because of their knowledge cutoff date.

Speaker 3

08:57

When they don't have.

Speaker 2

08:58

The information they need, they tend to hallucinate, generating plausible sounding but incorrect or even totally fabricated answers.

Speaker 1

09:06

Which can be dangerous in a real application.

Speaker 2

09:08

Absolutely, and that's exactly where retrieval augmented generation are AG becomes essential. It's basically your defense mechanism against hallucination by providing the necessary context RAG.

Speaker 1

09:19

So walk us through how it actually works. How does it ground the LM in specific maybe private or very current information.

Speaker 2

09:26

Okay, So, our RAG is specifically designed to enhance the accuracy of outputs generated by llms by providing context from external sources. Meta AI actually coined the term, and their research found that RAG makes models more factual and specific. The whole process generally involves four key steps for getting your documents ready, sometimes called ingestion or indexing. First, you

09:50

extract the text from whatever documents you have. Lang chain has helpers for this, like text loater for plain text files, or pd sloader for PDFs, and many others. Simple enough, the text out step one. Step two, you split that text into manageable chunks. This is really important because, as we mentioned, LM's have a context window, a limit on how much text they can look at in one go.

10:11

Can't just feeding up five hundred page document, right, It's too big Exactly so, tools like lang teen's where cursive character text splitter, cleverly break the text down. It tries to split along natural boundaries first like paragraphs and sentences, then words. To keep things coherent, you can configure the chunk size and also add some chunk overlap, meaning consecutive chunks share a bit of text.

Speaker 1

10:30

Ah overlap helps maintain context across the breaks.

Speaker 2

10:34

Precisely, It's like making sure the end of one chapter flows smoothly into the start of the next third step. You convert these text chunks into numbers, specifically into embeddings.

Speaker 1

10:44

Embeddings. Okay, this sounds like where the magic happens.

Speaker 3

10:47

It kind of is.

Speaker 2

10:48

Think of an embedding as a long list of numbers a vector that represents the meaning of that text chunk. Now it's a lossy representation. You can't perfectly reconstruct the original words just from the numbers, like you can't get perfect ced quality back.

Speaker 3

11:02

From an MP three.

Speaker 1

11:03

But it captures the essence exactly.

Speaker 2

11:05

It captures the semantic essence, and this allows for math on words. This is a huge lead from older systems that just did keyword searching LM based embeddings or semantic embeddings understand meaning.

Speaker 1

11:16

Okay, this is fascinating. How do we teach a computer the difference between say, lion, pet and dog. I get the related but how does the computer quantify that? If we connect this to the bigger picture, this cosign similarity idea quantifying how close pet and dog are numerically versus lion. That seems powerful, But how does that number crunching actually enable better search?

Speaker 2

11:42

That's a fantastic question. Really gets to the heart of semantic search. Imagine all these words, or rather the concepts they represent, existing is points in some vast high dimensional space. The embedding vectors for pet and dog would literally be mapped closer together in this space. Then either would be to lion because they meanings are more related. Cosine similarity is just the mathematical tool we use to measure the

12:04

angle or the closeness between these vectors. It gives a score usually between natus, one opposite meaning and one identical meaning. So pet and dog would have a cosine similarity score much closer to one than pet and lion.

Speaker 1

12:16

Ah. Okay, so similarity means closer in this meaning space exactly.

Speaker 2

12:21

And this ability to turn text into embeddings that capture deep meaning lets us search based on concepts, not just keywords. You could search for happy house animal and the system could find documents talking about joyful puppies or content cats. Even if the exact words happy house or animal aren't there, it understands the underlying meaning is similar.

Speaker 1

12:40

That's incredibly powerful search. Okay, so we've extracted, split, and embedded the text. What's the final step.

Speaker 2

12:47

The fourth step is to store these embeddings in a vector store. Think of this as a specialized database designed to store these numerical vectors and perform those complex similarity calculations like cosine similarity really efficiently and quickly. There are lots of options, open source ones like pg vector, an extension for post cresscool, dedicated databases like wev eight or

13:08

pine Cone, or cloud services. When you, the user ask a question, your question is also converted into an embedding vector. The vector store then rapidly finds the stored embeddings and their corresponding text chunks that are most similar mathematically to your query embedding. Those relevant chunks are then retrieved and passed to the LLM along with your original.

Speaker 1

13:29

Question, giving it the specific context it needs to answer accurate precisely.

Speaker 2

13:32

And lane Chain also provides tools like its indexing API and record manager to help keep this vector store up to date. As your source documents change, you can efficiently track those changes, add new embeddings, remove old ones, and avoid costly reprocessing of unchanged documents, keeping the knowledge current.

Speaker 1

13:49

Okay, that makes sense. We've got the basic arget pipeline down, index the data, retrieve relevant chunks, give them to the LLM. But I imagine building something truly production ready involves more nuance. What are the common challenges and how do we refine that search for knowledge to be even more accurate and robust?

Speaker 2

14:08

Yeah, moving from a basic a RAGI demo to production definitely introduces complexity. Users ask questions in all sorts of ways, sometimes ambiguously. Your data might live in multiple different places, and you often need to translate that natural language question into something more structured for retrieval. So we need more advanced strategies.

Speaker 1

14:26

Okay, what are they?

Speaker 2

14:26

The book highlights three main categories of strategy. The first is query transformation. The idea here is to modify the user's input before you even search to improve the chances of finding the best documents.

Speaker 1

14:38

Ah, like cleaning up the question first exactly.

Speaker 2

14:40

One technique is rewrite retrieve read. Here, you actually use another LM call first just to rewrite the user's potentially vague or conversational query into a clearer, more focused search query. Then you use that rewritten query for the retrieval step.

Speaker 1

14:54

Smart like having an assistant clarify your question before searching. Does it add much delay?

Speaker 2

15:00

Yeah, it's a little bit of latency. Yeah, because it's an extra LM call, but often the improvement in retrieval quality is worth it. Another transformation technique is multi query retrieval. Instead of just one query, you have the LMM generate multiple versions of the given user question, maybe from slightly different angles or using different keywords.

Speaker 1

15:20

Oh interesting, why do that?

Speaker 2

15:21

It's great for complex questions that might need information from multiple perspectives. You run retrievals for all those generated queries in parallel, then combine the unique documents found. It casts a wider net, reducing the chance you miss something important. Building on that is RAG fusion. It starts like multiquery, generating multiple queries and retrieving results for each, but then it has a crucial final re ranking step using something called the reciprocal rank of fusion.

Speaker 3

15:47

RF algorithm.

Speaker 1

15:49

Sounds technical.

Speaker 2

15:50

It's a clever way to combine the rankings from all the different searches. Documents that consistently rank highly across multiple queries get boosted to the very top. Really effective at finding the most relevant stuff while also broadening discovery.

Speaker 1

16:03

Okay, so RF aggregates the wisdom of multiple searches. Cool any other transformation tricks?

Speaker 2

16:09

One more interesting one is hypothetical document embeddings or Heidi this kind of counterintuitive. Instead of searching with the user's query, you first have an LM create a hypothetical document that would be a perfect answer to the query.

Speaker 1

16:23

Wait, it makes up an answer first.

Speaker 3

16:24

Yeah.

Speaker 2

16:25

The intuition is that this generated ideal answer, even though it's hypothetical, is often semantically more similar to the actual relevant documents than the original maybe short or ambiguous user query. So you embed this hypothetical answer and use that embedding for the similarity search.

Speaker 1

16:42

Huh, that's sliver. Using an ideal answer is a better search query? Okay, so that's query transformation. What's the second strategy?

Speaker 2

16:49

The second strategy is query routing. This tackles the problem you mentioned earlier. What if your data lives in different places. Maybe you have Python docks in one vector store and JavaScript docs in another.

Speaker 1

17:02

Right, how do you send the query to the right place?

Speaker 2

17:04

That's exactly what quer routing does, forward a user's query to the relevant data source. There are a couple of ways logical routing uses an LLM to make the decision. You give the LM descriptions of your available data sources, like this index contains technical documentation for Python and Based on the user's query, the LM picks which of the available indexes to use. Lang chain helps ensure the LM outputs its choice in a structured way your application can understand.

Speaker 1

17:32

So the LLM acts like a switchboard operator kind of.

Speaker 3

17:35

Yeah.

Speaker 2

17:36

Alternatively, there's semantic routing.

Speaker 3

17:38

Here.

Speaker 2

17:39

You embed the descriptions of your data sources themselves. Then you compare the user's query embedding to these description embeddings. The closest match indicates the most relevant data source. This is more dynamic, doesn't require an LLLM call for every routing decision.

Speaker 1

17:52

Okay, route it logically or semantically makes sense. What's the third major strategy?

Speaker 2

17:57

The third is query construction. This is about transforming a natural language query into the query language of the database or data source you were interacting with. It goes beyond just finding similar text chunks. Oh so, well, maybe you need to combine semantic search with traditional database filters. Text to metadata filter is a technique where the LLM extracts structured information like a date, a category, a price range directly from the user's natural language query.

Speaker 1

18:24

Ah so, if I ask for sci fi movies from the eighties, it pulls out sci fi for semantic search and nineteen eighties as a metadata filter exactly.

Speaker 2

18:32

It lets you combine the power of semantic understanding with the precision of structured filters. Another big one here is text to seql. This involves having the LM translate a natural language question like what were our total sales in Q three directly into an executable SQL query to run against a traditional relational database.

Speaker 1

18:50

Wow, that's powerful. How do you make that reliable? SQL? Can be tricky?

Speaker 2

18:54

It requires careful setup. You usually need to provide the LLM with a description of the database scheme like the create table statements, maybe some example rows, and often some few shot examples of natural language questions paired with their correct SQL queries.

Speaker 3

19:09

To guide it.

Speaker 1

19:10

Got it? So, text to sqel translates language to database code.

Speaker 2

19:14

If we connect this text to SQL capability to the bigger picture. Though, while it's incredibly powerful, letting an LEBLEM generate SQL queries directly from potentially untrusted user input is one of the riskiest things you can do in a production application. Ah.

Speaker 1

19:29

Security implications, Yeah, I can see that.

Speaker 2

19:31

Absolutely. This raise is a really important question around safety. You must implement critical security measures, things like ensuring the database connection has read only permissions, strictly limiting access to only the necessary tables, maybe even views, and definitely adding query timeouts to prevent denial of service attacks or runaway queries. It's a capability that demands extreme caution and robust safeguards.

Speaker 1

19:54

Absolutely crucial point. Okay, this is getting really interesting, especially for building interactive apps. How do we tackle the fact that lllms are inherently forgetful. How do we give them memory to actually hold a conversation, especially as things get more complex.

Speaker 2

20:08

Yeah, you've hit on a fundamental aspect. Llms are stateless. Every time you interact with them. It's like a fresh start. There's no memory of the prior prompt or model response built in. It's like talking to someone who forgets everything you just said every.

Speaker 1

20:22

Single turn, which isn't great for a chatbot.

Speaker 3

20:24

Not at all.

Speaker 2

20:26

The simplest way to build memory is just to literally store the history of the chatter, all the user messages and assistant responses as a list, and then include that entire list in the prompt for the next turn.

Speaker 1

20:37

Okay, just stuff the whole conversation history back in.

Speaker 2

20:40

Basically, Yeah, yeah, but you can imagine at scale. That gets tricky. How do you update that history reliably? How do you manage state? When you have multiple things happening, maybe multiple actors or steps in your application? It gets complicated fast. And that's precisely where langgraf enters the picture. Lang graph acts as the coordination layer these more complex, multi step, potentially multi actor applications. It's what allows them to remember state and coordinate actions over time.

Speaker 1

21:08

Lang grap Okay, how does it work? Is it like a state machine?

Speaker 2

21:10

You think of it kind of like designing a flow chart for your AI app. It is three core components. First, there's the state. This is the shared data that evolves over the course of the application run. It can include the chat history, intermediate results, anything the application needs to remember. Second, you have nodes. These are the individual steps or functions

21:30

in your flow chart. A node might be a call to an LM, a call to a tool like our calculator or search engine, or just some regular Python code that processes the state. And Third, you have edges. These are the connections between the nodes, determining the flow of execution. Edges can be fixed like always go from no day to node B, or they.

Speaker 1

21:48

Can be conditional conditional edges, meaning.

Speaker 2

21:51

Meaning the next step depends on the current state. Often, an LM in one node might decide which node to go to next, making the flow dynamic. A huge benefit of lang graph is its built in persistence. It uses something called a checkpointer. You can think of it like an auto save.

Speaker 1

22:07

Function AH, so it saves the state automatically exactly.

Speaker 2

22:10

It saves the current state after each step. This means that every invocation after the first doesn't start from blank slate. If the app crashes or the user comes back later, it can pick up right where it left off from the last save state. Really important for long running or stateful interactions.

Speaker 1

22:27

That's huge for usability. What about that growing chat history though, Does lang graph help manage that so you don't overload the LLM?

Speaker 3

22:34

Yes? Absolutely.

Speaker 2

22:36

While lang graph manages the overall state persistence, you still use lane chain's utilities within your nodes to manage the specific chat history part of the state before passing.

Speaker 3

22:45

It to an LLM.

Speaker 2

22:46

You can indeligently filter messages, maybe keeping only summaries or key turns, trim the history based on the number of messages or total tokens, or even merge older parts of the conversation into a concise summary, all designed to keep the context relevant without ex eating.

Speaker 3

23:00

Those M limits.

Speaker 1

23:01

Okay, so lang graph orchestrates the flow in memory, and lang chain helps manage the conversation content. That makes sense. So we've gone from just calling an LLLM once to chains and now to the state ful lang graph applications. How should we think about the different levels of complexity and I guess intelligence in these systems.

Speaker 2

23:20

That's a great way to frame it. We can think about a progression of cognitive architectures moving towards more sophisticated applications, and as we move up this ladder, we constantly grapple with that fundamental trade off we mentioned agency, the lmm's capacity to act autonomously, and reliability the degree to which we can trust its outputs. More autonomy often means less predictability.

Speaker 1

23:41

Right the agency versus reliability balance.

Speaker 3

23:43

So progression looks something like this.

Speaker 2

23:45

At the base, you have a simple LM call, one input, one output, like asking get to summarize text. Simple, usually reliable for that specific task. Next level up is a chain. This involves multiple LELLM calls or calls to tools executed in a pre defined fixed sequence. Example, step one LLLLM generates a SEQL query. Step two different M explains that query in plain English. The sequence never changes.

Speaker 1

24:09

Okay, fixed steps like an assembly.

Speaker 2

24:11

Line pretty much. Then it gets more dynamic with the router. Here, an LOM decides the sequence of steps. At runtime, it chooses which pre defined path to take based on the input. Like an earlier example, if it's a medical question, rode to the medical index, if insurance route to the FAQ index. The LOLLM makes a choice, but the possible paths are still pre defined.

Speaker 1

24:30

So it adds a decision point. And beyond writers, we finally get to agents. That seems to be the buzzword everyone's excited about.

Speaker 2

24:38

Exactly an agent is quite simply something that acts. What makes agent architectures unique and powerful is that they use an LLLM driven loop for control. The LM isn't just executing pre defined steps. It's deciding what to do next based on the result of its previous actions, and critically, it decides when to stop. The most common pattern here is the plan do loop, often called the react to architecture reasoning plus acting plan do loop.

Speaker 1

25:04

How does that work in practice?

Speaker 2

25:05

Imagine an agent needs to answer a complex question that requires, say, searching the web and then doing a calculation. Step one, Them gets the question, observes. Step two, it thinks, for reasons, okay, answer this, I first need to search for X. Step three, it plans an action, call the search tool with query X. Step four, the system executes, the search tool does step five. Them gets the search results back, observes again. Step six, It thinks again, okay, based on these results, now we.

Speaker 3

25:35

Need to calculate why using the calculator tool.

Speaker 2

25:37

Step seven, it plans the next action, call the calculator with inputs A and B. Step eight, the system executes the calculator does. Step nine. The LOM gets the calculation result observes.

Speaker 3

25:47

Step ten.

Speaker 2

25:48

It thinks, one last time, okay, now I have all the pieces I can formulate the final answer. It decides the loop is finished. Step eleven, it outputs the final answer to the user. See how the LOM is driving the whole process, deciding which tool to use one and ultimately deciding when it's done. That iterative self directed loop

26:03

is the core of an agent. Plank graph is perfect for implementing these loops using its nodes for the LLM calls and tool executions and conditional edges to route the flow based on the LM's decisions.

Speaker 1

26:14

Wow, okay, that really clarifies it. The LLM is in the driver's seat, choosing actions and deciding when the job is done. That's a big step up in autonomy. Can we make these agents even smarter?

Speaker 2

26:26

Definitely? There are enhancements. For instance, you might design the agent to always call a tool first, maybe always starting with a search to ensure its reasoning is grounded in current information before it even tries to answer. Another challenge arise is when you have many possible tools, how does the agent pick the right one? You can actually use

26:45

our rag on the tool descriptions. Store descriptions of all your tools in a vector store, and when the agent decides it needs a tool, it first does a semantic search over the tool descriptions to find the most relevant one for the.

Speaker 1

26:56

Task at hand, using our rig to help the agent choose its own tools. That's pretty meta. This sounds incredibly powerful, giving models the ability to genuinely tackle multi step problems. Can they go even further?

Speaker 3

27:07

Like?

Speaker 1

27:08

Can agents learn or work together or maybe even critique their own work?

Speaker 3

27:12

Yes? Absolutely, to all of those.

Speaker 2

27:14

One really powerful extension is reflection or self critique. This involves setting up a kind of loop, often using multiple LLM calls that mimics how humans create and refine things. You might have a creator prompt that generates a first draft of something, say in an essay. Then you have a separate revisor prompt with the LLLM critiques that draft based on specific criteria like clarity, tone, fascial accuracy. Finally, the original lam or another one revises the draft based on that.

Speaker 1

27:42

Critique, so it acts as its own editor exactly.

Speaker 2

27:44

It allows the LM to refine its output, maybe catch errors or improved style. You can eat something of the revisor LLM to adopt a different persona while critiquing, like asking it to review the essay from the perspective of a skeptical historian. This iterative refinement can significately boost quality.

Speaker 1

28:01

That's amazing. What about teamwork? Can you have multiple agents collaborating?

Speaker 2

28:05

Yes, for really complex problems that might be too much for a single agent, maybe requiring too many different tools or too much context, you can use multi agent architectures. You literally build teams of LVM agents that work together. There are different ways to coordinate these teams, but a practical approach highlighted in the book is the supervisor architecture. Here you have a central supervisor agent often an element self, whose job is to manage the workflow based on the

28:32

overall goal and the current state. The supervisor decides which agent or agents should be called next. It wrote tasks to specialize subagents. Maybe one agent is good at research, another at writing code, another at summarizing. Their progress and results are often shared in a central place, like a list of messages in the lang graft state, allowing them to build on each other's work. It enables true collaborative problem solving among AI agents.

Speaker 1

28:57

So what does this all mean? We're talking about building digital teams that can think, plan, act, reflect on their actions, and even critique their own work. It feels like this genuinely expands the kinds of problems we can even attempt to solve with AI.

Speaker 2

29:10

It really does, and it brings us right back to the fundamental tension we keep mentioning the trade off between agency and reliability. As these agents become more autonomous, controlling them and trusting their output becomes even more critical. If we zoom out again, and connect this to the bigger picture. You can visualize this trade off as a kind of frontier on a graph. You want to push that frontier outwards, achieve more agency for the same level of reliability, or

29:36

achieve higher reliability for the same level of agency. Pushing this frontier is key to building production ready applications people can actually depend on.

Speaker 1

29:44

Right, trust is paramount, So specifically, how do we improve the actual user experience and maybe more importantly, the reliability and control in these powerful, sometimes complex, agentic systems.

Speaker 2

29:55

Great question. There are several vital techniques discussed in the book. First off, managing latency p reception with streaming an intermediate output. LLM calls, especially in complex agent loops, can take time. A few seconds of waiting can feel long to a user, so communicating progress makes that higher latency more palatable. This includes dreaming the llm's final output token by token so the text appears gradually like someone typing.

Speaker 1

30:19

Makes it feel more responsive exactly.

Speaker 2

30:21

It also includes showing intermediate steps maybe messages like okay, searching for X or now calculating Y, so the user sees the agent as working. Second, and absolutely critical for reliability is ensuring structured output. You often need the LLM to return information in a specific predictable format like Jason, not just freeform text. Linke dams with structure output method

30:43

is designed for this. It helps reduce variants and ensures downstream systems can reliably parse and use the llm's output. Using a low temperature setting often helps here too.

Speaker 1

30:52

So you get predictable data structures back, not just a chatty response precisely.

Speaker 2

30:57

Third, especially for high agency applications, you need human in the loop controls. These give essential oversight to the end user, allowing intervention and correction. One control is interrupt The user should be able to manually stop an ongoing agent process at any time. Ideally the state is saved, so they can choose to resume later, restart, or just abandon it. If the agent's going off track. A panic button basically

31:24

kind of another is authorized. The application pauses before performing a potentially critical or irreversible action, maybe sending an email, making a purchase, modifying a file, and ask the user for explicit confirmation is it okay to do this?

Speaker 1

31:38

Essential for safety absolutely.

Speaker 2

31:40

And then there's the ability to fork and replay history. This allows the user to effectively go back in time to an earlier point in the conversation or workflow state, and then start a new branch from there, trying a different approach. It's fantastic for experimentation, debugging, and recovering from errors without starting over completely.

Speaker 1

31:56

Those controls sound invaluable for making these powerful systems usable end safe in the real world. Okay, what about a common scenario. Llms can be a bit slow. What happens if a user sends a new message while the agent is still thinking about the previous one. How does systems handle that concurrency?

Speaker 2

32:14

That's the challenge of multitasking lllms. Handling concurrent inputs lms are often quite slow compared to traditional software responses. Users will send follow up messages or new requests before the first one is finished. There are different strategies. The simplest is just to refuse concurrent input maybe disable the input box while processing.

Speaker 1

32:33

Not very user friendly, Yeah, frustrating.

Speaker 3

32:36

Right.

Speaker 2

32:36

You can handle each input independently in a new thread, but that might lose conversational context. You could queue inputs, processing them one after another, or you can interrupt the current run to prioritize the new input, either abandoning the old one or trying to save its partial state. This raises an interesting question. If an LLM is mid computation and you send another message, how should it respond intelligently? The book mentions in advanced sategy called.

Speaker 3

33:00

Fork and merge.

Speaker 1

33:01

Fork and merge.

Speaker 2

33:02

Yeah, The idea is the system temporarily forks the agent's current state. When new input arrives, it processes the new input in parallel, perhaps starting from that fork state. Then somehow it intelligently merges the results or final states from both the original computation and the new parallel one. It's complex to implement correctly, requiring careful steat management, but it could allow for very fluid interruption tolerant interactions.

Speaker 1

33:28

Wow. Okay, that sounds complex but powerful for smooth interaction.

Speaker 3

33:32

Yeah.

Speaker 1

33:33

So we've designed the core logic, we've added memory with lang graph, We've considered reliability and UX controls. Now, how do we actually get this thing out of our development environment and deployed for external users, making sure it's stable and can handle real.

Speaker 2

33:46

Traffic Right the production leak? This is where things get serious. The book highlights lang graph platform as a solution here. It's essentially a managed service for deploying and hosting lang graph agents at scale. Its goal is to handle the operational headachesction. It provides things like horizontally scaling task queues and servers to handle many concurrent users, plus a robust postgress checkpointer for efficiently storing potentially large states and conversation threads.

34:13

The aim is fault tolerant scalability, so.

Speaker 1

34:16

It takes care of the scaling and reliability infrastructure.

Speaker 3

34:19

That's the idea.

Speaker 2

34:20

Now, before you deploy there, you'll need a few prerequisite setup your apikeys from your LLM provider like OpenAI, a configured vector store if you're using RWI. The book mentioned Superbase with its pg vector extension is a good option, and you'll need a langsmith account because lang graft platform is tightly integrated with langsmith for monitoring and debugging.

Speaker 1

34:41

Langsmith. What's that?

Speaker 2

34:42

Langsmith is another part of the lang sching ecosystem focus specifically on observability, debugging, testing and monitoring for LLLM applications. It's pretty crucial for the whole life cycle.

Speaker 1

34:52

Okay, got it, So you set up prerequisites, then how do you deploy to lang graph platform.

Speaker 2

34:57

The process is designed to be fairly straightforward. You typically define your lang graph application structure in a configuration file, often the langgraph dot json. You can test it locally using the lang graph command line interface the CLI with a command like lang grafh dev, and then deployment to the managed platform is often done via one click submissions directly from the langsmith user interface.

Speaker 1

35:18

Okay, seems streamlined, but deployment isn't the end, right. We keep coming back to the fact that llms are non deterministic and prone to hallucination. How do we build and maintain trust after launch? How does continuous improvement work in this AI world?

Speaker 2

35:34

This is absolutely critical. Deployment is just the beginning of the journey. You need that continuous improvement cycle. Design, test data is deployed on monitor and fixed JAD redesign. Even in the design stage, you can build in defensiveness like that self corrective our gag idea we touched on earlier. You can have an LLLM within your agent's flow whose job is to grade retrieval relevance. Did we find good documents and check the answer for hallucinations before showing it to the user.

Speaker 1

35:59

The LM double checks itself yeah.

Speaker 2

36:01

And if it decides the retrieval was poor or the answer looks suspicious, it could trigger a fallback. Maybe try a web search instead, or ask the user for clarification. Building self correction right into the design. Then comes pre production testing. This is all about measuring accuracy, latency, cost, whatever metrics matter to you before you expose the app to real users. For this, you need good data sets.

Speaker 1

36:24

Where do those data sets come from?

Speaker 2

36:26

Several sources. You can have manually curated examples humans carefully writing good questions and ideal answers. You can use application logs from early internal testing or beta users, or you can even generate synthetic data using other lolms to create diverse examples of inputs and outputs. Langsmith actually has tools specifically for creating and managing these test data sets. Once

36:47

you have data, you need evaluation criteria. You compare your app's output against some ground truth references.

Speaker 3

36:52

How do you evaluate? You can use human evaluators people.

Speaker 2

36:55

Giving quolitative feedback on nuance, tone correctness, very valuable, but slow and expensive. You can use heuristic evaluators basically simple hard code of checks like does the output contain the specific keyword? Is a below a certain length? Quick, but limited increasingly popular lmms is a judge evaluators. Here, you use another LM, give it a real quick and ask it to score or critique your application's output based.

Speaker 3

37:19

On that rubric.

Speaker 1

37:19

Using an LM to judge another LLLM exactly.

Speaker 2

37:23

And what's really fascinating here, Lanksmith has this clever feedback loop. If a human corrects or disagrees with the LM as a judge's assessment, lank Smith captures that correction and automatically turns it into a fu shot example that gets added back into the judge's prompt for future evaluations.

Speaker 1

37:40

Wow, so the judging LLM actually learns from human corrections over time.

Speaker 2

37:44

Precisely, it helps the automated evaluation along better with human preferences, reducing the need for constant manual prompt tweaking for the judge. It's a really smart self improving mechanism. You also need rigorous regression testing. As you update your code or even the underlying LLLM models change model drift, you need to constantly rerun your tests to prevent regression and sure you

38:07

haven't accidentally made things worse. Langsmith's comparison view is designed to help spot these performance changes over time, and for complex agents, evaluation needs to happen on multiple levels the final response, but also individuals single step decisions like did it pick the right tool, and even the entire trajectory, the sequence of actions it took.

Speaker 1

38:25

Okay, that's a lot of testing before launch.

Speaker 2

38:27

What about after production? Monitoring is crucial for catching bugs and weird edge cases that only emerge with real users and real world data. Lang Smith is key here again providing tracing to track exactly what happened inside your agent when it error occurred or user gave bad feedback. You need mechanisms for collecting feedback and production, maybe simple thumbs updown buttons, annotation ques where users or internal teams can flag issues, or even running those LMM as a judge

38:53

evaluators on live traffic samples. You can also implement classification and tagging on inputs and outputs, checking for things like toxicity, personally identifiable information, or even trying to detect prompt injection attacks. These act as safety guardrails and a really important practical tip. Release your app in phases. Start with a small group of Beata users, gather feedback, fix issues, then gradually expand the rollout. Don't just flip the switch for everyone on day one.

Speaker 1

39:19

So what this all really means, this continuous cycle of design, testing, evaluation, monitoring, fixing. It's not just about squashing bugs, is it. It feels like it's about systematically building confidence and trust in these incredibly powerful but inherently probabilistic systems. It's about making them reliable enough for the real world. Okay, thinking bigger picture, now,

39:41

let's unpack this idea. Llms are amazing because they're so intuitive, right, They often understand what we mean, even with typos or slightly vague questions. That's fantastic. But that same flexibility means their output isn't always perfectly predictable. It can be slightly off, and that challenges our traditional software interfaces, which we usually

39:59

build expecting very precise, deterministic results. How is this fundamental difference going to change the way we actually interact with software.

Speaker 3

40:06

That's a really profound question. You're right.

Speaker 2

40:08

Traditionally UIs think Microsoft Word, Figma spreadsheets. They have fixed tool pallettes, predictable menus canvases where actions have precise, repeatable outcomes because the underlying logic is deterministic LLLM powered applications are just different. They are more forgiving of messy input, which is great, but their output does have that inherent variability. This mismatch is pushing us to think about new interaction patterns. New UIs designed for this LLM native world. The book

40:38

outlines three really interesting emerging patterns. The first and probably the easiest lift to integrate into existing apps is the interactive chatbot. Think of this as an AI sidekick living within your application, like get up copilot, chat within your code editor, or a similar chat interface within a design tool or document editor. This chatbi can see the main application content, the code, the design, the document and interact.

Speaker 1

41:03

With it so you can talk to it about the thing you're working.

Speaker 2

41:05

On, exactly, explain this code, suggest a different layout. Summarize this section. It's conversational collaboration. The key components are a good dialogue tune chat model obviously, conversation history, streaming outputs so it feels responsive, tool calling so the chatbot can actually invoke application functions, change the font size, refactor this code, and probably human in the loop controls for safety.

Speaker 1

41:28

Okay, the AI sidekick makes sense. What's the next pattern?

Speaker 2

41:31

The second pattern pushes the collaboration idea further collaborative editing with lms. Here the LLM agent isn't just a sidekick you talked to. It becomes one of those users contributing to this shared document or shared state, right alongside human collaborators. Think Google docs, but one of the cursors belongs to an AI.

Speaker 1

41:49

Whoa in AI is a real time teammate.

Speaker 2

41:52

Potentially, Yes, what's fascinating here is how this could work. Maybe the LM acts as an asynchronous drafter, pharing sections for you overnight, or maybe it's an always on copilot subjecting improvements or cleaning up formatting in real time. If we connect this to the bigger picture, it raises really important questions about how we design systems Where human edits and AI edits merge seamlessly, how do you handle conflicts whose changes take precedence?

Speaker 1

42:19

Yeah, the merging in conflict resolution sounds tricky.

Speaker 3

42:22

Definitely.

Speaker 2

42:23

Key components here involve managing that shared state carefully, maybe using sophisticated techniques like conflict free replicated data types crdts, or just robust merging logic. You need task managers, ways to handle concurrency, and definitely a good under.

Speaker 1

42:37

Redos stack an AI teammate directly editing alongside you. That's a huge paradigm shift. What's the third emerging pattern?

Speaker 2

42:44

The third and perhaps the most futuristic feeling is ambient computing. This is where the LLM is continuously doing some kind of work in the background while you, the user, are presumably doing something else.

Speaker 3

42:56

Entirely.

Speaker 2

42:57

It's not waiting for your explicit command, proactively working on your.

Speaker 1

43:01

Behalf as silent, always on assistant kind of Think about how LLM reasoning could transform this.

Speaker 2

43:07

Old ambient computing often required setting up lots of manual rules. If I get an email from X, notify me tedious. The fascinating question now is can elms use their understanding and reasoning to proactively identify what's genuinely interesting or important to you without needing endless configuration.

Speaker 1

43:25

So it learns what matters to me and surfaces just that.

Speaker 2

43:28

That's the potential. Key components here would include triggers detecting new information like emails, news, calendar, updates, long term memory to build context about you and your priorities, reflection, the agent actively learning and updating its internal model of what you find interesting, and crucially summarized output. It doesn't bombard you. It intelligently summarizes its findings and surfaces only the noteworthy stuff.

43:54

It needs a task manager too, to keep track of his background processes.

Speaker 1

43:58

So what does this all really mean for how we'll interact with tech? Could lms truly become these quiet, proactive assistants, maybe to drafting replies, summarizing reports, alerting us to opportunities all happening in the background without constant manual setup. That feels like a fundamentally different way to experience software.

Speaker 2

44:16

It really does point towards a potential future where software is less a collection of static tools we actively wield and more of an intelligent environment that it anticipates and assists.

Speaker 1

44:26

We have covered an incredible amount of ground today, haven't we. It feels like a whirlwind.

Speaker 3

44:30

Tourth It really does.

Speaker 1

44:31

From the absolute fundamentals of llms, how they predict text, what tokens are, and how prompting is our way to steer them, then diving deep into retrieval augmented generation our RAG, that crucial technique for grounding them in real world specific data to fight hallucinations.

Speaker 2

44:47

Yeah, and then moving into land graph enabling these complex multi step agents that can actually remember conversations, plan sequences of actions using tools, and even reflect and critique their own.

Speaker 1

44:59

Out puts exactly. And then tackling the really practical side making these things production ready, ensuring reliability with structured output, giving users control with human in the loop features, and mastering that whole deployment, testing, monitoring, and continuous improvement cycle.

Speaker 2

45:16

It's a lot, It is a lot, but it's clear that llms, combined with powerful frameworks like line chain and lang graph are genuinely giving us well thing building superpowers.

Speaker 3

45:25

As the book puts it, they're.

Speaker 2

45:26

Making previously hard things easy and previously impossible things possible. This deep dive hopefully has equipped you, the listener, with the core knowledge to not just watch this revolution unfold, but to potentially participate.

Speaker 1

45:38

In it absolutely. And as we wrap up, maybe a final thought to chew on. As these llms become more capable, more edgentic, and more deeply integrated into our digital lives. Imagine that world where software isn't just a static toolbox anymore. Imagine it as a dynamic, intelligent collaborator, always learning, always adapting.

45:58

How will that fundamental shift I packed our creativity, our productivity, even our very understanding of what it means to be well informed or in control in an age of potentially ambient AI constantly working around us.

Speaker 2

46:10

That's a fascinating future to contemplate.

Speaker 1

46:11

It really is. We definitely encourage you to continue your own deep dive. Explore the open source Langschaine library, check out lang graph, maybe look into langsmith. There's so much happening, and the joy of discovery in this field right now is truly immense.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript