LLM as a Judge: Why Your AI Might Be Marking Its Own Homework - podcast episode cover

LLM as a Judge: Why Your AI Might Be Marking Its Own Homework

Apr 30, 20261 hr 7 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Coding Chats episode 76 - John talks to Laura Dietz - a computer science professor whose work focuses on whether AI evaluation metrics actually tell the truth. She's known for her critical take on "LLM as a judge" — not because she thinks it's useless, but because she wants numbers that mean something rather than numbers that just make a system look good.


The conversation tackles some uncomfortable realities for software engineers: using an LLM to write code and another to review it is a circular trap, prompt engineering shouldn't be a computer scientist's day job, and every time you reject your code AI's output, you're quietly generating the training data that shapes its successor.


Chapters

00:00 Introduction to Laura Dietz and Her Journey

03:12 Exploring LLMs as Judges

06:16 Challenges in Evaluating Search Systems

08:49 The Evolution of User Queries and Expectations

11:46 The Role of LLMs in Information Retrieval

14:44 Defining Quality in Search Results

17:27 The Complexity of User Intent

19:54 Human-AI Collaboration in Code Review

22:53 The Future of LLMs in Software Development

25:23 Balancing Human and AI Roles

28:20 Innovative Approaches to AI Evaluation

34:10 The Art of Assembling Ideas

36:39 Balancing Cost and Quality in LLMs

39:09 Evaluating LLM Performance

43:50 The Future of LLMs and Training Data

49:19 Exploring New Architectures in AI

55:16 Understanding In-Context Learning

01:00:45 The Role of AI in Creative Expression

01:06:59 Exploring Related Content


Laura's Links:

https://www.cs.unh.edu/~dietz/https://

www.linkedin.com/in/laura-dietz-47036516/

John's Links:

John's LinkedIn: https://www.linkedin.com/in/johncrickett/

John’s YouTube: https://www.youtube.com/@johncrickett

John's Twitter: https://x.com/johncrickett

John's Bluesky: https://bsky.app/profile/johncrickett.bsky.social


Check out John's software engineering related newsletters: Coding Challenges: https://codingchallenges.substack.com/ which shares real-world project ideas that you can use to level up your coding skills.


Developing Skills: https://read.developingskills.fyi/ covering everything from system design to soft skills, helping them progress their career from junior to staff+ or for those that want onto a management track.


Takeaways

Using an LLM to both generate and evaluate outputs is circular — like a student grading their own homework.

If your evaluation metric can go up without your system actually improving, it's not a real metric.

A better human-in-the-loop isn't one that rubber-stamps AI suggestions — it's one that's guided to look in the right place.

LLMs don't get bored, which makes them genuinely useful for code review — but that's not the same as making them accurate.

"Faith-based engineering" — trusting AI output without validation — is a real and growing problem in software teams.

Prompt engineering is a workaround, not a discipline; real engineers should be building systems, not crafting incantations.

Every rejection you give your code AI is training signal — your frustration today is someone else's better tool tomorrow.

The transformer attention mechanism is a weighted sum, and a sum isn't always the right operation — some problems need an AND, not an OR.

AI tools are lowering the barrier to coding for people who were previously too intimidated to try, and that's worth celebrating.

The same network effect that makes a platform valuable also makes monopoly in AI training data genuinely dangerous.

For the best experience, listen in Metacast app for iOS or Android