Deep Papers

Deep Papers is a podcast series featuring deep dives on today’s most important AI papers and research. Hosted by Arize AI founders and engineers, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning.

Last refreshed: February 18th, 2026 at 9:06 PM ⓘ

Follow this podcast in the Metacast mobile app to refresh it and see new episodes.

Follow on

Apple Podcasts

Spotify

RSS

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Composable Interventions for Language Models

This week, we're excited to be joined by Kyle O'Brien, Applied Scientist at Microsoft, to discuss his most recent paper, Composable Interventions for Language Models. Kyle and his team present a new framework, composable interventions, that allows for the study of multiple interventions applied sequentially to the same language model. The discussion will cover their key findings from extensive experiments, revealing how different interventions—such as knowledge editing, model compression, and ma...

Sep 11, 2024•43 min

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

This week’s paper presents a comprehensive study of the performance of various LLMs acting as judges. The researchers leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which they find to have a high inter-annotator agreement. The study includes nine judge models and nine exam-taker models – both base and instruction-tuned. They assess the judge models’ alignment across different model sizes, families, and judge prom...

Aug 16, 2024•39 min

Breaking Down Meta's Llama 3 Herd of Models

Meta just released Llama 3.1 405B–according to them, it’s “the first openly available model that rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation.” Will the latest Llama herd ignite new applications and modeling paradigms like synthetic data generation? Will it enable the improvement and training of smaller models, as well as model distillation? Meta thinks so. We’ll take a look at what they d...

Aug 06, 2024•45 min

DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines

Chaining language model (LM) calls as composable modules is fueling a new way of programming, but ensuring LMs adhere to important constraints requires heuristic “prompt engineering.” The paper this week introduces LM Assertions, a programming construct for expressing computational constraints that LMs should satisfy. The researchers integrated their constructs into the recent DSPy programming model for LMs and present new strategies that allow DSPy to compile programs with LM Assertions into mo...

Jul 23, 2024•34 min

RAFT: Adapting Language Model to Domain Specific RAG

Where adapting LLMs to specialized domains is essential (e.g., recent news, enterprise private documents), we discuss a paper that asks how we adapt pre-trained LLMs for RAG in specialized domains. SallyAnn DeLucia is joined by Sai Kolasani, researcher at UC Berkeley’s RISE Lab (and Arize AI Intern), to talk about his work on RAFT: Adapting Language Model to Domain Specific RAG. RAFT (Retrieval-Augmented FineTuning) is a training recipe that improves an LLM’s ability to answer questions in a “op...

Jun 28, 2024•44 min

LLM Interpretability and Sparse Autoencoders: Research from OpenAI and Anthropic

It’s been an exciting couple weeks for GenAI! Join us as we discuss the latest research from OpenAI and Anthropic. We’re excited to chat about this significant step forward in understanding how LLMs work and the implications it has for deeper understanding of the neural activity of language models. We take a closer look at some recent research from both OpenAI and Anthropic. These two recent papers both focus on the sparse autoencoder--an unsupervised approach for extracting interpretable featur...

Jun 14, 2024•44 min

Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment

We break down the paper--Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment. Ensuring alignment (aka: making models behave in accordance with human intentions) has become a critical task before deploying LLMs in real-world applications. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. To address this issue, this paper presents a comprehensive su...

May 30, 2024•48 min

Breaking Down EvalGen: Who Validates the Validators?

Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators often inherit the problems of the LLMs they evaluate, requiring further human validation. This week’s paper explores EvalGen, a mixed-initative approach to aligning LLM-generated evaluation functions with human preferences. EvalGen assists users in developing both criteria accep...

May 13, 2024•45 min

Keys To Understanding ReAct: Synergizing Reasoning and Acting in Language Models

This week we explore ReAct, an approach that enhances the reasoning and decision-making capabilities of LLMs by combining step-by-step reasoning with the ability to take actions and gather information from external sources in a unified framework. Learn more about AI observability and evaluation , join the Arize AI Slack community or get the latest on LinkedIn and X ....

Apr 26, 2024•45 min

Demystifying Chronos: Learning the Language of Time Series

This week, we’ve covering Amazon’s time series model: Chronos. Developing accurate machine-learning-based forecasting models has traditionally required substantial dataset-specific tuning and model customization. Chronos however, is built on a language model architecture and trained with billions of tokenized time series observations, enabling it to provide accurate zero-shot forecasts matching or exceeding purpose-built models. We dive into time series forecasting, some recent research our team...

Apr 04, 2024•45 min

Anthropic Claude 3

This week we dive into the latest buzz in the AI world – the arrival of Claude 3. Claude 3 is the newest family of models in the LLM space, and Opus Claude 3 ( Anthropic's "most intelligent" Claude model ) challenges the likes of GPT-4. The Claude 3 family of models, according to Anthropic "sets new industry benchmarks," and includes "three state-of-the-art models in ascending order of capability: Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus." Each of these models "allows users to select t...

Mar 25, 2024•43 min

Reinforcement Learning in the Era of LLMs

We’re exploring Reinforcement Learning in the Era of LLMs this week with Claire Longo, Arize’s Head of Customer Success. Recent advancements in Large Language Models (LLMs) have garnered wide attention and led to successful products such as ChatGPT and GPT-4. Their proficiency in adhering to instructions and delivering harmless, helpful, and honest (3H) responses can largely be attributed to the technique of Reinforcement Learning from Human Feedback (RLHF). This week’s paper, aims to link the r...

Mar 15, 2024•45 min

Sora: OpenAI’s Text-to-Video Generation Model

This week, we discuss the implications of Text-to-Video Generation and speculate as to the possibilities (and limitations) of this incredible technology with some hot takes. Dat Ngo, ML Solutions Engineer at Arize, is joined by community member and AI Engineer Vibhu Sapra to review OpenAI’s technical report on their Text-To-Video Generation Model: Sora. According to OpenAI, “Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.” At the ...

Mar 01, 2024•45 min

RAG vs Fine-Tuning

This week, we’re discussing "RAG vs Fine-Tuning: Pipelines, Tradeoff, and a Case Study on Agriculture." This paper explores a pipeline for fine-tuning and RAG, and presents the tradeoffs of both for multiple popular LLMs, including Llama2-13B, GPT-3.5, and GPT-4. The authors propose a pipeline that consists of multiple stages, including extracting information from PDFs, generating questions and answers, using them for fine-tuning, and leveraging GPT-4 for evaluating the results. Overall, the res...

Feb 08, 2024•40 min

Phi-2 Model

We dive into Phi-2 and some of the major differences and use cases for a small language model (SLM) versus an LLM. With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on multi-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite ...

Feb 02, 2024•44 min

HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels

We discuss HyDE: a thrilling zero-shot learning technique that combines GPT-3’s language understanding with contrastive text encoders. HyDE revolutionizes information retrieval and grounding in real-world data by generating hypothetical documents from queries and retrieving similar real-world documents. It outperforms traditional unsupervised retrievers, rivaling fine-tuned retrievers across diverse tasks and languages. This leap in zero-shot learning efficiently retrieves relevant real-world in...

Feb 02, 2024•36 min

A Deep Dive Into Generative's Newest Models: Gemini vs Mistral (Mixtral-8x7B)–Part I

For the last paper read of the year, Arize CPO & Co-Founder, Aparna Dhinakaran, is joined by a Dat Ngo (ML Solutions Architect) and Aman Khan (Product Manager) for an exploration of the new kids on the block: Gemini and Mixtral-8x7B. There's a lot to cover, so this week's paper read is Part I in a series about Mixtral and Gemini. In Part I, we provide some background and context for Mixtral 8x7B from Mistral AI, a high-quality sparse mixture of experts model (SMoE) that outperforms Llama 2 7...

Dec 27, 2023•48 min

How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings

We’re thrilled to be joined by Shuaichen Chang, LLM researcher and the author of this week’s paper to discuss his findings. Shuaichen’s research investigates the impact of prompt constructions on the performance of large language models (LLMs) in the text-to-SQL task, particularly focusing on zero-shot, single-domain, and cross-domain settings. Shuaichen and his team explore various strategies for prompt construction, evaluating the influence of database schema, content representation, and promp...

Dec 18, 2023•45 min

The Geometry of Truth: Emergent Linear Structure in LLM Representation of True/False Datasets

For this paper read, we’re joined by Samuel Marks, Postdoctoral Research Associate at Northeastern University, to discuss his paper, “The Geometry of Truth: Emergent Linear Structure in LLM Representation of True/False Datasets.” Samuel and his team curated high-quality datasets of true/false statements and used them to study in detail the structure of LLM representations of truth. Overall, they present evidence that language models linearly represent the truth or falsehood of factual statements...

Nov 30, 2023•41 min

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

In this paper read, we discuss “Towards Monosemanticity: Decomposing Language Models Into Understandable Components,” a paper from Anthropic that addresses the challenge of understanding the inner workings of neural networks, drawing parallels with the complexity of human brain function. It explores the concept of “features,” (patterns of neuron activations) providing a more interpretable way to dissect neural networks. By decomposing a layer of neurons into thousands of features, this approach ...

Nov 20, 2023•45 min

RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models

We discuss RankVicuna, the first fully open-source LLM capable of performing high-quality listwise reranking in a zero-shot setting. While researchers have successfully applied LLMs such as ChatGPT to reranking in an information retrieval context, such work has mostly been built on proprietary models hidden behind opaque API endpoints. This approach yields experimental results that are not reproducible and non-deterministic, threatening the veracity of outcomes that build on such shaky foundatio...

Oct 18, 2023•44 min

Explaining Grokking Through Circuit Efficiency

Join Arize Co-Founder & CEO Jason Lopatecki, and ML Solutions Engineer, Sally-Ann DeLucia, as they discuss “Explaining Grokking Through Circuit Efficiency." This paper explores novel predictions about grokking, providing significant evidence in favor of its explanation. Most strikingly, the research conducted in this paper demonstrates two novel and surprising behaviors: ungrokking, in which a network regresses from perfect to low test accuracy, and semi-grokking, in which a network shows de...

Oct 17, 2023•36 min

Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior

Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. In this episode, we discuss the paper, “Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior.” This episode is led by SallyAnn Delucia (ML Solutions Engineer, Arize AI), and Amber Roberts (ML Solutions Engineer, Arize AI). The research they discuss highlights t...

Sep 29, 2023•42 min

Skeleton of Thought: LLMs Can Do Parallel Decoding

Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. In this paper reading, we explore the paper ‘Skeleton-of-Thought’ (SoT) approach, aimed at reducing large language model latency while enhancing answer quality. This episode is led by Aparna Dhinakaran ( Chief Product Officer, Arize AI) and Sally-Ann Delucia (ML Solutions Engineer, Arize AI), with tw...

Aug 30, 2023•44 min

Llama 2: Open Foundation and Fine-Tuned Chat Models

Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. This episode is led by Aparna Dhinakaran ( Chief Product Officer, Arize AI) and Michael Schiff (Chief Technology Officer, Arize AI), as they discuss the paper "Llama 2: Open Foundation and Fine-Tuned Chat Models." In this paper reading, we explore the paper “Developing Llama 2: Pretrained Large Langu...

Jul 31, 2023•30 min

Lost in the Middle: How Language Models Use Long Contexts

Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. This episode is led by Sally-Ann DeLucia and Amber Roberts, as they discuss the paper "Lost in the Middle: How Language Models Use Long Contexts." This paper examines how well language models utilize longer input contexts. The study focuses on multi-document question answering and key-value retrieval...

Jul 26, 2023•42 min

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Hosted by AI Pub creator Brian Burns and Arize AI founders Jason Lopatecki and Aparna Dhinakaran, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. In this episode, we talk about Orca. Recent research focuses on improving smaller models through imitation learning using outputs from large foundation models (LFMs). Challenges include limited imitation...

Jul 21, 2023•42 min

Toolformer: Training LLMs To Use Tools

Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Hosted by AI Pub creator Brian Burns and Arize AI founders Jason Lopatecki and Aparna Dhinakaran, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. In this episode, we interview Timo Schick and Thomas Scialom, the Research Scientists at Meta AI behind Toolformer. "Vanilla" language models cannot access information about the external world. But what ...

Mar 20, 2023•34 min

Hungry Hungry Hippos - H3

Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Hosted by AI Pub creator Brian Burns and Arize AI founders Jason Lopatecki and Aparna Dhinakaran, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. In this episode, we interview Dan Fu and Tri Dao, inventors of " Hungry Hungry Hippos " (aka "H3"). This language modeling architecture performs comparably to transformers, while admitting much longer co...

Feb 13, 2023•42 min•Season 1Ep. 2

ChatGPT and InstructGPT: Aligning Language Models to Human Intention

Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Hosted by AI Pub creator Brian Burns and Arize AI founders Jason Lopatecki and Aparna Dhinakaran, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. In this first episode, we’re joined by Long Ouyang and Ryan Lowe , research scientists at OpenAI and creators of InstructGPT. InstructGPT was one of the first major applications of Reinforcement Learning...

Jan 18, 2023•48 min•Season 1Ep. 1

← Prev Next →

Hosted on Buzzsprout

For the best experience, listen in Metacast app for iOS or Android