Building the Next Generation of Conversational AI - podcast episode cover

Building the Next Generation of Conversational AI

Mar 14, 20252 hr 42 minEp. 37
--:--
--:--
Listen in podcast apps:

Episode description

In this episode of AI + a16z, Sesame Cofounder and CTO Ankit Kumar joins a16z general partner Anjney Midha for a deep dive into the research and engineering behind their voice technology. They discuss the technical challenges of real-time speech generation, the trade-offs in balancing personality with efficiency, and why the team is open-sourcing key components of their model. Ankit breaks down the complexities of multimodal AI, full-duplex conversation modeling, and the computational optimizations that enable low-latency interactions. 

They also explore the evolution of natural language as a user interface and its potential to redefine human-computer interaction.
Plus, we take audience questions on everything from scaling laws in speech synthesis to the role of in-context learning in making AI voices more expressive.

Key Takeaways:
How Sesame AI achieves natural voice interactions through real-time speech generation.

  • The impact of open-sourcing their speech model and what it means for AI research.
  • The role of full-duplex modeling in improving AI responsiveness.
  • How computational efficiency and system latency shape AI conversation quality.
  • The growing role of natural language as a user interface in AI-driven experiences.

For anyone interested in AI and voice technology, this episode offers an in-depth look at the latest advancements pushing the boundaries of human-computer interaction.

Learn more:

The Maya + Miles demo

Crossing the uncanny valley of conversational voice

Sesame CSM 1B model

Follow everybody on X:

Ankit Kumar

Anjney Midha

Check out everything a16z is doing with artificial intelligence here, including articles, projects, and more podcasts.

Building the Next Generation of Conversational AI | AI + a16z podcast - Listen or read transcript on Metacast