"10-Minute System Design" is your go-to podcast for quick, digestible insights into system design, AI, machine learning, and distributed systems. In each episode, your hosts break down complex tech concepts into easy-to-understand discussions, it's perfect for both beginners looking to learn the basics and experienced professionals needing a quick refresh. In just 10 minutes, we dive deep into the core ideas, offering clear explanations and practical takeaways to help you stay sharp and informed in today’s fast-evolving tech landscape.
Last refreshed: ⓘ
Follow this podcast in the Metacast mobile app to refresh it and see new episodes.
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more
In this 10-minute episode, we’ll explore neural networks — the core component of many modern AI models, including LLMs like ChatGPT, Gemini, and Claude, that mimic how the human brain works. We’ll demystify the “black box” nature of the neural networks and reveal how they work. It’s a must-listen for new learners curious about AI.
In this episode, we'll take a look at Meta’s ambitious approach to scaling large language models. We'll explore the shift from handling many smaller models for recommendation engines to building colossal generative AI models, and the immense challenges that come with it. From hardware and software optimizations to managing power and dealing with inevitable hardware failures, we'll break down the critical pieces that make Meta's infrastructure tick. What does it take to run systems this large wit...
In this episode, let's explore how Netflix revamped their video processing pipeline, moving from a monolithic system to a microservices architecture. What drove such a major shift? You'll hear how their original platform, Reloaded, couldn’t keep up with Netflix’s rapid pace of innovation, and why Cosmos, their new system, is now the backbone of everything from streaming to studio operations. But what challenges did they face along the way? And is Cosmos truly the future-proof solution it promise...
Explore the engineering marvel behind Apple's iCloud, detailing its evolution from a massive Cassandra deployment to an architecture powered by FoundationDB. The discussion highlights how Apple overcame Cassandra's limitations regarding real-time collaboration and partition constraints. Learn about the ingenious FoundationDB Record Layer, its role in enabling multi-tenancy and per-user databases, and Apple's strategies for mitigating FoundationDB's latency through asynchronous processing. The episode concludes with crucial takeaways on scalability, stateless design, and robust conflict resolution.
In this episode, we explore the system behind Uber's driver-matching functionality, capable of handling an incredible one million requests per second. We break down the key technologies that make it work, from H3, the hexagonal grid system for location indexing, to Ringpop, which scales services across servers. You'll hear about how GPS data is transformed into road segments, and how databases like Cassandra and Redis power this high-demand platform. Whether you're curious about large-scale syst...
In this episode, we'll learn how Instagram scaled to 2.5 billion users. We'll discuss the major challenges Instagram faced — from resource constraints to data consistency and performance, and unpack the innovative strategies the team used to tackle them. From replacing Python with more performant languages to leveraging Cassandra for distributed data storage, we'll learn how Instagram managed to keep things running smoothly at such massive scale. Curious how they did it? Tune in to hear how a mi...
In this episode, we explore how Facebook engineers scaled Memcached, the open-source caching system, to handle billions of requests and trillions of items. We’ll break down the challenges they faced and the smart solutions they developed — from reducing latency to optimizing memory usage. Join us as we uncover how they transitioned from a single cluster to a distributed system spread across the globe, tackling data replication, load balancing, and more. If you’re curious about the inner workings...
In this episode, we explore another important piece of technology from Google: Spanner — a globally distributed database that reshapes how massive datasets are managed. We’ll talk about its unique architecture, including the TrueTime API, which solves clock uncertainty to ensure consistency across data centers. We’ll also cover Spanner’s concurrency control, two-phase commit, and lock-free read-only transactions. Plus, discover how Google’s ad platform, F1, leverages Spanner to handle millions o...
In this episode, we take a closer look at the Hadoop Distributed File System (HDFS), a key part of the Hadoop framework that helps store and manage huge amounts of data. We’ll explore how HDFS spreads data across many affordable servers, making it both scalable and cost-effective. You’ll learn about its main components, like the NameNode and DataNodes, and how they work together. We’ll also discuss features that keep your data safe and ensure it moves efficiently. Join us, we’ll touch on the cha...
This episode focuses on Kafka, the distributed messaging system born at LinkedIn. Learn how Kafka was designed to tackle the massive streams of log data driving personalized recommendations, search algorithms, and real-time security. We'll explore how it outperforms traditional systems like ActiveMQ and RabbitMQ with its streamlined architecture, decentralized coordination, and focus on efficiency. Tune in to explore Kafka's unique design and how it’s becoming essential for modern data processin...
In this episode, our hosts take a closer look at a groundbreaking research paper on Dynamo, Amazon’s innovative distributed data storage system. With a focus on availability over consistency, Dynamo employs cutting-edge techniques like consistent hashing and gossip-based failure detection to deliver high performance. Join us as we unpack the paper’s insights into its design and implementation, its real-world applications within Amazon, and the fascinating trade-offs between performance and durab...
In this episode, our hosts delve into the legendary research paper detailing the creation and implementation of Chubby, Google's innovative distributed lock service. Designed for large-scale, loosely-coupled systems, Chubby offers a reliable mechanism for synchronization, such as electing primary servers among peers. The paper explores the critical design choices prioritizing availability over raw performance, revealing the system's architecture, implementation intricacies, and essential compone...
In this 10-minute episode, we explore the Google File System (GFS), a scalable, fault-tolerant distributed file system designed for Google’s vast data needs. Built on commodity hardware, GFS ensures high performance for many clients. We’ll cover key design principles like handling frequent component failures, large file operations, and atomic appends. We’ll also dive into its architecture—featuring a master server for metadata management and chunkservers for storage—along with data handling, fau...
Join us in this episode as we dive into MapReduce. We’ll explore how it revolutionizes the way we process vast datasets on large clusters. With a focus on simplicity, the MapReduce framework abstracts complex tasks like data partitioning and fault tolerance, allowing users to easily define two essential functions: “Map” and “Reduce.” We’ll discuss real-world applications that showcase its power—from distributed grep to web link analysis. If you’re curious about how to harness the potential of di...
In this episode, our hosts delve into Cassandra, the distributed storage system developed at Facebook to tackle the immense challenges of managing structured data. Designed for high availability and scalability, Cassandra emerged from the need to support billions of daily writes for the Inbox Search feature. Join us as we explore this game-changing piece of tech that influences modern distributed systems today.
Imagine a revolutionary storage system that can handle petabytes of data across thousands of ordinary servers. This is Bigtable — a groundbreaking solution that redefines how structured data is managed at scale. Discover how Bigtable handles petabytes of structured data across thousands of servers, enabling unparalleled scalability and flexibility. Join us as we uncover its real-world applications—from Google Analytics to Personalized Search — and the vital lessons learned in designing robust, l...
Ever wondered how multiple processes can safely share resources without stepping on each other's toes? In this episode, we'll talk about Redis's distributed lock and discover how it ensures mutual exclusion for shared resources across a network of Redis servers, allowing only one process at a time to gain access. We’ll delve into its safety and liveness properties that guarantee reliable lock management, even amidst failures. Join us as we unpack potential challenges like network partitions and ...