In this episode of Disseminate , we welcome Harry Gavrilidis back to the podcast to explore his latest research on fast and scalable data transfer across systems, soon to be presented at SIGMOD 2025. Building on his work with XDB, Harry introduces XDBC , a novel data transfer framework designed to balance performance and generalizability. They dive into the challenges of moving data across heterogeneous environments—ranging from cloud systems to IoT devices—and critique the limitations of curren...
Jun 16, 2025•57 min•Season 6Ep. 22
In this episode of the DuckDB in Research series, Harry Gavriilidis (PhD student at TU Berlin) joins us to discuss Sheet Reader — a high-performance spreadsheet parser that dramatically outpaces traditional tools in both speed and memory efficiency. By taking advantage of the standardized structure of spreadsheet files and bypassing generic XML parsers, Sheet Reader delivers fast and lightweight parsing, even on large files. Now available as a DuckDB extension, it enables users to query spreadsh...
Apr 17, 2025•41 min•Season 10Ep. 6
In this episode of the DuckDB in Research series, we’re joined by Arjen de Vries, Professor of Data Science at Radboud University. Arjen dives into his team’s development of a DuckDB extension for FAISS, a library originally developed at Facebook for efficient similarity search and vector operations. We explore the growing importance of embeddings and dense retrieval in modern information retrieval systems, and how DuckDB’s zero-copy architecture and tight integration with the Python ecosystem m...
Apr 10, 2025•46 min•Season 10Ep. 5
In this episode, we sit down with David Justen to discuss his work on POLAR: Adaptive and Non-invasive Join Order Selection via Plans of Least Resistance which was implemented in DuckDB. David shares his journey in the database space, insights into performance optimization, and the challenges of working with modern analytical workloads. We dive into the intricacies of query compilation, vectorized execution, and how DuckDB is shaping the future of in-memory databases. Tune in for a deep dive int...
Apr 03, 2025•51 min•Season 10Ep. 4
In this episode, we sit down with Daniël ten Wolde, a PhD researcher at CWI’s Database Architectures Group, to explore DuckPGQ—an extension to DuckDB that brings powerful graph querying capabilities to relational databases. Daniel shares his journey into database research, the motivations behind DuckPGQ, and how it simplifies working with graph data. We also dive into the technical challenges of implementing SQL Property Graph Queries (SQL PGQ) in DuckDB, discuss performance benchmarks, and expl...
Mar 20, 2025•49 min•Season 10Ep. 3
In this episode we kick off our DuckDB in Research series with Till Döhmen, a software engineer at MotherDuck, where he leads AI efforts. Till shares insights into DuckDQ , a Python library designed for efficient data quality validation in machine learning pipelines, leveraging DuckDB’s high-performance querying capabilities. We discuss the challenges of ensuring data integrity in ML workflows, the inefficiencies of existing solutions, and how DuckDQ provides a lightweight, drop-in replacement t...
Mar 13, 2025•58 min•Season 10Ep. 2
Hey folks! We have been collaborating with everyone's favourite in-process SQL OLAP database management system DuckDB to bring you a new podcast series - the DuckDB in Research series! At Disseminate our mission is to bridge the gap between research and industry by exploring research that has a real-world impact. DuckDB embodies this synergy—decades of research underpin its design, and now it’s making waves in the research community as a platform for others to build on and this is what the serie...
Mar 06, 2025•3 min•Season 10Ep. 1
In this High Impact in Databases episode we talk to Anastasia Ailamaki . Anastasia is a Professor of Computer and Communication Sciences at the École Polytechnique Fédérale de Lausanne (EPFL). Tune in to hear Anastasia's story! The podcast is proudly sponsored by Pometry the developers behind Raphtory , the open source temporal graph analytics engine for Python and Rust. You can find Anastasia on: Homepage Google Scholar LinkedIn Hosted on Acast. See acast.com/privacy for more information....
Mar 03, 2025•46 min•Season 7Ep. 10
In this episode, we chat with Anastasiia Kozar about her research on fault tolerance in resource-constrained environments. As IoT applications leverage sensors, edge devices, and cloud infrastructure, ensuring system reliability at the edge poses unique challenges. Unlike the cloud, edge devices operate without persistent backups or high availability standards, leading to increased vulnerability to failures. Anastasiia explains how traditional methods fall short, as they fail to align resource a...
Dec 16, 2024•49 min•Season 6Ep. 21
In this episode, we chat with with Liana Patel to discuss ACORN, a groundbreaking method for hybrid search in applications using mixed-modality data. As more systems require simultaneous access to embedded images, text, video, and structured data, traditional search methods struggle to maintain efficiency and flexibility. Liana explains how ACORN, leveraging Hierarchical Navigable Small Worlds (HNSW), enables efficient, predicate-agnostic searches by introducing innovative predicate subgraph tra...
Nov 11, 2024•53 min•Season 6Ep. 20
In this High Impact episode we talk to David Maier . David is the Maseeh Professor Emeritus of Emerging Technologies at Portland State University. Tune in to hear David's story and learn about some of his most impactful work. The podcast is proudly sponsored by Pometry the developers behind Raphtory , the open source temporal graph analytics engine for Python and Rust. You can find David on: Homepage Google Scholar Hosted on Acast. See acast.com/privacy for more information....
Nov 04, 2024•1 hr 2 min•Season 7Ep. 9
In this episode, Raunak Shah joins us to discuss the critical issue of data redundancy in enterprise data lakes, which can lead to soaring storage and maintenance costs. Raunak highlights how large-scale data environments, ranging from terabytes to petabytes, often contain duplicate and redundant datasets that are difficult to manage. He introduces the concept of "dataset containment" and explains its significance in identifying and reducing redundancy at the table level in these massive data la...
Oct 28, 2024•31 min•Season 6Ep. 19
In this High Impact episode we talk to Aditya Parameswaran about his some of his most impactful work. Aditya is an Associate Professor at the University of California, Berkeley. Tune in to hear Aditya's story! The podcast is proudly sponsored by Pometry the developers behind Raphtory , the open source temporal graph analytics engine for Python and Rust. Links: EPIC Data Lab Answering Queries using Humans, Algorithms and Databases (CIDR'11) Potter’s Wheel: An Interactive Data Cleaning System (VLD...
Oct 21, 2024•59 min•Season 7Ep. 8
In this episode, we sit down with Marco Costa to discuss the fascinating world of range filters, focusing on how they help optimize queries in databases by determining whether a range intersects with a given set of keys. Marco explains how traditional range filters, like Bloom filters, often result in high false positives and slow query times, especially when dealing with adversarial inputs where queries are correlated with the keys. He walks us through the limitations of existing heuristic-base...
Oct 14, 2024•37 min•Season 6Ep. 18
In this High Impact episode we talk to Ali Dasdan, CTO at Zoominfo . Tune in to hear Ali's story and learn about some of his most impactful work such as his work on "Map-Reduce-Merge". The podcast is proudly sponsored by Pometry the developers behind Raphtory , the open source temporal graph analytics engine for Python and Rust. Materials mentioned on this episode: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters (SIGMOD'07) The Art of Doing Science and Engineering: Lear...
Oct 08, 2024•1 hr 3 min•Season 7Ep. 7
In this episode, we dive deep into the complexities of managing analytical query workloads with our guest, Matt Perron. Matt explains how the rapid and unpredictable fluctuations in resource demands present a significant challenge for provisioning. Traditional methods often lead to either over-provisioning, resulting in excessive costs, or under-provisioning, which causes poor query latency during demand spikes. However, there's a promising solution on the horizon. Matt shares insights from rece...
Jul 22, 2024•52 min•Season 6Ep. 17
In this High Impact episode we talk to Andreas Kipf about his work on "Learned Cardinalities". Andreas is the Professor of Data Systems at Technische Universität Nürnberg (UTN). Tune in to hear Andreas's story and learn about some of his most impactful work. The podcast is proudly sponsored by Pometry the developers behind Raphtory , the open source temporal graph analytics engine for Python and Rust. Papers mentioned on this episode: Learned Cardinalities: Estimating Correlated Joins with Deep ...
Jul 15, 2024•53 min•Season 7Ep. 6
In this episode, we delve into the intersection of software engineering (SE) research and professional practice with experts Marvin Wyrich and Justus Bogner. As LinkedIn stands as the largest professional network globally, it serves as a critical platform for bridging the gap between SE researchers and practitioners. Marvin and Justus explore the dynamics of how research findings are shared and discussed on LinkedIn, providing both quantitative and qualitative insights into the effectiveness of ...
Jul 08, 2024•48 min•Season 6Ep. 16
In this High Impact episode we talk to Joe Hellerstein . Joe is the Jim Gray Professor of Computer Science at UC Berkeley. Tune in to hear Joe's story and learn about some of his most impactful work. The podcast is proudly sponsored by Pometry the developers behind Raphtory , the open source temporal graph analytics engine for Python and Rust. Hosted on Acast. See acast.com/privacy for more information....
Jul 01, 2024•53 min•Season 7Ep. 5
In this episode, we chat with Harry Goldstein about Property-Based Testing (PBT). Harry shares insights from interviews with PBT users at Jane Street, highlighting PBT's strengths in testing complex code and boosting developer confidence. Harry also discusses the challenges of writing properties and generating random data, and the difficulties in assessing test effectiveness. He identifies key areas for future improvement, such as performance enhancements and better random input generation. This...
Jun 25, 2024•49 min•Season 6Ep. 15
In this High Impact episode we talk to Raghu Ramakrishnan . Raghu is CTO for Data and a Technical Fellow at Microsoft. Tune in to hear Raghu's story and learn about some of his most impactful work. The podcast is proudly sponsored by Pometry the developers behind Raphtory , the open source temporal graph analytics engine for Python and Rust. Hosted on Acast. See acast.com/privacy for more information....
Jun 17, 2024•24 min•Season 7Ep. 4
Join us as we chat with Gina Yuan about her pioneering work on sidekick protocols, designed to enhance the performance of encrypted transport protocols like QUIC and WebRTC. These protocols ensure privacy but limit in-network innovations. Gina explains how sidekick protocols allow intermediaries to assist endpoints without compromising encryption. Discover how Gina tackles the challenge of referencing opaque packets with her innovative quACK tool and learn about the real-world benefits, includin...
Jun 10, 2024•55 min•Season 6Ep. 14
Welcome to another episode of the High Impact series - today we talk with Moshe Vardi! Moshe is the Karen George Distinguished Service Professor in Computational Engineering at Rice University where his research focuses on automated reasoning. Tune in to hear Moshe's story and learn about some of his most impactful work. The podcast is proudly sponsored by Pometry the developers behind Raphtory , the open source temporal graph analytics engine for Python and Rust. You can find Moshe on X, Linked...
Jun 03, 2024•48 min•Season 7Ep. 3
In this episode, we dip our toes into the world of sustainable computing and interview Tammy Sukprasert about her research on reducing carbon emissions in cloud computing through workload scheduling. Tammy explores the concept of shifting cloud workloads across different times and locations to coincide with low-carbon energy availability. Unlike previous studies that focused on specific regions or workloads, her comprehensive analysis uses carbon intensity data from 123 regions to assess both ba...
May 27, 2024•33 min•Season 6Ep. 13
Welcome to the first episode of the High Impact series! The High Impact series is inspired by a blog post “ Most Influential Database Papers " by Ryan Marcus and today we talk to Ryan! Tune in to hear about Ryan's story so far. We chat about his current work before moving on to discuss his most impactful work. We also dig into what motivates him and how he handles setbacks, as well as getting his take on the current trends. The podcast is proudly sponsored by Pometry the developers behind Raphto...
May 20, 2024•1 hr•Season 7Ep. 2
In this episode, we explore the world of caching with Yazhuo Zhang, who introduces the game-changing SIEVE algorithm. Traditional eviction algorithms have long struggled with a trade-off between efficiency, throughput, and simplicity. However, SIEVE disrupts this balance by offering a simpler alternative to LRU while outperforming state-of-the-art algorithms in both efficiency and scalability for web cache workloads. Implemented in five production cache libraries with minimal code changes, SIEVE...
May 13, 2024•43 min•Season 6Ep. 12
Introducing the High Impact Series! Hey folks, we have a new series coming soon inspired by a blog post “ Most Influential Database Papers " by Ryan Marcus . The series will feature interviews with the authors of some of the most impactful work in the field of databases. We will talk about the story behind some of their most impactful work, getting them to reflect on the impact it has had over years, as well as getting their take on the current trends in the field. Proudly sponsored by Pometry H...
May 06, 2024•3 min•Season 7Ep. 1
In this episode, we talk to Eleni Zapridou and delve into the challenges of data processing within enterprises, where multiple applications operate concurrently on shared resources. Traditional resource boundaries between applications often lead to increased costs and resource consumption. However, as Eleni explains the principle of functional isolation offers a solution by combining cross-task optimizations with performance isolation. We explore GroupShare, an innovative strategy that reduces C...
Apr 29, 2024•39 min•Season 6Ep. 11
In this thought-provoking podcast episode, we dive into the world of scalable OLTP (OnLine Transaction Processing) systems with the insightful Pat Helland. As a seasoned expert in the field, Pat shares his insights on the critical role of isolation semantics in the scalability of OLTP systems, emphasizing its significance as the "BIG DEAL." By examining the interface between OLTP databases and applications, particularly through the lens of RCSI (READ COMMITTED SNAPSHOT ISOLATION) SQL databases, ...
Apr 15, 2024•1 hr 20 min•Season 6Ep. 10
In this episode, we talk to Rui Liu and explore the transformative potential of Ratchet, a groundbreaking resource-adaptive query execution framework. We delve into the challenges posed by ephemeral resources in modern cloud environments and the innovative solutions offered by Ratchet. Rui guides us through the intricacies of Ratchet's design, highlighting its ability to enable adaptive query suspension and resumption, sophisticated resource arbitration for diverse workloads, and a fine-grained ...
Apr 01, 2024•54 min•Season 6Ep. 9