In this episode, Yifei Yang introduces predicate transfer, a revolutionary method for optimizing join performance in databases. Predicate transfer builds on Bloom joins, extending its benefits to multi-table joins. Inspired by Yannakakis's theoretical insights, predicate transfer leverages Bloom filters to achieve significant speed improvements. Yang's evaluation shows an average 3.3× performance boost over Bloom join on the TPC-H benchmark, highlighting the potential of predicate transfer to re...
Mar 18, 2024•48 min•Season 6Ep. 8
In this episode, Vikramank Singh introduces the Panda framework, aimed at refining Large Language Models' (LLMs) capability to address database performance issues. Vikramank elaborates on Panda's four components—Grounding, Verification, Affordance, and Feedback—illustrating how they collaborate to contextualize LLM responses and deliver actionable recommendations. By bridging the divide between technical knowledge and practical troubleshooting needs, Panda has the potential to revolutionize data...
Mar 04, 2024•1 hr 8 min•Season 6Ep. 7
In this episode, Tamer Eldeeb sheds light on the challenges faced by geo-distributed database management systems (DBMSes) in supporting strictly-serializable transactions across multiple regions. He discusses the compromises often made between low-latency regional writes and restricted programming models in existing DBMS solutions. Tamer introduces Chablis, a groundbreaking geo-distributed, multi-versioned transactional key-value store designed to overcome these limitations. Chablis offers a gen...
Feb 12, 2024•1 hr 2 min•Season 6Ep. 6
Summary: In this episode, we chat to Matt Butrovich about his research on database proxies. We discuss the inefficiencies of traditional database proxies, which operate in user-space, causing overhead due to buffer copying and system calls. Matt introduces "user-bypass" which leverages Linux's eBPF infrastructure to move application logic into kernel-space. Matt then tells us about Tigger, a PostgreSQL-compatible DBMS proxy, showcasing user-bypass benefits. Tune in to hear about the experiments ...
Dec 18, 2023•1 hr 4 min•Season 6Ep. 5
Summary: In this episode, Gábor Szárnyas takes us on a journey through the LDBC Social Network Benchmark's Business Intelligence workload (SNB BI). Developed through collaboration between academia and industry the SNB BI is a comprehensive graph OLAP benchmark. It pushes the boundaries of synthetic and scalable analytical database benchmarks, featuring a sophisticated data generator and a temporal graph with small-world phenomena. The benchmark's query workload, rooted in LDBC's innovative desig...
Dec 04, 2023•47 min•Season 6Ep. 4
Summary: In this week's episode, we talk with Thaleia Doudali and explore the realm of cloud resource forecasting, focusing on the use of Long Short Term Memory (LSTM) neural networks, a popular machine learning model. Drawing from her research, Thaleia discusses the surprising discovery that, despite the complexity of ML models, accurate predictions often boil down to a simple shift of values by one time step. The discussion explores the nuances of time series data, encompassing resource metric...
Nov 20, 2023•49 min•Season 6Ep. 3
Summary: In this episode Jinkun Geng talks to us about Nezha, a high-performance consensus protocol. Nezha can be deployed by cloud tenants without support from cloud providers. Nezha bridges the gap between protocols such as MultiPaxos and Raft, which can be readily deployed, and protocols such as NOPaxos and Speculative Paxos, that provide better performance, but require access to technologies such as programmable switches and in-network prioritization, which cloud tenants do not have. Tune in...
Oct 23, 2023•55 min•Season 6Ep. 2
Summary: In this episode, Dimitris Koutsoukos talks to us about Persistent or Non Volatile Memory (PMEM) and we answer the question: Is it Not Very Meaningful for Databases? PMEM offers expanded memory capacity and faster access to persistent storage. However, (before Dimitris's work) there was no comprehensive empirical analysis of existing database engines under diferent PMEM modes, to understand how databases can benefit from the various hardware configurations. Dimitris and his colleagues ha...
Oct 09, 2023•49 min•Season 6Ep. 1
Summary: Security is a core responsibility for Function-as-a-Service (FaaS) providers. The prevailing approach has each function execute in its own container to isolate concurrent executions of different functions. However, successive invocations of the same function commonly reuse the runtime state of a previous invocation in order to avoid container cold-start delays when invoking a function. Although efficient, this container reuse has security implications for functions that are invoked on b...
Sep 11, 2023•43 min•Season 5Ep. 10
Summary: In this episode Cuong Nguyen tells us about Detock, a geographically replicated database system. Tune in to learn about its specialised concurrency control and deadlock resolution protocols that enable processing strictly-serializable multi-region transactions with near-zero performance degradation at extremely high conflict and improves latency by up to a factor of 5. Links: SIGMOD Paper Detock Github Repo Cuong's Homepage Hosted on Acast. See acast.com/privacy for more information....
Aug 28, 2023•37 min•Season 5Ep. 9
Concurrency bugs are difficult to detect, reproduce, and diagnose, as they manifest under rare timing conditions. Recently, active delay injection has proven efficient for exposing one such type of bug — thread-safety violations — with low over-head, high coverage, and minimal code analysis. However, how to efficiently apply active delay injection to broader classes of concurrency bugs is still an open question. In this episode, Bogdan Stoica tells us about how answered this question by focusing...
Aug 14, 2023•56 min•Season 5Ep. 8
Summary: In this episode, Roger Waleffe talks about Graph Neural Networks (GNNs) for large-scale graphs. Specifically, he reveals all about MariusGNN, the first system that utilises the entire storage hierarchy (including disk) for GNN training. Tune in to find out how MaruisGNN works and just how fast it goes (and how much more cost-efficient it is!) Links: Marius Project Roger's Homepage Roger's Twitter EuroSys'23 Paper Support the podcast through Buy Me a Coffee Hosted on Acast. See acast.com...
Jul 31, 2023•1 hr 13 min•Season 5Ep. 7
Summary: The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. In this episode, Mad...
Jul 17, 2023•46 min•Season 5Ep. 6
Summary: Compared to hard disk drives (HDDs), solid-state drives (SSDs) have two fundamentally different properties: (i) read/write asymmetry (writes are slower than reads) and (ii) access concurrency (multiple I/Os can be executed in parallel to saturate the device bandwidth). But, database operators are often designed without considering storage asymmetry and concurrency resulting in device under utilization. In thie episode, Tarikul Islam Papon tells us about his work on a new Asymmetry &...
Jun 20, 2023•47 min•Season 5Ep. 5
Summary: Snapshot isolation is supported by most commercial databases and is widely used by applications. However, checking, if given a set of transactions, a database ensures Snapshot Isolation is either slow or gives up soundness. In this episode, Jian Zhang tells us about VIPER, an SI checker that is sound, complete, and fast. Tune in to learn more!! Links: Paper GitHub repo Jian's homepage Hosted on Acast. See acast.com/privacy for more information....
Jun 09, 2023•43 min•Season 5Ep. 4
Summary: Federated Learning (FL) enables distributed training by learners using local data, thereby enhancing privacy and reducing communication. However, it presents numerous challenges relating to the heterogeneity of the data distribution, device capabilities, and participant availability as deployments scale, which can impact both model convergence and bias. Existing FL schemes use random participant selection to improve fairness; however, this can result in inefficient use of resources and ...
May 26, 2023•59 min•Season 5Ep. 3
Summary: Log-structured merge (LSM) trees have emerged as one of the most commonly used storage-based data structures in modern data systems as they offer high throughput for writes and good utilization of storage space. In this episode, Subhadeep Sarkar presents the fundamental principles of the LSM paradigm. He tells us about recent research on improving write performance and the various optimization techniques and hybrid designs adopted by LSM engines to accelerate reads. Tune in to find out ...
May 11, 2023•59 min•Season 5Ep. 2
Summary: The increasing need for data trading across businesses nowadays has created a demand for data marketplaces. However, despite the intentions of both data providers and consumers, today’s data marketplaces remain mere data catalogs. In this episode, Andra tells us about her vision for marketplaces of the future which require a set of value-added services, such as advanced search and discovery. Also, she tell us about her and her team's effort to engineer and develop an open-source modular...
Apr 25, 2023•46 min•Season 5Ep. 1
Summary: Sorting is one of the most well-studied problems in computer science and a vital operation for relational database systems. Despite this, little research has been published on implementing an efficient relational sorting operator. In this episode, Laurens Kuiper tells us about his work filling this gap! Tune in to hear about a micro-benchmarks that explores how to sort relational data efficiently for analytical database systems, taking into account different query execution engines as w...
Apr 12, 2023•55 min•Season 4Ep. 10
Summary: In this episode Semih Salihoğlu tell us about Kùzu, an in-process property graph database management system built for query speed and scalability. Listen to hear the vision for Kùzu and to learn more about Kùzu's factorized query processor! Links: Kùzu GitHub repo CIDR paper contact@kuzudb.com Kùzu Slack Kùzu Twitter Kùzu Website - blog posts Semih mentioned can be found here Semih's Homepage Semih's Twitter Hosted on Acast. See acast.com/privacy for more information....
Apr 03, 2023•1 hr 17 min•Season 4Ep. 9
Summary: Today’s storage landscape offers a deep and heterogeneous stack of technologies that promises to meet even the most demanding data intensive workload needs. The diversity of technologies, however, presents a challenge. Parts of it are not controlled directly by the application, e.g., the cache layers, and the parts that are controlled, often require the programmer to deal with very different transfer mechanisms, such as disk and network APIs. Combining these different abstractions prope...
Mar 28, 2023•50 min•Season 4Ep. 8
Summary: Today’s organizations utilize a plethora of heterogeneous and autonomous DBMSes, many of those being spread across different geo-locations. It is therefore crucial to have effective and efficient cross-database query processing capabilities. In this episode, Haralampos Gavriilidis tell us about XDB, an efficient middleware system that runs cross database analytics over existing DBMSes. Tune in to learn more! Links: Preprint Haralampos's homepage Support the podcast here ! Hosted on Acas...
Mar 20, 2023•1 hr 1 min•Season 4Ep. 7
Summary: This week Paras Jain and Sarah Wooders tell us about how you can quickly data transfers between any cloud with Skyplane. Tune in to learn more! Links: Skyplane homepage Sarah's homepage Paras's homepage Support the podcast here Hosted on Acast. See acast.com/privacy for more information....
Mar 13, 2023•46 min•Season 4Ep. 6
Summary: Many database applications execute transactions under a weaker isolation level, such as READ COMMITTED. This often leads to concurrency bugs that look like race conditions in multi-threaded programs. While this problem is well known, philosophies of how to address this problem vary a lot, ranging from making a SERIALIZABLE database faster to living with weaker isolation and the consequence of concurrency bugs. In this episode, Yang talks about the consequences of these bugs, the root ca...
Mar 06, 2023•56 min•Season 4Ep. 5
Summary: Agreement protocols have been extensively used by distributed data management systems to provide robustness and high availability. The broad spectrum of design dimensions, applications, and fault models have resulted in different flavours of agreement protocols. This has made it hard to argue their correctness and has unintentionally created a disparity in understanding their design. In this episode, Suyash Gupta tell us about a unified framework that simplifies expressing different agr...
Feb 27, 2023•1 hr 4 min•Season 4Ep. 4
Summary: Many distributed cloud OLTP databases have settled on a shared-storage design coupled with a single-writer. This design choice is remarkable since conventional wisdom promotes using a shared-nothing architecture for building scalable systems. In this episode, Tobias revisits the question of what a scalable OLTP design for the cloud should look like by analysing the data access behaviour of different systems. Tune in to find out more! Links: Paper Website Email Twitter Google Scholar Hos...
Feb 20, 2023•55 min•Season 4Ep. 3
Summary: In this episode, Hamish Nicholson tells us about HetCache, a storage engine for analytical workloads that optimizes the data access paths and tunes data placement by co-optimizing for the combinations of different memories, compute devices, and queries. Specifically, we present how the increasingly complex storage hierarchy impacts analytical query processing in GPU-NVMe-accelerated servers. HetCache accelerates analytics on CPU-GPU servers for larger-than-memory datasets through propor...
Feb 13, 2023•51 min•Season 4Ep. 2
Summary: Few to zero DBMSs provide extensibility together with implementations of modern concepts, like query compilation for example. This as an impeding factor in academic research. In this episode, Immanuel Haffner, presents mutable, a system that is fitted to academic research and education. mutable features a modular design, where individual components can be composed to form a complete system. Check out the episode to learn more! Links: Paper Website Mutable github repo Bobby Tables xkcd H...
Feb 06, 2023•1 hr 28 min•Season 4Ep. 1
Summary: Recent shell-script parallelization systems enjoy mostly automated speedups by parallelizing scripts ahead-of-time. Unfortunately, such static parallelization is hampered by dynamic behavior pervasive in shell scripts—e.g., variable expansion and command substitution—which often requires reasoning about the current state of the shell and filesystem. Tune in to hear how Konstantinos Kallas and his colleagues overcame this issue (and others) with PaSH-JIT, a just-in-time (JIT) shell-scrip...
Jan 30, 2023•58 min•Season 3Ep. 5
Summary: Cloud stacks must isolate application components, while permitting efficient data sharing between components deployed on the same physical host. Traditionally, the memory management unit (MMU) enforces isolation and permits sharing at page granularity. MMU approaches, however, lead to cloud stacks with large trusted computing bases in kernel space, and page granularity requires inefficient OS interfaces for data sharing. Forthcoming CPUs with hardware support for memory capabilities off...
Jan 23, 2023•36 min•Season 3Ep. 4