Summary: Far-memory techniques that enable applications to use remote memory and are increasingly appealing in modern data centers, supporting applications’ large memory footprint and improving machines’ resource utilization. In this episode Haoran Ma tells us about the problems with current far-memory techniques and how they focus on OS-level optimizations and are agnostic to managed runtimes and garbage collections (GC) underneath applications written in high-level languages. Owing to differen...
Jan 16, 2023•44 min•Season 3Ep. 3
Summary: In this episode Lexiang Huang talks about a framework for understanding a class of failures in distributed systems called metastable failures. Lexiang tells us about his study on the prevalence of such failures in the wild and how he and his colleagues scoured over publicly available incident reports from many organizations, ranging from hyperscalers to small companies. Listen to the episode to find out about his main findings and gain a deeper understanding of metastable failures and h...
Jan 09, 2023•53 min•Season 3Ep. 2
Summary: Debugging is time-consuming, accounting for roughly 50% of a developer's time. In this episode Andrew Quinn tells us about the OmniTable, an abstraction that captures all execution state as a large queryable data table. In his research Andrew has built a query model around an OmniTable that supports SQL to simplify debugging. An OmniTable decouples debugging logic from the original execution, which SteamDrill, Andrew's prototype, uses to reduce the performance overhead of debugging (Ste...
Jan 02, 2023•58 min•Season 3Ep. 1
Summary: This episode features Audrey Cheng talking about TAOBench, a new benchmark that captures the social graph workload at Meta. Audrey tells us about the features of workload, how it compares with other benchmarks, and how it fills a gap in the existing space of benchmark. Also, we hear all about the fantastic real-world impact the benchmark has already had across a range of companies. Links: Paper Personal website Meta blog post GitHub repo Hosted on Acast. See acast.com/privacy for more i...
Dec 12, 2022•53 min•Season 2Ep. 5
Summary: Users have the right to consent to the use of their data, but current methods are limited to very coarse-grained expressions of consent, as “opt-in/opt-out” choices for certain uses. In this episode, George talks about how he and his group identified the need for fine-grained consent management and how they formalized how to express and manage user consent and personal contracts of data usage in relational databases. Their approach enables data owners to express the intended data usage ...
Dec 05, 2022•56 min•Season 2Ep. 4
Summary (VLDB abstract): Despite the wide adoption of graph processing across many different application domains, there is no underlying data structure that can serve a variety of graph workloads (analytics, traversals, and pattern matching) on dynamic graphs with transactional updates. In this episode, Per talks about Sortledton, a universal graph data structure that addresses the open problem by being carefully optimizing for the most relevant data access patterns used by graph computation ker...
Nov 28, 2022•41 min•Season 2Ep. 3
Summary (VLDB abstract): Single-node multi-core stream processing engines (SPEs) can process hundreds of millions of tuples per second. Yet making them fault-tolerant with exactly-once semantics while retaining this performance is an open challenge: due to the limited I/O bandwidth of a single-node, it becomes infeasible to persist all stream data and operator state during execution. Instead, single-node SPEs rely on upstream distributed systems, such as Apache Kafka, to recover stream data afte...
Nov 21, 2022•46 min•Season 2Ep. 2
Summary: In this episode Kevin Gaffney tells us about SQLite, the most widely deployed database engine in existence. SQLite is found in nearly every smartphone, computer, web browser, television, and automobile. Several factors are likely responsible for its ubiquity, including its in-process design, standalone codebase, extensive test suite, and cross-platform file format. While it supports complex analytical queries, SQLite is primarily designed for fast online transaction processing (OLTP), e...
Nov 14, 2022•48 min•Season 2Ep. 1
Summary: In this episode Matthias Jasny from TU Darmstadt talks about P4DB, a database that uses a programmable switch to accelerate OLTP workloads. The main idea of P4DB is that it implements a transaction processing engine on top of a P4-programmable switch. The switch can thus act as an accelerator in the network, especially when it is used to store and process hot (contended) tuples on the switch. P4DB provides significant benefits compared to traditional DBMS architectures and can achieve a...
Aug 08, 2022•27 min•Season 1Ep. 10
Summary: In this episode Tobias talks about his work on ScaleStore, a distributed storage engine that exploits DRAM caching, NVMe storage, and RDMA networking to achieve high performance, cost-efficiency, and scalability. Using low latency RDMA messages, ScaleStore implements a transparent memory abstraction that provides access to the aggregated DRAM memory and NVMe storage of all nodes. In contrast to existing distributed RDMA designs such as NAM-DB or FaRM, ScaleStore stores cold data on NVMe...
Aug 01, 2022•23 min•Season 1Ep. 9
Summary: Many transactions in web applications are constructed ad-hoc in the application code. For example, developers might explicitly use locking primitives or validation procedures to coordinate critical code fragments. In this episode, Chuzhe tells us these ad-hoc transactions, database operations coordinated by application code. Until Chuzhe’s work, little was known about them. In this episode he chats about the first comprehensive study on ad hoc transactions. By studying 91 ad hoc transac...
Jul 25, 2022•32 min•Season 1Ep. 8
Summary: Enterprises use distributed database systems to meet the demands of mixed or hybrid transaction/analytical processing (HTAP) workloads that contain both transactional (OLTP) and analytical (OLAP) requests. Distributed HTAP systems typically maintain a complete copy of data in row-oriented storage format that is well-suited for OLTP workloads and a second complete copy in column-oriented storage format optimised for OLAP workloads. Maintaining these data copies consumes significant stora...
Jul 18, 2022•28 min•Season 1Ep. 7
Summary: Distributed in-memory processing frameworks accelerate iterative workloads by caching suitable datasets in memory rather than recomputing them in each iteration. Selecting appropriate datasets to cache as well as allocating a suitable cluster configuration for caching these datasets play a crucial role in achieving optimal performance. In practice, both are tedious, time-consuming tasks and are often neglected by end users, who are typically not aware of workload semantics, sizes of int...
Jul 11, 2022•32 min•Season 1Ep. 6
Summary: The JavaScript Object Notation (JSON) is a popular data format used in document stores to natively support semi-structured data. In this interview, Thomas talks about how he addressed the problem of JSON similarity lookup queries: given a query document and a distance threshold, retrieve all documents that are within the threshold from the query document, i.e., get me all similar documents!. Different from other hierarchical formats such as XML, JSON supports both ordered and unordered ...
Jul 08, 2022•12 min•Season 1Ep. 4
Summary: The use of machine learning (ML) in high-stakes societal decisions has encouraged the consideration of fairness throughout the ML lifecycle. Although data integration is one of the primary steps to generate high-quality training data, most of the fairness literature ignores this stage. In this interview Sainyam discusses why he focuses on fairness in the integration component of data management, aiming to identify features that improve prediction without adding any bias to the dataset. ...
Jul 08, 2022•12 min•Season 1Ep. 5
Summary: A climate network represents the global climate system by the interactions of a set of anomaly time-series. Network science has been applied on climate data to study the dynamics of a climate network. The core task and first step to enable interactive network science on climate data is the efficient construction and update of a climate network on user-defined time-windows. In this interview Draco talks about TSUBASA, an algorithm for the efficient construction of climate networks based ...
Jul 04, 2022•17 min•Season 1Ep. 3
Summary: In this interview Felix discusses "historical what-if queries", a novel type of what-if analysis that determines the effect of a hypothetical change to the transactional history of a database. For example, “how would revenue be affected if we would have charged an additional $6 for shipping?” In his research Felix has developed efficient techniques for answering these historical what-if queries, i.e., determining how a modified history affects the current database state. During the show...
Jul 01, 2022•19 min•Season 1Ep. 2
Summary: Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy. Maximizing resource utilization is becoming more challenging as the throughput of training processes increases with hardware innovations (e.g., faster GPUs, TPUs, and inter-connects) and advanced parallelization techniques that yield better scalability. At the same time, the amount of training data needed in order to train increasingly complex models is growing. As a c...
Jun 27, 2022•25 min•Season 1Ep. 1
Welcome to Disseminate! The podcast bringing you the cutting edge of Computer Science research in a digestible format. Each series will focus on papers published at a specific Computer Science conference, e.g., SIGMOD, CVPR, so we will cover a wide range of topics from distributed systems to computer vision. Each episode within a series will feature an interview with the author(s) of a paper published at that conference. The podcasts aims to be an alternative source of information for industry p...
Jun 03, 2022•2 min