Data Engineering Podcast

‌

Episodes

‌

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Episodes

Zenlytic Is Building You A Better Coworker With AI Agents

Summary The purpose of business intelligence systems is to allow anyone in the business to access and decode data to help them make informed decisions. Unfortunately this often turns into an exercise in frustration for everyone involved due to complex workflows and hard-to-understand dashboards. The team at Zenlytic have leaned on the promise of large language models to build an AI agent that lets you converse with your data. In this episode they share their journey through the fast-moving lands...

May 19, 2024•54 min•Ep 426•Transcript available on Metacast

Release Management For Data Platform Services And Logic

Summary Building a data platform is a substrantial engineering endeavor. Once it is running, the next challenge is figuring out how to address release management for all of the different component parts. The services and systems need to be kept up to date, but so does the code that controls their behavior. In this episode your host Tobias Macey reflects on his current challenges in this area and some of the factors that contribute to the complexity of the problem. Announcements Hello and welcome...

May 12, 2024•20 min•Ep 425•Transcript available on Metacast

Barking Up The Wrong GPTree: Building Better AI With A Cognitive Approach

Summary Artificial intelligence has dominated the headlines for several months due to the successes of large language models. This has prompted numerous debates about the possibility of, and timeline for, artificial general intelligence (AGI). Peter Voss has dedicated decades of his life to the pursuit of truly intelligent software through the approach of cognitive AI. In this episode he explains his approach to building AI in a more human-like fashion and the emphasis on learning rather than st...

May 05, 2024•54 min•Ep 424•Transcript available on Metacast

Build Your Second Brain One Piece At A Time

Summary Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collection of tools that complements the tools that developers already use. In this episode he explains the data collection and preparation process, the colle...

Apr 28, 2024•50 min•Ep 423•Transcript available on Metacast

Making Email Better With AI At Shortwave

Summary Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers. Announcements Hello...

Apr 21, 2024•54 min•Ep 422•Transcript available on Metacast

Designing A Non-Relational Database Engine

Summary Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episo...

Apr 14, 2024•1 hr 16 min•Ep 421•Transcript available on Metacast

Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

Summary Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component ...

Apr 07, 2024•56 min•Ep 420•Transcript available on Metacast

Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary

Summary Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this ep...

Mar 31, 2024•51 min•Ep 419•Transcript available on Metacast

Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Summary A core differentiator of Dagster in the ecosystem of data orchestration is their focus on software defined assets as a means of building declarative workflows. With their launch of Dagster+ as the redesigned commercial companion to the open source project they are investing in that capability with a suite of new features. In this episode Pete Hunt, CEO of Dagster labs, outlines these new capabilities, how they reduce the burden on data teams, and the increased collaboration that they ena...

Mar 24, 2024•56 min•Ep 418•Transcript available on Metacast

Reconciling The Data In Your Databases With Datafold

Summary A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data. Announcements Hello and welco...

Mar 17, 2024•58 min•Ep 417•Transcript available on Metacast

Version Your Data Lakehouse Like Your Software With Nessie

Summary Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning...

Mar 10, 2024•41 min•Ep 416•Transcript available on Metacast

When And How To Conduct An AI Program

Summary Artificial intelligence technologies promise to revolutionize business and produce new sources of value. In order to make those promises a reality there is a substantial amount of strategy and investment required. Colleen Tartow has worked across all stages of the data lifecycle, and in this episode she shares her hard-earned wisdom about how to conduct an AI program for your organization. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managem...

Mar 03, 2024•46 min•Ep 415•Transcript available on Metacast

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Summary Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache A...

Feb 25, 2024•56 min•Ep 414•Transcript available on Metacast

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Summary A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and sc...

Feb 18, 2024•59 min•Ep 413•Transcript available on Metacast

Data Sharing Across Business And Platform Boundaries

Summary Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building ...

Feb 11, 2024•1 hr•Ep 412•Transcript available on Metacast

Tackling Real Time Streaming Data With SQL Using RisingWave

Summary Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable. Announcements Hello and welcome to the Data Engineering Podcast, the show abo...

Feb 04, 2024•57 min•Ep 411•Transcript available on Metacast

Build A Data Lake For Your Security Logs With Scanner

Summary Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get starte...

Jan 29, 2024•1 hr 3 min•Ep 410•Transcript available on Metacast

Modern Customer Data Platform Principles

Summary Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern c...

Jan 22, 2024•1 hr 2 min•Ep 409•Transcript available on Metacast

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Summary Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode h...

Jan 07, 2024•50 min•Ep 408•Transcript available on Metacast

Designing Data Platforms For Fintech Companies

Summary Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality d...

Jan 01, 2024•48 min•Ep 407•Transcript available on Metacast

Troubleshooting Kafka In Production

Summary Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: : Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitiga...

Dec 24, 2023•1 hr 15 min•Ep 406•Transcript available on Metacast

Adding An Easy Mode For The Modern Data Stack With 5X

Summary The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the...

Dec 18, 2023•56 min•Ep 405•Transcript available on Metacast

Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

Summary If your business metrics looked weird tomorrow, would you know about it first? Anomaly detection is focused on identifying those outliers for you, so that you are the first to know when a business critical dashboard isn't right. Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. Andrew Maguire got tired of solving that problem for each of the different roles he has ended up in, so he created the open source Anomstack project....

Dec 11, 2023•51 min•Ep 404•Transcript available on Metacast

Designing Data Transfer Systems That Scale

Summary The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud. Announcements ...

Dec 04, 2023•1 hr 4 min•Ep 403•Transcript available on Metacast

Addressing The Challenges Of Component Integration In Data Platform Architectures

Summary Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being ma...

Nov 27, 2023•30 min•Ep 402•Transcript available on Metacast

Unlocking Your dbt Projects With Practical Advice For Practitioners

Summary The dbt project has become overwhelmingly popular across analytics and data engineering teams. While it is easy to adopt, there are many potential pitfalls. Dustin Dorsey and Cameron Cyr co-authored a practical guide to building your dbt project. In this episode they share their hard-won wisdom about how to build and scale your dbt projects. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data projects are notoriously complex. With m...

Nov 20, 2023•1 hr 16 min•Ep 401•Transcript available on Metacast

Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine

Summary Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when t...

Nov 13, 2023•1 hr 8 min•Ep 400•Transcript available on Metacast

Shining Some Light In The Black Box Of PostgreSQL Performance

Summary Databases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. In this episode Lukas Fittl shares some hard-won wisdom about the causes and solution of many performance bottlenecks and the work that he is doing to shine some light on PostgreSQL to make it easier to understand how to keep it running smoothly. Announcements Hello and welcome to the Data Engi...

Nov 06, 2023•55 min•Ep 399•Transcript available on Metacast

Surveying The Market Of Database Products

Summary Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection. Announcements Hello and welcome to the Data Engineering Podcast, the show ...

Oct 30, 2023•47 min•Ep 398•Transcript available on Metacast

Defining A Strategy For Your Data Products

Summary The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management...

Oct 23, 2023•1 hr 4 min•Ep 397•Transcript available on Metacast