Data Engineering Podcast

Tobias Macey•www.dataengineeringpodcast.com

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Follow on

Podcasts are better in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episodes

Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50

Summary There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are fac...

Oct 01, 2018•53 min•Ep. 50

A Primer On Enterprise Data Curation with Todd Walter - Episode 49

Summary As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of a...

Sep 24, 2018•50 min•Ep. 49

Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48

Summary Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the proje...

Sep 17, 2018•48 min•Ep. 48

Keep Your Data And Query It Too Using Chaos Search with Thomas Hazel and Pete Cheslock - Episode 47

Summary Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become problematic to keep all of your history. Chaos Search was started to make it easy for you to keep all of your data and make it usable in S3, so that you can have the best of both worlds. In this episode the CTO, Thomas Hazel, and VP of Product, Pete Cheslock, describe how they have built a platform to let you keep all of your history, save money, a...

Sep 10, 2018•48 min•Ep. 47

An Agile Approach To Master Data Management with Mark Marinelli - Episode 46

Summary With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as Bob Smith on Twitter? Using master data management to build a data catalog helps you answer these questions reliably and simplify the process of building your business intelligence reports. In this episode the head of product at Tamr, Mark...

Sep 03, 2018•47 min•Ep. 46

Protecting Your Data In Use At Enveil with Ellison Anne Williams - Episode 45

Summary There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is still a weak point where attackers can gain access to your unencrypted information. In this episode Ellison Anny Williams, CEO of Enveil, describes how her company uses homomorphic encryption to ensure that your analytical queries can be executed without ever having to decrypt your data. Preamble Hello and welcome to the Data Engineering Podcast, the s...

Aug 27, 2018•25 min•Ep. 45

Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44

Summary The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a graph, however databases architected around that use case have historically been difficult to use at scale or for serving fast, distributed queries. In this episode Manish Jain explains how DGraph is overcoming those limitations, how the project got started, and how you can start us...

Aug 20, 2018•43 min•Ep. 44

Putting Airflow Into Production With James Meickle - Episode 43

Summary The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design ...

Aug 13, 2018•48 min•Ep. 43

Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42

Summary One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus on stability it has stayed relevant as the ecosystem of development environments and data requirements have changed and evolved over its lifetime. It is difficult to capture any single facet of this database in a single conversation, let alone the entire surface area, but in this episode Jonathan Katz does an admirable job of it. He explains how Post...

Aug 06, 2018•56 min•Ep. 42

Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41

Summary With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of s...

Jul 30, 2018•29 min•Ep. 41

Ceph: A Reliable And Scalable Distributed Filesystem with Sage Weil - Episode 40

Summary When working with large volumes of data that you need to access in parallel across multiple instances you need a distributed filesystem that will scale with your workload. Even better is when that same system provides multiple paradigms for interacting with the underlying storage. Ceph is a highly available, highly scalable, and performant system that has support for object storage, block storage, and native filesystem access. In this episode Sage Weil, the creator and lead maintainer of...

Jul 16, 2018•49 min•Ep. 40

Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39

Summary Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical interface. This framework provides a flexible platform for building a wide variety of integrations that can be managed and scaled easily to fit your particular needs. In this episode project members Kevin Doran and Andy LoPresto discuss...

Jul 08, 2018•1 hr 4 min•Ep. 39

Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38

Summary Data is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning projects. This is problematic when the volume scales beyond a handful of records. In this episode Dr. Cheryl Martin, Chief Data Scientist for Alegion, discusses the importance of properly labeled information for machine learning and artificial intelligence projects, the systems that they have built to scale the process of incorporating human intelligence in...

Jul 02, 2018•46 min•Ep. 38

Package Management And Distribution For Your Data Using Quilt with Kevin Moore - Episode 37

Summary Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every data team has a bespoke means of sharing data sets, versioning them, tracking related metadata and changes, and publishing them for use in the software systems that rely on them. The CEO and founder of Quilt Data, Kevin Moore, was sufficiently frustrated by this problem to create a platform that attempts to be the means by which data can be as collabo...

Jun 25, 2018•42 min•Ep. 37

User Analytics In Depth At Heap with Dan Robinson - Episode 36

Summary Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize that you haven’t been tracking a key interaction, having to write custom logic to add that event, and then waiting to collect data. Heap is a platform that automatically tracks every event so that you can retroactively decide which actions are important to your business and easily build reports with or without SQL. In this episode Dan Robinson, CTO of ...

Jun 17, 2018•45 min•Ep. 36

CockroachDB In Depth with Peter Mattis - Episode 35

Summary With the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to replicate information geographically came at the expense of transactions and familiar query languages. To address these shortcomings the engineers at Cockroach Labs have built a globally distributed SQL database with full ACID semantics in Cockroach DB. In this episode Peter Matt...

Jun 11, 2018•44 min•Ep. 35

ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34

Summary Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this episode Jan Steeman and Jan Stücke explain where Arango fits in the crowded database market, how it works under the hood, and how you can start working with it today. Preamble Hello and welcome to the Data Engineering Podcast, the show ...

Jun 04, 2018•40 min•Ep. 34

The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33

Summary Building an ETL pipeline is a common need across businesses and industries. It’s easy to get one started but difficult to manage as new requirements are added and greater scalability becomes necessary. Rather than duplicating the efforts of other engineers it might be best to use a hosted service to handle the plumbing so that you can focus on the parts that actually matter for your business. In this episode CTO and co-founder of Alooma, Yair Weinberger, explains how the platform address...

May 28, 2018•48 min•Ep. 33

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

Summary Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. Kamil Bajda-Pawlikowski co-founded Starburst Data to provide support and tooling for Presto, as well as contributing advanced features back to the project. In th...

May 21, 2018•42 min•Ep. 32

Brief Conversations From The Open Data Science Conference: Part 2 - Episode 31

Summary The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Andy Eschbacher of Carto. He dscribes some of the complexities inherent to working with geospatial data, how they are handling it, and some of the interesting use cases that they enable for their customers. Next is Todd Blaschka, COO of TigerGraph. He explains how gra...

May 14, 2018•26 min•Ep. 31

Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30

Summary The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine...

May 07, 2018•33 min•Ep. 30

Metabase Self Service Business Intelligence with Sameer Al-Sakran - Episode 29

Summary Business Intelligence software is often cumbersome and requires specialized knowledge of the tools and data to be able to ask and answer questions about the state of the organization. Metabase is a tool built with the goal of making the act of discovering information and asking questions of an organizations data easy and self-service for non-technical users. In this episode the CEO of Metabase, Sameer Al-Sakran, discusses how and why the project got started, the ways that it can be used ...

Apr 30, 2018•45 min•Ep. 29

Octopai: Metadata Management for Better Business Intelligence with Amnon Drori - Episode 28

Summary The information about how data is acquired and processed is often as important as the data itself. For this reason metadata management systems are built to track the journey of your business data to aid in analysis, presentation, and compliance. These systems are frequently cumbersome and difficult to maintain, so Octopai was founded to alleviate that burden. In this episode Amnon Drori, CEO and co-founder of Octopai, discusses the business problems he witnessed that led him to starting ...

Apr 23, 2018•40 min•Ep. 28

Data Engineering Weekly with Joe Crobak - Episode 27

Summary The rate of change in the data engineering industry is alternately exciting and exhausting. Joe Crobak found his way into the work of data management by accident as so many of us do. After being engrossed with researching the details of distributed systems and big data management for his work he began sharing his findings with friends. This led to his creation of the Hadoop Weekly newsletter, which he recently rebranded as the Data Engineering Weekly newsletter. In this episode he discus...

Apr 15, 2018•44 min•Ep. 27

Defining DataOps with Chris Bergh - Episode 26

Summary Managing an analytics project can be difficult due to the number of systems involved and the need to ensure that new information can be delivered quickly and reliably. That challenge can be met by adopting practices and principles from lean manufacturing and agile software development, and the cross-functional collaboration, feedback loops, and focus on automation in the DevOps movement. In this episode Christopher Bergh discusses ways that you can start adding reliability and speed to y...

Apr 08, 2018•55 min•Ep. 26

ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable - Episode 25

Summary Cloud computing and ubiquitous virtualization have changed the ways that our applications are built and deployed. This new environment requires a new way of tracking and addressing the security of our systems. ThreatStack is a platform that collects all of the data that your servers generate and monitors for unexpected anomalies in behavior that would indicate a breach and notifies you in near-realtime. In this episode ThreatStack’s director of operations, Pete Cheslock, and senior infra...

Apr 01, 2018•52 min•Ep. 25

MarketStore: Managing Timeseries Financial Data with Hitoshi Harada and Christopher Ryan - Episode 24

Summary The data that is used in financial markets is time oriented and multidimensional, which makes it difficult to manage in either relational or timeseries databases. To make this information more manageable the team at Alapaca built a new data store specifically for retrieving and analyzing data generated by trading markets. In this episode Hitoshi Harada, the CTO of Alapaca, and Christopher Ryan, their lead software engineer, explain their motivation for building MarketStore, how it operat...

Mar 25, 2018•33 min•Ep. 24

Stretching The Elastic Stack with Philipp Krenn - Episode 23

Summary Search is a common requirement for applications of all varieties. Elasticsearch was built to make it easy to include search functionality in projects built in any language. From that foundation, the rest of the Elastic Stack has been built, expanding to many more use cases in the proces. In this episode Philipp Krenn describes the various pieces of the stack, how they fit together, and how you can use them in your infrastructure to store, search, and analyze your data. Preamble Hello and...

Mar 19, 2018•51 min•Ep. 23

Database Refactoring Patterns with Pramod Sadalage - Episode 22

Summary As software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and iterative schema evolution provide the necessary mechanisms to ensure that your data layer is as agile as your application. Pramod Sadalage saw the need for these capabilities during the early days of the introduction of modern development practices and co-authored a book to codify a large number of patterns to aid practitioners, and in this episode he ...

Mar 12, 2018•49 min•Ep. 22

The Future Data Economy with Roger Chen - Episode 21

Summary Data is an increasingly sought after raw material for business in the modern economy. One of the factors driving this trend is the increase in applications for machine learning and AI which require large quantities of information to work from. As the demand for data becomes more widespread the market for providing it will begin transform the ways that information is collected and shared among and between organizations. With his experience as a chair for the O’Reilly AI conference and an ...

Mar 05, 2018•43 min•Ep. 21

← Prev Next →

For the best experience, listen in Metacast app for iOS or Android

Open in Metacast