What is Data Engineering Podcast about?

Data Engineering Podcast covers the tools, architectures, practices, and organizational challenges behind modern data management. Topics include databases, orchestration, lakehouses, AI systems, governance, observability, and analytics platforms.

Who hosts Data Engineering Podcast?

The show is hosted by Tobias Macey, who interviews engineers, founders, product leaders, and other specialists working in data infrastructure and data management.

How long are Data Engineering Podcast episodes?

Most episodes run about 45 to 70 minutes, with recent episodes commonly landing around the one-hour mark.

Where can I listen to Data Engineering Podcast?

You can listen through major podcast apps using the show's RSS feed, and episodes are also available from the official Data Engineering Podcast website.

How often does Data Engineering Podcast release?

The show releases on a weekly cadence, with standalone interviews focused on current data engineering tools and practices.

Data Engineering Podcast

Tobias Macey·512 episodes

EducationTechnologyExpert interviewsData platformsWeekly45-70 min/epStandalone episodesPractitioner-focused

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Why listen

Data Engineering Podcast gives working data professionals a steady stream of practical architecture conversations with people building the tools, platforms, and operating models behind modern data systems. Tobias Macey keeps the format focused on implementation tradeoffs, from lakehouses and orchestration to AI agents, governance, observability, and platform economics. It is best for engineers, architects, analytics leaders, and technically curious builders who want to hear how real teams reason through production data problems.

Episodes

52 min

Jun 8, 2026Episode 511

Text to Data Products: Kaarvi’s End-to-End AI for Ingestion, Quality, and Dashboards

Summary In this episode Shravan Gunda, founder and CEO of Kaarvi AI, talks about building an AI-native, agent-driven data platform designed to eliminate the janitorial work that consumes most data teams. He explores Kaarvi’s multi-agent architecture that runs queries across seven LLMs in parallel for reliability, its synthetic data generator that mirrors source schemas for quick testing, and “Hey Kaarvi” chat for text-to-SQL, text-to-transformations, and text-to-dashboard workflows. He also digs into on-prem versus SaaS deployments, domain-specialized agents for privacy and accuracy, code blocks for custom Python/SQL, and the roadmap for a marketplace and desktop assistant. Shravan highlights how Kaarvi compresses weeks of work into hours and bridges the gap between business users and data engineers by turning AI into a dependable force multiplier. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementThis episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to dataengineeringpodcast.com/datadriven today to start practicing.Your host is Tobias Macey and today I'm interviewing Shravan Gunda about building an agent-driven data platform at KaarviInterviewIntroductionHow did you get involved in the area of data management?Can you describe what Kaarvi is and the story behind it?"AI" is a very broad term that encompasses numerous possible implementations. Can you give some more detail about the different types and applications of AI in Kaarvi's architecture?What are some of the core assumptions of data workflows that need to be reconsidered when AI is embedded in the execution path?What are the most interesting, innovative, or unexpected ways that you have seen Kaarvi used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Kaarvi?When is Kaarvi the wrong choice?What do you have planned for the future of Kaarvi?Contact Info <a href="https://www.linkedin

54 min

Jun 1, 2026Episode 510

Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture

SummaryIn this episode Weimo Liu, co‑founder of PuppyGraph, talks about the engineering behind their “zero-copy” graph querying engine for lakehouse and database sources. He explores how PuppyGraph lets you run Cypher and Gremlin traversals and graph algorithms directly on data in Iceberg, Delta, Hudi, Hive, and even MongoDB—without loading into a separate graph store. Weimo explains their edge-sharded, vectorized, MPP architecture that tackles hub nodes, multi-hop traversals, and shuffle at scale, targeting sub-second to single-digit-second workloads. He digs into practical graph data modeling on top of normalized and denormalized tables, logical views, and flexible mappings; strategies for caching, adaptive reads, and leveraging Iceberg metadata; and how PuppyGraph’s operator-based engine unifies query and algorithms. He also covers real-world applications—from cybersecurity log analysis to entity resolution and agentic workflows—when to choose embedded or transactional graph databases instead, and what’s next for enterprise features and broader warehouse integrations.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementThis episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to dataengineeringpodcast.com/datadriven today to start practicing.Your host is Tobias Macey and today I'm interviewing Weimo Liu about the engineering behind PuppyGraph's zero-copy ETL for querying your lakehouse as a graphInterviewIntroductionHow did you get involved in the area of data management?Can you start by describing what PuppyGraph is and the story behind it?What are some of the key use cases that people are turning to PuppyGraph and graph data models for?Graph engines have struggled to take off for several years, not least of which is due to the difficulty of scaling them to large data volumes as a result of the topological nature of the data. Can you describe the architecture of PuppyGraph and some of the wa

58 min

May 6, 2026Episode 509

Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes

SummaryIn this episode Robert Nishihara, co-founder of Anyscale and co-creator of Ray, talks about maximizing hardware utilization for AI and data-intensive workloads. He explores Ray’s evolution alongside Kubernetes and PyTorch, and why consolidation at these layers has enabled a new generation of complex, heterogeneous workloads. Robert explains how data preparation has shifted to GPU- and inference-heavy, multimodal pipelines; where Ray fits compared to Spark and workflow orchestrators; and why Ray excels at composing heterogeneous pools of compute, handling failures, and scaling complex systems like multi-node LLM inference and reinforcement learning. He digs into practical strategies for boosting GPU utilization across training and inference, elasticity and prioritization of workloads, topology-aware scheduling, and the importance of fast failure recovery as hardware scales from nodes to racks. If you’re wrestling with expensive GPUs, multimodal data curation, or cross-node LLM inference, this conversation offers concrete mental models and architectural guidance.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementYour host is Tobias Macey and today I'm interviewing Robert Nishihara about the challenges of maximizing the utility of your available hardware for AI applicationsInterviewIntroductionHow did you get involved in the area of data management?Can you start by giving an overview of the major contributors to wasted or idle compute?Why does it matter if the available compute isn't being maximized?What are some of the typical ad-hoc methods that teams might use to try to get the most out of their available hardware (especially GPUs)? What are the most interesting, innovative, or unexpected ways that you have seen Ray used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Ray and distributed compute for data and AI?When is Ray the wrong choice?What do you have planned for the future of Ray?Contact InfoLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?Closing AnnouncementsThank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up f

59 min

Apr 7, 2026Episode 508

The AI-First Data Engineer: 10–50x Productivity and What Changes Next

Summary In this episode, I sit down with Gleb Mezhanskiy, CEO and co-founder of Datafold, to explore how agentic AI is reshaping data engineering. We unpack the leap from chat-assisted coding to truly agentic workflows where AI not only writes SQL and dbt models but also executes queries, debugs, runs tests, and ships production-ready outcomes. Gleb explains why teams that master this AI-first loop can see 10–50x gains, how security/compliance concerns can be addressed with platform-native LLM endpoints, and why the role of data engineers is shifting from code authors to operators of autonomous agents. We dig into the consolidation of the modern data stack, the economics driving more data products (Jevons paradox), and why product thinking, domain knowledge, and cross-functional skills will define the next wave of standout data professionals. We also cover practical steps for leaders and ICs: modernizing off legacy platforms, establishing safe AI adoption paths, codifying reusable “skills” and context for agents, and building validation utilities that keep the inner loop fast and trustworthy. Finally, Gleb shares how Datafold moved to fully AI-driven software delivery and why “outcomes over tools” is the emerging model for complex initiatives like data platform migrations—and how this reframes data quality for the AI era, emphasizing broad data access plus rich context over brittle human-centric tests. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm bringing back Gleb Mezhanskiy to talk about our predictions for the impact of AI on data engineering for 2026InterviewIntroductionHow did you get involved in the area of data management?What are the concrete steps that teams need to be taking today to take advantage of agentic AI capabilities?What are the new guardrails/constraints/workflows that need to be in

50 min

Mar 29, 2026Episode 507

Treat Metering Like Finance: Building Data Platforms for Consumption Economics

Summary In this episode Himant Goyal, Senior Product Manager at Salesforce, talks about how data platform investments enable reliable, accurate metering for consumption-based business models. Himant explains why consumption turns operations into a real-time optimization problem spanning metering, cost attribution, billing, governance, and cross-functional ownership. He explores the richness required in usage data to support sophisticated pricing, the importance of treating metering like a financial system, and the architectural foundations - event schemas, durable ingestion, normalization/validation, a usage ledger, and clear serving layers - needed to power near-real-time visibility with fine-grained drilldowns. He also digs into anti-patterns and reliability concerns such as late or duplicate data, time zone pitfalls, SLAs, and automated policy decisions for pipeline failures. Himant shares practical guidance for capturing usage events from products and logs, balancing push vs. pull and real-time vs. batch processing to manage costs. He highlights configurable metering and rate-card versioning for rapid onboarding of new products, and the cultural shift required for finance, product, and engineering to co-own metering. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Himant Goyal about how data platform investments support consumption based business modelsAnnouncementsIntroductionHow did you get involved in managing the data products or data management?Can you start by outlining the types of businesses and products that are "consumption based" and the impact that it has on the economics of the company?What are the unique operational challenges that are presented by having consumption as the unit of cost?How does the availability and accessibility of metering data impact the lev

42 min

Mar 22, 2026Episode 506

Beyond the PDF: Rowan Cockett on Reproducible, Composable Science

Summary In this episode Rowan Cockett, co-founder and CEO of CurveNote and co-founder of the Continuous Science Foundation, talks about building data systems that make scientific research reproducible, reusable, and easier to communicate. He digs into the sociotechnical roots of the reproducibility crisis - from data integrity and access to entrenched publishing incentives and PDF-bound workflows. He explores open standards and tools like Jupyter, Jupyter Book, and the push toward cloud-optimized formats (e.g., Zarr), along with graceful degradation strategies that keep interactive research usable over time. Rowan details how CurveNote enables interactive, reproducible articles that spin up compute on demand while delegating large dataset storage to specialized partners, and how community efforts like the Continuous Science Foundation and initiatives with Creative Commons aim to fix credit, licensing, and attribution. He also discusses the Open Exchange Architecture (OXA) initiative to establish a modular, computational standard for sharing science, the momentum in computational biosciences and neuroscience, and why true progress hinges on interoperability and composability across data, code, and narrative. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Rowan Cockett about building data systems that make scientific research easier to reproduceInterview IntroductionHow did you get involved in the area of data management?Can you describe what your interest is in reproducibility of scientific research?What role does data play in the set of challenges that plague reproducibility of published research?What are some of the notable changes in the areas of scientific process, and data systems that have contributed to the current crisis of reproducibility?Beyond technological

1 hr 1 min

Mar 16, 2026Episode 505

Beyond Prompts: Practical Paths to Self‑Improving AI

Summary In this episode Raj Shukla, CTO of SymphonyAI, explores what it really takes to build self‑improving AI systems that work in production. Raj unpacks how agentic systems interact with real-world environments, the feedback loops that enable continuous learning, and why intelligent memory layers often provide the most practical middle ground between prompt tweaks and full Reinforcement Learning. He discusses the architecture needed around models - data ingestion, sensors, action layers, sandboxes, RBAC, and agent lifecycle management - to reach enterprise-grade reliability, as well as the policy alignment steps required for regulated domains like financial crime. Raj shares hard-won lessons on tool use evolution (from bespoke tools to filesystem and Unix primitives), dynamic code-writing subagents, model version brittleness, and how organizations can standardize process and entity graphs to accelerate time-to-value. He also dives into pitfalls such as policy gaps and tribal knowledge, strategies for staged rollouts and monitoring, and where small models and cost optimization make sense. Raj closes with a vision for bringing RL-style improvement to enterprises without requiring a research team - letting businesses own the reasoning and memory layers that truly differentiate their AI systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey, and today I’m interviewing Raj Shukla about building self-improving AI systems — and how they enable AI scalability in real production environments.Interview IntroductionHow did you get involved in AI/ML?Can you start by outlining what actually improves over time in a self-improving AI system? How is that different from simply improving a model or an agent? How would you differentiate between an agent/agentic system vs. a self-improving system?

1 hr 5 min

Mar 8, 2026Episode 504

Orion at Gravity: Trustworthy AI Analysts for the Enterprise

Summary In this episode of the Data Engineering Podcast, Lucas Thelosen and Drew Gilson, co-founders of Gravity, discuss their vision for agentic analytics in the enterprise, enabled by semantic layers and broader context engineering. They share their journey from Looker and Google to building Orion, an AI analyst that combines data semantics with rich business context to deliver trustworthy and actionable insights. Lucas and Drew explain how Orion uses governed, role-specific "custom agents" to drive analysis, recommendations, and proactive preparation for meetings, while maintaining accuracy, lineage transparency, and human-in-the-loop feedback. The conversation covers evolving views on semantic layers, agent memory, retrieval, and operating across messy data, multiple warehouses, and external context like documents and weather. They emphasize the importance of trust, governance, and the path to AI coworkers that act as reliable colleagues. Lucas and Drew also share field stories from public companies where Orion has surfaced board-level issues, accelerated executive prep with last-minute research, and revealed how BI investments are actually used, highlighting a shift from static dashboards to dynamic, dialog-driven decisions. They stress the need for accessible (non-proprietary) models, managing context and technical debt over time, and focusing on business actions - not just metrics - to unlock real ROI. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Lucas Thelosen and Drew Gilson about the application of semantic layers to context engineering for agentic analyticsInterview IntroductionHow did you get involved in the area of data management?Can you start by digging into the practical elements of what is involved in the creation and maintenance of a "semantic layer"?How

LTDG

45 min

Mar 2, 2026Episode 503

From Models to Momentum: Uniting Architects and Engineers with ER/Studio

Summary In this episode of the Data Engineering Podcast, Jamie Knowles (Product Director) and Ryan Hirsch (Product Marketing Manager) discuss the importance of enterprise data modeling with ER/Studio. They highlight how clear, shared semantic models are a foundational discipline for modern data engineering, preventing semantic drift, speeding up delivery, and reducing rework. Jamie explains that ER/Studio helps teams define logical models that translate into physical designs and code across warehouses and analytics platforms, while maintaining traceability and governance. The conversation also touches on how AI increases the tolerance for ambiguity, but doesn't fix unclear definitions - it amplifies them. Jamie and Ryan describe ER/Studio's integrations with governance tools, collaboration features like TeamServer, reverse engineering, and metadata bridges, as well as new AI-assisted modeling capabilities. They emphasize that most data problems are meaning problems, and investing in architecture and a semantic backbone can make engineering faster, governance simpler, and analytics more reliable. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Jamie Knowles and Ryan Hirsch about ER/Studio and the foundational role of enterprise data modeling in modern data engineering.Interview IntroductionHow did you get involved in the area of data management?Can you describe what ER/Studio is and the story behind it? How has it evolved to handle the shift from traditional on-prem databases to modern, complex, and highly regulated enterprise environments?How do you define "Enterprise Data Architecture" today, and how does it differ from just managing a collection of pipelines in a modern data stack?In your view, what are the distinct responsibilities of a Data Architect versus a

JKRH

57 min

Feb 22, 2026Episode 502

From Data Models to Mind Models: Designing AI Memory at Scale

Summary In this episode of the Data Engineering Podcast, Vasilije "Vas" Markovich, founder of Cognee, discusses building agentic memory, a crucial aspect of artificial intelligence that enables systems to learn, adapt, and retain knowledge over time. He explains the concept of agentic memory, highlighting the importance of distinguishing between permanent and session memory, graph+vector layers, latency trade-offs, and multi-tenant isolation to ensure safe knowledge sharing or protection. The conversation covers practical considerations such as storage choices (Redis, Qdrant, LanceDB, Neo4j), metadata design, temporal relevance and decay, and emerging research areas like trace-based scoring and reinforcement learning for improving retrieval. Vas shares real-world examples of agentic memory in action, including applications in pharma hypothesis discovery, logistics control towers, and cybersecurity feeds, as well as scenarios where simpler approaches may suffice. He also offers guidance on when to add memory, pitfalls to avoid (naive summarization, uncontrolled fine-tuning), human-in-the-loop realities, and Cognee's future plans: revamped session/long-term stores, decision-trace research, and richer time and transformation mechanisms. Additionally, Vas touches on policy guardrails for agent actions and the potential for more efficient "pseudo-languages" for multi-agent collaboration. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Vasilije Markovic about agentic memory architectures and applicationsInterview IntroductionHow did you get involved in the area of data management?Can you start by giving an overview of the different elements of "memory" in an agentic context?storage and retrieval mechanismshow to model memorieshow does that change as you go from short-t

50 min

Feb 15, 2026Episode 501

Prompt Management, Tracing, and Evals: The New Table Stakes for GenAI Ops

Summary In this episode of the Data Engineering Podcast, Aman Agarwal, creator of OpenLit, discusses the operational groundwork required to run LLM-powered applications reliably and cost-effectively. He highlights common blind spots that teams face, including opaque model behavior, runaway token costs, and brittle prompt management, and explains how OpenTelemetry-native observability can turn these black-box interactions into stepwise, debuggable traces across models, tools, and data stores. Aman showcases OpenLit's approach to open standards, vendor-neutral integrations, and practical features such as fleet-managed OTEL collectors, zero-code Kubernetes instrumentation, prompt and secret management, and evaluation workflows. They also explore experimentation patterns, routing across models, and closing the loop from evals to prompt/dataset improvements, demonstrating how better visibility reshapes design choices from prototype to production. Aman shares lessons learned building in the open, where OpenLit fits and doesn't, and what's next in context management, security, and ecosystem integrations, providing resources and examples of multi-database observability deployments for listeners. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Aman Agarwal about the operational investments that are necessary to ensure you get the most out of your AI modelsInterview IntroductionHow did you get involved in the area of AI/data management?Can you start by giving your assessment of the main blind spots that are common in the existing AI application patterns?As teams adopt agentic architectures, how common is it to fall prey to those same blind spots?There are numerous tools/services available now focused on various elements of "LLMOps". What are the major components necessary for a min

46 min

Feb 8, 2026Episode 500

From Legacy to AI-Ready: How MongoDB AMP Accelerates Modernization

SummaryIn this episode, Shilpa Kolhar, SVP of Product and Engineering at MongoDB, discusses using MongoDB as a unified foundation for AI-driven and agentic applications. She explains how the Application Modernization Platform (AMP) accelerates the transition from legacy relational systems to a document-first architecture, driven by the need for AI-readiness and speed of change. Shilpa highlights MongoDB's features, such as its native JSON document model, Atlas Vector Search, auto-embeddings, and integrated search, which help eliminate drift and latency across operational data, indexing, and vectors, emphasizing the importance of keeping context, transactions, and embeddings together for real-time AI use cases. She shares best practices for re-architecting legacy systems, including schema validation and versioning patterns to tame schema drift, aggregation pipelines for consistent reads, and pragmatic standardization across services, while also detailing AMP's approach to scoping large estates and the balance of LLM-powered automation with human-in-the-loop governance.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Shilpa Kolhar about using MongoDB as the foundation for AI-driven applicationsInterviewIntroductionHow did you get involved in the area of data management?Can you describe what MongoDB is and the core primitives that it offers?The MongoDB engine has gone through substantial evolution since it was first introduced over 20 years ago. What are some of the most notable features that have been added in recent years?You recently launched the MongoDB Application Modernization Platform (AMP). What are the key elements of modernization that it is focused on?How do the core primitives of the MongoDB engine align with modernization objectives?There is a lot of attenti

56 min

Feb 1, 2026Episode 499

Branches, Diffs, and SQL: How Dolt Powers Agentic Workflows

Summary In this episode Tim Sehn, founder and CEO of DoltHub, talks about Dolt - the world’s first version‑controlled SQL database - and why Git‑style semantics belong at the heart of data systems and AI workflows. Tim explains how Dolt combines a MySQL/Postgres‑compatible interface with a novel storage engine built on a “Prollytree” to enable fast, row‑level branching, merging, and diffs of both schema and data. He digs into real production use cases: powering applications that expose version control to end users, reproducible ML feature stores, managing massive configuration for games, and enabling safe agentic writes via branch‑based review flows. He compares Dolt’s approach to LakeFS, Neon, and PlanetScale, and explores developer workflows unlocked by decentralized clones, full audit logs, and PR‑style data reviews. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.Your host is Tobias Macey and today I'm interviewing Tim Sehn about Dolt, a version controlled database engine and its applications for agentic workflowsInterview IntroductionHow did you get involved in the area of data management?Can you describe what Dolt is and the story behind it?What are the key use cases that you are focused on solving by adding version control to the database layer?There are numerous projects related to different aspects of versioning in different data contexts (e.g. LakeFS, Datomic, etc.). What are the versioning semantics that you are focused on?You position Dolt as "the database for AI". How does data versioning relate to AI use cases?What types of AI systems are able to make best use of Dolt's versioning capabilities?Can you describe how Dolt and Doltgres are implemented?How have the design and scope of the project changed since you first started working on it?What are some of the architec

40 min

Jan 25, 2026Episode 498

Logical First, Physical Second: A Pragmatic Path to Trusted Data

Summary In this episode of the Data Engineering Podcast Jamie Knowles, Product Director for ER/Studio, talks about data architecture and its importance in driving business meaning. He discusses how data architecture should start with business meaning, not just physical schemas, and explores the pitfalls of jumping straight to physical designs. Jamie shares his practical definition of data architecture centered on shared semantic models that anchor transactional, analytical, and event-driven systems. The conversation covers strategies for evolving an architecture in tandem with delivery, including defining core concepts, aligning teams through governance, and treating the model as a living product. He also examines how generative AI can both help and harm data architecture, accelerating first drafts but amplifying risk without a human-approved ontology. Jamie emphasizes the importance of doing the hard work upfront to make meaning explicit, keeping models simple and business-aligned, and using tools and patterns to reuse that meaning everywhere. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of da

1 hr 12 min

Jan 18, 2026Episode 497

Your Data, Your Lake: How Observe Uses Iceberg and Streaming ETL for Observability

Summary In this episode Jacob Leverich, cofounder and CTO of Observe, talks about applying lakehouse architectures to observability workloads. Jacob discusses Observe’s decision to leverage cloud-native warehousing and open table formats for scale and cost efficiency. He digs into the core pain points teams face with fragmented tools, soaring costs, and data silos, and how a lakehouse approach - paired with streaming ingest via OpenTelemetry, Kafka-backed durability, curated/columnarized tables, and query orchestration - can deliver low-latency, interactive troubleshooting across logs, metrics, and traces at petabyte scale. He also explore the practicalities of loading and organizing telemetry by use case to reduce read amplification, the role of Iceberg (including v3’s JSON shredding) and Snowflake’s implementation, and why open table formats enable “your data in your lake” strategies. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementIf you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hu

56 min

Jan 12, 2026Episode 496

Semantic Operators Meet Dataframes: Building Context for Agents with FENIC

Summary In this episode Kostas Pardalis talks about Fenic - an open-source, PySpark-inspired dataframe engine designed to bring LLM-powered semantics into reliable data engineering workflows. Kostas shares why today’s data infrastructure assumptions (BI-first, expert-operated, CPU-bound) fall short for AI-era tasks that are increasingly inference- and IO-bound. He explores how Fenic introduces semantic operators (e.g., semantic filter, extract, join) as first-class citizens in the logical plan so the optimizer can reason about inference, costs, and constraints. This enables developers to turn unstructured data into explicit schemas, compose transformations lazily, and offload LLM work safely and efficiently. He digs into Fenic’s architecture (lazy dataframe API, logical/physical plans, Polars execution, DuckDB/Arrow SQL path), how it exposes tools via MCP for agent integration, and where it fits in context engineering as a companion for memory/state management in agentic systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementYou’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporti

49 min

Jan 5, 2026Episode 495

Beyond Dashboards: How Data Teams Earn a Seat at the Table

Summary In this episode Goutham Budati about his Data–Perspective–Action framework and how it empowers data teams to become true business partners. Gautham traces his path from automating Excel reports to leading high‑impact data organizations, then breaks down why technical excellence alone isn’t enough: teams must pair reliable data systems with deliberate storytelling, clear problem framing, and concrete action plans. He digs into tactics for moving from reactive ticket-taking to proactive influence — weekly one‑page narratives, design-first discovery, sampling stakeholders for real pain points, and treating dashboards as living roadmaps. He also explores how to right-size technical scope, preserve trust in core metrics, organize teams as “build” and “storytelling” duos, and translate business macros and micros into resilient system designs. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.You’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildYour host is Tobias Macey and today I'm interviewing Goutham Budati about his data-perspective-action framework for empowering data teams to be more influential in the businessInterview IntroductionHow did you get involved in the area of data management?Can you describe what the Data-Perspective-Action framework is and the story behind it?What does it look like when someone operates at each of those three levels?How does that change the day-to-day work of an individual contributor?Why does technically excellent da

59 min

Dec 29, 2025Episode 494

Unfreezing The Data Lake: The Future-Proof File Format

Summary In this episode PhD researcher Xinyu Zeng talks about F3, the “future-proof file format” designed to address today’s hardware realities and evolving workloads. He digs into the limitations of Parquet and ORC - especially CPU-bound decoding, metadata overhead for wide-table projections, and poor random-access behavior for ML training and serving - and how F3 rethinks layout and encodings to be efficient, interoperable, and extensible. Xinyu explains F3’s two major ideas: a decoupled, flexible layout that separates IO units, dictionary scope, and encoding choices; and self-decoding files that embed WebAssembly kernels so new encodings can be adopted without waiting on every engine to upgrade. He discusses how table formats and file formats should increasingly be decoupled, potential synergies between F3 and table layers (including centralizing and verifying WASM kernels), and future directions such as extending WASM beyond encodings to indexing or filtering. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementYou’re a developer who wants to innovate—instead, you’re stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It’s a flexible, unified platform that’s built for developers, by developers. MongoDB is ACID compliant, Enterprise-ready, with the capabilities you need to ship AI apps—fast. That’s why so many of the Fortune 500 trust MongoDB with their most critical workloads. Ready to think outside rows and columns? Start building at MongoDB.com/BuildComposable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Xinyu Zeng about the future-proof file formatInterview IntroductionHow did you get involved in the area of data management?Can you describe what the F3 project is and the story behind it?We have several widely adopted file formats (Parquet, ORC, Avro, etc.). Why do we keep creating new ones?Parquet is the format with perhaps the broadest adoption. What are the

XZXZ

1 hr 6 min

Dec 21, 2025Episode 493

From Context to Semantics: How Metadata Powers Agentic AI

Summary In this episode Suresh Srinivas and Sriharsha Chintalapani explore how metadata platforms are evolving from human-centric catalogs into the foundational context layer for AI and agentic systems. They discuss the origins and growth of OpenMetadata and Collate, why “context” is necessary but “semantics” is critical for precise AI outcomes, and how a schema-first, API-first, unified platform enables discovery, observability, and governance in one workflow. They share how AI agents can now automate documentation, classification, data quality testing, and enforcement of policies, and why aligning governance with user identity and intent is essential as agentic access scales. They also dig into scalability strategies, MCP-based agent workflows, AI governance (including model/agent tracking), and the emerging convergence of big data with ontologies to deliver machine-understandable meaning. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin han

SCSS

26 min

Dec 14, 2025Episode 492

From Data Engineering to AI Engineering: Where the Lines Blur

Summary In this solo episode of the Data Engineering Podcast, host Tobias Macey reflects on how AI has transformed the practice and pace of data engineering over time. Starting from its origins in the Hadoop and cloud warehouse era, he explores the discipline's evolution through ML engineering and MLOps to today's blended boundaries between data, ML, and AI engineering. The conversation covers how unstructured data is becoming more prominent, vectors and knowledge graphs are emerging as key components, and reliability expectations are changing due to interactive user-facing AI. The host also delves into process changes, including tighter collaboration, faster dataset onboarding, new governance and access controls, and the importance of treating experimentation and evaluation as fundamental testing practices. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Dataf

58 min

Dec 8, 2025Episode 491

Malloy: Hierarchical Data, Semantic Models, and the Future of Analytics

Summary In this episode Michael Toy, co-creator of Malloy, talks about rethinking how we work with data beyond SQL. Michael shares the origins of Malloy from his and Lloyd Tabb’s experience at Looker, why SQL’s mental model often fights human problem solving, and how Malloy aims to be a composable, maintainable language that treats SQL as the assembly layer rather than something humans should write. He explores Malloy’s core ideas — semantic modeling tightly coupled with a query language, hierarchical data as the default mental model, and preserving context so analysis stays interactive and open-ended. He also digs into the developer experience and ecosystem: Malloy’s TypeScript implementation, VS Code integration, CLI, emerging notebook support, and how Malloy can sit alongside or replace parts of existing transformation workflows. Michael discusses practical trade-offs in language design, the surprising fit for LLM-generated queries, and near-term roadmap areas like dimensional filtering, better aggregation strategies across levels, and closing gaps that still require escaping to SQL. He closes with an invitation to contribute to the open-source project and help shape its evolution. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/dataf

1 hr

Nov 24, 2025Episode 490

Blurring Lines: Data, AI, and the New Playbook for Team Velocity

SummaryIn this crossover episode, Max Beauchemin explores how multiplayer, multi‑agent engineering is transforming the way individuals and teams build data and AI systems. He digs into the shifting boundary between data and AI engineering, the rise of “context as code,” and how just‑in‑time retrieval via MCP and CLIs lets agents gather what they need without bloating context windows. Max shares hard‑won practices from going “AI‑first” for most tasks, where humans focus on orchestration and taste, and the new bottlenecks that appear — code review, QA, async coordination — when execution accelerates 2–10x. He also dives deep into Agor, his open‑source agent orchestration platform: a spatial, multiplayer workspace that manages Git worktrees and live dev environments, templatizes prompts by workflow zones, supports session forking and sub‑sessions, and exposes an internal MCP so agents can schedule, monitor, and even coordinate other agents.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to

51 min

Nov 16, 2025Episode 489

State, Scale, and Signals: Rethinking Orchestration with Durable Execution

Summary In this episode Preeti Somal, EVP of Engineering at Temporal, talks about the durable execution model and how it reshapes the way teams build reliable, stateful systems for data and AI. She explores Temporal’s code‑first programming model—workflows, activities, task queues, and replay—and how it eliminates hand‑rolled retry, checkpoint, and error‑handling scaffolding while letting data remain where it lives. Preeti shares real-world patterns for replacing DAG-first orchestration, integrating application and data teams through signals and Nexus for cross-boundary calls, and using Temporal to coordinate long-running, human-in-the-loop, and agentic AI workflows with full observability and auditability. Shee also discusses heuristics for choosing Temporal alongside (or instead of) traditional orchestrators, managing scale without moving large datasets, and lessons from running durable execution as a cloud service. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the busin

51 min

Nov 9, 2025Episode 488

The AI Data Paradox: High Trust in Models, Low Trust in Data

SummaryIn this episode of the Data Engineering Podcast Ariel Pohoryles, head of product marketing for Boomi's data management offerings, talks about a recent survey of 300 data leaders on how organizations are investing in data to scale AI. He shares a paradox uncovered in the research: while 77% of leaders trust the data feeding their AI systems, only 50% trust their organization's data overall. Ariel explains why truly productionizing AI demands broader, continuously refreshed data with stronger automation and governance, and highlights the challenges posed by unstructured data and vector stores. The conversation covers the need to shift from manual reviews to automated pipelines, the resurgence of metadata and master data management, and the importance of guardrails, traceability, and agent governance. Ariel also predicts a growing convergence between data teams and application integration teams and advises leaders to focus on high-value use cases, aggressive pipeline automation, and cataloging and governing the coming sprawl of AI agents, all while using AI to accelerate data engineering itself.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you

50 min

Nov 2, 2025Episode 487

Bridging the AI–Data Gap: Collect, Curate, Serve

SummaryIn this episode of the Data Engineering Podcast Omri Lifshitz (CTO) and Ido Bronstein (CEO) of Upriver talk about the growing gap between AI's demand for high-quality data and organizations' current data practices. They discuss why AI accelerates both the supply and demand sides of data, highlighting that the bottleneck lies in the "middle layer" of curation, semantics, and serving. Omri and Ido outline a three-part framework for making data usable by LLMs and agents: collect, curate, serve, and share challenges of scaling from POCs to production, including compounding error rates and reliability concerns. They also explore organizational shifts, patterns for managing context windows, pragmatic views on schema choices, and Upriver's approach to building autonomous data workflows using determinism and LLMs at the right boundaries. The conversation concludes with a look ahead to AI-first data platforms where engineers supervise business semantics while automation stitches technical details end-to-end.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven fro

OLIB

1 hr 5 min

Oct 27, 2025Episode 486

Beyond the Perimeter: Practical Patterns for Fine‑Grained Data Access

SummaryIn this episode of the Data Engineering Podcast Matt Topper, president of UberEther, talks about the complex challenge of identity, credentials, and access control in modern data platforms. With the shift to composable ecosystems, integration burdens have exploded, fracturing governance and auditability across warehouses, lakes, files, vector stores, and streaming systems. Matt shares practical solutions, including propagating user identity via JWTs, externalizing policy with engines like OPA/Rego and Cedar, and using database proxies for native row/column security. He also explores catalog-driven governance, lineage-based label propagation, and OpenTDF for binding policies to data objects. The conversation covers machine-to-machine access, short-lived credentials, workload identity, and constraining access by interface choke points, as well as lessons from Zanzibar-style policy models and the human side of enforcement. Matt emphasizes the need for trust composition - unifying provenance, policy, and identity context - to answer questions about data access, usage, and intent across the entire data path.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great

1 hr 4 min

Oct 18, 2025Episode 485

The True Costs of Legacy Systems: Technical Debt, Risk, and Exit Strategies

SummaryIn this episode Kate Shaw, Senior Product Manager for Data and SLIM at SnapLogic, talks about the hidden and compounding costs of maintaining legacy systems—and practical strategies for modernization. She unpacks how “legacy” is less about age and more about when a system becomes a risk: blocking innovation, consuming excess IT time, and creating opportunity costs. Kate explores technical debt, vendor lock-in, lost context from employee turnover, and the slippery notion of “if it ain’t broke,” especially when data correctness and lineage are unclear. Shee digs into governance, observability, and data quality as foundations for trustworthy analytics and AI, and why exit strategies for system retirement should be planned from day one. The discussion covers composable architectures to avoid monoliths and big-bang migrations, how to bridge valuable systems into AI initiatives without lock-in, and why clear success criteria matter for AI projects. Kate shares lessons from the field on discovery, documentation gaps, parallel run strategies, and using integration as the connective tissue to unlock data for modern, cloud-native and AI-enabled use cases. She closes with guidance on planning migrations, defining measurable outcomes, ensuring lineage and compliance, and building for swap-ability so teams can evolve systems incrementally instead of living with a “bowl of spaghetti.”AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guar

51 min

Oct 11, 2025Episode 484

Context Engineering as a Discipline: Building Governed AI Analytics

SummaryIn this episode of the Data Engineering Podcast, host Tobias Macey welcomes back Nick Schrock, CTO and founder of Dagster Labs, to discuss Compass - a Slack-native, agentic analytics system designed to keep data teams connected with business stakeholders. Nick shares his journey from initial skepticism to embracing agentic AI as model and application advancements made it practical for governed workflows, and explores how Compass redefines the relationship between data teams and stakeholders by shifting analysts into steward roles, capturing and governing context, and integrating with Slack where collaboration already happens. The conversation covers organizational observability through Compass's conversational system of record, cost control strategies, and the implications of agentic collaboration on Conway's Law, as well as what's next for Compass and Nick's optimistic views on AI-accelerated software engineering.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Your host is Tobias Macey and today I'm interviewing Nick Schrock about building an AI analyst that keeps data teams in the loopInterviewIntroductionHow did you get involved in the area of data man

1 hr 1 min

Oct 5, 2025Episode 483

The Data Model That Captures Your Business: Metric Trees Explained

SummaryIn this episode of the Data Engineering Podcast Vijay Subramanian, founder and CEO of Trace, talks about metric trees - a new approach to data modeling that directly captures a company's business model. Vijay shares insights from his decade-long experience building data practices at Rent the Runway and explains how the modern data stack has led to a proliferation of dashboards without a coherent way for business consumers to reason about cause, effect, and action. He explores how metric trees differ from and interoperate with other data modeling approaches, serve as a backend for analytical workflows, and provide concrete examples like modeling Uber's revenue drivers and customer journeys. Vijay also discusses the potential of AI agents operating on metric trees to execute workflows, organizational patterns for defining inputs and outputs with business teams, and a vision for analytics that becomes invisible infrastructure embedded in everyday decisions.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Vijay Subramanian about metric trees and how they empower more effective and adaptive analyticsInterviewIntroducti

56 min

Sep 28, 2025Episode 482

From GPUs-as-a-Service to Workloads-as-a-Service: Flex AI’s Path to High-Utilization AI Infra

SummaryIn this crossover episode of the AI Engineering Podcast, host Tobias Macey interviews Brijesh Tripathi, CEO of Flex AI, about revolutionizing AI engineering by removing DevOps burdens through "workload as a service". Brijesh shares his expertise from leading AI/HPC architecture at Intel and deploying supercomputers like Aurora, highlighting how access friction and idle infrastructure slow progress. Join them as they discuss Flex AI's innovative approach to simplifying heterogeneous compute, standardizing on consistent Kubernetes layers, and abstracting inference across various accelerators, allowing teams to iterate faster without wrestling with drivers, libraries, or cloud-by-cloud differences. Brijesh also shares insights into Flex AI's strategies for lifting utilization, protecting real-time workloads, and spanning the full lifecycle from fine-tuning to autoscaled inference, all while keeping complexity at bay.Pre-ambleI hope you enjoy this cross-over episode of the AI Engineering Podcast, another show that I run to act as your guide to the fast-moving world of building scalable and maintainable AI systems. As generative AI models have grown more powerful and are being applied to a broader range of use cases, the lines between data and AI engineering are becoming increasingly blurry. The responsibilities of data teams are being extended into the realm of context engineering, as well as designing and supporting new infrastructure elements that serve the needs of agentic applications. This episode is an example of the types of work that are not easily categorized into one or the other camp.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent chang

52 min

Sep 18, 2025Episode 481

From RAG to Relational: How Agentic Patterns Are Reshaping Data Architecture

SummaryIn this episode of the AI Engineering Podcast Mark Brooker, VP and Distinguished Engineer at AWS, talks about how agentic workflows are transforming database usage and infrastructure design. He discusses the evolving role of data in AI systems, from traditional models to more modern approaches like vectors, RAG, and relational databases. Mark explains why agents require serverless, elastic, and operationally simple databases, and how AWS solutions like Aurora and DSQL address these needs with features such as rapid provisioning, automated patching, geodistribution, and spiky usage. The conversation covers topics including tool calling, improved model capabilities, state in agents versus stateless LLM calls, and the role of Lambda and AgentCore for long-running, session-isolated agents. Mark also touches on the shift from local MCP tools to secure, remote endpoints, the rise of object storage as a durable backplane, and the need for better identity and authorization models. The episode highlights real-world patterns like agent-driven SQL fuzzing and plan analysis, while identifying gaps in simplifying data access, hardening ops for autonomous systems, and evolving serverless database ergonomics to keep pace with agentic development.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit <a href="https://www.dataengineeringpodcast.com/dataf

1 hr 10 min

Sep 10, 2025Episode 480

Duck Lake: Simplifying the Lakehouse Ecosystem

SummaryIn this episode of the Data Engineering Podcast Hannes Mühleisen and Mark Raasveldt, the creators of DuckDB, share their work on Duck Lake, a new entrant in the open lakehouse ecosystem. They discuss how Duck Lake, is focused on simplicity, flexibility, and offers a unified catalog and table format compared to other lakehouse formats like Iceberg and Delta. Hannes and Mark share insights into how Duck Lake revolutionizes data architecture by enabling local-first data processing, simplifying deployment of lakehouse solutions, and offering benefits such as encryption features, data inlining, and integration with existing ecosystems.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Your host is Tobias Macey and today I'm interviewing Hannes Mühleisen and Mark Raasveldt about DuckLake, the latest entrant into the open lakehouse ecosystemInterviewIntroductionHow did you get involved in the area of data management?Can you describe what DuckLake is and the story behind it?What are the particular problems that DuckLake is solving for?How does this compare to the capabilities of MotherDuck?Iceberg and Delta already

HMMR

1 hr 6 min

Sep 1, 2025Episode 479

Aligning Business and Data: The Essential Role of Data Modeling

SummaryIn this episode of the Data Engineering Podcast Serge Gershkovich, head of product at SQL DBM, talks about the socio-technical aspects of data modeling. Serge shares his background in data modeling and highlights its importance as a collaborative process between business stakeholders and data teams. He debunks common misconceptions that data modeling is optional or secondary, emphasizing its crucial role in ensuring alignment between business requirements and data structures. The conversation covers challenges in complex environments, the impact of technical decisions on data strategy, and the evolving role of AI in data management. Serge stresses the need for business stakeholders' involvement in data initiatives and a systematic approach to data modeling, warning against relying solely on technical expertise without considering business alignment.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Enterprises today face an enormous challenge: they’re investing billions into Snowflake and Databricks, but without strong foundations, those investments risk becoming fragmented, expensive, and hard to govern. And that’s especially evident in large, complex enterprise data environments. That’s why companies like DirecTV and Pfizer rely on SqlDBM. Data modeling may be one of the most traditional practices in IT, but it remains the backbone of enterprise data strategy. In today’s cloud era, that backbone needs a modern approach built natively for the cloud, with direct connections to the very platforms driving your business forward. Without strong modeling, data management becomes chaotic, analytics lose trust, and AI initiatives fail to scale. SqlDBM ensures enterprises don’t just move to the cloud—they maximize their ROI by creating governed, scalable, and business-aligned data environments. If global enterprises are using SqlDBM to tackle the biggest challenges in data management, analytics, and AI, isn’t it worth exploring what it can do for yours? Visit dataengineeringpodcast.com/sqldbm to learn more.Your host is Tobias Macey and today I'm interviewing Serge Gershkovich about how and why data modeling is a sociotechnical endeavorIn

50 min

Aug 26, 2025Episode 478

From Academia to Industry: Bridging Data Engineering Challenges

SummaryIn this episode of the Data Engineering Podcast Professor Paul Groth, from the University of Amsterdam, talks about his research on knowledge graphs and data engineering. Paul shares his background in AI and data management, discussing the evolution of data provenance and lineage, as well as the challenges of data integration. He explores the impact of large language models (LLMs) on data engineering, highlighting their potential to simplify knowledge graph construction and enhance data integration. The conversation covers the evolving landscape of data architectures, managing semantics and access control, and the interplay between industry and academia in advancing data engineering practices, with Paul also sharing insights into his work with the intelligent data engineering lab and the importance of human-AI collaboration in data engineering pipelines.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Paul Groth about his research on knowledge graphs and data engineeringInterviewIntroductionHow did you get involved in the area of data management?Can you start by describing the focus and scope of your academic efforts?Given your focus on data management for machine learning as part of the INDELab, what are some of the developing trends that practitioners should be aware of?ML architectures / systems changing (matteo interlandi) GPUs for data mangementYou have spent a large portion of your career working with knowledge graphs, which have largely been a niche area until recently. What are some of the notable changes in the knowledge graph ecosystem that have resulted from the introduction of LLMs?What are some of the other ways that you are seeing LLMs change the methods of data engineering?There are numerous vague and anecdotal references to the power of LLMs to unlock value from unstructured data. What are some of the realitites that you are seeing in your research?A majority of the conversations in this podcast are focused on data engineering in the context of a business organization. What are some of the ways that management of research data is disjoint fr

1 hr 1 min

Aug 18, 2025Episode 477

High Performance And Low Overhead Graphs With KuzuDB

SummaryIn this episode of the Data Engineering Podcast Prashanth Rao, an AI engineer at KuzuDB, talks about their embeddable graph database. Prashanth explains how KuzuDB addresses performance shortcomings in existing solutions through columnar storage and novel join algorithms. He discusses the usability and scalability of KuzuDB, emphasizing its open-source nature and potential for various graph applications. The conversation explores the growing interest in graph databases due to their AI and data engineering applications, and Prashanth highlights KuzuDB's potential in edge computing, ephemeral workloads, and integration with other formats like Iceberg and Parquet.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Prashanth Rao about KuzuDB, an embeddable graph databaseInterviewIntroductionHow did you get involved in the area of data management?Can you describe what KuzuDB is and the story behind it?What are the core use cases that Kuzu is focused on addressing?What is explicitly out of scope?Graph engines have been available and in use for a long time, but generally for more niche use cases. How would you characterize the current state of the graph data ecosystem?You note scalability as a feature of Kuzu, which is a phrase with many potential interpretations. Typically horizontal scaling of graphs has been complicated, in what sense does Kuzu make that claim?Can you describe some of the typical architecture and integration patterns of Kuzu?What are some of the more interesting or esoteric means of architecting with Kuzu?For cases where Kuzu is rendering a graph across an external data repository (e.g. Iceberg, etc.), what are the patterns for balancing data freshness with network/compute efficiency? (e.g. read and create every time or persist the Kuzu state)Can you describe the internal architecture of Kuzu and key design factors?What are the benefits and tradeoffs of using a columnar store with adjacency lists vs. a more graph-native storage format?What are the most interesting, innovative, or unexpected ways t

1 hr 10 min

Aug 12, 2025Episode 476

Bridging Data and Decision-Making: AI's Role in Modern Analytics

SummaryIn this episode of the Data Engineering Podcast Lucas Thelosen and Drew Gilson from Gravity talk about their development of Orion, an autonomous data analyst that bridges the gap between data availability and business decision-making. Lucas and Drew share their backgrounds in data analytics and how their experiences have shaped their approach to leveraging AI for data analysis, emphasizing the potential of AI to democratize data insights and make sophisticated analysis accessible to companies of all sizes. They discuss the technical aspects of Orion, a multi-agent system designed to automate data analysis and provide actionable insights, highlighting the importance of integrating AI into existing workflows with accuracy and trustworthiness in mind. The conversation also explores how AI can free data analysts from routine tasks, enabling them to focus on strategic decision-making and stakeholder management, as they discuss the future of AI in data analytics and its transformative impact on businesses.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Lucas Thelosen and Drew Gilson about the engineering and impact of building an autonomous data analystInterviewIntroductionHow did you get involved in the area of data management?Can you describe what Orion is and the story behind it?How do you envision the role of an agentic analyst in an organizational context?There have been several attempts at building LLM-powered data analysis, many of which are essentially a text-to-SQL interface. How have the capabilities and architectural patterns grown in the past ~2 years to enable a more capable system?One of the key success factors for a data analyst is their ability to translate business questions into technical representations. How can an autonomous AI-powered system understand the complex nuance of the business to build effective analyses?Many agentic approaches to analytics require a substantial investment in data architecture, documentation, and semantic models to be effective. What are the gradations of effectiveness for autonomous analytics for companies who are at different points on the

LTDG

50 min

Aug 5, 2025Episode 475

From Bits to Tables: The Evolution of S3 Storage

SummaryIn this episode of the Data Engineering Podcast Andy Warfield talks about the innovative functionalities of S3 Tables and Vectors and their integration into modern data stacks. Andy shares his journey through the tech industry and his role at Amazon, where he collaborates to enhance storage capabilities, discussing the evolution of S3 from a simple storage solution to a sophisticated system supporting advanced data types like tables and vectors crucial for analytics and AI-driven applications. He explains the motivations behind introducing S3 Tables and Vectors, highlighting their role in simplifying data management and enhancing performance for complex workloads, and shares insights into the technical challenges and design considerations involved in developing these features. The conversation explores potential applications of S3 Tables and Vectors in fields like AI, genomics, and media, and discusses future directions for S3's development to further support data-driven innovation.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementTired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to 6x while guaranteeing accuracy? Datafold's Migration Agent is the only AI-powered solution that doesn't just translate your code; it validates every single data point to ensure perfect parity between your old and new systems. Whether you're moving from Oracle to Snowflake, migrating stored procedures to dbt, or handling complex multi-system migrations, they deliver production-ready code with a guaranteed timeline and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold to book a demo and see how they're turning months-long migration nightmares into week-long success stories.Your host is Tobias Macey and today I'm interviewing Andy Warfield about S3 Tables and VectorsInterviewIntroductionHow did you get involved in the area of data management?Can you describe what your goals are with the Tables and Vector features of S3?How did the experience of building S3 Tables inform your work on S3 Vectors?There are numerous implementations of vector storage and search. How do you view the role of S3 in the context of that ecosystem?The most directly analogous implementation that I'm aware of is the Lance table format. How would you compare the implementation and capabilities of Lance with what you are building with S3 Vectors?What opportunity do you see for being able to offer a protocol compatible implementation similar to the Iceberg compatibility that you provide with S3 Tables?Can you describe the technical implementation of the Vectors functionality in S3?What are the

51 min

Jul 28, 2025Episode 474

Revolutionizing Python Notebooks with Marimo

SummaryIn this episode of the Data Engineering Podcast Akshay Agrawal from Marimo discusses the innovative new Python notebook environment, which offers a reactive execution model, full Python integration, and built-in UI elements to enhance the interactive computing experience. He discusses the challenges of traditional Jupyter notebooks, such as hidden states and lack of interactivity, and how Marimo addresses these issues with features like reactive execution and Python-native file formats. Akshay also explores the broader landscape of programmatic notebooks, comparing Marimo to other tools like Jupyter, Streamlit, and Hex, highlighting its unique approach to creating data apps directly from notebooks and eliminating the need for separate app development. The conversation delves into the technical architecture of Marimo, its community-driven development, and future plans, including a commercial offering and enhanced AI integration, emphasizing Marimo's role in bridging the gap between data exploration and production-ready applications.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementTired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to 6x while guaranteeing accuracy? Datafold's Migration Agent is the only AI-powered solution that doesn't just translate your code; it validates every single data point to ensure perfect parity between your old and new systems. Whether you're moving from Oracle to Snowflake, migrating stored procedures to dbt, or handling complex multi-system migrations, they deliver production-ready code with a guaranteed timeline and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold to book a demo and see how they're turning months-long migration nightmares into week-long success stories.Your host is Tobias Macey and today I'm interviewing Akshay Agrawal about Marimo, a reusable and reproducible Python notebook environmentInterviewIntroductionHow did you get involved in the area of data management?Can you describe what Marimo is and the story behind it?What are the core problems and use cases that you are focused on addressing with Marimo?What are you explicitly not trying to solve for with Marimo?Programmatic notebooks have been around for decades now. Jupyter was largely responsible for making them popular outside of academia. How have the applications of notebooks changed in recent years?What are the limitations that have been most challenging to address in production contexts?Jupyter has long had support for multi-language notebooks/notebook kernels. What is your opinion on the utility of that feature as a core concern o

55 min

Jul 21, 2025Episode 473

Warehouse Native Incremental Data Processing With Dynamic Tables And Delayed View Semantics

SummaryIn this episode of the Data Engineering Podcast Dan Sotolongo from Snowflake talks about the complexities of incremental data processing in warehouse environments. Dan discusses the challenges of handling continuously evolving datasets and the importance of incremental data processing for optimized resource use and reduced latency. He explains how delayed view semantics can address these challenges by maintaining up-to-date results with minimal work, leveraging Snowflake's dynamic tables feature. The conversation also explores the broader landscape of data processing, comparing batch and streaming systems, and highlights the trade-offs between them. Dan emphasizes the need for a unified theoretical framework to discuss semantic guarantees in data pipelines and introduces the concept of delayed view semantics, touching on the limitations of current systems and the potential of dynamic tables to simplify complex data workflows.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Dan Sotolongo about the challenges of incremental data processing in warehouse environments and how delayed view semantics help to address the problemInterviewIntroductionHow did you get involved in the area of data management?Can you start by defining the scope of the term "incremental data processing"?What are some of the common solutions that data engineers build when creating workflows to implement that pattern?What are some common difficulties that they encounter in the pursuit of incremental data?Can you describe what delayed view semantics are and the story behind it?What are the problems that DVS explicitly doesn't address?How does the approach that you have taken in Dynamic View Semantics compare to systems like Materialize, Feldera, etc.Can you describe the technical architecture of the implementation of Dynamic Tables?What are the elements of the problem that are as-yet unsolved?How has the implementation changed/evolved as you learned more about the solution space?What would be involved in implementing the delayed view semanti

52 min

Jul 15, 2025Episode 472

Streamlining Data Pipelines with MCP Servers and Vector Engines

SummaryIn this episode of the Data Engineering Podcast Kacper Łukawski from Qdrant about integrating MCP servers with vector databases to process unstructured data. Kacper shares his experience in data engineering, from building big data pipelines in the automotive industry to leveraging large language models (LLMs) for transforming unstructured datasets into valuable assets. He discusses the challenges of building data pipelines for unstructured data and how vector databases facilitate semantic search and retrieval-augmented generation (RAG) applications. Kacper delves into the intricacies of vector storage and search, including metadata and contextual elements, and explores the evolution of vector engines beyond RAG to applications like semantic search and anomaly detection. The conversation covers the role of Model Context Protocol (MCP) servers in simplifying data integration and retrieval processes, highlighting the need for experimentation and evaluation when adopting LLMs, and offering practical advice on optimizing vector search costs and fine-tuning embedding models for improved search quality.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Kacper Łukawski about how MCP servers can be paired with vector databases to streamline processing of unstructured dataInterviewIntroductionHow did you get involved in the area of data management?LLMs are enabling the derivation of useful data assets from unstructured sources. What are the challenges that teams face in building the pipelines to support that work?How has the role of vector engines grown or evolved in the past ~2 years as LLMs have gained broader adoption?Beyond its role as a store of context for agents, RAG, etc. what other applications are common for vector databaes?In the ecosystem of vector engines, what are the distinctive elements of Qdrant?How has the MCP specification simplified the work of processing unstructured data?Can you describe the toolchain and workflow involved in building a data pipeline that leverages an MCP for generating embeddings?helping data engineers gain confidence in non-deterministic

KŁ

55 min

Jul 6, 2025Episode 471

Foundational Data Engineering At Two Sigma

SummaryIn this episode of the Data Engineering Podcast Effie Baram, a leader in foundational data engineering at Two Sigma, talks about the complexities and innovations in data engineering within the finance sector. She discusses the critical role of data at Two Sigma, balancing data quality with delivery speed, and the socio-technical challenges of building a foundational data platform that supports research and operational needs while maintaining regulatory compliance and data quality. Effie also shares insights into treating data as code, leveraging modern data warehouses, and the evolving role of data engineers in a rapidly changing technological landscape.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial. Your host is Tobias Macey and today I'm interviewing Effie Baram about data engineering in the finance sectorInterviewIntroductionHow did you get involved in the area of data management?Can you start by outlining the role of data in the context of Two Sigma?What are some of the key characteristics of

54 min

Jun 29, 2025Episode 470

Enabling Agents In The Enterprise With A Platform Approach

SummaryIn this episode of the Data Engineering Podcast Arun Joseph talks about developing and implementing agent platforms to empower businesses with agentic capabilities. From leading AI engineering at Deutsche Telekom to his current entrepreneurial venture focused on multi-agent systems, Arun shares insights on building agentic systems at an organizational scale, highlighting the importance of robust models, data connectivity, and orchestration loops. Listen in as he discusses the challenges of managing data context and cost in large-scale agent systems, the need for a unified context management platform to prevent data silos, and the potential for open-source projects like LMOS to provide a foundational substrate for agentic use cases that can transform enterprise architectures by enabling more efficient data management and decision-making processes.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial. Your host is Tobias Macey and today I'm interviewing Arun Joseph about building an agent platform to empower the business to adopt agentic capabilitiesI

1 hr 1 min

Jun 18, 2025Episode 469

Dagster's New Era: Modularizing Data Transformation in the Age of AI

SummaryIn this episode of the Data Engineering Podcast we welcome back Nick Schrock, CTO and founder of Dagster Labs, to discuss the evolving landscape of data engineering in the age of AI. As AI begins to impact data platforms and the role of data engineers, Nick shares his insights on how it will ultimately enhance productivity and expand software engineering's scope. He delves into the current state of AI adoption, the importance of maintaining core data engineering principles, and the need for human oversight when leveraging AI tools effectively. Nick also introduces Dagster's new components feature, designed to modularize and standardize data transformation processes, making it easier for teams to collaborate and integrate AI into their workflows. Join in to explore the future of data engineering, the potential for AI to abstract away complexity, and the importance of open standards in preventing walled gardens in the tech industry.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementThis episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial. Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chroni

44 min

Jun 11, 2025Episode 468

AI and the Lakehouse: How Starburst is Pioneering New Workflows

SummaryIn this episode of the Data Engineering Podcast Alex Albu, tech lead for AI initiatives at Starburst, talks about integrating AI workloads with the lakehouse architecture. From his software engineering roots to leading data engineering efforts, Alex shares insights on enhancing Starburst's platform to support AI applications, including an AI agent for data exploration and using AI for metadata enrichment and workload optimization. He discusses the challenges of integrating AI with data systems, innovations like SQL functions for AI tasks and vector databases, and the limitations of traditional architectures in handling AI workloads. Alex also shares his vision for the future of Starburst, including support for new data formats and AI-driven data exploration tools.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th. This epis

1 hr 1 min

Jun 3, 2025Episode 467

Amazon S3: The Backbone of Modern Data Systems

SummaryIn this episode of the Data Engineering Podcast Mai-Lan Tomsen Bukovec, Vice President of Technology at AWS, talks about the evolution of Amazon S3 and its profound impact on data architecture. From her work on compute systems to leading the development and operations of S3, Mylan shares insights on how S3 has become a foundational element in modern data systems, enabling scalable and cost-effective data lakes since its launch alongside Hadoop in 2006. She discusses the architectural patterns enabled by S3, the importance of metadata in data management, and how S3's evolution has been driven by customer needs, leading to innovations like strong consistency and S3 tables.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th.Your host is Tobias Macey and today I'm interviewing Mai-Lan Tomsen Bukovec about the evolutions

42 min

May 29, 2025Episode 466

Scaling Data Operations With Platform Engineering

SummaryIn this episode of the Data Engineering Podcast Chakravarthy Kotaru talks about scaling data operations through standardized platform offerings. From his roots as an Oracle developer to leading the data platform at a major online travel company, Chakravarthy shares insights on managing diverse database technologies and providing databases as a service to streamline operations. He explains how his team has transitioned from DevOps to a platform engineering approach, centralizing expertise and automating repetitive tasks with AWS Service Catalog. Join them as they discuss the challenges of migrating legacy systems, integrating AI and ML for automation, and the importance of organizational buy-in in driving data platform success.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th.Your host is Tobias Macey and tod

49 min

May 21, 2025Episode 465

From Data Discovery to AI: The Evolution of Semantic Layers

SummaryIn this episode of the Data Engineering Podcast, host Tobias Macy welcomes back Shinji Kim to discuss the evolving role of semantic layers in the era of AI. As they explore the challenges of managing vast data ecosystems and providing context to data users, they delve into the significance of semantic layers for AI applications. They dive into the nuances of semantic modeling, the impact of AI on data accessibility, and the importance of business logic in semantic models. Shinji shares her insights on how SelectStar is helping teams navigate these complexities, and together they cover the future of semantic modeling as a native construct in data systems. Join them for an in-depth conversation on the evolving landscape of data engineering and its intersection with AI.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Shinji Kim about the role of semantic layers in the era of AIInterviewIntroductionHow did you get involved in the area of data management?Semantic modeling gained a lot of attention ~4-5 years ago in the context of the "modern data stack". What is your motivation for revisiting that topic today?There are several overlapping concepts – "semantic layer," "metrics layer," "headless BI." How do you define these terms, and what are the key distinctions and overlaps?Do you see these concepts converging, or do they serve distinct long-term purposes?Data warehousing and business intelligence have been around for decades now. What new value does semantic modeling beyond practices like star schemas, OLAP cubes, etc.?What benefits does a semantic model provide when integrating your data platform into AI use cases?How is it different between using AI as an interface to your analytical use cases vs. powering customer facing AI applications with your data?Putting in the effort to create and maintain a set of semantic models is non-zero. What role can LLMs play in helping to propose and construct those models?For teams who have already invested in building this capability, what additional context and metadata is necessary to provide guidance to LLMs when

46 min

May 13, 2025Episode 464

Balancing Off-the-Shelf and Custom Solutions in Data Engineering

SummaryIn this episode of the Data Engineering Podcast Tulika Bhatt, a senior software engineer at Netflix, talks about her experiences with large-scale data processing and the future of data engineering technologies. Tulika shares her journey into the data engineering field, discussing her work at BlackRock and Verizon before joining Netflix, and explains the challenges and innovations involved in managing Netflix's impression data for personalization and user experience. She highlights the importance of balancing off-the-shelf solutions with custom-built systems using technologies like Spark, Flink, and Iceberg, and delves into the complexities of ensuring data quality and observability in high-speed environments, including robust alerting strategies and semantic data auditing.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Tulika Bhatt about her experiences working on large scale data processing and her insights on the future trajectory of the supporting technologiesInterviewIntroductionHow did you get involved in the area of data management?Can you start by outlining the ways that operating at large scale change the ways that you need to think about the design of data systems?When dealing with small-scale data systems it can be feasible to have manual processes. What are the elements of large scal data systems that demand autopmation?How can those large-scale automation principles be down-scaled to the systems that the rest of the world are operating?A perennial problem in data engineering is that of data quality. The past 4 years has seen a significant growth in the number of tools and practices available for automating the validation and verification of data. In your experience working with high volume data flows, what are the elements of data validation that are still unsolved?Generative AI has taken the world by storm over the past couple years. How has that changed the ways that you approach your daily work?What do you see as the future realities of working with data across various axes of large scale, real-time, etc.?What are the most interesting, innovati

59 min

May 5, 2025Episode 463

StarRocks: Bridging Lakehouse and OLAP for High-Performance Analytics

SummaryIn this episode of the Data Engineering Podcast Sida Shen, product manager at CelerData, talks about StarRocks, a high-performance analytical database. Sida discusses the inception of StarRocks, which was forked from Apache Doris in 2020 and evolved into a high-performance Lakehouse query engine. He explains the architectural design of StarRocks, highlighting its capabilities in handling high concurrency and low latency queries, and its integration with open table formats like Apache Iceberg, Delta Lake, and Apache Hudi. Sida also discusses how StarRocks differentiates itself from other query engines by supporting on-the-fly joins and eliminating the need for denormalization pipelines, and shares insights into its use cases, such as customer-facing analytics and real-time data processing, as well as future directions for the platform.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Sida Shen about StarRocks, a high performance analytical database supporting shared nothing and shared data patternsInterviewIntroductionHow did you get involved in the area of data management?Can you describe what StarRocks is and the story behind it?There are numerous analytical databases on the market. What are the attributes of StarRocks that differentiate it from other options?Can you describe the architecture of StarRocks?What are the "-ilities" that are foundational to the design of the system?How have the design and focus of the project evolved since it was first created?What are the tradeoffs involved in separating the communication layer from the data layers?The tiered architecture enables the shared nothing and shared data behaviors, which allows for the implementation of lakehouse patterns. What are some of the patterns that are possible due to the single interface/dual pattern nature of StarRocks?The shared data implementation has cacheing built in to accelerate interaction with datasets. What are some of the limitations/edge cases that operators and consumers should be aware of?StarRocks supports management of lakehouse tables (Iceberg, Delta, Hudi

1 hr 12 min

Apr 28, 2025Episode 462

Exploring NATS: A Multi-Paradigm Connectivity Layer for Distributed Applications

SummaryIn this episode of the Data Engineering Podcast Derek Collison, creator of NATS and CEO of Synadia, talks about the evolution and capabilities of NATS as a multi-paradigm connectivity layer for distributed applications. Derek discusses the challenges and solutions in building distributed systems, and highlights the unique features of NATS that differentiate it from other messaging systems. He delves into the architectural decisions behind NATS, including its ability to handle high-speed global microservices, support for edge computing, and integration with Jetstream for data persistence, and explores the role of NATS in modern data management and its use cases in industries like manufacturing and connected vehicles.AnnouncementsHello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Derek Collison about NATS, a multi-paradigm connectivity layer for distributed applications.InterviewIntroductionHow did you get involved in the area of data management?Can you describe what NATS is and the story behind it?How have your experiences in past roles (cloud foundry, TIBCO messaging systems) informed the core principles of NATS?What other sources of inspiration have you drawn on in the design and evolution of NATS? (e.g. Kafka, RabbitMQ, etc.)There are several patterns and abstractions that NATS can support, many of which overlap with other well-regarded technologies. When designing a system or service, what are the heuristics that should be used to determine whether NATS should act as a replacement or addition to those capabilities? (e.g. considerations of scale, speed, ecosystem compatibility, etc.)There is often a divide in the technologies and architecture used between operational/user-facing applications and data systems. How does the unification of multiple messaging patterns in NATS shift the ways that teams think about the relationship between these use cases?How does the shared communication layer of NATS with multiple protocol and pattern adaptaters reduce the need to replicate data and logic across application and data layers?Can you describe how the core NATS system is archit