PyData Seattle 2023

Building Reliable, Open Lakehouses with Delta Lake
04-26, 11:00–12:30 (America/Los_Angeles), Rainier

Delta Lake: Building Reliable and Scalable Open Lakehouses


Delta Lake is an open-source storage framework that enables the creation of a Lakehouse architecture using a variety of compute engines such as Spark, PrestoDB, Flink, Trino, and Hive from Python. Its high data reliability and optimized query performance make it an ideal solution for supporting big data use cases, including batch and streaming data ingestion, fast interactive queries, and machine learning.

In this tutorial, you will learn about the current requirements in modern data engineering and the challenges faced by data engineers in ensuring data reliability and performance. We will delve into how Delta Lake can help overcome these obstacles, through presentations, hands-on code examples and notebooks.

By the end of the tutorial, you will have a comprehensive understanding of how Delta Lake can be applied to your data architecture and the benefits it can bring. Additionally, you will gain insight into how the wider open-source community is utilizing Delta Lake as an open standard to develop the next generation of data engineering and data science tools in Python.


Prior Knowledge Expected

No previous knowledge expected

Jim Hibbard is a Developer Advocate at Databricks. Prior to that, he worked at Seattle Children’s Hospital where he developed frameworks and methods for integrating medical records with multi-omics datasets to improve care. He is currently working on improving machine learning infrastructure and model management as part of the extended MLflow team.