04-26, 11:00–12:30 (America/Los_Angeles), Kodiak Theatre
When Pandas starts to become a bottleneck for data workloads, data practitioners seek out distributed computing frameworks such as Spark, Dask, and Ray. The problem is porting over existing code would take a lot of rewrites. Though drop-in replacements exist where you can just change the import statement, the resulting code is still attached to the Pandas interface, which is not a good grammar for a lot of distributed computing problems. In this tutorial, we will go over some scenarios where the Pandas interface can't scale, and we'll show how to port the existing code to distributed backend with minimal rewrites.
It is well-known Pandas does not scale to large datasets well because it is single-core and utilizes a lot of memory. In order to utilize a cluster, data practitioners would need to use another framework like Spark, Dask, or Ray capable of handling the orchestration of distributed workloads. In most cases, this requires some amount of rewriting because none of these frameworks are intended to be drop-in replacements. Even Dask, which is built on Pandas, does not exhaustively support the Pandas API because not all operations translate well to a distributed setting.
For already written code, how can we bring it to Spark, Dask, or Ray? There are solutions that propose drop-in replacements, such as Modin and Koalas (renamed to PySpark Pandas). But even if drop-in replacements perfectly matched the Pandas API, they can be suboptimal because of some assumptions the Pandas interface makes, such as having the index. Furthermore, not all distributed computing use cases lend themselves well to Pandas semantics. An example of this is executing a grid search where one model is trained per machine on a small dataset.
Fugue is different from drop-in replacement frameworks in that it provides a minimal interface to port over Python and Pandas workloads to distributed computing. Just by learning one function, users can port already existing business logic across a cluster. Fugue also is designed to break up “big data” problems into several “small data” problems, giving users finer-grained control of how to distribute work across a cluster.
In this tutorial, we’ll go over:
- an intro to distributed computing principles
- drop-in replacement frameworks
- a couple of ways to spin up a cluster for execution
- how to use Fugue to distributed existing code with just a few additional lines
No previous knowledge expected
Kevin Kho is a maintainer for the Fugue project, an abstraction layer for distributed computing. Previously, he was an Open Source Community Engineer at Prefect, an workflow orchestration management system. Before working on data tooling, he was a data scientist for 4 years.
Anthony Holten is a Senior Software Engineer at Interos, Inc. building supply chain software that calculates and tracks risk profiles for hundreds of millions of companies worldwide. Previously, as a Data Engineer at Deloitte, Anthony empowered government clients’ internal policy analysis through natural language processing. He is a published photographer whose formal education is in International Relations by way of Washington, DC and Beijing, China.