PyData Seattle 2023

Scaling Altair visualizations with VegaFusion
04-27, 11:45–12:30 (America/Los_Angeles), Kodiak Theatre

Altair is a popular Python visualization library that makes it easy to build a wide variety of statistical charts. On its own, Altair is unsuitable for visualizing large datasets (more than a few thousand rows) because it requires transferring the entire dataset to the browser for data processing and rendering. VegaFusion integrates with Altair to overcome this limitation by automatically moving data intensive calculations from the browser to the Python kernel. With VegaFusion, many Altair charts easily scale to millions of rows, significantly increasing the utility of Altair throughout the PyData ecosystem.


Background

Altair [1] is a popular data visualization library for Python that makes it easy to build a wide variety of statistical charts with just a few lines of code. By integrating data transformations, like binning and aggregation, into chart specifications Altair often eliminates the need to perform external data pre-processing in pandas. Altair also supports a powerful selections framework [2] that enables the creation of advanced interactive charts.

This talk will include a brief overview of Altair, with several examples of the kinds of static and interactive visualizations it can be used to create.

Problem

While Altair provides many benefits, a significant limitation is that it requires transferring the entire dataset to the browser for data processing and rendering. Consider a histogram of a one million row dataset. While it is convenient for the Altair chart specification to include the data transforms necessary to perform the histogram binning and aggregation, it's not practical to serialize one million rows to JSON and transfer them from the Python kernel to the browser. To protect against crashing the browser, Altair imposes a default limit of five thousand rows, raising an error when this limit is exceeded.

Solution

VegaFusion [3] integrates with Altair to automatically extract data transformations from chart specifications. The extracted transformations are then evaluated by efficient multi-threaded Rust implementations in the Python kernel. The transformed data is then sent to the browser, along with the modified chart specification, for rendering.

As an example, in the case of a twenty bin histogram of a one million row dataset, this approach results in sending a twenty row dataset to the browser rather than the one million row dataset that plain Altair would attempt to send.

Features

VegaFusion provides three main features for Altair users: A mime renderer, a widget renderer, and a function to extract transformed data from Altair charts.

Mime renderer

The VegaFusion mime renderer [4] pre-evaluates the data transformations associated with a chart, inlining the transformed data back into the chart specification. This approach passes the modified chart specification for display in the browser exactly the same way that plain Altair does. The mime renderer is compatible with every jupyter-based environment that Altair already works in, and it's a great choice for non-interactive charts.

Widget Renderer

The VegaFusion widget renderer [5] includes a custom Jupyter widget extension [6] that maintains a live connection between the browser and the Python kernel. This makes it possible to support interactive charts that transform data in response to interactive selection events. For example, the widget renderer enables linked histogram brushing across subplots for datasets with over one million rows.

Extract Transformed Data

In plain Altair, data transformations are evaluated in the browser, and so it's not possible to access the transformed data from Python. VegaFusion adds the ability to access transformed datasets as pandas DataFrames [7], making it possible to use Altair transforms as part of larger data analysis pipelines without needing to reimplement these data transformations in Python.

Demo

VegaFusion renderers can be enabled for any Altair visualization with the addition of two lines of code. The presentation will include a live demonstration of the interactive exploration of a dataset with over a million rows.

License

As of version 1.0, VegaFusion is available under the same license as Altair: BSD-3.

Expected Experience

To get the most out of this talk, some experience with data visualization in Python will be helpful, but experience with Altair is not required.

References


Prior Knowledge Expected

No previous knowledge expected

I care deeply about the future of the open data science technology ecosystem, I’m a contributor to a variety of open source visualization and data science projects, I'm the creator of VegaFusion, and I'm a visualization engineer at Hex Technologies.

I hold a Master’s Degree in Computer Science from Johns Hopkins University, and Bachelor's degrees in Mathematics and Physics from Millersville University.