PyData Seattle 2023

Scaling data workloads using the best of both worlds: pandas and Spark
04-28, 10:15–11:00 (America/Los_Angeles), St. Helens

It is indisputable that pandas is oftentimes the keystone element in any data wrangling and analysis workloads. However, the challenge is that pandas is not meant for big data processing. This presents data practitioners a dilemma: should we downsample data and lose information? Or should we explore a distributed processing framework to scale out data workloads? An example of a mainstream distributed processing tool is Apache Spark. However, this means data practitioners now have to learn a new language, PySpark. Not all is bleak though: pandas API on Spark provides pandas equivalent APIs in PySpark. It allows pandas users to transition from single-node to distributed environment, by just simply swapping the pandas package with pyspark.pandas.

On the other hand, existing PySpark users may wish to write their own custom user-defined functions (UDFs) that are not included in existing PySpark API. Pandas Function APIs, newly included in Spark 3.0+, allow users to apply arbitrary Python native functions, with pandas instances as the input and output against a PySpark dataframe. For instance, data scientists could use pandas function API to train a ML model based on each group of data using a single line of code.

Co-presented by both a top open-source Apache Spark commiter and a hands-on data science consultant, this talk equips data analysts and scientists with the knowledge of scaling their data analysis workloads with implementation details and best practice guidance. Working knowledge of pandas, basic Spark, and machine learning is helpful.

Time 0-5 mins:
- Introduce the common challenges with using single-node pandas library for data analysis
- Introduce distributed data processing framework, Apache Spark
- Introduce pandas API on Spark

Time 5-15 mins:
- How pandas API on Spark works under the hood
- Performance comparisons between pandas API on Spark vs pandas
- Coverage of pyspark.pandas API
- Best practices with respect to choosing types of indices for pyspark.pandas dataframes

Time 15-30 mins:
- Introduce pandas function API for PySpark users
- Types of pandas function API
- E.g. train models on groups of data, apply custom transformation to PySpark dataframes
- Provide best practices: how best to partition PySpark dataframes

Time 30-45 mins:
- Q/A

Prior Knowledge Expected

No previous knowledge expected

Hyukjin is a techlead in PySpark team as a Staff Software Engineer in Databricks, and Apache Spark PMC member and committer, working on many different areas in Apache Spark such as PySpark, Spark SQL, SparkR, etc.

He is the number one top contributor in Apache Spark, one of the top contributors in pandas API on Spark (Koalas), and the maintainer of multiple open source projects such as Py4J. He mainly focuses on development, helping discussions, and reviewing many features and changes in Apache Spark.

Chengyin Eng is a Senior Data Scientist on the Machine Learning Practice team at Databricks. She is experienced in developing end-to-end scalable machine learning solutions for cross-functional clients and works with product/engineering tam to define MLOps best practices. She also teaches ML in production and deep learning courses. She spoke at Open Data Science Conference, Data and AI Summit, Women in Data Science, etc. Outside of work, she enjoys connecting with friends, watching crime mystery films, and trying out new food recipes.