PyData Seattle 2023

From prototype to deployment: Increase productivity and simplify data operations in Python
04-28, 15:00–15:45 (America/Los_Angeles), Hood

Designing ML pipelines is a complex process involving numerous changes along the way, from a prototype to deployment. It frequently involves iterating over multiple models on a smaller scale and then converting those models to run at scale. In this talk we will discuss the inefficiencies of this process and present a modern open source based solution that helps to mitigate many of these inefficiencies. The proposed tools and approaches help data scientists, data engineers, and machine learning engineers work more efficiently across all ranges of tasks and reduce the time-to-solution. We also present future development plans.


Authoring ML pipelines has become more involved. Complexity, data size, cost of prototyping, and productivity are all but a few challenges facing companies that employ data scientists and machine learning engineers who deliver cutting edge ML models that drive the revenues. A common approach to this data science process requires quickly prototyping a model in pandas and scikit-learn, and then rewriting the logic and training a model on a distributed cluster using tools like (py)Spark or Snowflake. The translation cost and the time it takes can be a significant factor in a successful deployment and monitoring of an ML model.


Prior Knowledge Expected

Previous knowledge expected

Tom is a Field Engineer/Solutions Architect with Voltron Data. He has almost 20 years of experience working with data across multiple industries ranging from airlines, thru finance and banking, to high tech. Tom holds a PhD degree in Operations Research from the UNSW. He has extensive experience presenting at international conferences (KDD, PyData Seattle, GTC). Tom is an author of 3 books and a video series on data analytics and data engines, and has authored multiple blog posts and webinars on GPU applications for big data. He also received a patent for a solution that discovers patterns in extremely high-dimensional datasets while working at Microsoft. At Voltron Data, Tom works on building bespoke solutions for solving intricate problems for customers leveraging the capabilities of the Apache Arrow data ecosystem.