PyData Seattle 2023

The Importance of Synthetic Data in Data-Centric AI
04-27, 11:00–11:45 (America/Los_Angeles), Hood

This talk covers the importance of synthetic data for the adoption and development of Data-Centric AI approaches. We’ll cover how generative models can be used to mimic real-world domains through the generation of synthetic data and demonstrate their application using the open-source python package, ydata-synthetic. For this talk, we’ll focus on tabular data and discuss the impact of synthetic data on different industries such as healthcare and finance. Finally, we’ll explain how to validate the quality of the synthetic data generated, depending on the downstream application - privacy-preserving and ML as an ML performance booster.

Introduced by Andrew NG in 2021, the concept of Data-Centric AI supports that the investment in data quality rather than just model optimization will lead to higher performance and return from AI initiatives.
The need to create more diverse, carefully labeled datasets, and higher volumes of data, combined with the challenges and costs associated with data collection, led to the rise and need for new synthetic data generation solutions. Synthetic data generation is considered one of the most important toolkits to master by data science teams and one of the best solutions to deal with the challenges associated with data collection and access. The objective of the session is to provide the audience with the answers to the following questions:

  1. What is synthetic data and why should I learn how to generate it? (10min)
    We'll start with a brief explanation of what synthetic data is and its different types. We’ll cover its importance in the path of adopting a data-centric mindset towards the development of AI solutions, and why data scientists should add it to their skillset in order to overcome challenges such as data access or lack of data variability.

  2. What are the different methods to generate tabular synthetic data? (10 min)
    We’ll cover the different techniques to generate synthetic tabular data -- faker, bayesian networks, and generative models such as VAE and GANs -- and discuss the differences and benefits of each technique.

  3. How to assess the quality of synthetic data for different downstream applications? (15 min)
    We’ll explain the different aspects that can be validated regarding the quality of synthetic data: fidelity, utility, and privacy. With a practical example, we’ll discuss the concepts of holdout and overfitting from a synthetic data generation perspective, along with metrics such as category coverage or membership inference attacks.

  4. How is synthetic data being used in real-world applications and industries? (5min)
    We’ll go over how different industries are already using synthetic data and the benefits it brings to distinct use cases. We’ll end with some directions for further development of synthetic data solutions in machine learning applications.

Prior Knowledge Expected

Previous knowledge expected

Fabiana Clemente is the co-founder and CDO of YData, combining Data Understanding, Causality, and Privacy as her main fields of work and research, with the mission to make data actionable for organizations.

Passionate for data, Fabiana has vast experience leading data science teams in startups and multinational companies.

Host of “When Machine Learning meets privacy” podcast and a guest speaker at Datacast and Privacy Please, the previous WebSummit speaker, was recently awarded “Founder of the Year” by the South Europe Startup Awards.

This speaker also appears in: