PyData Seattle 2023

Managing a search engine for over 600 million openly licensed media records
04-27, 14:45–15:30 (America/Los_Angeles), Hood

Have you ever wanted to add an image or audio track to your blog, but don’t want to copy something off Google Images without attribution? Or wanted to remix a song? Create some art using images with the express consent of the original creator? Openverse (wp.org/openverse and openverse.org) is a search engine for openly licensed media with over 600 million indexed image & audio files. Openverse can help you find this content, and give appropriate attribution for it. Managing this much data from over 30 disparate sources can be a challenge. We'll talk about how we identify, aggregate, and index CC licensed data across the web to make it accessible from a single search engine.


I'll give a brief overview of what media licensing is, and why it's important. Then I'll discuss what Openverse is and how anyone can leverage it when searching for media to use (either on their website, their slides, making music, creating art, and so much more!). I'll dive into the details of our aggregation process and how we use Airflow, Postgres, and Elasticsearch to manage the data that backs our search engine. I'll discuss some of the challenges we face with our current data architecture, as well as opportunities for folks to get involved with this open source endeavor.


Prior Knowledge Expected

Previous knowledge expected

Madison is a Senior Data Engineer out of Seattle and an avid Python user. She currently works at Automattic on the Openverse team, and has worked at Ookla (Speedtest.net), the Allen Institute for Cell Science, and the Broad Institute. In her spare time she can be found baking, building digital tools to help those battling oppression, contributing to open source, petting her cats, reading queer fiction, or playing video games.