PyData Seattle 2023

Building a Search Engine
04-26, 15:30–17:00 (America/Los_Angeles), Rainier

Most production information retrieval systems are built on top of Lucene which uses BM25.
Current state of the art techniques utilize embeddings for retrieval. This workshop will cover common information retrieval concepts, what companies used in the past, and how new systems use embeddings.

Outline:
- Non deep learning based retrieval
- Embeddings and Vector Similarity Overview
- Serving Vector Similarity using Approximate Nearest Neighbors (ANN)

By the end of the session, a participant will be able to build a production information retrieval system leveraging Embeddings and Vector Similarity using ANN. This will allow participants to utilize state of the art technologies / techniques on top of the traditional information retrieval systems.


Most companies need a search engine to help serve relevant results to their users. This workshop aims to demystify what is involved in building such a system.

The workshop will cover three main themes:
- Non deep learning based retrieval system
- Overview of Embedding based retrieval system
- Putting an Embedding based retrieval system with ANN to production

The full outline is shared below.

Intro (10 mins)
- Search retrieval concepts: approaches, evaluation metrics etc.
- Overview of common production retrieval stack
- Walk over the notebooks and environment setup

Non deep learning based retrieval (15 min)
- Overview of tf-idf and BM-25
- How production systems use ElasticSearch / SOLR
- Hands-on lab experience: Reviewing Retrieval Results from BM-25

Embeddings and Vector Similarity Overview (25 min)
- Brief review of common embedding techniques: word2vec, BERT
- Briefly talk about how to train own custom embeddings
- Vector Similarity and Evaluation metrics
- Hands-on lab experience: Compare results of Non deep learning and Vector Similarity

Serving Vector Similarity using Approximate Nearest Neighbors (25 min)
- Why Vector Similarity needs ANN
- Review common Approximate Nearest Neighbors techniques in FAISS
- Hands-on lab experience: Building FAISS Index and comparing results

By the end of the session, we hope to empower the user with enough information to build a production information retrieval system leveraging Embeddings and Vector Similarity using ANN.

Resources for workshop
Jupyter Hub
Slides
Repo


Prior Knowledge Expected

No previous knowledge expected

Machine Learning Engineer at Walmart Search