E-commerce Search

This repository showcases a complete e-commerce search system, demonstrating the core components of modern information retrieval. The system is built using the Shopping Queries Dataset, a large-scale, human-annotated benchmark for product search. The pipeline is composed of three main stages: retrieval, pre-ranking, and re-ranking. Each stage uses a fine-tuned deep learning model to progressively refine search results, aiming for high relevance and efficiency.

Key Components

1. Retrieval

The initial stage of the search process is retrieval, which focuses on efficiently finding a large set of potentially relevant items from the entire product catalog. This stage must be fast and scalable, as it needs to handle millions of items.

Model: We use a two-tower model for this task. The model consists of two separate neural networks (towers): one for the user's query and another for the product items. By training the model to embed queries and products into the same vector space, we can retrieve candidate items by performing a simple and fast nearest-neighbor search on the product embeddings.
Implementation: The model is a fine-tuned version of a pre-trained language model, adapted specifically for the e-commerce domain.

2. Pre-ranking

After retrieval, the system has a smaller list of candidate items (e.g., a few hundred). The pre-ranking stage sifts through this list to surface the most promising candidates for the final re-ranking stage. This stage is more powerful than retrieval but still needs to be efficient.

Model: We employ another two-tower model here, but it's typically more complex than the retrieval model. This model fine-tunes a transformer-based architecture from the Hugging Face transformers library. The use of a transformer allows the model to capture more nuanced relationships between the query and the product.
Implementation: This model is trained to score the relevance of a query-product pair, allowing us to select the top-k most relevant items from the retrieved set.

3. Re-ranking

The final stage is re-ranking, which takes the refined list from the pre-ranking stage and performs a deep, fine-grained analysis to determine the final order of the search results. This stage is the most computationally intensive and powerful, but it only needs to be applied to a small number of items.

Model: A cross-encoder model is used for re-ranking. Unlike the two-tower models, the cross-encoder processes the query and the product together in a single transformer model. This allows the model to build a rich, contextual understanding of how the query and the product relate to each other, leading to highly accurate relevance scores.
Implementation: This model is also fine-tuned using a transformer architecture from the transformers library, specifically for the task of ranking search results.

Dataset

This project uses the Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search, a publicly available benchmark for product search. This dataset is particularly valuable because it contains a large collection of real-world, difficult search queries from Amazon. A key feature of this dataset is the use of the ESCI relevance framework, which goes beyond simple binary relevance to categorize product-query pairs into four classes:

Exact (E): The product is an exact match for the query.
Substitute (S): The product is a viable alternative or substitute for the query.
Complement (C): The product complements the query (e.g., a phone charger for a phone).
Irrelevant (I): The product is not relevant to the query.

This rich annotation schema provides a strong signal for training advanced ranking models.

For more details on the dataset, please refer to the original paper:

@article{reddy2022shopping,
title={Shopping Queries Dataset: A Large-Scale {ESCI} Benchmark for Improving Product Search},
author={Chandan K. Reddy and Lluís Màrquez and Fran Valero and Nikhil Rao and Hugo Zaragoza and Sambaran Bandyopadhyay and Arnab Biswas and Anlu Xing and Karthik Subbian},
year={2022},
eprint={2206.06588},
archivePrefix={arXiv}
}

Getting Started

To run this project, follow these steps:

Clone the repository:
```
git clone [repository_url]
```
Install dependencies:
```
pip install -r requirements.txt
```
Run the training and evaluation scripts for each stage as follows:
- python retriever/retriever_train.py
- python retriever/vectorizer.py
- python retriever/index/data_inject.py
- python train_preranking.py
- python train_reranking.py

Feel free to explore the code, experiment with different models, or adapt this framework for your own search applications.

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
final_ranking		final_ranking
pre_ranking		pre_ranking
retriever		retriever
utils		utils
.gitignore		.gitignore
README.md		README.md
compose.yaml		compose.yaml
index_search.py		index_search.py
main.py		main.py
pre_ranking_step.py		pre_ranking_step.py
ranking_step.py		ranking_step.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

E-commerce Search

Key Components

1. Retrieval

2. Pre-ranking

3. Re-ranking

Dataset

Getting Started

About

Uh oh!

Uh oh!

Languages

malinphy/e_commerce_search

Folders and files

Latest commit

History

Repository files navigation

E-commerce Search

Key Components

1. Retrieval

2. Pre-ranking

3. Re-ranking

Dataset

Getting Started

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages