Skip to content

yuhuifishash/NGFix

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[SIGMOD'26] Dynamically Detect and Fix Hardness for Efficient Approximate Nearest Neighbor Search

Introduction

HNSW-NGFix* is a graph-based approximate nearest neighbor search (ANNS) algorithm. It uses historical queries to fix defective regions in the graph, and can particularly improve the performance of the graph index on Out-of-Distribution (OOD) queries.

We propose Escape Hardness, a metric to evaluate the quality of the graph structure around the query. Then we divide the graph search into two stages and dynamically identify and fix defective graph regions in each stage. (1) From the entry point to the vicinity of the query. We propose Reachability Fixing (RFix), which enhances the navigability of some key nodes. (2) Searching within the vicinity of the query. We propose Neighboring Graph Defects Fixing (NGFix) to improve graph connectivity in regions where queries are densely distributed.

Prerequisites

AVX-512

GCC 10.1.0+ with OpenMP

CMAKE 3.0+

BOOST 1.55+

Compiler

mkdir build && cd build
cmake ..
make -j

Datasets

Format

We support the *.fbin and *.ibin formats. It starts with two 32-bit integers (number of vectors and dimension), followed by all vectors stored sequentially. You can refer to test/tools/data_loader.h for more details.

The ground truth is stored as *.ibin formats. The $i$-th vector in the file represents the ground truth of the $i$-th query (i.e., the $k$ NN of $i$-th query). It is required that the vector dimension $k \ge MaxS$ (default 200).

Compute Exact KNN (Ground Truth)

We use the program provided by DiskANN to compute ground truth. You can also use GPU to accelerate the computation of the ground truth.

Remember to handle floating-point errors and duplicate vectors, otherwise they may affect the recall in testing.

Prepare Datasets

Dataset Link
Text-to-Image https://research.yandex.com/blog/benchmarks-for-billion-scale-similarity-search
LAION https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-embeddings
WebVid You can download it from the repository of RoarGraph
MainSearch https://zenodo.org/records/17257137

The MainSearch Dataset (we propose) is sourced from a large e-commerce platform (Alibaba Group). The training paradigm of the embedding model consists of two stages: (1) LLM-based supervised fine-tuning (LLM-SFT), which adapts the multimodal LLM to produce deep semantic vectors from (query, document) pairs; and (2) retrieval-oriented reinforcement learning from human feedback (RO-RLHF), which establishes an iterative feedback loop for continual refinement of retrieval quality. A light weight retrieval framework is then applied, and the original 3072-dimensional embeddings generated by the multimodal LLM are compressed to 256 dimensions via Matryoshka Representation Learning (MRL)—significantly improving retrieval efficiency while preserving semantic fidelity.

Since the MainSearch dataset comes from a real production environment, there are many duplicates in both the queries and the base data. We have already deduplicated the queries, and you can decide whether to deduplicate the base data depending on your needs.

Getting Started

Quick Start

# remember to modify the path in the shell.
./build_hnsw_bottom.sh # build base graph
./build_hnsw_ngfix_aknn.sh # use approximate gt to build HNSW-NGFix*
# ./build_hnsw_ngfix.sh # use exact gt to build HNSW-NGFix*
./search_hnsw_ngfix.sh # test HNSW-NGFix*

Test Insertion

./test_hnsw_ngfix_insertion.sh

Test Deletion

./test_hnsw_ngfix_deletion.sh

Parameters

Using the default parameters provided in the scripts and codes can achieve good performance on most datasets, while keeping the index construction overhead low.

Other Optimizations

To improve performance in real-world scenarios, we provide the option of combining HNSW-NGFix* with RaBitQ-8bits. (Note that for a fair comparison, all the experiments in the paper are based on full-precision vectors.)

Test HNSW-NGFix* with RaBitQ

./build_hnsw_bottom_rabitq.sh
./build_hnsw_ngfix_rabitq.sh
./search_hnsw_ngfix_rabitq.sh
WebVid2.5M HNSW-NGFix* HNSW-NGFix*-RaBitQ-8bits HNSW-NGFix*-RaBitQ-8bits-rerank
Index size 5.2G 1.7G 6.5G
recall@100 0.9483 0.9492 0.9509
latency 1.54ms 1.25ms 0.91ms

About

[SIGMOD'26] Dynamically Detect and Fix Hardness for Efficient Approximate Nearest Neighbor Search

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages