HNSW-NGFix* is a graph-based approximate nearest neighbor search (ANNS) algorithm. It uses historical queries to fix defective regions in the graph, and can particularly improve the performance of the graph index on Out-of-Distribution (OOD) queries.
We propose Escape Hardness, a metric to evaluate the quality of the graph structure around the query. Then we divide the graph search into two stages and dynamically identify and fix defective graph regions in each stage. (1) From the entry point to the vicinity of the query. We propose Reachability Fixing (RFix), which enhances the navigability of some key nodes. (2) Searching within the vicinity of the query. We propose Neighboring Graph Defects Fixing (NGFix) to improve graph connectivity in regions where queries are densely distributed.
AVX-512
GCC 10.1.0+ with OpenMP
CMAKE 3.0+
BOOST 1.55+
mkdir build && cd build
cmake ..
make -jWe support the *.fbin and *.ibin formats. It starts with two 32-bit integers (number of vectors and dimension), followed by all vectors stored sequentially. You can refer to test/tools/data_loader.h for more details.
The ground truth is stored as *.ibin formats. The
We use the program provided by DiskANN to compute ground truth. You can also use GPU to accelerate the computation of the ground truth.
Remember to handle floating-point errors and duplicate vectors, otherwise they may affect the recall in testing.
| Dataset | Link |
|---|---|
| Text-to-Image | https://research.yandex.com/blog/benchmarks-for-billion-scale-similarity-search |
| LAION | https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-embeddings |
| WebVid | You can download it from the repository of RoarGraph |
| MainSearch | https://zenodo.org/records/17257137 |
The MainSearch Dataset (we propose) is sourced from a large e-commerce platform (Alibaba Group). The training paradigm of the embedding model consists of two stages: (1) LLM-based supervised fine-tuning (LLM-SFT), which adapts the multimodal LLM to produce deep semantic vectors from (query, document) pairs; and (2) retrieval-oriented reinforcement learning from human feedback (RO-RLHF), which establishes an iterative feedback loop for continual refinement of retrieval quality. A light weight retrieval framework is then applied, and the original 3072-dimensional embeddings generated by the multimodal LLM are compressed to 256 dimensions via Matryoshka Representation Learning (MRL)—significantly improving retrieval efficiency while preserving semantic fidelity.
Since the MainSearch dataset comes from a real production environment, there are many duplicates in both the queries and the base data. We have already deduplicated the queries, and you can decide whether to deduplicate the base data depending on your needs.
# remember to modify the path in the shell.
./build_hnsw_bottom.sh # build base graph
./build_hnsw_ngfix_aknn.sh # use approximate gt to build HNSW-NGFix*
# ./build_hnsw_ngfix.sh # use exact gt to build HNSW-NGFix*
./search_hnsw_ngfix.sh # test HNSW-NGFix*./test_hnsw_ngfix_insertion.sh./test_hnsw_ngfix_deletion.shUsing the default parameters provided in the scripts and codes can achieve good performance on most datasets, while keeping the index construction overhead low.
To improve performance in real-world scenarios, we provide the option of combining HNSW-NGFix* with RaBitQ-8bits. (Note that for a fair comparison, all the experiments in the paper are based on full-precision vectors.)
./build_hnsw_bottom_rabitq.sh
./build_hnsw_ngfix_rabitq.sh
./search_hnsw_ngfix_rabitq.sh| WebVid2.5M | HNSW-NGFix* | HNSW-NGFix*-RaBitQ-8bits | HNSW-NGFix*-RaBitQ-8bits-rerank |
|---|---|---|---|
| Index size | 5.2G | 1.7G | 6.5G |
| recall@100 | 0.9483 | 0.9492 | 0.9509 |
| latency | 1.54ms | 1.25ms | 0.91ms |