[SIGMOD'26] Dynamically Detect and Fix Hardness for Efficient Approximate Nearest Neighbor Search

Introduction

HNSW-NGFix* is a graph-based approximate nearest neighbor search (ANNS) algorithm. It uses historical queries to fix defective regions in the graph, and can particularly improve the performance of the graph index on Out-of-Distribution (OOD) queries.

We propose Escape Hardness, a metric to evaluate the quality of the graph structure around the query. Then we divide the graph search into two stages and dynamically identify and fix defective graph regions in each stage. (1) From the entry point to the vicinity of the query. We propose Reachability Fixing (RFix), which enhances the navigability of some key nodes. (2) Searching within the vicinity of the query. We propose Neighboring Graph Defects Fixing (NGFix) to improve graph connectivity in regions where queries are densely distributed.

Prerequisites

AVX-512

GCC 10.1.0+ with OpenMP

CMAKE 3.0+

BOOST 1.55+

Compiler

mkdir build && cd build
cmake ..
make -j

Datasets

Format

We support the *.fbin and *.ibin formats. It starts with two 32-bit integers (number of vectors and dimension), followed by all vectors stored sequentially. You can refer to test/tools/data_loader.h for more details.

The ground truth is stored as *.ibin formats. The $i$-th vector in the file represents the ground truth of the $i$-th query (i.e., the $k$ NN of $i$-th query). It is required that the vector dimension $k \ge MaxS$ (default 200).

Compute Exact KNN (Ground Truth)

We use the program provided by DiskANN to compute ground truth. You can also use GPU to accelerate the computation of the ground truth.

Remember to handle floating-point errors and duplicate vectors, otherwise they may affect the recall in testing.

Prepare Datasets

Dataset	Link
Text-to-Image	https://research.yandex.com/blog/benchmarks-for-billion-scale-similarity-search
LAION	https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-embeddings
WebVid	You can download it from the repository of RoarGraph
MainSearch	https://zenodo.org/records/17257137

The MainSearch Dataset (we propose) is sourced from a large e-commerce platform (Alibaba Group). The training paradigm of the embedding model consists of two stages: (1) LLM-based supervised fine-tuning (LLM-SFT), which adapts the multimodal LLM to produce deep semantic vectors from (query, document) pairs; and (2) retrieval-oriented reinforcement learning from human feedback (RO-RLHF), which establishes an iterative feedback loop for continual refinement of retrieval quality. A light weight retrieval framework is then applied, and the original 3072-dimensional embeddings generated by the multimodal LLM are compressed to 256 dimensions via Matryoshka Representation Learning (MRL)—significantly improving retrieval efficiency while preserving semantic fidelity.

Since the MainSearch dataset comes from a real production environment, there are many duplicates in both the queries and the base data. We have already deduplicated the queries, and you can decide whether to deduplicate the base data depending on your needs.

Getting Started

Quick Start

# remember to modify the path in the shell.
./build_hnsw_bottom.sh # build base graph
./build_hnsw_ngfix_aknn.sh # use approximate gt to build HNSW-NGFix*
# ./build_hnsw_ngfix.sh # use exact gt to build HNSW-NGFix*
./search_hnsw_ngfix.sh # test HNSW-NGFix*

Test Insertion

./test_hnsw_ngfix_insertion.sh

Test Deletion

./test_hnsw_ngfix_deletion.sh

Parameters

Using the default parameters provided in the scripts and codes can achieve good performance on most datasets, while keeping the index construction overhead low.

Other Optimizations

To improve performance in real-world scenarios, we provide the option of combining HNSW-NGFix* with RaBitQ-8bits. (Note that for a fair comparison, all the experiments in the paper are based on full-precision vectors.)

Test HNSW-NGFix* with RaBitQ

./build_hnsw_bottom_rabitq.sh
./build_hnsw_ngfix_rabitq.sh
./search_hnsw_ngfix_rabitq.sh

WebVid2.5M	HNSW-NGFix*	HNSW-NGFix*-RaBitQ-8bits	HNSW-NGFix*-RaBitQ-8bits-rerank
Index size	5.2G	1.7G	6.5G
recall@100	0.9483	0.9492	0.9509
latency	1.54ms	1.25ms	0.91ms

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
docs		docs
ngfixlib		ngfixlib
rabitqlib		rabitqlib
shell		shell
test		test
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
appendix.pdf		appendix.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[SIGMOD'26] Dynamically Detect and Fix Hardness for Efficient Approximate Nearest Neighbor Search

Introduction

Prerequisites

Compiler

Datasets

Format

Compute Exact KNN (Ground Truth)

Prepare Datasets

Getting Started

Quick Start

Test Insertion

Test Deletion

Parameters

Other Optimizations

Test HNSW-NGFix* with RaBitQ

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[SIGMOD'26] Dynamically Detect and Fix Hardness for Efficient Approximate Nearest Neighbor Search

Introduction

Prerequisites

Compiler

Datasets

Format

Compute Exact KNN (Ground Truth)

Prepare Datasets

Getting Started

Quick Start

Test Insertion

Test Deletion

Parameters

Other Optimizations

Test HNSW-NGFix* with RaBitQ

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages