Skip to content

fix(sindi): fix error when neither doc nor query is relevant (#1589)#1609

Merged
inabao merged 1 commit into0.16from
cherry-pick-488225a-fix
Feb 28, 2026
Merged

fix(sindi): fix error when neither doc nor query is relevant (#1589)#1609
inabao merged 1 commit into0.16from
cherry-pick-488225a-fix

Conversation

@inabao
Copy link
Copy Markdown
Collaborator

@inabao inabao commented Feb 27, 2026

cp #1589 to 0.16
link #1581

Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
@inabao inabao self-assigned this Feb 27, 2026
@inabao inabao added kind/bug Something isn't working version/0.16 labels Feb 27, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a bug in the SINDI algorithm's sparse term data handling, specifically addressing an error that occurred when neither the document nor the query yielded relevant results. The changes introduce a more robust mechanism for tracking the total number of inserted vectors and refine the distance filtering logic, ensuring that the KnnSearch operation and its serialization behave correctly under various relevance conditions.

Highlights

  • Corrected Distance Filtering Logic: Modified the SparseTermDataCell::InsertHeapByDists method to change the condition for considering a distance relevant from dists[id] != 0 to dists[id] < 0. This change specifically addresses an error scenario where neither the document nor the query was relevant, ensuring that only truly relevant distances are processed.
  • Introduced Total Vector Count: Added a new total_count_ member variable to the SparseTermDataCell class and ensured it is incremented every time a vector is inserted. This provides an accurate count of all vectors processed by the cell.
  • Improved Loop Iteration Accuracy: Updated the iteration bounds in SparseTermDataCell::InsertHeapByDists from dists_size to the newly introduced total_count_, ensuring that loops correctly process all relevant elements.
  • Adjusted Test Case Validation: Modified a test case in src/algorithm/sindi/sindi_test.cpp to use another_result->GetDim() instead of a fixed k value when iterating through search results, improving the robustness and accuracy of test validation after serialization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/algorithm/sindi/sindi_test.cpp
    • Updated the loop condition in the KnnSearch serialization test to use another_result->GetDim() for accurate iteration.
  • src/data_cell/sparse_term_datacell.cpp
    • Modified the distance relevance check from dists[id] != 0 to dists[id] < 0 within InsertHeapByDists.
    • Replaced dists_size with total_count_ in iteration loops within InsertHeapByDists.
    • Added an increment to total_count_ when a vector is inserted into the SparseTermDataCell.
  • src/data_cell/sparse_term_datacell.h
    • Declared a new int64_t total_count_ member variable to track the number of inserted vectors.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses an error scenario in the SINDI algorithm where search results might be incorrect or lead to out-of-bounds access. The changes correctly adjust loop bounds in SparseTermDataCell::InsertHeapByDists to use total_count_ instead of dists_size, ensuring iteration over the actual number of elements. The logic for identifying valid distances has been refined from dists[id] != 0 to dists[id] < 0, which aligns with the typical representation of inner product distances. A new total_count_ member variable was introduced and is properly maintained, contributing to the overall correctness of the index. The test case was also updated to reflect these corrections, improving the robustness of the testing framework.

if (heap.size() < n_candidate) {
for (; id < dists_size; id++) {
if (dists[id] != 0) {
for (; id < total_count_; id++) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The change from dists_size to total_count_ is a critical fix. Using dists_size could lead to iterating over uninitialized memory or incorrect data if dists_size is larger than the actual number of elements, potentially causing crashes or incorrect search results. total_count_ accurately reflects the number of valid elements.

// test serialize
auto another_result = another_index->KnnSearch(query, k, search_param_str, nullptr);
for (int j = 0; j < k; j++) {
for (int j = 0; j < another_result->GetDim(); j++) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Changing the loop condition from j < k to j < another_result->GetDim() is a crucial correction. This prevents potential out-of-bounds access if another_result->GetDim() returns fewer elements than k, ensuring the test accurately reflects the actual number of results returned by the search.

for (; id < dists_size; id++) {
if (dists[id] != 0) {
for (; id < total_count_; id++) {
if (dists[id] < 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Changing the condition from dists[id] != 0 to dists[id] < 0 is a significant logical correction. Given that inner product distances are often stored as negative values (e.g., dist_ip = -1 * query * base), checking for dists[id] < 0 correctly identifies valid, processed distances, preventing false positives or negatives in the heap insertion logic.

}

for (; id < dists_size; id++) {
for (; id < total_count_; id++) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the previous instance, updating the loop bound from dists_size to total_count_ here is essential for correctness. It ensures that the iteration for inserting into the heap does not go beyond the actual number of elements, preventing potential memory access issues and ensuring accurate processing.

term_datas_[term].push_back(val);
term_sizes_[term] += 1;
}
total_count_++;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Incrementing total_count_ after inserting a vector is a necessary addition. This ensures that the total_count_ member accurately reflects the number of vectors stored, which is then correctly used as the iteration limit in search functions like InsertHeapByDists.


Allocator* const allocator_{nullptr};

int64_t total_count_{0};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The introduction of total_count_ as a member variable is a necessary change to correctly track the number of elements. This variable is crucial for the updated logic in sparse_term_datacell.cpp where it is used to define accurate iteration bounds.

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

@@            Coverage Diff             @@
##             0.16    #1609      +/-   ##
==========================================
+ Coverage   92.21%   92.37%   +0.15%     
==========================================
  Files         295      295              
  Lines       15677    15685       +8     
==========================================
+ Hits        14457    14489      +32     
+ Misses       1220     1196      -24     
Flag Coverage Δ
cpp 92.37% <100.00%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
common 92.73% <ø> (+0.11%) ⬆️
datacell 92.49% <100.00%> (+0.59%) ⬆️
index 91.47% <66.66%> (-0.02%) ⬇️
simd 100.00% <ø> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31735f3...6c25fff. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@LHT129 LHT129 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Collaborator

@wxyucs wxyucs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@inabao inabao merged commit 3dec5ba into 0.16 Feb 28, 2026
23 checks passed
@inabao inabao deleted the cherry-pick-488225a-fix branch February 28, 2026 06:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/S version/0.16

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants