Skip to content

fix(sindi): fix error when neither doc nor query is relevant#1589

Merged
inabao merged 3 commits intomainfrom
fix-sindi-large
Feb 12, 2026
Merged

fix(sindi): fix error when neither doc nor query is relevant#1589
inabao merged 3 commits intomainfrom
fix-sindi-large

Conversation

@inabao
Copy link
Copy Markdown
Collaborator

@inabao inabao commented Feb 10, 2026

close: #1581

Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @inabao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an error in the sparse term data cell component, specifically targeting scenarios where neither documents nor queries yield relevant results. The solution refines the logic for handling zero-distance cases during initial heap filling and introduces a dedicated counter to accurately track the total number of inserted vectors. By leveraging this new counter, search loops are now guaranteed to process the complete dataset, enhancing the reliability and correctness of search operations in edge conditions.

Highlights

  • Corrected Distance Comparison: The condition for checking distance in fill_heap_initial was updated from dist != 0 to dist > 0. This change ensures that items with a distance of exactly zero are handled correctly and do not trigger filtering logic intended for non-zero distances, addressing an edge case in relevance calculations.
  • Introduced Total Count Tracking: A new member variable, total_count_ of type int64_t, was added to the SparseTermDataCell class. This variable is now used to accurately track the total number of vectors inserted into the data cell.
  • Updated Search Loop Iteration: The iteration bounds within the InsertHeapByDists method were modified to use the newly introduced total_count_ instead of dists_size. This ensures that search operations correctly iterate over all available data, preventing potential omissions and improving the robustness of the search algorithm.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/datacell/sparse_term_datacell.cpp
    • Modified the distance check in fill_heap_initial from dist != 0 to dist > 0.
    • Updated loop conditions in InsertHeapByDists to use total_count_ instead of dists_size.
    • Incremented total_count_ when a new vector is inserted in InsertVector.
  • src/datacell/sparse_term_datacell.h
    • Added a new private member variable total_count_ of type int64_t.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an error in the sindi sparse search implementation. It introduces a total_count_ member to SparseTermDataCell to robustly track the number of vectors, and uses this count in loops, which is a good improvement for correctness and robustness.

However, I've found a critical issue in the change at src/datacell/sparse_term_datacell.cpp:104. The condition for initial heap filling was changed from dist != 0 to dist > 0. As dist represents the negative inner product (-IP), this change incorrectly prioritizes dissimilar vectors (IP < 0) and will likely cause the k-NN search to fail or return incorrect results. I've left a comment with a suggested fix.

uint32_t n_candidate,
const FilterPtr& filter) const {
if (dist != 0) {
if (dist > 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The condition dist > 0 appears to be incorrect for a similarity search. Based on my understanding of the code, dist represents the negative inner product (-IP). For similarity, we seek a large IP, which means a small (negative) dist.

With dist > 0, the code selects for IP < 0, prioritizing dissimilar vectors. If all inner products are non-negative, the initial heap filling will fail.

The original dist != 0 was more correct. To be more precise and efficient for similarity search, the condition should likely be dist < 0 to only consider vectors with positive inner products.

    if (dist < 0) {

Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Feb 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

@@            Coverage Diff             @@
##             main    #1589      +/-   ##
==========================================
- Coverage   91.18%   90.65%   -0.54%     
==========================================
  Files         329      329              
  Lines       19396    19457      +61     
==========================================
- Hits        17687    17639      -48     
- Misses       1709     1818     +109     
Flag Coverage Δ
cpp 90.65% <100.00%> (-0.54%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
common 85.81% <ø> (-0.18%) ⬇️
datacell 91.92% <ø> (-1.30%) ⬇️
index 90.03% <100.00%> (-0.57%) ⬇️
simd 100.00% <ø> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 765e6cb...28e33f0. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@wxyucs wxyucs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown
Collaborator

@LHT129 LHT129 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@inabao inabao merged commit 488225a into main Feb 12, 2026
29 of 31 checks passed
@inabao inabao deleted the fix-sindi-large branch February 12, 2026 07:56
@wxyucs
Copy link
Copy Markdown
Collaborator

wxyucs commented Feb 27, 2026

@inabao this pull request cannot cherry-pick to the branch 0.16 (CONFLICTS), please create a new pull request to the branch 0.16 .

@wxyucs
Copy link
Copy Markdown
Collaborator

wxyucs commented Feb 27, 2026

@inabao this pull request cannot cherry-pick to the branch 0.17 (CONFLICTS), please create a new pull request to the branch 0.17 .

wxyucs pushed a commit that referenced this pull request Feb 27, 2026
Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
inabao added a commit that referenced this pull request Feb 27, 2026
Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
inabao added a commit that referenced this pull request Feb 28, 2026
…1609)

Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

call SINDI::KnnSearch return id is too large

3 participants