fix(sindi): fix error when neither doc nor query is relevant#1589
fix(sindi): fix error when neither doc nor query is relevant#1589
Conversation
Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
Summary of ChangesHello @inabao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves an error in the sparse term data cell component, specifically targeting scenarios where neither documents nor queries yield relevant results. The solution refines the logic for handling zero-distance cases during initial heap filling and introduces a dedicated counter to accurately track the total number of inserted vectors. By leveraging this new counter, search loops are now guaranteed to process the complete dataset, enhancing the reliability and correctness of search operations in edge conditions. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request aims to fix an error in the sindi sparse search implementation. It introduces a total_count_ member to SparseTermDataCell to robustly track the number of vectors, and uses this count in loops, which is a good improvement for correctness and robustness.
However, I've found a critical issue in the change at src/datacell/sparse_term_datacell.cpp:104. The condition for initial heap filling was changed from dist != 0 to dist > 0. As dist represents the negative inner product (-IP), this change incorrectly prioritizes dissimilar vectors (IP < 0) and will likely cause the k-NN search to fail or return incorrect results. I've left a comment with a suggested fix.
| uint32_t n_candidate, | ||
| const FilterPtr& filter) const { | ||
| if (dist != 0) { | ||
| if (dist > 0) { |
There was a problem hiding this comment.
The condition dist > 0 appears to be incorrect for a similarity search. Based on my understanding of the code, dist represents the negative inner product (-IP). For similarity, we seek a large IP, which means a small (negative) dist.
With dist > 0, the code selects for IP < 0, prioritizing dissimilar vectors. If all inner products are non-negative, the initial heap filling will fail.
The original dist != 0 was more correct. To be more precise and efficient for similarity search, the condition should likely be dist < 0 to only consider vectors with positive inner products.
if (dist < 0) {
Codecov Report✅ All modified and coverable lines are covered by tests. @@ Coverage Diff @@
## main #1589 +/- ##
==========================================
- Coverage 91.18% 90.65% -0.54%
==========================================
Files 329 329
Lines 19396 19457 +61
==========================================
- Hits 17687 17639 -48
- Misses 1709 1818 +109
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
|
@inabao this pull request cannot cherry-pick to the branch 0.16 (CONFLICTS), please create a new pull request to the branch 0.16 . |
|
@inabao this pull request cannot cherry-pick to the branch 0.17 (CONFLICTS), please create a new pull request to the branch 0.17 . |
Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
close: #1581