fix(sindi): fix error when neither doc nor query is relevant (#1589)#1609
fix(sindi): fix error when neither doc nor query is relevant (#1589)#1609
Conversation
Signed-off-by: jinjiabao.jjb <jinjiabao.jjb@antgroup.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a bug in the SINDI algorithm's sparse term data handling, specifically addressing an error that occurred when neither the document nor the query yielded relevant results. The changes introduce a more robust mechanism for tracking the total number of inserted vectors and refine the distance filtering logic, ensuring that the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request effectively addresses an error scenario in the SINDI algorithm where search results might be incorrect or lead to out-of-bounds access. The changes correctly adjust loop bounds in SparseTermDataCell::InsertHeapByDists to use total_count_ instead of dists_size, ensuring iteration over the actual number of elements. The logic for identifying valid distances has been refined from dists[id] != 0 to dists[id] < 0, which aligns with the typical representation of inner product distances. A new total_count_ member variable was introduced and is properly maintained, contributing to the overall correctness of the index. The test case was also updated to reflect these corrections, improving the robustness of the testing framework.
| if (heap.size() < n_candidate) { | ||
| for (; id < dists_size; id++) { | ||
| if (dists[id] != 0) { | ||
| for (; id < total_count_; id++) { |
There was a problem hiding this comment.
The change from dists_size to total_count_ is a critical fix. Using dists_size could lead to iterating over uninitialized memory or incorrect data if dists_size is larger than the actual number of elements, potentially causing crashes or incorrect search results. total_count_ accurately reflects the number of valid elements.
| // test serialize | ||
| auto another_result = another_index->KnnSearch(query, k, search_param_str, nullptr); | ||
| for (int j = 0; j < k; j++) { | ||
| for (int j = 0; j < another_result->GetDim(); j++) { |
There was a problem hiding this comment.
| for (; id < dists_size; id++) { | ||
| if (dists[id] != 0) { | ||
| for (; id < total_count_; id++) { | ||
| if (dists[id] < 0) { |
There was a problem hiding this comment.
Changing the condition from dists[id] != 0 to dists[id] < 0 is a significant logical correction. Given that inner product distances are often stored as negative values (e.g., dist_ip = -1 * query * base), checking for dists[id] < 0 correctly identifies valid, processed distances, preventing false positives or negatives in the heap insertion logic.
| } | ||
|
|
||
| for (; id < dists_size; id++) { | ||
| for (; id < total_count_; id++) { |
There was a problem hiding this comment.
Similar to the previous instance, updating the loop bound from dists_size to total_count_ here is essential for correctness. It ensures that the iteration for inserting into the heap does not go beyond the actual number of elements, preventing potential memory access issues and ensuring accurate processing.
| term_datas_[term].push_back(val); | ||
| term_sizes_[term] += 1; | ||
| } | ||
| total_count_++; |
There was a problem hiding this comment.
|
|
||
| Allocator* const allocator_{nullptr}; | ||
|
|
||
| int64_t total_count_{0}; |
Codecov Report✅ All modified and coverable lines are covered by tests. @@ Coverage Diff @@
## 0.16 #1609 +/- ##
==========================================
+ Coverage 92.21% 92.37% +0.15%
==========================================
Files 295 295
Lines 15677 15685 +8
==========================================
+ Hits 14457 14489 +32
+ Misses 1220 1196 -24
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
cp #1589 to 0.16
link #1581