Skip to content

Fix SINDI index returning no results after deserialization#1623

Closed
Copilot wants to merge 2 commits intomainfrom
copilot/fix-sindi-index-deserialization
Closed

Fix SINDI index returning no results after deserialization#1623
Copilot wants to merge 2 commits intomainfrom
copilot/fix-sindi-index-deserialization

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 4, 2026

SparseTermDataCell::total_count_ is never persisted during Serialize nor restored during Deserialize. Since InsertHeapByDists uses it as the loop bound, a deserialized index processes zero candidates and returns empty results — regardless of index contents.

Changes

  • src/datacell/sparse_term_datacell.cpp: Reconstruct total_count_ in Deserialize by scanning the loaded term_ids_ and computing max(inner_id) + 1. Backward compatible — no format change required, existing serialized files work correctly.

  • src/algorithm/sindi/sindi_test.cpp: Assert another_result->GetDim() == result->GetDim() after deserializing, so this regression is caught by unit tests.

Root cause

// InsertHeapByDists — called when use_term_lists_heap_insert=false
for (; id < total_count_; id++) {   // total_count_ = 0 after deserialization
    insert_candidate_into_heap(...); // never executes → empty results
}
Original prompt

This section details on the original issue you should resolve

<issue_title>No results can be retrieved after deserializing from the file for the SINDI index</issue_title>
<issue_description>Describe the bug
No results can be retrieved after deserializing from the file for the sindi index. However, for the same data, if you query it immediately after it is constructed, you can obtain the results.

To Reproduce

deserialize from index

TEST_CASE_PERSISTENT_FIXTURE(fixtures::SINDITestIndex,
                             "Test Deserialize",
                             "[ft][sindi][search]") {
    auto origin_size = vsag::Options::Instance().block_size_limit();
    auto size = 1024 * 1024 * 2;
    auto dim = 768;
    vsag::Options::Instance().set_block_size_limit(size);
    std::string build_param = "{\"dtype\":\"sparse\",\"metric_type\":\"ip\",\"dim\": 1024,\"index_param\":{\"use_reorder\":false,\"doc_prune_ratio\":0.00000,\"window_size\":60000,\"deserialize_without_footer\":true,\"deserialize_without_buffer\":true,\"use_quantization\":false,\"term_id_limit\":500000,\"avg_doc_term_length\":120}}";
    auto index_result = vsag::Factory::CreateIndex("sindi", build_param);
    REQUIRE(index_result.has_value());
    auto index = index_result.value();
    
    std::ifstream infile("sindi.index", std::ios::binary);
    auto des_res = index->Deserialize(infile);
    if (! des_res.has_value()) {
      std::cerr << "Deserialize fail, error=" << (int)des_res.error().type <<std::endl;
      throw std::runtime_error("Index Deserialize fail"); 
    }


    std::string search_param = "{\"sindi\":{\"query_prune_ratio\":0.000000,\"use_term_lists_heap_insert\":false,\"n_candidate\":0}}";
    uint32_t len = 3;
    uint32_t* dims = new uint32_t[len];
    float* vals = new float[len];
    dims[0] = 1;
    dims[1] = 2;
    dims[2] = 3;
    vals[0] = 0.1;
    vals[1] = 0.2;
    vals[2] = 0.3;
    vsag::SparseVector sparse;
    sparse.len_ = len;
    sparse.ids_ = dims;
    sparse.vals_ = vals;
    auto data = vsag::Dataset::Make();
    data->NumElements(1)->SparseVectors(&sparse)->Owner(false);
    auto result = index->KnnSearch(data, 2, search_param);
    REQUIRE(result.has_value());
    const int64_t* neighbors = result.value()->GetIds();
    auto result_size = result.value()->GetDim();
    auto dist = result.value()->GetDistances();
    std::cout << "result_size = " << result_size << std::endl;
    for(int i=0; i<result_size; ++i) {
      std::cout << ", neighbors[" << i << "]=" << neighbors[i]<< ", dist=" << dist[i] << std::endl;
    }
}

same data

TEST_CASE_PERSISTENT_FIXTURE(fixtures::SINDITestIndex,
                             "SINDI Sparse SQL Example",
                             "[ft][sindi][build]") {

    std::string build_param = "{\"dtype\":\"sparse\",\"metric_type\":\"ip\",\"dim\": 1024,\"index_param\":{\"use_reorder\":false,\"doc_prune_ratio\":0.00000,\"window_size\":60000,\"deserialize_without_footer\":true,\"deserialize_without_buffer\":true,\"use_quantization\":false,\"term_id_limit\":500000,\"avg_doc_term_length\":120}}";
    auto index = TestFactory("sindi", build_param, true);

    constexpr uint64_t count = 10;
    std::vector<int64_t> ids(count);
    for (uint64_t i = 0; i < count; ++i) {
        ids[i] = static_cast<int64_t>(i + 10001);
    }

    std::vector<std::vector<uint32_t>> term_ids = {
        {1, 2, 3},
        {2, 3, 4},
        {3, 4, 5},
        {4, 5, 6},
        {5, 6, 7},
        {6, 7, 8},
        {7, 8, 9},
        {8, 9, 10},
        {9, 10, 11},
        {10, 11, 12},
    };
    std::vector<std::vector<float>> term_vals = {
        {0.1f, 0.2f, 0.3f},
        {0.2f, 0.3f, 0.4f},
        {0.3f, 0.4f, 0.5f},
        {0.4f, 0.5f, 0.6f},
        {0.5f, 0.6f, 0.7f},
        {0.6f, 0.7f, 0.8f},
        {0.7f, 0.8f, 0.9f},
        {0.8f, 0.9f, 1.0f},
        {0.9f, 1.0f, 1.1f},
        {1.0f, 1.1f, 1.2f},
    };

    std::vector<vsag::SparseVector> sparse_vectors(count);
    for (uint64_t i = 0; i < count; ++i) {
        sparse_vectors[i].len_ = static_cast<uint32_t>(term_ids[i].size());
        sparse_vectors[i].ids_ = term_ids[i].data();
        sparse_vectors[i].vals_ = term_vals[i].data();
    }

    auto base = vsag::Dataset::Make();
    base->NumElements(count)
        ->Ids(ids.data())
        ->SparseVectors(sparse_vectors.data())
        ->Owner(false);

    auto add_result = index->Add(base);
    REQUIRE(add_result.has_value());
    REQUIRE(add_result.value().empty());

    std::vector<uint32_t> query_ids = {1, 2, 3};
    std::vector<float> query_vals = {0.1f, 0.2f, 0.3f};
    vsag::SparseVector query_sv;
    query_sv.len_ = static_cast<uint32_t>(query_ids.size());
    query_sv.ids_ = query_ids.data();
    query_sv.vals_ = query_vals.data();
    auto query = vsag::Dataset::Make();
    query->NumElements(1)->SparseVectors(&query_sv)->Owner(false);

    std::string search_param = "{\"sindi\":{\"query_prune_ratio\":0.000000,\"use_term_lists_heap_inser...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes antgroup/vsag#1622

<!-- START COPILOT CODING AGENT TIPS -->
---

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. [Learn more about Advanced Security.](https://gh.io/cca-advanced-security)

Co-authored-by: wxyucs <12595343+wxyucs@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix deserialization issue for SINDI index Fix SINDI index returning no results after deserialization Mar 4, 2026
@wxyucs
Copy link
Copy Markdown
Collaborator

wxyucs commented Mar 4, 2026

same as #1621

@wxyucs wxyucs closed this Mar 4, 2026
@wxyucs wxyucs deleted the copilot/fix-sindi-index-deserialization branch March 24, 2026 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants