Skip to content

Minor optimizations for getDataByLabel? #643

@LTLA

Description

@LTLA

In my KNN-related applications, I often create the HNSW index and pass it to downstream functions without the original data. When I need to do a KNN search within the data, I use HierarchicalNSW::getDataByLabel() to extract the data vector before passing it to searchKnn().

This works pretty well, but some examination of the source reveals that getDataByLabel() does a surprising amount of work:

hnswlib/hnswlib/hnswalg.h

Lines 826 to 847 in c1b9b79

std::vector<data_t> getDataByLabel(labeltype label) const {
// lock all operations with element by label
std::unique_lock <std::mutex> lock_label(getLabelOpMutex(label));
std::unique_lock <std::mutex> lock_table(label_lookup_lock);
auto search = label_lookup_.find(label);
if (search == label_lookup_.end() || isMarkedDeleted(search->second)) {
throw std::runtime_error("Label not found");
}
tableint internalId = search->second;
lock_table.unlock();
char* data_ptrv = getDataByInternalId(internalId);
size_t dim = *((size_t *) dist_func_param_);
std::vector<data_t> data;
data_t* data_ptr = (data_t*) data_ptrv;
for (size_t i = 0; i < dim; i++) {
data.push_back(*data_ptr);
data_ptr += 1;
}
return data;
}

i.e., multiple locks that might cause contention in a multi-threaded context, plus an allocation.

Might you consider adding a streamlined version like:

    const char* getDataByLabelRaw(labeltype label) const {
        auto search = label_lookup_.find(label);
        if (search == label_lookup_.end() || isMarkedDeleted(search->second)) {
            throw std::runtime_error("Label not found");
        }
        tableint internalId = search->second;
        return getDataByInternalId(internalId);
    }

where the pointer could be directly passed to searchKnn()? This avoids the allocation of and copy to a new vector, and skips the various locks under the assumption that the user is not adding/deleting points at the same time. Probably won't make much of a performance difference in the single-threaded case, but is a bit more ergonomic as I can just pass the pointer directly to searchKnn().

FWIW we could then streamline getDataByLabel() itself to:

    template<typename data_t>
    std::vector<data_t> getDataByLabel(labeltype label) const {
        // lock all operations with element by label
        std::unique_lock <std::mutex> lock_label(getLabelOpMutex(label));
        
        std::unique_lock <std::mutex> lock_table(label_lookup_lock);
        auto data_ptrv = getDataByLabelRaw(label);
        lock_table.unlock();

        size_t dim = *((size_t *) dist_func_param_);
        data_t* data_ptr = (data_t*) data_ptrv;
        return std::vector<data_t>(data_ptr, data_ptr + dim);
    }

(TBH I don't really understand the motivation for the locks - are people really reading from the index at the same time as they're adding/deleting points? That seems like it would have all kinds of issues from race conditions. But if this is a common operation, one might also consider switching to a std::shared_mutex + std::shared_lock so that parallel reads don't contend with each other.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions