-
Notifications
You must be signed in to change notification settings - Fork 783
Description
In my KNN-related applications, I often create the HNSW index and pass it to downstream functions without the original data. When I need to do a KNN search within the data, I use HierarchicalNSW::getDataByLabel() to extract the data vector before passing it to searchKnn().
This works pretty well, but some examination of the source reveals that getDataByLabel() does a surprising amount of work:
Lines 826 to 847 in c1b9b79
| std::vector<data_t> getDataByLabel(labeltype label) const { | |
| // lock all operations with element by label | |
| std::unique_lock <std::mutex> lock_label(getLabelOpMutex(label)); | |
| std::unique_lock <std::mutex> lock_table(label_lookup_lock); | |
| auto search = label_lookup_.find(label); | |
| if (search == label_lookup_.end() || isMarkedDeleted(search->second)) { | |
| throw std::runtime_error("Label not found"); | |
| } | |
| tableint internalId = search->second; | |
| lock_table.unlock(); | |
| char* data_ptrv = getDataByInternalId(internalId); | |
| size_t dim = *((size_t *) dist_func_param_); | |
| std::vector<data_t> data; | |
| data_t* data_ptr = (data_t*) data_ptrv; | |
| for (size_t i = 0; i < dim; i++) { | |
| data.push_back(*data_ptr); | |
| data_ptr += 1; | |
| } | |
| return data; | |
| } |
i.e., multiple locks that might cause contention in a multi-threaded context, plus an allocation.
Might you consider adding a streamlined version like:
const char* getDataByLabelRaw(labeltype label) const {
auto search = label_lookup_.find(label);
if (search == label_lookup_.end() || isMarkedDeleted(search->second)) {
throw std::runtime_error("Label not found");
}
tableint internalId = search->second;
return getDataByInternalId(internalId);
}where the pointer could be directly passed to searchKnn()? This avoids the allocation of and copy to a new vector, and skips the various locks under the assumption that the user is not adding/deleting points at the same time. Probably won't make much of a performance difference in the single-threaded case, but is a bit more ergonomic as I can just pass the pointer directly to searchKnn().
FWIW we could then streamline getDataByLabel() itself to:
template<typename data_t>
std::vector<data_t> getDataByLabel(labeltype label) const {
// lock all operations with element by label
std::unique_lock <std::mutex> lock_label(getLabelOpMutex(label));
std::unique_lock <std::mutex> lock_table(label_lookup_lock);
auto data_ptrv = getDataByLabelRaw(label);
lock_table.unlock();
size_t dim = *((size_t *) dist_func_param_);
data_t* data_ptr = (data_t*) data_ptrv;
return std::vector<data_t>(data_ptr, data_ptr + dim);
}(TBH I don't really understand the motivation for the locks - are people really reading from the index at the same time as they're adding/deleting points? That seems like it would have all kinds of issues from race conditions. But if this is a common operation, one might also consider switching to a std::shared_mutex + std::shared_lock so that parallel reads don't contend with each other.)