Reduce idle CPU usage by replacing polling with blocking in worker threads#1902
Reduce idle CPU usage by replacing polling with blocking in worker threads#1902DocShotgun wants to merge 1 commit intokvcache-ai:mainfrom
Conversation
* Replace sleep polling with condition variable blocking
There was a problem hiding this comment.
Code Review
This pull request replaces thread polling and sleeping with condition variables and mutexes to improve synchronization efficiency in the InNumaPool and NumaJobDistributor classes. Feedback focuses on optimizing atomic memory orders within condition variable predicates, where std::memory_order_relaxed is sufficient due to the existing mutex lock, and suggests refactoring duplicated initialization logic in NumaJobDistributor into a private helper function.
| thread_state_[thread_id].cv.wait(lock, [&] { | ||
| return thread_state_[thread_id].status.load(std::memory_order_acquire) != ThreadStatus::WAITING; | ||
| }); |
There was a problem hiding this comment.
The std::condition_variable::wait predicate is executed while the mutex is locked. The lock and unlock operations on the mutex already provide the necessary memory synchronization (acquire-release semantics). Therefore, using std::memory_order_acquire for loading the atomic status inside the predicate is stronger than necessary. You can use std::memory_order_relaxed for a minor performance improvement and to more clearly express that synchronization relies on the mutex.
| thread_state_[thread_id].cv.wait(lock, [&] { | |
| return thread_state_[thread_id].status.load(std::memory_order_acquire) != ThreadStatus::WAITING; | |
| }); | |
| thread_state_[thread_id].cv.wait(lock, [&] { | |
| return thread_state_[thread_id].status.load(std::memory_order_relaxed) != ThreadStatus::WAITING; | |
| }); |
| mutexes.push_back(std::make_unique<std::mutex>()); | ||
| cvs.push_back(std::make_unique<std::condition_variable>()); |
There was a problem hiding this comment.
| cvs[numa_id]->wait(lock, [&] { | ||
| return status[numa_id]->load(std::memory_order_acquire) != ThreadStatus::WAITING; | ||
| }); |
There was a problem hiding this comment.
Similar to my other comment, the std::condition_variable::wait predicate is executed under a lock, which already ensures memory visibility. Using std::memory_order_acquire here is not necessary. std::memory_order_relaxed is sufficient and more idiomatic in this context, as the mutex handles the synchronization.
| cvs[numa_id]->wait(lock, [&] { | |
| return status[numa_id]->load(std::memory_order_acquire) != ThreadStatus::WAITING; | |
| }); | |
| cvs[numa_id]->wait(lock, [&] { | |
| return status[numa_id]->load(std::memory_order_relaxed) != ThreadStatus::WAITING; | |
| }); |
This PR eliminates excessive idle CPU usage in the CPU backend threadpools by replacing fixed-interval polling (
sleep_for(1ms)) with condition-variable-based blocking.The current behavior is that worker threads spin for
50msupon first becoming idle, and then enter a1mssleep-based polling loop. This results in around ~2% idle CPU utilization per thread. On my configuration with--kt-cpuinfer 128, this adds up to around ~256%, or ~2.5 cores worth of constant CPU utilization while no inference requests are running.With this PR:
50msactive periodsleep_for(1ms)withstd::condition_variable::waitstd::mutexstd::condition_variablenotify_one) when transitioning toWORKINGorEXITAfter applying these changes locally, the inference server's idle CPU usage drops from ~256% to a mere ~1-2%.
Before submitting