Reduce idle CPU usage by replacing polling with blocking in worker threads by DocShotgun · Pull Request #1902 · kvcache-ai/ktransformers

DocShotgun · 2026-03-27T21:39:34Z

This PR eliminates excessive idle CPU usage in the CPU backend threadpools by replacing fixed-interval polling (sleep_for(1ms)) with condition-variable-based blocking.

The current behavior is that worker threads spin for 50ms upon first becoming idle, and then enter a 1ms sleep-based polling loop. This results in around ~2% idle CPU utilization per thread. On my configuration with --kt-cpuinfer 128, this adds up to around ~256%, or ~2.5 cores worth of constant CPU utilization while no inference requests are running.

With this PR:

Worker threads retain the existing 50ms active period
Replace sleep_for(1ms) with std::condition_variable::wait
Add per-thread:
- std::mutex
- std::condition_variable
Notify threads (notify_one) when transitioning to WORKING or EXIT

After applying these changes locally, the inference server's idle CPU usage drops from ~256% to a mere ~1-2%.

Before submitting

Did you read the contributor guideline?
Did you write any new necessary tests?

* Replace sleep polling with condition variable blocking

gemini-code-assist

Code Review

This pull request replaces thread polling and sleeping with condition variables and mutexes to improve synchronization efficiency in the InNumaPool and NumaJobDistributor classes. Feedback focuses on optimizing atomic memory orders within condition variable predicates, where std::memory_order_relaxed is sufficient due to the existing mutex lock, and suggests refactoring duplicated initialization logic in NumaJobDistributor into a private helper function.

gemini-code-assist · 2026-03-27T21:41:30Z

kt-kernel/cpu_backend/worker_pool.cpp

+        thread_state_[thread_id].cv.wait(lock, [&] {
+          return thread_state_[thread_id].status.load(std::memory_order_acquire) != ThreadStatus::WAITING;
+        });


The std::condition_variable::wait predicate is executed while the mutex is locked. The lock and unlock operations on the mutex already provide the necessary memory synchronization (acquire-release semantics). Therefore, using std::memory_order_acquire for loading the atomic status inside the predicate is stronger than necessary. You can use std::memory_order_relaxed for a minor performance improvement and to more clearly express that synchronization relies on the mutex.

Suggested change

thread_state_[thread_id].cv.wait(lock, [&] {

return thread_state_[thread_id].status.load(std::memory_order_acquire) != ThreadStatus::WAITING;

});

thread_state_[thread_id].cv.wait(lock, [&] {

return thread_state_[thread_id].status.load(std::memory_order_relaxed) != ThreadStatus::WAITING;

});

gemini-code-assist · 2026-03-27T21:41:30Z

kt-kernel/cpu_backend/worker_pool.cpp

+    mutexes.push_back(std::make_unique<std::mutex>());
+    cvs.push_back(std::make_unique<std::condition_variable>());


This loop for initializing status, mutexes, and cvs is duplicated in the other NumaJobDistributor::init overload (lines 280-281). To improve maintainability and avoid future inconsistencies, consider refactoring this common initialization logic into a private helper function.

gemini-code-assist · 2026-03-27T21:41:31Z

kt-kernel/cpu_backend/worker_pool.cpp

+        cvs[numa_id]->wait(lock, [&] {
+          return status[numa_id]->load(std::memory_order_acquire) != ThreadStatus::WAITING;
+        });


Similar to my other comment, the std::condition_variable::wait predicate is executed under a lock, which already ensures memory visibility. Using std::memory_order_acquire here is not necessary. std::memory_order_relaxed is sufficient and more idiomatic in this context, as the mutex handles the synchronization.

Suggested change

cvs[numa_id]->wait(lock, [&] {

return status[numa_id]->load(std::memory_order_acquire) != ThreadStatus::WAITING;

});

cvs[numa_id]->wait(lock, [&] {

return status[numa_id]->load(std::memory_order_relaxed) != ThreadStatus::WAITING;

});

Fix worker pool idle CPU usage

61dbd0b

* Replace sleep polling with condition variable blocking

gemini-code-assist bot reviewed Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce idle CPU usage by replacing polling with blocking in worker threads#1902

Reduce idle CPU usage by replacing polling with blocking in worker threads#1902
DocShotgun wants to merge 1 commit intokvcache-ai:mainfrom
DocShotgun:reduce_idle_cpu_usage

DocShotgun commented Mar 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 27, 2026

Uh oh!

gemini-code-assist bot Mar 27, 2026

Uh oh!

gemini-code-assist bot Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		mutexes.push_back(std::make_unique<std::mutex>());
		cvs.push_back(std::make_unique<std::condition_variable>());

Conversation

DocShotgun commented Mar 27, 2026

Before submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant