Skip to content

Reduce idle CPU usage by replacing polling with blocking in worker threads#1902

Open
DocShotgun wants to merge 1 commit intokvcache-ai:mainfrom
DocShotgun:reduce_idle_cpu_usage
Open

Reduce idle CPU usage by replacing polling with blocking in worker threads#1902
DocShotgun wants to merge 1 commit intokvcache-ai:mainfrom
DocShotgun:reduce_idle_cpu_usage

Conversation

@DocShotgun
Copy link
Copy Markdown
Contributor

This PR eliminates excessive idle CPU usage in the CPU backend threadpools by replacing fixed-interval polling (sleep_for(1ms)) with condition-variable-based blocking.

The current behavior is that worker threads spin for 50ms upon first becoming idle, and then enter a 1ms sleep-based polling loop. This results in around ~2% idle CPU utilization per thread. On my configuration with --kt-cpuinfer 128, this adds up to around ~256%, or ~2.5 cores worth of constant CPU utilization while no inference requests are running.

With this PR:

  • Worker threads retain the existing 50ms active period
  • Replace sleep_for(1ms) with std::condition_variable::wait
  • Add per-thread:
    • std::mutex
    • std::condition_variable
  • Notify threads (notify_one) when transitioning to WORKING or EXIT

After applying these changes locally, the inference server's idle CPU usage drops from ~256% to a mere ~1-2%.

Before submitting

* Replace sleep polling with condition variable blocking
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces thread polling and sleeping with condition variables and mutexes to improve synchronization efficiency in the InNumaPool and NumaJobDistributor classes. Feedback focuses on optimizing atomic memory orders within condition variable predicates, where std::memory_order_relaxed is sufficient due to the existing mutex lock, and suggests refactoring duplicated initialization logic in NumaJobDistributor into a private helper function.

Comment on lines +228 to +230
thread_state_[thread_id].cv.wait(lock, [&] {
return thread_state_[thread_id].status.load(std::memory_order_acquire) != ThreadStatus::WAITING;
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The std::condition_variable::wait predicate is executed while the mutex is locked. The lock and unlock operations on the mutex already provide the necessary memory synchronization (acquire-release semantics). Therefore, using std::memory_order_acquire for loading the atomic status inside the predicate is stronger than necessary. You can use std::memory_order_relaxed for a minor performance improvement and to more clearly express that synchronization relies on the mutex.

Suggested change
thread_state_[thread_id].cv.wait(lock, [&] {
return thread_state_[thread_id].status.load(std::memory_order_acquire) != ThreadStatus::WAITING;
});
thread_state_[thread_id].cv.wait(lock, [&] {
return thread_state_[thread_id].status.load(std::memory_order_relaxed) != ThreadStatus::WAITING;
});

Comment on lines +257 to +258
mutexes.push_back(std::make_unique<std::mutex>());
cvs.push_back(std::make_unique<std::condition_variable>());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This loop for initializing status, mutexes, and cvs is duplicated in the other NumaJobDistributor::init overload (lines 280-281). To improve maintainability and avoid future inconsistencies, consider refactoring this common initialization logic into a private helper function.

Comment on lines +404 to +406
cvs[numa_id]->wait(lock, [&] {
return status[numa_id]->load(std::memory_order_acquire) != ThreadStatus::WAITING;
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to my other comment, the std::condition_variable::wait predicate is executed under a lock, which already ensures memory visibility. Using std::memory_order_acquire here is not necessary. std::memory_order_relaxed is sufficient and more idiomatic in this context, as the mutex handles the synchronization.

Suggested change
cvs[numa_id]->wait(lock, [&] {
return status[numa_id]->load(std::memory_order_acquire) != ThreadStatus::WAITING;
});
cvs[numa_id]->wait(lock, [&] {
return status[numa_id]->load(std::memory_order_relaxed) != ThreadStatus::WAITING;
});

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant