fix(shm): Add memory barriers for cross-process shared memory visibility#30407
Conversation
The shared memory ring buffer protocol in shm_broadcast.py uses plain byte writes to signal between writer and reader processes. On multi-core systems, these writes may stay in CPU store buffers and not be visible to other processes running on different cores, causing indefinite spinning/freeze under sustained concurrent load. This patch adds explicit memory barriers using threading.Lock acquire/release (which provides full barrier semantics per POSIX.1-2008) at four critical points: - In acquire_write(): before reading flags and after setting written flag - In acquire_read(): before reading flags and after setting read flag The memory barrier ensures that: 1. All stores before the barrier are globally visible 2. All loads after the barrier see the latest values Fixes freeze observed during sustained concurrent batch inference (~500+ requests) where both writer and readers would spin indefinitely waiting for flags that were updated but not visible across CPU cores. Signed-off-by: Christina Holland <hey@christinaholland.com> Signed-off-by: Christina <truffle@gmail.com>
…ager Signed-off-by: Christina <truffle@gmail.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request correctly identifies and fixes a critical race condition in the shared memory broadcast mechanism. By introducing memory barriers using a memory_fence function, it ensures proper synchronization and visibility of memory writes between processes. The fences are strategically placed within the acquire_write and acquire_read methods, which is crucial for the correctness of the lock-free ring buffer on multi-core systems. The implementation of memory_fence using threading.Lock is a pragmatic approach to create a portable memory barrier. The changes are well-documented with clear comments explaining the purpose of each fence. Overall, this is a solid and necessary fix to prevent potential deadlocks and data corruption.
|
Thanks @kitaekatt. I benchmarked a low latency case and this also actually appears to have a slight performance benefit 🎉 Before This PR |
|
Hi! This looks like an important fix for preventing data races. We've actually encountered similar issues before, as mentioned in #27858, so I really appreciate this effort! I have a small question about the implementation choice – I was under the impression that Thanks again for your work on this fix. It's great to see attention to this tricky issue! |
|
Good question @slippersss. For context, this is a lock-free ring buffer - the design specifically avoids blocking so multiple processes can read/write concurrently using atomic flag checks and memory barriers. The Acquiring and immediately releasing a
|
Sorry for the late reply, and thanks a lot for the detailed explanation - I learned a lot. Thanks again! |
…ity (vllm-project#30407) Signed-off-by: Christina Holland <hey@christinaholland.com> Signed-off-by: Christina <truffle@gmail.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
…ity (vllm-project#30407) Signed-off-by: Christina Holland <hey@christinaholland.com> Signed-off-by: Christina <truffle@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Summary
Adds proper memory barriers for cross-process shared memory synchronization to prevent data races.
Changes
Testing
Tested with multi-process inference scenarios.
Note: PR recreated due to branch rename (was fix-shm-memory-barriers, now fix/29819-shm-memory-barriers)