Skip to content

Comments

fix(shm): Add memory barriers for cross-process shared memory visibility#30407

Merged
njhill merged 3 commits intovllm-project:mainfrom
kitaekatt:fix/29819-shm-memory-barriers
Dec 10, 2025
Merged

fix(shm): Add memory barriers for cross-process shared memory visibility#30407
njhill merged 3 commits intovllm-project:mainfrom
kitaekatt:fix/29819-shm-memory-barriers

Conversation

@kitaekatt
Copy link
Contributor

Summary

Adds proper memory barriers for cross-process shared memory synchronization to prevent data races.

Changes

  • Add memory barriers in shared memory broadcast implementation
  • Ensures correct visibility ordering across process boundaries

Testing

Tested with multi-process inference scenarios.

Note: PR recreated due to branch rename (was fix-shm-memory-barriers, now fix/29819-shm-memory-barriers)

The shared memory ring buffer protocol in shm_broadcast.py uses plain byte
writes to signal between writer and reader processes. On multi-core systems,
these writes may stay in CPU store buffers and not be visible to other
processes running on different cores, causing indefinite spinning/freeze
under sustained concurrent load.

This patch adds explicit memory barriers using threading.Lock acquire/release
(which provides full barrier semantics per POSIX.1-2008) at four critical
points:
- In acquire_write(): before reading flags and after setting written flag
- In acquire_read(): before reading flags and after setting read flag

The memory barrier ensures that:
1. All stores before the barrier are globally visible
2. All loads after the barrier see the latest values

Fixes freeze observed during sustained concurrent batch inference (~500+
requests) where both writer and readers would spin indefinitely waiting
for flags that were updated but not visible across CPU cores.

Signed-off-by: Christina Holland <hey@christinaholland.com>
Signed-off-by: Christina <truffle@gmail.com>
…ager

Signed-off-by: Christina <truffle@gmail.com>
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly identifies and fixes a critical race condition in the shared memory broadcast mechanism. By introducing memory barriers using a memory_fence function, it ensures proper synchronization and visibility of memory writes between processes. The fences are strategically placed within the acquire_write and acquire_read methods, which is crucial for the correctness of the lock-free ring buffer on multi-core systems. The implementation of memory_fence using threading.Lock is a pragmatic approach to create a portable memory barrier. The changes are well-documented with clear comments explaining the purpose of each fence. Overall, this is a solid and necessary fix to prevent potential deadlocks and data corruption.

@kitaekatt
Copy link
Contributor Author

This PR is a re-opening of #29819. The original branch was accidentally deleted, preventing that PR from being reopened.

@njhill You had provided feedback on the previous PR - would appreciate your review on this one as well.

@njhill njhill self-assigned this Dec 10, 2025
@njhill
Copy link
Member

njhill commented Dec 10, 2025

Thanks @kitaekatt. I benchmarked a low latency case and this also actually appears to have a slight performance benefit 🎉

vllm serve meta-llama/Llama-3.2-1B-Instruct --disable-log-requests --uvicorn-log-level=error --port 8033 --distributed-executor-backend=mp
vllm bench serve \
    --backend vllm \
    --model meta-llama/Llama-3.2-1B-Instruct \
    --dataset-name random \
    --random-input-len 512 \
    --random-output-len 4000 \
    --ignore-eos \
    --num-prompts 12 \
    --max-concurrency 2 \
    --seed 42

Before

============ Serving Benchmark Result ============
Successful requests:                     12        
Failed requests:                         0         
Maximum request concurrency:             2         
Benchmark duration (s):                  49.51     
Total input tokens:                      6132      
Total generated tokens:                  48000     
Request throughput (req/s):              0.24      
Output token throughput (tok/s):         969.52    
Peak output token throughput (tok/s):    1000.00   
Peak concurrent requests:                4.00      
Total Token throughput (tok/s):          1093.38   
---------------Time to First Token----------------
Mean TTFT (ms):                          23.97     
Median TTFT (ms):                        22.70     
P99 TTFT (ms):                           34.61     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.06      
Median TPOT (ms):                        2.06      
P99 TPOT (ms):                           2.07      
---------------Inter-token Latency----------------
Mean ITL (ms):                           2.06      
Median ITL (ms):                         2.05      
P99 ITL (ms):                            2.37      
==================================================

This PR

============ Serving Benchmark Result ============
Successful requests:                     12        
Failed requests:                         0         
Maximum request concurrency:             2         
Benchmark duration (s):                  47.91     
Total input tokens:                      6132      
Total generated tokens:                  48000     
Request throughput (req/s):              0.25      
Output token throughput (tok/s):         1001.95   
Peak output token throughput (tok/s):    1032.00   
Peak concurrent requests:                4.00      
Total Token throughput (tok/s):          1129.95   
---------------Time to First Token----------------
Mean TTFT (ms):                          16.33     
Median TTFT (ms):                        16.12     
P99 TTFT (ms):                           18.14     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1.99      
Median TPOT (ms):                        1.99      
P99 TPOT (ms):                           2.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           1.99      
Median ITL (ms):                         1.98      
P99 ITL (ms):                            2.27      
==================================================

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 10, 2025
@njhill njhill enabled auto-merge (squash) December 10, 2025 18:15
@njhill njhill merged commit 166ac3c into vllm-project:main Dec 10, 2025
53 checks passed
@slippersss
Copy link
Contributor

Hi! This looks like an important fix for preventing data races. We've actually encountered similar issues before, as mentioned in #27858, so I really appreciate this effort!

I have a small question about the implementation choice – I was under the impression that threading.Lock works within a single process, while for cross-process shared memory we might need something like multiprocessing.Lock. I'm not entirely sure though, and I'm curious to learn how threading.Lock handles memory visibility across separate processes in this context.

Thanks again for your work on this fix. It's great to see attention to this tricky issue!

@kitaekatt
Copy link
Contributor Author

Good question @slippersss. For context, this is a lock-free ring buffer - the design specifically avoids blocking so multiple processes can read/write concurrently using atomic flag checks and memory barriers.

The threading.Lock here is being used purely as a memory fence. The shared memory region is already visible across processes via mmap; the issue is ensuring ordering so that Process B observes writes from Process A in the correct sequence.

Acquiring and immediately releasing a threading.Lock forces a full memory barrier at the hardware level (flushes pending writes, invalidates stale cache lines). It's a portable Python idiom since there's no direct mfence wrapper in stdlib.

multiprocessing.Lock would introduce actual blocking between processes, which would negate the lock-free design.

@kitaekatt kitaekatt deleted the fix/29819-shm-memory-barriers branch December 15, 2025 15:29
@slippersss
Copy link
Contributor

Good question @slippersss. For context, this is a lock-free ring buffer - the design specifically avoids blocking so multiple processes can read/write concurrently using atomic flag checks and memory barriers.

The threading.Lock here is being used purely as a memory fence. The shared memory region is already visible across processes via mmap; the issue is ensuring ordering so that Process B observes writes from Process A in the correct sequence.

Acquiring and immediately releasing a threading.Lock forces a full memory barrier at the hardware level (flushes pending writes, invalidates stale cache lines). It's a portable Python idiom since there's no direct mfence wrapper in stdlib.

multiprocessing.Lock would introduce actual blocking between processes, which would negate the lock-free design.

Sorry for the late reply, and thanks a lot for the detailed explanation - I learned a lot. Thanks again!

Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025
…ity (vllm-project#30407)

Signed-off-by: Christina Holland <hey@christinaholland.com>
Signed-off-by: Christina <truffle@gmail.com>
Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…ity (vllm-project#30407)

Signed-off-by: Christina Holland <hey@christinaholland.com>
Signed-off-by: Christina <truffle@gmail.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants