fix(shm): Add memory barriers for cross-process shared memory visibility by kitaekatt · Pull Request #30407 · vllm-project/vllm

kitaekatt · 2025-12-10T17:43:10Z

Summary

Adds proper memory barriers for cross-process shared memory synchronization to prevent data races.

Changes

Add memory barriers in shared memory broadcast implementation
Ensures correct visibility ordering across process boundaries

Testing

Tested with multi-process inference scenarios.

Note: PR recreated due to branch rename (was fix-shm-memory-barriers, now fix/29819-shm-memory-barriers)

The shared memory ring buffer protocol in shm_broadcast.py uses plain byte writes to signal between writer and reader processes. On multi-core systems, these writes may stay in CPU store buffers and not be visible to other processes running on different cores, causing indefinite spinning/freeze under sustained concurrent load. This patch adds explicit memory barriers using threading.Lock acquire/release (which provides full barrier semantics per POSIX.1-2008) at four critical points: - In acquire_write(): before reading flags and after setting written flag - In acquire_read(): before reading flags and after setting read flag The memory barrier ensures that: 1. All stores before the barrier are globally visible 2. All loads after the barrier see the latest values Fixes freeze observed during sustained concurrent batch inference (~500+ requests) where both writer and readers would spin indefinitely waiting for flags that were updated but not visible across CPU cores. Signed-off-by: Christina Holland <hey@christinaholland.com> Signed-off-by: Christina <truffle@gmail.com>

…ager Signed-off-by: Christina <truffle@gmail.com>

chatgpt-codex-connector · 2025-12-10T17:43:17Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request correctly identifies and fixes a critical race condition in the shared memory broadcast mechanism. By introducing memory barriers using a memory_fence function, it ensures proper synchronization and visibility of memory writes between processes. The fences are strategically placed within the acquire_write and acquire_read methods, which is crucial for the correctness of the lock-free ring buffer on multi-core systems. The implementation of memory_fence using threading.Lock is a pragmatic approach to create a portable memory barrier. The changes are well-documented with clear comments explaining the purpose of each fence. Overall, this is a solid and necessary fix to prevent potential deadlocks and data corruption.

kitaekatt · 2025-12-10T18:11:46Z

This PR is a re-opening of #29819. The original branch was accidentally deleted, preventing that PR from being reopened.

@njhill You had provided feedback on the previous PR - would appreciate your review on this one as well.

njhill · 2025-12-10T18:14:44Z

Thanks @kitaekatt. I benchmarked a low latency case and this also actually appears to have a slight performance benefit 🎉

vllm serve meta-llama/Llama-3.2-1B-Instruct --disable-log-requests --uvicorn-log-level=error --port 8033 --distributed-executor-backend=mp

vllm bench serve \
    --backend vllm \
    --model meta-llama/Llama-3.2-1B-Instruct \
    --dataset-name random \
    --random-input-len 512 \
    --random-output-len 4000 \
    --ignore-eos \
    --num-prompts 12 \
    --max-concurrency 2 \
    --seed 42

Before

============ Serving Benchmark Result ============
Successful requests:                     12        
Failed requests:                         0         
Maximum request concurrency:             2         
Benchmark duration (s):                  49.51     
Total input tokens:                      6132      
Total generated tokens:                  48000     
Request throughput (req/s):              0.24      
Output token throughput (tok/s):         969.52    
Peak output token throughput (tok/s):    1000.00   
Peak concurrent requests:                4.00      
Total Token throughput (tok/s):          1093.38   
---------------Time to First Token----------------
Mean TTFT (ms):                          23.97     
Median TTFT (ms):                        22.70     
P99 TTFT (ms):                           34.61     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.06      
Median TPOT (ms):                        2.06      
P99 TPOT (ms):                           2.07      
---------------Inter-token Latency----------------
Mean ITL (ms):                           2.06      
Median ITL (ms):                         2.05      
P99 ITL (ms):                            2.37      
==================================================

This PR

============ Serving Benchmark Result ============
Successful requests:                     12        
Failed requests:                         0         
Maximum request concurrency:             2         
Benchmark duration (s):                  47.91     
Total input tokens:                      6132      
Total generated tokens:                  48000     
Request throughput (req/s):              0.25      
Output token throughput (tok/s):         1001.95   
Peak output token throughput (tok/s):    1032.00   
Peak concurrent requests:                4.00      
Total Token throughput (tok/s):          1129.95   
---------------Time to First Token----------------
Mean TTFT (ms):                          16.33     
Median TTFT (ms):                        16.12     
P99 TTFT (ms):                           18.14     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1.99      
Median TPOT (ms):                        1.99      
P99 TPOT (ms):                           2.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           1.99      
Median ITL (ms):                         1.98      
P99 ITL (ms):                            2.27      
==================================================

slippersss · 2025-12-13T17:55:14Z

Hi! This looks like an important fix for preventing data races. We've actually encountered similar issues before, as mentioned in #27858, so I really appreciate this effort!

I have a small question about the implementation choice – I was under the impression that threading.Lock works within a single process, while for cross-process shared memory we might need something like multiprocessing.Lock. I'm not entirely sure though, and I'm curious to learn how threading.Lock handles memory visibility across separate processes in this context.

Thanks again for your work on this fix. It's great to see attention to this tricky issue!

kitaekatt · 2025-12-15T15:28:34Z

Good question @slippersss. For context, this is a lock-free ring buffer - the design specifically avoids blocking so multiple processes can read/write concurrently using atomic flag checks and memory barriers.

The threading.Lock here is being used purely as a memory fence. The shared memory region is already visible across processes via mmap; the issue is ensuring ordering so that Process B observes writes from Process A in the correct sequence.

Acquiring and immediately releasing a threading.Lock forces a full memory barrier at the hardware level (flushes pending writes, invalidates stale cache lines). It's a portable Python idiom since there's no direct mfence wrapper in stdlib.

multiprocessing.Lock would introduce actual blocking between processes, which would negate the lock-free design.

slippersss · 2025-12-20T12:11:23Z

Good question @slippersss. For context, this is a lock-free ring buffer - the design specifically avoids blocking so multiple processes can read/write concurrently using atomic flag checks and memory barriers.

The threading.Lock here is being used purely as a memory fence. The shared memory region is already visible across processes via mmap; the issue is ensuring ordering so that Process B observes writes from Process A in the correct sequence.

Acquiring and immediately releasing a threading.Lock forces a full memory barrier at the hardware level (flushes pending writes, invalidates stale cache lines). It's a portable Python idiom since there's no direct mfence wrapper in stdlib.

multiprocessing.Lock would introduce actual blocking between processes, which would negate the lock-free design.

Sorry for the late reply, and thanks a lot for the detailed explanation - I learned a lot. Thanks again!

…ity (vllm-project#30407) Signed-off-by: Christina Holland <hey@christinaholland.com> Signed-off-by: Christina <truffle@gmail.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

…ity (vllm-project#30407) Signed-off-by: Christina Holland <hey@christinaholland.com> Signed-off-by: Christina <truffle@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

kitaekatt added 2 commits December 9, 2025 18:41

fix: Address review feedback - platform-agnostic docs and context man…

0f27680

…ager Signed-off-by: Christina <truffle@gmail.com>

mergify bot mentioned this pull request Dec 10, 2025

fix(shm): Add memory barriers for cross-process shared memory visibility #29819

Closed

3 tasks

gemini-code-assist bot reviewed Dec 10, 2025

View reviewed changes

njhill self-assigned this Dec 10, 2025

njhill approved these changes Dec 10, 2025

View reviewed changes

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 10, 2025

njhill enabled auto-merge (squash) December 10, 2025 18:15

Merge branch 'main' into fix/29819-shm-memory-barriers

b9e8556

njhill merged commit 166ac3c into vllm-project:main Dec 10, 2025
53 checks passed

kitaekatt deleted the fix/29819-shm-memory-barriers branch December 15, 2025 15:29

kitaekatt mentioned this pull request Dec 15, 2025

fix(gguf): GGUF model support fixes for Blackwell GPUs #30497

Closed

4 tasks

njhill mentioned this pull request Jan 9, 2026

[Bugfix] Fix memory inconsistency in cross-process shared memory #32022

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

fix(shm): Add memory barriers for cross-process shared memory visibility#30407

fix(shm): Add memory barriers for cross-process shared memory visibility#30407
njhill merged 3 commits intovllm-project:mainfrom
kitaekatt:fix/29819-shm-memory-barriers

kitaekatt commented Dec 10, 2025

Uh oh!

chatgpt-codex-connector bot commented Dec 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

kitaekatt commented Dec 10, 2025

Uh oh!

njhill commented Dec 10, 2025

Uh oh!

Uh oh!

slippersss commented Dec 13, 2025

Uh oh!

kitaekatt commented Dec 15, 2025

Uh oh!

slippersss commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Comments

Conversation

kitaekatt commented Dec 10, 2025

Summary

Changes

Testing

Uh oh!

chatgpt-codex-connector bot commented Dec 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

kitaekatt commented Dec 10, 2025

Uh oh!

njhill commented Dec 10, 2025

Uh oh!

Uh oh!

slippersss commented Dec 13, 2025

Uh oh!

kitaekatt commented Dec 15, 2025

Uh oh!

slippersss commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants