GEMM + ReduceScatter with Workgroup Specialization Example #317

knwng · 2026-01-13T17:59:54Z

Motivation

To add an example of GEMM + ReduceScatter by workgroup specialization

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

mawad-amd · 2026-01-14T16:33:16Z

examples/22_gemm_one_shot_reduce_scatter_wg_specialization/gemm_reduce_scatter.py

+
+            tl.store(C + local_offset, c, mask=sub_mask, cache_modifier=".wt")
+            tl.debug_barrier()
+            tl.store(locks + tile_id, 1, cache_modifier=".wt")


Use atomic_cas with release semantics.

I know some other examples are using this barrier/volatile stores but these are not correct and we will fix them. The correct pattern is shown in this example:

iris/examples/06_message_passing/message_passing_load_store.py

Lines 18 to 94 in 0dcaba0

def producer_kernel(

source_buffer, # tl.tensor: pointer to source data

target_buffer, # tl.tensor: pointer to target data

flag, # tl.tensor: pointer to flags

buffer_size, # int32: total number of elements

producer_rank: tl.constexpr,

consumer_rank: tl.constexpr,

BLOCK_SIZE: tl.constexpr,

heap_bases_ptr: tl.tensor, # tl.tensor: pointer to heap bases pointers

):

pid = tl.program_id(0)

# Compute start index of this block

block_start = pid * BLOCK_SIZE

offsets = block_start + tl.arange(0, BLOCK_SIZE)

# Guard for out-of-bounds accesses

mask = offsets < buffer_size

# Load chunk from source buffer

values = iris.load(source_buffer + offsets, producer_rank, producer_rank, heap_bases_ptr, mask=mask)

# Store chunk to target buffer

iris.store(

target_buffer + offsets,

values,

producer_rank,

consumer_rank,

heap_bases_ptr,

mask=mask,

)

# Set flag to signal completion

iris.atomic_cas(flag + pid, 0, 1, producer_rank, consumer_rank, heap_bases_ptr, sem="release", scope="sys")

@triton.jit

def consumer_kernel(

buffer, # tl.tensor: pointer to shared buffer (read from target_rank)

flag, # tl.tensor: sync flag per block

buffer_size, # int32: total number of elements

consumer_rank: tl.constexpr,

BLOCK_SIZE: tl.constexpr,

heap_bases_ptr: tl.tensor, # tl.tensor: pointer to heap bases pointers

):

pid = tl.program_id(0)

block_start = pid * BLOCK_SIZE

offsets = block_start + tl.arange(0, BLOCK_SIZE)

mask = offsets < buffer_size

# Spin-wait until writer sets flag[pid] = 1

done = 0

while done == 0:

done = iris.atomic_cas(

flag + pid, 1, 0, consumer_rank, consumer_rank, heap_bases_ptr, sem="acquire", scope="sys"

)

# Read from the target buffer (written by producer)

values = iris.load(buffer + offsets, consumer_rank, consumer_rank, heap_bases_ptr, mask=mask)

# Do something with values...

# (Here you might write to output, do computation, etc.)

values = values * 2

# Store chunk to target buffer

iris.store(

buffer + offsets,

values,

consumer_rank,

consumer_rank,

heap_bases_ptr,

mask=mask,

)

# Optionally reset the flag for next iteration

tl.store(flag + pid, 0)

mawad-amd

Thanks for the PR, Kyle! I know it is a draft but I left a couple of comments.

mawad-amd · 2026-01-14T16:33:42Z

examples/22_gemm_one_shot_reduce_scatter_wg_specialization/gemm_reduce_scatter.py

+
+            local_offset = rm[:, None] * stride_cm + rn[None, :] * stride_cn
+
+            while tl.load(locks + tile_id, cache_modifier=".cv", volatile=True) != 1:


Use atomic_cas with acquire semantics here

mawad-amd · 2026-01-14T16:34:46Z

examples/22_gemm_one_shot_reduce_scatter_wg_specialization/benchmark.py

+                args["gsize_m"],
+                args["num_stages"],
+                shmem.get_heap_bases(),
+                "gfx942",


Can we avoid the hardcoded arch here and maybe find it via torch.cuda.get_device_properties?

knwng and others added 3 commits January 12, 2026 01:26

add GEMM+ReduceScatter w/ workgroup specialization

0741c51

Apply Ruff auto-fixes

6da76b4

cleanup

b468024

mawad-amd reviewed Jan 14, 2026

View reviewed changes

address comment

83ca440

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GEMM + ReduceScatter with Workgroup Specialization Example #317

GEMM + ReduceScatter with Workgroup Specialization Example #317

Uh oh!

knwng commented Jan 13, 2026

Uh oh!

mawad-amd Jan 14, 2026

Uh oh!

mawad-amd Jan 14, 2026

Uh oh!

mawad-amd left a comment

Uh oh!

mawad-amd Jan 14, 2026

Uh oh!

mawad-amd Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	def producer_kernel(
	source_buffer, # tl.tensor: pointer to source data
	target_buffer, # tl.tensor: pointer to target data
	flag, # tl.tensor: pointer to flags
	buffer_size, # int32: total number of elements
	producer_rank: tl.constexpr,
	consumer_rank: tl.constexpr,
	BLOCK_SIZE: tl.constexpr,
	heap_bases_ptr: tl.tensor, # tl.tensor: pointer to heap bases pointers
	):
	pid = tl.program_id(0)

	# Compute start index of this block
	block_start = pid * BLOCK_SIZE
	offsets = block_start + tl.arange(0, BLOCK_SIZE)

	# Guard for out-of-bounds accesses
	mask = offsets < buffer_size

	# Load chunk from source buffer
	values = iris.load(source_buffer + offsets, producer_rank, producer_rank, heap_bases_ptr, mask=mask)

	# Store chunk to target buffer
	iris.store(
	target_buffer + offsets,
	values,
	producer_rank,
	consumer_rank,
	heap_bases_ptr,
	mask=mask,
	)

	# Set flag to signal completion
	iris.atomic_cas(flag + pid, 0, 1, producer_rank, consumer_rank, heap_bases_ptr, sem="release", scope="sys")


	@triton.jit
	def consumer_kernel(
	buffer, # tl.tensor: pointer to shared buffer (read from target_rank)
	flag, # tl.tensor: sync flag per block
	buffer_size, # int32: total number of elements
	consumer_rank: tl.constexpr,
	BLOCK_SIZE: tl.constexpr,
	heap_bases_ptr: tl.tensor, # tl.tensor: pointer to heap bases pointers
	):
	pid = tl.program_id(0)

	block_start = pid * BLOCK_SIZE
	offsets = block_start + tl.arange(0, BLOCK_SIZE)
	mask = offsets < buffer_size

	# Spin-wait until writer sets flag[pid] = 1
	done = 0
	while done == 0:
	done = iris.atomic_cas(
	flag + pid, 1, 0, consumer_rank, consumer_rank, heap_bases_ptr, sem="acquire", scope="sys"
	)

	# Read from the target buffer (written by producer)
	values = iris.load(buffer + offsets, consumer_rank, consumer_rank, heap_bases_ptr, mask=mask)

	# Do something with values...
	# (Here you might write to output, do computation, etc.)
	values = values * 2

	# Store chunk to target buffer
	iris.store(
	buffer + offsets,
	values,
	consumer_rank,
	consumer_rank,
	heap_bases_ptr,
	mask=mask,
	)

	# Optionally reset the flag for next iteration
	tl.store(flag + pid, 0)


		local_offset = rm[:, None] * stride_cm + rn[None, :] * stride_cn

		while tl.load(locks + tile_id, cache_modifier=".cv", volatile=True) != 1:

GEMM + ReduceScatter with Workgroup Specialization Example #317

Are you sure you want to change the base?

GEMM + ReduceScatter with Workgroup Specialization Example #317

Uh oh!

Conversation

knwng commented Jan 13, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

mawad-amd Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

mawad-amd Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

mawad-amd left a comment

Choose a reason for hiding this comment

Uh oh!

mawad-amd Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

mawad-amd Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants