Skip to content

numaaware: add GPU NUMA topology awareness to scheduler#5095

Open
pmady wants to merge 6 commits intovolcano-sh:masterfrom
pmady:feature/gpu-numa-topology-awareness
Open

numaaware: add GPU NUMA topology awareness to scheduler#5095
pmady wants to merge 6 commits intovolcano-sh:masterfrom
pmady:feature/gpu-numa-topology-awareness

Conversation

@pmady
Copy link

@pmady pmady commented Mar 11, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it

This PR extends the numaaware scheduler plugin to support GPU NUMA topology awareness. Currently, the plugin only considers CPU NUMA topology when scheduling workloads. For GPU workloads (especially multi-GPU training and LLM inference), cross-NUMA GPU placement causes measurable performance degradation due to NVLink and PCIe traffic crossing NUMA boundaries.

Changes

1. API: Add GPU topology types (pkg/scheduler/api/numa_info.go)

  • Add GPUInfo and GPUDetails types for GPU-to-NUMA node mapping
  • Add NUMANodes() and GPUsInNUMANodes() helper methods on GPUDetails
  • Add GPUDetail field to NumatopoInfo struct
  • Update DeepCopy to include GPU topology data

2. GPU HintProvider (pkg/scheduler/plugins/numaaware/provider/gpumanager/)

  • Implement policy.HintProvider interface for nvidia.com/gpu resources
  • Generate topology hints based on GPU-to-NUMA node affinity
  • Prefer allocations from the minimum number of NUMA nodes
  • Allocate GPUs from preferred NUMA nodes first, then spill over
  • Follows the same pattern as the existing cpumanager provider

3. Plugin registration and scoring (pkg/scheduler/plugins/numaaware/numaaware.go)

  • Register gpumanager.NewProvider() alongside existing cpumanager.NewProvider()
  • Update getNodeNumaNumForTask scoring to include GPU NUMA node count
  • Add getNumaNodeCntForGPUID helper function

Example scenario (from #4998)

Node A: 8 GPUs, GPUs 0-3 on NUMA 0, GPUs 4-7 on NUMA 1, available: GPUs 0-2 on NUMA 0 + GPU 7 on NUMA 1
Node B: 8 GPUs, GPUs 0-3 on NUMA 0, GPUs 4-7 on NUMA 1, available: GPUs 0-3 all on NUMA 0

For a task requesting 4 GPUs:

  • Without this PR: scheduler might pick Node A, GPUs span 2 NUMA nodes
  • With this PR: scheduler prefers Node B, all 4 GPUs from single NUMA node

Which issue(s) this PR fixes

Fixes #4998

Does this PR introduce a user-facing change?

Add GPU NUMA topology awareness to the numaaware scheduler plugin. The scheduler now considers GPU-to-NUMA node affinity when scheduling GPU workloads, preferring placements that minimize cross-NUMA GPU allocation for better performance.

How Has This Been Tested?

  • Unit tests for GPU HintProvider: topology hints and allocation (10 test cases)
  • Verified all existing cpumanager and policy tests still pass
  • go build ./pkg/scheduler/... passes cleanly

pmady added 4 commits March 10, 2026 20:07
Add GPUInfo, GPUDetails types and helper methods (NUMANodes, GPUsInNUMANodes)
to support GPU-to-NUMA node mapping in the scheduler API. Extend NumatopoInfo
struct with GPUDetail field and update DeepCopy to include GPU topology data.

This enables the numaaware plugin to consider GPU PCIe affinity when making
scheduling decisions for GPU workloads.

Fixes volcano-sh#4998

Signed-off-by: pmady <pavan4devops@gmail.com>
Add gpumanager package that implements the policy.HintProvider interface
for nvidia.com/gpu resources. The provider:

- Generates topology hints based on GPU-to-NUMA node affinity
- Prefers allocations from the minimum number of NUMA nodes
- Allocates GPUs from preferred NUMA nodes first, then spills over
- Follows the same pattern as the existing cpumanager provider

Fixes volcano-sh#4998

Signed-off-by: pmady <pavan4devops@gmail.com>
Register the gpumanager HintProvider alongside the existing cpumanager
provider so that GPU NUMA topology is considered during predicate and
allocation phases.

Update getNodeNumaNumForTask to include GPU NUMA node count in the
scoring function. Nodes where both CPU and GPU assignments span fewer
NUMA nodes receive higher scores, improving locality for mixed
CPU+GPU workloads.

Add getNumaNodeCntForGPUID helper to count NUMA nodes used by GPU
assignments.

Fixes volcano-sh#4998

Signed-off-by: pmady <pavan4devops@gmail.com>
Test coverage includes:
- 4 GPUs from single NUMA node (both NUMA 0 and NUMA 1)
- 4 GPUs when only one NUMA node has enough available
- 5 GPUs requiring cross-NUMA allocation
- 8 GPUs using all available across both NUMA nodes
- 9 GPUs requested (impossible, no hints returned)
- Container with no GPU request (nil hints)
- Allocation from preferred NUMA node
- Allocation with partial NUMA availability

All tests follow existing table-driven test pattern from cpumanager.

Fixes volcano-sh#4998

Signed-off-by: pmady <pavan4devops@gmail.com>
Copilot AI review requested due to automatic review settings March 11, 2026 01:11
@volcano-sh-bot volcano-sh-bot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/feature Categorizes issue or PR as related to a new feature. labels Mar 11, 2026
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jessestutler for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot
Copy link
Contributor

Welcome @pmady! It looks like this is your first PR to volcano-sh/volcano 🎉

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the numaaware scheduler plugin by integrating GPU NUMA topology awareness. Previously, the plugin only considered CPU NUMA topology, which could lead to suboptimal GPU placements for performance-critical workloads. By incorporating GPU-to-NUMA node affinity into the scheduling decisions, the system can now intelligently place GPU-intensive tasks on nodes that minimize cross-NUMA communication, thereby improving performance for multi-GPU training and LLM inference.

Highlights

  • GPU Topology API Extension: New API types (GPUInfo, GPUDetails) and helper methods were introduced to represent GPU-to-NUMA node topology, enhancing the scheduler's understanding of hardware layout.
  • GPU HintProvider Implementation: A dedicated GPU HintProvider (gpumanager) was implemented to generate topology hints and allocate GPUs based on NUMA affinity, prioritizing placements that minimize cross-NUMA communication.
  • Scheduler Scoring Update: The numaaware scheduler plugin was updated to register the new GPU HintProvider and modify its scoring logic to factor in GPU NUMA node count alongside CPU NUMA count for more optimal task placement.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pkg/scheduler/api/numa_info.go
    • Added GPUInfo and GPUDetails types for GPU topology information.
    • Added NUMANodes() and GPUsInNUMANodes() helper methods to GPUDetails.
    • Included a GPUDetail field in the NumatopoInfo struct.
    • Updated the DeepCopy method to correctly handle GPUDetail.
  • pkg/scheduler/plugins/numaaware/numaaware.go
    • Imported the new gpumanager package.
    • Registered gpumanager.NewProvider() alongside the existing CPU manager provider.
    • Modified getNodeNumaNumForTask to calculate and include GPU NUMA node count in the scoring logic.
    • Added a new helper function getNumaNodeCntForGPUID to count NUMA nodes for assigned GPUs.
  • pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng.go
    • Created a new file implementing the gpumanager provider.
    • Implemented the policy.HintProvider interface for nvidia.com/gpu resources.
    • Provided GetTopologyHints to generate NUMA-aware GPU placement hints.
    • Implemented Allocate to assign GPUs based on the best topology hint, prioritizing aligned GPUs.
  • pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng_test.go
    • Created a new file for unit tests of the gpumanager provider.
    • Included test cases for GetTopologyHints to verify correct hint generation.
    • Included test cases for Allocate to verify correct GPU assignment based on hints.
Activity
  • No human activity (comments, reviews) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@volcano-sh-bot volcano-sh-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 11, 2026
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces GPU NUMA topology awareness to the numaaware scheduler plugin, a valuable feature for performance-sensitive GPU workloads. While the changes are well-structured and follow existing patterns, two medium-severity security issues were identified: a potential Denial of Service (DoS) due to exponential complexity in bitmask iteration with many NUMA nodes, and a potential integer overflow when handling large GPU resource requests. These require addressing with appropriate input validation and consistent data types. Additionally, a significant issue in the scoring logic could lead to suboptimal scheduling decisions, and there are a few minor suggestions for improvement.

Comment on lines 241 to 256
cpuNumaCnt := getNumaNodeCntForCPUID(assignCpus, node.NumaSchedulerInfo.CPUDetail)

// Include GPU NUMA node count in scoring if GPU assignments exist.
gpuNumaCnt := 0
assignGPUs := resAssignMap[node.Name][string(gpumanager.NvidiaGPUResource)]
if assignGPUs.Size() > 0 && node.NumaSchedulerInfo.GPUDetail != nil {
gpuNumaCnt = getNumaNodeCntForGPUID(assignGPUs, node.NumaSchedulerInfo.GPUDetail)
}

// Total NUMA node count: prefer nodes where both CPU and GPU
// assignments span fewer NUMA nodes.
totalNumaCnt := cpuNumaCnt + gpuNumaCnt
nodeNumaCnts[index] = api.ScoredNode{
NodeName: node.Name,
Score: int64(getNumaNodeCntForCPUID(assignCpus, node.NumaSchedulerInfo.CPUDetail)),
Score: int64(totalNumaCnt),
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current scoring logic totalNumaCnt := cpuNumaCnt + gpuNumaCnt incorrectly calculates the total number of NUMA nodes used. It sums the counts of NUMA nodes for CPUs and GPUs separately, which doesn't correctly represent the total span of NUMA nodes when resources share a NUMA node. For example, if both CPUs and GPUs are on NUMA node 0, the current logic would score it as 1+1=2, which is the same score as when CPUs are on NUMA 0 and GPUs are on NUMA 1. The score should be the count of the union of NUMA nodes used by both CPUs and GPUs.

To fix this, you should calculate the union of NUMA node bitmasks for CPUs and GPUs first, and then count the bits in the resulting mask for the score. This will require changing getNumaNodeCntForGPUID to return a bitmask.BitMask, and also modifying getNumaNodeCntForCPUID (or creating a new helper) to return a bitmask.

Suggested change
cpuNumaCnt := getNumaNodeCntForCPUID(assignCpus, node.NumaSchedulerInfo.CPUDetail)
// Include GPU NUMA node count in scoring if GPU assignments exist.
gpuNumaCnt := 0
assignGPUs := resAssignMap[node.Name][string(gpumanager.NvidiaGPUResource)]
if assignGPUs.Size() > 0 && node.NumaSchedulerInfo.GPUDetail != nil {
gpuNumaCnt = getNumaNodeCntForGPUID(assignGPUs, node.NumaSchedulerInfo.GPUDetail)
}
// Total NUMA node count: prefer nodes where both CPU and GPU
// assignments span fewer NUMA nodes.
totalNumaCnt := cpuNumaCnt + gpuNumaCnt
nodeNumaCnts[index] = api.ScoredNode{
NodeName: node.Name,
Score: int64(getNumaNodeCntForCPUID(assignCpus, node.NumaSchedulerInfo.CPUDetail)),
Score: int64(totalNumaCnt),
}
cpuNumaMask := getNumaMaskForCPUID(assignCpus, node.NumaSchedulerInfo.CPUDetail)
// Include GPU NUMA node count in scoring if GPU assignments exist.
gpuNumaMask, _ := bitmask.NewBitMask()
assignGPUs := resAssignMap[node.Name][string(gpumanager.NvidiaGPUResource)]
if assignGPUs.Size() > 0 && node.NumaSchedulerInfo.GPUDetail != nil {
gpuNumaMask = getNumaMaskForGPUID(assignGPUs, node.NumaSchedulerInfo.GPUDetail)
}
// Total NUMA node count: prefer nodes where both CPU and GPU
// assignments span the minimum number of unique NUMA nodes.
totalNumaMask := cpuNumaMask.Or(gpuNumaMask)
nodeNumaCnts[index] = api.ScoredNode{
NodeName: node.Name,
Score: int64(totalNumaMask.Count()),
}

Comment on lines +262 to +270
func getNumaNodeCntForGPUID(gpus cpuset.CPUSet, gpuDetails api.GPUDetails) int {
mask, _ := bitmask.NewBitMask()
for _, gpuIdx := range gpus.List() {
if info, ok := gpuDetails[gpuIdx]; ok {
mask.Add(info.NUMANodeID)
}
}
return mask.Count()
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To implement the corrected scoring logic as suggested in the other comment, this function should be renamed to getNumaMaskForGPUID and return a bitmask.BitMask instead of an int.

Suggested change
func getNumaNodeCntForGPUID(gpus cpuset.CPUSet, gpuDetails api.GPUDetails) int {
mask, _ := bitmask.NewBitMask()
for _, gpuIdx := range gpus.List() {
if info, ok := gpuDetails[gpuIdx]; ok {
mask.Add(info.NUMANodeID)
}
}
return mask.Count()
}
func getNumaMaskForGPUID(gpus cpuset.CPUSet, gpuDetails api.GPUDetails) bitmask.BitMask {
mask, _ := bitmask.NewBitMask()
for _, gpuIdx := range gpus.List() {
if info, ok := gpuDetails[gpuIdx]; ok {
mask.Add(info.NUMANodeID)
}
}
return mask
}

minAffinitySize := gpuDetail.NUMANodes().Size()
hints := []policy.TopologyHint{}

bitmask.IterateBitMasks(gpuDetail.NUMANodes().List(), func(mask bitmask.BitMask) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The generateGPUTopologyHints function calls bitmask.IterateBitMasks with a list of NUMA node IDs. This function iterates over all possible non-empty combinations of the provided NUMA nodes, which has a complexity of $2^N - 1$, where $N$ is the number of unique NUMA nodes. If a node reports a large number of NUMA nodes (e.g., due to misconfiguration or a compromised node agent), the scheduler will consume excessive CPU time, leading to a Denial of Service (DoS). While the number of NUMA nodes is typically small, the lack of validation on this input makes the scheduler vulnerable to resource exhaustion.

if !ok {
return 0
}
return int(gpuQuantity.Value())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The requestedGPUs function converts a GPU resource request value from int64 to int. On 32-bit systems, or with extremely large request values, this can result in an integer overflow. A negative value resulting from such an overflow would cause incorrect behavior in the hint generation and allocation logic, potentially leading to suboptimal or incorrect scheduling decisions.

@@ -0,0 +1,186 @@
/*
Copyright 2026 The Volcano Authors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The copyright year is set to 2026, which is in the future. Please correct it to the current year.

Suggested change
Copyright 2026 The Volcano Authors.
Copyright 2024 The Volcano Authors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are drunk gemini 🗡️

Comment on lines +69 to +77
// Count how many available GPUs fall within this NUMA node combination.
numMatching := 0
for _, gpuIdx := range availableGPUs.List() {
if gpuInfo, ok := gpuDetail[gpuIdx]; ok {
if mask.IsSet(gpuInfo.NUMANodeID) {
numMatching++
}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The calculation of numMatching is performed inside the IterateBitMasks loop, which iterates over all available GPUs for each NUMA node combination. This can be inefficient. You can optimize this by pre-calculating the number of available GPUs per NUMA node before the loop.

For example:

// Before IterateBitMasks
availablePerNuma := make(map[int]int)
for _, gpuIdx := range availableGPUs.List() {
    if gpuInfo, ok := gpuDetail[gpuIdx]; ok {
        availablePerNuma[gpuInfo.NUMANodeID]++
    }
}

bitmask.IterateBitMasks(..., func(mask bitmask.BitMask) {
    // ...

    // Replace the loop with this:
    numMatching := 0
    for _, numaID := range mask.GetBits() {
        numMatching += availablePerNuma[numaID]
    }

    // ...
})

@@ -0,0 +1,305 @@
/*
Copyright 2026 The Volcano Authors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The copyright year is set to 2026, which is in the future. Please correct it to the current year.

Suggested change
Copyright 2026 The Volcano Authors.
Copyright 2024 The Volcano Authors.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds GPU NUMA-topology awareness to the numaaware scheduler plugin so GPU workloads are preferentially placed on nodes where GPU allocations span fewer NUMA nodes (reducing cross-NUMA PCIe/NVLink traffic).

Changes:

  • Extends scheduler NUMA topology API (NumatopoInfo) with GPU topology (GPUDetail) and helper methods.
  • Introduces a new gpumanager hint provider to generate/allocate GPU topology hints for nvidia.com/gpu.
  • Updates numaaware scoring to account for GPU NUMA node span in addition to CPU.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
pkg/scheduler/api/numa_info.go Adds GPU topology types/helpers and deep-copy support via GPUDetail.
pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng.go Implements GPU topology hint generation and allocation for nvidia.com/gpu.
pkg/scheduler/plugins/numaaware/provider/gpumanager/gpu_mng_test.go Adds unit tests for GPU hint generation and allocation behavior.
pkg/scheduler/plugins/numaaware/numaaware.go Registers GPU provider and includes GPU NUMA span in node scoring.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +58 to +67
func generateGPUTopologyHints(availableGPUs cpuset.CPUSet, gpuDetail api.GPUDetails, request int) []policy.TopologyHint {
minAffinitySize := gpuDetail.NUMANodes().Size()
hints := []policy.TopologyHint{}

bitmask.IterateBitMasks(gpuDetail.NUMANodes().List(), func(mask bitmask.BitMask) {
// Count the total GPUs in the NUMA nodes covered by this mask.
gpusInMask := gpuDetail.GPUsInNUMANodes(mask.GetBits()...).Size()
if gpusInMask >= request && mask.Count() < minAffinitySize {
minAffinitySize = mask.Count()
}
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In generateGPUTopologyHints, minAffinitySize is updated based on the total GPUs in a NUMA-mask (gpuDetail.GPUsInNUMANodes), not on the available GPUs. If each single NUMA node has enough total GPUs but not enough available GPUs to satisfy the request, all feasible hints will have mask.Count() > minAffinitySize and will end up with Preferred=false (i.e., no preferred hint), which can degrade/incorrectly influence policy merging. Compute minAffinitySize from the masks that actually satisfy numMatching >= request (or compute min over the constructed hints) and add a unit test for the ‘must span NUMA nodes due to availability’ case.

Copilot uses AI. Check for mistakes.
Comment on lines 87 to 96
// NumatopoInfo is the information about topology manager on the node
type NumatopoInfo struct {
Namespace string
Name string
Policies map[nodeinfov1alpha1.PolicyName]string
NumaResMap map[string]*ResourceInfo
CPUDetail topology.CPUDetails
GPUDetail GPUDetails
ResReserved v1.ResourceList
}
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumatopoInfo now includes GPUDetail, and gpumanager/scoring depend on it, but in this repo the scheduler cache conversion (SchedulerCache.getNumaInfo in pkg/scheduler/cache/event_handlers.go) currently only populates CPUDetail/NumaResMap/ResReserved and does not set GPUDetail. With the current wiring, topoInfo.GPUDetail will remain empty and the GPU NUMA-awareness logic will never be exercised. Please extend the Numatopology source (CRD/spec + cache conversion) to carry GPU topology data into NumatopoInfo (and ensure the resource name keys match, e.g. "nvidia.com/gpu").

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fully missing, the feature basically has no runtime plumbing.

The previous scoring logic summed CPU and GPU NUMA node counts separately
(cpuNumaCnt + gpuNumaCnt), which incorrectly double-counted shared NUMA
nodes. For example, if CPUs and GPUs were both on NUMA 0, the score was
1+1=2 — the same as if they were on different NUMA nodes.

Fix by computing the union of CPU and GPU NUMA bitmasks before counting.
Co-located CPU+GPU on the same NUMA node now correctly scores 1.

Rename helper functions to return bitmask.BitMask instead of int:
- getNumaNodeCntForCPUID -> getNumaMaskForCPUID
- getNumaNodeCntForGPUID -> getNumaMaskForGPUID

Fixes volcano-sh#4998

Signed-off-by: pmady <pavan4devops@gmail.com>
@JesseStutler
Copy link
Member

Thanks for your contribution! Will take a look 👍 , also /cc @zjj2wry

Remove redundant nil check before len() on GPUDetail map.
len() for nil maps is defined as zero in Go.

Fixes volcano-sh#4998

Signed-off-by: pmady <pavan4devops@gmail.com>
@volcano-sh-bot
Copy link
Contributor

Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages.

The list of commits with invalid commit messages:

  • 672fbfb scheduler: add GPUDetail to NumatopoInfo for GPU NUMA topology
  • 2139528 numaaware: implement GPU topology HintProvider
  • 1861519 numaaware: register GPU provider and add GPU-aware scoring
  • bff8b2d numaaware: add unit tests for GPU topology HintProvider
  • 4dbbdf2 numaaware: fix scoring to use NUMA bitmask union for CPU+GPU
  • 2f71592 numaaware: fix staticcheck S1009 lint in gpumanager
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello 👋 ,
Thank you for the contribution!
This is quite a wasp nest you put your hand in. 😄 I don't think that an implementation without any runtime plumbing makes sense here currently (without proper NumaTopology CRD extension, population from exporter, scheduler cache translation...). Without this, the new provider cannot be effective in real clusters.

Also, maybe can we avoid hardcoding "nvidia.com/gpu" in the provider design and make it resource-name driven, with general GPUs so other accelerators (e.g. Ascend) can be supported with the same framework? NVIDIA-first implementation is fine, but the abstraction should remain generic.

But first things first. Probably we should update, refactor and implement this feature in resource-exporter at least for nvidia GPUs, and maybe in the scheduler. After that we can revisit this PR.

@@ -0,0 +1,186 @@
/*
Copyright 2026 The Volcano Authors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are drunk gemini 🗡️

Comment on lines 87 to 96
// NumatopoInfo is the information about topology manager on the node
type NumatopoInfo struct {
Namespace string
Name string
Policies map[nodeinfov1alpha1.PolicyName]string
NumaResMap map[string]*ResourceInfo
CPUDetail topology.CPUDetails
GPUDetail GPUDetails
ResReserved v1.ResourceList
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fully missing, the feature basically has no runtime plumbing.

@pmady
Copy link
Author

pmady commented Mar 13, 2026

Hi @hajnalmt, thank you for the thorough review — you are absolutely right, and this is exactly the feedback I needed.

Acknowledging the Gap

I agree that the scheduler-side logic alone does not make sense without the full runtime plumbing. After reviewing the resource-exporter codebase, I can see the complete pipeline that is needed:

  1. volcano-sh/apis — Extend NumatopoSpec to include a GPUDetail field so the Numatopology CRD carries GPU topology from nodes to the scheduler.
  2. volcano-sh/resource-exporter — Add a GPUNumaInfo provider that implements the NumaInfo interface, discovers GPU-to-NUMA topology via sysfs (/sys/bus/pci/devices/<BDF>/numa_node), and registers it alongside the existing CPU provider.
  3. volcano-sh/volcano (this PR) — Consume GPUDetail in the scheduler hint provider and scoring.

I should have started from the data-source side. I have now implemented the missing pieces:

PRs Created

On the Vendor-Agnostic Question

Good point about not hard-coding nvidia.com/gpu. I see two options:

  1. Generic DeviceNumaInfo — Discover all PCI devices with NUMA affinity via sysfs, then let the scheduler match by extended resource name. This would support AMD, Intel, and any future accelerators.
  2. Start with NVIDIA, abstract later — Ship a working NVIDIA implementation first, but behind a DeviceProvider interface that can be extended for ROCm/Intel later.

Which direction do you prefer? I am leaning toward option 2 (concrete first, abstract later) but happy to go either way.

I will keep this PR open and rebase it once the apis and resource-exporter PRs are merged. Please let me know if the approach looks good!

/cc @JesseStutler @zjj2wry

@volcano-sh-bot
Copy link
Contributor

@pmady: GitHub didn't allow me to request PR reviews from the following users: zjj2wry.

Note that only volcano-sh members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

Hi @hajnalmt, thank you for the thorough review — you are absolutely right, and this is exactly the feedback I needed.

Acknowledging the Gap

I agree that the scheduler-side logic alone does not make sense without the full runtime plumbing. After reviewing the resource-exporter codebase, I can see the complete pipeline that is needed:

  1. volcano-sh/apis — Extend NumatopoSpec to include a GPUDetail field so the Numatopology CRD carries GPU topology from nodes to the scheduler.
  2. volcano-sh/resource-exporter — Add a GPUNumaInfo provider that implements the NumaInfo interface, discovers GPU-to-NUMA topology via sysfs (/sys/bus/pci/devices/<BDF>/numa_node), and registers it alongside the existing CPU provider.
  3. volcano-sh/volcano (this PR) — Consume GPUDetail in the scheduler hint provider and scoring.

I should have started from the data-source side. I have now implemented the missing pieces:

PRs Created

On the Vendor-Agnostic Question

Good point about not hard-coding nvidia.com/gpu. I see two options:

  1. Generic DeviceNumaInfo — Discover all PCI devices with NUMA affinity via sysfs, then let the scheduler match by extended resource name. This would support AMD, Intel, and any future accelerators.
  2. Start with NVIDIA, abstract later — Ship a working NVIDIA implementation first, but behind a DeviceProvider interface that can be extended for ROCm/Intel later.

Which direction do you prefer? I am leaning toward option 2 (concrete first, abstract later) but happy to go either way.

I will keep this PR open and rebase it once the apis and resource-exporter PRs are merged. Please let me know if the approach looks good!

/cc @JesseStutler @zjj2wry

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhancement: Add GPU NUMA Topology Awareness to Scheduler

5 participants