fix: queue allocated metric consistency for jobless queues in capacit…#5050
fix: queue allocated metric consistency for jobless queues in capacit…#5050Aliexe-code wants to merge 1 commit intovolcano-sh:masterfrom
Conversation
|
Welcome @Aliexe-code! It looks like this is your first PR to volcano-sh/volcano 🎉 |
Summary of ChangesHello @Aliexe-code, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a bug in the capacity scheduler plugin that led to inconsistent allocated resource metrics for jobless queues, such as the root queue, when queue hierarchy was not enabled. Previously, these queues would incorrectly show zero allocated resources. The changes ensure that the "Queue.Status.Allocated" value is correctly used for these queues, providing accurate resource reporting and improving the reliability of resource management within the scheduler. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request correctly addresses the issue where jobless queues were missing from metrics when hierarchy is disabled. By ensuring all queues are populated in queueOpts using their status as a fallback, it provides consistent reporting. The simplification of the metrics recording logic is also a welcome improvement. I have identified one maintainability issue regarding code duplication.
| for queueID, queueInfo := range ssn.Queues { | ||
| if _, exists := cp.queueOpts[queueID]; !exists { | ||
| attr := &queueAttr{ | ||
| queueID: queueInfo.UID, | ||
| name: queueInfo.Name, | ||
| deserved: api.NewResource(queueInfo.Queue.Spec.Deserved), | ||
| allocated: api.EmptyResource(), | ||
| request: api.EmptyResource(), | ||
| elastic: api.EmptyResource(), | ||
| inqueue: api.EmptyResource(), | ||
| guarantee: api.EmptyResource(), | ||
| } | ||
| if len(queueInfo.Queue.Spec.Capability) != 0 { | ||
| attr.capability = api.NewResource(queueInfo.Queue.Spec.Capability) | ||
| if attr.capability.MilliCPU <= 0 { | ||
| attr.capability.MilliCPU = math.MaxFloat64 | ||
| } | ||
| if attr.capability.Memory <= 0 { | ||
| attr.capability.Memory = math.MaxFloat64 | ||
| } | ||
| } | ||
| if len(queueInfo.Queue.Spec.Guarantee.Resource) != 0 { | ||
| attr.guarantee = api.NewResource(queueInfo.Queue.Spec.Guarantee.Resource) | ||
| } | ||
| realCapability := api.ExceededPart(cp.totalResource, cp.totalGuarantee).Add(attr.guarantee) | ||
| if attr.capability == nil { | ||
| attr.capability = api.EmptyResource() | ||
| attr.realCapability = realCapability | ||
| } else { | ||
| realCapability.MinDimensionResource(attr.capability, api.Infinity) | ||
| attr.realCapability = realCapability | ||
| } | ||
| // Use Queue.Status.Allocated for jobless queues to ensure metric consistency | ||
| if queueInfo.Queue.Status.Allocated != nil { | ||
| attr.allocated = api.NewResource(queueInfo.Queue.Status.Allocated) | ||
| } | ||
| cp.queueOpts[queueID] = attr | ||
| klog.V(4).Infof("Added jobless Queue <%s> attributes with allocated from status: <%v>", queueInfo.Name, attr.allocated) | ||
| } | ||
| } |
There was a problem hiding this comment.
The logic for initializing queueAttr (calculating deserved, capability, guarantee, and realCapability) is almost identical to the logic used in the previous loop (lines 418-448). This duplication makes the code harder to maintain and prone to drift if the initialization logic changes in the future. Consider refactoring this common initialization logic into a helper method that can be used in both places.
There was a problem hiding this comment.
Pull request overview
This PR fixes a bug where jobless queues (such as the root queue) had their allocated metrics incorrectly reported as zero when hierarchy was disabled, instead of using the actual Queue.Status.Allocated value. The fix ensures that all queues in the session are added to the capacity plugin's internal queueOpts map, with jobless queues using their Status.Allocated values for consistency with the Queue CRD status.
Changes:
- Added logic to populate
queueOptsfor jobless queues usingQueue.Status.Allocatedwhen hierarchy is disabled - Simplified metrics recording to iterate over
queueOptsinstead of handling jobless queues separately - Added test case to verify jobless queue handling
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pkg/scheduler/plugins/capacity/capacity.go | Added logic to populate queueOpts for jobless queues using Status.Allocated (lines 487-528), simplified metrics recording to handle all queues uniformly (lines 541-550) |
| pkg/scheduler/plugins/capacity/capacity_test.go | Added TestBuildQueueAttrsMetricConsistency to verify jobless queue allocated values are read from Status.Allocated |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| if actualAllocated.Memory != expectedAllocated.Memory { | ||
| t.Errorf("Root queue allocated Memory: got %v, want %v", actualAllocated.Memory, expectedAllocated.Memory) | ||
| } |
There was a problem hiding this comment.
The test verifies that rootQueue.Queue.Status.Allocated contains the expected values, but this only confirms the test setup, not the fix itself. The actual fix ensures that jobless queues are added to queueOpts with their allocated values from Status.Allocated, which would then be used for metrics reporting. However, the test doesn't verify that the capacity plugin's queueOpts was populated correctly for the jobless queue, or that metrics would be reported with the correct values. Consider accessing the plugin instance to verify queueOpts contains an entry for the root queue with the correct allocated value, or trigger and verify metric updates.
| } | |
| } | |
| // Additionally verify that the capacity plugin populated queueOpts for the jobless root queue | |
| pluginIface, ok := ssn.Plugins[PluginName] | |
| if !ok { | |
| t.Fatal("capacity plugin should be registered in the session") | |
| } | |
| capacityPlugin, ok := pluginIface.(*CapacityPlugin) | |
| if !ok { | |
| t.Fatalf("plugin registered under %q is not *CapacityPlugin", PluginName) | |
| } | |
| rootQueueOpt, ok := capacityPlugin.queueOpts[rootQueueID] | |
| if !ok { | |
| t.Fatalf("queueOpts should contain an entry for root queue %q", rootQueueID) | |
| } | |
| if rootQueueOpt.allocated == nil { | |
| t.Fatal("queueOpts entry for root queue should have allocated resources set") | |
| } | |
| // queueOpts for the jobless root queue must reflect Status.Allocated | |
| pluginAllocated := rootQueueOpt.allocated | |
| if pluginAllocated.MilliCPU != expectedAllocated.MilliCPU { | |
| t.Errorf("queueOpts allocated CPU for root queue: got %v, want %v", pluginAllocated.MilliCPU, expectedAllocated.MilliCPU) | |
| } | |
| if pluginAllocated.Memory != expectedAllocated.Memory { | |
| t.Errorf("queueOpts allocated Memory for root queue: got %v, want %v", pluginAllocated.Memory, expectedAllocated.Memory) | |
| } |
There was a problem hiding this comment.
Yeah this would be good, but not a necessity the test verifies root at least 👍
| attr := &queueAttr{ | ||
| queueID: queueInfo.UID, | ||
| name: queueInfo.Name, | ||
| deserved: api.NewResource(queueInfo.Queue.Spec.Deserved), | ||
| allocated: api.EmptyResource(), | ||
| request: api.EmptyResource(), | ||
| elastic: api.EmptyResource(), | ||
| inqueue: api.EmptyResource(), | ||
| guarantee: api.EmptyResource(), | ||
| } | ||
| if len(queueInfo.Queue.Spec.Capability) != 0 { | ||
| attr.capability = api.NewResource(queueInfo.Queue.Spec.Capability) | ||
| if attr.capability.MilliCPU <= 0 { | ||
| attr.capability.MilliCPU = math.MaxFloat64 | ||
| } | ||
| if attr.capability.Memory <= 0 { | ||
| attr.capability.Memory = math.MaxFloat64 | ||
| } | ||
| } | ||
| if len(queueInfo.Queue.Spec.Guarantee.Resource) != 0 { | ||
| attr.guarantee = api.NewResource(queueInfo.Queue.Spec.Guarantee.Resource) | ||
| } | ||
| realCapability := api.ExceededPart(cp.totalResource, cp.totalGuarantee).Add(attr.guarantee) | ||
| if attr.capability == nil { | ||
| attr.capability = api.EmptyResource() | ||
| attr.realCapability = realCapability | ||
| } else { | ||
| realCapability.MinDimensionResource(attr.capability, api.Infinity) | ||
| attr.realCapability = realCapability | ||
| } |
There was a problem hiding this comment.
The code for initializing queue attributes here (lines 491-520) is nearly identical to the code for job-based queues (lines 418-448). Consider extracting this logic into a helper function to reduce code duplication and improve maintainability. The function could accept a queueInfo parameter and return a queueAttr, handling the capability, guarantee, and realCapability initialization in one place.
6d220dd to
d1f90fd
Compare
|
/cc hajnalmt @Aliexe-code Can I review this one? :) |
d1f90fd to
a3bac86
Compare
|
Hello @Aliexe-code , |
|
@hajnalmt Please go ahead , sorry for late responding :) |
|
/retest |
hajnalmt
left a comment
There was a problem hiding this comment.
Thanks looks Overall good to me 😊
I had minor comments only can you please address them?
| // initQueueAttr initializes a queueAttr from queueInfo. | ||
| // This helper method ensures consistent initialization for both queues with jobs and jobless queues. | ||
| func (cp *capacityPlugin) initQueueAttr(queueInfo *api.QueueInfo) *queueAttr { | ||
| attr := &queueAttr{ |
There was a problem hiding this comment.
I think we can reuse newQueueAttr here to avoid code duplication.
attr := cp.newQueueAttr(queue)
| } | ||
| if len(queueInfo.Queue.Spec.Capability) != 0 { | ||
| attr.capability = api.NewResource(queueInfo.Queue.Spec.Capability) | ||
| if attr.capability.MilliCPU <= 0 { |
There was a problem hiding this comment.
If you switch to newQueueAttr you still need to handle this as a defensive measure, but attr.capability will be set to an EmptyResource so it's never nil, the same is true for attr.realCapability. You can even merge the two if blocks (this and the one at the end of this function) and handle the realCapability settings and the <=0 negative guard rails in one go.
| metrics.UpdateQueueDeserved(attr.name, attr.deserved.MilliCPU, attr.deserved.Memory, attr.deserved.ScalarResources) | ||
| metrics.UpdateQueueAllocated(attr.name, attr.allocated.MilliCPU, attr.allocated.Memory, attr.allocated.ScalarResources) | ||
| metrics.UpdateQueueRequest(attr.name, attr.request.MilliCPU, attr.request.Memory, attr.request.ScalarResources) | ||
| if attr.capability != nil { |
There was a problem hiding this comment.
This is never nil at this point. Can you delete this branch please?
| } | ||
|
|
||
| for _, attr := range cp.queueOpts { | ||
| if attr.realCapability != nil { |
There was a problem hiding this comment.
This will never be nil at this point. Can you delete this branch please?
| metrics.UpdateQueueAllocated(attr.name, attr.allocated.MilliCPU, attr.allocated.Memory, attr.allocated.ScalarResources) | ||
| metrics.UpdateQueueRequest(attr.name, attr.request.MilliCPU, attr.request.Memory, attr.request.ScalarResources) | ||
| if attr.capability != nil { | ||
| metrics.UpdateQueueCapacity(attr.name, attr.capability.MilliCPU, attr.capability.Memory, attr.capability.ScalarResources) |
There was a problem hiding this comment.
We have that defensive measure that sets capability for maxFloat64 if it's <=0 for cpu and memory, it will be a very big number in these updates. I wonder if we should have that or just 0. Let's leave it this way now.
(This is true for the hierachy case too just wanted to mention this here.)
hajnalmt
left a comment
There was a problem hiding this comment.
Thanks for the change!
Please sign the DCO on your commit 😊 We are getting there, the capability/realCapability logic is not right in the new change, can you fix that please?
| attr := cp.newQueueAttr(queueInfo) | ||
|
|
||
| // Set MaxFloat64 for CPU/Memory if <= 0 (defensive measure for unlimited capacity) | ||
| if attr.capability.MilliCPU <= 0 { |
There was a problem hiding this comment.
This will set these to be MaxFloat64 always when they are set to 0 or missed too, this is two different cases! We want to set them only when they are not explicitly set to 0.
|
|
||
| // Calculate realCapability: the effective resource limit for this queue | ||
| realCapability := api.ExceededPart(cp.totalResource, cp.totalGuarantee).Add(attr.guarantee) | ||
| realCapability.MinDimensionResource(attr.capability, api.Infinity) |
There was a problem hiding this comment.
if Queue.Spec.Capability is set to nil this will override it! You need to distinguish these two settings via if len(queue.Queue.Spec.Capability) != 0 like in newQueueAttr.
The right code here:
realCapability := api.ExceededPart(cp.totalResource, cp.totalGuarantee).Add(attr.guarantee)
if len(queueInfo.Queue.Spec.Capability) != 0 {
if attr.capability.MilliCPU <= 0 {
attr.capability.MilliCPU = math.MaxFloat64
}
if attr.capability.Memory <= 0 {
attr.capability.Memory = math.MaxFloat64
}
realCapability.MinDimensionResource(attr.capability, api.Infinity)
}
attr.realCapability = realCapability
| // TestBuildQueueAttrsMetricConsistency tests Bug 1 fix: | ||
| // Queue allocated metric should be consistent with Queue.Status.Allocated | ||
| // when hierarchy is disabled, including for jobless queues (like root queue). | ||
| // Before the fix: jobless queues were not added to queueOpts, causing metrics to report 0 allocated. |
There was a problem hiding this comment.
causing metrics to report stale values as allocated.
|
Hello @Aliexe-code, |
072566e to
96168fb
Compare
|
@hajnalmt Thank you so much for your patience and thorough review! The changes are now pushed. I really appreciate your guidance in helping me get this right! |
hajnalmt
left a comment
There was a problem hiding this comment.
/area scheduling
/kind bug
/ok-to-test
Thank you so much! Let's see how the testing goes, and I will put an lgtm on this if it goes through.
| // Distinguish between Capability not set (nil) vs Capability explicitly set to 0. | ||
| // When Capability is nil, we use MaxFloat64 as a default for unlimited capacity. | ||
| // When Capability is explicitly set to 0, we respect the user's intent and keep it as 0. | ||
| if queueInfo.Queue.Spec.Capability != nil && len(queueInfo.Queue.Spec.Capability) != 0 { |
There was a problem hiding this comment.
nitpick: The != nil check here is technically redundant since len(nil) == 0 in Go, so len(queueInfo.Queue.Spec.Capability) != 0 alone would suffice. That said, I think it's fine to keep it nicely together with the comment, it makes the explicit distinction between "not set" and "set to 0" very readable.
There was a problem hiding this comment.
@Aliexe-code The static check failed here exactly with this error.
It looks like we need to check only len here.
level=info msg="Execution took 5m13.131287087s"
Error: pkg/scheduler/plugins/capacity/capacity.go:460:5: S1009: should omit nil check; len() for nil maps is defined as zero (staticcheck)
if queueInfo.Queue.Spec.Capability != nil && len(queueInfo.Queue.Spec.Capability) != 0 {
| } | ||
| if actualAllocated.Memory != expectedAllocated.Memory { | ||
| t.Errorf("Root queue allocated Memory: got %v, want %v", actualAllocated.Memory, expectedAllocated.Memory) | ||
| } |
There was a problem hiding this comment.
Yeah this would be good, but not a necessity the test verifies root at least 👍
hajnalmt
left a comment
There was a problem hiding this comment.
Oh I see you have a merge conflict, can you rebase on master please?
|
@hajnalmt Rebased on latest upstream/master. |
hajnalmt
left a comment
There was a problem hiding this comment.
Thanks!
The PR failed on the static check, can you please fix this line? otherwise lgtm.
level=info msg="Execution took 5m13.131287087s"
Error: pkg/scheduler/plugins/capacity/capacity.go:460:5: S1009: should omit nil check; len() for nil maps is defined as zero (staticcheck)
if queueInfo.Queue.Spec.Capability != nil && len(queueInfo.Queue.Spec.Capability) != 0 {
| // Distinguish between Capability not set (nil) vs Capability explicitly set to 0. | ||
| // When Capability is nil, we use MaxFloat64 as a default for unlimited capacity. | ||
| // When Capability is explicitly set to 0, we respect the user's intent and keep it as 0. | ||
| if queueInfo.Queue.Spec.Capability != nil && len(queueInfo.Queue.Spec.Capability) != 0 { |
There was a problem hiding this comment.
@Aliexe-code The static check failed here exactly with this error.
It looks like we need to check only len here.
level=info msg="Execution took 5m13.131287087s"
Error: pkg/scheduler/plugins/capacity/capacity.go:460:5: S1009: should omit nil check; len() for nil maps is defined as zero (staticcheck)
if queueInfo.Queue.Spec.Capability != nil && len(queueInfo.Queue.Spec.Capability) != 0 {
|
@hajnalmt Fixed the static check issue. Removed the redundant nil check as suggested. The CI should pass now. Thanks! |
hajnalmt
left a comment
There was a problem hiding this comment.
Hello,
@Aliexe-code I don't see a commit, and the code is the same. Did you forget to push?
e65bfba to
af787ec
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hajnalmt The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
…y plugin When hierarchy is disabled, jobless queues (like root) were not added to queueOpts, causing their metrics to report zero allocated instead of using Queue.Status.Allocated. This fix: - Adds jobless queues to queueOpts with allocated from Status.Allocated - Refactors queueAttr initialization into helper method to avoid duplication - Distinguishes between Capability not set vs explicitly set to 0 - Adds test to verify jobless queue allocated consistency Related: volcano-sh#5048 (Bug 1) Signed-off-by: Aliexe-code <terminalexe1@gmail.com>
4048a70 to
e3f339e
Compare
/kind bug
What this PR does / why we need it:
Fixes a bug where jobless queues (like root) had their allocated metrics incorrectly
reported as zero when hierarchy was disabled, instead of using the actual
Queue.Status.Allocatedvalue.Which issue(s) this PR fixes:
Related to #5048 (Bug 1)
Special notes for your reviewer:
None
Does this PR introduce a user-facing change?