Skip to content

[Proposal] Gate-Controlled Scheduling for Cluster Autoscalers Compatibility#4727

Open
devzizu wants to merge 30 commits intovolcano-sh:masterfrom
devzizu:proposal-4710
Open

[Proposal] Gate-Controlled Scheduling for Cluster Autoscalers Compatibility#4727
devzizu wants to merge 30 commits intovolcano-sh:masterfrom
devzizu:proposal-4710

Conversation

@devzizu
Copy link

@devzizu devzizu commented Nov 15, 2025

What type of PR is this?

/kind documentation
/kind feature

What this PR does / why we need it:

This design proposal addresses an issue where Volcano incorrectly signals cluster autoscalers (e.g., CA or Karpenter) to scale up nodes even when pods are only waiting for queue capacity, not cluster resources. Currently, Volcano marks all unallocated pods as Unschedulable regardless of the reason, causing autoscalers to interpret queue constraints as insufficient node capacity.

This proposal introduces an opt-in feature using Kubernetes schedulingGates to hide queue-constrained pods from autoscalers, ensuring scale-up operations only trigger for legitimate node-fit failures. This aims to prevent unnecessary infrastructure costs and resource waste.

Which issue(s) this PR fixes:

Fixes #4710

@volcano-sh-bot volcano-sh-bot added kind/documentation Categorizes issue or PR as related to documentation. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Nov 15, 2025
@volcano-sh-bot volcano-sh-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 15, 2025
@volcano-sh-bot
Copy link
Contributor

Welcome @devzizu! It looks like this is your first PR to volcano-sh/volcano 🎉

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 15, 2025
@devzizu devzizu marked this pull request as ready for review November 15, 2025 22:30
@volcano-sh-bot volcano-sh-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 15, 2025
@JesseStutler
Copy link
Member

@kingeasternsun @hajnalmt Do you have time help @devzizu to take a look at this proposal?

@hajnalmt
Copy link
Contributor

hajnalmt commented Dec 4, 2025

/cc @hajnalmt

Sure I am starting to review it 👍

Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great proposal @devzizu.
I am looking forward to the implementation, great idea and solid design change.

I had mostly minor remarks except the capacity plugin change.
What I see is that the pods that are marked scheduling gated by volcano after this change with the annotation they can maybe considered to be inqueue but straight out to be considered allocated seems a too significant change.

Wouldn’t it be sufficient to adjust the DeductSchGatedResources function so that the tasks that are only volcano scheduling gated won't be deducted when hasOnlyVolcanoSchedulingGate returns true? This approach would also maintain compatibility with the overcommit plugin, which I’m slightly concerned about since it heavily depends on scheduling gate behavior.
--- I want to remark that this would mean that these annotated pods would move to inqueue and would never be in the scope of the enqueue action if every other scheduling gate is removed, which is probably what we want.

Thank you for the contribution and the idea once more!

Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the updates @devzizu !
The reserved cache is a good idea, but I still had so many questions, please find them below!

@volcano-sh-bot volcano-sh-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 21, 2025
Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are getting there! Keep up the great work 😊

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign william-wang for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Work!

Please squash your commits, and sign the DCO! I think this is quite ready.
Do you need any help with the implementation? We will need to add an e2e test for this too.
I think the it shall happen in a different PR though as the design is quite big.

@hajnalmt
Copy link
Contributor

/cc @JesseStutler

I think we can pull this in release-1.15

Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
Signed-off-by: devzizu <jazevedo960@gmail.com>
@JesseStutler
Copy link
Member

@devzizu Hi Pedro, sorry, I want to look back at why we chose to use scheduling gate instead of taking a new condition like adding an Unallocatable, instead of letting pods with insufficient queue resources continue to add an Unscheduable condition, because it feels like the intrusive modification of schedulingGate is a bit too much, even though we can control it with a switch. But it is inevitable that there will be more switch control code in the future, I am not sure if this is a better solution, sorry to look back at our original design at this time, but I suddenly thought that we have changed so much code, why not try to solve it from the condition? WDYT, however, it's not very friendly to your current job, it's just a suggestion of mine, and you're welcome to disagree with me at any time.

@devzizu
Copy link
Author

devzizu commented Mar 6, 2026

@devzizu Hi Pedro, sorry, I want to look back at why we chose to use scheduling gate instead of taking a new condition like adding an Unallocatable, instead of letting pods with insufficient queue resources continue to add an Unscheduable condition, because it feels like the intrusive modification of schedulingGate is a bit too much, even though we can control it with a switch. But it is inevitable that there will be more switch control code in the future, I am not sure if this is a better solution, sorry to look back at our original design at this time, but I suddenly thought that we have changed so much code, why not try to solve it from the condition? WDYT, however, it's not very friendly to your current job, it's just a suggestion of mine, and you're welcome to disagree with me at any time.

IIRC, we did not considered following that path as it would introduce a new non-standard reason name in the condition (i.e., Unallocatable or others). I also don't how autoscalers would react to that, and specially if we would be changing the Unschedulable/Unallocatable back and forth depending on if the Pod passed allocatable check in each scheduling cycle. Besides, we would also need to reserve space right? I a Pod is Unschedulable and the queue can only fit one pod, we don't want other Pods to also pass the check. Maybe we'll not be saving much complexity here. And I also believe scheduling gates are more semantically correct as introduced in KEP-3521. I'm happy to follow the approach that makes more sense, so let's try to reach an agreement on the right path 🙏

WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cluster Autoscaler node scale-up for Pods that exceed Queue's capability

5 participants