Skip to content

fix: backfill action does not commit statements for pipelined tasks#5057

Open
rayo1uo wants to merge 3 commits intovolcano-sh:masterfrom
rayo1uo:fix_backfill_commit
Open

fix: backfill action does not commit statements for pipelined tasks#5057
rayo1uo wants to merge 3 commits intovolcano-sh:masterfrom
rayo1uo:fix_backfill_commit

Conversation

@rayo1uo
Copy link

@rayo1uo rayo1uo commented Feb 25, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

part of #5048
Bug 4: Backfill action does not commit statements for pipelined tasks

Special notes for your reviewer:

  • Refactored the backfill action to utilize allocate and unpipeline at the statement level instead of the session level.

  • Add Commit And Discard Function in TestPickUpPendingTasks.

  • Add unit tests TestBackfillReallocateBestEffortTask for backfill action.

Does this PR introduce a user-facing change?

NONE

@volcano-sh-bot volcano-sh-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 25, 2026
@volcano-sh-bot
Copy link
Contributor

Welcome @rayo1uo! It looks like this is your first PR to volcano-sh/volcano 🎉

@gemini-code-assist
Copy link

Summary of Changes

Hello @rayo1uo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an issue in the scheduler's backfill action where transactional statements were not being correctly finalized. By explicitly committing successful UnPipeline operations and discarding failed ones, it ensures data consistency and proper state management within the scheduler. The accompanying test updates validate this corrected behavior.

Highlights

  • Transaction Management Fix: Implemented proper transaction handling within the backfill action by ensuring stmt.Commit() is called upon successful UnPipeline of a task and stmt.Discard() is called if an error occurs during UnPipeline.
  • Test Case Enhancement: Updated backfill_test.go to include klog/v2 import, added a dummy node for testing, and modified the Pipeline calls to correctly commit or discard statements, mirroring the logic applied in the main backfill action.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pkg/scheduler/actions/backfill/backfill.go
    • Added stmt.Commit() after a successful UnPipeline operation.
    • Added stmt.Discard() if UnPipeline fails.
  • pkg/scheduler/actions/backfill/backfill_test.go
    • Imported klog/v2.
    • Added a test node to the scheduler cache.
    • Modified Pipeline calls to include error handling with stmt.Discard() and stmt.Commit().
Activity
  • No specific activity (comments, reviews, progress) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@volcano-sh-bot volcano-sh-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 25, 2026
@rayo1uo rayo1uo force-pushed the fix_backfill_commit branch from 73792e9 to 1ecdd75 Compare February 25, 2026 04:24
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an issue in the backfill action by adding a statement commit. The changes in the test file correctly set up pipelined tasks for testing, which is a good improvement. However, the change in pkg/scheduler/actions/backfill/backfill.go appears to be based on a misunderstanding of how framework.Statement and UnPipeline work. The added Commit and Discard calls are ineffective, and the error handling is for an unreachable code block. I've provided a suggestion to correct and simplify this.

@rayo1uo rayo1uo force-pushed the fix_backfill_commit branch from 1ecdd75 to 2802109 Compare February 25, 2026 04:46
@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 25, 2026
@rayo1uo rayo1uo force-pushed the fix_backfill_commit branch from 2802109 to 34b866b Compare February 25, 2026 07:24
@rayo1uo rayo1uo marked this pull request as ready for review February 25, 2026 07:25
Copilot AI review requested due to automatic review settings February 25, 2026 07:25
@volcano-sh-bot volcano-sh-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 25, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes Bug 4 from issue #5048, where the backfill action failed to commit statements for pipelined tasks, preventing them from being properly unpipelined and reallocated.

Changes:

  • Refactored UnPipeline implementation from Statement to Session for direct execution without requiring commit
  • Updated backfill action to call ssn.UnPipeline() directly instead of through an uncommitted statement
  • Added comprehensive test coverage for backfill reallocation of pipelined best-effort tasks

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
pkg/scheduler/framework/statement.go Simplified UnPipeline to delegate to session method
pkg/scheduler/framework/session.go Added UnPipeline method moved from Statement for direct execution
pkg/scheduler/actions/backfill/backfill.go Fixed bug by calling ssn.UnPipeline() directly instead of uncommitted stmt.UnPipeline()
pkg/scheduler/actions/backfill/backfill_test.go Added comprehensive tests including test for pipelined task reallocation scenario
Comments suppressed due to low confidence (1)

pkg/scheduler/framework/session.go:827

  • The error message says "when pipeline" but should say "when unpipeline" since this is in the UnPipeline method. This appears to be a copy-paste error from the original implementation.
				klog.Errorf("Failed to exec deallocate callback functions for task <%v/%v> to node <%v> when pipeline in Session <%v>: %v",
					task.Namespace, task.Name, task.NodeName, ssn.UID, eventInfo.Err)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rayo1uo rayo1uo force-pushed the fix_backfill_commit branch from 34b866b to 8e1ca0e Compare February 25, 2026 07:29
@rayo1uo rayo1uo changed the title fix: add statement commit in backfill action fix: backfill action does not commit statements for pipelined tasks Feb 25, 2026
@rayo1uo
Copy link
Author

rayo1uo commented Feb 25, 2026

cc @hajnalmt Please take a look and let me know if this pr looks correct, thanks!

@rayo1uo rayo1uo force-pushed the fix_backfill_commit branch from 8e1ca0e to ca80c89 Compare February 25, 2026 07:51
Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! Nice work.
The tests are a nice addition.

I had minor comments only. Can you modify the description of the PR too? You have added unit tests not e2e tests.

@rayo1uo rayo1uo force-pushed the fix_backfill_commit branch from ca80c89 to 3654de1 Compare February 25, 2026 11:02
@rayo1uo
Copy link
Author

rayo1uo commented Feb 25, 2026

@hajnalmt Thanks a lot for your thorough code review. I've addressed all your comments – feel free to check it out whenever you're free.

Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

Thanks superb!

@volcano-sh-bot volcano-sh-bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Feb 25, 2026
Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@JesseStutler
Copy link
Member

@hajnalmt @JesseStutler Could you kindly review this again? The backfill action has been refactored to use allocate and unpipeline at the statement level (replacing the session-level implementation), and this change is now complete. Thanks~

Are you trying to keep the written style consistent with the Allocate action? Why such a big change all of a sudden? I think change the session.Allocate to statement.Allocate and relate with statement.Commit is enough :)

@rayo1uo
Copy link
Author

rayo1uo commented Feb 28, 2026

@hajnalmt @JesseStutler Could you kindly review this again? The backfill action has been refactored to use allocate and unpipeline at the statement level (replacing the session-level implementation), and this change is now complete. Thanks~

Are you trying to keep the written style consistent with the Allocate action? Why such a big change all of a sudden? I think change the session.Allocate to statement.Allocate and relate with statement.Commit is enough :)

Yes, I think keeping the same style as the Allocate action is more reasonable.
This achieves the goal of fair-share allocation for best-effort tasks.
BTW, I don't think simply changing session.Allocate to statement.Allocate is sufficient, because we also need to determine when to commit or discard the statements.

Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, but this seems really risky and far-reaching..
I am not against rewriting the whole action but we will need more people to take a look at it. The intention that we want to align this implementation with allocate is a thing I can absolutely support, with queue order honoring in backfill action too. The latter is an exceptionally great addition I think.

@rayo1uo rayo1uo force-pushed the fix_backfill_commit branch 3 times, most recently from 7560695 to 84c6f06 Compare March 2, 2026 06:32
@rayo1uo
Copy link
Author

rayo1uo commented Mar 2, 2026

@hajnalmt Thanks for your thorough review. This PR is ready for further review. 🚀

@rayo1uo rayo1uo force-pushed the fix_backfill_commit branch from 84c6f06 to 2b16554 Compare March 2, 2026 08:07
@rayo1uo rayo1uo force-pushed the fix_backfill_commit branch 2 times, most recently from e150e7d to 29f41a1 Compare March 3, 2026 07:13
Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 3, 2026
@hajnalmt
Copy link
Contributor

hajnalmt commented Mar 3, 2026

/label tide/merge-method-squash
/area scheduling
/unhold

@volcano-sh-bot volcano-sh-bot added area/scheduling tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Mar 3, 2026
klog.Errorf("JobInfo transition failed, ignore it.")
continue

job := jobs.Pop().(*api.JobInfo)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously it pops all the jobs until the jobQueue is empty, now it only pop one?

Copy link
Author

@rayo1uo rayo1uo Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, adopt a fair‑share manner here.
Imagine there are two queues, q1 and q2, each with two jobs (q1-j1, q1-j2, q2-j1, q2-j2).
The previous allocation sequence might be q1-j1, q1-j2, q2-j1, q2-j2,
while the current sequence is q1-j1, q2-j1, q1-j2, q2-j2.
Is this implementation reasonable, or do you prefer to keep the original allocation order?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer keep the original order because we have already sorted the queues

Copy link
Contributor

@hajnalmt hajnalmt Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't change this back, this is working correctly.
The point here @hzxuzhonghu is that after a job has been processed the queue order can change, because the share changes for the queue (since allocated is more now). ssn.QueueOrder will properly reprioritize that queue after it's pushed back if it has more jobs, and pop the right prioritezed Queue with a new job, so we process jobs one by one to adjust to the priority change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The right example here shows how the queue order changes because of task allocation, if we have 2 queues :

Queue Deserved CPU Allocated CPU (initial) Share (initial)
Q1 100m 10m 10/100 = 0.10
Q2 100m 20m 20/100 = 0.20

Initial queue order: Q1 (0.10) → Q2 (0.20) (Q1 popped first, lower share = higher priority)

  • Q1 has two jobs: Q1-J1 (requests 50m CPU), Q1-J2 (requests 30m CPU)
  • Q2 has two jobs: Q2-J1 (requests 40m CPU), Q2-J2 (requests 20m CPU)

Old Code (pop ALL jobs from queue, process them all)
Pop Q1 → process Q1-J1 and Q1-J2 back-to-back
Pop Q2 → process Q2-J1 and Q2-J2
Allocation sequence: Q1-J1 → Q1-J2 → Q2-J1 → Q2-J2

After Q1-J1 is allocated (50m), Q1's share is now 60/100 = 0.60. Higher than Q2's 0.20. But Q1-J2 still gets allocated next because we already popped Q1 and are draining all its jobs. Q2 is starved even though it should now have higher priority.

New Code (pop ONE job per queue, re-push queue, re-sort):

Iteration 1:
Queue order: Q1 (0.10) → Q2 (0.20)
Pop Q1 → allocate Q1-J1 (50m CPU)
Q1's share becomes: 60/100 = 0.60
Re-push Q1 into priority queue

Iteration 2:
Queue order: Q2 (0.20) → Q1 (0.60) ← Q2 is now higher priority!
Pop Q2 → allocate Q2-J1 (40m CPU)
Q2's share becomes: 60/100 = 0.60
Re-push Q2
Iteration 3:
Queue order: Q1 (0.60) == Q2 (0.60) (tie-broken arbitrarily)
Pop Q1 → allocate Q1-J2 (30m CPU)
Q1's share becomes: 90/100 = 0.90
Re-push Q1
Iteration 4:
Queue order: Q2 (0.60) → Q1 (0.90)
Pop Q2 → allocate Q2-J2 (20m CPU)
Allocation sequence: Q1-J1 → Q2-J1 → Q1-J2 → Q2-J2

This is the same pattern the allocate action already uses, pop one job, re-push the queue, let the priority queue re-sort.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hajnalmt Really appreciate your detailed explanations!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rayo1uo @hajnalmt

This way now is more fair, but what is the performance influence? I can see there is a degradation because of the increasing sort

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is definitely a performance decrease which this brings in, but it is minor or even negligible.

The priority queues use heap-based sorting (container/heap). Each Push/Pop is O(log n) where n is the number of queues or jobs (in this case queues). But n in practice is small: the number of queues in a cluster is typically single digits (1-10), and jobs per queue is in the low hundreds at most during a single scheduling cycle. (we filter this at several points, so Pending BestEffort tasks we are having here only.)

The old code was O(Q × J) with one flat pass. The new code is O(Q × log Q) for queue re-prioritization and O(J × log J) for job re-prioritization per queue. For small Q and J, I think this is inconsequential. The overhead is microseconds or less, this trade-off for fair-share queue ordering is well deserved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

microsecond increasing is not acceptable
Can we add some bench tests before and after covering different Q/J? So many users are concerned about performance

metrics.UpdateE2eSchedulingDurationByJob(job.Name, string(job.Queue), job.Namespace, metrics.Duration(job.CreationTimestamp.Time))
metrics.UpdateE2eSchedulingLastTimeByJob(job.Name, string(job.Queue), job.Namespace, time.Now())

if ssn.JobReady(job) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i donot catch up with here, breaking the loop means we donot allocate for other tasks of the job, this seems not right, we should try to schedule for all best effort tasks here

Copy link
Author

@rayo1uo rayo1uo Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i only adjust the allocation sequence here: first try to allocate resources to make all jobs ready, and then allocate resources to the extra best-effort tasks within ready jobs. The relevant code is as follows:

if len(stmt.Operations()) > 0 && ssn.JobReady(job) {
	stmt.Commit()
	if !tasks.Empty() { // remaining best effort tasks would be allocated in later loop
		jobs.Push(job)
	}
} else {
	stmt.Discard()
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case, this fall into a deadloop if the specific task can not be allocated?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case, this fall into a deadloop if the specific task can not be allocated?

@hzxuzhonghu Actually no - if a specific task cannot be allocated, it will be popped from the queue and will no longer be pushed back into it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may cause performance issue too if there are may other tasks exceed minavailable, we will re enueue many times and re run this loop https://github.com/volcano-sh/volcano/pull/5057/changes#diff-8600dd48131f07698f5fd074b3979a15a990a348cb433ea74a8b104aed378584R139

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hzxuzhonghu I think this is valid in theory, but negligible in practice.

Removing this will regress the fairness behavior, like in the previous case just only above minAvailable. For example if you have 2 jobs in different queues same priority (share), same task requests, both with minAvailable=2 but one with 100 best-effort tasks, the other one with just 3, then the first can get it's 98 task scheduled before the second one get's to schedule 1, possibly starving for a lot of time.

For T best-effort tasks beyond minAvailable, there are T extra iterations of the outer loop, each doing O(log Q + log J) priority queue operations. This means that for 100 extra tasks across 5 queues, avaraged 10 jobs per queue, that's ~100 × (log5 + log10) ≈ 100 × 5 = 500 extra comparisons. Still microseconds.

If you want to address the re-enqueue concern without losing fairness, a better approach would be batching: allocate multiple tasks per iteration (e.g., allocate up to N tasks after the job is ready before breaking) rather than breaking after every single task once ready. But that's an optimization for a later PR if profiling shows it matters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the deep looking, as above can we test with some constructed cases

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hzxuzhonghu @hajnalmt
I’ve run a benchmark here, and the related code can be found at https://github.com/rayo1uo/volcano/blob/fix_backfill_bench/pkg/scheduler/actions/backfill/backfill_benchmark_test.go. The results are as follows:

goos: darwin
goarch: arm64
pkg: volcano.sh/volcano/pkg/scheduler/actions/backfill
cpu: Apple M4 Pro
BenchmarkBackfillAllocateResources/q2_j4_t16/job_strategy/pop_all_jobs-14         	     236	    962353 ns/op	  668331 B/op	   10197 allocs/op
BenchmarkBackfillAllocateResources/q2_j4_t16/job_strategy/pop_one_job_requeue_queue-14         	     236	    979088 ns/op	  669209 B/op	   10216 allocs/op
BenchmarkBackfillAllocateResources/q2_j4_t16/task_strategy/pop_all_tasks-14                    	     285	    899015 ns/op	  611616 B/op	    8748 allocs/op
BenchmarkBackfillAllocateResources/q2_j4_t16/task_strategy/job_ready_break_requeue-14          	     229	   1018833 ns/op	  667623 B/op	   10183 allocs/op
BenchmarkBackfillAllocateResources/q4_j8_t16/job_strategy/pop_all_jobs-14                      	      54	   3853279 ns/op	 2758488 B/op	   42076 allocs/op
BenchmarkBackfillAllocateResources/q4_j8_t16/job_strategy/pop_one_job_requeue_queue-14         	      60	   3959456 ns/op	 2769588 B/op	   42217 allocs/op
BenchmarkBackfillAllocateResources/q4_j8_t16/task_strategy/pop_all_tasks-14                    	      68	   3592999 ns/op	 2459904 B/op	   35230 allocs/op
BenchmarkBackfillAllocateResources/q4_j8_t16/task_strategy/job_ready_break_requeue-14          	      58	   4904244 ns/op	 2677400 B/op	   40878 allocs/op
PASS
ok  	volcano.sh/volcano/pkg/scheduler/actions/backfill	488.108s

Job Strategy

pop_one_job_requeue_queue is slightly slower compared to pop_all_jobs.

Workload pop_all_jobs pop_one_job_requeue_queue Change
q2_j4_t16 962353 ns/op 979088 ns/op ~1.7% slower (0.0167 ms slower)
q4_j8_t16 3853279 ns/op 3959456 ns/op ~2.8% slower (0.106 ms slower)

Task Strategy

pop_all_tasks significantly outperforms job_ready_break_requeue.

Workload pop_all_tasks job_ready_break_requeue Improvement
q2_j4_t16 899015 ns/op 1018833 ns/op ~11.8% faster (0.120 ms faster)
q4_j8_t16 3592999 ns/op 4904244 ns/op ~26.7% faster (1.31 ms faster)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whao cool benchmark. Nice. I think this is acceptable. Do I understand this properly, that the sample tasks count is 128 and 512? It's arguable that how right this is but in practice we won't have more even for really large clusters, we are speaking about best effort tasks only, and a single scheduling cycle.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whao cool benchmark. Nice. I think this is acceptable. Do I understand this properly, that the sample tasks count is 128 and 512? It's arguable that how right this is but in practice we won't have more even for really large clusters, we are speaking about best effort tasks only, and a single scheduling cycle.

Yes, the sample tasks count is 128 and 512.

rayo1uo added 2 commits March 4, 2026 18:12
Signed-off-by: rayo1uo <luoyu19@mails.ucas.edu.cn>
Signed-off-by: rayo1uo <luoyu19@mails.ucas.edu.cn>
@rayo1uo rayo1uo force-pushed the fix_backfill_commit branch from 29f41a1 to 2d896bd Compare March 4, 2026 10:12
@volcano-sh-bot volcano-sh-bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 4, 2026
@rayo1uo rayo1uo force-pushed the fix_backfill_commit branch from 2d896bd to 1b322eb Compare March 4, 2026 10:20
Signed-off-by: rayo1uo <luoyu19@mails.ucas.edu.cn>
@rayo1uo rayo1uo force-pushed the fix_backfill_commit branch from d4fcde6 to 72b87e7 Compare March 4, 2026 10:30
Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

metrics.UpdateE2eSchedulingDurationByJob(job.Name, string(job.Queue), job.Namespace, metrics.Duration(job.CreationTimestamp.Time))
metrics.UpdateE2eSchedulingLastTimeByJob(job.Name, string(job.Queue), job.Namespace, time.Now())

if ssn.JobReady(job) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whao cool benchmark. Nice. I think this is acceptable. Do I understand this properly, that the sample tasks count is 128 and 512? It's arguable that how right this is but in practice we won't have more even for really large clusters, we are speaking about best effort tasks only, and a single scheduling cycle.

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 9, 2026
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hajnalmt
Once this PR has been reviewed and has the lgtm label, please assign wangyang0616 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rayo1uo
Copy link
Author

rayo1uo commented Mar 10, 2026

Hi @hzxuzhonghu @hajnalmt, could we proceed with merging this PR? If there are any pending requirements, please let me know what I can do to help move it forward.

@hajnalmt
Copy link
Contributor

From my side nothing to add the PR is fine. Maybe @JesseStutler can help with a final review too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/scheduling lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants