fix: backfill action does not commit statements for pipelined tasks#5057
fix: backfill action does not commit statements for pipelined tasks#5057rayo1uo wants to merge 3 commits intovolcano-sh:masterfrom
Conversation
|
Welcome @rayo1uo! It looks like this is your first PR to volcano-sh/volcano 🎉 |
Summary of ChangesHello @rayo1uo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves an issue in the scheduler's Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
73792e9 to
1ecdd75
Compare
There was a problem hiding this comment.
Code Review
This pull request aims to fix an issue in the backfill action by adding a statement commit. The changes in the test file correctly set up pipelined tasks for testing, which is a good improvement. However, the change in pkg/scheduler/actions/backfill/backfill.go appears to be based on a misunderstanding of how framework.Statement and UnPipeline work. The added Commit and Discard calls are ineffective, and the error handling is for an unreachable code block. I've provided a suggestion to correct and simplify this.
1ecdd75 to
2802109
Compare
2802109 to
34b866b
Compare
There was a problem hiding this comment.
Pull request overview
This pull request fixes Bug 4 from issue #5048, where the backfill action failed to commit statements for pipelined tasks, preventing them from being properly unpipelined and reallocated.
Changes:
- Refactored
UnPipelineimplementation from Statement to Session for direct execution without requiring commit - Updated backfill action to call
ssn.UnPipeline()directly instead of through an uncommitted statement - Added comprehensive test coverage for backfill reallocation of pipelined best-effort tasks
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| pkg/scheduler/framework/statement.go | Simplified UnPipeline to delegate to session method |
| pkg/scheduler/framework/session.go | Added UnPipeline method moved from Statement for direct execution |
| pkg/scheduler/actions/backfill/backfill.go | Fixed bug by calling ssn.UnPipeline() directly instead of uncommitted stmt.UnPipeline() |
| pkg/scheduler/actions/backfill/backfill_test.go | Added comprehensive tests including test for pipelined task reallocation scenario |
Comments suppressed due to low confidence (1)
pkg/scheduler/framework/session.go:827
- The error message says "when pipeline" but should say "when unpipeline" since this is in the UnPipeline method. This appears to be a copy-paste error from the original implementation.
klog.Errorf("Failed to exec deallocate callback functions for task <%v/%v> to node <%v> when pipeline in Session <%v>: %v",
task.Namespace, task.Name, task.NodeName, ssn.UID, eventInfo.Err)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
34b866b to
8e1ca0e
Compare
|
cc @hajnalmt Please take a look and let me know if this pr looks correct, thanks! |
8e1ca0e to
ca80c89
Compare
hajnalmt
left a comment
There was a problem hiding this comment.
Thanks for the change! Nice work.
The tests are a nice addition.
I had minor comments only. Can you modify the description of the PR too? You have added unit tests not e2e tests.
ca80c89 to
3654de1
Compare
|
@hajnalmt Thanks a lot for your thorough code review. I've addressed all your comments – feel free to check it out whenever you're free. |
hajnalmt
left a comment
There was a problem hiding this comment.
/ok-to-test
Thanks superb!
Are you trying to keep the written style consistent with the Allocate action? Why such a big change all of a sudden? I think change the session.Allocate to statement.Allocate and relate with statement.Commit is enough :) |
Yes, I think keeping the same style as the Allocate action is more reasonable. |
There was a problem hiding this comment.
Thanks for the changes, but this seems really risky and far-reaching..
I am not against rewriting the whole action but we will need more people to take a look at it. The intention that we want to align this implementation with allocate is a thing I can absolutely support, with queue order honoring in backfill action too. The latter is an exceptionally great addition I think.
7560695 to
84c6f06
Compare
|
@hajnalmt Thanks for your thorough review. This PR is ready for further review. 🚀 |
84c6f06 to
2b16554
Compare
e150e7d to
29f41a1
Compare
|
/label tide/merge-method-squash |
| klog.Errorf("JobInfo transition failed, ignore it.") | ||
| continue | ||
|
|
||
| job := jobs.Pop().(*api.JobInfo) |
There was a problem hiding this comment.
previously it pops all the jobs until the jobQueue is empty, now it only pop one?
There was a problem hiding this comment.
Yes, adopt a fair‑share manner here.
Imagine there are two queues, q1 and q2, each with two jobs (q1-j1, q1-j2, q2-j1, q2-j2).
The previous allocation sequence might be q1-j1, q1-j2, q2-j1, q2-j2,
while the current sequence is q1-j1, q2-j1, q1-j2, q2-j2.
Is this implementation reasonable, or do you prefer to keep the original allocation order?
There was a problem hiding this comment.
prefer keep the original order because we have already sorted the queues
There was a problem hiding this comment.
Please don't change this back, this is working correctly.
The point here @hzxuzhonghu is that after a job has been processed the queue order can change, because the share changes for the queue (since allocated is more now). ssn.QueueOrder will properly reprioritize that queue after it's pushed back if it has more jobs, and pop the right prioritezed Queue with a new job, so we process jobs one by one to adjust to the priority change.
There was a problem hiding this comment.
The right example here shows how the queue order changes because of task allocation, if we have 2 queues :
| Queue | Deserved CPU | Allocated CPU (initial) | Share (initial) |
|---|---|---|---|
| Q1 | 100m | 10m | 10/100 = 0.10 |
| Q2 | 100m | 20m | 20/100 = 0.20 |
Initial queue order: Q1 (0.10) → Q2 (0.20) (Q1 popped first, lower share = higher priority)
- Q1 has two jobs: Q1-J1 (requests 50m CPU), Q1-J2 (requests 30m CPU)
- Q2 has two jobs: Q2-J1 (requests 40m CPU), Q2-J2 (requests 20m CPU)
Old Code (pop ALL jobs from queue, process them all)
Pop Q1 → process Q1-J1 and Q1-J2 back-to-back
Pop Q2 → process Q2-J1 and Q2-J2
Allocation sequence: Q1-J1 → Q1-J2 → Q2-J1 → Q2-J2
After Q1-J1 is allocated (50m), Q1's share is now 60/100 = 0.60. Higher than Q2's 0.20. But Q1-J2 still gets allocated next because we already popped Q1 and are draining all its jobs. Q2 is starved even though it should now have higher priority.
New Code (pop ONE job per queue, re-push queue, re-sort):
Iteration 1:
Queue order: Q1 (0.10) → Q2 (0.20)
Pop Q1 → allocate Q1-J1 (50m CPU)
Q1's share becomes: 60/100 = 0.60
Re-push Q1 into priority queue
Iteration 2:
Queue order: Q2 (0.20) → Q1 (0.60) ← Q2 is now higher priority!
Pop Q2 → allocate Q2-J1 (40m CPU)
Q2's share becomes: 60/100 = 0.60
Re-push Q2
Iteration 3:
Queue order: Q1 (0.60) == Q2 (0.60) (tie-broken arbitrarily)
Pop Q1 → allocate Q1-J2 (30m CPU)
Q1's share becomes: 90/100 = 0.90
Re-push Q1
Iteration 4:
Queue order: Q2 (0.60) → Q1 (0.90)
Pop Q2 → allocate Q2-J2 (20m CPU)
Allocation sequence: Q1-J1 → Q2-J1 → Q1-J2 → Q2-J2
This is the same pattern the allocate action already uses, pop one job, re-push the queue, let the priority queue re-sort.
There was a problem hiding this comment.
@hajnalmt Really appreciate your detailed explanations!
There was a problem hiding this comment.
There is definitely a performance decrease which this brings in, but it is minor or even negligible.
The priority queues use heap-based sorting (container/heap). Each Push/Pop is O(log n) where n is the number of queues or jobs (in this case queues). But n in practice is small: the number of queues in a cluster is typically single digits (1-10), and jobs per queue is in the low hundreds at most during a single scheduling cycle. (we filter this at several points, so Pending BestEffort tasks we are having here only.)
The old code was O(Q × J) with one flat pass. The new code is O(Q × log Q) for queue re-prioritization and O(J × log J) for job re-prioritization per queue. For small Q and J, I think this is inconsequential. The overhead is microseconds or less, this trade-off for fair-share queue ordering is well deserved.
There was a problem hiding this comment.
microsecond increasing is not acceptable
Can we add some bench tests before and after covering different Q/J? So many users are concerned about performance
| metrics.UpdateE2eSchedulingDurationByJob(job.Name, string(job.Queue), job.Namespace, metrics.Duration(job.CreationTimestamp.Time)) | ||
| metrics.UpdateE2eSchedulingLastTimeByJob(job.Name, string(job.Queue), job.Namespace, time.Now()) | ||
|
|
||
| if ssn.JobReady(job) { |
There was a problem hiding this comment.
sorry i donot catch up with here, breaking the loop means we donot allocate for other tasks of the job, this seems not right, we should try to schedule for all best effort tasks here
There was a problem hiding this comment.
i only adjust the allocation sequence here: first try to allocate resources to make all jobs ready, and then allocate resources to the extra best-effort tasks within ready jobs. The relevant code is as follows:
if len(stmt.Operations()) > 0 && ssn.JobReady(job) {
stmt.Commit()
if !tasks.Empty() { // remaining best effort tasks would be allocated in later loop
jobs.Push(job)
}
} else {
stmt.Discard()
}
There was a problem hiding this comment.
Is there a case, this fall into a deadloop if the specific task can not be allocated?
There was a problem hiding this comment.
Is there a case, this fall into a deadloop if the specific task can not be allocated?
@hzxuzhonghu Actually no - if a specific task cannot be allocated, it will be popped from the queue and will no longer be pushed back into it.
There was a problem hiding this comment.
This may cause performance issue too if there are may other tasks exceed minavailable, we will re enueue many times and re run this loop https://github.com/volcano-sh/volcano/pull/5057/changes#diff-8600dd48131f07698f5fd074b3979a15a990a348cb433ea74a8b104aed378584R139
There was a problem hiding this comment.
@hzxuzhonghu I think this is valid in theory, but negligible in practice.
Removing this will regress the fairness behavior, like in the previous case just only above minAvailable. For example if you have 2 jobs in different queues same priority (share), same task requests, both with minAvailable=2 but one with 100 best-effort tasks, the other one with just 3, then the first can get it's 98 task scheduled before the second one get's to schedule 1, possibly starving for a lot of time.
For T best-effort tasks beyond minAvailable, there are T extra iterations of the outer loop, each doing O(log Q + log J) priority queue operations. This means that for 100 extra tasks across 5 queues, avaraged 10 jobs per queue, that's ~100 × (log5 + log10) ≈ 100 × 5 = 500 extra comparisons. Still microseconds.
If you want to address the re-enqueue concern without losing fairness, a better approach would be batching: allocate multiple tasks per iteration (e.g., allocate up to N tasks after the job is ready before breaking) rather than breaking after every single task once ready. But that's an optimization for a later PR if profiling shows it matters.
There was a problem hiding this comment.
Thanks for the deep looking, as above can we test with some constructed cases
There was a problem hiding this comment.
@hzxuzhonghu @hajnalmt
I’ve run a benchmark here, and the related code can be found at https://github.com/rayo1uo/volcano/blob/fix_backfill_bench/pkg/scheduler/actions/backfill/backfill_benchmark_test.go. The results are as follows:
goos: darwin
goarch: arm64
pkg: volcano.sh/volcano/pkg/scheduler/actions/backfill
cpu: Apple M4 Pro
BenchmarkBackfillAllocateResources/q2_j4_t16/job_strategy/pop_all_jobs-14 236 962353 ns/op 668331 B/op 10197 allocs/op
BenchmarkBackfillAllocateResources/q2_j4_t16/job_strategy/pop_one_job_requeue_queue-14 236 979088 ns/op 669209 B/op 10216 allocs/op
BenchmarkBackfillAllocateResources/q2_j4_t16/task_strategy/pop_all_tasks-14 285 899015 ns/op 611616 B/op 8748 allocs/op
BenchmarkBackfillAllocateResources/q2_j4_t16/task_strategy/job_ready_break_requeue-14 229 1018833 ns/op 667623 B/op 10183 allocs/op
BenchmarkBackfillAllocateResources/q4_j8_t16/job_strategy/pop_all_jobs-14 54 3853279 ns/op 2758488 B/op 42076 allocs/op
BenchmarkBackfillAllocateResources/q4_j8_t16/job_strategy/pop_one_job_requeue_queue-14 60 3959456 ns/op 2769588 B/op 42217 allocs/op
BenchmarkBackfillAllocateResources/q4_j8_t16/task_strategy/pop_all_tasks-14 68 3592999 ns/op 2459904 B/op 35230 allocs/op
BenchmarkBackfillAllocateResources/q4_j8_t16/task_strategy/job_ready_break_requeue-14 58 4904244 ns/op 2677400 B/op 40878 allocs/op
PASS
ok volcano.sh/volcano/pkg/scheduler/actions/backfill 488.108sJob Strategy
pop_one_job_requeue_queue is slightly slower compared to pop_all_jobs.
| Workload | pop_all_jobs | pop_one_job_requeue_queue | Change |
|---|---|---|---|
| q2_j4_t16 | 962353 ns/op | 979088 ns/op | ~1.7% slower (0.0167 ms slower) |
| q4_j8_t16 | 3853279 ns/op | 3959456 ns/op | ~2.8% slower (0.106 ms slower) |
Task Strategy
pop_all_tasks significantly outperforms job_ready_break_requeue.
| Workload | pop_all_tasks | job_ready_break_requeue | Improvement |
|---|---|---|---|
| q2_j4_t16 | 899015 ns/op | 1018833 ns/op | ~11.8% faster (0.120 ms faster) |
| q4_j8_t16 | 3592999 ns/op | 4904244 ns/op | ~26.7% faster (1.31 ms faster) |
There was a problem hiding this comment.
Whao cool benchmark. Nice. I think this is acceptable. Do I understand this properly, that the sample tasks count is 128 and 512? It's arguable that how right this is but in practice we won't have more even for really large clusters, we are speaking about best effort tasks only, and a single scheduling cycle.
There was a problem hiding this comment.
Whao cool benchmark. Nice. I think this is acceptable. Do I understand this properly, that the sample tasks count is 128 and 512? It's arguable that how right this is but in practice we won't have more even for really large clusters, we are speaking about best effort tasks only, and a single scheduling cycle.
Yes, the sample tasks count is 128 and 512.
Signed-off-by: rayo1uo <luoyu19@mails.ucas.edu.cn>
29f41a1 to
2d896bd
Compare
2d896bd to
1b322eb
Compare
Signed-off-by: rayo1uo <luoyu19@mails.ucas.edu.cn>
d4fcde6 to
72b87e7
Compare
| metrics.UpdateE2eSchedulingDurationByJob(job.Name, string(job.Queue), job.Namespace, metrics.Duration(job.CreationTimestamp.Time)) | ||
| metrics.UpdateE2eSchedulingLastTimeByJob(job.Name, string(job.Queue), job.Namespace, time.Now()) | ||
|
|
||
| if ssn.JobReady(job) { |
There was a problem hiding this comment.
Whao cool benchmark. Nice. I think this is acceptable. Do I understand this properly, that the sample tasks count is 128 and 512? It's arguable that how right this is but in practice we won't have more even for really large clusters, we are speaking about best effort tasks only, and a single scheduling cycle.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hajnalmt The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @hzxuzhonghu @hajnalmt, could we proceed with merging this PR? If there are any pending requirements, please let me know what I can do to help move it forward. |
|
From my side nothing to add the PR is fine. Maybe @JesseStutler can help with a final review too. |
What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR fixes:
part of #5048
Bug 4: Backfill action does not commit statements for pipelined tasks
Special notes for your reviewer:
Refactored the backfill action to utilize
allocateandunpipelineat the statement level instead of the session level.Add Commit And Discard Function in
TestPickUpPendingTasks.Add unit tests
TestBackfillReallocateBestEffortTaskfor backfill action.Does this PR introduce a user-facing change?