fix: panic and restart of volcano scheduler pods on install#5077
fix: panic and restart of volcano scheduler pods on install#5077Tau721 wants to merge 1 commit intovolcano-sh:masterfrom
Conversation
|
@Tau721: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical concurrency bug in the Volcano scheduler that caused pods to panic and restart during installation when multiple scheduler instances were deployed. The fix enhances the queue creation process by making it resilient to Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a critical bug where multiple Volcano scheduler pods could panic on startup due to a race condition when creating default and root queues. The fix correctly handles the AlreadyExists error within a retry loop, ensuring that concurrent initializations do not lead to fatal errors. The change is well-reasoned and includes a new concurrent test case to validate the fix. I have one minor suggestion to improve observability by restoring a log message for retryable errors.
| if err == nil { | ||
| klog.V(2).Infof("Successfully created queue %s", name) | ||
| return nil | ||
| } | ||
|
|
||
| klog.V(2).Infof("Successfully created queue %s", name) | ||
| return nil | ||
| // If this queue has just been created by others and already exists, skip creating it. | ||
| if apierrors.IsAlreadyExists(err) { | ||
| klog.V(2).Infof("Queue %s already exists, skip creating.", name) | ||
| return nil | ||
| } | ||
| return err |
There was a problem hiding this comment.
The logic to handle IsAlreadyExists errors is correct and fixes the race condition. However, the log message for other retryable errors was removed. It would be good to restore it for better observability into why retries are happening.
if err == nil {
klog.V(2).Infof("Successfully created queue %s", name)
return nil
}
// If this queue has just been created by others and already exists, skip creating it.
if apierrors.IsAlreadyExists(err) {
klog.V(2).Infof("Queue %s already exists, skip creating.", name)
return nil
}
klog.V(2).Infof("Failed to create queue %s: %v, will retry", name, err)
return errThere was a problem hiding this comment.
Pull request overview
Fixes a race during Volcano scheduler startup where multiple scheduler replicas concurrently creating the root and default queues could hit an AlreadyExists error path and crash-loop.
Changes:
- Make default/root queue initialization tolerate concurrent creation by treating
AlreadyExistson create as success. - Add a concurrent unit test covering creation/skip scenarios for
rootanddefaultqueues.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
pkg/scheduler/cache/cache.go |
Adjusts queue creation retry logic to gracefully handle concurrent AlreadyExists results. |
pkg/scheduler/cache/cache_test.go |
Adds a concurrency-focused test validating queue initialization behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Jitter: 0.1, | ||
| }, func(err error) bool { | ||
| return !apierrors.IsAlreadyExists(err) | ||
| return true |
There was a problem hiding this comment.
In the retry.OnError call, the retry predicate now unconditionally returns true. This will retry for the full backoff even on non-retriable errors (e.g., Forbidden/Invalid), delaying scheduler startup before ultimately fatal-ing. Consider restricting retries to transient API errors (timeouts, server errors, too-many-requests) and returning immediately for non-retriable errors.
| return true | |
| // Retry only on transient API errors. | |
| return apierrors.IsTimeout(err) || | |
| apierrors.IsServerTimeout(err) || | |
| apierrors.IsTooManyRequests(err) |
There was a problem hiding this comment.
/cc @Tau721 Do we need to retry until timeout? (If the error is not already exists error)
| if apierrors.IsAlreadyExists(err) { | ||
| klog.V(2).Infof("Queue %s already exists, skip creating.", name) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
When Create returns a non-nil, non-AlreadyExists error, the code returns the error without logging it. Since this sits inside a retry loop, losing the per-attempt error detail can make diagnosing startup issues harder; consider logging the failure (and that it will retry) before returning the error.
| } | |
| } | |
| klog.Errorf("Failed to create queue %s, will retry: %v", name, err) |
pkg/scheduler/cache/cache_test.go
Outdated
| newDefaultAndRootQueue(tt.vcClient, defaultQueue) | ||
| wg.Done() |
There was a problem hiding this comment.
The goroutine calls wg.Done() at the end rather than deferring it. If newDefaultAndRootQueue ever panics or early-returns due to a fatal path, this can deadlock the test. Prefer defer wg.Done() at the top of the goroutine body.
| newDefaultAndRootQueue(tt.vcClient, defaultQueue) | |
| wg.Done() | |
| defer wg.Done() | |
| newDefaultAndRootQueue(tt.vcClient, defaultQueue) |
| var wg sync.WaitGroup | ||
| wg.Add(tt.concurrentNumber) | ||
| for i := 0; i < tt.concurrentNumber; i++ { | ||
| go func() { | ||
| newDefaultAndRootQueue(tt.vcClient, defaultQueue) | ||
| wg.Done() | ||
| }() | ||
| } | ||
| wg.Wait() |
There was a problem hiding this comment.
This test calls newDefaultAndRootQueue, which uses klog.Fatalf on error (os.Exit), including when invoked inside a goroutine. A failure here would terminate the entire go test process abruptly and can be hard to debug. Consider refactoring the queue-init logic to return an error (and keep the Fatalf at the top-level scheduler startup), so the unit test can assert on errors instead of exiting.
| var wg sync.WaitGroup | |
| wg.Add(tt.concurrentNumber) | |
| for i := 0; i < tt.concurrentNumber; i++ { | |
| go func() { | |
| newDefaultAndRootQueue(tt.vcClient, defaultQueue) | |
| wg.Done() | |
| }() | |
| } | |
| wg.Wait() | |
| for i := 0; i < tt.concurrentNumber; i++ { | |
| newDefaultAndRootQueue(tt.vcClient, defaultQueue) | |
| } |
364a0d7 to
6bc83ff
Compare
|
/kind bug |
…rently create the `root` or `default` queue Signed-off-by: caotuo721 <caotuo721@yeah.net>
6bc83ff to
3023275
Compare
JesseStutler
left a comment
There was a problem hiding this comment.
Could you also specify multiple replicas for the volcano-scheduler when installing(In the e2e test)? I'm wondering that we always deploy 1 replica scheduler but never found this error before, we could never accept panic
| klog.V(2).Infof("Queue %s already exists, skip creating.", name) | ||
| return nil | ||
| } | ||
| klog.V(2).Infof("Failed to create queue %s: %v, will retry", name, err) |
| Jitter: 0.1, | ||
| }, func(err error) bool { | ||
| return !apierrors.IsAlreadyExists(err) | ||
| return true |
There was a problem hiding this comment.
/cc @Tau721 Do we need to retry until timeout? (If the error is not already exists error)
What type of PR is this?
/kind bug
What this PR does / why we need it:
Fix the bug that when installing volcano with two or more volcano scheduler pods, some of these pods may panic and restart.
Which issue(s) this PR fixes:
Fixes #5074
Special notes for your reviewer:
None
Does this PR introduce a user-facing change?