Skip to content

fix: panic and restart of volcano scheduler pods on install#5077

Open
Tau721 wants to merge 1 commit intovolcano-sh:masterfrom
Tau721:fix_scheduler_restart_on_install
Open

fix: panic and restart of volcano scheduler pods on install#5077
Tau721 wants to merge 1 commit intovolcano-sh:masterfrom
Tau721:fix_scheduler_restart_on_install

Conversation

@Tau721
Copy link
Contributor

@Tau721 Tau721 commented Mar 4, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fix the bug that when installing volcano with two or more volcano scheduler pods, some of these pods may panic and restart.

Which issue(s) this PR fixes:

Fixes #5074

Special notes for your reviewer:

None

Does this PR introduce a user-facing change?

NONE

Copilot AI review requested due to automatic review settings March 4, 2026 01:58
@volcano-sh-bot
Copy link
Contributor

@Tau721: The label(s) kind/bugfix cannot be applied, because the repository doesn't have them.

Details

In response to this:

What type of PR is this?

/kind bugfix

What this PR does / why we need it:

Fix the bug that when installing volcano with two or more volcano scheduler pods, some of these pods may panic and restart.

Which issue(s) this PR fixes:

Fixes #5074

Special notes for your reviewer:

None

Does this PR introduce a user-facing change?

NONE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign thor-wl for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical concurrency bug in the Volcano scheduler that caused pods to panic and restart during installation when multiple scheduler instances were deployed. The fix enhances the queue creation process by making it resilient to AlreadyExists errors, ensuring stable operation in multi-scheduler environments.

Highlights

  • Concurrency Bug Fix: Resolved an issue where Volcano scheduler pods would panic and restart when multiple instances concurrently attempted to create default and root queues during installation.
  • Improved Queue Creation Logic: Modified the queue creation retry mechanism to gracefully handle AlreadyExists errors, preventing unnecessary retries and panics when a queue has already been created by another scheduler instance.
  • Enhanced Test Coverage: Added new test cases to specifically validate the concurrent creation of default and root queues, ensuring the robustness of the updated logic.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pkg/scheduler/cache/cache.go
    • Updated the newDefaultAndRootQueue function to explicitly check for apierrors.IsAlreadyExists when creating queues. If the queue already exists, the error is now handled gracefully by logging and returning nil, preventing retries and panics. The retry condition was also simplified to always retry for other transient errors.
  • pkg/scheduler/cache/cache_test.go
    • Introduced a new test suite, TestNewDefaultAndRootQueue, which includes multiple scenarios to verify the correct behavior of concurrent default and root queue creation, including cases where queues already exist.
Activity
  • No specific activity (comments, reviews, etc.) was provided in the context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 4, 2026
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical bug where multiple Volcano scheduler pods could panic on startup due to a race condition when creating default and root queues. The fix correctly handles the AlreadyExists error within a retry loop, ensuring that concurrent initializations do not lead to fatal errors. The change is well-reasoned and includes a new concurrent test case to validate the fix. I have one minor suggestion to improve observability by restoring a log message for retryable errors.

Comment on lines +483 to +492
if err == nil {
klog.V(2).Infof("Successfully created queue %s", name)
return nil
}

klog.V(2).Infof("Successfully created queue %s", name)
return nil
// If this queue has just been created by others and already exists, skip creating it.
if apierrors.IsAlreadyExists(err) {
klog.V(2).Infof("Queue %s already exists, skip creating.", name)
return nil
}
return err

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to handle IsAlreadyExists errors is correct and fixes the race condition. However, the log message for other retryable errors was removed. It would be good to restore it for better observability into why retries are happening.

            if err == nil {
                klog.V(2).Infof("Successfully created queue %s", name)
                return nil
            }
            // If this queue has just been created by others and already exists, skip creating it.
            if apierrors.IsAlreadyExists(err) {
                klog.V(2).Infof("Queue %s already exists, skip creating.", name)
                return nil
            }
            klog.V(2).Infof("Failed to create queue %s: %v, will retry", name, err)
            return err

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race during Volcano scheduler startup where multiple scheduler replicas concurrently creating the root and default queues could hit an AlreadyExists error path and crash-loop.

Changes:

  • Make default/root queue initialization tolerate concurrent creation by treating AlreadyExists on create as success.
  • Add a concurrent unit test covering creation/skip scenarios for root and default queues.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
pkg/scheduler/cache/cache.go Adjusts queue creation retry logic to gracefully handle concurrent AlreadyExists results.
pkg/scheduler/cache/cache_test.go Adds a concurrency-focused test validating queue initialization behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Jitter: 0.1,
}, func(err error) bool {
return !apierrors.IsAlreadyExists(err)
return true
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the retry.OnError call, the retry predicate now unconditionally returns true. This will retry for the full backoff even on non-retriable errors (e.g., Forbidden/Invalid), delaying scheduler startup before ultimately fatal-ing. Consider restricting retries to transient API errors (timeouts, server errors, too-many-requests) and returning immediately for non-retriable errors.

Suggested change
return true
// Retry only on transient API errors.
return apierrors.IsTimeout(err) ||
apierrors.IsServerTimeout(err) ||
apierrors.IsTooManyRequests(err)

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @Tau721 Do we need to retry until timeout? (If the error is not already exists error)

if apierrors.IsAlreadyExists(err) {
klog.V(2).Infof("Queue %s already exists, skip creating.", name)
return nil
}
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When Create returns a non-nil, non-AlreadyExists error, the code returns the error without logging it. Since this sits inside a retry loop, losing the per-attempt error detail can make diagnosing startup issues harder; consider logging the failure (and that it will retry) before returning the error.

Suggested change
}
}
klog.Errorf("Failed to create queue %s, will retry: %v", name, err)

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Comment on lines +470 to +471
newDefaultAndRootQueue(tt.vcClient, defaultQueue)
wg.Done()
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goroutine calls wg.Done() at the end rather than deferring it. If newDefaultAndRootQueue ever panics or early-returns due to a fatal path, this can deadlock the test. Prefer defer wg.Done() at the top of the goroutine body.

Suggested change
newDefaultAndRootQueue(tt.vcClient, defaultQueue)
wg.Done()
defer wg.Done()
newDefaultAndRootQueue(tt.vcClient, defaultQueue)

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Comment on lines +466 to +474
var wg sync.WaitGroup
wg.Add(tt.concurrentNumber)
for i := 0; i < tt.concurrentNumber; i++ {
go func() {
newDefaultAndRootQueue(tt.vcClient, defaultQueue)
wg.Done()
}()
}
wg.Wait()
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test calls newDefaultAndRootQueue, which uses klog.Fatalf on error (os.Exit), including when invoked inside a goroutine. A failure here would terminate the entire go test process abruptly and can be hard to debug. Consider refactoring the queue-init logic to return an error (and keep the Fatalf at the top-level scheduler startup), so the unit test can assert on errors instead of exiting.

Suggested change
var wg sync.WaitGroup
wg.Add(tt.concurrentNumber)
for i := 0; i < tt.concurrentNumber; i++ {
go func() {
newDefaultAndRootQueue(tt.vcClient, defaultQueue)
wg.Done()
}()
}
wg.Wait()
for i := 0; i < tt.concurrentNumber; i++ {
newDefaultAndRootQueue(tt.vcClient, defaultQueue)
}

Copilot uses AI. Check for mistakes.
@Tau721 Tau721 force-pushed the fix_scheduler_restart_on_install branch from 364a0d7 to 6bc83ff Compare March 4, 2026 02:17
@Tau721
Copy link
Contributor Author

Tau721 commented Mar 4, 2026

/kind bug

@volcano-sh-bot volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 4, 2026
…rently create the `root` or `default` queue

Signed-off-by: caotuo721 <caotuo721@yeah.net>
@Tau721 Tau721 force-pushed the fix_scheduler_restart_on_install branch from 6bc83ff to 3023275 Compare March 4, 2026 03:35
Copy link
Member

@JesseStutler JesseStutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also specify multiple replicas for the volcano-scheduler when installing(In the e2e test)? I'm wondering that we always deploy 1 replica scheduler but never found this error before, we could never accept panic

klog.V(2).Infof("Queue %s already exists, skip creating.", name)
return nil
}
klog.V(2).Infof("Failed to create queue %s: %v, will retry", name, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better change to Errorf

Jitter: 0.1,
}, func(err error) bool {
return !apierrors.IsAlreadyExists(err)
return true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @Tau721 Do we need to retry until timeout? (If the error is not already exists error)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Categorizes issue or PR as related to a bug. priority/high size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Concurrent creation of the default or root queue leads to panic and restart of the scheduler pod

4 participants