fix: panic and restart of volcano scheduler pods on install by Tau721 · Pull Request #5077 · volcano-sh/volcano

Tau721 · 2026-03-04T01:58:09Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fix the bug that when installing volcano with two or more volcano scheduler pods, some of these pods may panic and restart.

Which issue(s) this PR fixes:

Fixes #5074

Special notes for your reviewer:

None

Does this PR introduce a user-facing change?

NONE

volcano-sh-bot · 2026-03-04T01:58:13Z

@Tau721: The label(s) kind/bugfix cannot be applied, because the repository doesn't have them.

Details

In response to this:

What type of PR is this?

/kind bugfix

What this PR does / why we need it:

Fix the bug that when installing volcano with two or more volcano scheduler pods, some of these pods may panic and restart.

Which issue(s) this PR fixes:

Fixes #5074

Special notes for your reviewer:

None

Does this PR introduce a user-facing change?
NONE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

volcano-sh-bot · 2026-03-04T01:58:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign thor-wl for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/scheduler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist · 2026-03-04T01:58:25Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical concurrency bug in the Volcano scheduler that caused pods to panic and restart during installation when multiple scheduler instances were deployed. The fix enhances the queue creation process by making it resilient to AlreadyExists errors, ensuring stable operation in multi-scheduler environments.

Highlights

Concurrency Bug Fix: Resolved an issue where Volcano scheduler pods would panic and restart when multiple instances concurrently attempted to create default and root queues during installation.
Improved Queue Creation Logic: Modified the queue creation retry mechanism to gracefully handle AlreadyExists errors, preventing unnecessary retries and panics when a queue has already been created by another scheduler instance.
Enhanced Test Coverage: Added new test cases to specifically validate the concurrent creation of default and root queues, ensuring the robustness of the updated logic.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

pkg/scheduler/cache/cache.go
- Updated the newDefaultAndRootQueue function to explicitly check for apierrors.IsAlreadyExists when creating queues. If the queue already exists, the error is now handled gracefully by logging and returning nil, preventing retries and panics. The retry condition was also simplified to always retry for other transient errors.
pkg/scheduler/cache/cache_test.go
- Introduced a new test suite, TestNewDefaultAndRootQueue, which includes multiple scenarios to verify the correct behavior of concurrent default and root queue creation, including cases where queues already exist.

Activity

No specific activity (comments, reviews, etc.) was provided in the context.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses a critical bug where multiple Volcano scheduler pods could panic on startup due to a race condition when creating default and root queues. The fix correctly handles the AlreadyExists error within a retry loop, ensuring that concurrent initializations do not lead to fatal errors. The change is well-reasoned and includes a new concurrent test case to validate the fix. I have one minor suggestion to improve observability by restoring a log message for retryable errors.

gemini-code-assist · 2026-03-04T02:00:10Z

pkg/scheduler/cache/cache.go

+			if err == nil {
+				klog.V(2).Infof("Successfully created queue %s", name)
+				return nil
 			}
-
-			klog.V(2).Infof("Successfully created queue %s", name)
-			return nil
+			// If this queue has just been created by others and already exists, skip creating it.
+			if apierrors.IsAlreadyExists(err) {
+				klog.V(2).Infof("Queue %s already exists, skip creating.", name)
+				return nil
+			}
+			return err


The logic to handle IsAlreadyExists errors is correct and fixes the race condition. However, the log message for other retryable errors was removed. It would be good to restore it for better observability into why retries are happening.

if err == nil { klog.V(2).Infof("Successfully created queue %s", name) return nil } // If this queue has just been created by others and already exists, skip creating it. if apierrors.IsAlreadyExists(err) { klog.V(2).Infof("Queue %s already exists, skip creating.", name) return nil } klog.V(2).Infof("Failed to create queue %s: %v, will retry", name, err) return err

Copilot

Pull request overview

Fixes a race during Volcano scheduler startup where multiple scheduler replicas concurrently creating the root and default queues could hit an AlreadyExists error path and crash-loop.

Changes:

Make default/root queue initialization tolerate concurrent creation by treating AlreadyExists on create as success.
Add a concurrent unit test covering creation/skip scenarios for root and default queues.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`pkg/scheduler/cache/cache.go`	Adjusts queue creation retry logic to gracefully handle concurrent `AlreadyExists` results.
`pkg/scheduler/cache/cache_test.go`	Adds a concurrency-focused test validating queue initialization behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-04T02:01:18Z

pkg/scheduler/cache/cache.go

 			Jitter:   0.1,
 		}, func(err error) bool {
-			return !apierrors.IsAlreadyExists(err)
+			return true


In the retry.OnError call, the retry predicate now unconditionally returns true. This will retry for the full backoff even on non-retriable errors (e.g., Forbidden/Invalid), delaying scheduler startup before ultimately fatal-ing. Consider restricting retries to transient API errors (timeouts, server errors, too-many-requests) and returning immediately for non-retriable errors.

Suggested change

return true

// Retry only on transient API errors.

return apierrors.IsTimeout(err) ||

apierrors.IsServerTimeout(err) ||

apierrors.IsTooManyRequests(err)

/cc @Tau721 Do we need to retry until timeout? (If the error is not already exists error)

Copilot · 2026-03-04T02:01:19Z

pkg/scheduler/cache/cache.go

+			if apierrors.IsAlreadyExists(err) {
+				klog.V(2).Infof("Queue %s already exists, skip creating.", name)
+				return nil
+			}


When Create returns a non-nil, non-AlreadyExists error, the code returns the error without logging it. Since this sits inside a retry loop, losing the per-attempt error detail can make diagnosing startup issues harder; consider logging the failure (and that it will retry) before returning the error.

Suggested change

}

}

klog.Errorf("Failed to create queue %s, will retry: %v", name, err)

Copilot · 2026-03-04T02:01:19Z

pkg/scheduler/cache/cache_test.go

+					newDefaultAndRootQueue(tt.vcClient, defaultQueue)
+					wg.Done()


The goroutine calls wg.Done() at the end rather than deferring it. If newDefaultAndRootQueue ever panics or early-returns due to a fatal path, this can deadlock the test. Prefer defer wg.Done() at the top of the goroutine body.

Suggested change

newDefaultAndRootQueue(tt.vcClient, defaultQueue)

wg.Done()

defer wg.Done()

newDefaultAndRootQueue(tt.vcClient, defaultQueue)

Copilot · 2026-03-04T02:01:19Z

pkg/scheduler/cache/cache_test.go

+			var wg sync.WaitGroup
+			wg.Add(tt.concurrentNumber)
+			for i := 0; i < tt.concurrentNumber; i++ {
+				go func() {
+					newDefaultAndRootQueue(tt.vcClient, defaultQueue)
+					wg.Done()
+				}()
+			}
+			wg.Wait()


This test calls newDefaultAndRootQueue, which uses klog.Fatalf on error (os.Exit), including when invoked inside a goroutine. A failure here would terminate the entire go test process abruptly and can be hard to debug. Consider refactoring the queue-init logic to return an error (and keep the Fatalf at the top-level scheduler startup), so the unit test can assert on errors instead of exiting.

Suggested change

var wg sync.WaitGroup

wg.Add(tt.concurrentNumber)

for i := 0; i < tt.concurrentNumber; i++ {

go func() {

newDefaultAndRootQueue(tt.vcClient, defaultQueue)

wg.Done()

}()

}

wg.Wait()

for i := 0; i < tt.concurrentNumber; i++ {

newDefaultAndRootQueue(tt.vcClient, defaultQueue)

}

Tau721 · 2026-03-04T02:21:16Z

/kind bug

…rently create the `root` or `default` queue Signed-off-by: caotuo721 <caotuo721@yeah.net>

JesseStutler

Could you also specify multiple replicas for the volcano-scheduler when installing(In the e2e test)? I'm wondering that we always deploy 1 replica scheduler but never found this error before, we could never accept panic

JesseStutler · 2026-03-04T07:02:33Z

pkg/scheduler/cache/cache.go

+				klog.V(2).Infof("Queue %s already exists, skip creating.", name)
+				return nil
+			}
+			klog.V(2).Infof("Failed to create queue %s: %v, will retry", name, err)


Better change to Errorf

JesseStutler · 2026-03-04T07:04:54Z

pkg/scheduler/cache/cache.go

 			Jitter:   0.1,
 		}, func(err error) bool {
-			return !apierrors.IsAlreadyExists(err)
+			return true


/cc @Tau721 Do we need to retry until timeout? (If the error is not already exists error)

Copilot AI review requested due to automatic review settings March 4, 2026 01:58

volcano-sh-bot requested review from hwdef and william-wang March 4, 2026 01:58

volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 4, 2026

Copilot started reviewing on behalf of Tau721 March 4, 2026 01:58 View session

gemini-code-assist bot reviewed Mar 4, 2026

View reviewed changes

Copilot AI reviewed Mar 4, 2026

View reviewed changes

Tau721 force-pushed the fix_scheduler_restart_on_install branch from 364a0d7 to 6bc83ff Compare March 4, 2026 02:17

volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 4, 2026

bugfix: fix the panic occurred when two volcano scheduler pods concur…

3023275

…rently create the `root` or `default` queue Signed-off-by: caotuo721 <caotuo721@yeah.net>

Tau721 force-pushed the fix_scheduler_restart_on_install branch from 6bc83ff to 3023275 Compare March 4, 2026 03:35

JesseStutler added the priority/high label Mar 4, 2026

JesseStutler reviewed Mar 4, 2026

View reviewed changes

-			return true
+			// Retry only on transient API errors.
+			return apierrors.IsTimeout(err) ||
+				apierrors.IsServerTimeout(err) ||
+				apierrors.IsTooManyRequests(err)

	}
	}
	klog.Errorf("Failed to create queue %s, will retry: %v", name, err)

Conversation

Tau721 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

volcano-sh-bot commented Mar 4, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

volcano-sh-bot commented Mar 4, 2026

Uh oh!

gemini-code-assist bot commented Mar 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Tau721 Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

JesseStutler Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Tau721 Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Tau721 Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Tau721 commented Mar 4, 2026

Uh oh!

JesseStutler left a comment

Choose a reason for hiding this comment

Uh oh!

JesseStutler Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

JesseStutler Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Tau721 commented Mar 4, 2026 •

edited

Loading