Skip to content

[train] Make controller resilient to errors in all lifecycle hooks#60900

Merged
justinvyu merged 22 commits intoray-project:masterfrom
liulehui:shutdown-hook
Apr 10, 2026
Merged

[train] Make controller resilient to errors in all lifecycle hooks#60900
justinvyu merged 22 commits intoray-project:masterfrom
liulehui:shutdown-hook

Conversation

@liulehui
Copy link
Copy Markdown
Contributor

@liulehui liulehui commented Feb 10, 2026

Description

  1. Follow-up to [train] Add a CallbackManager and guardrail some callback hooks #60117. Makes the train controller resilient to errors in all controller lifecycle hooks (_start, _shutdown, after_controller_state_update, after_controller_finish, etc.) — previously only the callback manager's invoke path was hardened.
    Key changes:
  • Centralize error handling in _run_control_loop_iteration: Replace 4 scattered make_decision_execute_failure_decision call sites with a single catch-all that routes all errors through the failure policy. Preserves WorkerGroupError type (no double-wrapping in ControllerError). Falls back to force ErroredState if the failure policy itself fails.
  • Route all lifecycle hooks through CallbackManager: Add async_invoke for async hooks (e.g. before_controller_shutdown). Replace direct for callback in ... loops with _run_controller_hook / CallbackManager, which catches and wraps callback errors as ControllerError.
  • Harden _shutdown: Guard _shutdown_worker_group() and before_controller_shutdown callback with try/except. Prioritize original training error over shutdown errors. Never retry during shutdown.
  • Harden _set_state: Suppress callback failures during terminal state transitions to preserve the root-cause error. Prevent re-entry into after_controller_state_update when handling a callback failure.
  • Simplify _start_worker_group: Raise on failure instead of returning Optional[ControllerError] — let errors propagate naturally to the catch-all.
  • Guard AsyncioActorExit: Re-raise at _run_control_loop_iteration level so it's handled correctly regardless of which state the controller is in.
  • Reduce log noise: Log detailed tracebacks once at the point of capture; use logger.warning for suppression messages instead of logger.exception

Please refer to this diagram:
6eeb7df37cfc0f6acc09fb329d863fe8

  1. see the difference table below:
Aspect Before (master) After (this PR)
Callback crashes 6 hooks crash the controller All hooks caught and handled
Error routing 4 separate make_decision call sites + catch-all 1 centralized catch-all in _run_control_loop_iteration + _run_controller_hook for callback-specific guard
WorkerGroupError type Preserved (handled inline) Preserved (catch-all checks isinstance(TrainingFailedError))
AsyncioActorExit Only caught in _step RunningState Caught at _run_control_loop_iteration level (covers all states)
Shutdown errors Crash controller Logged + suppressed or → ErroredState (never retry)
_shutdown_worker_group in resize Unguarded (crashes) Logged + bubbles to catch-all
Duplicate tracebacks logger.exception at multiple layers Traceback logged once at capture point; suppression messages use logger.warning
_start_worker_group Returns Optional[ControllerError] Raises on failure (simpler API)

Additional information

  1. see repro script and corresponding logs sample here: https://gist.github.com/liulehui/b3c01a6a061250fd1b3da6ab07acd1e5
  2. an example of error log ares:
class ReproCallback(ControllerCallback, WorkerGroupCallback):
    def before_controller_shutdown(self):
        raise Exception("error raised in ReproCallback with before_controller_shutdown")

def train_func(config):
    # ray.train.report({"success": 1})
    raise Exception("fail train_func")

trainer = TorchTrainer(
    train_func, 
    scaling_config=ray.train.ScalingConfig(num_workers=2),
    run_config=ray.train.RunConfig(
        callbacks=[ReproCallback()]
    )
)
trainer.fit()

corresponding log:

(TrainController pid=23510) [FailurePolicy] RAISE
(TrainController pid=23510)   Source: worker group
(TrainController pid=23510)   Error count: 1 (max allowed: 0)
(TrainController pid=23510) Error: Training failed due to worker errors:
(TrainController pid=23510) [Rank 0 Error Snippet]:
(TrainController pid=23510) Traceback (most recent call last):
(TrainController pid=23510)   File "/var/folders/gh/t3_93fyn3gvfwyxq_29m82440000gp/T/ipykernel_23459/1460648966.py", line 27, in train_func
(TrainController pid=23510) Exception: fail train_func
(TrainController pid=23510) ray.train.WorkerGroupError: Training failed due to worker errors:
(TrainController pid=23510) [Rank 0 Error Snippet]:
(TrainController pid=23510) Traceback (most recent call last):
(TrainController pid=23510)   File "/var/folders/gh/t3_93fyn3gvfwyxq_29m82440000gp/T/ipykernel_23459/1460648966.py", line 27, in train_func
(TrainController pid=23510) Exception: fail train_func
(TrainController pid=23510) 
(RayTrainWorker pid=23729) Error in training function:
(RayTrainWorker pid=23729) 
(RayTrainWorker pid=23796) 
(TrainController pid=23510) Exception raised in callback hook 'before_controller_shutdown' from callback 'ReproCallback'.
(TrainController pid=23510)   File "/Users/lehui/Desktop/ray/python/ray/train/v2/_internal/execution/callback_manager.py", line 44, in async_invoke
(TrainController pid=23510)     result = method(*args, **context)
(TrainController pid=23510)              ^^^^^^^^^^^^^^^^^^^^^^^^
(TrainController pid=23510)   File "/var/folders/gh/t3_93fyn3gvfwyxq_29m82440000gp/T/ipykernel_23459/1460648966.py", line 22, in before_controller_shutdown
(TrainController pid=23510) Exception: error raised in ReproCallback with before_controller_shutdown
(TrainController pid=23510) Another error occurred during shutdown after a training error. This error is being ignored to preserve the original training error. Error: Training failed due to controller error:
(TrainController pid=23510) error raised in ReproCallback with before_controller_shutdown
---------------------------------------------------------------------------
WorkerGroupError                          Traceback (most recent call last)
Cell In[1], line 40
     27     raise Exception("fail train_func")
     31 trainer = TorchTrainer(
     32     train_func, 
     33     scaling_config=ray.train.ScalingConfig(num_workers=2),
   (...)     38 
     39 )
---> 40 trainer.fit()

File ~/Desktop/ray/python/ray/train/v2/api/data_parallel_trainer.py:194, in DataParallelTrainer.fit(self)
    179 result = self._initialize_and_run_controller(
    180     train_fn_ref=train_fn_ref,
    181     scaling_policy=create_scaling_policy(self.scaling_config),
   (...)    185     validation_config=self.validation_config,
    186 )
    188 if result.error:
    189     # NOTE: If the training run errored out, raise an error back to the
    190     # user's driver script.
    191     # For example, if the Train `FailurePolicy` runs out of retries,
    192     # and one of the workers errors. The controller will exit, and
    193     # the error will be raised here.
--> 194     raise result.error
    196 return result

WorkerGroupError: Training failed due to worker errors:
[Rank 0 Error Snippet]:
Traceback (most recent call last):
  File "/var/folders/gh/t3_93fyn3gvfwyxq_29m82440000gp/T/ipykernel_23459/1460648966.py", line 27, in train_func
Exception: fail train_func

@liulehui liulehui requested a review from a team as a code owner February 10, 2026 01:02
@liulehui liulehui added the go add ONLY when ready to merge, run all tests label Feb 10, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively hardens the controller lifecycle hooks by integrating them with the CallbackManager and improving error handling. The changes introduce more robust failure management, especially during shutdown sequences, by routing all lifecycle callback invocations through _run_controller_hook. Key improvements include the extraction of _make_failure_decision for centralized failure policy logic, adding try/except blocks around worker group shutdowns, and special handling for errors that occur when the controller is already in a ShuttingDownState. The new tests in test_controller_callback_behaviour.py are comprehensive and validate the new error handling paths. My main feedback is a suggestion to refactor some duplicated code to improve maintainability.

Copy link
Copy Markdown
Contributor

@TimothySeah TimothySeah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this PR and #60117 make controller callback handling much cleaner! FYI I only looked at controller.py and can take a look at the tests in a future review. I wanted to make two high level points first.

First, these two PR's actually changed the meaning of ERRORED and ABORTED. In the past:

  • ABORTED: user abort or controller exception (could be from our callbacks, user callbacks, or our non-callback error)
  • ERRORED: user training error

But after these PR's:

  • ABORTED: user abort or controller exception (can only be from our non-callback error)
  • ERRORED: user training error or controller exception (could be from our callbacks or user callbacks)

This is out of scope for this PR, but I think our train states should distinguish between the following cases:

  • User training errors
  • User callback errors
  • Errors that are our fault i.e. our callback errors or our non-callback errors. I think user aborts can also go here since aborts create an invalid state.

Second, I have feedback on how ShuttingDownState(ErroredState) is handled. Iiuc, we only reach make_failure_decision/execute_failure_decision's ShuttingDownState(ErroredState) conditional blocks if we try and fail to shut down the worker group. Should we retry shutdowns? If so, we should change this branch to also go through the failure policy. If not, I think it would be cleaner to move the ShuttingDownState(ErroredState) checks from make_failure_decision/execute_failure_decision to the isinstance(controller_state, ShuttingDownState) branch.

@liulehui
Copy link
Copy Markdown
Contributor Author

Nice, this PR and #60117 make controller callback handling much cleaner! FYI I only looked at controller.py and can take a look at the tests in a future review. I wanted to make two high level points first.

First, these two PR's actually changed the meaning of ERRORED and ABORTED. In the past:

  • ABORTED: user abort or controller exception (could be from our callbacks, user callbacks, or our non-callback error)
  • ERRORED: user training error

But after these PR's:

  • ABORTED: user abort or controller exception (can only be from our non-callback error)
  • ERRORED: user training error or controller exception (could be from our callbacks or user callbacks)

This is out of scope for this PR, but I think our train states should distinguish between the following cases:

  • User training errors
  • User callback errors
  • Errors that are our fault i.e. our callback errors or our non-callback errors. I think user aborts can also go here since aborts create an invalid state.

Second, I have feedback on how ShuttingDownState(ErroredState) is handled. Iiuc, we only reach make_failure_decision/execute_failure_decision's ShuttingDownState(ErroredState) conditional blocks if we try and fail to shut down the worker group. Should we retry shutdowns? If so, we should change this branch to also go through the failure policy. If not, I think it would be cleaner to move the ShuttingDownState(ErroredState) checks from make_failure_decision/execute_failure_decision to the isinstance(controller_state, ShuttingDownState) branch.

I agree to distinguish different error states makes sense.
this is PR is more for preventing the uncaught error in shutdown phase or the callback hooks crashing the controller, i.e. if there is exception in callback error, it is not aborting the controller, instead of just crashing the rest of the controller process.

  1. I refactored the shutting down logic a bit to avoid branches in different places, we should not retry when in shutting down state.

@TimothySeah
Copy link
Copy Markdown
Contributor

  1. I refactored the shutting down logic a bit to avoid branches in different places, we should not retry when in shutting down state.

Sounds good, I think this is cleaner now.

Copy link
Copy Markdown
Contributor

@TimothySeah TimothySeah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Feb 27, 2026
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 13, 2026
@github-actions
Copy link
Copy Markdown

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

@github-actions github-actions bot closed this Mar 27, 2026
@liulehui liulehui reopened this Mar 30, 2026
@github-actions github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Mar 30, 2026
@liulehui liulehui removed the unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. label Mar 30, 2026
@liulehui liulehui changed the title [train] Harden remaining controller lifecycle hooks with error handling [train] Make controller resilient to errors in all lifecycle hooks Mar 30, 2026
Copy link
Copy Markdown
Contributor

@TimothySeah TimothySeah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM aside from a few nits and clarifications.

I also noticed that the following methods are called slightly differently in multiple places but confirmed that they make sense:

  • Run_controller_hook: This never raises. When it's called in the _step loop it returns the result because the loop sets the state. When it's called outside the loop (start, run after step, abort) we have to manually set the state.
  • shutdown_worker_group: When we call this in the _step loop (_make_and_handle_scaling_decision_for_non_running_worker_group, _execute_resize_decision) we bubble up the raise to _run_control_loop_iteration to go through the failure policy. But when we call this in shutdown we intentionally avoid retrying.

if self.get_state().is_terminal():
return

# Use a manual for-loop instead of CallbackManager here because abort
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale behind

  • Every before_controller_abort callback must be called
  • Other callbacks are "fail fast"
    ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our callbacks in the codebase are all fail fast, i.e. if one callback fails and throw exception, the rest won't get executed.
I think it is ok in most of the cases, but for abort, there are a few places, e.g. the PlacementGroupCleaner is a detached worker, which needs to be killed by the abort callbacks, otherwise it will always be there, thus it is needed to make the resource cleaning related callback to called to the best effort., see this issue: #61689

Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! pretty much looks good to me, but wondering if we can still simplify a bit more. There's still 3 layers of try catches that I think we can remove. See my main comment here.

Happy to merge first and we can address some of these as followups.

Each of these has a layer of try catch:

run_controller_step() -> run_controller_hook() -> callback_manager.invoke()

callback.after_controller_finish(result)
failure_result = self._run_controller_hook(
"after_controller_finish", result, invoke_failure_decision_callbacks=False
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't even use the before_controller_execute_failure_decision hook anywhere. We could remove that hook as well as the "invoke_failure_decision_callbacks" logic in a followup to simplify things.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit worried that if some advanced user defined their own ControllerCallback and leverage this hook might facing breaking change, e.g. use this hook to emit some metrics about failure count/fleet error number

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh these callbacks are all DeveloperAPI so we can change them. We already have a public UserCallback which has an after_exception method that users should use instead

@liulehui liulehui requested a review from justinvyu April 9, 2026 19:30
liulehui added 20 commits April 9, 2026 13:36
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit e2c4708. Configure here.

Signed-off-by: Lehui Liu <lehui@anyscale.com>
@justinvyu justinvyu merged commit 5159cb6 into ray-project:master Apr 10, 2026
6 checks passed
goutamvenkat-anyscale pushed a commit to goutamvenkat-anyscale/ray that referenced this pull request Apr 10, 2026
…ay-project#60900)

1. Follow-up to ray-project#60117. Makes the train controller resilient to errors
in **all** controller lifecycle hooks (`_start`, `_shutdown`,
`after_controller_state_update`, `after_controller_finish`, etc.) —
previously only the callback manager's `invoke` path was hardened.

---------

Signed-off-by: Lehui Liu <lehui@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants