[train] Make controller resilient to errors in all lifecycle hooks by liulehui · Pull Request #60900 · ray-project/ray

liulehui · 2026-02-10T01:02:27Z

Description

Follow-up to [train] Add a CallbackManager and guardrail some callback hooks #60117. Makes the train controller resilient to errors in all controller lifecycle hooks (_start, _shutdown, after_controller_state_update, after_controller_finish, etc.) — previously only the callback manager's invoke path was hardened.
Key changes:

Centralize error handling in _run_control_loop_iteration: Replace 4 scattered make_decision → _execute_failure_decision call sites with a single catch-all that routes all errors through the failure policy. Preserves WorkerGroupError type (no double-wrapping in ControllerError). Falls back to force ErroredState if the failure policy itself fails.
Route all lifecycle hooks through CallbackManager: Add async_invoke for async hooks (e.g. before_controller_shutdown). Replace direct for callback in ... loops with _run_controller_hook / CallbackManager, which catches and wraps callback errors as ControllerError.
Harden _shutdown: Guard _shutdown_worker_group() and before_controller_shutdown callback with try/except. Prioritize original training error over shutdown errors. Never retry during shutdown.
Harden _set_state: Suppress callback failures during terminal state transitions to preserve the root-cause error. Prevent re-entry into after_controller_state_update when handling a callback failure.
Simplify _start_worker_group: Raise on failure instead of returning Optional[ControllerError] — let errors propagate naturally to the catch-all.
Guard AsyncioActorExit: Re-raise at _run_control_loop_iteration level so it's handled correctly regardless of which state the controller is in.
Reduce log noise: Log detailed tracebacks once at the point of capture; use logger.warning for suppression messages instead of logger.exception

Please refer to this diagram:

see the difference table below:

Aspect	Before (master)	After (this PR)
Callback crashes	6 hooks crash the controller	All hooks caught and handled
Error routing	4 separate `make_decision` call sites + catch-all	1 centralized catch-all in `_run_control_loop_iteration` + `_run_controller_hook` for callback-specific guard
`WorkerGroupError` type	Preserved (handled inline)	Preserved (catch-all checks `isinstance(TrainingFailedError)`)
`AsyncioActorExit`	Only caught in `_step` RunningState	Caught at `_run_control_loop_iteration` level (covers all states)
Shutdown errors	Crash controller	Logged + suppressed or → ErroredState (never retry)
`_shutdown_worker_group` in resize	Unguarded (crashes)	Logged + bubbles to catch-all
Duplicate tracebacks	`logger.exception` at multiple layers	Traceback logged once at capture point; suppression messages use `logger.warning`
`_start_worker_group`	Returns `Optional[ControllerError]`	Raises on failure (simpler API)

Additional information

see repro script and corresponding logs sample here: https://gist.github.com/liulehui/b3c01a6a061250fd1b3da6ab07acd1e5
an example of error log ares:

class ReproCallback(ControllerCallback, WorkerGroupCallback):
    def before_controller_shutdown(self):
        raise Exception("error raised in ReproCallback with before_controller_shutdown")

def train_func(config):
    # ray.train.report({"success": 1})
    raise Exception("fail train_func")

trainer = TorchTrainer(
    train_func, 
    scaling_config=ray.train.ScalingConfig(num_workers=2),
    run_config=ray.train.RunConfig(
        callbacks=[ReproCallback()]
    )
)
trainer.fit()

corresponding log:

(TrainController pid=23510) [FailurePolicy] RAISE
(TrainController pid=23510)   Source: worker group
(TrainController pid=23510)   Error count: 1 (max allowed: 0)
(TrainController pid=23510) Error: Training failed due to worker errors:
(TrainController pid=23510) [Rank 0 Error Snippet]:
(TrainController pid=23510) Traceback (most recent call last):
(TrainController pid=23510)   File "/var/folders/gh/t3_93fyn3gvfwyxq_29m82440000gp/T/ipykernel_23459/1460648966.py", line 27, in train_func
(TrainController pid=23510) Exception: fail train_func
(TrainController pid=23510) ray.train.WorkerGroupError: Training failed due to worker errors:
(TrainController pid=23510) [Rank 0 Error Snippet]:
(TrainController pid=23510) Traceback (most recent call last):
(TrainController pid=23510)   File "/var/folders/gh/t3_93fyn3gvfwyxq_29m82440000gp/T/ipykernel_23459/1460648966.py", line 27, in train_func
(TrainController pid=23510) Exception: fail train_func
(TrainController pid=23510) 
(RayTrainWorker pid=23729) Error in training function:
(RayTrainWorker pid=23729) 
(RayTrainWorker pid=23796) 
(TrainController pid=23510) Exception raised in callback hook 'before_controller_shutdown' from callback 'ReproCallback'.
(TrainController pid=23510)   File "/Users/lehui/Desktop/ray/python/ray/train/v2/_internal/execution/callback_manager.py", line 44, in async_invoke
(TrainController pid=23510)     result = method(*args, **context)
(TrainController pid=23510)              ^^^^^^^^^^^^^^^^^^^^^^^^
(TrainController pid=23510)   File "/var/folders/gh/t3_93fyn3gvfwyxq_29m82440000gp/T/ipykernel_23459/1460648966.py", line 22, in before_controller_shutdown
(TrainController pid=23510) Exception: error raised in ReproCallback with before_controller_shutdown
(TrainController pid=23510) Another error occurred during shutdown after a training error. This error is being ignored to preserve the original training error. Error: Training failed due to controller error:
(TrainController pid=23510) error raised in ReproCallback with before_controller_shutdown
---------------------------------------------------------------------------
WorkerGroupError                          Traceback (most recent call last)
Cell In[1], line 40
     27     raise Exception("fail train_func")
     31 trainer = TorchTrainer(
     32     train_func, 
     33     scaling_config=ray.train.ScalingConfig(num_workers=2),
   (...)     38 
     39 )
---> 40 trainer.fit()

File ~/Desktop/ray/python/ray/train/v2/api/data_parallel_trainer.py:194, in DataParallelTrainer.fit(self)
    179 result = self._initialize_and_run_controller(
    180     train_fn_ref=train_fn_ref,
    181     scaling_policy=create_scaling_policy(self.scaling_config),
   (...)    185     validation_config=self.validation_config,
    186 )
    188 if result.error:
    189     # NOTE: If the training run errored out, raise an error back to the
    190     # user's driver script.
    191     # For example, if the Train `FailurePolicy` runs out of retries,
    192     # and one of the workers errors. The controller will exit, and
    193     # the error will be raised here.
--> 194     raise result.error
    196 return result

WorkerGroupError: Training failed due to worker errors:
[Rank 0 Error Snippet]:
Traceback (most recent call last):
  File "/var/folders/gh/t3_93fyn3gvfwyxq_29m82440000gp/T/ipykernel_23459/1460648966.py", line 27, in train_func
Exception: fail train_func

gemini-code-assist

Code Review

This pull request effectively hardens the controller lifecycle hooks by integrating them with the CallbackManager and improving error handling. The changes introduce more robust failure management, especially during shutdown sequences, by routing all lifecycle callback invocations through _run_controller_hook. Key improvements include the extraction of _make_failure_decision for centralized failure policy logic, adding try/except blocks around worker group shutdowns, and special handling for errors that occur when the controller is already in a ShuttingDownState. The new tests in test_controller_callback_behaviour.py are comprehensive and validate the new error handling paths. My main feedback is a suggestion to refactor some duplicated code to improve maintainability.

python/ray/train/v2/_internal/execution/controller/controller.py

TimothySeah

Nice, this PR and #60117 make controller callback handling much cleaner! FYI I only looked at controller.py and can take a look at the tests in a future review. I wanted to make two high level points first.

First, these two PR's actually changed the meaning of ERRORED and ABORTED. In the past:

ABORTED: user abort or controller exception (could be from our callbacks, user callbacks, or our non-callback error)
ERRORED: user training error

But after these PR's:

ABORTED: user abort or controller exception (can only be from our non-callback error)
ERRORED: user training error or controller exception (could be from our callbacks or user callbacks)

This is out of scope for this PR, but I think our train states should distinguish between the following cases:

User training errors
User callback errors
Errors that are our fault i.e. our callback errors or our non-callback errors. I think user aborts can also go here since aborts create an invalid state.

Second, I have feedback on how ShuttingDownState(ErroredState) is handled. Iiuc, we only reach make_failure_decision/execute_failure_decision's ShuttingDownState(ErroredState) conditional blocks if we try and fail to shut down the worker group. Should we retry shutdowns? If so, we should change this branch to also go through the failure policy. If not, I think it would be cleaner to move the ShuttingDownState(ErroredState) checks from make_failure_decision/execute_failure_decision to the isinstance(controller_state, ShuttingDownState) branch.

python/ray/train/v2/_internal/execution/controller/controller.py

liulehui · 2026-02-13T23:09:51Z

Nice, this PR and #60117 make controller callback handling much cleaner! FYI I only looked at controller.py and can take a look at the tests in a future review. I wanted to make two high level points first.

First, these two PR's actually changed the meaning of ERRORED and ABORTED. In the past:

ABORTED: user abort or controller exception (could be from our callbacks, user callbacks, or our non-callback error)

ERRORED: user training error

But after these PR's:

ABORTED: user abort or controller exception (can only be from our non-callback error)

ERRORED: user training error or controller exception (could be from our callbacks or user callbacks)

This is out of scope for this PR, but I think our train states should distinguish between the following cases:

User training errors

User callback errors

Errors that are our fault i.e. our callback errors or our non-callback errors. I think user aborts can also go here since aborts create an invalid state.

Second, I have feedback on how ShuttingDownState(ErroredState) is handled. Iiuc, we only reach make_failure_decision/execute_failure_decision's ShuttingDownState(ErroredState) conditional blocks if we try and fail to shut down the worker group. Should we retry shutdowns? If so, we should change this branch to also go through the failure policy. If not, I think it would be cleaner to move the ShuttingDownState(ErroredState) checks from make_failure_decision/execute_failure_decision to the isinstance(controller_state, ShuttingDownState) branch.

I agree to distinguish different error states makes sense.
this is PR is more for preventing the uncaught error in shutdown phase or the callback hooks crashing the controller, i.e. if there is exception in callback error, it is not aborting the controller, instead of just crashing the rest of the controller process.

I refactored the shutting down logic a bit to avoid branches in different places, we should not retry when in shutting down state.

TimothySeah · 2026-02-13T23:53:06Z

I refactored the shutting down logic a bit to avoid branches in different places, we should not retry when in shutting down state.

Sounds good, I think this is cleaner now.

TimothySeah

LGTM

python/ray/train/v2/tests/test_controller.py

python/ray/train/v2/tests/test_controller_callback_behaviour.py

python/ray/train/v2/_internal/execution/controller/controller.py

github-actions · 2026-03-13T12:26:48Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions · 2026-03-27T12:29:42Z

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

python/ray/train/v2/_internal/execution/controller/controller.py

python/ray/train/v2/_internal/execution/callback_manager.py

python/ray/train/v2/tests/test_controller_callback_behaviour.py

python/ray/train/v2/tests/test_controller.py

TimothySeah

LGTM aside from a few nits and clarifications.

I also noticed that the following methods are called slightly differently in multiple places but confirmed that they make sense:

Run_controller_hook: This never raises. When it's called in the _step loop it returns the result because the loop sets the state. When it's called outside the loop (start, run after step, abort) we have to manually set the state.
shutdown_worker_group: When we call this in the _step loop (_make_and_handle_scaling_decision_for_non_running_worker_group, _execute_resize_decision) we bubble up the raise to _run_control_loop_iteration to go through the failure policy. But when we call this in shutdown we intentionally avoid retrying.

python/ray/train/v2/_internal/execution/controller/controller.py

TimothySeah · 2026-03-31T22:32:05Z

python/ray/train/v2/_internal/execution/controller/controller.py

        if self.get_state().is_terminal():
            return

+        # Use a manual for-loop instead of CallbackManager here because abort


What's the rationale behind

Every before_controller_abort callback must be called

Other callbacks are "fail fast"
?

I think our callbacks in the codebase are all fail fast, i.e. if one callback fails and throw exception, the rest won't get executed.
I think it is ok in most of the cases, but for abort, there are a few places, e.g. the PlacementGroupCleaner is a detached worker, which needs to be killed by the abort callbacks, otherwise it will always be there, thus it is needed to make the resource cleaning related callback to called to the best effort., see this issue: #61689

python/ray/train/v2/tests/test_controller_callback_behaviour.py

justinvyu

Thanks! pretty much looks good to me, but wondering if we can still simplify a bit more. There's still 3 layers of try catches that I think we can remove. See my main comment here.

Happy to merge first and we can address some of these as followups.

Each of these has a layer of try catch:

run_controller_step() -> run_controller_hook() -> callback_manager.invoke()

python/ray/train/v2/_internal/execution/controller/controller.py

justinvyu · 2026-04-07T19:03:07Z

python/ray/train/v2/_internal/execution/controller/controller.py

-            callback.after_controller_finish(result)
+        failure_result = self._run_controller_hook(
+            "after_controller_finish", result, invoke_failure_decision_callbacks=False
+        )


We actually don't even use the before_controller_execute_failure_decision hook anywhere. We could remove that hook as well as the "invoke_failure_decision_callbacks" logic in a followup to simplify things.

I am a bit worried that if some advanced user defined their own ControllerCallback and leverage this hook might facing breaking change, e.g. use this hook to emit some metrics about failure count/fleet error number

Oh these callbacks are all DeveloperAPI so we can change them. We already have a public UserCallback which has an after_exception method that users should use instead

python/ray/train/v2/_internal/execution/controller/controller.py

python/ray/train/v2/_internal/execution/callback_manager.py

Signed-off-by: Lehui Liu <lehui@anyscale.com>

python/ray/train/v2/_internal/execution/controller/controller.py

Signed-off-by: Lehui Liu <lehui@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit e2c4708. Configure here.}

python/ray/train/v2/_internal/execution/controller/controller.py

Signed-off-by: Lehui Liu <lehui@anyscale.com>

…ay-project#60900) 1. Follow-up to ray-project#60117. Makes the train controller resilient to errors in **all** controller lifecycle hooks (`_start`, `_shutdown`, `after_controller_state_update`, `after_controller_finish`, etc.) — previously only the callback manager's `invoke` path was hardened. --------- Signed-off-by: Lehui Liu <lehui@anyscale.com>

liulehui requested a review from a team as a code owner February 10, 2026 01:02

liulehui added the go add ONLY when ready to merge, run all tests label Feb 10, 2026

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/controller/controller.py Show resolved Hide resolved

cursor bot reviewed Feb 10, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/controller/controller.py Outdated Show resolved Hide resolved

TimothySeah reviewed Feb 10, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/controller/controller.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/controller/controller.py Show resolved Hide resolved

liulehui force-pushed the shutdown-hook branch from 7dafd64 to 5d01369 Compare February 13, 2026 23:02

TimothySeah approved these changes Feb 14, 2026

View reviewed changes

python/ray/train/v2/tests/test_controller.py Outdated Show resolved Hide resolved

python/ray/train/v2/tests/test_controller_callback_behaviour.py Outdated Show resolved Hide resolved

cursor bot reviewed Feb 17, 2026

View reviewed changes

python/ray/train/v2/tests/test_controller_callback_behaviour.py Show resolved Hide resolved

justinvyu reviewed Feb 18, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/controller/controller.py Show resolved Hide resolved

python/ray/train/v2/_internal/execution/controller/controller.py Show resolved Hide resolved

liulehui force-pushed the shutdown-hook branch from 1f32cc9 to 339ef5b Compare February 19, 2026 21:55

ray-gardener bot added the train Ray Train Related Issue label Feb 27, 2026

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 13, 2026

TimothySeah mentioned this pull request Mar 18, 2026

[train][torchft] Ray Train manages replica group restarts #61475

Merged

github-actions bot closed this Mar 27, 2026

liulehui reopened this Mar 30, 2026

github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Mar 30, 2026

liulehui force-pushed the shutdown-hook branch from 7df5ff6 to f3f6b0b Compare March 30, 2026 21:16

cursor bot reviewed Mar 30, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/controller/controller.py Outdated Show resolved Hide resolved

liulehui removed the unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. label Mar 30, 2026

cursor bot reviewed Mar 30, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/controller/controller.py Outdated Show resolved Hide resolved

liulehui changed the title ~~[train] Harden remaining controller lifecycle hooks with error handling~~ [train] Make controller resilient to errors in all lifecycle hooks Mar 30, 2026

cursor bot reviewed Mar 31, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/callback_manager.py Show resolved Hide resolved

cursor bot reviewed Mar 31, 2026

View reviewed changes

python/ray/train/v2/tests/test_controller_callback_behaviour.py Show resolved Hide resolved

cursor bot reviewed Mar 31, 2026

View reviewed changes

python/ray/train/v2/tests/test_controller.py Show resolved Hide resolved

TimothySeah approved these changes Mar 31, 2026

View reviewed changes

justinvyu reviewed Apr 8, 2026

View reviewed changes

liulehui requested a review from justinvyu April 9, 2026 19:30

cursor bot reviewed Apr 9, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/controller/controller.py Show resolved Hide resolved

python/ray/train/v2/_internal/execution/callback_manager.py Show resolved Hide resolved

liulehui added 20 commits April 9, 2026 13:36

rework shutting down state

da6a253

Signed-off-by: Lehui Liu <lehui@anyscale.com>

added shutdown hooks

ec57ae1

Signed-off-by: Lehui Liu <lehui@anyscale.com>

add run_controller_hook for start and finish

36660e1

Signed-off-by: Lehui Liu <lehui@anyscale.com>

addressing ai comments

ae66935

Signed-off-by: Lehui Liu <lehui@anyscale.com>

adding all unit tests

932a70d

Signed-off-by: Lehui Liu <lehui@anyscale.com>

merge the shutdown logic in _shutdown

1cd70fd

Signed-off-by: Lehui Liu <lehui@anyscale.com>

better order for logging exception

85a3cb0

Signed-off-by: Lehui Liu <lehui@anyscale.com>

addressed comments

eb5ff5e

Signed-off-by: Lehui Liu <lehui@anyscale.com>

remove redundant _start call

9205f56

Signed-off-by: Lehui Liu <lehui@anyscale.com>

addressing comments

30bc238

Signed-off-by: Lehui Liu <lehui@anyscale.com>

rework some logic

0cb66cb

Signed-off-by: Lehui Liu <lehui@anyscale.com>

avoid spammy exceptions

256b72f

Signed-off-by: Lehui Liu <lehui@anyscale.com>

remove unused async hook call

17d4922

Signed-off-by: Lehui Liu <lehui@anyscale.com>

revamp error propagation in controller

d008bef

Signed-off-by: Lehui Liu <lehui@anyscale.com>

revision

fe4cde2

Signed-off-by: Lehui Liu <lehui@anyscale.com>

change callback bq test to medium

e83f1a0

Signed-off-by: Lehui Liu <lehui@anyscale.com>

change to use async def

be0467b

Signed-off-by: Lehui Liu <lehui@anyscale.com>

a missing one for async def

a7a3a2e

Signed-off-by: Lehui Liu <lehui@anyscale.com>

addressing comments

726aa12

Signed-off-by: Lehui Liu <lehui@anyscale.com>

address comments

d2764e3

Signed-off-by: Lehui Liu <lehui@anyscale.com>

liulehui force-pushed the shutdown-hook branch from 3a7cc28 to d2764e3 Compare April 9, 2026 20:37

justinvyu reviewed Apr 9, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/controller/controller.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/execution/controller/controller.py Show resolved Hide resolved

move the set state into try/except

e2c4708

Signed-off-by: Lehui Liu <lehui@anyscale.com>

cursor bot reviewed Apr 9, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/controller/controller.py Show resolved Hide resolved

set state to aborted

d39cefe

Signed-off-by: Lehui Liu <lehui@anyscale.com>

justinvyu merged commit 5159cb6 into ray-project:master Apr 10, 2026
6 checks passed

Conversation

liulehui commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

TimothySeah left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

liulehui commented Feb 13, 2026

Uh oh!

TimothySeah commented Feb 13, 2026

Uh oh!

TimothySeah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TimothySeah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

TimothySeah Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

liulehui Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

justinvyu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

justinvyu Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

liulehui Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

justinvyu Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

liulehui commented Feb 10, 2026 •

edited

Loading

TimothySeah left a comment •

edited

Loading

justinvyu left a comment •

edited

Loading