[train] Make controller resilient to errors in all lifecycle hooks#60900
[train] Make controller resilient to errors in all lifecycle hooks#60900justinvyu merged 22 commits intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request effectively hardens the controller lifecycle hooks by integrating them with the CallbackManager and improving error handling. The changes introduce more robust failure management, especially during shutdown sequences, by routing all lifecycle callback invocations through _run_controller_hook. Key improvements include the extraction of _make_failure_decision for centralized failure policy logic, adding try/except blocks around worker group shutdowns, and special handling for errors that occur when the controller is already in a ShuttingDownState. The new tests in test_controller_callback_behaviour.py are comprehensive and validate the new error handling paths. My main feedback is a suggestion to refactor some duplicated code to improve maintainability.
python/ray/train/v2/_internal/execution/controller/controller.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Nice, this PR and #60117 make controller callback handling much cleaner! FYI I only looked at controller.py and can take a look at the tests in a future review. I wanted to make two high level points first.
First, these two PR's actually changed the meaning of ERRORED and ABORTED. In the past:
- ABORTED: user abort or controller exception (could be from our callbacks, user callbacks, or our non-callback error)
- ERRORED: user training error
But after these PR's:
- ABORTED: user abort or controller exception (can only be from our non-callback error)
- ERRORED: user training error or controller exception (could be from our callbacks or user callbacks)
This is out of scope for this PR, but I think our train states should distinguish between the following cases:
- User training errors
- User callback errors
- Errors that are our fault i.e. our callback errors or our non-callback errors. I think user aborts can also go here since aborts create an invalid state.
Second, I have feedback on how ShuttingDownState(ErroredState) is handled. Iiuc, we only reach make_failure_decision/execute_failure_decision's ShuttingDownState(ErroredState) conditional blocks if we try and fail to shut down the worker group. Should we retry shutdowns? If so, we should change this branch to also go through the failure policy. If not, I think it would be cleaner to move the ShuttingDownState(ErroredState) checks from make_failure_decision/execute_failure_decision to the isinstance(controller_state, ShuttingDownState) branch.
python/ray/train/v2/_internal/execution/controller/controller.py
Outdated
Show resolved
Hide resolved
7dafd64 to
5d01369
Compare
I agree to distinguish different error states makes sense.
|
Sounds good, I think this is cleaner now. |
python/ray/train/v2/tests/test_controller_callback_behaviour.py
Outdated
Show resolved
Hide resolved
1f32cc9 to
339ef5b
Compare
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
This pull request has been automatically closed because there has been no more activity in the 14 days Please feel free to reopen or open a new pull request if you'd still like this to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for your contribution! |
python/ray/train/v2/_internal/execution/controller/controller.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/controller.py
Outdated
Show resolved
Hide resolved
TimothySeah
left a comment
There was a problem hiding this comment.
LGTM aside from a few nits and clarifications.
I also noticed that the following methods are called slightly differently in multiple places but confirmed that they make sense:
- Run_controller_hook: This never raises. When it's called in the _step loop it returns the result because the loop sets the state. When it's called outside the loop (start, run after step, abort) we have to manually set the state.
- shutdown_worker_group: When we call this in the _step loop (_make_and_handle_scaling_decision_for_non_running_worker_group, _execute_resize_decision) we bubble up the raise to _run_control_loop_iteration to go through the failure policy. But when we call this in shutdown we intentionally avoid retrying.
| if self.get_state().is_terminal(): | ||
| return | ||
|
|
||
| # Use a manual for-loop instead of CallbackManager here because abort |
There was a problem hiding this comment.
What's the rationale behind
- Every before_controller_abort callback must be called
- Other callbacks are "fail fast"
?
There was a problem hiding this comment.
I think our callbacks in the codebase are all fail fast, i.e. if one callback fails and throw exception, the rest won't get executed.
I think it is ok in most of the cases, but for abort, there are a few places, e.g. the PlacementGroupCleaner is a detached worker, which needs to be killed by the abort callbacks, otherwise it will always be there, thus it is needed to make the resource cleaning related callback to called to the best effort., see this issue: #61689
There was a problem hiding this comment.
Thanks! pretty much looks good to me, but wondering if we can still simplify a bit more. There's still 3 layers of try catches that I think we can remove. See my main comment here.
Happy to merge first and we can address some of these as followups.
Each of these has a layer of try catch:
run_controller_step() -> run_controller_hook() -> callback_manager.invoke()
python/ray/train/v2/_internal/execution/controller/controller.py
Outdated
Show resolved
Hide resolved
| callback.after_controller_finish(result) | ||
| failure_result = self._run_controller_hook( | ||
| "after_controller_finish", result, invoke_failure_decision_callbacks=False | ||
| ) |
There was a problem hiding this comment.
We actually don't even use the before_controller_execute_failure_decision hook anywhere. We could remove that hook as well as the "invoke_failure_decision_callbacks" logic in a followup to simplify things.
There was a problem hiding this comment.
I am a bit worried that if some advanced user defined their own ControllerCallback and leverage this hook might facing breaking change, e.g. use this hook to emit some metrics about failure count/fleet error number
There was a problem hiding this comment.
Oh these callbacks are all DeveloperAPI so we can change them. We already have a public UserCallback which has an after_exception method that users should use instead
python/ray/train/v2/_internal/execution/controller/controller.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/controller.py
Outdated
Show resolved
Hide resolved
python/ray/train/v2/_internal/execution/controller/controller.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
Signed-off-by: Lehui Liu <lehui@anyscale.com>
python/ray/train/v2/_internal/execution/controller/controller.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Lehui Liu <lehui@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit e2c4708. Configure here.
Signed-off-by: Lehui Liu <lehui@anyscale.com>
…ay-project#60900) 1. Follow-up to ray-project#60117. Makes the train controller resilient to errors in **all** controller lifecycle hooks (`_start`, `_shutdown`, `after_controller_state_update`, `after_controller_finish`, etc.) — previously only the callback manager's `invoke` path was hardened. --------- Signed-off-by: Lehui Liu <lehui@anyscale.com>

Description
_start,_shutdown,after_controller_state_update,after_controller_finish, etc.) — previously only the callback manager'sinvokepath was hardened.Key changes:
_run_control_loop_iteration: Replace 4 scatteredmake_decision→_execute_failure_decisioncall sites with a single catch-all that routes all errors through the failure policy. PreservesWorkerGroupErrortype (no double-wrapping inControllerError). Falls back to forceErroredStateif the failure policy itself fails.CallbackManager: Addasync_invokefor async hooks (e.g.before_controller_shutdown). Replace directfor callback in ...loops with_run_controller_hook/CallbackManager, which catches and wraps callback errors asControllerError._shutdown: Guard_shutdown_worker_group()andbefore_controller_shutdowncallback with try/except. Prioritize original training error over shutdown errors. Never retry during shutdown._set_state: Suppress callback failures during terminal state transitions to preserve the root-cause error. Prevent re-entry intoafter_controller_state_updatewhen handling a callback failure._start_worker_group: Raise on failure instead of returningOptional[ControllerError]— let errors propagate naturally to the catch-all.AsyncioActorExit: Re-raise at_run_control_loop_iterationlevel so it's handled correctly regardless of which state the controller is in.logger.warningfor suppression messages instead oflogger.exceptionPlease refer to this diagram:

make_decisioncall sites + catch-all_run_control_loop_iteration+_run_controller_hookfor callback-specific guardWorkerGroupErrortypeisinstance(TrainingFailedError))AsyncioActorExit_stepRunningState_run_control_loop_iterationlevel (covers all states)_shutdown_worker_groupin resizelogger.exceptionat multiple layerslogger.warning_start_worker_groupOptional[ControllerError]Additional information
corresponding log: