Summary
Ray Data currently lacks built-in checkpointing functionality, which makes it challenging to recover from failures in long-running data processing pipelines. This feature request proposes adding checkpoint and resume capabilities to Ray Data to improve fault tolerance and reduce the cost of restarting large-scale data processing jobs.
Motivation
Large Ray Data pipelines can take hours or days to complete. When failures occur due to unconfigured retryable exceptions or bug, the entire pipeline must restart from the beginning, resulting in:
- High Costs
- Significant GPU resource waste
- Extended time-to-completion
- Operational Complexity
- Users currently need to manually segment large jobs (e.g., splitting a single large job into 10 parts)
- No built-in mechanism to preserve progress when jobs are interrupted
- Cross-cluster job migration is not supported Proposed Solution: jobs must be migrated to other clusters/data centers when high-priority workloads require the urgently resources
Requirements Overview
- Job State Persistence to External Storage
- Cross-Cluster Resume
Use case
No response
Summary
Ray Data currently lacks built-in checkpointing functionality, which makes it challenging to recover from failures in long-running data processing pipelines. This feature request proposes adding checkpoint and resume capabilities to Ray Data to improve fault tolerance and reduce the cost of restarting large-scale data processing jobs.
Motivation
Large Ray Data pipelines can take hours or days to complete. When failures occur due to unconfigured retryable exceptions or bug, the entire pipeline must restart from the beginning, resulting in:
Requirements Overview
Use case
No response