[Data] Add Checkpoint/Resume Support for Ray Data Pipelines

## Summary
Ray Data currently lacks built-in checkpointing functionality, which makes it challenging to recover from failures in long-running data processing pipelines. This feature request proposes adding checkpoint and resume capabilities to Ray Data to improve fault tolerance and reduce the cost of restarting large-scale data processing jobs.

## Motivation
Large Ray Data pipelines can take hours or days to complete. When failures occur due to unconfigured retryable exceptions or bug, the entire pipeline must restart from the beginning, resulting in:
   
1. High Costs
   - Significant GPU resource waste
   - Extended time-to-completion
2. Operational Complexity
   - Users currently need to manually segment large jobs (e.g., splitting a single large job into 10 parts)
   - No built-in mechanism to preserve progress when jobs are interrupted
   - Cross-cluster job migration is not supported Proposed Solution: jobs must be migrated to other clusters/data centers when high-priority workloads require the urgently resources

## Requirements Overview

1. Job State Persistence to External Storage
2. Cross-Cluster Resume

### Use case

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add Checkpoint/Resume Support for Ray Data Pipelines #55008

Summary

Motivation

Requirements Overview

Use case

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Data] Add Checkpoint/Resume Support for Ray Data Pipelines #55008

Description

Summary

Motivation

Requirements Overview

Use case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions