[Data] 2 phase commit for checkpointing to avoid duplicates by xinyuangui2 · Pull Request #60983 · ray-project/ray

xinyuangui2 · 2026-02-11T21:03:51Z

Summary

Reproduce script: https://gist.github.com/xinyuangui2/0be60bb7fd629afdf462a480519be86c

Current Issue

This is a typical Ray Data pipeline: read_parquet -> map_batches -> write. In the checkpointing introduced in #59409, every block is first written and then checkpointed. If a failure occurs after data is written but before the checkpoint is saved, there will be duplicates in the final result.

Solution: 2-Phase Commit

Only for file-based datasinks (_FileDatasink and its subclasses) for now.

Write Stage

Fetch the result filenames using FilenameProvider. Create pending checkpoint files ({id}.pending.parquet) and store the result filenames in the parquet metadata.
Write data to the output path.
Commit the pending checkpoint files by renaming {id}.pending.parquet to {id}.parquet.

Restore Stage

Find any pending parquet files (*.pending.parquet).
Fetch the result filenames from the parquet metadata.
Delete any files that match the filenames (pattern-based matching).
Delete the pending parquet files.
Continue with the existing checkpoint loading stage.

Correctness Guarantees

The 2-phase commit ensures correctness because:

Failure at any step of the write stage doesn't affect the final result - If failure occurs before commit, the pending checkpoint is cleaned up during restore.
Restore stage is idempotent - Running restore multiple times produces the same result.

Other Changes

Deprecate `block_index` Parameter in `FilenameProvider`

The block_index parameter in FilenameProvider.get_filename_for_block() is always 0 in _FileDatasink because datasinks merge all blocks into one before writing. Additionally, this parameter makes fetching filenames inside pending checkpoints tricky (we need to predict the filename before writing).

A new method get_filename_for_task() is introduced without the block_index parameter. Custom FilenameProvider implementations should not depend on block_index to ensure checkpointing correctness.

Release tests (to add)

Signed-off-by: xgui <xgui@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a robust 2-phase commit mechanism for checkpointing file-based datasinks, effectively preventing data duplication upon failures. The implementation is well-structured, separating the logic for file-based and non-file-based datasinks, and includes pre-write (prepare), write, and post-write (commit) phases. The recovery logic correctly handles the cleanup of pending checkpoints and associated orphaned data files, even for complex scenarios like partitioned outputs. Additionally, the deprecation of block_index in FilenameProvider in favor of the more deterministic get_filename_for_task is a sensible improvement. The accompanying tests are comprehensive, covering a wide range of failure scenarios and edge cases, which instills confidence in the correctness of this critical feature. I have one suggestion for a minor performance optimization.

python/ray/data/checkpoint/checkpoint_writer.py

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

cursor · 2026-02-11T21:23:37Z

python/ray/data/_internal/planner/checkpoint/plan_write_op.py

-        data_context, op
-    )
+
+    if isinstance(datasink, _FileDatasink):


2PC path crashes RowBasedFileDatasink with custom FilenameProvider

Medium Severity

The isinstance(datasink, _FileDatasink) check at line 114 routes RowBasedFileDatasink (used by write_images) into the 2-phase commit path. This path calls _generate_base_filename, which invokes get_filename_for_task(). The base class default delegates to get_filename_for_block(None, ...), which raises NotImplementedError for custom FilenameProvider implementations that only override get_filename_for_row() — the only method the RowBasedFileDatasink contract actually requires. This causes a runtime crash when checkpointing is enabled with write_images and a custom provider (like the ImageFilenameProvider example in the docs).

Additional Locations (1)

python/ray/data/_internal/planner/checkpoint/plan_write_op.py#L176-L179

cursor · 2026-02-11T21:23:37Z

python/ray/data/_internal/planner/checkpoint/plan_write_op.py

-        data_context, op
-    )
+
+    if isinstance(datasink, _FileDatasink):


2PC path crashes RowBasedFileDatasink with custom FilenameProvider

Medium Severity

The isinstance(datasink, _FileDatasink) check at line 114 routes RowBasedFileDatasink (used by write_images) into the 2-phase commit path. This path calls _generate_base_filename, which invokes get_filename_for_task(). The base class default delegates to get_filename_for_block(None, ...), which raises NotImplementedError for custom FilenameProvider implementations that only override get_filename_for_row() — the only method the RowBasedFileDatasink contract actually requires. This causes a runtime crash when checkpointing is enabled with write_images and a custom provider (like the ImageFilenameProvider example in the docs).

Additional Locations (1)

python/ray/data/_internal/planner/checkpoint/plan_write_op.py#L176-L179

xinyuangui2 added 2 commits February 11, 2026 20:57

port turbo checkpointing change

2f8b221

Signed-off-by: xgui <xgui@anyscale.com>

fix the lazy execution error

7a112db

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 requested a review from a team as a code owner February 11, 2026 21:03

xinyuangui2 added the go add ONLY when ready to merge, run all tests label Feb 11, 2026

gemini-code-assist bot reviewed Feb 11, 2026

View reviewed changes

python/ray/data/checkpoint/checkpoint_writer.py Show resolved Hide resolved

cursor bot reviewed Feb 11, 2026

View reviewed changes

xinyuangui2 closed this Feb 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] 2 phase commit for checkpointing to avoid duplicates#60983

[Data] 2 phase commit for checkpointing to avoid duplicates#60983
xinyuangui2 wants to merge 2 commits intoray-project:masterfrom
xinyuangui2:checkpointing-fault-tolerance

xinyuangui2 commented Feb 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 11, 2026

Uh oh!

cursor bot Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xinyuangui2 commented Feb 11, 2026

Summary

Current Issue

Solution: 2-Phase Commit

Write Stage

Restore Stage

Correctness Guarantees

Other Changes

Deprecate block_index Parameter in FilenameProvider

Release tests (to add)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 11, 2026

Choose a reason for hiding this comment

2PC path crashes RowBasedFileDatasink with custom FilenameProvider

Uh oh!

cursor bot Feb 11, 2026

Choose a reason for hiding this comment

2PC path crashes RowBasedFileDatasink with custom FilenameProvider

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Deprecate `block_index` Parameter in `FilenameProvider`