[Core] Use TaskAttempt as the unique id for inflight actor task by jjyao · Pull Request #52812 · ray-project/ray

jjyao · 2025-05-06T04:52:20Z

Why are these changes needed?

Currently ActorTaskSubmitter has inflight_task_callbacks map whose key is TaskID to track all inflight actor tasks. It can cause the following issue:

Task_1_Attempt_0 is submitted and added to inflight_task_callbacks.
GCS tells the caller that the actor is dead before Task_1_Attemp_0 PushTask rpc callback is called.
ActorTaskSubmitter failed all inflight tasks and cleared inflight_task_callbacks
Task_1_Attempt_1 is submitted (actor task retry) and added to inflight_task_callbacks. The key is Task_1.
Task_1_Attempt_0 PushTask rpc callback is finally called (network error status) and we use Task_1 as the key to see if it's in the inflight_task_callbacks and we can find it and the callback is called. The problem is that this callback is for Task_1_Attempt_1 not Task_1_Attempt_0 so we end up failing the wrong task attempt.

Solution: use TaskAttempt to track each inflight task which is unique.

TODO: Investigate whether we can remove inflight_task_callbacks (#19354) all together after #51904 and purely rely on GRPC to call the callback when the actor is dead or restarted.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao · 2025-05-08T18:00:29Z

src/ray/core_worker/test/BUILD.bazel

 ray_cc_test(
-    name = "direct_actor_transport_test",
-    srcs = ["direct_actor_transport_test.cc"],
+    name = "task_receiver_test",


Split direct_actor_transport_test into task_receiver_test and actor_task_submitter_test.

jjyao · 2025-05-08T18:00:52Z

src/ray/core_worker/test/actor_task_submitter_test.cc

@@ -618,29 +596,30 @@ TEST_P(ActorTaskSubmitterTest, TestActorRestartOutOfOrderGcs) {
 }

 TEST_P(ActorTaskSubmitterTest, TestActorRestartFailInflightTasks) {


This is the only test that's changed.

I think it's best practice to only test against the interaction/public API as much as possible. The biggest exception is when dealing with code that was already written and is hard to test.

Here are a few arguments I've seen in favour of this:

The tests tell you what the contract for the API is so they only break when the contract is broken

You can refactor the implementation and if the tests pass you don't have to worry about bugs

The tests are not brittle i.e. they don't break due to implementation details

In this case, the API invariants we are being tested are:

Actor tasks without inflight retries succeed without a problem

Actor task retries succeed correctly while retries that have been discarded do not succeed

It's fair game to add a fake object like worker_client_ and check it's state to see if the correct side-effects happen (e.g. callback is added correctly), but these tests shouldn't inspect the private state of ActorTaskSubmitter since that is the API under test.

TL/DR: we should keep all the assertions against fake objects (such as worker_client_) that are testing side-effects of ActorTaskSubmitter but we should remove assertions like submitter_.NumInflightTasks(actor_id) which look at the implementation of ActionTaskSubmitter and not it's public API.

Sorry for the very long response. My two cents on testing.

Completely agree with this ^ same motivation for my comment below about NumInflightTasks()

jjyao · 2025-05-08T18:01:58Z

src/ray/core_worker/test/task_receiver_test.cc

@@ -0,0 +1,286 @@
+// Copyright 2017 The Ray Authors.


nothing changed, pure refactoring

jjyao · 2025-05-08T18:02:16Z

src/ray/core_worker/transport/actor_task_submitter.cc

          RAY_CHECK(it != client_queues_.end());
          auto &queue = it->second;
-          auto callback_it = queue.inflight_task_callbacks.find(task_id);
+          auto callback_it = queue.inflight_task_callbacks.find(task_attempt);


Is there a way to cancel an inflight GRPC request in addition to this? I don't have a lot of GRPC experience, but if we're going to discard the response, might as well cancel the request so this is never called.

I'm also not familiar with this part. Need to do some investigation.

I think it might be a useful follow up, but looking at our GRPC implementation, it doesn't look straightforward to implement.

gRPC has a notion of cancellation but our c++ code doesn't handle it so it would only be useful if it happens before the RPC handler begins on the server side

src/ray/core_worker/test/BUILD.bazel

src/ray/core_worker/test/actor_task_submitter_test.cc

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

src/ray/core_worker/transport/actor_task_submitter.cc

edoakes

The fix LGTM, stylistic comments to improve the tests

Have you audited other maps keyed on TaskID to check if there are other similar issues?

edoakes · 2025-05-09T15:46:23Z

src/ray/core_worker/test/actor_task_submitter_test.cc

                                        TaskID caller_id = TaskID::Nil()) {
  TaskSpecification task;
  task.GetMutableMessage().set_task_id(TaskID::FromRandom(actor_id.JobId()).Binary());
+  task.GetMutableMessage().set_attempt_number(0);


this isn't necessary; protobuf has well-defined zero values. but fine to do if you think it improves readability

Yea, I write it out for readability to emphasize it's attempt 0 of the task since attempt number is a key in this test.

src/ray/core_worker/test/actor_task_submitter_test.cc

src/ray/core_worker/test/task_receiver_test.cc

edoakes · 2025-05-09T16:07:54Z

src/ray/core_worker/transport/actor_task_submitter.h

+  /// Return the number of inflight actor tasks for the given actor id.
+  size_t NumInflightTasks(const ActorID &actor_id) const;


What exactly are "inflight" tasks from the perspective of the caller of this interface? Is "inflight" a specific state or a set of states?

We also have the notion of "pending" tasks in the API. Are these the same thing?

Inflight are those actor tasks whose PushTask RPC is sent and inflight (response not received yet). Pending are tasks that are queued (no PushTask RPC yet).

Added comment to explain what inflight means.

edoakes · 2025-05-09T16:09:20Z

src/ray/core_worker/test/actor_task_submitter_test.cc

  // Submit a task.
  ASSERT_TRUE(CheckSubmitTask(task1));
  EXPECT_CALL(*task_finisher_, CompletePendingTask(task1.TaskId(), _, _, _)).Times(1);
  ASSERT_TRUE(worker_client_->ReplyPushTask(Status::OK()));
+  ASSERT_EQ(worker_client_->callbacks.size(), 0);
+  ASSERT_EQ(submitter_.NumInflightTasks(actor_id), 0);


what's the point of this codeblock in the test? it doesn't seem relevant

This makes sure we have a clean state for the rest of the test: i.e. Task 1 should be completely finished.

src/ray/core_worker/transport/actor_task_submitter.cc

src/ray/core_worker/test/actor_task_submitter_test.cc

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao · 2025-05-12T16:11:26Z

Have you audited other maps keyed on TaskID to check if there are other similar issues?

I checked and some are suspicious. Created #52940 as the follow-up.

israbbani

The fix LGTM. I left a pretty detailed comment about writing tests against public APIs vs implementation details. I think it'll help improve our tests and our code if we try to follow that as a general guideline. 🚢

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

…project#52812) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: weiran11 <weiran11@baidu.com>

jjyao added 6 commits May 5, 2025 21:51

Debug data task cancellation

56aafbe

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

af85f3f

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

82b6e05

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

81768a6

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

c440f20

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

e284bf4

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao added the go add ONLY when ready to merge, run all tests label May 7, 2025

jjyao added 4 commits May 7, 2025 14:40

up

5476328

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

Merge branch 'master' of github.com:ray-project/ray into jjyao/dcancel

4059e4a

up

31b93b7

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

58cb32c

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao changed the title ~~Debug data task cancellation~~ [Core] Use TaskAttempt has the unique id for inflight actor task May 8, 2025

jjyao changed the title ~~[Core] Use TaskAttempt has the unique id for inflight actor task~~ [Core] Use TaskAttempt as the unique id for inflight actor task May 8, 2025

jjyao marked this pull request as ready for review May 8, 2025 17:59

jjyao commented May 8, 2025

View reviewed changes

jjyao assigned edoakes and kevin85421 May 8, 2025

jjyao requested review from edoakes and kevin85421 May 8, 2025 18:02

kevin85421 reviewed May 8, 2025

View reviewed changes

src/ray/core_worker/test/BUILD.bazel Show resolved Hide resolved

src/ray/core_worker/test/actor_task_submitter_test.cc Show resolved Hide resolved

up

eec49c3

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

kevin85421 approved these changes May 9, 2025

View reviewed changes

src/ray/core_worker/transport/actor_task_submitter.cc Outdated Show resolved Hide resolved

edoakes reviewed May 9, 2025

View reviewed changes

comments

e56ff76

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao requested a review from edoakes May 9, 2025 21:51

jjyao added 2 commits May 9, 2025 16:45

up

9b49b0e

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

up

488660b

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

format

e2bc5a2

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao mentioned this pull request May 12, 2025

[Core] Check incorrect use of TaskID as map key instead of TaskAttempt #52940

Open

israbbani approved these changes May 12, 2025

View reviewed changes

up

8791d72

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

edoakes approved these changes May 12, 2025

View reviewed changes

jjyao merged commit 078e6e6 into ray-project:master May 12, 2025
4 of 5 checks passed

jjyao deleted the jjyao/dcancel branch May 12, 2025 23:28

ran1995data pushed a commit to ran1995data/ray that referenced this pull request May 13, 2025

[Core] Use TaskAttempt as the unique id for inflight actor task (ray-…

7c792c6

…project#52812) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: weiran11 <weiran11@baidu.com>

jjyao mentioned this pull request May 13, 2025

[Autoscaler][V2] Fix autoscaler terminating more nodes than exist of a type #52760

Merged

8 tasks

israbbani mentioned this pull request May 19, 2025

[core][refactor] Move to_resubmit_ from CoreWorker to TaskManager to avoid an abstraction leak #52779

Closed

8 tasks

hainesmichaelc added the community-backlog label May 22, 2025

israbbani mentioned this pull request Jun 19, 2025

[core] fix detached actor being unexpectedly killed #53562

Merged

8 tasks

		@@ -618,29 +596,30 @@ TEST_P(ActorTaskSubmitterTest, TestActorRestartOutOfOrderGcs) {
		}

		TEST_P(ActorTaskSubmitterTest, TestActorRestartFailInflightTasks) {

		/// Return the number of inflight actor tasks for the given actor id.
		size_t NumInflightTasks(const ActorID &actor_id) const;

Conversation

jjyao commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjyao commented May 12, 2025

Uh oh!

israbbani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jjyao commented May 6, 2025 •

edited

Loading