[Core] Use TaskAttempt as the unique id for inflight actor task#52812
[Core] Use TaskAttempt as the unique id for inflight actor task#52812jjyao merged 16 commits intoray-project:masterfrom
Conversation
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
| ray_cc_test( | ||
| name = "direct_actor_transport_test", | ||
| srcs = ["direct_actor_transport_test.cc"], | ||
| name = "task_receiver_test", |
There was a problem hiding this comment.
Split direct_actor_transport_test into task_receiver_test and actor_task_submitter_test.
| @@ -618,29 +596,30 @@ TEST_P(ActorTaskSubmitterTest, TestActorRestartOutOfOrderGcs) { | |||
| } | |||
|
|
|||
| TEST_P(ActorTaskSubmitterTest, TestActorRestartFailInflightTasks) { | |||
There was a problem hiding this comment.
This is the only test that's changed.
There was a problem hiding this comment.
I think it's best practice to only test against the interaction/public API as much as possible. The biggest exception is when dealing with code that was already written and is hard to test.
Here are a few arguments I've seen in favour of this:
- The tests tell you what the contract for the API is so they only break when the contract is broken
- You can refactor the implementation and if the tests pass you don't have to worry about bugs
- The tests are not brittle i.e. they don't break due to implementation details
In this case, the API invariants we are being tested are:
- Actor tasks without inflight retries succeed without a problem
- Actor task retries succeed correctly while retries that have been discarded do not succeed
It's fair game to add a fake object like worker_client_ and check it's state to see if the correct side-effects happen (e.g. callback is added correctly), but these tests shouldn't inspect the private state of ActorTaskSubmitter since that is the API under test.
TL/DR: we should keep all the assertions against fake objects (such as worker_client_) that are testing side-effects of ActorTaskSubmitter but we should remove assertions like submitter_.NumInflightTasks(actor_id) which look at the implementation of ActionTaskSubmitter and not it's public API.
There was a problem hiding this comment.
Sorry for the very long response. My two cents on testing.
There was a problem hiding this comment.
Completely agree with this ^ same motivation for my comment below about NumInflightTasks()
| @@ -0,0 +1,286 @@ | |||
| // Copyright 2017 The Ray Authors. | |||
There was a problem hiding this comment.
nothing changed, pure refactoring
| RAY_CHECK(it != client_queues_.end()); | ||
| auto &queue = it->second; | ||
| auto callback_it = queue.inflight_task_callbacks.find(task_id); | ||
| auto callback_it = queue.inflight_task_callbacks.find(task_attempt); |
There was a problem hiding this comment.
Is there a way to cancel an inflight GRPC request in addition to this? I don't have a lot of GRPC experience, but if we're going to discard the response, might as well cancel the request so this is never called.
There was a problem hiding this comment.
I'm also not familiar with this part. Need to do some investigation.
There was a problem hiding this comment.
I think it might be a useful follow up, but looking at our GRPC implementation, it doesn't look straightforward to implement.
There was a problem hiding this comment.
gRPC has a notion of cancellation but our c++ code doesn't handle it so it would only be useful if it happens before the RPC handler begins on the server side
edoakes
left a comment
There was a problem hiding this comment.
The fix LGTM, stylistic comments to improve the tests
Have you audited other maps keyed on TaskID to check if there are other similar issues?
| TaskID caller_id = TaskID::Nil()) { | ||
| TaskSpecification task; | ||
| task.GetMutableMessage().set_task_id(TaskID::FromRandom(actor_id.JobId()).Binary()); | ||
| task.GetMutableMessage().set_attempt_number(0); |
There was a problem hiding this comment.
this isn't necessary; protobuf has well-defined zero values. but fine to do if you think it improves readability
There was a problem hiding this comment.
Yea, I write it out for readability to emphasize it's attempt 0 of the task since attempt number is a key in this test.
| /// Return the number of inflight actor tasks for the given actor id. | ||
| size_t NumInflightTasks(const ActorID &actor_id) const; |
There was a problem hiding this comment.
What exactly are "inflight" tasks from the perspective of the caller of this interface? Is "inflight" a specific state or a set of states?
We also have the notion of "pending" tasks in the API. Are these the same thing?
There was a problem hiding this comment.
Inflight are those actor tasks whose PushTask RPC is sent and inflight (response not received yet). Pending are tasks that are queued (no PushTask RPC yet).
Added comment to explain what inflight means.
| // Submit a task. | ||
| ASSERT_TRUE(CheckSubmitTask(task1)); | ||
| EXPECT_CALL(*task_finisher_, CompletePendingTask(task1.TaskId(), _, _, _)).Times(1); | ||
| ASSERT_TRUE(worker_client_->ReplyPushTask(Status::OK())); | ||
| ASSERT_EQ(worker_client_->callbacks.size(), 0); | ||
| ASSERT_EQ(submitter_.NumInflightTasks(actor_id), 0); |
There was a problem hiding this comment.
what's the point of this codeblock in the test? it doesn't seem relevant
There was a problem hiding this comment.
This makes sure we have a clean state for the rest of the test: i.e. Task 1 should be completely finished.
I checked and some are suspicious. Created #52940 as the follow-up. |
israbbani
left a comment
There was a problem hiding this comment.
The fix LGTM. I left a pretty detailed comment about writing tests against public APIs vs implementation details. I think it'll help improve our tests and our code if we try to follow that as a general guideline. 🚢
…project#52812) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: weiran11 <weiran11@baidu.com>
Why are these changes needed?
Currently
ActorTaskSubmitterhasinflight_task_callbacksmap whose key is TaskID to track all inflight actor tasks. It can cause the following issue:inflight_task_callbacks.ActorTaskSubmitterfailed all inflight tasks and clearedinflight_task_callbacksinflight_task_callbacks. The key is Task_1.inflight_task_callbacksand we can find it and the callback is called. The problem is that this callback is for Task_1_Attempt_1 not Task_1_Attempt_0 so we end up failing the wrong task attempt.Solution: use
TaskAttemptto track each inflight task which is unique.TODO: Investigate whether we can remove
inflight_task_callbacks(#19354) all together after #51904 and purely rely on GRPC to call the callback when the actor is dead or restarted.Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.