Conversation
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces dead-letter queue (DLQ) functionality for the task processor. Failed tasks and unknown tasks are now moved to configurable DLQs for later inspection. The changes include:
- Using Celery signals (
task_failure,task_unknown) to handle task failures. - New helper functions for creating queue configs and moving tasks.
- Configuration options for failed and unprocessable task queues.
- New tests to verify the DLQ logic for various failure scenarios.
My review focuses on ensuring the robustness of the DLQ mechanism, particularly around data serialization, and on improving the maintainability of the new tests. I've identified a couple of critical issues where failed task data could be lost due to serialization errors, and I've provided suggestions to fix them. I also recommend refactoring the new tests to reduce code duplication.
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
|
In my opinion, the tests can be rewritten for simplicity; they are too verbose right now and kind of hard to read. |
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
refactored them to make it less verbose but retain all the checks & functionalities |
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
Signed-off-by: harshit <harshit@anyscale.com>
abrarsheikh
left a comment
There was a problem hiding this comment.
the current implementation of the test is not deterministic and need to be improved. Let's work on that in the follow up PR.
### Summary This pull request introduces Dead-Letter Queue (DLQ) functionality for async inference. Users can configure two DLQs: 1. `failed_task_queue` – for tasks that fail during normal execution. 2. `unprocessable_task_queue` – for tasks that cannot be processed (e.g., deserialization failures or missing handlers). All unprocessable tasks will automatically be routed to the unprocessable_task_queue, while other failures will go to the failed_task_queue. The detailed behavior is defined in the [RFC document](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0). ### Changes in this PR 1. Integrated Celery signals (task_failure, task_unknown) to handle task failures. 2. Added helper functions for moving tasks into the correct DLQ. 3. Introduced tests to verify DLQ routing logic across different failure scenarios. 4. Added a persistence test to ensure tasks are retried at-least-once as per the [RFC’s NFR requirements](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0#heading=h.4om3bw49w03x). ### Follow-up work (to be added in a separate PR) Additional tests will be added in the next PR to keep this one focused and manageable. These will cover: 1. Task processor metrics 2. Task processor health checks 3. Task cancellation (cancel_task) 4. Multiple task consumers in a single Serve application 5. Ensuring failed tasks are retried exactly max_retry + 1 times --------- Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: sampan <sampan@anyscale.com>
### Summary This pull request introduces Dead-Letter Queue (DLQ) functionality for async inference. Users can configure two DLQs: 1. `failed_task_queue` – for tasks that fail during normal execution. 2. `unprocessable_task_queue` – for tasks that cannot be processed (e.g., deserialization failures or missing handlers). All unprocessable tasks will automatically be routed to the unprocessable_task_queue, while other failures will go to the failed_task_queue. The detailed behavior is defined in the [RFC document](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0). ### Changes in this PR 1. Integrated Celery signals (task_failure, task_unknown) to handle task failures. 2. Added helper functions for moving tasks into the correct DLQ. 3. Introduced tests to verify DLQ routing logic across different failure scenarios. 4. Added a persistence test to ensure tasks are retried at-least-once as per the [RFC’s NFR requirements](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0#heading=h.4om3bw49w03x). ### Follow-up work (to be added in a separate PR) Additional tests will be added in the next PR to keep this one focused and manageable. These will cover: 1. Task processor metrics 2. Task processor health checks 3. Task cancellation (cancel_task) 4. Multiple task consumers in a single Serve application 5. Ensuring failed tasks are retried exactly max_retry + 1 times --------- Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
### Summary This pull request introduces Dead-Letter Queue (DLQ) functionality for async inference. Users can configure two DLQs: 1. `failed_task_queue` – for tasks that fail during normal execution. 2. `unprocessable_task_queue` – for tasks that cannot be processed (e.g., deserialization failures or missing handlers). All unprocessable tasks will automatically be routed to the unprocessable_task_queue, while other failures will go to the failed_task_queue. The detailed behavior is defined in the [RFC document](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0). ### Changes in this PR 1. Integrated Celery signals (task_failure, task_unknown) to handle task failures. 2. Added helper functions for moving tasks into the correct DLQ. 3. Introduced tests to verify DLQ routing logic across different failure scenarios. 4. Added a persistence test to ensure tasks are retried at-least-once as per the [RFC’s NFR requirements](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0#heading=h.4om3bw49w03x). ### Follow-up work (to be added in a separate PR) Additional tests will be added in the next PR to keep this one focused and manageable. These will cover: 1. Task processor metrics 2. Task processor health checks 3. Task cancellation (cancel_task) 4. Multiple task consumers in a single Serve application 5. Ensuring failed tasks are retried exactly max_retry + 1 times --------- Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
### Summary This pull request introduces Dead-Letter Queue (DLQ) functionality for async inference. Users can configure two DLQs: 1. `failed_task_queue` – for tasks that fail during normal execution. 2. `unprocessable_task_queue` – for tasks that cannot be processed (e.g., deserialization failures or missing handlers). All unprocessable tasks will automatically be routed to the unprocessable_task_queue, while other failures will go to the failed_task_queue. The detailed behavior is defined in the [RFC document](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0). ### Changes in this PR 1. Integrated Celery signals (task_failure, task_unknown) to handle task failures. 2. Added helper functions for moving tasks into the correct DLQ. 3. Introduced tests to verify DLQ routing logic across different failure scenarios. 4. Added a persistence test to ensure tasks are retried at-least-once as per the [RFC’s NFR requirements](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0#heading=h.4om3bw49w03x). ### Follow-up work (to be added in a separate PR) Additional tests will be added in the next PR to keep this one focused and manageable. These will cover: 1. Task processor metrics 2. Task processor health checks 3. Task cancellation (cancel_task) 4. Multiple task consumers in a single Serve application 5. Ensuring failed tasks are retried exactly max_retry + 1 times --------- Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: zac <zac@anyscale.com>
### Summary This pull request introduces Dead-Letter Queue (DLQ) functionality for async inference. Users can configure two DLQs: 1. `failed_task_queue` – for tasks that fail during normal execution. 2. `unprocessable_task_queue` – for tasks that cannot be processed (e.g., deserialization failures or missing handlers). All unprocessable tasks will automatically be routed to the unprocessable_task_queue, while other failures will go to the failed_task_queue. The detailed behavior is defined in the [RFC document](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0). ### Changes in this PR 1. Integrated Celery signals (task_failure, task_unknown) to handle task failures. 2. Added helper functions for moving tasks into the correct DLQ. 3. Introduced tests to verify DLQ routing logic across different failure scenarios. 4. Added a persistence test to ensure tasks are retried at-least-once as per the [RFC’s NFR requirements](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0#heading=h.4om3bw49w03x). ### Follow-up work (to be added in a separate PR) Additional tests will be added in the next PR to keep this one focused and manageable. These will cover: 1. Task processor metrics 2. Task processor health checks 3. Task cancellation (cancel_task) 4. Multiple task consumers in a single Serve application 5. Ensuring failed tasks are retried exactly max_retry + 1 times --------- Signed-off-by: harshit <harshit@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
### Summary This pull request introduces Dead-Letter Queue (DLQ) functionality for async inference. Users can configure two DLQs: 1. `failed_task_queue` – for tasks that fail during normal execution. 2. `unprocessable_task_queue` – for tasks that cannot be processed (e.g., deserialization failures or missing handlers). All unprocessable tasks will automatically be routed to the unprocessable_task_queue, while other failures will go to the failed_task_queue. The detailed behavior is defined in the [RFC document](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0). ### Changes in this PR 1. Integrated Celery signals (task_failure, task_unknown) to handle task failures. 2. Added helper functions for moving tasks into the correct DLQ. 3. Introduced tests to verify DLQ routing logic across different failure scenarios. 4. Added a persistence test to ensure tasks are retried at-least-once as per the [RFC’s NFR requirements](https://docs.google.com/document/d/1Ix7uKrP3Q5LCjJ5wZG47ncUi5ScbYzyrtFXsYSlGnwg/edit?tab=t.0#heading=h.4om3bw49w03x). ### Follow-up work (to be added in a separate PR) Additional tests will be added in the next PR to keep this one focused and manageable. These will cover: 1. Task processor metrics 2. Task processor health checks 3. Task cancellation (cancel_task) 4. Multiple task consumers in a single Serve application 5. Ensuring failed tasks are retried exactly max_retry + 1 times --------- Signed-off-by: harshit <harshit@anyscale.com>
Summary
This pull request introduces Dead-Letter Queue (DLQ) functionality for async inference. Users can configure two DLQs:
failed_task_queue– for tasks that fail during normal execution.unprocessable_task_queue– for tasks that cannot be processed (e.g., deserialization failures or missing handlers).All unprocessable tasks will automatically be routed to the unprocessable_task_queue, while other failures will go to the failed_task_queue. The detailed behavior is defined in the RFC document.
Changes in this PR
Follow-up work (to be added in a separate PR)
Additional tests will be added in the next PR to keep this one focused and manageable. These will cover: