Skip to content

[Serve] Calls to a Serve Deployment's .remote(), hang after some amount of time / requests. #47870

@JadenFiotto-Kaufman

Description

@JadenFiotto-Kaufman

What happened + What you expected to happen

After some amount of time and some amount of calls to .remote(), my deployment no longer receives requests. When this happens it corresponds with logs In the Deployment that is calling .remote() like:

"WARNING 2024-09-24 09:51:14,270 serve 20 pow_2_scheduler.py:536 - Failed to get queue length from Replica(id=‘was0z9gr’, deployment=‘ModelDeployment’, app=‘Model:nnsight-models-languagemodel-languagemodel-repo-id-eleutherai-gpt-j-6b’) within 1.0s. If this happens repeatedly it’s likely caused by high network latency in the cluster. You can configure the deadline using the RAY_SERVE_QUEUE_LENGTH_RESPONSE_DEADLINE_S environment variable.”

Here is my setup. I have three services in my architecture.
1.) My own FastApi server.
2.) A Request pre-processing Ray Deployment.
3.) A Compute Ray Deployment.

The FastApi server is my own custom ingress endpoint, not using Ray’s ingress functionality. It connects to an existing Ray cluster via ray.init() on initialization. When the async request endpoint is hit, it uses serve.get_app_handle("RequestApplication").remote(request) to send the request to the Request Deployment.

The Request deployment is on the Ray head node and based on the request received, sends the request to the correct compute deployment like serve.get_app_handle().remote(request). This happens in the deployment’s async call method.

The Compute deployment is on a different ray node on a different machine than the FastApi server and Ray head node. The Compute deployment receives the request also on its async call method, executes the compute heavy request, and handles sending the data back to the FastApi server.

Things to note:

Neither the FastApi server nor the Request Deployment wait for the response from .remote(request) as the Compute deployment handles the result separate from Ray.
The request is a pydantic object which contains a large object as one of its attributes. In the FastApi server, it calls ray.put on that attribute, which then the Compute deployment calls ray.get
In the logs for for the Request deployment, I consistently see “LongPollClient polling timed out. Retrying.” Not sure if this is a problem.
In the FastApi server I sometimes see: “WARNING 2024-09-24 09:51:14,270 serve 20 pow_2_scheduler.py:536 - Failed to get queue length from Replica(id=‘was0z9gr’, deployment=‘ModelDeployment’, app=‘Model:nnsight-models-languagemodel-languagemodel-repo-id-eleutherai-gpt-j-6b’) within 1.0s. If this happens repeatedly it’s likely caused by high network latency in the cluster. You can configure the deadline using the RAY_SERVE_QUEUE_LENGTH_RESPONSE_DEADLINE_S environment variable.”
This is often followed by: “concurrent.futures._base.InvalidStateError: CANCELLED: <Future at 0x7f118c190910 state=cancelled>”
Is there anything I need to do as I’m not waiting for the results of requests to .remote()? Should I be “closing” the DeploymentHandles in some way after I’ve sent data via their .remote() method?

Sorry for the long post. Please let me know if theres any logs or information I can provide.

Versions / Dependencies

Ray: '2.36.0'
Python: '3.10.15'
OS: 'Ubuntu 20.04.6 LTS (Focal Fossa)'

Reproduction script

Not something I can reproduce in one script. Here is a link to our software, specifically the Ray Deployment that calls .remote() and hangs: https://github.com/ndif-team/ndif/blob/dev/ray/deployments/request.py

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

Labels

@author-action-requiredThe PR author is responsible for the next step. Remove tag to send back to the reviewer.P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tserveRay Serve Related Issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions