[Serve] Calls to a Serve Deployment's .remote(), hang after some amount of time / requests.

### What happened + What you expected to happen

After some amount of time and some amount of calls to .remote(), my deployment no longer receives requests. When this happens it corresponds with logs In the Deployment that is calling .remote() like:

"WARNING 2024-09-24 09:51:14,270 serve 20 pow_2_scheduler.py:536 - Failed to get queue length from Replica(id=‘was0z9gr’, deployment=‘ModelDeployment’, app=‘Model:nnsight-models-languagemodel-languagemodel-repo-id-eleutherai-gpt-j-6b’) within 1.0s. If this happens repeatedly it’s likely caused by high network latency in the cluster. You can configure the deadline using the RAY_SERVE_QUEUE_LENGTH_RESPONSE_DEADLINE_S environment variable.”

Here is my setup. I have three services in my architecture.
1.) My own FastApi server.
2.) A Request pre-processing Ray Deployment.
3.) A Compute Ray Deployment.

The FastApi server is my own custom ingress endpoint, not using Ray’s ingress functionality. It connects to an existing Ray cluster via ray.init() on initialization. When the async request endpoint is hit, it uses serve.get_app_handle("RequestApplication").remote(request) to send the request to the Request Deployment.

The Request deployment is on the Ray head node and based on the request received, sends the request to the correct compute deployment like serve.get_app_handle(<compute app name>).remote(request). This happens in the deployment’s async __call__ method.

The Compute deployment is on a different ray node on a different machine than the FastApi server and Ray head node. The Compute deployment receives the request also on its async __call__ method, executes the compute heavy request, and handles sending the data back to the FastApi server.

Things to note:

Neither the FastApi server nor the Request Deployment wait for the response from .remote(request) as the Compute deployment handles the result separate from Ray.
The request is a pydantic object which contains a large object as one of its attributes. In the FastApi server, it calls ray.put on that attribute, which then the Compute deployment calls ray.get
In the logs for for the Request deployment, I consistently see “LongPollClient polling timed out. Retrying.” Not sure if this is a problem.
In the FastApi server I sometimes see: “WARNING 2024-09-24 09:51:14,270 serve 20 pow_2_scheduler.py:536 - Failed to get queue length from Replica(id=‘was0z9gr’, deployment=‘ModelDeployment’, app=‘Model:nnsight-models-languagemodel-languagemodel-repo-id-eleutherai-gpt-j-6b’) within 1.0s. If this happens repeatedly it’s likely caused by high network latency in the cluster. You can configure the deadline using the RAY_SERVE_QUEUE_LENGTH_RESPONSE_DEADLINE_S environment variable.”
This is often followed by: “concurrent.futures._base.InvalidStateError: CANCELLED: <Future at 0x7f118c190910 state=cancelled>”
Is there anything I need to do as I’m not waiting for the results of requests to .remote()? Should I be “closing” the DeploymentHandles in some way after I’ve sent data via their .remote() method?

Sorry for the long post. Please let me know if theres any logs or information I can provide.

### Versions / Dependencies

Ray: '2.36.0'
Python: '3.10.15'
OS: 'Ubuntu 20.04.6 LTS (Focal Fossa)'

### Reproduction script

Not something I can reproduce in one script. Here is a link to our software, specifically the Ray Deployment that calls .remote() and hangs: https://github.com/ndif-team/ndif/blob/dev/ray/deployments/request.py

### Issue Severity

High: It blocks me from completing my task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Calls to a Serve Deployment's .remote(), hang after some amount of time / requests. #47870

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Serve] Calls to a Serve Deployment's .remote(), hang after some amount of time / requests. #47870

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions