Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 80 additions & 2 deletions doc/source/serve/advanced-guides/performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ This section offers some tips and tricks to improve your Ray Serve application's

Ray Serve is built on top of Ray, so its scalability is bounded by Ray’s scalability. See Ray’s [scalability envelope](https://github.com/ray-project/ray/blob/master/release/benchmarks/README.md) to learn more about the maximum number of nodes and other limitations.

## Debugging performance issues
## Debugging performance issues in request path

The performance issue you're most likely to encounter is high latency or low throughput for requests.

Expand Down Expand Up @@ -75,7 +75,85 @@ Ray Serve allows you to fine-tune the backoff behavior of the request router, wh
- `RAY_SERVE_ROUTER_RETRY_MAX_BACKOFF_S`: The maximum backoff time (in seconds) between retries. Default is `0.5`.


### Give the Serve Controller more time to process requests
### Enable throughput-optimized serving

:::{note}
In Ray v2.54.0, the defaults for `RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD` and `RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP` will change to `0` for improved performance.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but not the logging ones?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging can be made default after #57850 is implemented

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're making this out in the future 2.54, should we just include the logging one as well then

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am inclined to leave it out, because when we implement the time-based logger, we can immediately roll it out without warning to the developer, since it has no perceivable impact.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In Ray v2.54.0, the defaults for `RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD` and `RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP` will change to `0` for improved performance.
A breaking change to this functionality will go live with Ray version 2.54.0. The defaults for `RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD` and `RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP` will change to `0`, disabling existing default functionality to improve serving throughput.
You should update your code to explicitly set these properties to `1` if your workloads require legacy behavior.

Typically, I avoid mentioning future state. Since this is a known planned migration, we should announce it.

We should also draft customer comms and work with @tg-anyscale to address Anyscale customers. (I understand this is technically opt-in for the breaking change because it's a new Ray version, but still nice to encourage customers to start testing now.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A breaking change to this functionality will go live with

This is not a breaking change from the customer POV, they don't need to take any action to opt into these optimizations.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm confused: won't running the user code in the same loop as the serve code break customer workloads with blocking logic once the default changes (assuming upgrade to Ray 2.54.0)?

Or will users not experience a performance degradation relative to now, they just won't see an improvement?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will users not experience a performance degradation relative to now, they just won't see an improvement?

This ^

:::

This section details how to enable Ray Serve options focused on improving throughput and reducing latency. These configurations focus on the following:

- Reducing overhead associated with frequent logging.
- Disabling behavior that allowed Serve applications to include blocking operations.

If your Ray Serve code includes thread blocking operations, you must refactor your code to achieve enhanced throughput. The following table shows examples of blocking and non-blocking code:

<table>
<tr>
<th>Blocking operation (❌)</th>
<th>Non-blocking operation (✅)</th>
</tr>
<tr>
<td>

```python
from ray import serve
from fastapi import FastAPI
import time

app = FastAPI()

@serve.deployment
@serve.ingress(app)
class BlockingDeployment:
@app.get("/process")
async def process(self):
# ❌ Blocking operation
time.sleep(2)
return {"message": "Processed (blocking)"}

serve.run(BlockingDeployment.bind())
```

</td>
<td>

```python
from ray import serve
from fastapi import FastAPI
import asyncio

app = FastAPI()

@serve.deployment
@serve.ingress(app)
class NonBlockingDeployment:
@app.get("/process")
async def process(self):
# ✅ Non-blocking operation
await asyncio.sleep(2)
return {"message": "Processed (non-blocking)"}

serve.run(NonBlockingDeployment.bind())
```

</td>
</tr>
</table>

To configure all options to the recommended settings, set the environment variable `RAY_SERVE_THROUGHPUT_OPTIMIZED=1`.

You can also configure each option individually. The following table details the recommended configurations and their impact:

| Configured value | Impact |
| --- | --- |
| `RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD=0` | Your code runs in the same event loop as the replica's main event loop. You must avoid blocking operations in your request path. Set this configuration to `1` to run your code in a separate event loop, which protects the replica's ability to communicate with the Serve Controller if your code has blocking operations. |
| `RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP=0`| The request router runs in the same event loop as the your code's event loop. You must avoid blocking operations in your request path. Set this configuration to `1` to run the router in a separate event loop, which protect Ray Serve's request routing ability when your code has blocking operations |
| `RAY_SERVE_REQUEST_PATH_LOG_BUFFER_SIZE=1000` | Sets the log buffer to batch writes to every `1000` logs, flushing the buffer on write. The system always flushes the buffer and writes logs when it detects a line with level ERROR. Set the buffer size to `1` to disable buffering and write logs immediately. |
| `RAY_SERVE_LOG_TO_STDERR=0` | Only write logs to files under the `logs/serve/` directory. Proxy, Controller, and Replica logs no longer appear in the console, worker files, or the Actor Logs section of the Ray Dashboard. Set this property to `1` to enable additional logging. |


## Debugging performance issues in controller

The Serve Controller runs on the Ray head node and is responsible for a variety of tasks,
including receiving autoscaling metrics from other Ray Serve components.
Expand Down