diff --git a/doc/source/serve/advanced-guides/performance.md b/doc/source/serve/advanced-guides/performance.md index 1cdff7ab7862..cf9e63f44454 100644 --- a/doc/source/serve/advanced-guides/performance.md +++ b/doc/source/serve/advanced-guides/performance.md @@ -17,7 +17,7 @@ This section offers some tips and tricks to improve your Ray Serve application's Ray Serve is built on top of Ray, so its scalability is bounded by Ray’s scalability. See Ray’s [scalability envelope](https://github.com/ray-project/ray/blob/master/release/benchmarks/README.md) to learn more about the maximum number of nodes and other limitations. -## Debugging performance issues +## Debugging performance issues in request path The performance issue you're most likely to encounter is high latency or low throughput for requests. @@ -75,7 +75,85 @@ Ray Serve allows you to fine-tune the backoff behavior of the request router, wh - `RAY_SERVE_ROUTER_RETRY_MAX_BACKOFF_S`: The maximum backoff time (in seconds) between retries. Default is `0.5`. -### Give the Serve Controller more time to process requests +### Enable throughput-optimized serving + +:::{note} +In Ray v2.54.0, the defaults for `RAY_SERVE_RUN_USER_CODE_IN_SEPARATE_THREAD` and `RAY_SERVE_RUN_ROUTER_IN_SEPARATE_LOOP` will change to `0` for improved performance. +::: + +This section details how to enable Ray Serve options focused on improving throughput and reducing latency. These configurations focus on the following: + +- Reducing overhead associated with frequent logging. +- Disabling behavior that allowed Serve applications to include blocking operations. + +If your Ray Serve code includes thread blocking operations, you must refactor your code to achieve enhanced throughput. The following table shows examples of blocking and non-blocking code: + +
| Blocking operation (❌) | +Non-blocking operation (✅) | +
|---|---|
| + +```python +from ray import serve +from fastapi import FastAPI +import time + +app = FastAPI() + +@serve.deployment +@serve.ingress(app) +class BlockingDeployment: + @app.get("/process") + async def process(self): + # ❌ Blocking operation + time.sleep(2) + return {"message": "Processed (blocking)"} + +serve.run(BlockingDeployment.bind()) +``` + + | ++ +```python +from ray import serve +from fastapi import FastAPI +import asyncio + +app = FastAPI() + +@serve.deployment +@serve.ingress(app) +class NonBlockingDeployment: + @app.get("/process") + async def process(self): + # ✅ Non-blocking operation + await asyncio.sleep(2) + return {"message": "Processed (non-blocking)"} + +serve.run(NonBlockingDeployment.bind()) +``` + + | +