Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 11 additions & 3 deletions python/ray/llm/_internal/serve/engines/vllm/vllm_engine.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import argparse
import dataclasses
import inspect
import json
import typing
from typing import TYPE_CHECKING, Any, AsyncGenerator, List, Optional, Tuple, Union

Expand Down Expand Up @@ -329,7 +330,7 @@ async def start(self) -> None:
self._oai_models = getattr(state, "openai_serving_models", None)
self._oai_serving_chat = getattr(state, "openai_serving_chat", None)
self._oai_serving_completion = getattr(state, "openai_serving_completion", None)
self._oai_serving_embedding = getattr(state, "openai_serving_embedding", None)
self._oai_serving_embedding = getattr(state, "serving_embedding", None)
self._oai_serving_transcription = getattr(
state, "openai_serving_transcription", None
)
Expand Down Expand Up @@ -579,7 +580,8 @@ async def embeddings(
raw_request: Optional[Request] = RawRequestInfo.to_starlette_request_optional(
raw_request_info
)
embedding_response = await self._oai_serving_embedding.create_embedding( # type: ignore[attr-defined]
# vLLM's ServingEmbedding is a callable, not a class with create_embedding
embedding_response = await self._oai_serving_embedding(
request,
raw_request=raw_request,
)
Expand All @@ -588,8 +590,14 @@ async def embeddings(
yield ErrorResponse(
error=ErrorInfo(**embedding_response.error.model_dump())
)
return

# vLLM returns a Starlette Response object, extract the JSON content
if hasattr(embedding_response, 'body'):
content = json.loads(embedding_response.body)
yield EmbeddingResponse(**content)
else:
yield EmbeddingResponse(**embedding_response.model_dump())
yield embedding_response
Comment on lines +596 to +600
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using hasattr for duck-typing can be brittle. A more robust way to check if embedding_response is a Starlette-like response object is to also check the type of the body attribute. Starlette Response objects have a body attribute of type bytes. This avoids potential issues if another type of object with a body attribute of a different type is returned in the future.

Suggested change
if hasattr(embedding_response, 'body'):
content = json.loads(embedding_response.body)
yield EmbeddingResponse(**content)
else:
yield EmbeddingResponse(**embedding_response.model_dump())
yield embedding_response
if isinstance(getattr(embedding_response, "body", None), bytes):
content = json.loads(embedding_response.body)
yield EmbeddingResponse(**content)
else:
yield embedding_response

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error responses mishandled as embedding success responses

Medium Severity

vLLM's PoolingServing.__call__ always returns a Starlette Response object (never a VLLMErrorResponse instance), so the isinstance(embedding_response, VLLMErrorResponse) check on line 589 is dead code. When vLLM returns an error, it will be a JSONResponse with an error status code and error body. This error response will fall through to the hasattr(embedding_response, 'body') branch, where EmbeddingResponse(**content) will crash because the error JSON doesn't match the EmbeddingResponse schema. The response's status_code needs to be checked to distinguish success from error responses before parsing the body.

Fix in Cursor Fix in Web


async def transcriptions(
self,
Expand Down