ray-project · kouroshHakha · Nov 7, 2025 · Oct 30, 2025 · Oct 31, 2025 · gemini-code-assist
@@ -34,3 +34,4 @@ Processor configs
     ~ProcessorConfig
     ~HttpRequestProcessorConfig
     ~vLLMEngineProcessorConfig
+    ~SGLangEngineProcessorConfig
@@ -0,0 +1,59 @@
+"""
+Quickstart: vLLM + Ray Data batch inference.
+
+1. Installation
+2. Dataset creation
+3. Processor configuration
+4. Running inference
+5. Getting results
+"""
+
+# __minimal_vllm_quickstart_start__
+import ray
+from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor
+
+# Initialize Ray
+ray.init()
-ray.init()
+ray.init(ignore_reinit_error=True)
-ray.init()
+ray.init(ignore_reinit_error=True)
+
+# simple dataset
+ds = ray.data.from_items([
+    {"prompt": "What is machine learning?"},
+    {"prompt": "Explain neural networks in one sentence."},
+])
+
+# Minimal vLLM configuration
+config = vLLMEngineProcessorConfig(
+    model_source="unsloth/Llama-3.1-8B-Instruct",
+    concurrency=1,  # 1 vLLM engine replica
+    batch_size=32,  # 32 samples per batch
-    batch_size=32,  # 32 samples per batch
+    batch_size=16,  # 16 samples per batch
-    batch_size=32,  # 32 samples per batch
+    batch_size=16,  # 16 samples per batch
+)
+
+# Build processor
+# preprocess: converts input row to format expected by vLLM (OpenAI chat format)
+# postprocess: extracts generated text from vLLM output
+processor = build_llm_processor(
+    config,
+    preprocess=lambda row: {
+        "messages": [{"role": "user", "content": row["prompt"]}],
+        "sampling_params": {"temperature": 0.7, "max_tokens": 100},
+    },
+    postprocess=lambda row: {
+        "prompt": row["prompt"],
+        "response": row["generated_text"],
+    },
+)
+
+# inference
+ds = processor(ds)
+
+# iterate through the results
+for result in ds.iter_rows():
+    print(f"Q: {result['prompt']}")
+    print(f"A: {result['response']}\n")
+
+# Alternative ways to get results:
+# results = ds.take(10)  # Get first 10 results
+# ds.show(limit=5)       # Print first 5 results
+# ds.write_parquet("output.parquet")  # Save to file
+# __minimal_vllm_quickstart_end__
+
@@ -7,30 +7,57 @@ The :ref:`ray.data.llm <llm-ref>` module integrates with key large language mode
 
 This guide shows you how to use :ref:`ray.data.llm <llm-ref>` to:
 
+* :ref:`Quickstart: vLLM batch inference <vllm_quickstart>`
 * :ref:`Perform batch inference with LLMs <batch_inference_llm>`
 * :ref:`Configure vLLM for LLM inference <vllm_llm>`
 * :ref:`Batch inference with embedding models <embedding_models>`
 * :ref:`Query deployed models with an OpenAI compatible API endpoint <openai_compatible_api_endpoint>`
 
-.. _batch_inference_llm:
+.. _vllm_quickstart:
 
-Perform batch inference with LLMs
+Quickstart: vLLM batch inference
 ---------------------------------
 
-At a high level, the :ref:`ray.data.llm <llm-ref>` module provides a :class:`Processor <ray.data.llm.Processor>` object which encapsulates
-logic for performing batch inference with LLMs on a Ray Data dataset.
+Get started with vLLM batch inference in just a few steps. This example shows the minimal setup needed to run batch inference on a dataset.
 
-You can use the :func:`build_llm_processor <ray.data.llm.build_llm_processor>` API to construct a processor.
-The following example uses the :class:`vLLMEngineProcessorConfig <ray.data.llm.vLLMEngineProcessorConfig>` to construct a processor for the `unsloth/Llama-3.1-8B-Instruct` model.
+.. note::
+    This quickstart requires a GPU as vLLM is GPU-accelerated.
 
-To start, install Ray Data + LLMs. This also installs vLLM, which is a popular and optimized LLM inference engine.
+First, install Ray Data with LLM support:
 
 .. code-block:: bash
 
     pip install -U "ray[data, llm]>=2.49.1"
 
-The :class:`vLLMEngineProcessorConfig <ray.data.llm.vLLMEngineProcessorConfig>` is a configuration object for the vLLM engine.
-It contains the model name, the number of GPUs to use, and the number of shards to use, along with other vLLM engine configurations.
+Here's a complete minimal example that runs batch inference:
+
+.. literalinclude:: doc_code/working-with-llms/minimal_quickstart.py
+    :language: python
+    :start-after: __minimal_vllm_quickstart_start__
+    :end-before: __minimal_vllm_quickstart_end__
+
+This example:
+
+1. Creates a simple dataset with prompts
+2. Configures a vLLM processor with minimal settings
+3. Builds a processor that handles preprocessing (converting prompts to OpenAI chat format) and postprocessing (extracting generated text)
+4. Runs inference on the dataset
+5. Iterates through results
+
+The processor expects input rows with a ``prompt`` field and outputs rows with both ``prompt`` and ``response`` fields. You can consume results using ``iter_rows()``, ``take()``, ``show()``, or save to files with ``write_parquet()``.
+
+For more configuration options and advanced features, see the sections below.
+
+.. _batch_inference_llm:
+
+Perform batch inference with LLMs
+---------------------------------
+
+At a high level, the :ref:`ray.data.llm <llm-ref>` module provides a :class:`Processor <ray.data.llm.Processor>` object which encapsulates
+logic for performing batch inference with LLMs on a Ray Data dataset.
+
+You can use the :func:`build_llm_processor <ray.data.llm.build_llm_processor>` API to construct a processor.
+The following example uses the :class:`vLLMEngineProcessorConfig <ray.data.llm.vLLMEngineProcessorConfig>` to construct a processor for the `unsloth/Llama-3.1-8B-Instruct` model.
 Upon execution, the Processor object instantiates replicas of the vLLM engine (using :meth:`map_batches <ray.data.Dataset.map_batches>` under the hood).
 
 .. .. literalinclude:: doc_code/working-with-llms/basic_llm_example.py
@@ -52,14 +79,7 @@ The configuration includes detailed comments explaining:
 - **`max_num_batched_tokens`**: Maximum tokens processed simultaneously (reduce if CUDA OOM occurs)
 - **`accelerator_type`**: Specify GPU type for optimal resource allocation
 
-Each processor requires specific input columns based on the model and configuration. The vLLM processor expects input in OpenAI chat format with a 'messages' column.
-
-This basic configuration pattern is used throughout this guide and includes helpful comments explaining key parameters.
-
-This configuration creates a processor that expects:
-
-- **Input**: Dataset with 'messages' column (OpenAI chat format)
-- **Output**: Dataset with 'generated_text' column containing model responses
+The vLLM processor expects input in OpenAI chat format with a 'messages' column and outputs a 'generated_text' column containing model responses.
 
 Some models may require a Hugging Face token to be specified. You can specify the token in the `runtime_env` argument.