Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/data/api/llm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,4 @@ Processor configs
~ProcessorConfig
~HttpRequestProcessorConfig
~vLLMEngineProcessorConfig
~SGLangEngineProcessorConfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
"""
Quickstart: vLLM + Ray Data batch inference.

1. Installation
2. Dataset creation
3. Processor configuration
4. Running inference
5. Getting results
"""

# __minimal_vllm_quickstart_start__
import ray
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor

# Initialize Ray
ray.init()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's a good practice in documentation examples to use ray.init(ignore_reinit_error=True). This prevents errors if a user runs the script multiple times in an interactive environment like a Jupyter notebook, where Ray might have already been initialized.

Suggested change
ray.init()
ray.init(ignore_reinit_error=True)


# simple dataset
ds = ray.data.from_items([
{"prompt": "What is machine learning?"},
{"prompt": "Explain neural networks in one sentence."},
])

# Minimal vLLM configuration
config = vLLMEngineProcessorConfig(
model_source="unsloth/Llama-3.1-8B-Instruct",
concurrency=1, # 1 vLLM engine replica
batch_size=32, # 32 samples per batch
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

A batch_size of 32 might be too large for some GPUs when running an 8B model, potentially leading to out-of-memory errors. For a quickstart example, it's safer to start with a smaller batch size, for example 16, and let users increase it if their hardware allows.

Suggested change
batch_size=32, # 32 samples per batch
batch_size=16, # 16 samples per batch

)

# Build processor
# preprocess: converts input row to format expected by vLLM (OpenAI chat format)
# postprocess: extracts generated text from vLLM output
processor = build_llm_processor(
config,
preprocess=lambda row: {
"messages": [{"role": "user", "content": row["prompt"]}],
"sampling_params": {"temperature": 0.7, "max_tokens": 100},
},
postprocess=lambda row: {
"prompt": row["prompt"],
"response": row["generated_text"],
},
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing "prompt" Field Causes Processing Error

The postprocess function accesses row["prompt"], but the preprocess function doesn't preserve the original "prompt" field from the input row. This causes a KeyError because the "prompt" field is lost during preprocessing.

Fix in Cursor Fix in Web


# inference
ds = processor(ds)

# iterate through the results
for result in ds.iter_rows():
print(f"Q: {result['prompt']}")
print(f"A: {result['response']}\n")

# Alternative ways to get results:
# results = ds.take(10) # Get first 10 results
# ds.show(limit=5) # Print first 5 results
# ds.write_parquet("output.parquet") # Save to file
# __minimal_vllm_quickstart_end__

54 changes: 37 additions & 17 deletions doc/source/data/working-with-llms.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,30 +7,57 @@ The :ref:`ray.data.llm <llm-ref>` module integrates with key large language mode

This guide shows you how to use :ref:`ray.data.llm <llm-ref>` to:

* :ref:`Quickstart: vLLM batch inference <vllm_quickstart>`
* :ref:`Perform batch inference with LLMs <batch_inference_llm>`
* :ref:`Configure vLLM for LLM inference <vllm_llm>`
* :ref:`Batch inference with embedding models <embedding_models>`
* :ref:`Query deployed models with an OpenAI compatible API endpoint <openai_compatible_api_endpoint>`

.. _batch_inference_llm:
.. _vllm_quickstart:

Perform batch inference with LLMs
Quickstart: vLLM batch inference
---------------------------------

At a high level, the :ref:`ray.data.llm <llm-ref>` module provides a :class:`Processor <ray.data.llm.Processor>` object which encapsulates
logic for performing batch inference with LLMs on a Ray Data dataset.
Get started with vLLM batch inference in just a few steps. This example shows the minimal setup needed to run batch inference on a dataset.

You can use the :func:`build_llm_processor <ray.data.llm.build_llm_processor>` API to construct a processor.
The following example uses the :class:`vLLMEngineProcessorConfig <ray.data.llm.vLLMEngineProcessorConfig>` to construct a processor for the `unsloth/Llama-3.1-8B-Instruct` model.
.. note::
This quickstart requires a GPU as vLLM is GPU-accelerated.

To start, install Ray Data + LLMs. This also installs vLLM, which is a popular and optimized LLM inference engine.
First, install Ray Data with LLM support:

.. code-block:: bash

pip install -U "ray[data, llm]>=2.49.1"

The :class:`vLLMEngineProcessorConfig <ray.data.llm.vLLMEngineProcessorConfig>` is a configuration object for the vLLM engine.
It contains the model name, the number of GPUs to use, and the number of shards to use, along with other vLLM engine configurations.
Here's a complete minimal example that runs batch inference:

.. literalinclude:: doc_code/working-with-llms/minimal_quickstart.py
:language: python
:start-after: __minimal_vllm_quickstart_start__
:end-before: __minimal_vllm_quickstart_end__

This example:

1. Creates a simple dataset with prompts
2. Configures a vLLM processor with minimal settings
3. Builds a processor that handles preprocessing (converting prompts to OpenAI chat format) and postprocessing (extracting generated text)
4. Runs inference on the dataset
5. Iterates through results

The processor expects input rows with a ``prompt`` field and outputs rows with both ``prompt`` and ``response`` fields. You can consume results using ``iter_rows()``, ``take()``, ``show()``, or save to files with ``write_parquet()``.

For more configuration options and advanced features, see the sections below.

.. _batch_inference_llm:
Comment on lines +46 to 51
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also deduplicate the content from the section below?

https://anyscale-ray--58330.com.readthedocs.build/en/58330/data/working-with-llms.html#perform-batch-inference-with-llms

Feel like there's some redundancy, like the installation and basic explanation of the configuration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done + sglang engine pointer added


Perform batch inference with LLMs
---------------------------------

At a high level, the :ref:`ray.data.llm <llm-ref>` module provides a :class:`Processor <ray.data.llm.Processor>` object which encapsulates
logic for performing batch inference with LLMs on a Ray Data dataset.

You can use the :func:`build_llm_processor <ray.data.llm.build_llm_processor>` API to construct a processor.
The following example uses the :class:`vLLMEngineProcessorConfig <ray.data.llm.vLLMEngineProcessorConfig>` to construct a processor for the `unsloth/Llama-3.1-8B-Instruct` model.
Upon execution, the Processor object instantiates replicas of the vLLM engine (using :meth:`map_batches <ray.data.Dataset.map_batches>` under the hood).

.. .. literalinclude:: doc_code/working-with-llms/basic_llm_example.py
Expand All @@ -52,14 +79,7 @@ The configuration includes detailed comments explaining:
- **`max_num_batched_tokens`**: Maximum tokens processed simultaneously (reduce if CUDA OOM occurs)
- **`accelerator_type`**: Specify GPU type for optimal resource allocation

Each processor requires specific input columns based on the model and configuration. The vLLM processor expects input in OpenAI chat format with a 'messages' column.

This basic configuration pattern is used throughout this guide and includes helpful comments explaining key parameters.

This configuration creates a processor that expects:

- **Input**: Dataset with 'messages' column (OpenAI chat format)
- **Output**: Dataset with 'generated_text' column containing model responses
The vLLM processor expects input in OpenAI chat format with a 'messages' column and outputs a 'generated_text' column containing model responses.

Some models may require a Hugging Face token to be specified. You can specify the token in the `runtime_env` argument.

Expand Down