-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[docs][data.llm] simplify / add ray data.llm quickstart example #58330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,59 @@ | ||||||
| """ | ||||||
| Quickstart: vLLM + Ray Data batch inference. | ||||||
|
|
||||||
| 1. Installation | ||||||
| 2. Dataset creation | ||||||
| 3. Processor configuration | ||||||
| 4. Running inference | ||||||
| 5. Getting results | ||||||
| """ | ||||||
|
|
||||||
| # __minimal_vllm_quickstart_start__ | ||||||
| import ray | ||||||
| from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor | ||||||
|
|
||||||
| # Initialize Ray | ||||||
| ray.init() | ||||||
|
|
||||||
| # simple dataset | ||||||
| ds = ray.data.from_items([ | ||||||
| {"prompt": "What is machine learning?"}, | ||||||
| {"prompt": "Explain neural networks in one sentence."}, | ||||||
| ]) | ||||||
|
|
||||||
| # Minimal vLLM configuration | ||||||
| config = vLLMEngineProcessorConfig( | ||||||
| model_source="unsloth/Llama-3.1-8B-Instruct", | ||||||
| concurrency=1, # 1 vLLM engine replica | ||||||
| batch_size=32, # 32 samples per batch | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A
Suggested change
|
||||||
| ) | ||||||
|
|
||||||
| # Build processor | ||||||
| # preprocess: converts input row to format expected by vLLM (OpenAI chat format) | ||||||
| # postprocess: extracts generated text from vLLM output | ||||||
| processor = build_llm_processor( | ||||||
| config, | ||||||
| preprocess=lambda row: { | ||||||
| "messages": [{"role": "user", "content": row["prompt"]}], | ||||||
| "sampling_params": {"temperature": 0.7, "max_tokens": 100}, | ||||||
| }, | ||||||
| postprocess=lambda row: { | ||||||
| "prompt": row["prompt"], | ||||||
| "response": row["generated_text"], | ||||||
| }, | ||||||
| ) | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||
|
|
||||||
| # inference | ||||||
| ds = processor(ds) | ||||||
|
|
||||||
| # iterate through the results | ||||||
| for result in ds.iter_rows(): | ||||||
| print(f"Q: {result['prompt']}") | ||||||
| print(f"A: {result['response']}\n") | ||||||
|
|
||||||
| # Alternative ways to get results: | ||||||
| # results = ds.take(10) # Get first 10 results | ||||||
| # ds.show(limit=5) # Print first 5 results | ||||||
| # ds.write_parquet("output.parquet") # Save to file | ||||||
| # __minimal_vllm_quickstart_end__ | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,30 +7,57 @@ The :ref:`ray.data.llm <llm-ref>` module integrates with key large language mode | |
|
|
||
| This guide shows you how to use :ref:`ray.data.llm <llm-ref>` to: | ||
|
|
||
| * :ref:`Quickstart: vLLM batch inference <vllm_quickstart>` | ||
| * :ref:`Perform batch inference with LLMs <batch_inference_llm>` | ||
| * :ref:`Configure vLLM for LLM inference <vllm_llm>` | ||
| * :ref:`Batch inference with embedding models <embedding_models>` | ||
| * :ref:`Query deployed models with an OpenAI compatible API endpoint <openai_compatible_api_endpoint>` | ||
|
|
||
| .. _batch_inference_llm: | ||
| .. _vllm_quickstart: | ||
|
|
||
| Perform batch inference with LLMs | ||
| Quickstart: vLLM batch inference | ||
| --------------------------------- | ||
|
|
||
| At a high level, the :ref:`ray.data.llm <llm-ref>` module provides a :class:`Processor <ray.data.llm.Processor>` object which encapsulates | ||
| logic for performing batch inference with LLMs on a Ray Data dataset. | ||
| Get started with vLLM batch inference in just a few steps. This example shows the minimal setup needed to run batch inference on a dataset. | ||
|
|
||
| You can use the :func:`build_llm_processor <ray.data.llm.build_llm_processor>` API to construct a processor. | ||
| The following example uses the :class:`vLLMEngineProcessorConfig <ray.data.llm.vLLMEngineProcessorConfig>` to construct a processor for the `unsloth/Llama-3.1-8B-Instruct` model. | ||
| .. note:: | ||
| This quickstart requires a GPU as vLLM is GPU-accelerated. | ||
|
|
||
| To start, install Ray Data + LLMs. This also installs vLLM, which is a popular and optimized LLM inference engine. | ||
| First, install Ray Data with LLM support: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| pip install -U "ray[data, llm]>=2.49.1" | ||
|
|
||
| The :class:`vLLMEngineProcessorConfig <ray.data.llm.vLLMEngineProcessorConfig>` is a configuration object for the vLLM engine. | ||
| It contains the model name, the number of GPUs to use, and the number of shards to use, along with other vLLM engine configurations. | ||
| Here's a complete minimal example that runs batch inference: | ||
|
|
||
| .. literalinclude:: doc_code/working-with-llms/minimal_quickstart.py | ||
| :language: python | ||
| :start-after: __minimal_vllm_quickstart_start__ | ||
| :end-before: __minimal_vllm_quickstart_end__ | ||
|
|
||
| This example: | ||
|
|
||
| 1. Creates a simple dataset with prompts | ||
| 2. Configures a vLLM processor with minimal settings | ||
| 3. Builds a processor that handles preprocessing (converting prompts to OpenAI chat format) and postprocessing (extracting generated text) | ||
| 4. Runs inference on the dataset | ||
| 5. Iterates through results | ||
|
|
||
| The processor expects input rows with a ``prompt`` field and outputs rows with both ``prompt`` and ``response`` fields. You can consume results using ``iter_rows()``, ``take()``, ``show()``, or save to files with ``write_parquet()``. | ||
|
|
||
| For more configuration options and advanced features, see the sections below. | ||
|
|
||
| .. _batch_inference_llm: | ||
|
Comment on lines
+46
to
51
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you also deduplicate the content from the section below? Feel like there's some redundancy, like the installation and basic explanation of the configuration.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done + sglang engine pointer added |
||
|
|
||
| Perform batch inference with LLMs | ||
| --------------------------------- | ||
|
|
||
| At a high level, the :ref:`ray.data.llm <llm-ref>` module provides a :class:`Processor <ray.data.llm.Processor>` object which encapsulates | ||
| logic for performing batch inference with LLMs on a Ray Data dataset. | ||
|
|
||
| You can use the :func:`build_llm_processor <ray.data.llm.build_llm_processor>` API to construct a processor. | ||
| The following example uses the :class:`vLLMEngineProcessorConfig <ray.data.llm.vLLMEngineProcessorConfig>` to construct a processor for the `unsloth/Llama-3.1-8B-Instruct` model. | ||
| Upon execution, the Processor object instantiates replicas of the vLLM engine (using :meth:`map_batches <ray.data.Dataset.map_batches>` under the hood). | ||
|
|
||
| .. .. literalinclude:: doc_code/working-with-llms/basic_llm_example.py | ||
|
|
@@ -52,14 +79,7 @@ The configuration includes detailed comments explaining: | |
| - **`max_num_batched_tokens`**: Maximum tokens processed simultaneously (reduce if CUDA OOM occurs) | ||
| - **`accelerator_type`**: Specify GPU type for optimal resource allocation | ||
|
|
||
| Each processor requires specific input columns based on the model and configuration. The vLLM processor expects input in OpenAI chat format with a 'messages' column. | ||
|
|
||
| This basic configuration pattern is used throughout this guide and includes helpful comments explaining key parameters. | ||
|
|
||
| This configuration creates a processor that expects: | ||
|
|
||
| - **Input**: Dataset with 'messages' column (OpenAI chat format) | ||
| - **Output**: Dataset with 'generated_text' column containing model responses | ||
| The vLLM processor expects input in OpenAI chat format with a 'messages' column and outputs a 'generated_text' column containing model responses. | ||
|
|
||
| Some models may require a Hugging Face token to be specified. You can specify the token in the `runtime_env` argument. | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good practice in documentation examples to use
ray.init(ignore_reinit_error=True). This prevents errors if a user runs the script multiple times in an interactive environment like a Jupyter notebook, where Ray might have already been initialized.