Skip to content

A vLLM plugin built on the FlagOS unified multi-chip backend.

Notifications You must be signed in to change notification settings

flagos-ai/vllm-plugin-FL

Repository files navigation

vllm-FL

A vLLM plugin built on the FlagOS unified multi-chip backend.

Quick Start

Setup

  1. Install vllm from the official v0.13.0 (optional if the correct version is installed) or from the fork vllm-FL.

  2. Install FlagGems

    1.1 Install Build Dependencies

    pip install -U scikit-build-core==0.11 pybind11 ninja cmake

    1.2 Installation FlagGems

    git clone https://github.com/flagos-ai/FlagGems
    cd FlagGems
    pip install --no-build-isolation .
    # or editble install
    pip install --no-build-isolation -e .
  3. Install FlagCX

    2.1 Clone the repository:

    git clone https://github.com/flagos-ai/FlagCX.git
    git checkout -b v0.7.0
    git submodule update --init --recursive

    2.2 Build the library with different flags targeting to different platforms:

    make USE_NVIDIA=1

    2.3 Set environment

    export FLAGCX_PATH="$PWD"

    2.4 Installation FlagCX

    cd plugin/torch/
    python setup.py develop --adaptor nvidia/ascend
  4. Install vllm-plugin-fl

    3.1 Clone the repository:

    git clone https://github.com/flagos-ai/vllm-plugin-FL

    3.2 install

    cd vllm-plugin-fl
    pip install --no-build-isolation .
    # or editble install
    pip install --no-build-isolation -e .

If there are multiple plugins in the current environment, you can specify use vllm-plugin-fl via VLLM_PLUGINS='fl'.

Run a Task

Offline Batched Inference

With vLLM and vLLM-fl installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: offline_inference. Or use blow python script directly.

from vllm import LLM, SamplingParams
import torch
from vllm.config.compilation import CompilationConfig


if __name__ == '__main__':
    prompts = [
        "Hello, my name is",
    ]
    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=10, temperature=0.0)
    # Create an LLM.
    llm = LLM(model="Qwen/Qwen3-4B", max_num_batched_tokens=16384, max_num_seqs=2048)
    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Advanced use

Using CudaCommunication library

If you want to use the original CudaCommunication, you can unset the following environment variables.

unset FLAGCX_PATH

Using native CUDA operators

If you want to use the original CUDA operators, you can unset the following environment variables.

unset USE_FLAGGEMS

About

A vLLM plugin built on the FlagOS unified multi-chip backend.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7

Languages