vllm-FL

A vLLM plugin built on the FlagOS unified multi-chip backend.

Quick Start

Setup

Install vllm from the official v0.13.0 (optional if the correct version is installed) or from the fork vllm-FL.

1.1 Install Build Dependencies

pip install -U scikit-build-core==0.11 pybind11 ninja cmake

1.2 Installation FlagGems

git clone https://github.com/flagos-ai/FlagGems
cd FlagGems
pip install --no-build-isolation .
# or editble install
pip install --no-build-isolation -e .

Install FlagCX

2.1 Clone the repository:

git clone https://github.com/flagos-ai/FlagCX.git
git checkout -b v0.7.0
git submodule update --init --recursive

2.2 Build the library with different flags targeting to different platforms:

make USE_NVIDIA=1

2.3 Set environment

export FLAGCX_PATH="$PWD"

2.4 Installation FlagCX

cd plugin/torch/
python setup.py develop --adaptor nvidia/ascend

Install vllm-plugin-fl

3.1 Clone the repository:

git clone https://github.com/flagos-ai/vllm-plugin-FL

3.2 install

cd vllm-plugin-fl
pip install --no-build-isolation .
# or editble install
pip install --no-build-isolation -e .

If there are multiple plugins in the current environment, you can specify use vllm-plugin-fl via VLLM_PLUGINS='fl'.

Run a Task

Offline Batched Inference

With vLLM and vLLM-fl installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: offline_inference. Or use blow python script directly.

from vllm import LLM, SamplingParams
import torch
from vllm.config.compilation import CompilationConfig


if __name__ == '__main__':
    prompts = [
        "Hello, my name is",
    ]
    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=10, temperature=0.0)
    # Create an LLM.
    llm = LLM(model="Qwen/Qwen3-4B", max_num_batched_tokens=16384, max_num_seqs=2048)
    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Advanced use

Using CudaCommunication library

If you want to use the original CudaCommunication, you can unset the following environment variables.

unset FLAGCX_PATH

Using native CUDA operators

If you want to use the original CUDA operators, you can unset the following environment variables.

unset USE_FLAGGEMS

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
benchmarks		benchmarks
examples		examples
tests/e2e		tests/e2e
vllm_fl		vllm_fl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vllm-FL

Quick Start

Setup

Run a Task

Offline Batched Inference

Advanced use

Using CudaCommunication library

Using native CUDA operators

About

Uh oh!

Releases

Packages

Contributors 7

Uh oh!

Languages

flagos-ai/vllm-plugin-FL

Folders and files

Latest commit

History

Repository files navigation

vllm-FL

Quick Start

Setup

Run a Task

Offline Batched Inference

Advanced use

Using CudaCommunication library

Using native CUDA operators

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

Packages