-
Introduction: A framework leveraging code-as-tool and comprehensive SFT/RL datasets for "thinking with images".
-
Features: Supports multi-turn agent loops for the Qwen2.5-VL and Qwen3-VL series.
-
Datasets: Includes an SFT dataset constructed using GPT-5-High and an RL dataset covering diverse domains.
These cases show the multi-turn and emergent tool usage of the agent.
For RL training:
pip install "torch==2.8.0" "torchvision==0.23.0"
# vllm >= 0.11.0 or sglang >= 0.5.3 for Qwen3-VL series support
# Pick one stack: vLLM OR SGLang (install the one you need)
pip install vllm==0.11.0 # option 1: vLLM stack
pip install "sglang[all]==0.5.3" # option 2: SGLang stack
# transformers >= 4.57.0 for Qwen3-VL series support
pip3 install transformers==4.57.0
# FlashAttention
pip install --no-cache-dir --use-pep517 flash-attn==2.8.3 --no-build-isolation
# Other dependencies
pip install -r requirements-runtime.txtThe construction pipeline of the SFT dataset:
First, download the CodeVision-SFT Dataset for SFT. We use LLaMA-Factory for our SFT training. Update the config file in qwen2_5vl_full_sft.yaml and qwen3vl.yaml. Update the data file path in dataset_info.json. Then run the following command to launch the training:
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/qwen3vl.yamlWaiting for internal approval... Coming soon...




