Thinking with Programming Vision: Towards a Unified View for Thinking with Images

Overview

Introduction: A framework leveraging code-as-tool and comprehensive SFT/RL datasets for "thinking with images".
Features: Supports multi-turn agent loops for the Qwen2.5-VL and Qwen3-VL series.
Datasets: Includes an SFT dataset constructed using GPT-5-High and an RL dataset covering diverse domains.

Cases

These cases show the multi-turn and emergent tool usage of the agent.

Case 1

Case 2

Getting Started

Environment Setup

For RL training:

pip install "torch==2.8.0" "torchvision==0.23.0"

# vllm >= 0.11.0 or sglang >= 0.5.3 for Qwen3-VL series support
# Pick one stack: vLLM OR SGLang (install the one you need)
pip install vllm==0.11.0          # option 1: vLLM stack
pip install "sglang[all]==0.5.3"  # option 2: SGLang stack

# transformers >= 4.57.0 for Qwen3-VL series support
pip3 install transformers==4.57.0

# FlashAttention
pip install --no-cache-dir --use-pep517 flash-attn==2.8.3 --no-build-isolation

# Other dependencies
pip install -r requirements-runtime.txt

Running

Stage 1: SFT

The construction pipeline of the SFT dataset:

First, download the CodeVision-SFT Dataset for SFT. We use LLaMA-Factory for our SFT training. Update the config file in qwen2_5vl_full_sft.yaml and qwen3vl.yaml. Update the data file path in dataset_info.json. Then run the following command to launch the training:

cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/qwen3vl.yaml

Stage 2: RL

Waiting for internal approval... Coming soon...

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LLaMA-Factory		LLaMA-Factory
docs		docs
README.md		README.md
requirements-runtime.txt		requirements-runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Thinking with Programming Vision: Towards a Unified View for Thinking with Images

Overview

Cases

Case 1

Case 2

Getting Started

Environment Setup

Running

Stage 1: SFT

Stage 2: RL

About

Uh oh!

Languages

ByteDance-BandAI/CodeVision

Folders and files

Latest commit

History

Repository files navigation

Thinking with Programming Vision: Towards a Unified View for Thinking with Images

Overview

Cases

Case 1

Case 2

Getting Started

Environment Setup

Running

Stage 1: SFT

Stage 2: RL

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages