Skip to content

ByteDance-BandAI/CodeVision

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Logo

Thinking with Programming Vision: Towards a Unified View for Thinking with Images

license

Overview

  • Introduction: A framework leveraging code-as-tool and comprehensive SFT/RL datasets for "thinking with images".

  • Features: Supports multi-turn agent loops for the Qwen2.5-VL and Qwen3-VL series.

  • Datasets: Includes an SFT dataset constructed using GPT-5-High and an RL dataset covering diverse domains.

Overview

Cases

These cases show the multi-turn and emergent tool usage of the agent.

Case 1

Overview

Case 2

Overview

Getting Started

Environment Setup

For RL training:

pip install "torch==2.8.0" "torchvision==0.23.0"

# vllm >= 0.11.0 or sglang >= 0.5.3 for Qwen3-VL series support
# Pick one stack: vLLM OR SGLang (install the one you need)
pip install vllm==0.11.0          # option 1: vLLM stack
pip install "sglang[all]==0.5.3"  # option 2: SGLang stack

# transformers >= 4.57.0 for Qwen3-VL series support
pip3 install transformers==4.57.0

# FlashAttention
pip install --no-cache-dir --use-pep517 flash-attn==2.8.3 --no-build-isolation

# Other dependencies
pip install -r requirements-runtime.txt

Running

Stage 1: SFT

The construction pipeline of the SFT dataset:

sftdata

First, download the CodeVision-SFT Dataset for SFT. We use LLaMA-Factory for our SFT training. Update the config file in qwen2_5vl_full_sft.yaml and qwen3vl.yaml. Update the data file path in dataset_info.json. Then run the following command to launch the training:

cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/qwen3vl.yaml

Stage 2: RL

Waiting for internal approval... Coming soon...

About

Thinking with Programming Vision: Towards a Unified View for Thinking with Images

Resources

Stars

Watchers

Forks

Languages