Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

Video2Layout is a framework for reconstructing metric-grounded spatial layouts from video. It leverages continuous object boundary coordinates to quantify inter-object physical distances and object sizes, equipping the model with quantitative spatial computation capabilities. This effectively mitigates the inherent ambiguity in describing spatial relationships through natural language. Additionally, the framework adopts a two-stage SFT-to-RL training paradigm, which enhances the model's spatial reasoning abilities.

🚀 Overview

🎯 Key Benefits:

Metric-Grounded Cognitive Map — an accurate bird 's-eye view reflects the specific position of an object in the scene
Spatial reasoning computation — rigorous mathematical calculations eliminate the fuzziness of traditional natural language COT description spatial relationship reasoning
Generalization of real scenes — only the information of simulation data is needed, and there are no requirements for real scenarios.

QVS-Bench is a diagnostic benchmark for systematically evaluating how the quantity of image inputs impacts spatial reasoning accuracy. It maintains a substantially uniform proportional distribution across five input scale configurations (1, 4, 8, 12, and 16 frames), ensuring fair and unbiased analysis of the relevant mechanisms.

🛠️ Usage

(Step 1) Install

conda create -n v2lo python=3.10 -y 
conda activate v2lo
pip install -r requirements.txt

(Step 2) Training

# SFT training
bash src/ms-swift/sft.sh
# Merge model
bash src/ms-swift/merge_lora.sh

# RL training
bash src/EasyR1/examples/rl.sh
# Merge model
cd src/EasyR1
python3 scripts/model_merger.py --local_dir checkpoints/easy_r1/exp_name/global_step_1/actor

Citation

If you find our works useful for your research, please consider citing:

@misc{2511.16160,
Author = {Yibin Huang and Wang Xu and Wanyue Zhang and Helu Zhi and Jingjing Huang and Yangbin Xu and Yangang Sun and Conghui Zhu and Tiejun Zhao},
Title = {Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning},
Year = {2025},
Eprint = {arXiv:2511.16160},
}

Acknowledgement

The training codes are built on ms-swift and EasyR1.
The base models are from Qwen2.5-VL-7B-Instruct.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
figs		figs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

🚀 Overview

🛠️ Usage

(Step 1) Install

(Step 2) Training

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

ybrrraway/Video2Layout

Folders and files

Latest commit

History

Repository files navigation

Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

🚀 Overview

🛠️ Usage

(Step 1) Install

(Step 2) Training

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages