Jiaqi Liao†1, Zhengyuan Yang1, Linjie Li1, Dianqi Li, Kevin Lin1, Yu Cheng2, Lijuan Wang1✉
1Microsoft, 2The Chinese University of Hong Kong
†Interns at Microsoft
Abstract: In this work, we study the problem of Text-to-Image In-Context Learning (T2I-ICL). While Unified Multimodal LLMs (MLLMs) have advanced rapidly in recent years, they struggle with contextual reasoning in T2I-ICL scenarios. To address this limitation, we propose a novel framework that incorporates a thought process called ImageGen-CoT prior to image generation. To avoid generating unstructured ineffective reasoning steps, we develop an automatic pipeline to curate a high-quality ImageGen-CoT dataset. We then fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities. To further enhance performance, we explore test-time scale-up strategies and propose a novel hybrid scaling approach. This approach first generates multiple ImageGen-CoT chains and then produces multiple images for each chain via sampling. Extensive experiments demonstrate the effectiveness of our proposed method. Notably, fine-tuning with the ImageGen-CoT dataset leads to a substantial 80% performance gain for SEED-X on T2I-ICL tasks.
- 🎯 Chain-of-Thought Prompting: We propose a novel framework that incorporates a thought process called ImageGen-CoT prior to image generation in T2I-ICL tasks.
- [07/08/25] Our paper is accepted by ICCV2025
- [07/08/25] Our dataset is available on 🤗huggingface!
-
Clone this repository and required models
# Clone main repository git clone https://github.com/CurryX-001/ImageGen-CoT cd ImageGen-CoT/models # Clone and setup LLaVA git clone https://github.com/yzeng58/LLaVA mv LLaVA llava # Clone and setup SEED-X (requires git-lfs) git lfs install git clone https://huggingface.co/spaces/tttoaster/SEED-X-17B mv SEED-X-17B SEED_X
-
Install Packages
Linux
# create the environment for llava to work conda create -n llava python=3.10.13 conda activate llava pip install --upgrade pip # enable PEP 660 support pip install git+https://github.com/yzeng58/LLaVA/@a61aae093656922fe16ec2152b031dd1de72fe92 pip install -r conda_env/llava_requirements.txt # create the environment for seed-x to work conda env create -f conda_env/seedx_environment.yml
-
Download the CoBSAT dataset.
wget "https://huggingface.co/datasets/yzeng58/CoBSAT/resolve/main/datasets.zip" -
Extract the
datasets.zipfile usingunzip datasets.zipand move thedatasetsfolder into yourImageGen-CoTdirectory.
Please refer to the evaluation scripts in the scripts directory:
scripts/baseline.sh: For running baselinescripts/evaluate.sh: For running the main evaluations
If you find this work helpful, please cite our paper:
@article{liao2025imagegen,
title={Imagegen-cot: Enhancing text-to-image in-context learning with chain-of-thought reasoning},
author={Liao, Jiaqi and Yang, Zhengyuan and Li, Linjie and Li, Dianqi and Lin, Kevin and Cheng, Yu and Wang, Lijuan},
journal={arXiv preprint arXiv:2503.19312},
year={2025}
}This codebase is based on CoBSAT. We thank them for their great work!
