ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

Jiaqi Liao^†1, Zhengyuan Yang¹, Linjie Li¹, Dianqi Li, Kevin Lin¹, Yu Cheng², Lijuan Wang^1✉

¹Microsoft, ²The Chinese University of Hong Kong

^†Interns at Microsoft

Abstract: In this work, we study the problem of Text-to-Image In-Context Learning (T2I-ICL). While Unified Multimodal LLMs (MLLMs) have advanced rapidly in recent years, they struggle with contextual reasoning in T2I-ICL scenarios. To address this limitation, we propose a novel framework that incorporates a thought process called ImageGen-CoT prior to image generation. To avoid generating unstructured ineffective reasoning steps, we develop an automatic pipeline to curate a high-quality ImageGen-CoT dataset. We then fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities. To further enhance performance, we explore test-time scale-up strategies and propose a novel hybrid scaling approach. This approach first generates multiple ImageGen-CoT chains and then produces multiple images for each chain via sampling. Extensive experiments demonstrate the effectiveness of our proposed method. Notably, fine-tuning with the ImageGen-CoT dataset leads to a substantial 80% performance gain for SEED-X on T2I-ICL tasks.

🔍 Key Contributions:

🎯 Chain-of-Thought Prompting: We propose a novel framework that incorporates a thought process called ImageGen-CoT prior to image generation in T2I-ICL tasks.

News 🚀

[07/08/25] Our paper is accepted by ICCV2025
[07/08/25] Our dataset is available on 🤗huggingface!

Step 1: Set Up Environment

Clone this repository and required models

# Clone main repository
git clone https://github.com/CurryX-001/ImageGen-CoT
cd ImageGen-CoT/models

# Clone and setup LLaVA
git clone https://github.com/yzeng58/LLaVA 
mv LLaVA llava

# Clone and setup SEED-X (requires git-lfs)
git lfs install
git clone https://huggingface.co/spaces/tttoaster/SEED-X-17B
mv SEED-X-17B SEED_X

Install Packages

Linux

# create the environment for llava to work 
conda create -n llava python=3.10.13
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install git+https://github.com/yzeng58/LLaVA/@a61aae093656922fe16ec2152b031dd1de72fe92
pip install -r conda_env/llava_requirements.txt

# create the environment for seed-x to work 
conda env create -f conda_env/seedx_environment.yml

Step 2: Download Dataset

Download the CoBSAT dataset.

wget "https://huggingface.co/datasets/yzeng58/CoBSAT/resolve/main/datasets.zip"

Extract the datasets.zip file using unzip datasets.zip and move the datasets folder into your ImageGen-CoT directory.

Step3: Evaluation

Please refer to the evaluation scripts in the scripts directory:

scripts/baseline.sh: For running baseline
scripts/evaluate.sh: For running the main evaluations

Step4: Citation

If you find this work helpful, please cite our paper:

@article{liao2025imagegen,
  title={Imagegen-cot: Enhancing text-to-image in-context learning with chain-of-thought reasoning},
  author={Liao, Jiaqi and Yang, Zhengyuan and Li, Linjie and Li, Dianqi and Lin, Kevin and Cheng, Yu and Wang, Lijuan},
  journal={arXiv preprint arXiv:2503.19312},
  year={2025}
}

Acknowledgments

This codebase is based on CoBSAT. We thank them for their great work!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
conda_env		conda_env
load_datasets		load_datasets
load_models		load_models
scripts		scripts
static		static
.gitignore		.gitignore
README.md		README.md
configs.py		configs.py
environment.py		environment.py
evaluation_icl.py		evaluation_icl.py
finetune_icl.py		finetune_icl.py
helper.py		helper.py
inference_cobsat.py		inference_cobsat.py
inference_icl.py		inference_icl.py
load_dataset.py		load_dataset.py
load_model.py		load_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

Jiaqi Liao^†1, Zhengyuan Yang¹, Linjie Li¹, Dianqi Li, Kevin Lin¹, Yu Cheng², Lijuan Wang^1✉

¹Microsoft, ²The Chinese University of Hong Kong

^†Interns at Microsoft

🔍 Key Contributions:

News 🚀

Contents

Step 1: Set Up Environment

Step 2: Download Dataset

Step3: Evaluation

Step4: Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

wavejava/ImageGen-CoT

Folders and files

Latest commit

History

Repository files navigation

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

Jiaqi Liao†1, Zhengyuan Yang1, Linjie Li1, Dianqi Li, Kevin Lin1, Yu Cheng2, Lijuan Wang1✉ 1Microsoft, 2The Chinese University of Hong Kong †Interns at Microsoft

🔍 Key Contributions:

News 🚀

Contents

Step 1: Set Up Environment

Step 2: Download Dataset

Step3: Evaluation

Step4: Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Jiaqi Liao^†1, Zhengyuan Yang¹, Linjie Li¹, Dianqi Li, Kevin Lin¹, Yu Cheng², Lijuan Wang^1✉

¹Microsoft, ²The Chinese University of Hong Kong

^†Interns at Microsoft

Packages