Skip to content

wavejava/ImageGen-CoT

 
 

Repository files navigation

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

Jiaqi Liao†1, Zhengyuan Yang1, Linjie Li1, Dianqi Li, Kevin Lin1, Yu Cheng2, Lijuan Wang1✉

1Microsoft, 2The Chinese University of Hong Kong

Interns at Microsoft

arXiv GitHub Dataset

Abstract: In this work, we study the problem of Text-to-Image In-Context Learning (T2I-ICL). While Unified Multimodal LLMs (MLLMs) have advanced rapidly in recent years, they struggle with contextual reasoning in T2I-ICL scenarios. To address this limitation, we propose a novel framework that incorporates a thought process called ImageGen-CoT prior to image generation. To avoid generating unstructured ineffective reasoning steps, we develop an automatic pipeline to curate a high-quality ImageGen-CoT dataset. We then fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities. To further enhance performance, we explore test-time scale-up strategies and propose a novel hybrid scaling approach. This approach first generates multiple ImageGen-CoT chains and then produces multiple images for each chain via sampling. Extensive experiments demonstrate the effectiveness of our proposed method. Notably, fine-tuning with the ImageGen-CoT dataset leads to a substantial 80% performance gain for SEED-X on T2I-ICL tasks.

🔍 Key Contributions:

  • 🎯 Chain-of-Thought Prompting: We propose a novel framework that incorporates a thought process called ImageGen-CoT prior to image generation in T2I-ICL tasks.

News 🚀

  • [07/08/25] Our paper is accepted by ICCV2025
  • [07/08/25] Our dataset is available on 🤗huggingface!

Contents

Step 1: Set Up Environment

  1. Clone this repository and required models

    # Clone main repository
    git clone https://github.com/CurryX-001/ImageGen-CoT
    cd ImageGen-CoT/models
    
    # Clone and setup LLaVA
    git clone https://github.com/yzeng58/LLaVA 
    mv LLaVA llava
    
    # Clone and setup SEED-X (requires git-lfs)
    git lfs install
    git clone https://huggingface.co/spaces/tttoaster/SEED-X-17B
    mv SEED-X-17B SEED_X
  2. Install Packages

    Linux
    # create the environment for llava to work 
    conda create -n llava python=3.10.13
    conda activate llava
    pip install --upgrade pip  # enable PEP 660 support
    pip install git+https://github.com/yzeng58/LLaVA/@a61aae093656922fe16ec2152b031dd1de72fe92
    pip install -r conda_env/llava_requirements.txt
    
    # create the environment for seed-x to work 
    conda env create -f conda_env/seedx_environment.yml

Step 2: Download Dataset

  1. Download the CoBSAT dataset.

    wget "https://huggingface.co/datasets/yzeng58/CoBSAT/resolve/main/datasets.zip"
  2. Extract the datasets.zip file using unzip datasets.zip and move the datasets folder into your ImageGen-CoT directory.

Step3: Evaluation

Please refer to the evaluation scripts in the scripts directory:

  • scripts/baseline.sh: For running baseline
  • scripts/evaluate.sh: For running the main evaluations

Step4: Citation

If you find this work helpful, please cite our paper:

@article{liao2025imagegen,
  title={Imagegen-cot: Enhancing text-to-image in-context learning with chain-of-thought reasoning},
  author={Liao, Jiaqi and Yang, Zhengyuan and Li, Linjie and Li, Dianqi and Lin, Kevin and Cheng, Yu and Wang, Lijuan},
  journal={arXiv preprint arXiv:2503.19312},
  year={2025}
}

Acknowledgments

This codebase is based on CoBSAT. We thank them for their great work!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.5%
  • Shell 1.5%