Kaihang Pan1* · Weile Chen1* · Haiyi Qiu1* · Qifan Yu1 · Wendong Bu1 · Zehan Wang1 ·
Yun Zhu2 · Juncheng Li1 · Siliang Tang1
1Zhejiang University 2Shanghai Artificial Intelligence Laboratory
*Equal contribution.
WiseEdit is a knowledge-intensive benchmark for cognition- and creativity-informed image editing. It decomposes instruction-based editing into three stages, Awareness, Interpretation, and Imagination, and provides 1,220 bilingual test cases together with a GPT-4o–based automatic evaluation pipeline. Using WiseEdit, we benchmark 22 state-of-the-art image editing models and reveal clear limitations in knowledge-based reasoning and compositional creativity.
- [2025.11.29] 📄 WiseEdit paper released on arXiv.
- [2025.11.29] 📊 WiseEdit project page released.
- More updates coming soon – stay tuned and ⭐ star the repo!
- Release paper and project page.
- Release WiseEdit benchmark data.
- Release automatic evaluation code & prompts.
- Release baseline results & model outputs.
WiseEdit is built around task depth and knowledge breadth.
WiseEdit includes:
-
Awareness Task
- Focus: Where to edit.
- No explicit spatial coordinates are given in the instruction.
- Requires comparative reasoning, reference matching, or fine-grained perception
-
Interpretation Task
- Focus: How to edit at the perception level.
- Instructions often encode implicit intent, demanding world knowledge
-
Imagination Task
- Focus: subject driven creative generation.
- Requires complex composition and identity-preserving transformations
-
WiseEdit-Complex
- Combines Awareness + Interpretation + Imagination.
- Multi-image, multi-step reasoning with conditional logic and compositional generation.
WiseEdit organizes cases by knowledge type:
-
Declarative Knowledge – “knowing what”
- Facts, concepts, perceptual cues.
-
Procedural Knowledge – “knowing how”
- Multi-step skills or procedures.
-
Metacognitive Knowledge – “knowing about knowing”
- When and how to apply declarative / procedural knowledge; conditional reasoning, rule stacking, etc.
These are grounded in Cultural Common Sense, Natural Sciences, and Spatio-Temporal Logic, stressing culturally appropriate, physically consistent, and logically coherent edits.
We adopt a VLM-based automatic evaluation pipeline:
- Backbone evaluator: GPT-4o.
- Metrics (1–10 → linearly mapped to 0–100):
- IF – Instruction Following
- DP – Detail Preserving
- VQ – Visual Quality
- KF – Knowledge Fidelity (for knowledge-informed cases)
- CF – Creative Fusion (for imagination / complex cases)
The overall score is:
where
Our benchmark data is hosted on Hugging Face:
- WiseEdit-Benchmark: https://huggingface.co/datasets/123123chen/WiseEdit-Benchmark
The folder structure for WiseEdit-Benchmark is organized as follows:
WiseEdit-Benchmark/
├── WiseEdit/
│ ├── Awareness/
│ │ ├── Awareness_1/
│ │ │ ├── imgs/ # input images for this subset
│ │ │ ├── img_ref/ # reference images (if any)
│ │ │ ├── Awareness_1.csv # metadata + instructions in CSV format
│ │ │ └── ins.json # same annotations in JSON format (used by code)
│ │ └── Awareness_2/
│ │ ├── imgs/
│ │ ├── img_ref/
│ │ ├── Awareness_2.csv # metadata + instructions in CSV format
│ │ └── ins.json # same annotations in JSON format
│ ├── Imagination/
│ │ └── ... # similar structure for Imagination subsets
│ └── Interpretation/
│ └── ... # similar structure for Interpretation subsets
└── WiseEdit-Complex/
├── WiseEdit_Complex_2/
│ ├── imgs/
│ ├── img_ref/
│ ├── WiseEdit_Complex_2.csv # metadata + instructions in CSV format
│ └── ins.json # same annotations in JSON format
├── WiseEdit_Complex_3/
│ └── ...
└── WiseEdit_Complex_4/
└── ...
All our model evaluation results are also released at:
- WiseEdit-Results: https://huggingface.co/datasets/midbee/WiseEdit-Results
This project requires Python 3.10. First install dependencies (from requirements.txt):
pip install -r requirements.txt
Set your API credentials (the evaluator calls an OpenAI-compatible Chat Completions API):
# required
export API_KEY="YOUR_API_KEY"
# optional: if not set, the default https://api.openai.com/v1 will be used
export BASE_URL="https://api.openai.com/v1"
If BASE_URL is not set, it will automatically fall back to https://api.openai.com/v1.
git clone https://github.com/beepkh/WiseEdit
cd WiseEdit
# 1) create and activate env
conda create -n wiseedit python=3.10 -y
conda activate wiseedit
# 2) install requirements
pip install -r requirements.txt
# 3) set env vars
export API_KEY="YOUR_API_KEY"
# optional
export BASE_URL="https://api.openai.com/v1"
Before running the evaluation, you need to organize all generated images as:
result_img_root/<MODEL_NAME>/<SUBSET>/<LANG>/<ID>.png,
where <MODEL_NAME> is the model tag passed to --name, <SUBSET> is the CSV/JSON subset name (e.g. Awareness_1, Imagination_2, WiseEdit_Complex_3), <LANG> is cn or en, and <ID>.png is the sample id in the corresponding CSV/JSON (e.g. 1.png, 2.png, …).
You can refer to the WiseEdit-Results for an example of this directory layout.
/path/to/result_images_root/
└── <MODEL_NAME>/ # e.g. Nano-banana-pro, GPT, etc.
├── Awareness_1/
│ ├── cn/
│ │ ├── 1.png # id = 1 in Awareness_1.csv / ins.json (cn)
│ │ ├── 2.png
│ │ └── ...
│ └── en/
│ ├── 1.png # id = 1 in Awareness_1.csv / ins.json (en)
│ ├── 2.png
│ └── ...
├── Awareness_2/
│ ├── cn/
│ └── en/
├── Imagination_1/
│ ├── cn/
│ └── en/
├── Imagination_2/
│ └── ...
├── Interpretation_1/
│ └── ...
├── WiseEdit_Complex_2/
│ └── ...
└── ...
Evaluation/generate_image_example.py uses FLUX.2-Dev as an example to demonstrate how to generate the corresponding images for each test case in WiseEdit.
Run run_eval.py to score all subsets and produce score_*.csv:
python run_eval.py \
--name Nano-banana-pro \
--dataset_dir /path/to/WiseEdit-Benchmark \
--result_img_root /path/to/result_images_root \
--score_output_root /path/to/score_output_root \
--num_workers 5 # number of threads used for evaluation
To evaluate only specific CSVs (e.g. Imagination_1.csv and Awareness_1.csv):
python run_eval.py \
--name Nano-banana-pro \
--dataset_dir /path/to/WiseEdit-Benchmark \
--result_img_root /path/to/result_images_root \
--score_output_root /path/to/score_output_root \
--num_workers 5 \
--target_csv Imagination_1.csv Awareness_1.csv
run_eval.py will write files like:
/score_output_root/Nano-banana-pro/score_Imagination_1.csv
/score_output_root/Nano-banana-pro/score_Awareness_1.csv
...
After all score_*.csv are ready, run statistic.py:
python statistic.py \
--dataset_dir /path/to/WiseEdit-Benchmark \
--score_root /path/to/score_output_root \
--name Nano-banana-pro \
--statistic_output_dir /path/to/statistic_output
This will generate:
/statistic_output/Nano-banana-pro_cn.csv
/statistic_output/Nano-banana-pro_en.csv
/statistic_output/Nano-banana-pro_complex.csv
and print per-task, per-language averages to the console (You can replace Nano-banana-pro with your own model-name).
If you only want to test the results under single image (like Table 5 and Table 6 in our paper), After all score_*_1.csv are ready (note there is no score_WiseEdit_Complex_1.csv), run statistic_single.py:
python statistic_single.py \
--dataset_dir /path/to/WiseEdit-Benchmark \
--score_root /path/to/score_output_root \
--name Nano-banana-pro \
--statistic_output_dir /path/to/statistic_output
This will generate:
/statistic_output/Nano-banana-pro_cn_sing.csv
/statistic_output/Nano-banana-pro_en_sing.csv
and print per-task, per-language averages to the console.
If you find WiseEdit helpful, please cite:
@article{pan2025wiseedit,
title={WiseEdit: Benchmarking Cognition-and Creativity-Informed Image Editing},
author={Pan, Kaihang and Chen, Weile and Qiu, Haiyi and Yu, Qifan and Bu, Wendong and Wang, Zehan and Zhu, Yun and Li, Juncheng and Tang, Siliang},
journal={arXiv preprint arXiv:2512.00387},
year={2025}
}
