🍇 [Read our arXiv Paper] 🍎 [Project Page]
Jianwei Yang*⚑, Hao Zhang*, Feng Li*, Xueyan Zou*, Chunyuan Li, Jianfeng Gao
* Core Contributors ⚑ Project Lead
We present Set-of-Mark (SoM) prompting, simply overlaying a number of spatial and speakable marks on the images, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V.
- [10/18] We are going to release the SoM toolbox very soon. Stay tunned!
Users can select which granularity of masks to generate, and which mode to use between automatic (top) and interactive (bottom). A higher alpha blending value (0.4) is used for better visualization.
SoM enables interleaved prompts which include textual and visual content. The visual content can be represented using the region indices.

In comparison to GPT-4V without SoM, adding marks enables GPT-4V to ground the reasoning on detailed contents of the image (Left). Clear object cross-image references are observed on the right. 17
Case study on solving CAPTCHA. GPT-4V gives the wrong answer with a wrong number of squares while finding the correct squares with corresponding marks after SoM prompting.
Case study on an image of dish for GPT-4V. GPT-4V does not produce a grounded answer with the original image. Based on SoM prompting, GPT-4V not only speaks out the ingredients but also corresponds them to the regions.
SoM-pormpted GPT-4V gives very precise suggestions while the original one fails, even with hallucinated foods, e.g., soft drinks
Likewise, GPT4-V with SoM can help to provide thorough tool usage instruction
, teaching
users the function of each button on a controller. Note that this image is not fully labeled, while
GPT-4V can also provide information about the non-labeled buttons.
GPT-4V with SoM gives a reasonable suggestion on how to achieve a goal in a gaming scenario.
We conduct experiments on various vision tasks to verify the effectiveness of our SoM. Results show that GPT4V+SoM outperforms specialists on most vision tasks and is comparable to MaskDINO on COCO panoptic segmentation.

Our model adopts the following models to propose masks:
- Mask DINO
- SEEM
- Semantic-SAM
- Segment Anything for the SA-1B data.
We also thank GPT-4V for providing a strong foundational model!
If you find our work helpful for your research, please consider citing the following BibTeX entry.
@article{yang2023setofmark,
title={Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V},
author={Jianwei Yang and Hao Zhang and Feng Li and Xueyan Zou and Chunyuan Li and Jianfeng Gao},
journal={arXiv preprint arXiv:2310.11441},
year={2023},
}

