FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers
ICCV 2025
1Harbin Institute of Technology, Shenzhen
2Huawei Noah's Ark Lab
†Corresponding author
- [01/2026] 🔥 The extended paper of FALCON++ is released on TechRxiv.
- [12/2025] 🔥 Checkpoint released. Enjoy it!
- [07/2025] 🔥 The code and project page are released. Enjoy it!
- [06/2025] 🔥 The arXiv paper is updated.
- [06/2025] FALCON is accepted to ICCV 2025!
- [01/2025] arXiv paper released.
This is the github repository of FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers. In this work, we propose the FALCON model, which introduces a novel visual register technique to simultaneously address the issues of visual redundancy and fragmentation in the high-resolution visual encoding of MLLMs.
- Clone this repository and navigate to the folder
git clone git@github.com:JiuTian-VL/JiuTian-FALCON.git
cd falcon- Install Package
conda create -n falcon python=3.10 -y
conda activate falcon
pip install --upgrade pip
pip install -e .- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
We have developed a well-encapsulated class JiutianHDInfer specifically designed for model inference in jiutian/eval/model_infer.py.
Below is an example of how to use the JiutianHDInfer class. By calling the inference method, you can easily obtain the model's inference results.
from jiutian.eval.model_infer import JiutianHDInfer
model_infer = JiutianHDInfer(
model_path='/path/to/ckpt',
model_base='/path/to/base_ckpt or None',
conv_mode='llama_3_1',
)
image_file = '/path/to/image'
question = 'question'
model_infer.inference(image_file, question)See docs/Evaluation.md for details.
If you find this work useful for your research, please kindly cite our paper:
@inproceedings{zhang2025falcon,
title={Falcon: Resolving visual redundancy and fragmentation in high-resolution multimodal large language models via visual registers},
author={Zhang, Renshan and Shao, Rui and Chen, Gongwei and Zhang, Miao and Zhou, Kaiwen and Guan, Weili and Nie, Liqiang},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={23530--23540},
year={2025}
}
