GitHub - JiuTian-VL/FALCON: [ICCV 2025] Official repository of "FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers"

FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers
ICCV 2025

Renshan Zhang¹, Rui Shao¹†, Gongwei Chen¹, Miao Zhang¹, Weili Guan¹, Kaiwen Zhou², Liqiang Nie¹†

¹Harbin Institute of Technology, Shenzhen
²Huawei Noah's Ark Lab
†Corresponding author

If you find this work useful for your research, please kindly cite our paper and star our repo.

Updates

[01/2026] 🔥 The extended paper of FALCON++ is released on TechRxiv.
[12/2025] 🔥 Checkpoint released. Enjoy it!
[07/2025] 🔥 The code and project page are released. Enjoy it!
[06/2025] 🔥 The arXiv paper is updated.
[06/2025] FALCON is accepted to ICCV 2025!
[01/2025] arXiv paper released.

Introduction

This is the github repository of FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers. In this work, we propose the FALCON model, which introduces a novel visual register technique to simultaneously address the issues of visual redundancy and fragmentation in the high-resolution visual encoding of MLLMs.

Installation

Clone this repository and navigate to the folder

git clone git@github.com:JiuTian-VL/JiuTian-FALCON.git
cd falcon

Install Package

conda create -n falcon python=3.10 -y
conda activate falcon
pip install --upgrade pip
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Quick Start

We have developed a well-encapsulated class JiutianHDInfer specifically designed for model inference in jiutian/eval/model_infer.py.

Below is an example of how to use the JiutianHDInfer class. By calling the inference method, you can easily obtain the model's inference results.

from jiutian.eval.model_infer import JiutianHDInfer

model_infer = JiutianHDInfer(
    model_path='/path/to/ckpt',
    model_base='/path/to/base_ckpt or None',
    conv_mode='llama_3_1',
)

image_file = '/path/to/image'
question = 'question'
model_infer.inference(image_file, question)

Evaluations

See docs/Evaluation.md for details.

Citation

If you find this work useful for your research, please kindly cite our paper:

@inproceedings{zhang2025falcon,
  title={Falcon: Resolving visual redundancy and fragmentation in high-resolution multimodal large language models via visual registers},
  author={Zhang, Renshan and Shao, Rui and Chen, Gongwei and Zhang, Miao and Zhou, Kaiwen and Guan, Weili and Nie, Liqiang},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={23530--23540},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
docs		docs
evaluation		evaluation
jiutian		jiutian
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers
ICCV 2025

If you find this work useful for your research, please kindly cite our paper and star our repo.

Updates

Introduction

Installation

Quick Start

Evaluations

Citation

About

Uh oh!

Releases

Packages

Languages

License

JiuTian-VL/FALCON

Folders and files

Latest commit

History

Repository files navigation

FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual RegistersICCV 2025

If you find this work useful for your research, please kindly cite our paper and star our repo.

Updates

Introduction

Installation

Quick Start

Evaluations

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers
ICCV 2025

Packages