- 🤗🤗🤗We release
$DPN$ , a novel approach to enhance computational efficiency in Multimodal Large Language Models (MLLMs) by incorporating Dynamic Pooling Experts (DPE) layers.
Dynamic Pyramid Network (
Multimodal large language models (MLLMs) have demonstrated impressive performance in various vision-language (VL) tasks, but their expensive computations still limit the real-world application. To address this issue, recent efforts aim to compress the visual features to save the computational costs of MLLMs. However, direct visual compression methods, such as efficient projectors, or token pruning approaches like Fast V, inevitably destroy the visual semantics in MLLMs, which becomes more serious in difficult samples.
DPN formulates MLLM as a hierarchical structure where visual features are gradually compressed with increasing depth. In this case, even with a high compression ratio, fine-grained visual information can still be perceived in shallow layers. To maximize the benefit of DPN, we further propose an innovative Dynamic Pooling Experts (DPE) that can dynamically choose the optimal visual compression rate according to input features. With this design, harder samples will be assigned larger computations, thus preserving the model performance.
- Dynamic Pyramid Network: Visual features are compressed as the depth of MLLM increases, maintaining performance while improving inference efficiency.
- Dynamic Pooling Experts: Choose the optimal compression ratio based on the complexity of the image and visual questions to ensure performance and efficiency.
- Routing Loss: Improve the acceleration effect without affecting training cost and performance.
| Model | FLOPs Ratio Reduction | Accuracy |
|---|---|---|
| DPN-LLaVA-1.5-7B | 56% | +0.7% |
| DPN-LLaVA-HR-7B | 40% | -0.4% |
| DPN-LLaVA-HR-X-13B | 44% | +0.6% |
For more details, check the full report.
Our DPN Approach can dynamically compress visual tokens according to the complexity of visual tasks in MLLM to accelerate reasoning, and significantly outperforms existing acceleration methods in fine-grained visual perception tasks.
-
Efficient Inference Structure : As shown in Figure 4a, in DPN-LLaVA, the deeper the layer, the fewer visual tokens need to be processed to achieve efficient inference.
-
Dynamic Optimal Compression: As shown in Figure 3, 4a, DPE can dynamically select the appropriate compression ratio based on the complexity of the image and the problem to ensure low performance loss.
-
Superior performance compared to existing methods: As shown in Figure 4b, compared to sparsification methods like Fast V and efficient projector approaches like TokenPacker, our dynamic compression method demonstrates significant advantages in image detail perception tasks.
This visualization demonstrates how
- Clone the repository and navigate to the
$DPN$ folder:
git clone git@github.com:aihao2000/DPN-LLaVA.git
cd DPN-LLaVA- Create and activate a new conda environment:
conda create -n dpn-llava python=3.10 -y
conda activate dpn-llava- Upgrade pip and install the package:
pip install --upgrade pip # enable PEP 660 support
pip install -e .- Install additional packages for training:
pip install ninja
pip install flash-attn==2.6.3 --no-build-isolationPlease refer to the original LLaVA-HR for data preparation. Or whatever MLLM's offical repo you are using.
We recommend to directly pre-trained projector, here are the link from official LLaVA and LLaVA-HR. You can use the following script to download the pre-trained projectors used in this project.
| Version | Vision Encoder | Projection | Pretrain Data | Pretraining schedule | Download |
|---|---|---|---|---|---|
| LLaVA-7b | CLIP-L | MLP-2x | LCS-558K | 1e | projector |
| LLaVA-HR-7b | CLIP-L & ConvNeXt-L | MLP-2x | LCS-558K | 1e | projector |
| LLaVA-HR-X-13b | CLIP-L & ConvNeXt-XXL | MLP-2x | LCS-558K | 1e | projector |
huggingface-cli download --local-dir ./checkpoints/vicuna-7b-v1.5 lmsys/vicuna-7b-v1.5
# for dpn-llava
huggingface-cli download --local-dir ./checkpoints/llava-v1.5-7b llava-v1.5-7b mm_projector.bin
# for dpn-llava-hr
huggingface-cli download --local-dir ./checkpoints/llava-hr-7b-pretrain-384 favor123/llava-hr-7b-pretrain-384 mm_projector.bin
# for dpn-llava-hr-x
huggingface-cli download --local-dir ./checkpoints/vicuna-13b-v1.5 lmsys/vicuna-13b-v1.5
huggingface-cli download --local-dir ./checkpoints/llava-hr-13b-x-pretrain-384 favor123/llava-hr-13b-x-pretrain-384 mm_projector.binFor the 13B model, you may need to modify the parameters used by the hack_llava function in the llava_hr/train/train.py file.
bash scripts/v1_5/train_dpn_llava.sh
bash scripts/v1_5/train_dpn_llava_hr.sh
bash scripts/v1_5/train_dpn_llava_hr_x.shWe follow LLaVA-v1.5 to conduct evaluations. you should download eval.zip and unzip it to ./playground/data/eval. Please refer to Evaluation.md to prepare the data.
Then, your can run our evaluation script bash scripts/v1_5/eval.sh.
For questions, please reach out to aihao2000@outlook.com.
This project is licensed under the MIT License - see the LICENSE file for details.
Special thanks to all contributors and the LLaVA & LLaVA-HR project for codebase.



