Tempo is an efficient, query-aware framework that natively compresses hour-long videos for downstream Multimodal LLMs. Instead of blindly dropping frames, Tempo acts as an intelligent temporal compressor, dynamically distributing the rhythm of the video based on user intent.
Project Page | Paper | Demo
- ๐ง Intent-Driven Compression (ATA): Uses a Small VLM as an O(1) dynamic router to allocate dense token bandwidth to query-critical moments while rapidly fast-forwarding redundant backgrounds.
- โก Extreme Efficiency: Achieves aggressive dynamic compression (0.5โ16 tokens/frame), bypassing the lost-in-the-middle phenomenon without breaking causality.
- ๐ State-of-the-Art Performance: Our compact Tempo-6B model scores 52.3 on the extreme-long LVBench under a strict 8K visual token budget (53.7 with 12K budget), outperforming proprietary baselines like GPT-4o and Gemini 1.5 Pro.
(Click play to see our interactive UI, dynamic token allocation visualization, and real-time inference)
demo.mp4
- [2026.04] ๐ฆ We have released the Intermediate Checkpoints for Stages 0, 1, and 2! You can find them in our Hugging Face Collection.
- [2026.04] ๐ The full Evaluation Pipeline is now open-sourced! Our evaluation scripts, integrated with the standard
lmms-evalframework for LVBench, Video-MME, MLVU, and LongVideoBench, are ready to use. Please refer to the Evaluation Section for detailed instructions. - [2026.04] ๐ Our paper is officially out! You can read it on arXiv and check out our page on Hugging Face Papers.
- [2026.04] ๐ We have released the Tempo-6B inference code, interactive Gradio UI, and the final checkpoints (Stage 3)!
- [TODO] ๐ ๏ธ Training Code: The complete training scripts for all 4 stages will be open-sourced in the following weeks. Stay tuned!
โญ Tip: Please Watch or Star this repository to keep an eye on our latest updates and code releases!
Create a new conda environment and install all required dependencies:
# Clone our repository
git clone https://github.com/FeiElysia/Tempo.git
cd Tempo
# Create environment
conda create -n tempo python=3.12 -y
conda activate tempo
# Install all packages (PyTorch 2.6.0 + CUDA 12.4)
pip install -r requirements.txtSince flash-attn installation can be highly environment-dependent, please install it manually using one of the methods below:
# Method 1
pip install flash-attn==2.7.4.post1
# Method 2: Without Build Isolation
pip install flash-attn==2.7.4.post1 --no-build-isolation
# Method 3: If you are unable to build from source, you can directly download and install the pre-built wheel:
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
pip install flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
rm flash_attn*.whl๐ก If you are unable to install
flash-attn, you can still run Tempo by disabling it:
- Set
use_flash_attn=Falsewhen callingload_pretrained_model.- Comment out the line
self.config._attn_implementation = "flash_attention_2"inqwen3vl_encoder.py.Theoretically, the numerical differences should be minimal. We have visually verified that the model produces excellent qualitative results without Flash-Attention. However, please note that we have not rigorously evaluated its impact on the benchmarks we reported in the paper.
To fully support the open-source community and facilitate future research, we have released the weights for our final model alongside the intermediate checkpoints from all 4 stages of our training pipeline.
๐ก Note on Token Budgets: Tempo's Adaptive Token Allocation (ATA) is dynamically controlled at inference time. The 4K and 8K budget configurations reported in our paper use the exact same final weights (Stage 3). You simply adjust the budget hyperparameter during inference.
| Training Stage | Description | Weights |
|---|---|---|
| Stage 0 | Modality Alignment | ๐ค HF Link |
| Stage 1 | Pre-training | ๐ค HF Link |
| Stage 2 | Broad Supervised Fine-Tuning | ๐ค HF Link |
| Stage 3 | Long-Context SFT (Final Tempo-6B) | ๐ค HF Link |
(Note: If you only want to run inference or evaluate our model, simply download the Stage 3 weights. The intermediate checkpoints for Stages 0, 1, and 2 are provided for researchers who wish to reproduce our training pipeline, conduct ablation studies, or perform custom fine-tuning.)
To run the inference script successfully, you need to download two components: our final Tempo-6B weights, and the base Qwen3-VL-2B-Instruct model (for Tempo initialization).
We highly recommend using the huggingface-cli for fast and resumable downloads:
mkdir -p checkpoints
# 1. Download the final Tempo-6B model
huggingface-cli download --resume-download Vision-CAIR/Tempo-6B --local-dir ./checkpoints/Tempo-6B
# 2. Download the base Qwen3-VL model (Required for architecture initialization)
# ๐ก Note: To avoid caching Qwen3-VL in the default system drive during inference,
# you can modify Tempo-6B's `config.json`: change "Qwen/Qwen3-VL-2B-Instruct" to "./checkpoints/Qwen3-VL-2B-Instruct" and run:
huggingface-cli download --resume-download Qwen/Qwen3-VL-2B-Instruct --local-dir ./checkpoints/Qwen3-VL-2B-InstructWe provide multiple ways to interact with Tempo, from web UI to batch scripts.
To launch the local Gradio application with interactive visualizations of the Token Allocation distribution:
python app.pyNavigate to the generated local or public URL in your browser. Our UI features dynamic token compression visualization and one-click example testing.
โ ๏ธ Note: We use Git LFS to manage the demo videos in this repository. If you encounter a "Video not playable" error, it means the actual video files were not downloaded duringgit clone(this will not happen if your system has already installed Git LFS).
- Option 1 (Recommended): Install Git LFS (
git lfs install) and rungit lfs pullinside the repository to fetch the real video files.- Option 2: Manually download the specific
.mp4files directly from our repository web interface.
Run the default example: We provide a quick-start script to test a pre-configured.
sh ./scripts/infer/infer.shRun your own custom video:
To test your own videos, call the Python script directly. Make sure to point the --model_path to your downloaded local checkpoint.
python infer.py \
--model_path "./checkpoints/Tempo-6B" \
--video_path "/path/to/your/custom_video.mp4" \
--query "Your detailed question here."Run all default examples: To sequentially reproduce all the qualitative examples shown on our Project Page, run:
sh ./scripts/infer/infer_all_demos.shRun a custom batch:
For testing across multiple custom videos, we highly recommend our JSON-based pipeline.
Simply edit the test configurations in ./examples/demo_cases.json:
[
{
"video_path": "/path/to/custom1.mp4",
"query": "Question 1"
},
{
"video_path": "/path/to/custom2.mp4",
"query": "Question 2"
}
]Once configured, execute the same batch script to automatically process them all:
sh ./scripts/infer/infer_all_demos.shHere are some unedited inference results generated directly by our Tempo-6B model. These examples demonstrate the model's capability in video summarization and complex QA.
๐ง Example 1: Complex Multi-Task Analysis & Grounding (Honkai: Star Rail)
hsr_helloworld.mp4
๐ท Query:
Task: Please analyze the provided video and answer the following 7 questions precisely. Q1: How many performers are visible on the stage? Q2: Describe the architectural elements in the background. What historical civilization do they remind you of? Q3: What is happening in the night sky above the performers, and what does this suggest about the event? Q4: List the hair colors of the performers in order from left to right. Q5: Identify the specific musical instrument being played by the performer located on the far left of the stage. Q6: What is the specific time interval (in seconds, e.g., XX-XXs) during which this fireworks performance scene occurs in the video? Q7: Look at the audience in the foreground. How does their silhouette-like depiction affect the viewer's perspective of the stage?
๐ด Tempo Response: โฑ๏ธ
Refer to 1:27 - 1:30Q1: There are four performers visible on the stage.
Q2: The architectural elements in the background resemble ancient Greek columns, suggesting a historical setting reminiscent of ancient Greece.
Q3: The night sky above the performers is filled with fireworks, suggesting a celebratory event or performance.
Q4: The hair colors of the performers from left to right are purple, pink, blonde, and blue.
Q5: The performer on the far left is playing a harp.
Q6: The fireworks performance scene occurs between 88.0-91.0 seconds.
Q7: The silhouette-like depiction of the audience in the foreground creates a sense of depth and focus on the performers on stage, enhancing the viewer's perspective of the stage.
๐ฅ Example 2: Video Summarization (Honkai Impact 3rd)
honkai3_becauseofyou.mp4
๐ท Query: Describe the video in detail.
๐ด Tempo Response: โฑ๏ธ
Global SummaryThe video opens with a high-tech, neon-lit cityscape at night, filled with floating structures and digital screens. A character with pink hair and a blue eye patch is seen running through the city, dodging obstacles. The environment is dark with vibrant neon colors, primarily purples, blues, and greens. The character is dressed in a futuristic outfit with a weapon. As the character moves, they are surrounded by digital elements and symbols, suggesting a cyberpunk setting. The scene shifts to a close-up of the character's face, showing determination and focus. The character is then seen in a combat stance, ready to fight. The lighting is dynamic, with flashes of light and energy. The video ends with a black screen displaying credits in white text.
Tempo bridges the gap between hour-long videos and bounded LLM context windows by casting long video understanding as an end-to-end, query-aware cross-modal distillation process.
Instead of blindly sampling frames, our pipeline operates in three highly efficient phases:
- 1. Local Compressor (Early Distillation): A Small VLM (SVLM) processes video segments alongside the user query. It uses learnable memory tokens to inherently distill visual semantics, dropping query-irrelevant backgrounds early on.
- 2. Adaptive Token Allocation (ATA): Operating as a training-free, O(1) dynamic router during inference, ATA intercepts zero-shot relevance scores from the SVLM. It allocates dense token bandwidth to query-critical segments while rapidly fast-forwarding redundancies into minimal temporal anchors.
- 3. Global Decoder (Synthesis): The highly compressed, filtered memory tokens are assembled with explicit temporal tags (e.g.,
<t=2.0s>). A large global LLM then synthesizes this condensed storyline to generate precise answers without suffering from attention dilution.
Tempo achieves state-of-the-art performance on long video benchmarks while using a fraction of the token budget compared to traditional models.
| Model | Size | Tokens / Frame | LongVideoBench | MLVU | Video-MME | LVBench |
|---|---|---|---|---|---|---|
| Proprietary Models | ||||||
| GPT-4o | - | - | 66.7 | 64.6 | 71.9 | 30.8 |
| Gemini 1.5 Pro | - | - | 64.0 | - | 75.0 | 33.1 |
| General Open-Source | ||||||
| VideoLLaMA3* | 7B | โค 91 | 59.8 | 73.0 | 66.2 | 45.3 |
| Qwen2.5-VL | 7B | 1924 | 56.0 | 70.2 | 65.1 | 45.3 |
| Qwen3-VL* | 8B | โค 640 | - | 78.1 | 71.4 | 58.0 |
| Specialized Long Video | ||||||
| LongVA | 7B | 144 | - | 56.3 | 52.6 | - |
| Kangaroo | 8B | 256 | 54.8 | 61.0 | 56.0 | 39.4 |
| LongVU | 7B | 64 | - | 65.4 | 60.6 | - |
| VideoChat-Flash | 7B | 16 | 64.7 | 74.7 | 65.3 | 48.2 |
| Tempo (4K Budget)* | 6B | 0.5โ16 | 64.5 | 75.6 | 67.8 | 52.7 |
| โณ actual avg. toks/frame | 2.8 | 2.8 | 3.6 | 2.9 | ||
| Tempo (8K Budget)* | 6B | 0.5โ16 | 65.1 | 75.2 | 67.7 | 52.3 |
| โณ actual avg. toks/frame | 3.1 | 3.3 | 4.3 | 3.5 |
๐ก Note: While configured with a theoretical dynamic range of 0.5โ16 tokens, Tempo's Adaptive Token Allocation (ATA) operates substantially below the maximum limits in practice (see the actual avg. rows). For the complete leaderboard and metrics, please visit our Project Page.
Our evaluation scripts are organized under ./scripts/eval/.
๐ก Note: By default, the scripts are configured for a 4-GPU setup. Before running, please ensure you adjust
NUM_GPUSandCUDA_VISIBLE_DEVICESinside the scripts to match your local machine environment.
To reproduce the main results across all benchmarks, run the script for your desired visual token budget:
# For 4K budget
sh ./scripts/eval/4k_budgets/eval_all.sh
# For 8K budget
sh ./scripts/eval/8k_budgets/eval_all.shTo run evaluation on a specific benchmark (e.g., LVBench under the 4K budget), execute the corresponding script:
sh ./scripts/eval/4k_budgets/eval_lvbench.shFor a quick sanity check to ensure your environment and model loading are properly configured:
sh ./scripts/eval/eval_debug.shThe lmms-eval framework only reports the Overall accuracy for Video-MME. To obtain the detailed breakdown across different video lengths (Short, Medium, Long), use our provided script:
python ./scripts/eval/split_videomme.py
โ ๏ธ Before running, please opensplit_videomme.pyand modify thejsonl_filepath to point to your recently generated*_samples_videomme.jsonllog file. The overall score calculated by this script matches thelmms-evaloutput.
We are actively cleaning up and organizing the training codebase for public release.
- Training Scripts: The complete source code and configuration files for all 4 stages of our progressive training curriculum.
Stay tuned! Please watch/star the repository to get notified as soon as the training code drops.
While Tempo provides a strong foundation for long video understanding, it opens up several exciting possibilities for the community. Potential avenues for future research include:
- Routing Post-Training: Enhancing the SVLM's zero-shot routing precision via RL to elicit stronger relevance priors.
- Autoregressive Compression: Exploring reasoning-driven, dynamic length token generation for query-aware segment compression.
- Multi-Turn Efficiency: Implementing hierarchical, on-demand visual extraction to support extremely fast multi-turn dialogue.
๐ For more detailed insights, please refer to the Discussion and Future Works section in our Paper.
๐ค Call for Collaboration: We warmly welcome community contributions! If you are interested in exploring these directions, building upon Tempo, or collaborating on future research, please feel free to reach out to us directly at junjiefei@outlook.com. Let's push the boundaries of long video understanding together!
If you find our work useful for your research and applications, please consider citing our paper:
@article{fei2026small,
title={Small Vision-Language Models are Smart Compressors for Long Video Understanding},
author={Fei, Junjie and Chen, Jun and Liu, Zechun and Xiong, Yunyang and Zhou, Chong and Wen, Wei and Han, Junlin and Zhuge, Mingchen and Suri, Saksham and Qian, Qi and others},
journal={arXiv preprint arXiv:2604.08120},
year={2026}
}Junjie Fei, Mingchen Zhuge, Shuming Liu, and Mohamed Elhoseiny were supported by funding from the KAUST Center of Excellence for Generative AI.
We extend our sincere gratitude to the open-source community for their invaluable contributions that made this research possible:
- Evaluation Benchmarks: LVBench, Video-MME, MLVU, and LongVideoBench.
- Codebase Foundations: Our codebase is built upon the excellent architectures of LongVU, VideoChat-Flash, LLaVA, LMMs-Eval.
- Pre-trained Weights: Our models are initialized using the powerful foundational models from Qwen3-VL and Qwen3-LM.
- Project Page: Our website template is adapted from the Nerfies project.
- โ๏ธ Framework & Code: The Tempo framework's source code is open-sourced under the Apache-2.0 License to foster research and community development.
- ๐ซ Model Weights: The pre-trained model weights and checkpoints are distributed strictly for academic and non-commercial research purposes. Any commercial use is explicitly prohibited without prior written consent.
- ๐ฌ Video & Data Assets: All visual media (video clips, images) showcased on our project page and within evaluation pipelines are utilized under the doctrine of Fair Use exclusively for non-profit academic research and scientific illustration. We claim no ownership over the original, copyrighted media assets.
Take-down Notice: We deeply respect the intellectual property rights of creators. If you are a copyright holder and believe that any content hosted in this repository or our project page infringes upon your rights, please contact us at junjiefei@outlook.com or open an Issue. We will promptly investigate and remove the identified content upon verification.
