Skip to content

FeiElysia/Tempo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

13 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽฅ Tempo: Small Vision-Language Models are Smart Compressors for Long Video Understanding

Project Page Paper Hugging Face License

Tempo is an efficient, query-aware framework that natively compresses hour-long videos for downstream Multimodal LLMs. Instead of blindly dropping frames, Tempo acts as an intelligent temporal compressor, dynamically distributing the rhythm of the video based on user intent.

Project Page | Paper | Demo

โœจ Highlights

  • ๐Ÿง  Intent-Driven Compression (ATA): Uses a Small VLM as an O(1) dynamic router to allocate dense token bandwidth to query-critical moments while rapidly fast-forwarding redundant backgrounds.
  • โšก Extreme Efficiency: Achieves aggressive dynamic compression (0.5โ€“16 tokens/frame), bypassing the lost-in-the-middle phenomenon without breaking causality.
  • ๐Ÿ† State-of-the-Art Performance: Our compact Tempo-6B model scores 52.3 on the extreme-long LVBench under a strict 8K visual token budget (53.7 with 12K budget), outperforming proprietary baselines like GPT-4o and Gemini 1.5 Pro.

๐ŸŽฌ Watch Tempo in Action

(Click play to see our interactive UI, dynamic token allocation visualization, and real-time inference)

demo.mp4

๐Ÿ”ฅ News

  • [2026.04] ๐Ÿ“ฆ We have released the Intermediate Checkpoints for Stages 0, 1, and 2! You can find them in our Hugging Face Collection.
  • [2026.04] ๐Ÿ“Š The full Evaluation Pipeline is now open-sourced! Our evaluation scripts, integrated with the standard lmms-eval framework for LVBench, Video-MME, MLVU, and LongVideoBench, are ready to use. Please refer to the Evaluation Section for detailed instructions.
  • [2026.04] ๐Ÿ“„ Our paper is officially out! You can read it on arXiv and check out our page on Hugging Face Papers.
  • [2026.04] ๐Ÿš€ We have released the Tempo-6B inference code, interactive Gradio UI, and the final checkpoints (Stage 3)!
  • [TODO] ๐Ÿ› ๏ธ Training Code: The complete training scripts for all 4 stages will be open-sourced in the following weeks. Stay tuned!

โญ Tip: Please Watch or Star this repository to keep an eye on our latest updates and code releases!


๐Ÿš€ Quick Start

1. Installation

Create a new conda environment and install all required dependencies:

# Clone our repository
git clone https://github.com/FeiElysia/Tempo.git
cd Tempo

# Create environment
conda create -n tempo python=3.12 -y
conda activate tempo

# Install all packages (PyTorch 2.6.0 + CUDA 12.4)
pip install -r requirements.txt

โšก Installing Flash-Attention

Since flash-attn installation can be highly environment-dependent, please install it manually using one of the methods below:

# Method 1
pip install flash-attn==2.7.4.post1

# Method 2: Without Build Isolation
pip install flash-attn==2.7.4.post1 --no-build-isolation

# Method 3: If you are unable to build from source, you can directly download and install the pre-built wheel:
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
pip install flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
rm flash_attn*.whl

๐Ÿ’ก If you are unable to install flash-attn, you can still run Tempo by disabling it:

  1. Set use_flash_attn=False when calling load_pretrained_model.
  2. Comment out the line self.config._attn_implementation = "flash_attention_2" in qwen3vl_encoder.py.

Theoretically, the numerical differences should be minimal. We have visually verified that the model produces excellent qualitative results without Flash-Attention. However, please note that we have not rigorously evaluated its impact on the benchmarks we reported in the paper.

2. Model Zoo

To fully support the open-source community and facilitate future research, we have released the weights for our final model alongside the intermediate checkpoints from all 4 stages of our training pipeline.

๐Ÿ’ก Note on Token Budgets: Tempo's Adaptive Token Allocation (ATA) is dynamically controlled at inference time. The 4K and 8K budget configurations reported in our paper use the exact same final weights (Stage 3). You simply adjust the budget hyperparameter during inference.

Training Stage Description Weights
Stage 0 Modality Alignment ๐Ÿค— HF Link
Stage 1 Pre-training ๐Ÿค— HF Link
Stage 2 Broad Supervised Fine-Tuning ๐Ÿค— HF Link
Stage 3 Long-Context SFT (Final Tempo-6B) ๐Ÿค— HF Link

(Note: If you only want to run inference or evaluate our model, simply download the Stage 3 weights. The intermediate checkpoints for Stages 0, 1, and 2 are provided for researchers who wish to reproduce our training pipeline, conduct ablation studies, or perform custom fine-tuning.)

3. Prepare Checkpoints

To run the inference script successfully, you need to download two components: our final Tempo-6B weights, and the base Qwen3-VL-2B-Instruct model (for Tempo initialization).

We highly recommend using the huggingface-cli for fast and resumable downloads:

mkdir -p checkpoints

# 1. Download the final Tempo-6B model
huggingface-cli download --resume-download Vision-CAIR/Tempo-6B --local-dir ./checkpoints/Tempo-6B

# 2. Download the base Qwen3-VL model (Required for architecture initialization)
# ๐Ÿ’ก Note: To avoid caching Qwen3-VL in the default system drive during inference, 
# you can modify Tempo-6B's `config.json`: change "Qwen/Qwen3-VL-2B-Instruct" to "./checkpoints/Qwen3-VL-2B-Instruct" and run:
huggingface-cli download --resume-download Qwen/Qwen3-VL-2B-Instruct --local-dir ./checkpoints/Qwen3-VL-2B-Instruct

๐Ÿ’ป Inference & Demos

We provide multiple ways to interact with Tempo, from web UI to batch scripts.

1. Web UI (Gradio)

To launch the local Gradio application with interactive visualizations of the Token Allocation distribution:

python app.py

Navigate to the generated local or public URL in your browser. Our UI features dynamic token compression visualization and one-click example testing.

โš ๏ธ Note: We use Git LFS to manage the demo videos in this repository. If you encounter a "Video not playable" error, it means the actual video files were not downloaded during git clone (this will not happen if your system has already installed Git LFS).

  • Option 1 (Recommended): Install Git LFS (git lfs install) and run git lfs pull inside the repository to fetch the real video files.
  • Option 2: Manually download the specific .mp4 files directly from our repository web interface.

2. Single Video Inference

Run the default example: We provide a quick-start script to test a pre-configured.

sh ./scripts/infer/infer.sh

Run your own custom video: To test your own videos, call the Python script directly. Make sure to point the --model_path to your downloaded local checkpoint.

python infer.py \
    --model_path "./checkpoints/Tempo-6B" \
    --video_path "/path/to/your/custom_video.mp4" \
    --query "Your detailed question here."

3. Batch Inference

Run all default examples: To sequentially reproduce all the qualitative examples shown on our Project Page, run:

sh ./scripts/infer/infer_all_demos.sh

Run a custom batch: For testing across multiple custom videos, we highly recommend our JSON-based pipeline. Simply edit the test configurations in ./examples/demo_cases.json:

[
  {
    "video_path": "/path/to/custom1.mp4",
    "query": "Question 1"
  },
  {
    "video_path": "/path/to/custom2.mp4",
    "query": "Question 2"
  }
]

Once configured, execute the same batch script to automatically process them all:

sh ./scripts/infer/infer_all_demos.sh

โœจ See Tempo in Action

Here are some unedited inference results generated directly by our Tempo-6B model. These examples demonstrate the model's capability in video summarization and complex QA.

๐Ÿง  Example 1: Complex Multi-Task Analysis & Grounding (Honkai: Star Rail)
hsr_helloworld.mp4

๐Ÿ”ท Query:

Task: Please analyze the provided video and answer the following 7 questions precisely. Q1: How many performers are visible on the stage? Q2: Describe the architectural elements in the background. What historical civilization do they remind you of? Q3: What is happening in the night sky above the performers, and what does this suggest about the event? Q4: List the hair colors of the performers in order from left to right. Q5: Identify the specific musical instrument being played by the performer located on the far left of the stage. Q6: What is the specific time interval (in seconds, e.g., XX-XXs) during which this fireworks performance scene occurs in the video? Q7: Look at the audience in the foreground. How does their silhouette-like depiction affect the viewer's perspective of the stage?

๐Ÿ”ด Tempo Response: โฑ๏ธ Refer to 1:27 - 1:30

Q1: There are four performers visible on the stage.
Q2: The architectural elements in the background resemble ancient Greek columns, suggesting a historical setting reminiscent of ancient Greece.
Q3: The night sky above the performers is filled with fireworks, suggesting a celebratory event or performance.
Q4: The hair colors of the performers from left to right are purple, pink, blonde, and blue.
Q5: The performer on the far left is playing a harp.
Q6: The fireworks performance scene occurs between 88.0-91.0 seconds.
Q7: The silhouette-like depiction of the audience in the foreground creates a sense of depth and focus on the performers on stage, enhancing the viewer's perspective of the stage.

๐Ÿ”ฅ Example 2: Video Summarization (Honkai Impact 3rd)
honkai3_becauseofyou.mp4

๐Ÿ”ท Query: Describe the video in detail.

๐Ÿ”ด Tempo Response: โฑ๏ธ Global Summary

The video opens with a high-tech, neon-lit cityscape at night, filled with floating structures and digital screens. A character with pink hair and a blue eye patch is seen running through the city, dodging obstacles. The environment is dark with vibrant neon colors, primarily purples, blues, and greens. The character is dressed in a futuristic outfit with a weapon. As the character moves, they are surrounded by digital elements and symbols, suggesting a cyberpunk setting. The scene shifts to a close-up of the character's face, showing determination and focus. The character is then seen in a combat stance, ready to fight. The lighting is dynamic, with flashes of light and energy. The video ends with a black screen displaying credits in white text.


๐Ÿง  Methodology: The Tempo Framework

Tempo bridges the gap between hour-long videos and bounded LLM context windows by casting long video understanding as an end-to-end, query-aware cross-modal distillation process.

Tempo Architecture

Instead of blindly sampling frames, our pipeline operates in three highly efficient phases:

  • 1. Local Compressor (Early Distillation): A Small VLM (SVLM) processes video segments alongside the user query. It uses learnable memory tokens to inherently distill visual semantics, dropping query-irrelevant backgrounds early on.
  • 2. Adaptive Token Allocation (ATA): Operating as a training-free, O(1) dynamic router during inference, ATA intercepts zero-shot relevance scores from the SVLM. It allocates dense token bandwidth to query-critical segments while rapidly fast-forwarding redundancies into minimal temporal anchors.
  • 3. Global Decoder (Synthesis): The highly compressed, filtered memory tokens are assembled with explicit temporal tags (e.g., <t=2.0s>). A large global LLM then synthesizes this condensed storyline to generate precise answers without suffering from attention dilution.

๐Ÿ“Š Quantitative Results

Tempo achieves state-of-the-art performance on long video benchmarks while using a fraction of the token budget compared to traditional models.

Model Size Tokens / Frame LongVideoBench MLVU Video-MME LVBench
Proprietary Models
GPT-4o - - 66.7 64.6 71.9 30.8
Gemini 1.5 Pro - - 64.0 - 75.0 33.1
General Open-Source
VideoLLaMA3* 7B โ‰ค 91 59.8 73.0 66.2 45.3
Qwen2.5-VL 7B 1924 56.0 70.2 65.1 45.3
Qwen3-VL* 8B โ‰ค 640 - 78.1 71.4 58.0
Specialized Long Video
LongVA 7B 144 - 56.3 52.6 -
Kangaroo 8B 256 54.8 61.0 56.0 39.4
LongVU 7B 64 - 65.4 60.6 -
VideoChat-Flash 7B 16 64.7 74.7 65.3 48.2
Tempo (4K Budget)* 6B 0.5โ€“16 64.5 75.6 67.8 52.7
โ†ณ actual avg. toks/frame 2.8 2.8 3.6 2.9
Tempo (8K Budget)* 6B 0.5โ€“16 65.1 75.2 67.7 52.3
โ†ณ actual avg. toks/frame 3.1 3.3 4.3 3.5

๐Ÿ’ก Note: While configured with a theoretical dynamic range of 0.5โ€“16 tokens, Tempo's Adaptive Token Allocation (ATA) operates substantially below the maximum limits in practice (see the actual avg. rows). For the complete leaderboard and metrics, please visit our Project Page.


๐Ÿงช Evaluation

Our evaluation scripts are organized under ./scripts/eval/.

๐Ÿ’ก Note: By default, the scripts are configured for a 4-GPU setup. Before running, please ensure you adjust NUM_GPUS and CUDA_VISIBLE_DEVICES inside the scripts to match your local machine environment.

1. Evaluate All Benchmarks

To reproduce the main results across all benchmarks, run the script for your desired visual token budget:

# For 4K budget
sh ./scripts/eval/4k_budgets/eval_all.sh

# For 8K budget
sh ./scripts/eval/8k_budgets/eval_all.sh

2. Evaluate a Single Benchmark

To run evaluation on a specific benchmark (e.g., LVBench under the 4K budget), execute the corresponding script:

sh ./scripts/eval/4k_budgets/eval_lvbench.sh

3. Debugging

For a quick sanity check to ensure your environment and model loading are properly configured:

sh ./scripts/eval/eval_debug.sh

๐Ÿ“Š Special Note on Video-MME Scores

The lmms-eval framework only reports the Overall accuracy for Video-MME. To obtain the detailed breakdown across different video lengths (Short, Medium, Long), use our provided script:

python ./scripts/eval/split_videomme.py

โš ๏ธ Before running, please open split_videomme.py and modify the jsonl_file path to point to your recently generated *_samples_videomme.jsonl log file. The overall score calculated by this script matches the lmms-eval output.


๐Ÿ› ๏ธ Training (Coming Soon)

We are actively cleaning up and organizing the training codebase for public release.

  • Training Scripts: The complete source code and configuration files for all 4 stages of our progressive training curriculum.

Stay tuned! Please watch/star the repository to get notified as soon as the training code drops.


๐Ÿ”ฎ Next Steps

While Tempo provides a strong foundation for long video understanding, it opens up several exciting possibilities for the community. Potential avenues for future research include:

  • Routing Post-Training: Enhancing the SVLM's zero-shot routing precision via RL to elicit stronger relevance priors.
  • Autoregressive Compression: Exploring reasoning-driven, dynamic length token generation for query-aware segment compression.
  • Multi-Turn Efficiency: Implementing hierarchical, on-demand visual extraction to support extremely fast multi-turn dialogue.

๐Ÿ“– For more detailed insights, please refer to the Discussion and Future Works section in our Paper.

๐Ÿค Call for Collaboration: We warmly welcome community contributions! If you are interested in exploring these directions, building upon Tempo, or collaborating on future research, please feel free to reach out to us directly at junjiefei@outlook.com. Let's push the boundaries of long video understanding together!


๐Ÿ“ Citation

If you find our work useful for your research and applications, please consider citing our paper:

@article{fei2026small,
  title={Small Vision-Language Models are Smart Compressors for Long Video Understanding},
  author={Fei, Junjie and Chen, Jun and Liu, Zechun and Xiong, Yunyang and Zhou, Chong and Wen, Wei and Han, Junlin and Zhuge, Mingchen and Suri, Saksham and Qian, Qi and others},
  journal={arXiv preprint arXiv:2604.08120},
  year={2026}
}

๐Ÿค Acknowledgements

Junjie Fei, Mingchen Zhuge, Shuming Liu, and Mohamed Elhoseiny were supported by funding from the KAUST Center of Excellence for Generative AI.

We extend our sincere gratitude to the open-source community for their invaluable contributions that made this research possible:


๐Ÿ“„ License & Disclaimer

  • โš–๏ธ Framework & Code: The Tempo framework's source code is open-sourced under the Apache-2.0 License to foster research and community development.
  • ๐Ÿšซ Model Weights: The pre-trained model weights and checkpoints are distributed strictly for academic and non-commercial research purposes. Any commercial use is explicitly prohibited without prior written consent.
  • ๐ŸŽฌ Video & Data Assets: All visual media (video clips, images) showcased on our project page and within evaluation pipelines are utilized under the doctrine of Fair Use exclusively for non-profit academic research and scientific illustration. We claim no ownership over the original, copyrighted media assets.

Take-down Notice: We deeply respect the intellectual property rights of creators. If you are a copyright holder and believe that any content hosted in this repository or our project page infringes upon your rights, please contact us at junjiefei@outlook.com or open an Issue. We will promptly investigate and remove the identified content upon verification.

About

Tempo: Small Vision-Language Models are Smart Compressors for Long Video Understanding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages