TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

🌟 Overview

TimeChat-Captioner is a multimodal model designed to generate detailed, time-aware, and structurally coherent captions for multi-scene videos. It effectively coordinates visual and audio information to provide comprehensive video descriptions.

🌐 Project Page: timechat-captioner.github.io (coming soon)
🏠 Model: TimeChat-Captioner (7B)
📚 Train Dataset: TimeChatCap-40K
🏆 Benchmark: OmniDCBench

🚀 Quick Start

Below, we provide simple examples to show how to use TimeChat-Captioner-GRPO-7B with 🤗 Transformers.

Installation

conda create -n timechatcap python=3.12
conda activate timechatcap
pip install torch torchvision
pip install transformers==4.57.1
pip install accelerate
pip install flash-attn --no-build-isolation
# It's highly recommended to use `[decord]` feature for faster video loading.
pip install qwen-omni-utils[decord] -U

Usage

Note: To annotate high-quality timestamps and captions, limit video input to around 1 minute. Please segment longer videos into around 60-second clips before processing.

import torch
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info

# 1. Configuration
MODEL_ID = "yaolily/TimeChat-Captioner-GRPO-7B"
VIDEO_PATH = "example_video.mp4"  # <--- Replace with your video path

MAX_PIXELS = 297920
VIDEO_MAX_PIXELS = 297920


print(f"🚀 Processing video: {VIDEO_PATH}")

# 2. Load Model & Processor
print("⏳ Loading model...")
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="flash_attention_2"
)
processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID)
model.disable_talker()

# 3. Construct Conversation
# The prompt encourages detailed, time-aware audio-visual description.
conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "text", 
                "text": "Thoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
            },
            {
                "type": "video", 
                "video": VIDEO_PATH, 
                "max_pixels": MAX_PIXELS, 
                "max_frames": 160, 
                "fps": 2.0,
                "video_max_pixels": VIDEO_MAX_PIXELS
            }
        ],
    },
]

# 4. Process Inputs
print("⚙️  Processing inputs...")
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)

audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)

inputs = processor(
    text=text, 
    audio=audios, 
    images=images, 
    videos=videos, 
    return_tensors="pt", 
    padding=True, 
    use_audio_in_video=True
)
inputs = inputs.to(model.device).to(model.dtype)

# 5. Generate Description
print("✨ Generating description...")
with torch.inference_mode():
    text_ids = model.generate(
        **inputs, 
        use_audio_in_video=True, 
        return_audio=False,
        thinker_max_new_tokens=9216,
        talker_max_tokens=9216
    )

response = processor.decode(text_ids[0][inputs.input_ids[0].size(0):], skip_special_tokens=True)

print("\n" + "="*50)
print("🎬 VIDEO DESCRIPTION:")
print("="*50)
print(response)
print("="*50)

📊 Inference on OmniDCBench

We provide a multi-GPU batch inference pipeline to evaluate TimeChat-Captioner on the OmniDCBench benchmark.

Step 1. Download and extract the benchmark videos (see Infer/readme.md for full instructions):

# Clone the dataset
git clone https://huggingface.co/datasets/yaolily/OmniDCBench OmniDCBench

# Extract videos into Video/ directory
cd OmniDCBench && mkdir -p Video
cat Movie.tar.gz.*   | tar -xzf - -C Video/
mkdir -p Video/Youtube
cat Youtube.tar.gz.* | tar -xf - -C Video/Youtube

Step 2. Edit Infer/infer.sh to set your paths (MODEL_PATH, VIDEO_DIR, INPUT_PATH, GPU_NUM, etc.).

Step 3. Run inference:

cd Infer
bash infer.sh

Results will be merged into <OUTPUT_DIR>/merged_result.jsonl. See Infer/readme.md for detailed configuration options and output format.

🔧 Train

Training can be launched using the scripts provided in Train/script/*.sh. Please refer to Train/readme.md for detailed instructions.

📝 TODOs

Upload eval code to calculate SODA_M and F1.
Integrate eval code to lmms-eval.

📖 Citation

@misc{yao2026timechatcaptioner,
      title={TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions}, 
      author={Linli Yao and Yuancheng Wei and Yaojie Zhang and Lei Li and Xinlong Chen and Feifan Song and Ziyue Wang and Kun Ouyang and Yuanxin Liu and Lingpeng Kong and Qi Liu and Pengfei Wan and Kun Gai and Yuanxing Zhang and Xu Sun},
      year={2026},
      eprint={2602.08711},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.08711}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Eval		Eval
Infer		Infer
ThirdPartyLib		ThirdPartyLib
Train		Train
.gitignore		.gitignore
example_video.mp4		example_video.mp4
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

🌟 Overview

🚀 Quick Start

Installation

Usage

📊 Inference on OmniDCBench

🔧 Train

📝 TODOs

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

🌟 Overview

🚀 Quick Start

Installation

Usage

📊 Inference on OmniDCBench

🔧 Train

📝 TODOs

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages