TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions
TimeChat-Captioner is a multimodal model designed to generate detailed, time-aware, and structurally coherent captions for multi-scene videos. It effectively coordinates visual and audio information to provide comprehensive video descriptions.
- 🌐 Project Page: timechat-captioner.github.io (coming soon)
- 🏠 Model: TimeChat-Captioner (7B)
- 📚 Train Dataset: TimeChatCap-40K
- 🏆 Benchmark: OmniDCBench
Below, we provide simple examples to show how to use TimeChat-Captioner-GRPO-7B with 🤗 Transformers.
conda create -n timechatcap python=3.12
conda activate timechatcap
pip install torch torchvision
pip install transformers==4.57.1
pip install accelerate
pip install flash-attn --no-build-isolation
# It's highly recommended to use `[decord]` feature for faster video loading.
pip install qwen-omni-utils[decord] -UNote: To annotate high-quality timestamps and captions, limit video input to around 1 minute. Please segment longer videos into around 60-second clips before processing.
import torch
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
# 1. Configuration
MODEL_ID = "yaolily/TimeChat-Captioner-GRPO-7B"
VIDEO_PATH = "example_video.mp4" # <--- Replace with your video path
MAX_PIXELS = 297920
VIDEO_MAX_PIXELS = 297920
print(f"🚀 Processing video: {VIDEO_PATH}")
# 2. Load Model & Processor
print("⏳ Loading model...")
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="cuda",
attn_implementation="flash_attention_2"
)
processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID)
model.disable_talker()
# 3. Construct Conversation
# The prompt encourages detailed, time-aware audio-visual description.
conversation = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Thoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated."
},
{
"type": "video",
"video": VIDEO_PATH,
"max_pixels": MAX_PIXELS,
"max_frames": 160,
"fps": 2.0,
"video_max_pixels": VIDEO_MAX_PIXELS
}
],
},
]
# 4. Process Inputs
print("⚙️ Processing inputs...")
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)
inputs = processor(
text=text,
audio=audios,
images=images,
videos=videos,
return_tensors="pt",
padding=True,
use_audio_in_video=True
)
inputs = inputs.to(model.device).to(model.dtype)
# 5. Generate Description
print("✨ Generating description...")
with torch.inference_mode():
text_ids = model.generate(
**inputs,
use_audio_in_video=True,
return_audio=False,
thinker_max_new_tokens=9216,
talker_max_tokens=9216
)
response = processor.decode(text_ids[0][inputs.input_ids[0].size(0):], skip_special_tokens=True)
print("\n" + "="*50)
print("🎬 VIDEO DESCRIPTION:")
print("="*50)
print(response)
print("="*50)We provide a multi-GPU batch inference pipeline to evaluate TimeChat-Captioner on the OmniDCBench benchmark.
Step 1. Download and extract the benchmark videos (see Infer/readme.md for full instructions):
# Clone the dataset
git clone https://huggingface.co/datasets/yaolily/OmniDCBench OmniDCBench
# Extract videos into Video/ directory
cd OmniDCBench && mkdir -p Video
cat Movie.tar.gz.* | tar -xzf - -C Video/
mkdir -p Video/Youtube
cat Youtube.tar.gz.* | tar -xf - -C Video/YoutubeStep 2. Edit Infer/infer.sh to set your paths (MODEL_PATH, VIDEO_DIR, INPUT_PATH, GPU_NUM, etc.).
Step 3. Run inference:
cd Infer
bash infer.shResults will be merged into <OUTPUT_DIR>/merged_result.jsonl. See Infer/readme.md for detailed configuration options and output format.
Training can be launched using the scripts provided in Train/script/*.sh.
Please refer to Train/readme.md for detailed instructions.
- Upload eval code to calculate SODA_M and F1.
- Integrate eval code to lmms-eval.
@misc{yao2026timechatcaptioner,
title={TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions},
author={Linli Yao and Yuancheng Wei and Yaojie Zhang and Lei Li and Xinlong Chen and Feifan Song and Ziyue Wang and Kun Ouyang and Yuanxin Liu and Lingpeng Kong and Qi Liu and Pengfei Wan and Kun Gai and Yuanxing Zhang and Xu Sun},
year={2026},
eprint={2602.08711},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.08711}
}