-
twelve-labs
- Seoul
Stars
LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs
StreamingVLM: Real-Time Understanding for Infinite Video Streams
implementations and experimentation on mHC by deepseek - https://arxiv.org/abs/2512.24880
A Paper List for Humanoid Robot Learning.
Official repository of the paper "Does audio matter for modern video-LLMs and their benchmarks?"
An open-source implementaion for fine-tuning Qwen-VL series by Alibaba Cloud.
Query-aware Token Selector (QTSplus), a lightweight yet powerful visual token selection module that serves as an information gate between the vision encoder and LLMs.
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
slime is an LLM post-training framework for RL Scaling.
A framework for efficient model inference with omni-modality models
ChatDev 2.0: Dev All through LLM-powered Multi-Agent Collaboration
Ring attention implementation with flash attention
🔥🔥🔥 Latest Papers, Codes and Datasets on Video-LMM Post-Training
VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
[NeurIPS 2025] HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models
Lightning-Fast RL for LLM Reasoning and Agents. Made Simple & Flexible.
GLM-4.6V/4.5V/4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning
An efficient video loader for deep learning with smart shuffling that's super easy to digest
Official implementation of "Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence"
video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is developed by the Department of Electronic Engineering at Tsin…
Native Multimodal Models are World Learners
A simple, unified multimodal models training engine. Lean, flexible, and built for hacking at scale.
State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!


