- Nanjing, China
Highlights
- Pro
Lists (1)
Sort Name ascending (A-Z)
Stars
Ovis-Image is a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints.
This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]
MTEB: Massive Text Embedding Benchmark
An unified model that seamlessly integrates multimodal understanding, text-to-image generation, and image editing within a single powerful framework.
Official Implementation of ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay
[MathCoder, MathCoder-VL] Family of LLMs/LMMs for mathematical reasoning.
OpenThinkIMG is an end-to-end open-source framework that empowers LVLMs to think with images.
Awesome Unified Multimodal Models
Training VLM agents with multi-turn reinforcement learning
An Open-source RL System from ByteDance Seed and Tsinghua AIR
This is the first paper to explore how to effectively use R1-like RL for MLLMs and introduce Vision-R1, a reasoning MLLM that leverages cold-start initialization and RL training to incentivize reas…
MM-EUREKA: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
[NeurIPS 2024] Official Implementation of Hawk: Learning to Understand Open-World Video Anomalies
🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
verl: Volcano Engine Reinforcement Learning for LLMs
DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Models
The code repository for "Wings: Learning Multimodal LLMs without Text-only Forgetting" [NeurIPS 2024]
Train transformer language models with reinforcement learning.
This repo contains the code for "MEGA-Bench Scaling Multimodal Evaluation to over 500 Real-World Tasks" [ICLR 2025]
A high-throughput and memory-efficient inference and serving engine for LLMs
This repository contains datasets and baselines for benchmarking Chinese text recognition.
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3, Qwen3-MoE, DeepSeek-R1, GLM4.5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, …
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
Arena-Hard-Auto: An automatic LLM benchmark.
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
A flexible and efficient training framework for large-scale alignment tasks
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.


