Stars
Ovis-Image is a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints.
[CVPR 2025] A Comprehensive Benchmark for Document Parsing and Evaluation
An unified model that seamlessly integrates multimodal understanding, text-to-image generation, and image editing within a single powerful framework.
Awesome Unified Multimodal Models
Code for "WebVoyager: WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models"
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
[NeurIPS 2024] Official Implementation of Hawk: Learning to Understand Open-World Video Anomalies
Code for Paper: Harnessing Webpage Uis For Text Rich Visual Understanding
Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.
Official Repo for the paper: VCR: Visual Caption Restoration. Check arxiv.org/pdf/2406.06462 for details.
The code repository for "Wings: Learning Multimodal LLMs without Text-only Forgetting" [NeurIPS 2024]
AIDC-AI / AutoGPTQ
Forked from AutoGPTQ/AutoGPTQAn easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
Agentic ADK is an Agent application development framework launched by Alibaba International AI Business, based on Google-ADK and Ali-LangEngine.
[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Steer LLM outputs towards a certain topic/subject and enhance response capabilities using activation engineering by adding steering vectors
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3, Qwen3-MoE, DeepSeek-R1, GLM4.5, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Llava, …
[ECCV 2024] Official Implementation of An Incremental Unified Framework for Small Defect Inspection
程序员在家做饭方法指南。Programmer's guide about how to cook at home (Simplified Chinese only).
[ICML 2024] | MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
🎉 The code repository for "Parrot: Multilingual Visual Instruction Tuning" in PyTorch.
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
[NeurIPS 2024] MATH-Vision dataset and code to measure multimodal mathematical reasoning capabilities.
Code for Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
[NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
When do we not need larger vision models?
Data and Code for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning"

