Yan Ma's Homepage

I am a third-year Ph.D. student in Computer Science at Fudan University, advised by Prof. Pengfei Liu and Prof. Yu Qiao. My research focuses on multimodal agents in partially observed multimodal environments, especially how they perceive, reason, act, and predict over long-horizon tasks.

So far, my work has mainly studied how reinforcement learning changes multimodal models, including training dynamics and generalization (e.g., MAYE), joint reasoning-and-perception training (e.g., One-RL-to-See-Them-All), and agentic vision tool use (e.g., Med). Prior to this, I also worked on unified multimodal models (e.g., ANOLE) and text generation (e.g., MoPS).

My current focus is predictive multimodal learning from video, especially how video data can support the anticipation of future observations.

I have had valuable internship experiences at MiniMax and HailuoAI (2025.01 - 2026.02), where I contributed to the M-series foundation models and Hailuo video generation models; Shanghai AI Laboratory (2023.07 - 2025.01), focusing on text generation and multimodal foundation models.

News

Feb 2026 — 🔍 My paper "What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom" is out on arXiv! [Paper] [Homepage] [Code] [Huggingface]
Jun 2025 — 📘 A collaborative survey "Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers" is out on arXiv! [Paper]
May 2025 — 🤖 My paper "One RL to See Them All: Visual Triple Unified Reinforcement Learning" is out on arXiv! [Paper] [Code]
May 2025 — 🖼️ Our work "Thinking with Generated Images" is now available on arXiv! [Paper] [Code]
Apr 2025 — 🚀 Our new survey "Test Time Scaling Drives Cognition Engineering" is out on arXiv! 🔥 [Paper]
Apr 2025 — 📝 My paper "Rethinking RL Scaling for Vision Language Models" is out on arXiv! [Paper] 🎯
Jan 2025 — 🧠 The extended blog version of ANOLE was accepted to the ICLR 2025 Blog Track! [Blog] [Poster]
Sep 2024 — 🏟 "OlympicArena" accepted to NeurIPS 2024 Dataset & Benchmark Track. [Paper]
Sep 2024 — 💡 "Weak-to-strong reasoning" accepted to EMNLP 2024 Findings. [Paper]
Jul 2024 — 🐍 We released ANOLE, our open-source autoregressive vision-language model! [arXiv] / [Code] ⭐️700+
Jun 2024 — 🧹 My paper "MoPS: Modular Story Premise Synthesis" accepted to ACL 2024! [Paper]
Feb 2023 — ✨ Our AAAI 2023 paper on cross-domain adaptation is out! [Paper]
Dec 2022 — 🎤 Presented "Evolutionary Action Selection" at ICONIP 2022 (Oral). [Paper]

Selected Publications

The full listing can also be found on my Google Scholar Profile. But here can find links to related material like code suites.

News

Selected Publications

Education

Project

Misc

Invited Talks

Academic Activities