I am a third-year Ph.D. student in Computer Science at Fudan University, advised by Prof. Pengfei Liu and Prof. Yu Qiao. My research focuses on multimodal agents in partially observed multimodal environments, especially how they perceive, reason, act, and predict over long-horizon tasks.
So far, my work has mainly studied how reinforcement learning changes multimodal models, including training dynamics and generalization (e.g., MAYE), joint reasoning-and-perception training (e.g., One-RL-to-See-Them-All), and agentic vision tool use (e.g., Med). Prior to this, I also worked on unified multimodal models (e.g., ANOLE) and text generation (e.g., MoPS).
My current focus is predictive multimodal learning from video, especially how video data can support the anticipation of future observations.
I have had valuable internship experiences at MiniMax and HailuoAI (2025.01 - 2026.02), where I contributed to the M-series foundation models and Hailuo video generation models; Shanghai AI Laboratory (2023.07 - 2025.01), focusing on text generation and multimodal foundation models.

The full listing can also be found on my Google Scholar Profile. But here can find links to related material like code suites.
Website template adapted from joschu.net.