SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations

Wenhao Yan1*†, Sheng Ye1*†, Zhuoyi Yang1,2‡, Jiayan Teng1,2, ZhenHui Dong1, Kairui Wen1, Xiaotao Gu2, Yong-Jin Liu, Jie Tang
1Tsinghua University, 2Z.ai
*Equal contribution. Work done during internship at Z.ai. Project leader. §Corresponding author.

SCAIL enables high-fidelity character animation under diverse and challenging conditions.

Abstract

Achieving character animation that meets the studio-grade production standards remains challenging despite recent progress. Existing approaches can transfer motion from a driving video to a reference image, but often fail to preserve structural fidelity and temporal consistency in wild scenarios involving complex motion and cross-identity animations. In this work, we present SCAIL (Studio-grade Character Animation via In-context Learning), a framework designed to address these challenges from two key innovations. First, we propose a novel 3D pose representation, providing a robust and flexible motion signal. Second, we introduce a full-context pose injection mechanism within a diffusion-transformer architecture, enabling effective spatio-temporal reasoning over full motion sequences. To align with studio-level requirements, we develop a curated data pipeline ensuring both diversity and quality, and establish a comprehensive benchmark for systematic evaluation. Experiments show that SCAIL achieves state-of-the-art performance and advances character animation toward studio-grade reliability and realism.

Method

3D-Consistent Pose Representation

3D-Consistent Pose Representation

Exploration of Different Injection Methods

Exploration of Different Injection Methods

Full-Context Pose Injection with P-RoPE within DiT Architecture

Full-Context Pose Injection with P-RoPE within DiT Architecture

SCAIL builds upon Wan-I2V models and incorporates 3D-Consistent pose representation to learn precise identity-agnostic motion. After comparing different injection methods, we adopt full-context pose injection for the model to learn spatial-temporal motion characteristics. We leverage Pose-shifted RoPE to facilitate learning of spatial-temporal relation between video tokens and pose tokens.

Results Gallery



Community Works

❤️ A heartfelt thanks to friends in the community for their creativity! Results below are community works in gif format and the original videos were shared with their gracious consent. We were surprised to see the emergent abilities our model exhibited — understanding the 3D spatial relationships of 2D characters, driving hand-drawn artwork, and even controlling quadrupeds despite having no animal training data at all.


Chibi Gotham Battle

Homer in Slowmo (w/ Uni3c)

Anime Art Animation

Street Fighter 6 Motion Mimic

Doodle Art Animation

Dual Dance

Group Dance

Quadrupeds Animation (w/ ViTPose)

Comparison on Self-Driven Complex Motion

Ballet

Straddle

Comparison on Cross-Driven Complex Motion


Transfer human motion to human characters.

Acrobats

Expressive Body Movements

Occluded Postures

Fighting Scenes


Transfer human motion to anime characters.

Motion of Nonstandard Figures

Anime Characters' Interactions

Examples on Studio-Bench

BibTeX

@article{yan2025scail,
  title={SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations},
  author={Yan, Wenhao and Ye, Sheng and Yang, Zhuoyi and Teng, Jiayan and Dong, ZhenHui and Wen, Kairui and Gu, Xiaotao and Liu, Yong-Jin and Tang, Jie},
  journal={arXiv preprint arXiv:2512.05905},
  year={2025}
}