Zhiyao Sun
·
Ziqiao Peng
·
Yifeng Ma
·
Yi Chen
·
Zhengguang Zhou
·
Zixiang Zhou
Guozhen Zhang
·
Youliang Zhang
·
Yuan Zhou
·
Qinglin Lu
·
Yong-Jin Liu
Real-time, streaming interactive avatars represent a critical yet challenging goal in digital human research. Although diffusion-based human avatar generation methods achieve remarkable success, their non-causal architecture and high computational costs make them unsuitable for streaming. Moreover, existing interactive approaches are typically limited to head-and-shoulder region, limiting their ability to produce gestures and body motions. To address these challenges, we propose a two-stage autoregressive adaptation and acceleration framework that applies autoregressive distillation and adversarial refinement to adapt a high-fidelity human video diffusion model for real-time, interactive streaming. To ensure long-term stability and consistency, we introduce three key components: a Reference Sink, a Reference-Anchored Positional Re-encoding (RAPR) strategy, and a Consistency-Aware Discriminator. Building on this framework, we develop a one-shot, interactive, human avatar model capable of generating both natural talking and listening behaviors with coherent gestures. Extensive experiments demonstrate that our method achieves state-of-the-art performance, surpassing existing approaches in generation quality, real-time efficiency, and interaction naturalness.
@article{streamavatar2025,
title={StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars},
author={Zhiyao Sun and Ziqiao Peng and Yifeng Ma and Yi Chen and Zhengguang Zhou and Zixiang Zhou and Guozhen Zhang and Youliang Zhang and Yuan Zhou and Qinglin Lu and Yong-Jin Liu},
year={2025},
eprint={2512.22065},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.22065},
}