Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment
Lijie Liu * , Tianxiang Ma * , Bingchuan Li * †, Zhuowei Chen * , Jiawei Liu, Qian He, Xinglong Wu
* Equal contribution, † Project lead
Intelligent Creation Team, ByteDance
Phantom is a unified video generation framework for single and multi-subject references, built on existing text-to-video and image-to-video architectures. It achieves cross-modal alignment using text-image-video triplet data by redesigning the joint text-image injection model. Additionally, it emphasizes subject consistency in human generation while enhancing ID-preserving video generation.
- Identity Preserving Video Generation.

- Single Reference Subject-to-Video Generation.

- Multi-Reference Subject-to-Video Generation.

@article{liu2025phantom,
title={Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment},
author={Liu, Lijie and Ma, Tianxaing and Li, Bingchuan and Chen, Zhuowei and Liu, Jiawei and He, Qian and Wu, Xinglong},
journal={arXiv preprint arXiv:2502.11079},
year={2025}
}