Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis (CVPR 2025)
Our method is training-free, which combines T2I and T2V models to leverage their respective advantages. A high-quality video should encompass the following elements: (1) imaging quality (2) style diversity (3) effective motion (4) temporal consistency. These aspects are reflected in various metrics of the VBench Benchmark (Figure.2).
Effective motion is the essence of a video, and the Dynamic Degree is one of its key dimensions. Capturing this effectively requires training with high FPS videos, which often comes at the cost of temporal consistency, i.e., Motion Smoothness and Subject Consistency. A notable example of this trade-off can be seen in VideoCrafter-2.0. Building on this foundation, we enhance by leveraging T2V models known for their strong temporal consistency, such as AnimateDiff-Lightning and AnimateLCM. Additionally, we incorporate T2I (Text-to-Image) models renowned for their exceptional imaging quality, like epiCRealism and realisticVisionV60B1_v51VAE, as well as LoRA-based styles (e.g., watercolor).
Instead of basic composition that alternates between noising and denoising to transition between different models, we switch T2I and T2V during denoising process to eliminate redundant process.
Models are downloaded under ./pretrained_models. Here we list "vdm_type"/"idm_type", with their download URL. You can change "vdm_type"/"idm_type" in config.
- "SD": stable-diffusion-v1-5 from HERE.
- "EP": epiCRealism from HERE.
- "RV": realisticVisionV60B1_v51VAE from HERE
- "watercolor": from HERE
We resize the latents during T2I stage to increase resolution. Thus, our "idm_type" is format as "{model_name}_X{upscale_rate}". For example, "EP_X2.0".
-
"lightningEP"/"lightningSD": Download from HERE. Specifically, we use
animatediff_lightning_8step_diffusers.safetensors. Since it trains temporal module which support different T2I spatial module, we specify "vdm_type" as lightningEP or lightningSD. -
"animatelcm": Donwload from HERE. The base T2I module is SD-v1.5.
conda create -n EVS python==3.9.0
pip install -r requirements.txt
Hyperparameters are seen in config. We provide default configs in demo under ./configs, for different tasks. (1) enhance :enhance video quality of VideoCrafter-2.0; (2) edit: edit VideoCrafter-2.0 generated video; (3)real_edit: edit real video; and (4) style: changing style to watercolor.
Here is an example of config for enhance task:
{
"mp4_files": [
source_video1,
source_video2,
...
],
"prompts": [
target_prompt1,
target_prompt2,
...
],
"using_attn": true,
"pipeline_mode": "IDM_VDM_IDM",
"idm_invert_mode": "random",
"vdm_invert_mode": "random",
"idm_invert_mode_2": "ddim",
"vdm_type": "lightningEP",
"idm_type": "EP_X2.0",
"idm_inverse_strength": 0.4,
"vdm_inverse_strength": 0.4,
"stable_num": 2,
"stable_steps": [8],
"seed": 23
}
We introduce several key hyperparameters involved in the encapsulated composition process and provide guidance on how to adjust them effectively.
- "pipeline_mode": different composition shown in Figure.3.
- "idm_invert_mode"/"vdm_invert_mode"/"idm_invert_mode_2": for
pipeline_mode="IDM_VDM_IDM", different invert modes are used. - "idm_inverse_strength": noising strength of T2I. For enhance task, we use a smaller value; for edit task, we use a larger value.
- "stable_steps": during T2I denoising, at which step we switch to T2V. The smaller of stable_steps, the later we switch (stable_steps=0 means we switch to T2V after T2I has been totally denoised to clean latent). It is important to note that, stable_steps should be smaller than
T_I * idm_inverse_strength. - "vdm_inverse_strength": noising strength of T2V. It should be comparable to "idm_inverse_strength" to ensure that inconsistencies can be effectively eliminated.
- "stable_num": how many T2V denoising step is used. Since the one-step process of T2V is much slower than that of T2I, even though we use an acclerating T2V model with T_V=10, we still do not wait for T2V to fully denoise before switching to T2I.
For easy of implementation, we extract the Selective Feature Injection process into another script, SFI.py.

