Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis (CVPR 2025)

Introduction

Our method is training-free, which combines T2I and T2V models to leverage their respective advantages. A high-quality video should encompass the following elements: (1) imaging quality (2) style diversity (3) effective motion (4) temporal consistency. These aspects are reflected in various metrics of the VBench Benchmark (Figure.2).

Effective motion is the essence of a video, and the Dynamic Degree is one of its key dimensions. Capturing this effectively requires training with high FPS videos, which often comes at the cost of temporal consistency, i.e., Motion Smoothness and Subject Consistency. A notable example of this trade-off can be seen in VideoCrafter-2.0. Building on this foundation, we enhance by leveraging T2V models known for their strong temporal consistency, such as AnimateDiff-Lightning and AnimateLCM. Additionally, we incorporate T2I (Text-to-Image) models renowned for their exceptional imaging quality, like epiCRealism and realisticVisionV60B1_v51VAE, as well as LoRA-based styles (e.g., watercolor).

Instead of basic composition that alternates between noising and denoising to transition between different models, we switch T2I and T2V during denoising process to eliminate redundant process.

Setup

Models are downloaded under ./pretrained_models. Here we list "vdm_type"/"idm_type", with their download URL. You can change "vdm_type"/"idm_type" in config.

T2I

"SD": stable-diffusion-v1-5 from HERE.
"EP": epiCRealism from HERE.
"RV": realisticVisionV60B1_v51VAE from HERE
"watercolor": from HERE

We resize the latents during T2I stage to increase resolution. Thus, our "idm_type" is format as "{model_name}_X{upscale_rate}". For example, "EP_X2.0".

T2V

"lightningEP"/"lightningSD": Download from HERE. Specifically, we use animatediff_lightning_8step_diffusers.safetensors. Since it trains temporal module which support different T2I spatial module, we specify "vdm_type" as lightningEP or lightningSD.
"animatelcm": Donwload from HERE. The base T2I module is SD-v1.5.

Creating a Conda Environment

conda create -n EVS python==3.9.0
pip install -r requirements.txt

Inference

Hyperparameters are seen in config. We provide default configs in demo under ./configs, for different tasks. (1) enhance :enhance video quality of VideoCrafter-2.0; (2) edit: edit VideoCrafter-2.0 generated video; (3)real_edit: edit real video; and (4) style: changing style to watercolor.

Here is an example of config for enhance task:

{
    "mp4_files": [
        source_video1,
        source_video2,
        ...

    ],
    "prompts": [
        target_prompt1,
        target_prompt2,
        ...

    ],
    "using_attn": true,
    "pipeline_mode": "IDM_VDM_IDM",
    "idm_invert_mode": "random",
    "vdm_invert_mode": "random",
    "idm_invert_mode_2": "ddim",
    "vdm_type": "lightningEP",
    "idm_type": "EP_X2.0",

    "idm_inverse_strength": 0.4,
    "vdm_inverse_strength": 0.4,
    "stable_num": 2,
    "stable_steps": [8],
    "seed": 23
}

We introduce several key hyperparameters involved in the encapsulated composition process and provide guidance on how to adjust them effectively.