Skip to content
/ EVS Public

[CVPR 2025] Official implementation of paper: Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis

Notifications You must be signed in to change notification settings

Tonniia/EVS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis (CVPR 2025)

teaser

Introduction

Our method is training-free, which combines T2I and T2V models to leverage their respective advantages. A high-quality video should encompass the following elements: (1) imaging quality (2) style diversity (3) effective motion (4) temporal consistency. These aspects are reflected in various metrics of the VBench Benchmark (Figure.2).

teaser

Effective motion is the essence of a video, and the Dynamic Degree is one of its key dimensions. Capturing this effectively requires training with high FPS videos, which often comes at the cost of temporal consistency, i.e., Motion Smoothness and Subject Consistency. A notable example of this trade-off can be seen in VideoCrafter-2.0. Building on this foundation, we enhance by leveraging T2V models known for their strong temporal consistency, such as AnimateDiff-Lightning and AnimateLCM. Additionally, we incorporate T2I (Text-to-Image) models renowned for their exceptional imaging quality, like epiCRealism and realisticVisionV60B1_v51VAE, as well as LoRA-based styles (e.g., watercolor).

Instead of basic composition that alternates between noising and denoising to transition between different models, we switch T2I and T2V during denoising process to eliminate redundant process.

Setup

Models are downloaded under ./pretrained_models. Here we list "vdm_type"/"idm_type", with their download URL. You can change "vdm_type"/"idm_type" in config.

T2I

  • "SD": stable-diffusion-v1-5 from HERE.
  • "EP": epiCRealism from HERE.
  • "RV": realisticVisionV60B1_v51VAE from HERE
  • "watercolor": from HERE

We resize the latents during T2I stage to increase resolution. Thus, our "idm_type" is format as "{model_name}_X{upscale_rate}". For example, "EP_X2.0".

T2V

  • "lightningEP"/"lightningSD": Download from HERE. Specifically, we use animatediff_lightning_8step_diffusers.safetensors. Since it trains temporal module which support different T2I spatial module, we specify "vdm_type" as lightningEP or lightningSD.

  • "animatelcm": Donwload from HERE. The base T2I module is SD-v1.5.

Creating a Conda Environment

conda create -n EVS python==3.9.0
pip install -r requirements.txt

Inference

Hyperparameters are seen in config. We provide default configs in demo under ./configs, for different tasks. (1) enhance :enhance video quality of VideoCrafter-2.0; (2) edit: edit VideoCrafter-2.0 generated video; (3)real_edit: edit real video; and (4) style: changing style to watercolor.

Here is an example of config for enhance task:

{
    "mp4_files": [
        source_video1,
        source_video2,
        ...

    ],
    "prompts": [
        target_prompt1,
        target_prompt2,
        ...

    ],
    "using_attn": true,
    "pipeline_mode": "IDM_VDM_IDM",
    "idm_invert_mode": "random",
    "vdm_invert_mode": "random",
    "idm_invert_mode_2": "ddim",
    "vdm_type": "lightningEP",
    "idm_type": "EP_X2.0",

    "idm_inverse_strength": 0.4,
    "vdm_inverse_strength": 0.4,
    "stable_num": 2,
    "stable_steps": [8],
    "seed": 23
}

We introduce several key hyperparameters involved in the encapsulated composition process and provide guidance on how to adjust them effectively.

Encapsulated Composition

  • "pipeline_mode": different composition shown in Figure.3.
  • "idm_invert_mode"/"vdm_invert_mode"/"idm_invert_mode_2": for pipeline_mode="IDM_VDM_IDM", different invert modes are used.
  • "idm_inverse_strength": noising strength of T2I. For enhance task, we use a smaller value; for edit task, we use a larger value.
  • "stable_steps": during T2I denoising, at which step we switch to T2V. The smaller of stable_steps, the later we switch (stable_steps=0 means we switch to T2V after T2I has been totally denoised to clean latent). It is important to note that, stable_steps should be smaller than T_I * idm_inverse_strength.
  • "vdm_inverse_strength": noising strength of T2V. It should be comparable to "idm_inverse_strength" to ensure that inconsistencies can be effectively eliminated.
  • "stable_num": how many T2V denoising step is used. Since the one-step process of T2V is much slower than that of T2I, even though we use an acclerating T2V model with T_V=10, we still do not wait for T2V to fully denoise before switching to T2I.

Leveraging Temporal-Only Prior of T2V

For easy of implementation, we extract the Selective Feature Injection process into another script, SFI.py.

About

[CVPR 2025] Official implementation of paper: Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages