🦄 PoseAnything: Universal Pose-guided Video Generation with Part-aware Temporal Coherence

📝 Introduction

PoseAnything is a universal pose-guided video generation framework. It enables high-quality video generation for both human and non-human characters from arbitrary skeletal inputs

📅 Time Schedule

No.	Content	State
1	Model Enhanced Using Human Data	✅
2	Release Training Code	✅
3	XPose Dataset Release	Hugging Face

🛠️ Installation Guide

1. 📂 Clone Repository

git clone https://github.com/Ryan-w2024/PoseAnything.git
cd PoseAnything

2. 🐍 Environment Setup

Install with conda

conda create -n poseanything python=3.10
conda activate poseanything
pip install -e .
pip install flash_attn --no-build-isolation

3. 💾 Model Weights Download

Use the following command to download the model weights:

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.2-TI2V-5B --local-dir ./models/Wan2.2-TI2V-5B
huggingface-cli download Ryan241005/PoseAnything --local-dir ./models/Pony

After downloading, the weights files should be organized as:

PoseAnything/
├── models/
│     ├─ Wan2.2-TI2V-5B/
│     │       ├── models_t5_umt5-xxl-enc-bf16.pth
│     │       ├── Wan2.2_VAE.pth
│     │       └── ...
│     └── Pony/
│           ├── diffusion_pytorch_model-00001-of-00002.safetensors
│           ├── diffusion_pytorch_model-00002-of-00002.safetensors
│           └── ...
└── ...

💻 Quick Start: Inference

To run PoseAnything, you need to extract the masked image of the target subject based on the first frame and skeleton. You can either store the masked image directly to DATA_DIR/video, or use the following example script for automatic extraction:

cd Extractor
bash mask.sh # May need to downgrade transformers to 4.40.2

The data will then be formatted as follows:

DATA_DIR/
├── first_frame/
│      └── {file_name}.png
├── skeleton_image/
│      └── {file_name}/
│              └── 000.png
│              └── 001.png
│              └── 002.png
├── video/
│      └── {file_name}_id.png
└── ...

Then, You can then use the provided example script to run the demo

bash test.sh

If you wish to test the version that does not include the PTC module, run the following command (masked image is not required).

bash test_without_ptc.sh

✔ Tip: PoseAnything supports arbitrary skeleton inputs. For strong skeletal conditions (large motion/high density input), we suggest using a smaller CFG scale or no CFG for natural output. For weak skeletal conditions (small motion/low density input), increase the CFG scale to enhance fitting to the pose.

To test the TikTok dataset, refer to the script below:

bash test_tiktok.sh

Demo Showcase


skeleton_1.mp4	result_1.mp4	skeleton_2.mp4	result_2.mp4
skeleton_3.mp4	result_3.mp4	skeleton_4.mp4	result_4.mp4

🚀 Training

We provide training scripts for the DiT module, which excludes the PTC (Part-aware Temporal Coherence) module to reduce VRAM consumption.You need to modify the checkpoint beforehand to adapt the channel count of the patchify module to the newly added skeleton input:

python update_weight.py
bash train_without_ptc.sh

Since adding the PTC module leads to high VRAM overhead, we suggest using the DeepSpeed framework for optimization.

🗃️ Data Process

We also provide the code for automated skeleton extraction, which is built based on BlumNet and Grounded-Sam-2.

Installation Guide

To avoid conflicts, we highly recommend creating a new Conda environment.

cd Extractor
conda create -n extractor python=3.10
conda activate extractor
pip install -r requirement.txt

# Compile CUDA operators (as required by BlumNet) 
cd BlumNet/models/ops
sh ./make.sh
cd ../../../

Download Weights

Please download the weights for BlumNet and Grounded-SAM-2 following the instructions provided in corresponding repository, and run

cp -r Grounded/sam2 ./

Usage

To automate the extraction of skeletons from your own video data, you must first provide the paths to the videos to be processed and their corresponding captions, as shown in ../data/example/raw_metadata.csv. To run the example data：

bash run.sh

📧 Acknowledgement

Our implementation is based on DiffSynth-Studio, BlumNet and Grounded-SAM-2. Thanks for their remarkable contribution and released code! If we missed any open-source projects or related articles, we would like to complement the acknowledgement of this specific work immediately.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Extractor		Extractor
apps		apps
data		data
diffsynth		diffsynth
examples		examples
static		static
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
test.sh		test.sh
test_tiktok.sh		test_tiktok.sh
test_without_ptc.sh		test_without_ptc.sh
train_without_ptc.sh		train_without_ptc.sh
update_weight.py		update_weight.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🦄 PoseAnything: Universal Pose-guided Video Generation with Part-aware Temporal Coherence

📝 Introduction

📅 Time Schedule

🛠️ Installation Guide

1. 📂 Clone Repository

2. 🐍 Environment Setup

3. 💾 Model Weights Download

💻 Quick Start: Inference

Demo Showcase

🚀 Training

🗃️ Data Process

Installation Guide

Download Weights

Usage

📧 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🦄 PoseAnything: Universal Pose-guided Video Generation with Part-aware Temporal Coherence

📝 Introduction

📅 Time Schedule

🛠️ Installation Guide

1. 📂 Clone Repository

2. 🐍 Environment Setup

3. 💾 Model Weights Download

💻 Quick Start: Inference

Demo Showcase

🚀 Training

🗃️ Data Process

Installation Guide

Download Weights

Usage

📧 Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages