PoseAnything is a universal pose-guided video generation framework. It enables high-quality video generation for both human and non-human characters from arbitrary skeletal inputs
| No. | Content | State |
|---|---|---|
| 1 | Model Enhanced Using Human Data | β |
| 2 | Release Training Code | β |
| 3 | XPose Dataset Release | Hugging Face |
git clone https://github.com/Ryan-w2024/PoseAnything.git
cd PoseAnythingInstall with conda
conda create -n poseanything python=3.10
conda activate poseanything
pip install -e .
pip install flash_attn --no-build-isolationUse the following command to download the model weights:
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.2-TI2V-5B --local-dir ./models/Wan2.2-TI2V-5B
huggingface-cli download Ryan241005/PoseAnything --local-dir ./models/PonyAfter downloading, the weights files should be organized as:
PoseAnything/
βββ models/
β ββ Wan2.2-TI2V-5B/
β β βββ models_t5_umt5-xxl-enc-bf16.pth
β β βββ Wan2.2_VAE.pth
β β βββ ...
β βββ Pony/
β βββ diffusion_pytorch_model-00001-of-00002.safetensors
β βββ diffusion_pytorch_model-00002-of-00002.safetensors
β βββ ...
βββ ...To run PoseAnything, you need to extract the masked image of the target subject based on the first frame and skeleton. You can either store the masked image directly to DATA_DIR/video, or use the following example script for automatic extraction:
cd Extractor
bash mask.sh # May need to downgrade transformers to 4.40.2The data will then be formatted as follows:
DATA_DIR/
βββ first_frame/
β βββ {file_name}.png
βββ skeleton_image/
β βββ {file_name}/
β βββ 000.png
β βββ 001.png
β βββ 002.png
βββ video/
β βββ {file_name}_id.png
βββ ...
Then, You can then use the provided example script to run the demo
bash test.shIf you wish to test the version that does not include the PTC module, run the following command (masked image is not required).
bash test_without_ptc.shβ Tip: PoseAnything supports arbitrary skeleton inputs. For strong skeletal conditions (large motion/high density input), we suggest using a smaller CFG scale or no CFG for natural output. For weak skeletal conditions (small motion/low density input), increase the CFG scale to enhance fitting to the pose.
To test the TikTok dataset, refer to the script below:
bash test_tiktok.shskeleton_1.mp4 |
result_1.mp4 |
skeleton_2.mp4 |
result_2.mp4 |
skeleton_3.mp4 |
result_3.mp4 |
skeleton_4.mp4 |
result_4.mp4 |
We provide training scripts for the DiT module, which excludes the PTC (Part-aware Temporal Coherence) module to reduce VRAM consumption.You need to modify the checkpoint beforehand to adapt the channel count of the patchify module to the newly added skeleton input:
python update_weight.py
bash train_without_ptc.shSince adding the PTC module leads to high VRAM overhead, we suggest using the DeepSpeed framework for optimization.
We also provide the code for automated skeleton extraction, which is built based on BlumNet and Grounded-Sam-2.
To avoid conflicts, we highly recommend creating a new Conda environment.
cd Extractor
conda create -n extractor python=3.10
conda activate extractor
pip install -r requirement.txt
# Compile CUDA operators (as required by BlumNet)
cd BlumNet/models/ops
sh ./make.sh
cd ../../../Please download the weights for BlumNet and Grounded-SAM-2 following the instructions provided in corresponding repository, and run
cp -r Grounded/sam2 ./To automate the extraction of skeletons from your own video data, you must first provide the paths to the videos to be processed and their corresponding captions, as shown in ../data/example/raw_metadata.csv.
To run the example dataοΌ
bash run.shOur implementation is based on DiffSynth-Studio, BlumNet and Grounded-SAM-2. Thanks for their remarkable contribution and released code! If we missed any open-source projects or related articles, we would like to complement the acknowledgement of this specific work immediately.