This repository is an official implementation for:
PAM: A Pose–Appearance–Motion Engine for Sim-to-Real HOI Video Generation
Authors: Mingju Gao, Kaisen Yang, Huan-ang Gao, Bohan Li, Ao Ding, Wenyi Li, Yangcheng Yu, Jinkun Liu, Shaocong Xu, Yike Niu, Haohan Chi, Hao Chen, Hao Tang, Yu Zhang, Li Yi, Hao Zhao
You can set up the environment using the provided environment.yaml file. We recommend using Conda to manage the dependencies. This project requires CUDA 12.1.
conda env create -f environment.yaml
conda activate PAMYou should install manopth from https://github.com/hassony2/manopth
Please download the necessary datasets from their official sources:
- DexYCB: Visit the official website to download the dataset. You might also need the toolkit for processing.
- OakInk2: Visit the official website and their GitHub repository for download instructions and data setup.
To enable appearance and motion generation, you need to prepare the rendering conditions by following these steps:
Step 1: Convert DexYCB to Videos
Convert the DexYCB dataset into video format for subsequent processing:
python appearance_gen/utils/dexycb_to_videos.py --input_root /path/to/dexycb --output_root /path/to/dexycb_videosStep 2: Extract Depth Conditions
Extract depth maps from the video sequences as control signals. We leverage DepthCrafter to extract depth from videos. You need to mask the background and keep only the foreground using the seg mask.
Step 3: Generate Hand Keypoints and Semantic Masks
Generate hand keypoint annotations and semantic segmentation masks to guide the generation process. These conditions are processed in the dataloader.
Step 4: Extract Video Captions
Generate descriptive captions for the video sequences to provide textual guidance for the appearance generation model. We provide a caption extraction tool at appearance_gen/utils/description_extract.py, and pre-extracted prompts for DexYCB are available in dexycb_prompt.jsonl.
Step 5: Generate Filelist
Generate the file list that specifies input data paths for training and inference:
python appearance_gen/utils/gen_s0_split_filelist.py --root /path/to/dexycb_videos --save_path_root /path/to/filelist/dirThe final dataset structure should look like:
data/
├── dexycb_videos/ # Step 1: Converted videos
│ └── {subject}/ # e.g., 20200709-subject-01
│ └── {sequence}/ # e.g., 20200709_141754
│ └── {camera_id}/ # e.g., 836212060125
│ ├── video.mp4
│ └── descriptions.json
├── dexycb_depth/ # Step 2: Depth conditions
│ └── {subject}/
│ └── {sequence}/
│ └── {camera_id}/
│ └── depth.mp4
├── dexycb_fore_tracking/ # Step 3: Hand keypoints & semantic masks
│ └── {subject}/
│ └── {sequence}/
│ └── {camera_id}/
│ └── ...
├── dexycb_filelist/ # Step 5: Generated filelists
└── dexycb_prompt.jsonl # Step 4: Video captions
Step 1: Generate Filelist
Generate the video and conditions filelist for OakInk2:
python appearance_gen/utils/gen_oakink2_filelist.pyWe also provide pre-generated training video paths in appearance_gen/utils/oakink2_videos.txt.
Step 2: Extract Depth Conditions
Extract depth maps from the rendered sequences as control signals using DepthCrafter. Note that you need to mask the background and keep only the foreground using the segmentation mask.
Step 3: Generate Hand Keypoints and Semantic Masks
Generate segmentation masks and hand keypoint annotations:
python appearance_gen/utils/gen_hand_seg_oakink2.py \
--filelist_path /path/to/video/filelist \
--video_len 49 \
--root_dir /path/to/dataset/rootStep 4: Extract Video Captions
Extract descriptive captions using LLaVA as described in the DexYCB section above.
Once you have prepared the conditions and filelists, you can train the appearance generation model using our provided scripts. You can also download the dataset from our huggingface
Configuration
Before training, you need to modify the training script appearance_gen/script/train.sh to specify the following paths:
--data_root: Root directory containing your dataset--train_caption_column: Path to training video captions file--train_video_column: Path to training video list file--train_depth_column: Path to training depth conditions file--train_label_column: Path to training label file--valid_caption_column: Path to validation video captions file--valid_video_column: Path to validation video list file--valid_depth_column: Path to validation depth conditions file--valid_label_column: Path to validation label file
Training
The training uses DeepSpeed for distributed training with mixed precision (bf16). You can adjust the training configuration in appearance_gen/config.yaml if needed.
cd appearance_gen
# Modify script/train.sh with your data paths first
bash script/train.shNote: The default configuration uses 4 GPUs with DeepSpeed ZeRO stage 2. Adjust num_processes in config.yaml according to your available GPU resources.
Inference
After training, you can run inference using the trained ControlNet model. First, modify script/infer_double.sh with the following parameters:
--local_path: Path to the trained ControlNet checkpoint (e.g.,controlnet.bin)--save_path: Directory to save the generated results.
And other validation dataset filelist.
bash script/infer_double.shThe inference script will load the trained model and generate appearance results for the input conditions.
To accelerate the training process, we pre-compute the training latents offline. Run the following scripts:
# For DexYCB dataset
bash scripts/prepare_dataset_dexycb.sh
# For OakInk2 dataset
bash scripts/prepare_dataset_oakink2.shAfter generating the latents, merge them using:
python appearance_gen/utils/merge_latents.py \
--src_dir /path/to/latents/src/dir \
--dst_dir /path/to/latents/dst/dir We provide training scripts based on DeepSpeed for efficient distributed training.
# For DexYCB dataset
bash scripts/train_dexycb_motion.sh
# For OakInk2 dataset
bash scripts/train_oakink2_motion.shYou can customize the conditioning inputs via the --used_conditions argument. By default, we use three conditions (hand_keypoints, depth, seg_mask) as proposed in the paper
Before running inference, convert the trained checkpoint to the standard format:
python convert_ckpt.py \
/path/to/checkpoint/dir \
/path/to/save/dir \
--safe_serializationThen run the evaluation scripts:
# For DexYCB dataset
bash scripts/evaluate_dexycb.sh
# For OakInk2 dataset
bash scripts/evaluate_oakink2.shNote: Make sure to specify the validation file list in the corresponding script before running.
First, prepare the motion sequence using GraspXL. The data directory should follow this structure:
sim_data_root/
├── rendered_rgb_*.png
├── rendered_normal_*.png
├── rendered_depth_*.png
├── rendered_seg_mask_*.png
└── rendered_hand_keypoints_*.png
Generate the first frame with realistic appearance:
bash appearance_gen/script/infer_double_wild.shNote: Specify the relevant keywords in the script before running. After generation, select the desired first frame for motion synthesis.
Run the motion generation pipeline:
python testing/evaluation_sim.py \
--root_dir /path/to/sim/root \
--start_frame_filelist /path/to/selected/filelist \
--transformer_path /path/to/checkpoint \
--output_dir /path/to/outputThis project builds upon the following excellent works:
