[CVPR 2026] PAM: A Pose–Appearance–Motion Engine for Sim-to-Real HOI Video Generation

This repository is an official implementation for:

PAM: A Pose–Appearance–Motion Engine for Sim-to-Real HOI Video Generation

Authors: Mingju Gao, Kaisen Yang, Huan-ang Gao, Bohan Li, Ao Ding, Wenyi Li, Yangcheng Yu, Jinkun Liu, Shaocong Xu, Yike Niu, Haohan Chi, Hao Chen, Hao Tang, Yu Zhang, Li Yi, Hao Zhao

Getting Started

🔧 Installation

You can set up the environment using the provided environment.yaml file. We recommend using Conda to manage the dependencies. This project requires CUDA 12.1.

conda env create -f environment.yaml
conda activate PAM

You should install manopth from https://github.com/hassony2/manopth

📂 Data Preparation

Please download the necessary datasets from their official sources:

DexYCB: Visit the official website to download the dataset. You might also need the toolkit for processing.
OakInk2: Visit the official website and their GitHub repository for download instructions and data setup.

📋 Conditions Preparation for DexYCB Dataset

To enable appearance and motion generation, you need to prepare the rendering conditions by following these steps:

Step 1: Convert DexYCB to Videos

Convert the DexYCB dataset into video format for subsequent processing:

python appearance_gen/utils/dexycb_to_videos.py --input_root /path/to/dexycb --output_root /path/to/dexycb_videos

Step 2: Extract Depth Conditions

Extract depth maps from the video sequences as control signals. We leverage DepthCrafter to extract depth from videos. You need to mask the background and keep only the foreground using the seg mask.

Step 3: Generate Hand Keypoints and Semantic Masks

Generate hand keypoint annotations and semantic segmentation masks to guide the generation process. These conditions are processed in the dataloader.

Step 4: Extract Video Captions

Generate descriptive captions for the video sequences to provide textual guidance for the appearance generation model. We provide a caption extraction tool at appearance_gen/utils/description_extract.py, and pre-extracted prompts for DexYCB are available in dexycb_prompt.jsonl.

Step 5: Generate Filelist

Generate the file list that specifies input data paths for training and inference:

python appearance_gen/utils/gen_s0_split_filelist.py --root /path/to/dexycb_videos --save_path_root /path/to/filelist/dir

The final dataset structure should look like:

data/
├── dexycb_videos/                      # Step 1: Converted videos
│   └── {subject}/                      # e.g., 20200709-subject-01
│       └── {sequence}/                 # e.g., 20200709_141754
│           └── {camera_id}/            # e.g., 836212060125
│               ├── video.mp4
│               └── descriptions.json
├── dexycb_depth/                       # Step 2: Depth conditions
│   └── {subject}/
│       └── {sequence}/
│           └── {camera_id}/
│               └── depth.mp4
├── dexycb_fore_tracking/               # Step 3: Hand keypoints & semantic masks
│   └── {subject}/
│       └── {sequence}/
│           └── {camera_id}/
│               └── ...
├── dexycb_filelist/                    # Step 5: Generated filelists
└── dexycb_prompt.jsonl                 # Step 4: Video captions

📋 Conditions Preparation for OakInk2 Dataset

Step 1: Generate Filelist

Generate the video and conditions filelist for OakInk2:

python appearance_gen/utils/gen_oakink2_filelist.py

We also provide pre-generated training video paths in appearance_gen/utils/oakink2_videos.txt.

Step 2: Extract Depth Conditions

Extract depth maps from the rendered sequences as control signals using DepthCrafter. Note that you need to mask the background and keep only the foreground using the segmentation mask.

Step 3: Generate Hand Keypoints and Semantic Masks

Generate segmentation masks and hand keypoint annotations:

python appearance_gen/utils/gen_hand_seg_oakink2.py \
    --filelist_path /path/to/video/filelist \
    --video_len 49 \
    --root_dir /path/to/dataset/root

Step 4: Extract Video Captions

Extract descriptive captions using LLaVA as described in the DexYCB section above.

🎨 Appearance Generation

Once you have prepared the conditions and filelists, you can train the appearance generation model using our provided scripts. You can also download the dataset from our huggingface

Configuration

Before training, you need to modify the training script appearance_gen/script/train.sh to specify the following paths:

--data_root: Root directory containing your dataset
--train_caption_column: Path to training video captions file
--train_video_column: Path to training video list file
--train_depth_column: Path to training depth conditions file
--train_label_column: Path to training label file
--valid_caption_column: Path to validation video captions file
--valid_video_column: Path to validation video list file
--valid_depth_column: Path to validation depth conditions file
--valid_label_column: Path to validation label file

Training

The training uses DeepSpeed for distributed training with mixed precision (bf16). You can adjust the training configuration in appearance_gen/config.yaml if needed.

cd appearance_gen

# Modify script/train.sh with your data paths first
bash script/train.sh

Note: The default configuration uses 4 GPUs with DeepSpeed ZeRO stage 2. Adjust num_processes in config.yaml according to your available GPU resources.

Inference

After training, you can run inference using the trained ControlNet model. First, modify script/infer_double.sh with the following parameters:

--local_path: Path to the trained ControlNet checkpoint (e.g., controlnet.bin)
--save_path: Directory to save the generated results.

And other validation dataset filelist.

bash script/infer_double.sh

The inference script will load the trained model and generate appearance results for the input conditions.

🏃 Motion Generation

Prepare Training Latents

To accelerate the training process, we pre-compute the training latents offline. Run the following scripts:

# For DexYCB dataset
bash scripts/prepare_dataset_dexycb.sh

# For OakInk2 dataset
bash scripts/prepare_dataset_oakink2.sh

After generating the latents, merge them using:

python appearance_gen/utils/merge_latents.py \
    --src_dir /path/to/latents/src/dir \
    --dst_dir /path/to/latents/dst/dir

Training

We provide training scripts based on DeepSpeed for efficient distributed training.

# For DexYCB dataset
bash scripts/train_dexycb_motion.sh

# For OakInk2 dataset
bash scripts/train_oakink2_motion.sh

You can customize the conditioning inputs via the --used_conditions argument. By default, we use three conditions (hand_keypoints, depth, seg_mask) as proposed in the paper

Inference

Before running inference, convert the trained checkpoint to the standard format:

python convert_ckpt.py \
    /path/to/checkpoint/dir \
    /path/to/save/dir \
    --safe_serialization

Then run the evaluation scripts:

# For DexYCB dataset
bash scripts/evaluate_dexycb.sh

# For OakInk2 dataset
bash scripts/evaluate_oakink2.sh

Note: Make sure to specify the validation file list in the corresponding script before running.

🔄 Sim2Real Transfer

Data Preparation

First, prepare the motion sequence using GraspXL. The data directory should follow this structure:

sim_data_root/
├── rendered_rgb_*.png
├── rendered_normal_*.png
├── rendered_depth_*.png
├── rendered_seg_mask_*.png
└── rendered_hand_keypoints_*.png

Appearance Generation

Generate the first frame with realistic appearance:

bash appearance_gen/script/infer_double_wild.sh

Note: Specify the relevant keywords in the script before running. After generation, select the desired first frame for motion synthesis.

Motion Generation

Run the motion generation pipeline:

python testing/evaluation_sim.py \
    --root_dir /path/to/sim/root \
    --start_frame_filelist /path/to/selected/filelist \
    --transformer_path /path/to/checkpoint \
    --output_dir /path/to/output

🙏 Acknowledgement

This project builds upon the following excellent works:

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
accelerate_configs		accelerate_configs
appearance_gen		appearance_gen
assets		assets
config		config
models		models
scripts		scripts
testing		testing
training		training
val_resources		val_resources
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
convert_ckpt.py		convert_ckpt.py
dexycb_prompt.jsonl		dexycb_prompt.jsonl
environment.yaml		environment.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[CVPR 2026] PAM: A Pose–Appearance–Motion Engine for Sim-to-Real HOI Video Generation

Getting Started

🔧 Installation

📂 Data Preparation

📋 Conditions Preparation for DexYCB Dataset

📋 Conditions Preparation for OakInk2 Dataset

🎨 Appearance Generation

🏃 Motion Generation

Prepare Training Latents

Training

Inference

🔄 Sim2Real Transfer

Data Preparation

Appearance Generation

Motion Generation

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

[CVPR 2026] PAM: A Pose–Appearance–Motion Engine for Sim-to-Real HOI Video Generation

Getting Started

🔧 Installation

📂 Data Preparation

📋 Conditions Preparation for DexYCB Dataset

📋 Conditions Preparation for OakInk2 Dataset

🎨 Appearance Generation

🏃 Motion Generation

Prepare Training Latents

Training

Inference

🔄 Sim2Real Transfer

Data Preparation

Appearance Generation

Motion Generation

🙏 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages