A Digital Human Project with Generative AI.
- 项目相关论文
- MoDiTalker MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation
论文相关链接: https://arxiv.org/abs/2403.19144 https://paperswithcode.com/paper/moditalker-motion-disentangled-diffusion https://ku-cvlab.github.io/MoDiTalker/ https://github.com/KU-CVLAB/MoDiTalker
- AniPortrait AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation 论文相关链接: https://arxiv.org/abs/2403.17694 https://github.com/scutzzj/AniPortrait
1 环境准备
1.1 python环境准备
<1> chameleon
cd chameleon
pip install -r requirements.txt
pip install -U edge-tts==6.1.12
<2> ModiTalker
pip install p_tqdm
<3> AniPortrait
pip install -U diffusers==0.24.0 imageio==2.33.0 imageio-ffmpeg==0.4.9 omegaconf==2.2.3 ffmpeg-python==0.2.0
(如果在运行中遇到未安装的包,可参考Zejun-Yang_AniPortrait/requirements.txt 进行安装)
1.2 预训练模型
<1> chameleon
* face_alignment / 3drecon 的预训练模型地址(复制到本地):
user@192.168.58.238:
~/lk/mdls/digitalhuman/deep_3drecon/
~/lk/mdls/digitalhuman/fa/
将模型复制到指定目录:
cd chameleon/processes
cp -R ~/lk/mdls/digitalhuman/deep_3drecon/BFM deep_3drecon/
cp -R ~/lk/mdls/digitalhuman/deep_3drecon/checkpoints deep_3drecon/
cp -R ~/lk/mdls/digitalhuman/fa/fan deep_3drecon/
<2> ModiTalker
hubert预训练模型地址(复制到本地):
user@192.168.58.238:
/home/user/mnt/sdg1/mdls/models--facebook--hubert-large-ls960-ft
(更新代码:KU-CVLAB_MoDiTalker/data/data_utils/preprocess/process_audio_lk.py L:14
chameleon/projects/animake/process_audio_hubert.py L:14)
<3> AniPortrait
预训练模型地址(复制到本地):
user@192.168.58.238:
pretrained_base_model_path: '/home/user/mnt/sdg1/mdls/models--runwayml--stable-diffusion-v1-5'
pretrained_vae_path: '/home/user/mnt/sdg1/mdls/models--stabilityai--sd-vae-ft-mse'
image_encoder_path: '/home/user/mnt/sdg1/mdls/models--lambdalabs--sd-image-variations-diffusers/image_encoder'
mm_path: '/home/user/mnt/sdg1/mdls/models--guoyww--animatediff/mm_sd_v15_v2.ckpt'
(注意更新文件中相应地址:Zejun-Yang_AniPortrait/configs/prompts/animation_trnfa41.yaml)
1.3 自训练模型
<1> chameleon
user@192.168.58.238:
/data/likun/outs/chameleon/aniptrt/trn_fa_51/
/data/likun/outs/chameleon/aniptrt/trn_fa_51/stage1_bk/*-600000.pth
/data/likun/outs/chameleon/aniptrt/trn_fa_51/stage2_bk/*.pth
<2> ModiTalker
AToM:
user@192.168.58.238:
/home/user/mnt/sdg1/outs/digitalhuman/mobitalker/trn_19_atm_1/exp/weights/train-2000.pt
<3> AniPortrait
user@192.168.58.238:
denoising_unet_path: "/home/user/mnt/sdg1/outs/digitalhuman/aniportrait/trn_fa_41/stage1_41_bk/denoising_unet-300000.pth"
reference_unet_path: "/home/user/mnt/sdg1/outs/digitalhuman/aniportrait/trn_fa_41/stage1_41_bk/reference_unet-300000.pth"
pose_guider_path: "/home/user/mnt/sdg1/outs/digitalhuman/aniportrait/trn_fa_41/stage1_41_bk/pose_guider-300000.pth"
motion_module_path: "/home/user/mnt/sdg1/outs/digitalhuman/aniportrait/trn_fa_41/stage2_bk/motion_module-400000.pth"
(注意更新文件中相应地址:chameleon/configs/prompts/animation_trnfa41.yaml)
注:其他模型可参考注意更新文件中相应地址:chameleon/configs/prompts/animation_trnfa41.yaml
- 数据预处理
使用了几个开源的数据集: HDTF,CelebV-HQ,VFHQ
数据地址: user@192.168.58.238:/data/likun/data/data/HDTF/ user@192.168.58.238:/data/likun/data/data/celebv_hq/ user@192.168.58.238:/data/likun/data/data/VFHQ/
对于图像及视频数据,训练前需要做一些预处理,详细命令请参考文档: chameleon/processes/README.md
调整视频fps
python -u preprocess/unify_fps_vid.py \
--load_video_path /data/likun/data/data/HDTF/HDTF-FACE \
--save_video_path /data/likun/data/data/HDTF/HDTF_fps25 \
--start_idx 0 --end_idx 9999 \
--fps 25 \
--batch_size 1 \
--num_workers 10 \
&> /data/likun/data/data/HDTF/HDTF_fps25__log__mp.txt
从视频中提取图像
python -u preprocess/pick_audio_from_video.py \
--load_video_path /data/likun/data/data/HDTF/HDTF_fps25 \
--save_audio_path /data/likun/data/data/HDTF/HDTF_fps25_wavs \
--start_idx 0 --end_idx 9999 \
--batch_size 1 \
--num_workers 10 \
&> /data/likun/data/data/HDTF/HDTF_fps25_wavs__log.txt
从视频中抽帧存为图像
python -u preprocess/video2frame_celebv.py \
--load_video_path /data/likun/data/data/HDTF/HDTF_fps25 \
--save_images_path /data/likun/data/data/HDTF/HDTF_fps25_frame \
&> /data/likun/data/data/HDTF/HDTF_fps25_frame__log.txt
调整图片大小
python -u preprocess/prcs_img_resize.py \
/data/likun/data/data/HDTF/HDTF_fps25_frame \
/data/likun/data/data/HDTF/HDTF_fps25_frame_256 \
256 ".jpg" \
&> /data/likun/data/data/HDTF/HDTF_fps25_frame_256__log.txt
提取图片中2D及3D人脸关键点
python -u preprocess/process_video_3dmm_rollback_fa3drec_mp.py \
--audio_ok_file_path "" \
--hdtf_frames_path /data/likun/data/data/HDTF/HDTF_fps25_frame_256 \
--image_wh 256 \
--if_save_img 1 \
--mp_np 8 \
--saving_path /data/likun/data/data/HDTF/HDTF_fps25_frame_256_kps_23_unfcpsok_img \
&> /data/likun/data/data/HDTF/HDTF_fps25_frame_256_kps_23_unfcpsok_img__log_1.txt
- 推理生成
- 代码更新
<1> chameleon
chameleon/projects/animake/main_1.py L:51 L:59
<2> ModiTalker
<3> AniPortrait
Zejun-Yang_AniPortrait/scripts/pose2vid_fa_tr_lk.py L:41
3.1 分步执行 (详见chameleon/projects/animake/CMDS_1.md)
3.1.1 生成人脸关键点 这一步由图片生成相应的人脸关键点,
(pt210g118) user@8a100svr3:~/lk/proj/mypj/chameleon/processes$
python preprocess/prcs_img_ldmk_fa_diff.py \
--image_path "/home/user/mnt/sdg1/data/imgs/VFHQ_512_p100_256_kps_fortest_31/Clip+_pcoGxTYEKk+P0+C0+F106-276/00000000.png" \
--saving_path "/home/user/mnt/sdg1/data/imgs/VFHQ_512_p100_256_kps_fortest_31/Clip+_pcoGxTYEKk+P0+C0+F106-276/" \
--if_save_img 1 \
--image_wh 256
参数解释: --image_path 输入图片 --saving_path 输出保存目录 --if_save_img 1 是否保存相关图片结果 --image_wh 图片分辨率
3.1.2 生成语音feature
(pt210g118) user@8a100svr3:~/lk/proj/pydl/digitalhuman/KU-CVLAB_MoDiTalker/data/data_utils$
python preprocess/process_audio_lk.py \
--audio /home/user/mnt/sdg1/data/wavs/237-134500-0003.wav \
--save_sample_dir /home/user/mnt/sdg1/outs/digitalhuman/mobitalker/inf_ddp_atom_en_21/12_smpl \
--save_hubert_dir /home/user/mnt/sdg1/outs/digitalhuman/mobitalker/inf_ddp_atom_en_21/12_hbrt
3.1.3 由语音feature生成关键点序列
(pt210g118) user@8a100svr3:~/lk/proj/pydl/digitalhuman/KU-CVLAB_MoDiTalker/AToM$
CUDA_VISIBLE_DEVICES=3 python inference_lk.py \
--data_root /home/user/mnt/sdg1/data/imgs/VFHQ_512_p100_256_kps_fortest_31/Clip+_pcoGxTYEKk+P0+C0+F106-276/for_atom \
--cond_kps_path /home/user/mnt/sdg1/data/imgs/VFHQ_512_p100_256_kps_fortest_31/Clip+_pcoGxTYEKk+P0+C0+F106-276/face-centric/unposed/00000000.png.npy \
--hubert_path /home/user/mnt/sdg1/outs/digitalhuman/mobitalker/inf_ddp_atom_en_21/12_hbrt/16000/237-134500-0003.npy \
--audio_wav_path /home/user/mnt/sdg1/data/wavs/237-134500-0003.wav \
--save_dir /home/user/mnt/sdg1/outs/digitalhuman/mobitalker/inf_ddp_atom_en_21/12_kpts \
--checkpoint /home/user/mnt/sdg1/outs/digitalhuman/mobitalker/trn_19_atm_1/exp/weights/train-2000.pt
3.1.4 由关键点序列生成视频
(pt210g118) user@8a100svr3:~/lk/proj/pydl/digitalhuman/Zejun-Yang_AniPortrait$
python -m scripts.pose2vid_fa_tr_lk \
--config configs/prompts/animation_trnfa41.yaml \
-W 256 \
-H 256 \
-L 0 \
--fps 25 \
--ref_image_path "/home/user/mnt/sdg1/data/imgs/VFHQ_512_p100_256_kps_fortest_31/Clip+_pcoGxTYEKk+P0+C0+F106-276/00000000.png" \
--ref_pose_path "" \
--tgt_pose_path "/home/user/mnt/sdg1/outs/digitalhuman/mobitalker/inf_ddp_atom_en_21/12_kpts/frontalized_npy/00000000.png/atom_0.npy" \
--tgt_audio_path /home/user/mnt/sdg1/data/wavs/237-134500-0003.wav \
--out_save_path /home/user/mnt/sdg1/outs/digitalhuman/aniportrait/inf_p2v_trn32_21
3.2 单次执行
(详见chameleon/projects/animake/CMDS_3.md)
main文件: chameleon/projects/animake/main_1.py
(注意:根据MoDiTalker和AniPortrait代码目录更新main_1.py L:51 L:59)
3.2.1 由现有语音生成视频
python main_1.py \
\
--input_audio /home/user/mnt/sdg1/data/wavs/121-121726-0000.wav \
--save_sample_dir /home/user/mnt/sdg1/outs/digitalhuman/chameleon/animake/main_1_41_12/12_smpl \
--save_hubert_dir /home/user/mnt/sdg1/outs/digitalhuman/chameleon/animake/main_1_41_12/12_hbrt \
\
--save_dir /home/user/mnt/sdg1/outs/digitalhuman/chameleon/animake/main_1_41_12/12_kpts \
--checkpoint /home/user/mnt/sdg1/outs/digitalhuman/mobitalker/trn_19_atm_1/exp/weights/train-2000.pt \
\
--config /home/user/lk/proj/pydl/digitalhuman/Zejun-Yang_AniPortrait/configs/prompts/animation_trnfa41.yaml \
-W 256 \
-H 256 \
-L 0 \
--fps 25 \
--device "cuda" \
--ref_image_path "/home/user/mnt/sdh1/data/digiman/VFHQ_512_p100_256/Clip+_OUh5xHwjqs+P0+C2+F23393-23551/00000000.png" \
--ref_pose_path "" \
--tgt_pose_path "" \
--out_save_path /home/user/mnt/sdg1/outs/digitalhuman/chameleon/animake/main_1_41_12
3.2.2 由文本生成语音再生成视频
python main_1.py \
\
--tts_txt "a novel framework for generating high-quality animation driven by audio and a reference portrait image." \
--save_sample_dir /home/user/mnt/sdg1/outs/digitalhuman/chameleon/animake/main_1_41_12/12_smpl \
--save_hubert_dir /home/user/mnt/sdg1/outs/digitalhuman/chameleon/animake/main_1_41_12/12_hbrt \
\
--save_dir /home/user/mnt/sdg1/outs/digitalhuman/chameleon/animake/main_1_41_12/12_kpts \
--checkpoint /home/user/mnt/sdg1/outs/digitalhuman/mobitalker/trn_19_atm_1/exp/weights/train-2000.pt \
\
--config /home/user/lk/proj/pydl/digitalhuman/Zejun-Yang_AniPortrait/configs/prompts/animation_trnfa41.yaml \
-W 256 \
-H 256 \
-L 0 \
--fps 25 \
--device "cuda" \
--ref_image_path "/home/user/mnt/sdh1/data/digiman/VFHQ_512_p100_256/Clip+_OUh5xHwjqs+P0+C2+F23393-23551/00000000.png" \
--ref_pose_path "" \
--tgt_pose_path "" \
--out_save_path /home/user/mnt/sdg1/outs/digitalhuman/chameleon/animake/main_1_41_12
3.2.3 由预处理过的关键点序列生成视频
python main_1.py \
\
--save_dir /home/user/mnt/sdg1/outs/digitalhuman/chameleon/animake/main_1_41_12/12_kpts \
--checkpoint /home/user/mnt/sdg1/outs/digitalhuman/mobitalker/trn_19_atm_1/exp/weights/train-2000.pt \
\
--config /home/user/lk/proj/pydl/digitalhuman/Zejun-Yang_AniPortrait/configs/prompts/animation_trnfa41.yaml \
-W 256 \
-H 256 \
-L 0 \
--fps 25 \
--device "cuda" \
--ref_image_path "/home/user/mnt/sdg1/data/imgs/face_11/ref_images/solo.png" \
--ref_pose_path "" \
--tgt_pose_type "fcup" \
--tgt_pose_path "/home/user/mnt/sdh1/data/digiman/VFHQ_512_p100_256_kps_42_unfcpsok_img/face-centric/unposed/Clip+_pcoGxTYEKk+P0+C0+F106-276/" \
--out_save_path /home/user/mnt/sdg1/outs/digitalhuman/chameleon/animake/main_1_41_12
3.3 单次执行 (详见chameleon/projects/animake/CMDS_21.md)
main文件: chameleon/projects/animake/main_2.py
3.3.1 由现有语音生成视频
python main_2.py \
\
--input_audio /home/user/mnt/sdg1/data/wavs/121-121726-0000.wav \
--save_sample_dir /home/user/mnt/sdg1/outs/digitalhuman/chameleon/animake/main_1_41_12/12_smpl \
--save_hubert_dir /home/user/mnt/sdg1/outs/digitalhuman/chameleon/animake/main_1_41_12/12_hbrt \
\
--save_dir /home/user/mnt/sdg1/outs/digitalhuman/chameleon/animake/main_1_41_12/12_kpts \
--checkpoint /home/user/mnt/sdg1/outs/digitalhuman/mobitalker/trn_19_atm_1/exp/weights/train-2000.pt \
\
--config ../../configs/aniptrt/prompts/animation_trnfa41.yaml \
-W 256 \
-H 256 \
-L 0 \
--fps 25 \
--device "cuda" \
--ref_image_path "/home/user/mnt/sdh1/data/digiman/VFHQ_512_p100_256/Clip+_OUh5xHwjqs+P0+C2+F23393-23551/00000000.png" \
--ref_pose_path "" \
--tgt_pose_path "" \
--out_save_path /home/user/mnt/sdg1/outs/digitalhuman/chameleon/animake/main_1_41_12
参数介绍: --input_audio 输入的音频wav文件 --save_sample_dir 音频采样率调整后保存目录 --save_hubert_dir 音频特征提取后保存目录 --save_dir 生成的关键点序列保存目录 --checkpoint 语音生成关键点序列模型的路径 --config 关键点序列生成图片序列模型的配置文件,相应模型在其中指定 --ref_image_path 参考图片地址 --out_save_path 生成视频保存路径
或者简洁命令:
python main_2.py \
\
--input_audio /data/likun/data/data/wavs/LJ050-0180.wav \
\
--checkpoint /data/likun/outs/chameleon/moditalk/trn_31_lrs3/exp/weights/train-2000.pt \
\
--config ../../configs/aniptrt/prompts/animation_trnfa41.yaml \
-W 256 \
-H 256 \
-L 0 \
--fps 25 \
--ref_image_path "/data/likun/data/data/face/FFHQ/FFHQ512x512_p1/00004.png" \
--ref_pose_path "" \
--tgt_pose_path "" \
--out_save_path /data/likun/outs/chameleon/animake/main_2_42_41
3.3.2 由文本生成语音再生成视频
python main_2.py \
\
--tts_txt "a novel framework for generating high-quality animation driven by audio and a reference portrait image." \
--save_sample_dir /home/user/mnt/sdg1/outs/digitalhuman/chameleon/animake/main_1_41_12/12_smpl \
--save_hubert_dir /home/user/mnt/sdg1/outs/digitalhuman/chameleon/animake/main_1_41_12/12_hbrt \
\
--save_dir /home/user/mnt/sdg1/outs/digitalhuman/chameleon/animake/main_1_41_12/12_kpts \
--checkpoint /home/user/mnt/sdg1/outs/digitalhuman/mobitalker/trn_19_atm_1/exp/weights/train-2000.pt \
\
--config /home/user/lk/proj/pydl/digitalhuman/Zejun-Yang_AniPortrait/configs/prompts/animation_trnfa41.yaml \
-W 256 \
-H 256 \
-L 0 \
--fps 25 \
--device "cuda" \
--ref_image_path "/home/user/mnt/sdh1/data/digiman/VFHQ_512_p100_256/Clip+_OUh5xHwjqs+P0+C2+F23393-23551/00000000.png" \
--ref_pose_path "" \
--tgt_pose_path "" \
--out_save_path /home/user/mnt/sdg1/outs/digitalhuman/chameleon/animake/main_1_41_12
参数介绍: --tts_txt 输入文本
python main_2.py \
\
--tts_txt "a novel framework for generating high-quality animation driven by audio and a reference portrait image." \
\
--checkpoint /data/likun/outs/chameleon/moditalk/trn_31_lrs3/exp/weights/train-2000.pt \
\
--config ../../configs/aniptrt/prompts/animation_trnfa41.yaml \
-W 256 \
-H 256 \
-L 0 \
--fps 25 \
--ref_image_path "/home/user/mnt/sdg1/data/imgs/VFHQ_512_p100_256_kps_fortest_31/Clip+_pcoGxTYEKk+P0+C0+F106-276/00000000.png" \
--ref_pose_path "" \
--tgt_pose_path "" \
--out_save_path /data/likun/outs/chameleon/animake/main_2_43_24
3.3.3 由预处理过的关键点序列生成视频
python main_2.py \
\
--checkpoint /home/user/mnt/sdg1/outs/digitalhuman/mobitalker/trn_19_atm_1/exp/weights/train-2000.pt \
\
--config ../../configs/aniptrt/prompts/animation_trnfa41.yaml \
-W 256 \
-H 256 \
-L 0 \
--fps 25 \
--ref_image_path "/home/user/mnt/sdg1/data/imgs/VFHQ_512_p100_256_kps_fortest_31/Clip+_pcoGxTYEKk+P0+C0+F106-276/00000000.png" \
--ref_pose_path "" \
--tgt_pose_type "fcup" \
--tgt_pose_path "/home/user/mnt/sdh1/data/digiman/VFHQ_512_p100_256_kps_42_unfcpsok_img/face-centric/unposed/Clip+_pcoGxTYEKk+P0+C0+F106-276/" \
--out_save_path /data/likun/outs/chameleon/animake/main_2_51_11
参数介绍: --tgt_pose_type 关键点序列的类型 --tgt_pose_path 预先生成的关键点序列目录
4 模型训练
4.1 语音生成关键点序列模型
这里使用 MoDiTalker 项目的架构及方法训练 语音生成关键点序列模型。 项目目录:chameleon/projects/moditalk/ 详细请参考:chameleon/projects/moditalk/README.md
数据: lrs3 (预处理之后的) 地址: user@192.168.58.238:/data/likun/data/data/wav2lib/lrs3/lrs3_tmp/
- 训练命令:
CUDA_VISIBLE_DEVICES=4,5,6,7 \
torchrun --nproc_per_node=4 --master-port=30021 \
train_ddp.py \
--batch_size 128 \
--epochs 2000 \
--feature_type jukebox \
--save_interval 1 \
--processed_data_dir /data/likun/data/data/wav2lib/lrs3/lrs3_tmp \
--project /data/likun/outs/chameleon/moditalk/trn_31_lrs3
- 测试命令:
CUDA_VISIBLE_DEVICES=1 \
python inference_lk.py \
--cond_kps_path /home/user/mnt/sdg1/data/imgs/VFHQ_512_p100_256_kps_fortest_31/Clip+_pcoGxTYEKk+P0+C0+F106-276/face-centric/unposed/00000000.png.npy \
--hubert_path /home/user/mnt/sdg1/outs/digitalhuman/mobitalker/inf_ddp_atom_en_21/12_hbrt/16000/237-134500-0003.npy \
--audio_wav_path /home/user/mnt/sdg1/data/wavs/237-134500-0003.wav \
--save_dir /data/likun/outs/chameleon/moditalk/inf_ddp_atom_en_31/ \
--checkpoint /data/likun/outs/chameleon/moditalk/trn_31_lrs3/exp/weights/train-2000.pt
4.2 关键点序列生成视频模型
这里使用 AniPortrait 项目的架构及方法训练 关键点序列生成视频模型。 项目目录:chameleon/projects/aniptrt/ 详细请参考:chameleon/projects/aniptrt/README.md
这个模型的训练过程分为2个stage,主要目的是分步骤训练模型不同方面的能力,详细请参考AniPortrait的技术论文。
4.2.1 数据
使用了几个开源的数据集: HDTF,CelebV-HQ,VFHQ
数据地址: user@192.168.58.238:/data/likun/data/data/HDTF/ user@192.168.58.238:/data/likun/data/data/celebv_hq/ user@192.168.58.238:/data/likun/data/data/VFHQ/
详细请参考配置文件: chameleon/configs/aniptrt/train/stage1_fa_mlt.yaml chameleon/configs/aniptrt/train/stage2_fa_mlt.yaml
4.2.2 训练stage1
首先编辑配置文件:chameleon/configs/aniptrt/train/stage1_fa_mlt.yaml 主要是数据集的目录地址,训练过程LOG及输出保存地址。
CUDA_VISIBLE_DEVICES=4,5,6,7 \
accelerate launch --main_process_port 29502 \
train_stage_1_fa_tm.py --config ../../configs/aniptrt/train/stage1_fa_mlt.yaml \
4.2.3 训练stage2
首先编辑配置文件:chameleon/configs/aniptrt/train/stage2_fa_mlt.yaml 主要是数据集的目录地址,训练过程LOG及输出保存地址。
CUDA_VISIBLE_DEVICES=0,1,2,3 \
accelerate launch --main_process_port 29522 \
train_stage_2_fa.py --config ../../configs/aniptrt/train/stage2_fa_mlt.yaml \