DmitrMakeev
/

one-shot-talking-face

Model card Files Files and versions Community

DmitrMakeev commited on Sep 25, 2023

Commit

c626b55

•

1 Parent(s): 183ec7c

Upload 7 files

Browse files

Files changed (7) hide show

.gitattributes +59 -0
README.md +59 -0
config.py +3 -0
image_preprocess.py +57 -0
phindex.json +1 -0
requirements.txt +8 -0
test_script.py +180 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,59 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+OpenFace/FaceLandmarkVidMulti filter=lfs diff=lfs merge=lfs -text
+OpenFace/FeatureExtraction filter=lfs diff=lfs merge=lfs -text
+OpenFace/FaceLandmarkVid filter=lfs diff=lfs merge=lfs -text
+OpenFace/FaceLandmarkImg filter=lfs diff=lfs merge=lfs -text
+OpenFace/model/detection_validation/validator_cnn.txt filter=lfs diff=lfs merge=lfs -text
+OpenFace/model/detection_validation/validator_cnn_68.txt filter=lfs diff=lfs merge=lfs -text
+OpenFace/model/model_inner/patch_experts/ccnf_patches_1.00_inner.txt filter=lfs diff=lfs merge=lfs -text
+OpenFace/model/patch_experts/ccnf_patches_0.5_wild.txt filter=lfs diff=lfs merge=lfs -text
+OpenFace/model/patch_experts/ccnf_patches_1_wild.txt filter=lfs diff=lfs merge=lfs -text
+OpenFace/model/patch_experts/ccnf_patches_0.35_multi_pie.txt filter=lfs diff=lfs merge=lfs -text
+OpenFace/model/patch_experts/ccnf_patches_0.35_wild.txt filter=lfs diff=lfs merge=lfs -text
+OpenFace/model/patch_experts/ccnf_patches_0.25_wild.txt filter=lfs diff=lfs merge=lfs -text
+OpenFace/model/patch_experts/ccnf_patches_0.5_multi_pie.txt filter=lfs diff=lfs merge=lfs -text
+OpenFace/model/patch_experts/ccnf_patches_0.25_multi_pie.txt filter=lfs diff=lfs merge=lfs -text
+OpenFace/model/patch_experts/ccnf_patches_0.5_general.txt filter=lfs diff=lfs merge=lfs -text
+OpenFace/model/patch_experts/ccnf_patches_0.25_general.txt filter=lfs diff=lfs merge=lfs -text
+OpenFace/model/patch_experts/ccnf_patches_0.35_general.txt filter=lfs diff=lfs merge=lfs -text
+OpenFace/model/mtcnn_detector/ONet.dat filter=lfs diff=lfs merge=lfs -text
+samples/audios/trump.wav filter=lfs diff=lfs merge=lfs -text
+samples/audios/abstract.wav filter=lfs diff=lfs merge=lfs -text
+samples/audios/obama2.wav filter=lfs diff=lfs merge=lfs -text
+OpenFace/model/patch_experts/cen_patches_0.35_of.dat filter=lfs diff=lfs merge=lfs -text
+OpenFace/model/patch_experts/cen_patches_0.25_of.dat filter=lfs diff=lfs merge=lfs -text
+OpenFace/model/patch_experts/cen_patches_1.00_of.dat filter=lfs diff=lfs merge=lfs -text
+OpenFace/model/patch_experts/cen_patches_0.50_of.dat filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,59 @@

+# One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022)
+#### [Paper](https://arxiv.org/pdf/2112.02749.pdf) | [Demo](https://www.youtube.com/watch?v=HHj-XCXXePY)
+#### Requirements
+- Python >= 3.6 , Pytorch >= 1.8 and ffmpeg
+- Set up [OpenFace](https://github.com/TadasBaltrusaitis/OpenFace)
+  - We use the OpenFace tools to extract the initial pose of the reference image
+  - Make sure you have installed this tool, and set the `OPENFACE_POSE_EXTRACTOR_PATH` in `config.py`. For example, it should be the absolute path of  the "`FeatureExtraction.exe`" for Windows.
+- Other requirements are listed in the 'requirements.txt'
+#### Pretrained Checkpoint
+Please download the pretrained checkpoint from [google-drive](https://drive.google.com/file/d/1mjFEozPR_2vMaVRMd9Agk_sU1VaiUYMl/view?usp=sharing) and unzip it to the directory (`/checkpoints`). Or manually modify the settings of `GENERATOR_CKPT` and `AUDIO2POSE_CKPT` in the `config.py`.
+#### Extract phoneme
+We employ the [CMU phoneset](https://github.com/cmusphinx/cmudict) to represent phonemes, the extra 'SIL' means silence. All the phonesets can be seen in '`phindex.json`'.
+We have extracted the phonemes for the audios in the '`sample/audio`'  directory. For other audios, you can extract the phonemes by other ASR tools and then map them to the CMU phoneset. Or email to [email protected]  for help.
+#### Generate Demo Results
+```
+python test_script.py --img_path xxx.jpg --audio_path xxx.wav --phoneme_path xxx.json --save_dir "YOUR_DIR"
+```
+Note that the input images must keep the same height and width and the face should be appropriately cropped as in `samples/imgs`. You can also preprocess your images with `image_preprocess.py`.
+#### License and Citation
+```
+@InProceedings{wang2021one,
+author = Suzhen Wang, Lincheng Li, Yu Ding, Xin Yu
+title = {One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning},
+booktitle = {AAAI 2022},
+year = {2022},
+}
+```
+#### Acknowledgement
+This codebase is based on [First Order Motion Model](https://github.com/AliaksandrSiarohin/first-order-model) and [imaginaire](https://github.com/NVlabs/imaginaire), thanks for their contributions.

config.py ADDED Viewed

	@@ -0,0 +1,3 @@

+OPENFACE_POSE_EXTRACTOR_PATH = "/content/one-shot-talking-face/OpenFace/FeatureExtraction"
+GENERATOR_CKPT = "/content/one-shot-talking-face/checkpoints/generator.ckpt"
+AUDIO2POSE_CKPT = "/content/one-shot-talking-face/checkpoints/audio2pose.ckpt"

image_preprocess.py ADDED Viewed

	@@ -0,0 +1,57 @@

+import dlib
+import cv2
+def compute_aspect_preserved_bbox(bbox, increase_area, h, w):
+    left, top, right, bot = bbox
+    width = right - left
+    height = bot - top
+    width_increase = max(increase_area, ((1 + 2 * increase_area) * height - width) / (2 * width))
+    height_increase = max(increase_area, ((1 + 2 * increase_area) * width - height) / (2 * height))
+    left_t = int(left - width_increase * width)
+    top_t = int(top - height_increase * height)
+    right_t = int(right + width_increase * width)
+    bot_t = int(bot + height_increase * height)
+    left_oob = -min(0, left_t)
+    right_oob = right - min(right_t, w)
+    top_oob = -min(0, top_t)
+    bot_oob = bot - min(bot_t, h)
+    if max(left_oob, right_oob, top_oob, bot_oob) > 0:
+        max_w = max(left_oob, right_oob)
+        max_h = max(top_oob, bot_oob)
+        if max_w > max_h:
+            return left_t + max_w, top_t + max_w, right_t - max_w, bot_t - max_w
+        else:
+            return left_t + max_h, top_t + max_h, right_t - max_h, bot_t - max_h
+    else:
+        return (left_t, top_t, right_t, bot_t)
+def crop_src_image(src_img,save_img, detector=None):
+    if  detector is None:
+        detector = dlib.get_frontal_face_detector()
+    img = cv2.imread(src_img)
+    faces = detector(img, 0)
+    h, width, _ = img.shape
+    if len(faces) > 0:
+        bbox = [faces[0].left(), faces[0].top(),faces[0].right(), faces[0].bottom()]
+        l = bbox[3]-bbox[1]
+        bbox[1]= bbox[1]-l*0.1
+        bbox[3]= bbox[3]-l*0.1
+        bbox[1] = max(0,bbox[1])
+        bbox[3] = min(h,bbox[3])
+        bbox = compute_aspect_preserved_bbox(tuple(bbox), 0.5, img.shape[0], img.shape[1])
+        img = img[bbox[1] :bbox[3] , bbox[0]:bbox[2]]
+        img = cv2.resize(img, (256, 256))
+        cv2.imwrite(save_img,img)
+    else:
+        img = cv2.resize(img,(256,256))
+        cv2.imwrite(save_img, img)
+if __name__ == '__main__':
+    src_img = ""
+    out_img = ""
+    crop_src_image(src_img,out_img)

phindex.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"AA": 0, "AE": 1, "AH": 2, "AO": 3, "AW": 4, "AY": 5, "B": 6, "CH": 7, "D": 8, "DH": 9, "EH": 10, "ER": 11, "EY": 12, "F": 13, "G": 14, "HH": 15, "IH": 16, "IY": 17, "JH": 18, "K": 19, "L": 20, "M": 21, "N": 22, "NG": 23, "NSN": 24, "OW": 25, "OY": 26, "P": 27, "R": 28, "S": 29, "SH": 30, "SIL": 31, "T": 32, "TH": 33, "UH": 34, "UW": 35, "V": 36, "W": 37, "Y": 38, "Z": 39, "ZH": 40}

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+scikit-image
+python_speech_features
+pyworld
+pyyaml
+imageio
+scipy
+pyworld
+opencv-python

test_script.py ADDED Viewed

	@@ -0,0 +1,180 @@

+import os
+import numpy as np
+import torch
+import yaml
+from models.generator import OcclusionAwareGenerator
+from models.keypoint_detector import KPDetector
+import argparse
+import imageio
+from models.util import draw_annotation_box
+from models.transformer import Audio2kpTransformer
+from scipy.io import wavfile
+from tools.interface import read_img,get_img_pose,get_pose_from_audio,get_audio_feature_from_audio,\
+    parse_phoneme_file,load_ckpt
+import config
+def normalize_kp(kp_source, kp_driving, kp_driving_initial,
+                 use_relative_movement=True, use_relative_jacobian=True):
+    kp_new = {k: v for k, v in kp_driving.items()}
+    if use_relative_movement:
+        kp_value_diff = (kp_driving['value'] - kp_driving_initial['value'])
+        # kp_value_diff *= adapt_movement_scale
+        kp_new['value'] = kp_value_diff + kp_source['value']
+        if use_relative_jacobian:
+            jacobian_diff = torch.matmul(kp_driving['jacobian'], torch.inverse(kp_driving_initial['jacobian']))
+            kp_new['jacobian'] = torch.matmul(jacobian_diff, kp_source['jacobian'])
+    return kp_new
+def test_with_input_audio_and_image(img_path, audio_path,phs, generator_ckpt, audio2pose_ckpt, save_dir="samples/results"):
+    with open("config_file/vox-256.yaml") as f:
+        config = yaml.full_load(f)
+    # temp_audio = audio_path
+    # print(audio_path)
+    cur_path = os.getcwd()
+    sr,_ = wavfile.read(audio_path)
+    if sr!=16000:
+        temp_audio = os.path.join(cur_path,"samples","temp.wav")
+        command = "ffmpeg -y -i %s -async 1 -ac 1 -vn -acodec pcm_s16le -ar 16000 %s" % (audio_path, temp_audio)
+        os.system(command)
+    else:
+        temp_audio = audio_path
+    opt = argparse.Namespace(**yaml.full_load(open("config_file/audio2kp.yaml")))
+    img = read_img(img_path).cuda()
+    first_pose = get_img_pose(img_path)#.cuda()
+    audio_feature = get_audio_feature_from_audio(temp_audio)
+    frames = len(audio_feature) // 4
+    frames = min(frames,len(phs["phone_list"]))
+    tp = np.zeros([256, 256], dtype=np.float32)
+    draw_annotation_box(tp, first_pose[:3], first_pose[3:])
+    tp = torch.from_numpy(tp).unsqueeze(0).unsqueeze(0).cuda()
+    ref_pose = get_pose_from_audio(tp, audio_feature, audio2pose_ckpt)
+    torch.cuda.empty_cache()
+    trans_seq = ref_pose[:, 3:]
+    rot_seq = ref_pose[:, :3]
+    audio_seq = audio_feature#[40:]
+    ph_seq = phs["phone_list"]
+    ph_frames = []
+    audio_frames = []
+    pose_frames = []
+    name_len = frames
+    pad = np.zeros((4, audio_seq.shape[1]), dtype=np.float32)
+    for rid in range(0, frames):
+        ph = []
+        audio = []
+        pose = []
+        for i in range(rid - opt.num_w, rid + opt.num_w + 1):
+            if i < 0:
+                rot = rot_seq[0]
+                trans = trans_seq[0]
+                ph.append(31)
+                audio.append(pad)
+            elif i >= name_len:
+                ph.append(31)
+                rot = rot_seq[name_len - 1]
+                trans = trans_seq[name_len - 1]
+                audio.append(pad)
+            else:
+                ph.append(ph_seq[i])
+                rot = rot_seq[i]
+                trans = trans_seq[i]
+                audio.append(audio_seq[i * 4:i * 4 + 4])
+            tmp_pose = np.zeros([256, 256])
+            draw_annotation_box(tmp_pose, np.array(rot), np.array(trans))
+            pose.append(tmp_pose)
+        ph_frames.append(ph)
+        audio_frames.append(audio)
+        pose_frames.append(pose)
+    audio_f = torch.from_numpy(np.array(audio_frames,dtype=np.float32)).unsqueeze(0)
+    poses = torch.from_numpy(np.array(pose_frames, dtype=np.float32)).unsqueeze(0)
+    ph_frames = torch.from_numpy(np.array(ph_frames)).unsqueeze(0)
+    bs = audio_f.shape[1]
+    predictions_gen = []
+    kp_detector = KPDetector(**config['model_params']['kp_detector_params'],
+                             **config['model_params']['common_params'])
+    generator = OcclusionAwareGenerator(**config['model_params']['generator_params'],
+                                        **config['model_params']['common_params'])
+    kp_detector = kp_detector.cuda()
+    generator = generator.cuda()
+    ph2kp = Audio2kpTransformer(opt).cuda()
+    load_ckpt(generator_ckpt, kp_detector=kp_detector, generator=generator,ph2kp=ph2kp)
+    ph2kp.eval()
+    generator.eval()
+    kp_detector.eval()
+    with torch.no_grad():
+        for frame_idx in range(bs):
+            t = {}
+            t["audio"] = audio_f[:, frame_idx].cuda()
+            t["pose"] = poses[:, frame_idx].cuda()
+            t["ph"] = ph_frames[:,frame_idx].cuda()
+            t["id_img"] = img
+            kp_gen_source = kp_detector(img, True)
+            gen_kp = ph2kp(t,kp_gen_source)
+            if frame_idx == 0:
+                drive_first = gen_kp
+            norm = normalize_kp(kp_source=kp_gen_source, kp_driving=gen_kp, kp_driving_initial=drive_first)
+            out_gen = generator(img, kp_source=kp_gen_source, kp_driving=norm)
+            predictions_gen.append(
+                (np.transpose(out_gen['prediction'].data.cpu().numpy(), [0, 2, 3, 1])[0] * 255).astype(np.uint8))
+    log_dir = save_dir
+    os.makedirs(os.path.join(log_dir, "temp"),exist_ok=True)
+    f_name = os.path.basename(img_path)[:-4] + "_" + os.path.basename(audio_path)[:-4] + ".mp4"
+    # kwargs = {'duration': 1. / 25.0}
+    video_path = os.path.join(log_dir, "temp", f_name)
+    print("save video to: ", video_path)
+    imageio.mimsave(video_path, predictions_gen, fps=25.0)
+    # audio_path = os.path.join(audio_dir, x['name'][0].replace(".mp4", ".wav"))
+    save_video = os.path.join(log_dir, f_name)
+    cmd = r'ffmpeg -y -i "%s" -i "%s" -vcodec copy "%s"' % (video_path, audio_path, save_video)
+    os.system(cmd)
+    os.remove(video_path)
+if __name__ == '__main__':
+    argparser = argparse.ArgumentParser()
+    argparser.add_argument("--img_path", type=str, default=None, help="path of the input image ( .jpg ), preprocessed by image_preprocess.py")
+    argparser.add_argument("--audio_path", type=str, default=None, help="path of the input audio ( .wav )")
+    argparser.add_argument("--phoneme_path", type=str, default=None, help="path of the input phoneme. It should be note that the phoneme must be consistent with the input audio")
+    argparser.add_argument("--save_dir", type=str, default="samples/results", help="path of the output video")
+    args = argparser.parse_args()
+    phoneme = parse_phoneme_file(args.phoneme_path)
+    test_with_input_audio_and_image(args.img_path,args.audio_path,phoneme,config.GENERATOR_CKPT,config.AUDIO2POSE_CKPT,args.save_dir)