Model Summary
This is model takes as input a speaker embedding, which is the content representation of an audio file, and the initial static landmarks generated by another model predicting facial landmark points when a speaker says specific words. It uses these inputs to generate speakeraware landmark displacement. The Self Attention Network is used in speaker aware Audio Driven Animation, that takes in a single image and generates a video animation. Most machine learning talking head models only take into account the lips and jaw of the speaker. Using this model the motion of the head or the subtle correlation between mouth and eyebrows become a crucial clues to generate plausible talking heads. The model is therefore trained to take both the content and the nuances of the speaker to create a self attention encoder that’s specific to the speaker.
This specific pre-trained implementation of the model is sourced from the MakeITTalk project. To see it work in live action, take a look at the pace here.
Training
The audio-visual dataset, VoxCeleb2, dataset was used to train the model since it contains video segments from a variety of speakers [Chung et al. 2018]. VoxCeleb2 was originally designed for speaker verification. A subset of 67 speakers with a total of 1,232 videos clips from VoxCeleb2. On average, the clips were 5-10 minutes of videos for each speaker.source: Zhou et all 2020
Performance
The model was been evaluated against two baselines: “retrieve-same ID” and “retrieve-random ID”. These baselines retrieve the head pose and position sequence from another video clip randomly picked from the training set. Then the facial landmarks were translated and rotated to reproduce the copied head poses and positions. The first baseline “retrieve-same ID” uses a training video with the same speaker as in the test video. This strategy makes this baseline stronger since it re-uses dynamics from the same speaker. The second baseline “retrieve-random ID” uses a video from a different random speaker. This baseline is useful to examine whether the method and alternatives produce head pose and facial expressions better than random or not.
The model method achieves much smaller errors compared to both baselines, indicating the speaker-aware prediction is more faithful compared to merely copying head motion from another video. It produces 2.7𝑥 less error in head pose(D-Rot), and 1.7𝑥 less error in head position (D-Pos) compared to using a random speaker identity (see “retrieve-random ID”). This result also confirms that the head motion dynamics of random speakers largely differ from ground-truth ones.
Ethical Considerations
“Deepfake videos” are becoming more prevalent in everyday life. The general public might still think that talking head videos are hard or impossible to generate synthetically. As a result, algorithms for talking head generation can be misused to spread misinformation or for other malicious acts. This code can help people understand that generating such videos is entirely feasible. The main intention of distributing the model is to spread awareness and demystify this technology. The main code repo includes a watermark to the generated videos making it clear that they are synthetic. The code can be altered to remove the watermark but users are always encouraged to use the technology responsibly and ethically.