6.5 C
New York
Sunday, November 28, 2021

A new algorithm to animate faces with raw speech recording

Must Read

A massive asteroid stronger than an atom bomb is approaching Earth: how dangerous can it be?

The asteroid 2018 AH is expected to be roughly 190 meters long, similar to the Tunguska asteroid,...

Giving Tuesday: a chance to make a real difference for others and ourselves after Black Friday and Cyber Monday

Giving Tuesday 2021: After Black Friday and Cyber Monday, a chance to truly help others and ourselves.

JPMorgan’s chief might regret his comments about China’s CCP but he has a point

Comparing JPMorgan to the Chinese Communist Party is bold and hubristic. Despite his later regrets for boasting...
Aakash Molpariya
Aakash started in Nov 2018 as a writer at Revyuh.com. Since joining, as writer, he is mainly responsible for Software, Science, programming, system administration and the Technology ecosystem, but due to his versatility he is used for everything possible. He writes about topics ranging from AI to hardware to games, stands in front of and behind the camera, creates creative product images and much more. He is a trained IT systems engineer and has studied computer science. By the way, he is enthusiastic about his own small projects in game development, hardware-handicraft, digital art, gaming and music. Email: aakash (at) revyuh (dot) com

Developers from Microsoft Research introduced an algorithm that can animate static frames of faces using unprocessed recordings of people’s speech. The model they created is context-sensitive: it extracts from the audio not only phonetic characteristics, but also emotional tone and external noise, due to which it can superimpose all possible aspects of speech on a static frame.

To animate still images in most cases, the transfer of information from videos to the desired frame is used. In solving this problem, the developers have already achieved significant success: now there are models that can reliably transfer speech from a video sequence to a static frame, recreating the speaker’s facial expressions.

Difficulties in solving, however, may arise if the image needs to be “revived” with the help of an audio series: all existing algorithms that can transfer audio to a static frame so that a natural animation or even video of the speech process is obtained are limited in that they can work only with a clear, clearly audible speech delivered in a neutral voice without emotional color. Human speech, however, is multifaceted and ideally it is necessary to teach such algorithms to recreate all its aspects.

Gaurav Mittal and Baoyuan Wang from Microsoft Research decided to do this. Their algorithm receives an audio file as an input and, using a variational auto-encoder based on neural networks with a long short-term memory, identifies key aspects: the phonetic and emotional component (in total, the algorithm understands six basic emotions), as well as third-party noise. Based on the selected information, the speaker’s facial expressions are reconstructed — video files are used for this — and superimposed on the initially static image.

To learn the algorithm, the researchers used three different datasets: GRID, which consists of thousands of video recordings of 34 people’s speech delivered with a neutral expression, 7.4 thousand video recordings of speeches with different emotional colors taken from the CREMA-D dataset, as well as more than one hundred thousand video clips TED.

As a result, the researchers were able to animate static images even using audio with background noise of up to 40 decibels, and also successfully use the emotional components of the speaker’s speech in the animation. The authors themselves do not give the animations themselves, but give a comparison of the resulting frames with the results of one of the first such algorithms.

Comparison of the results of one of the original algorithms and the new algorithm on the pronunciation of individual sounds with the addition of various emotions and noise Gaurav Mittal & Baoyuan Wang / arXiv.org

The authors of the work also clarified that their algorithm can be used in all existing systems that can animate static images using audio: for this, it will be necessary to replace the processing audio component in third-party algorithms.

Speech, of course, carries a lot of information about the speaker, and not only about emotions and intentions, but also, for example, about the appearance. Recently, American developers taught the algorithm to recreate a person’s approximate appearance by recording his speech: the system accurately conveys the gender, age and race of the speaker.

Via | arXiv

- Advertisement -


Please enter your comment!
Please enter your name here

- Advertisement -

Latest News

- Advertisement -

More Articles Like This

- Advertisement -