EMO

video

Expressive, talking portrait videos from a single still image and an audio clip.

EMO (Emote Portrait Alive) is an AI-driven system that generates expressive, talking portrait videos from a single still image and an audio clip. Developed as part of a research project, EMO synthesizes realistic facial animations—such as lip movements, head motions, and emotional expressions—that align with the rhythm, content, and tone of the provided voice input. The result is a dynamic, photo-realistic video where the subject in the reference image appears to speak or sing naturally.

The system achieves this through a two-stage architecture: first, it encodes visual features from the reference image, and then it uses a diffusion-based model to generate coherent video frames that match the audio over time. EMO is capable of handling long sequences while preserving the identity of the subject, supporting diverse vocal inputs across different languages and emotional styles. It represents a significant step forward in the field of audio-driven facial animation, combining realism with flexibility in a highly efficient framework.