EMO2
EMO2 is a novel audio-driven talking head framework designed to generate both expressive facial expressions and hand gestures simultaneously. In contrast to existing approaches that primarily target full-body or half-body pose generation, this method focuses on the specific challenges of audio-driven gesture synthesis. A key limitation identified is the weak correlation between audio features and full-body gestures. To overcome this, the task is reformulated as a two-stage process. In the first stage, hand poses are generated directly from the audio input by leveraging the stronger correlation between audio signals and hand movements. In the second stage, a diffusion model is employed to synthesize video frames, using the hand poses from the first stage to guide the generation of realistic facial expressions and body movements.
See also EMO.