Combo: Co-speech holistic 3D human motion generation and efficient customizable adaptation in harmony

* equal technical contribution. This work was completed in collaboration with Carnegie Mellon University.

Abstract

In this paper, we propose a novel framework, Combo, for harmonious co-speech holistic 3D human motion generation and efficient customizable adaption. In particular, we identify that one fundamental challenge as the multiple-input-multiple-output (MIMO) nature of the generative model of interest. More concretely, on the input end, the model typically consumes both speech signals and character guidance (e.g., identity and emotion), which not only poses challenge on learning capacity but also hinders further adaptation to varying guidance; on the output end, holistic human motions mainly consist of facial expressions and body movements, which are inherently correlated but non-trivial to coordinate in current data-driven generation process. In response to the above challenge, we propose tailored designs to both ends. For the former, we propose to pre-train on data regarding a fixed identity with neutral emotion, and defer the incorporation of customizable conditions (identity and emotion) to fine-tuning stage, which is boosted by our novel X-Adapter for parameter-efficient fine-tuning. For the latter, we propose a simple yet effective transformer design, DU-Trans, which first divides into two branches to learn individual features of face expression and body movements, and then unites those to learn a joint bi-directional distribution and directly predicts combined coefficients. Evaluated on BEAT2 and SHOW datasets, Combo is highly effective in generating high-quality motions but also efficient in transferring identity and emotion. The code will be released at https://github.com/modelscope/facechain.

Demo Video

Framework


Overview of the Combo. The basic architecture named DU-Trans (a) first introduces two transformer encoders ΨF, ΨB incorporated with auxiliary losses F, B and Bi-Flow to help model their respective distributions, obtaining two sets of discriminative features fFl, fBl. Subsequently, it merges these two features and inputs them into the decoder ΦFB to learn the joint distribution, and directly uses a single head to predict synchronized and coordinated face and body coefficients. Then, X-Adapter (b) is the central module for achieving identity customization and emotion transfer, and it is simply inserted in parallel into the MHA and FFN layers of the two encoders. Note that this adapter is a general structure suitable for both identity and emotion, offering better controllable generation through conditions ze and zid.

image.

Verification on DU-Trans


Qualitative comparison with TalkSHOW and EMAGE on BEAT2 dataset. The left part shows the holistic motions while the right presents a close-up of the expressions. Our method can generate expressions and gestures that are synchronized with the audio, particularly producing accurate and diverse gestures for rhythm, semantics, and specific concepts.

image.

Quantitative comparison with SOTA methods on BEAT2 dataset. The * indicates training from scratch (pretraining), while the signifies fine-tuning the emotional model from the neutral pre-trained one. For simplicity, we report MSE × 10-8 and LVD × 10-5 as EMAGE.

image.

Emotion Control and Identity Personalization


The visualization of multi-modal emotion control and different identity personalization. The left part shows the manipulated outputs guided by sad images, happy audio, surprised motion clips. The right part displays the motions from the various source identities as well as the motions after fine-tuning that transfers them to the target identity Scott.

image.

Efficiency Analysis on X-Adapter


Tuning efficiency of X-Adapter. We choose MSE for the face and BC for the body in this visualization. Values below 0 on the y-axis (gray fill) indicate inferior performance compared to EMAGE, and vice versa. Our design exhibits exceptional tuning efficiency in terms of training time and data, achieving SOTA performance within 45 minutes with full (Ours) or half data (Ours-50), or even within 90 minutes with only 25% training data (Ours-25). Ours-FPFT means full-parameter finetuning on DU-Trans (w/o. X-adapter) with full data.

image.

BibTeX

@article{xu2024combo,
      title={Combo: Co-speech holistic 3D human motion generation and efficient customizable adaptation in harmony},
      author={Xu, Chao and Sun, Mingze and Cheng, Zhi-Qi and Wang, Fei and Liu, Yang and Sun, Baigui and Huang, Ruqi and Hauptmann, Alexander},
      journal={arXiv preprint arXiv:2408.09397},
      year={2024}
    }