diff --git a/docs/Bibtex.txt b/docs/Bibtex.txt new file mode 100644 index 0000000..0382ee3 --- /dev/null +++ b/docs/Bibtex.txt @@ -0,0 +1,6 @@ +@inproceedings{chuan2022tm2t, + title={TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts}, + author={Guo, Chuan and Zuo, Xinxin and Wang, Sen and Cheng, Li}, + booktitle={ECCV}, + year={2022} +} \ No newline at end of file diff --git a/docs/eccv_paper.png b/docs/eccv_paper.png new file mode 100644 index 0000000..02d8e28 Binary files /dev/null and b/docs/eccv_paper.png differ diff --git a/docs/framework.png b/docs/framework.png new file mode 100644 index 0000000..7941ecb Binary files /dev/null and b/docs/framework.png differ diff --git a/docs/index.html b/docs/index.html index 6966784..2543dc3 100644 --- a/docs/index.html +++ b/docs/index.html @@ -183,7 +183,7 @@
- Automated generation of 3D human motions from text is a challenging problem. The generated motions are expected to be sufficiently diverse to explore the text-grounded motion space, and more importantly, accurately depicting the content in prescribed text descriptions. Here we tackle this problem with a two-stage approach: text2length sampling and text2motion generation. Text2length involves sampling from the learned distribution function of motion lengths conditioned on the input text. This is followed by our text2motion module using temporal variational autoencoder to synthesize a diverse set of human motions of the sampled lengths. Instead of directly engaging with pose sequences, we propose motion snippet code as our internal motion representation, which captures local semantic motion contexts and is empirically shown to facilitate the generation of plausible motions faithful to the input text. Moreover, a large-scale dataset of scripted 3D Human motions, HumanML3D, is constructed, consisting of 14,616 motion clips and 44,970 text descriptions. + Inspired by the strong ties between vision and language, the two intimate human sensing and communication modalities, our paper aims to explore the generation of 3D human full-body motions from texts, as well as its reciprocal task, shorthanded for text2motion and motion2text, respectively. To tackle the existing challenges, especially to enable the generation of multiple distinct motions from the same text, and to avoid the undesirable production of trivial motionless pose sequences, we propose the use of motion token, a discrete and compact motion representation. This provides one level playing ground when considering both motions and text signals, as the motion and text tokens, respectively. Moreover, our motion2text module is integrated into the inverse alignment process of our text2motion training pipeline, where a significant deviation of synthesized text from the input text would be penalized by a large training loss; empirically this is shown to effectively improve performance. Finally, the mappings in-between the two modalities of motions and texts are facilitated by adapting the neural model for machine translation (NMT) to our context. This autoregressive modeling of the distribution over discrete motion tokens further enables non-deterministic production of pose sequences, of variable lengths, from an input text. Our approach is flexible, could be used for both text2motion and motion2text tasks. Empirical evaluations on two benchmark datasets demonstrate the superior performance of our approach on both tasks over a variety of state-of-the-art methods.
@@ -214,13 +214,13 @@
Acknowledgements |
diff --git a/docs/teaser_image.png b/docs/teaser_image.png
new file mode 100644
index 0000000..3a1ffc5
Binary files /dev/null and b/docs/teaser_image.png differ