Phonetic representations are used when recording spoken languages, but no equivalent exists for recording signed languages. As a result, linguists have proposed several annotation systems that operate on the gloss or sub-unit level; however, these resources are notably irregular and scarce.
Sign Language Production (SLP) aims to automatically translate spoken language sentences into continuous sequences of sign language. However, current state-of-the-art approaches rely on scarce linguistic resources to work. This has limited progress in the field. This paper introduces an innovative solution by transforming the continuous pose generation problem into a discrete sequence generation problem. Thus, overcoming the need for costly annotation. Although, if available, we leverage the additional information to enhance our approach.
By applying Vector Quantisation (VQ) to sign language data, we first learn a codebook of short motions that can be combined to create a natural sequence of sign. Where each token in the codebook can be thought of as the lexicon of our representation. Then using a transformer we perform a translation from spoken language text to a sequence of codebook tokens. Each token can be directly mapped to a sequence of poses allowing the translation to be performed by a single network. Furthermore, we present a sign stitching method to effectively join tokens together. We evaluate on the RWTH-PHOENIX-Weather-2014T (PHOENIX14T) and the more challenging Meine DGS Annotated (mDGS) datasets. An extensive evaluation shows our approach outperforms previous methods, increasing the BLEU-1 back translation score by up to 72%.
Previous approaches to SLP attempt to regress pose directly from the spoken language. This leads to underarticulated signing, as the signer regresses to the mean. Whereas, here by first learning a codebook we can ensure our new lexicon is expressive.
Here we present example tokens from the Codebooks.
Here we present translation examples.
Left skeleton - the ground truth extracted from the original videos.
Middel skeleton - applying the codebook to quantize the ground truth sequence.
Right skeleton - the translation output from the Text-to-Tokens transformer.
Note, in the following examples we show the baseline model without the stitching module. In "Comparison to Progressive Transformer" below we add the stitching module to show its effectiveness for creating smoother natural signing sequences.
Failure case:
The PHOENIX14T dataset only contains a single view of the signer. As a result, our pose estimator struggles to capture some high-frequency movements, as can be seen in the ground truth data. In addition, for longer sequences, the model can struggle to capture all the fine-grain detail in the handshape.
Failure case:
In the following examples, the model is able to capture the motion, but the fine detail in the hands is lost during the quantization step.
Here we compare our full approach to the progressive transformer. We apply both the contrastive learning and stitching module. Hence, the examples show smooth continuous signing.
As shown by the first plot "Without replacement" the codebook collapses to a single token, meaning the codebook cannot accurately quantise a sequence of sign language. In the following plot ("With Replacement") we show our aggressive replacement strategy helps evenly distribute tokens within the embedding space, allowing for the accurate quantization as shown in the videos above. Finally, as shown in the final plot our contrastive loss has a significant impact on the embedding space. We suggest that the non-uniform distribution of the tokens is the model collapsing lexical variants and overcoming signer-dependent features.
Without replacement | With Replacement | With Contrastive Learning |
---|---|---|
Distributed under the Attribution-NonCommercial-ShareAlike 4.0 International License. See LICENSE.txt
for more information.