Skip to content

Latest commit

 

History

History
290 lines (239 loc) · 17.4 KB

README.md

File metadata and controls

290 lines (239 loc) · 17.4 KB

PictoBERT: Transformers for Next Pictogram Prediction

Original code implementation of the paper "PictoBERT: Transformers for Next Pictogram Prediction".

Pictogram is the term used by the Augmentative and Alternative Communication (AAC) community for an image with a label that represents a place, person, action, object and animal. AAC systems like the shown below allow message construction and communication by arranging pictograms in sequence.

image

Pictogram prediction is an important task for AAC systems for it can facilitate communication. Previous works used n-gram statistical models or knowledge bases to accomplishing this task. Our proposal is an adaptation of the BERT (Bidirectional Encoder Representations from Transformers) model to perform pictogram prediction. We changed the BERT vocabulary and input embeddings to allow the usage of word-senses, considering that a word-sense represents better a pictogram. We call our version PictoBERT.

using_flow_2_hl

We trained the model using the CHILDES (Child Language Data Exchange System) corpora as a dataset. We annotated the North American English version of CHILDES with word-senses using supWSD. PictoBERT performance was compared to n-gram models and achieved good results, as show in the table bellow.

image

The PictoBERT is capable of predicting pictograms in different contexts. And its main characteristic is the ability to transfer learning for it allows other models focused on users' specific needs to be trained.

image

Software requirements

Execution

You can run the PictoBERT scripts using Google Colab or clone the repository in your machine and open the notebooks.

git clone https://github.com/jayralencar/pictoBERT.git

We present each of the notebooks below and their relationship with the paper's content. You may execute the notebooks following the sequence we give below. However, downloadable versions of the resources are available in each step.

1. PictoBERT

In the paper, we present PictoBERT construction (Section 4.1) in three steps: corpus construction, BERT adaptation and pretraining.

1.2 Dataset Creation

The dataset creation is described in Section 4.1.1 of the paper and consists of downloading and annotating the North American English part of the CHILDES dataset.

SemCHILDES.ipynb Run in Google Colab View source on GitHub NA-EN SemCHILDES



In addition, we also annotated the British English part of CHILDES with semantic roles to use for fine-tuning PictoBERT to perform pictogram prediction based on a grammatical structure.

Create_SRL_semCHILDES.ipynb Run in Google Colab View source on GitHub UK-EN SemCHILDES



1.3 Updating BERT Vocabulary and Embeddings Layer

For updating BERT vocabulary and Embeddings Layer, as described in Section 4.1.2 of the paper, we first trained a Word Level tokenizer and prepared the dataset for future training.

Train_Tokenizer_and_Prepare_Dataset.ipynb Run in Google Colab View source on GitHub PictoBERT Tokenizer
Train dataset
Test dataset
Val dataset



Then, we created the models by changing the BERT embeddings and vocabulary:

Create_Models.ipynb Run in Google Colab View source on GitHub PictoBERT contextualized
PictoBERT gloss-based



1.4 Pre-Training PictoBERT

As described in section 4.1.3 of the paper, we splited semCHILDES in a 98/1/1 split for training, validation, and test. We used a batch size of 128 sequences with 32 tokens. Each data batch was collated to choose 15% of the tokens for prediction. We used a learning rate of $1 \times 10 ^{-4}$, with $\beta_1 = 0.9$, $\beta_2 = 0.999$, L2 weight decay of 0.01 and linear decay of learning rate. Training PictoBERT was performed in a single 16GB NVIDIA Tesla V100 GPU for 500 epochs for each version.

Training_PictoBERT.ipynb Run in Google Colab View source on GitHub PictoBERT contextualized
PictoBERT gloss-based



1.5 Training n-gram models

As mentioned in the paper (section 5.1), we compare PictoBERT performance rather n-gram models performance. Using the notebook below, we trained n-gram models with orders varying from 2 to 7.

N-gram models.ipynb Run in Google Colab View source on GitHub N-gram models



2. Fine-tuning PictoBERT

As described in Section 5.2 of PictoBERT's paper, we fine-tuned two versions of the model: one for pictogram prediction based on a grammatical structure and the other for making predictions based on the ARASAAC vocabulary.

2.1. Pictogram Prediction Based on a Grammatica Structure

This section refers to the section 5.2.1 of the PictoBERT paper.

For fine-tuning the model, we used as basis the UK-EN SemCHILDES presented on section 1.2 of this document.

All the procedures for fine-tuning are described on the following notebook:

Fine_tuning_PictoBERT_(colourful_semantics).ipynb Run in Google Colab View source on GitHub Fine-tuned PictoBERT (contextualized) Fine-tuned PictoBERT (gloss-based) Tokenizer



In addition, we replicated the method proposed by Pereira et al. (2020) for constructing semantic grammars to compare with PictoBERT. Semantic grammars are generally represented using OWL ontologies. We opted to represent using relational databases to facilitate faster queries.

Semantic_Grammar.ipynb Run in Google Colab View source on GitHub Semantic Grammars (db versions)



2.2 Using ARASAAC vocabulary

This section refers to the section 5.2.2 of the PictoBERT paper.

The notebook presents:

  1. The procedure for mapping ARASAAC pictograms to WordNET word-senses
  2. The procedure for changing SemCHILDES to keep only sentences in which all tokens are also in the vocabulary generated 1.
  3. Train tokenizer
  4. Train models
ARASAAC_fine_tuned_PictoBERT.ipynb Run in Google Colab View source on GitHub Pictogram to word-sense mappings
Reduced SemCHILDES (corpus)
Tokenizer
ARASAAC PictoBERT (contextualized)
ARASAAC PictoBERT (gloss-based)



We also trained n-gram models to compare with the fine-tuned models.

N-gram models.ipynb Run in Google Colab View source on GitHub N-gram models



Evaluation

The evaluation scripts, as well as the results, are in the following notebook.

evaluation_codeocean.ipynb Run in Google Colab View source on GitHub



Cite

@article{PEREIRA2022117231,
	title = {Picto{BERT}: Transformers for next pictogram prediction},
	journal = {Expert Systems with Applications},
	volume = {202},
	pages = {117231},
	year = {2022},
	issn = {0957-4174},
	doi = {https://doi.org/10.1016/j.eswa.2022.117231},
	url = {https://www.sciencedirect.com/science/article/pii/S095741742200611X},
	author = {Jayr Alencar Pereira and David Macêdo and Cleber Zanchettin and Adriano Lorena Inácio {de Oliveira} and Robson do Nascimento Fidalgo},
	keywords = {Augmentative and alternative communication, Language modeling, Pictogram prediction},
}