Original code implementation of the paper "PictoBERT: Transformers for Next Pictogram Prediction".
Pictogram is the term used by the Augmentative and Alternative Communication (AAC) community for an image with a label that represents a place, person, action, object and animal. AAC systems like the shown below allow message construction and communication by arranging pictograms in sequence.
Pictogram prediction is an important task for AAC systems for it can facilitate communication. Previous works used n-gram statistical models or knowledge bases to accomplishing this task. Our proposal is an adaptation of the BERT (Bidirectional Encoder Representations from Transformers) model to perform pictogram prediction. We changed the BERT vocabulary and input embeddings to allow the usage of word-senses, considering that a word-sense represents better a pictogram. We call our version PictoBERT.
We trained the model using the CHILDES (Child Language Data Exchange System) corpora as a dataset. We annotated the North American English version of CHILDES with word-senses using supWSD. PictoBERT performance was compared to n-gram models and achieved good results, as show in the table bellow.
The PictoBERT is capable of predicting pictograms in different contexts. And its main characteristic is the ability to transfer learning for it allows other models focused on users' specific needs to be trained.
You can run the PictoBERT scripts using Google Colab or clone the repository in your machine and open the notebooks.
git clone https://github.com/jayralencar/pictoBERT.git
We present each of the notebooks below and their relationship with the paper's content. You may execute the notebooks following the sequence we give below. However, downloadable versions of the resources are available in each step.
In the paper, we present PictoBERT construction (Section 4.1) in three steps: corpus construction, BERT adaptation and pretraining.
The dataset creation is described in Section 4.1.1 of the paper and consists of downloading and annotating the North American English part of the CHILDES dataset.
SemCHILDES.ipynb | Run in Google Colab | View source on GitHub | NA-EN SemCHILDES |
In addition, we also annotated the British English part of CHILDES with semantic roles to use for fine-tuning PictoBERT to perform pictogram prediction based on a grammatical structure.
Create_SRL_semCHILDES.ipynb | Run in Google Colab | View source on GitHub | UK-EN SemCHILDES |
For updating BERT vocabulary and Embeddings Layer, as described in Section 4.1.2 of the paper, we first trained a Word Level tokenizer and prepared the dataset for future training.
Train_Tokenizer_and_Prepare_Dataset.ipynb | Run in Google Colab | View source on GitHub |
PictoBERT Tokenizer
Train dataset Test dataset Val dataset |
Then, we created the models by changing the BERT embeddings and vocabulary:
Create_Models.ipynb | Run in Google Colab | View source on GitHub |
PictoBERT contextualized
PictoBERT gloss-based |
As described in section 4.1.3 of the paper, we splited semCHILDES in a 98/1/1 split for training, validation, and test. We used a batch size of 128 sequences with 32 tokens. Each data batch was collated to choose 15% of the tokens for prediction. We used a learning rate of
Training_PictoBERT.ipynb | Run in Google Colab | View source on GitHub |
PictoBERT contextualized
PictoBERT gloss-based |
As mentioned in the paper (section 5.1), we compare PictoBERT performance rather n-gram models performance. Using the notebook below, we trained n-gram models with orders varying from 2 to 7.
N-gram models.ipynb | Run in Google Colab | View source on GitHub | N-gram models |
As described in Section 5.2 of PictoBERT's paper, we fine-tuned two versions of the model: one for pictogram prediction based on a grammatical structure and the other for making predictions based on the ARASAAC vocabulary.
This section refers to the section 5.2.1 of the PictoBERT paper.
For fine-tuning the model, we used as basis the UK-EN SemCHILDES presented on section 1.2 of this document.
All the procedures for fine-tuning are described on the following notebook:
Fine_tuning_PictoBERT_(colourful_semantics).ipynb | Run in Google Colab | View source on GitHub | Fine-tuned PictoBERT (contextualized) Fine-tuned PictoBERT (gloss-based) Tokenizer |
In addition, we replicated the method proposed by Pereira et al. (2020) for constructing semantic grammars to compare with PictoBERT. Semantic grammars are generally represented using OWL ontologies. We opted to represent using relational databases to facilitate faster queries.
Semantic_Grammar.ipynb | Run in Google Colab | View source on GitHub | Semantic Grammars (db versions) |
This section refers to the section 5.2.2 of the PictoBERT paper.
The notebook presents:
- The procedure for mapping ARASAAC pictograms to WordNET word-senses
- The procedure for changing SemCHILDES to keep only sentences in which all tokens are also in the vocabulary generated 1.
- Train tokenizer
- Train models
We also trained n-gram models to compare with the fine-tuned models.
N-gram models.ipynb | Run in Google Colab | View source on GitHub | N-gram models |
The evaluation scripts, as well as the results, are in the following notebook.
evaluation_codeocean.ipynb | Run in Google Colab | View source on GitHub |
@article{PEREIRA2022117231,
title = {Picto{BERT}: Transformers for next pictogram prediction},
journal = {Expert Systems with Applications},
volume = {202},
pages = {117231},
year = {2022},
issn = {0957-4174},
doi = {https://doi.org/10.1016/j.eswa.2022.117231},
url = {https://www.sciencedirect.com/science/article/pii/S095741742200611X},
author = {Jayr Alencar Pereira and David Macêdo and Cleber Zanchettin and Adriano Lorena Inácio {de Oliveira} and Robson do Nascimento Fidalgo},
keywords = {Augmentative and alternative communication, Language modeling, Pictogram prediction},
}