Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Boolean functions to demonstrate the following: (i) Random Transformers are relatively more biased towards functions of low sensitivity. (ii) When trained on Boolean functions, both Transformers and LSTMs prioritize learning functions of low sensitivity, with Transformers ultimately converging to functions of lower sensitivity. (iii) On sparse Boolean functions which have low sensitivity, we find that Transformers generalize near perfectly even in the presence of noisy labels whereas LSTMs overfit and achieve poor generalization accuracy. Overall, our results provide strong quantifiable evidence that suggests differences in the inductive biases of Transformers and recurrent models which may help explain Transformer's effective generalization performance despite relatively limited expressiveness.
- compatible with python 3
- dependencies can be installed using
Transformer-Simplicity/requirements.txt
Install VirtualEnv using the following (optional):
$ [sudo] pip install virtualenv
Create and activate your virtual environment (optional):
$ virtualenv -p python3 venv
$ source venv/bin/activate
Install all the required packages:
at Transformer-Simplicity/:
$ pip install -r requirements.txt
The current repository includes 4 directories implementing different models and settings:
- Training Transformer on Boolean functions :
Transformer-Simplicity/FLTAtt
- Training LSTMs on Boolean functions :
Transformer-Simplicity/FLTClassifier
- Experiments with Random Transformer :
Transformer-Simplicity/RandFLTAtt
- Experiments with Random LSTM :
Transformer-Simplicity/RandFLTClassifier
The set of command line arguments available can be seen in the respective args.py
file. Here, we illustrate running the experiment for training Transformers on sparse parities. Follow the same methodology for running any experiments with LSTMs.
At Transformer-Simplicity/FLTAtt:
$ python -m src.main -mode train -gpu 0 -dataset sparity40_5k -run_name trafo_sparity_40_5k -depth 4 -lr 0.001
To compute sensitivity of randomly initialized Transformers,
At Transformer-Simplicity/RandFLTAtt:
$ python rand_sensi.py -gpu 0 -sample_size 1000 -len 20 -trials 100
If you use our data or code, please cite our work:
@inproceedings{bhattamishra-etal-2023-simplicity,
title = "Simplicity Bias in Transformers and their Ability to Learn Sparse {B}oolean Functions",
author = "Bhattamishra, Satwik and
Patel, Arkil and
Kanade, Varun and
Blunsom, Phil",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.317",
pages = "5767--5791",
abstract = "Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Boolean functions to demonstrate the following: (i) Random Transformers are relatively more biased towards functions of low sensitivity. (ii) When trained on Boolean functions, both Transformers and LSTMs prioritize learning functions of low sensitivity, with Transformers ultimately converging to functions of lower sensitivity. (iii) On sparse Boolean functions which have low sensitivity, we find that Transformers generalize near perfectly even in the presence of noisy labels whereas LSTMs overfit and achieve poor generalization accuracy. Overall, our results provide strong quantifiable evidence that suggests differences in the inductive biases of Transformers and recurrent models which may help explain Transformer{'}s effective generalization performance despite relatively limited expressiveness.",
}
For any clarification, comments, or suggestions please contact Satwik or Arkil.