SentEval is a library for evaluating the quality of sentence embeddings. We assess their generalization power by using them as features on a broad and diverse set of "transfer" tasks. SentEval currently includes 17 downstream tasks. We also include a suite of 10 probing tasks which evaluate what linguistic properties are encoded in sentence embeddings. Our goal is to ease the study and the development of general-purpose fixed-size sentence representations.
(04/22) SentEval new tasks: Added probing tasks for evaluating what linguistic properties are encoded in sentence embeddings
(10/04) SentEval example scripts for three sentence encoders: SkipThought-LN/GenSen/Google-USE
This code is written in python. The dependencies are:
- Python 2/3 with NumPy/SciPy
- Pytorch>=0.4
- scikit-learn>=0.18.0
SentEval allows you to evaluate your sentence embeddings as features for the following downstream tasks:
Task | Type | #train | #test | needs_train | set_classifier |
---|---|---|---|---|---|
MR | movie review | 11k | 11k | 1 | 1 |
CR | product review | 4k | 4k | 1 | 1 |
SUBJ | subjectivity status | 10k | 10k | 1 | 1 |
MPQA | opinion-polarity | 11k | 11k | 1 | 1 |
SST | binary sentiment analysis | 67k | 1.8k | 1 | 1 |
SST | fine-grained sentiment analysis | 8.5k | 2.2k | 1 | 1 |
TREC | question-type classification | 6k | 0.5k | 1 | 1 |
SICK-E | natural language inference | 4.5k | 4.9k | 1 | 1 |
SNLI | natural language inference | 550k | 9.8k | 1 | 1 |
MRPC | paraphrase detection | 4.1k | 1.7k | 1 | 1 |
STS 2012 | semantic textual similarity | N/A | 3.1k | 0 | 0 |
STS 2013 | semantic textual similarity | N/A | 1.5k | 0 | 0 |
STS 2014 | semantic textual similarity | N/A | 3.7k | 0 | 0 |
STS 2015 | semantic textual similarity | N/A | 8.5k | 0 | 0 |
STS 2016 | semantic textual similarity | N/A | 9.2k | 0 | 0 |
STS B | semantic textual similarity | 5.7k | 1.4k | 1 | 0 |
SICK-R | semantic textual similarity | 4.5k | 4.9k | 1 | 0 |
COCO | image-caption retrieval | 567k | 5*1k | 1 | 0 |
where needs_train means a model with parameters is learned on top of the sentence embeddings, and set_classifier means you can define the parameters of the classifier in the case of a classification task (see below).
Note: COCO comes with ResNet-101 2048d image embeddings. More details on the tasks.
SentEval also includes a series of probing tasks to evaluate what linguistic properties are encoded in your sentence embeddings:
Task | Type | #train | #test | needs_train | set_classifier |
---|---|---|---|---|---|
SentLen | Length prediction | 100k | 10k | 1 | 1 |
WC | Word Content analysis | 100k | 10k | 1 | 1 |
TreeDepth | Tree depth prediction | 100k | 10k | 1 | 1 |
TopConst | Top Constituents prediction | 100k | 10k | 1 | 1 |
BShift | Word order analysis | 100k | 10k | 1 | 1 |
Tense | Verb tense prediction | 100k | 10k | 1 | 1 |
SubjNum | Subject number prediction | 100k | 10k | 1 | 1 |
ObjNum | Object number prediction | 100k | 10k | 1 | 1 |
SOMO | Semantic odd man out | 100k | 10k | 1 | 1 |
CoordInv | Coordination Inversion | 100k | 10k | 1 | 1 |
To get all the transfer tasks datasets, run (in data/downstream/):
./get_transfer_data.bash
This will automatically download and preprocess the downstream datasets, and store them in data/downstream (warning: for MacOS users, you may have to use p7zip instead of unzip). The probing tasks are already in data/probing by default.
In examples/bow.py, we evaluate the quality of the average of word embeddings.
To download state-of-the-art fastText embeddings:
curl -Lo glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
curl -Lo crawl-300d-2M.vec.zip https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip
To reproduce the results for bag-of-vectors, run (in examples/):
python bow.py
As required by SentEval, this script implements two functions: prepare (optional) and batcher (required) that turn text sentences into sentence embeddings. Then SentEval takes care of the evaluation on the transfer tasks using the embeddings as features.
To get the InferSent model and reproduce our results, download our best models and run infersent.py (in examples/):
curl -Lo examples/infersent1.pkl https://s3.amazonaws.com/senteval/infersent/infersent1.pkl
curl -Lo examples/infersent2.pkl https://s3.amazonaws.com/senteval/infersent/infersent2.pkl
We also provide example scripts for three other encoders:
- SkipThought with Layer-Normalization in Theano
- GenSen encoder in Pytorch
- Google encoder in TensorFlow
Note that for SkipThought and GenSen, following the steps of the associated githubs is necessary. The Google encoder script should work as-is.
To evaluate your sentence embeddings, SentEval requires that you implement two functions:
- prepare (sees the whole dataset of each task and can thus construct the word vocabulary, the dictionary of word vectors etc)
- batcher (transforms a batch of text sentences into sentence embeddings)
batcher only sees one batch at a time while the samples argument of prepare contains all the sentences of a task.
prepare(params, samples)
- params: senteval parameters.
- samples: list of all sentences from the tranfer task.
- output: No output. Arguments stored in "params" can further be used by batcher.
Example: in bow.py, prepare is is used to build the vocabulary of words and construct the "params.word_vect* dictionary of word vectors.
batcher(params, batch)
- params: senteval parameters.
- batch: numpy array of text sentences (of size params.batch_size)
- output: numpy array of sentence embeddings (of size params.batch_size)
Example: in bow.py, batcher is used to compute the mean of the word vectors for each sentence in the batch using params.word_vec. Use your own encoder in that function to encode sentences.
After having implemented the batch and prepare function for your own sentence encoder,
- to perform the actual evaluation, first import senteval and set its parameters:
import senteval
params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
- (optional) set the parameters of the classifier (when applicable):
params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64,
'tenacity': 5, 'epoch_size': 4}
You can choose nhid=0 (Logistic Regression) or nhid>0 (MLP) and define the parameters for training.
- Create an instance of the class SE:
se = senteval.engine.SE(params, batcher, prepare)
- define the set of transfer tasks and run the evaluation:
transfer_tasks = ['MR', 'SICKEntailment', 'STS14', 'STSBenchmark']
results = se.eval(transfer_tasks)
The current list of available tasks is:
['CR', 'MR', 'MPQA', 'SUBJ', 'SST2', 'SST5', 'TREC', 'MRPC', 'SNLI',
'SICKEntailment', 'SICKRelatedness', 'STSBenchmark', 'ImageCaptionRetrieval',
'STS12', 'STS13', 'STS14', 'STS15', 'STS16',
'Length', 'WordContent', 'Depth', 'TopConstituents','BigramShift', 'Tense',
'SubjNumber', 'ObjNumber', 'OddManOut', 'CoordinationInversion']
Global parameters of SentEval:
# senteval parameters
task_path # path to SentEval datasets (required)
seed # seed
usepytorch # use cuda-pytorch (else scikit-learn) where possible
kfold # k-fold validation for MR/CR/SUB/MPQA.
Parameters of the classifier:
nhid: # number of hidden units (0: Logistic Regression, >0: MLP); Default nonlinearity: Tanh
optim: # optimizer ("sgd,lr=0.1", "adam", "rmsprop" ..)
tenacity: # how many times dev acc does not increase before training stops
epoch_size: # each epoch corresponds to epoch_size pass on the train set
max_epoch: # max number of epoches
dropout: # dropout for MLP
Note that to get a proxy of the results while dramatically reducing computation time, we suggest the prototyping config:
params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 5}
params['classifier'] = {'nhid': 0, 'optim': 'rmsprop', 'batch_size': 128,
'tenacity': 3, 'epoch_size': 2}
which will results in a 5 times speedup for classification tasks.
To produce results that are comparable to the literature, use the default config:
params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64,
'tenacity': 5, 'epoch_size': 4}
which takes longer but will produce better and comparable results.
For probing tasks, we used an MLP with a Sigmoid nonlinearity and and tuned the nhid (in [50, 100, 200]) and dropout (in [0.0, 0.1, 0.2]) on the dev set.
Please considering citing [1] if using this code for evaluating sentence embedding methods.
[1] A. Conneau, D. Kiela, SentEval: An Evaluation Toolkit for Universal Sentence Representations
@article{conneau2018senteval,
title={SentEval: An Evaluation Toolkit for Universal Sentence Representations},
author={Conneau, Alexis and Kiela, Douwe},
journal={arXiv preprint arXiv:1803.05449},
year={2018}
}
Contact: [email protected], [email protected]
- J. R Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, S. Fidler - SkipThought Vectors, NIPS 2015
- S. Arora, Y. Liang, T. Ma - A Simple but Tough-to-Beat Baseline for Sentence Embeddings, ICLR 2017
- Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, Y. Goldberg - Fine-grained analysis of sentence embeddings using auxiliary prediction tasks, ICLR 2017
- A. Conneau, D. Kiela, L. Barrault, H. Schwenk, A. Bordes - Supervised Learning of Universal Sentence Representations from Natural Language Inference Data, EMNLP 2017
- S. Subramanian, A. Trischler, Y. Bengio, C. J Pal - Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning, ICLR 2018
- A. Nie, E. D. Bennett, N. D. Goodman - DisSent: Sentence Representation Learning from Explicit Discourse Relations, 2018
- D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. St. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y. Sung, B. Strope, R. Kurzweil - Universal Sentence Encoder, 2018
- A. Conneau, G. Kruszewski, G. Lample, L. Barrault, M. Baroni - What you can cram into a single vector: Probing sentence embeddings for linguistic properties, ACL 2018