BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models [paper link]

Welcome to this repository where you'll find all you need to evaluate your language model at:

the lexical level using a spot-the-word task (available in audio or phonetic form; see Table 1)
the syntactic level using a grammatical acceptability judgment task (available in audio, phonetic or orthographic form; see Table 2)

Getting started

You'll probably want to start from there:

How to download the evaluation data? How to evaluate my own model?
How to download the training sets?

Examples of stimuli

Stimuli examples can be listened to on this web page.

Word	Pseudo-word	Word	Pseudo-word
hello	lello pello sero dello sello	cookie	kootie koonie roodie rootie boonie

Table 1: Minimal pairs of real and pseudo-words used in the spot-the-word lexical task.

Phenomenon	Sentence example
Adjective-noun order	✓ The good mom. ✗ The mom good.
Noun-verb order	✓ The dragon says. ✗ The says dragon.
Anaphor-gender agreement	✓ The dad cuts himself. ✗ The dad cuts herself.
Anaphor-number agreement	✓The boys told themselves. ✗ The boys told himself.
Determiner-noun agreement	✓ Each good sister. ✗ Many good sister.
Noun-verb agreement	✓ The prince needs the princess. ✗ The prince need the princess.

Table 2: Minimal pairs of grammatical (✓) and ungrammatical (✗) sentences used in the syntactic task.

Reproduce the BabySLM benchmark

If you want to go further:

How to download the pre-trained models used in the paper and evaluate them?
How to retrain the models used in the paper?
How to prepare the training sets used in the paper?
How to recreate the lexical evaluation?
How to recreate the syntactic evaluation?

How to cite?

@inproceedings{lavechin2023baby,
title={BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models},
author={Lavechin, Marvin and Sy, Yaya and Titeux, Hadrien and Bland{\'o}n, Mar{\'\i}a Andrea Cruz and R{\"a}s{\"a}nen, Okko and Bredin, Herv{\'e} and Dupoux, Emmanuel and Cristia, Alejandrina},
year={2023},
booktitle = {Interspeech}
}

Additionnally, if you use BabyBERTa, please cite:

@inproceedings{huebner2021babyberta,
  title={BabyBERTa: Learning more grammar with small-scale child-directed language},
  author={Huebner, Philip A and Sulem, Elior and Cynthia, Fisher and Roth, Dan},
  booktitle={Proceedings of the 25th conference on computational natural language learning},
  pages={624--646},
  year={2021}
}

If you use the Providence corpus, please cite:

@inproceedings{borschinger2013joint,
  title={A joint model of word segmentation and phonological variation for English word-final/t/-deletion},
  author={B{\"o}rschinger, Benjamin and Johnson, Mark and Demuth, Katherine},
  booktitle={Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={1508--1516},
  year={2013}
}

If you use the LibriVox corpus, please cite:

@article{kearns2014librivox,
  title={Librivox: Free public domain audiobooks},
  author={Kearns, Jodi},
  journal={Reference Reviews},
  volume={28},
  number={1},
  pages={7--8},
  year={2014},
  publisher={Emerald Group Publishing Limited}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models [paper link]

Getting started

Examples of stimuli

Reproduce the BabySLM benchmark

How to cite?

Files

README.md

Latest commit

History

README.md

File metadata and controls

BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models [paper link]

Getting started

Examples of stimuli

Reproduce the BabySLM benchmark

How to cite?