Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking] Speech Translation Evaluation: CoVoST 2 #50

Closed
farzadab opened this issue Jul 17, 2024 · 1 comment · Fixed by #54
Closed

[Tracking] Speech Translation Evaluation: CoVoST 2 #50

farzadab opened this issue Jul 17, 2024 · 1 comment · Fixed by #54
Assignees

Comments

@farzadab
Copy link
Contributor

Idea: use ST as a zero-shot task to evaluate model understanding.

@farzadab farzadab self-assigned this Jul 17, 2024
@farzadab
Copy link
Contributor Author

farzadab commented Jul 17, 2024

Issue: The current CoVoST 2 dataset on Huggingface is not easy to use (data needs to be downloaded separately and in batch). I'm creating a native version.

I'm excluding the train set due to its size at this point.

Code:

# download Common Voice 4 data from https://commonvoice.mozilla.org/en/datasets and untar it
#   wget .... for en zh-CN fr es
# for x in *.tar.gz; do y=$(echo $x | sed 's/\(.*\)\.tar\.gz/\1/'); mkdir -p common_voice_4/$y; pushd $y; tar -xf ../../$x; popd; done

import datasets

# EN_X subsets
subsets = ['en_de', 'en_tr', 'en_fa', 'en_sv-SE', 'en_mn', 'en_zh-CN', 'en_cy', 'en_ca', 'en_sl', 'en_et', 'en_id', 'en_ar', 'en_ta', 'en_lv', 'en_ja']
# X_EN subsets
subsets += ['fr_en', 'es_en', 'zh-CN_en', 'ar_en', 'de_en', 'it_en', 'ru_en', 'pt_en', 'fa_en', 'ca_en', 'et_en', 'mn_en', 'nl_en', 'tr_en', 'sv-SE_en', 'lv_en', 'sl_en', 'ta_en', 'ja_en', 'id_en', 'cy_en']

for subset in subsets:
    source = subset.split('_')[0]
    ds = datasets.load_dataset('facebook/covost2', subset, data_dir=f"/home/farzad/common_voice_4/{source}")
    ds.push_to_hub('fixie-ai/covost2', subset, token='...', num_shards={k: 8 for k in ds.keys()})

Update: all subsets are added now.

Final dataset on HuggingFace: fixie-ai/covost2

@farzadab farzadab changed the title Speech Translation Evaluation: CoVoST 2 [Tracking] Speech Translation Evaluation: CoVoST 2 Jul 18, 2024
This was referenced Jul 23, 2024
zqhuang211 pushed a commit that referenced this issue Feb 12, 2025
* old references

* fixing filenames

* Update README.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant