You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issue: The current CoVoST 2 dataset on Huggingface is not easy to use (data needs to be downloaded separately and in batch). I'm creating a native version.
I'm excluding the train set due to its size at this point.
Code:
# download Common Voice 4 data from https://commonvoice.mozilla.org/en/datasets and untar it# wget .... for en zh-CN fr es# for x in *.tar.gz; do y=$(echo $x | sed 's/\(.*\)\.tar\.gz/\1/'); mkdir -p common_voice_4/$y; pushd $y; tar -xf ../../$x; popd; doneimportdatasets# EN_X subsetssubsets= ['en_de', 'en_tr', 'en_fa', 'en_sv-SE', 'en_mn', 'en_zh-CN', 'en_cy', 'en_ca', 'en_sl', 'en_et', 'en_id', 'en_ar', 'en_ta', 'en_lv', 'en_ja']
# X_EN subsetssubsets+= ['fr_en', 'es_en', 'zh-CN_en', 'ar_en', 'de_en', 'it_en', 'ru_en', 'pt_en', 'fa_en', 'ca_en', 'et_en', 'mn_en', 'nl_en', 'tr_en', 'sv-SE_en', 'lv_en', 'sl_en', 'ta_en', 'ja_en', 'id_en', 'cy_en']
forsubsetinsubsets:
source=subset.split('_')[0]
ds=datasets.load_dataset('facebook/covost2', subset, data_dir=f"/home/farzad/common_voice_4/{source}")
ds.push_to_hub('fixie-ai/covost2', subset, token='...', num_shards={k: 8forkinds.keys()})
Idea: use ST as a zero-shot task to evaluate model understanding.
The text was updated successfully, but these errors were encountered: