You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When instantiating an OpusParallelCorpus with language pairs such as ("de", "en") or ("en", "de"), the resulting corpus consistently has German as the first language and English as the second language, regardless of the input order.
Upon reviewing the source code of OpusParallelCorpus, it appears that the following code:
if l1 > l2:
l1, l2 = l2, l1
forces the languages to be ordered lexicographically based on their language codes. This results in German being treated as the first language in both ("de", "en") and ("en", "de").
To Reproduce
fromflair.datasetsimportOpusParallelCorpuscorpus_de_en=OpusParallelCorpus(
dataset="tatoeba",
l1="de",
l2="en",
max_tokens_per_doc=512,
)
corpus_en_de=OpusParallelCorpus(
dataset="tatoeba",
l1="en",
l2="de",
max_tokens_per_doc=512,
)
# Both corpora consist of (German, English) pairsprint(corpus_de_en.train[0])
>>DataPair: 'Sentence[5]: "Ich muss schlafen gehen."'+'Sentence[7]: "I have to go to sleep."'print(corpus_en_de.train[0])
>>DataPair: 'Sentence[5]: "Ich muss schlafen gehen."'+'Sentence[7]: "I have to go to sleep."'
Expected behavior
Sentence pairs in corpus_de_en are (German, English) while sentence pairs in corpus_en_de are (English, German)
Logs and Stack traces
Screenshots
No response
Additional Context
No response
Environment
Versions:
Flair
0.15.1
Pytorch
2.6.0+cu124
Transformers
4.49.0
GPU
False
The text was updated successfully, but these errors were encountered:
Describe the bug
When instantiating an OpusParallelCorpus with language pairs such as ("de", "en") or ("en", "de"), the resulting corpus consistently has German as the first language and English as the second language, regardless of the input order.
Upon reviewing the source code of OpusParallelCorpus, it appears that the following code:
forces the languages to be ordered lexicographically based on their language codes. This results in German being treated as the first language in both ("de", "en") and ("en", "de").
To Reproduce
Expected behavior
Sentence pairs in
corpus_de_en
are (German, English) while sentence pairs incorpus_en_de
are (English, German)Logs and Stack traces
Screenshots
No response
Additional Context
No response
Environment
Versions:
Flair
0.15.1
Pytorch
2.6.0+cu124
Transformers
4.49.0
GPU
False
The text was updated successfully, but these errors were encountered: