Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identical Sentence Pairs in German Train Set #10

Open
JonathanSchaber opened this issue Oct 1, 2020 · 1 comment
Open

Identical Sentence Pairs in German Train Set #10

JonathanSchaber opened this issue Oct 1, 2020 · 1 comment

Comments

@JonathanSchaber
Copy link

JonathanSchaber commented Oct 1, 2020

Hello,

I am using the PAWS-X train dataset for the German language. Upon analysing translated_train.tsv for German, I found 3,209 cases which consisted of identical sentence pairs. 84 of these 3,209 sentence pairs were tagged as non-paraphrases, the rest naturally as paraphrases (sentence pair indices attached in text file GER_duplicates.txt).

Assuming that those were generated by accident due to translation errors, I was surprised to find at least one identical sentence pair in the English train set (sentence pair ID 1288); there could also be more as I have not checked all.

Is this perhaps because of some bug?

@PhilipMay
Copy link

Nice one. Thanks for reporting @JonathanSchaber

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants