Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the display title - critical for low-resource languages #4

Open
bittlingmayer opened this issue Jan 17, 2020 · 0 comments
Open

Comments

@bittlingmayer
Copy link

bittlingmayer commented Jan 17, 2020

Some Wikipedias use DISPLAYTITLE to override the titles of almost all articles. Typically this is in the case of a low language in a high-low disglossia, a good example would be Alemannic (~"Swiss German").

For example, see https://als.wikipedia.org/wiki/Zürich:

Screenshot 2020-01-17 at 15 33 24

(The URL using the Standard German (de) Zürich instead of the actual Alemannic (als) Züri is a workaround for the fact that Alemannic has no single standardised orthography, so it's more practical to allow searches and lookups in the standard language.)

Currently, the actual output extracted is Zürich, but the expected output is Züri.

So in order to build a viable parallel titles corpus for such a language, we need to prefer DISPLAYTITLE and only take the underlying title if DISPLAYTITLE is unset.

(Not sure what the default should be , but it's probably good to make it an option not a hard rule, because for example for building a corpus for translation from als to en it's often useful to additionally include the de to en data, because of how often de segments occurs in real als data.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant