You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some Wikipedias use DISPLAYTITLE to override the titles of almost all articles. Typically this is in the case of a low language in a high-low disglossia, a good example would be Alemannic (~"Swiss German").
(The URL using the Standard German (de) Zürich instead of the actual Alemannic (als) Züri is a workaround for the fact that Alemannic has no single standardised orthography, so it's more practical to allow searches and lookups in the standard language.)
Currently, the actual output extracted is Zürich, but the expected output is Züri.
So in order to build a viable parallel titles corpus for such a language, we need to prefer DISPLAYTITLE and only take the underlying title if DISPLAYTITLE is unset.
(Not sure what the default should be , but it's probably good to make it an option not a hard rule, because for example for building a corpus for translation from als to en it's often useful to additionally include the de to en data, because of how often de segments occurs in real als data.)
The text was updated successfully, but these errors were encountered:
Some Wikipedias use
DISPLAYTITLE
to override the titles of almost all articles. Typically this is in the case of a low language in a high-low disglossia, a good example would be Alemannic (~"Swiss German").For example, see https://als.wikipedia.org/wiki/Zürich:
(The URL using the Standard German (
de
)Zürich
instead of the actual Alemannic (als
)Züri
is a workaround for the fact that Alemannic has no single standardised orthography, so it's more practical to allow searches and lookups in the standard language.)Currently, the actual output extracted is
Zürich
, but the expected output isZüri
.So in order to build a viable parallel titles corpus for such a language, we need to prefer
DISPLAYTITLE
and only take the underlying title ifDISPLAYTITLE
is unset.(Not sure what the default should be , but it's probably good to make it an option not a hard rule, because for example for building a corpus for translation from
als
toen
it's often useful to additionally include thede
toen
data, because of how oftende
segments occurs in realals
data.)The text was updated successfully, but these errors were encountered: