Skip to content

Language dependent configuration

dnmilne edited this page Aug 22, 2013 · 5 revisions

The languages.xml file (located in the config directory of the toolkit) describes language-dependent variables that are required for processing Wikipedia dumps. It is often necessary to modify and update this file to process new dumps of Wikipedia.

##Format

The language configuration file contains one Language element per language version of Wikipedia. Each of these contains RootCategory, DisambiguationCategory, and DisambiguationTemplate elements as children.

###Language

The Language element must provide the following attributes:

attribute description
code A short (typically two letter) code for the language (e.g. en, de, fr)
name A name (in English) for the language version
localName A name (in the given language) for the language version

###RootCategory

Describes the root category, from which all other content-based categories descend (e.g. this one for the en version)

There should be only one root category per language.

###DisambiguationCategory

Describes a category to which disambiguation pages often belong to (e.g. this one for the en version)

There may be multiple disambiguation categories per language.

###DisambiguationTemplate

Describes a template that is often invoked on disambiguation pages (e.g. this one for the en version)

There may be multiple disambiguation templates per language.

###RedirectIdentifier

Describes the syntax used to create redirects. For example, if redirects are identified like this:

#REDIRECT [[target]]

then you should use the following redirect identifier

<RedirectIdentifier>REDIRECT</RedirectIdentifier>

There may be multiple redirect identifiers per language.

##Current languages

Below are examples of Language elements for processing different versions of Wikipedia. Please feel free to add and modify these.

###Full English Wikipedia

<Language code="en" name="English" localName="English">
        
    <RootCategory>Fundamental categories</RootCategory>
               
    <DisambiguationCategory>disambiguation</DisambiguationCategory>
               
    <DisambiguationTemplate>disambiguation</DisambiguationTemplate>
    <DisambiguationTemplate>disambig</DisambiguationTemplate>
    <DisambiguationTemplate>geodis</DisambiguationTemplate>
    <DisambiguationTemplate>hndis</DisambiguationTemplate>
    <DisambiguationTemplate>hospitaldis</DisambiguationTemplate>
    <DisambiguationTemplate>mathdab</DisambiguationTemplate>
    <DisambiguationTemplate>mountianindex</DisambiguationTemplate>
    <DisambiguationTemplate>numberdis</DisambiguationTemplate>
    <DisambiguationTemplate>roaddis</DisambiguationTemplate>
    <DisambiguationTemplate>schooldis</DisambiguationTemplate>
    <DisambiguationTemplate>shipindex</DisambiguationTemplate>
    <DisambiguationTemplate>SIA</DisambiguationTemplate>
               
    <RedirectIdentifier>REDIRECT</RedirectIdentifier>
               
</Language>

###Simple English Wikipedia

<Language code="simple" name="Simple English" localName="Simple English">
        
    <RootCategory>articles</RootCategory>
               
    <DisambiguationCategory>disambiguation</DisambiguationCategory>
               
    <DisambiguationTemplate>disambiguation</DisambiguationTemplate>
    <DisambiguationTemplate>disambig</DisambiguationTemplate>
    <DisambiguationTemplate>2CC</DisambiguationTemplate>
    <DisambiguationTemplate>3CC</DisambiguationTemplate>
    <DisambiguationTemplate>hndis</DisambiguationTemplate>
               
    <RedirectIdentifier>REDIRECT</RedirectIdentifier>
               
</Language>

###German Wikipedia

<Language code="de" name="German" localName="Deutsch">

    <RootCategory>!Hauptkategorie</RootCategory>

    <DisambiguationCategory>Begriffsklärung</DisambiguationCategory>

    <DisambiguationTemplate>Begriffsklärung</DisambiguationTemplate>
		
    <RedirectIdentifier>REDIRECT</RedirectIdentifier>
    <RedirectIdentifier>WEITERLEITUNG</RedirectIdentifier>
		
</Language>