Added Translation normalisation, updated readme

Translations are now handled as follows: - The db model has been updated with Translation and TranslationSet classes. - TranslationSet accounts for groups of translations (those separated by - commas), and records particular word areas where found (matching a - '\s[A-Z]:' pattern) The dictionary_parser.py module has been updated with a - parse_translation method, which creates the Translation and TranslationSet - objects from the translation elements of the DICTLINE.GEN file in Words README.md has been updated with a bit more explanation of the three parts of the project (db, input parser, word parser)
badge · Sep 1, 2015 · 0540c31 · 0540c31
1 parent 0f71807
commit 0540c31
Show file tree

Hide file tree

Showing 3 changed files with 81 additions and 13 deletions.
diff --git a/README.md b/README.md
@@ -10,7 +10,27 @@ The basic idea behind the DoLL is to separate out the lexical *meaning* of a wor
 
 ## DoLL
 
-The DoLL is a continuation of the work of William Whitaker, who created the Latin-English-Latin dictionary [Whitaker’s Words](http://archives.nd.edu/whitaker/words.htm). It comprises three parts. The first is a python program which ingests the input files for *Words* (originally written in ADA) and creates the instances of the classes in its object model. The second is a sqlite database created from that object model using [sqlalchemy](http://www.sqlalchemy.org/); the third is a basic word parser (also written in python) for querying the database.
+The DoLL is a continuation of the work of William Whitaker, who created the Latin-English-Latin dictionary [Whitaker’s Words](http://archives.nd.edu/whitaker/words.htm). It comprises three parts. The first is a sqlite database created from that object model using [sqlalchemy](http://www.sqlalchemy.org/). The second is a python program which ingests the input files for *Words* (originally written in ADA) and creates the instances of the classes in its object model. The third is a basic word parser (also written in python) for querying the database.
+
+#### Database
+
+The database, in the `doll/db` directory, defines the model for the database using sqlalchemy (`model.py`), and basic configuration elements in `config.py`. This determines the name for the database file and whether sqlalchemy prints output to the console (echo).
+
+#### Input parser
+
+The input parser, in the `doll/input_parser` directory, is itself in three parts:
+
+* `add_database_types.py` does the job of adding basic type elements to the database, equivalent to the codes in *Words*, though with more detail (names and descriptions) for use in user interfaces
+
+* `parse_dictionary.py` parses the `DICTLINE.GEN` file from the *Words* source code and creates the dictionary entries themselves
+
+* `parse_inflections.py` parses the `INFLECTS.LAT` file from the *Words* source code and creates the inflections records
+
+In `__init.py__` the method `parse_all_inputs` takes the location of the words source code as an input, and runs the methods in the other modules in the directory. It also checks that the required input files are present; currently this means just `DICTLINE.GEN` and `INFLECTS.LAT`, but in future will need to look for the addons input file.
+
+#### Word parser
+
+Occupying the `parse_test.py` module, the `parse_word` method rather presumptuously says *Welcome to Words!*. While the intention is that this will become a facsimile of *Words* proper, it is intended of an example of how to use the database model to achieve a goal. 
 
 ## Current status
 
@@ -21,7 +41,7 @@ Firstly, two things should be noted about the software:
 
 At this point, the DoLL has no version number, to highlight the fact that it is currently an exploration rather than being on a path to success. *Words* is hugely impressive, but its architecture is limited by the structure of its inputs, and creating a normalised database from those inputs is less than straightforward. The following things are currently identified as significant challenges:
 
-- **parse_test.py currently handles only nouns, verbs, adjectives, and pronouns**
+- **`parse_test.py` currently handles only nouns, verbs, adjectives, and pronouns**
   - This is just a case of creating the queries for the other part of speech codes, but as these are created other problems arise than need considering.
 - **There is no English-Latin translation**
   - In *Words*, this is much more structurally straightforward as this is a word search on the dictionary, and for English lexemes that do inflect (verbs and pronouns) there is no attempt to link the various inflections between the two languages. Given how irregular English is, that seems completely sensible, but it does mean that the amount of effort to create the English-Latin part of a parser would be very slight.
@@ -30,6 +50,6 @@ At this point, the DoLL has no version number, to highlight the fact that it is
 - **Translations are not handled well**
   - Translations are currently stored in a single column in the dictionary_entry table (the Entry class). This needs sorting soon, as Whitaker left clear definitions on the structure of translations (,;: all have different meanings). It also limits the use of the DoLL to English, which, while popular, is not universal.
 - **Addons are ignored**
-  - Prefixes, suffixes, and the like, are handled in *Words* by the addons_package and generated from the ADDONS.LAT file. These are not used at all by the DoLL, but are definitely something we want to add support for.
+  - Prefixes, suffixes, and the like, are handled in *Words* by the `addons_package` and generated from the `ADDONS.LAT` file. These are not used at all by the DoLL, but are definitely something we want to add support for.
 - **qu/cu pronouns are a mess**
   - These pronouns have multiple dictionary and inflection entries, created for computational convenience, but this leads to duplicate results when parsing words
diff --git a/doll/db/model.py b/doll/db/model.py
@@ -510,6 +510,7 @@ class Entry(Base):
     source = relationship('WordSource', backref=backref('dictionary_entry'))
 
     stems = relationship('Stem', backref=backref('dictionary_stem'))
+    translation_sets = relationship('TranslationSet', backref=backref('dictionary_translation_set'))
 
 
 # Stem of a dictionary entry
@@ -529,10 +530,28 @@ class Stem(Base):
     entry = relationship('Entry', backref=backref('dictionary_stem'))
 
 
+# Set of translations
+class TranslationSet(Base):
+    """A set of related translations"""
+    __tablename__ = 'dictionary_translation_set'
+
+    id = Column(Integer, primary_key=True, autoincrement=True)
+
+    entry_id = Column(Integer, ForeignKey('dictionary_entry.id',
+                                          name='FK_dictionary_translation_set_entry_id'))
+    language_id = Column(Integer, ForeignKey('type_language.id',
+                                             name='FK_dictionary_translation_set_language_id'))
+    area_code = Column(String(10), ForeignKey('type_wordarea.code',
+                                              name='FK_dictionary_translation_set_wordarea_code'))
+
+    # Relationships
+    entry = relationship('Entry', backref=backref('dictionary_translation_set'))
+    language = relationship('Language', backref=backref('dictionary_translation_set'))
+    area = relationship('WordArea', backref=backref('dictionary_translation_set'))
+
+    translations = relationship('Translation', backref=backref('dictionary_translation'))
 
-'''
 
-Start of a translation model, but we need groups...
 
 # Translation
 class Translation(Base):
@@ -541,16 +560,13 @@ class Translation(Base):
 
     id = Column(Integer, primary_key=True, autoincrement=True)
 
-    entry_id = Column(Integer, ForeignKey('dictionary_entry.id',
-                                          name='FK_dictionary_translation_entry_id'))
-
-    language_id = Column(Integer, ForeignKey('type_language.id',
-                                          name='FK_dictionary_translation_language_id'))
+    translation_set_id = Column(Integer, ForeignKey('dictionary_translation_set.id',
+                                                    name='FK_dictionary_translation_translation_set_id'))
+    translation = Column(Unicode(4096, collation='BINARY'))
 
     # Relationships
-    entry = relationship('Entry', backref=backref('dictionary_translation'))
-    language = relationship('Language', backref=backref('dictionary_translation'))
-'''
+    translation_set = relationship('TranslationSet', backref=backref('dictionary_translation'))
+
 
 
 # Noun Entry

diff --git a/doll/input_parser/parse_dictionary.py b/doll/input_parser/parse_dictionary.py
@@ -2,6 +2,31 @@
 
 from doll.db import Connection
 from doll.db.model import *
+import re
+
+
+def parse_translation(session, language, entry, translation):
+    """Parses the translation line and creates the Translation and TranslationSet objects"""
+
+    for ts in [ts for ts in map(str.strip, translation.split(';')) if len(ts) > 0]:
+
+        # Check if the translation set includes an area
+        area_regex = re.match('\s([A-Z]):', ts)
+        if area_regex is not None:
+            area = session.query(WordArea).filter(WordArea.code == area_regex.group(0)).first()
+            translation_set = TranslationSet(entry=entry,
+                                             area=area,
+                                             language=language)
+        else:
+            translation_set = TranslationSet(entry=entry,
+                                             language=language)
+
+        session.add(translation_set)
+
+        for t in [t for t in map(str.strip, ts.split(',')) if len(t) > 0]:
+            session.add(Translation(translation_set=translation_set,
+                                    translation=t))
+
 
 def parse_dict_file(dict_file, commit_changes=False):
 
@@ -44,6 +69,8 @@ def parse_dict_file(dict_file, commit_changes=False):
 
             translation = line[110:].strip()
 
+            language = session.query(Language).filter(Language.code == 'E').first()
+
             # Create the list of stems, ignoring those that are empty or zzz
             stems = [Stem(stem_number=i, stem_word=s) for i, s in enumerate(stem_list, 1) if len(s) > 0 and s != 'zzz']
 
@@ -57,6 +84,11 @@ def parse_dict_file(dict_file, commit_changes=False):
                           translation=translation,
                           stems=stems)
 
+            parse_translation(session=session,
+                              language=language,
+                              entry=entry,
+                              translation=translation)
+
             # Create the specific entry given the part of speech
             if entry.part_of_speech_code == 'N':
                 noun_entry = NounEntry(declension_code=part_of_speech_data[0],