Skip to content

Commit

Permalink
Added Translation normalisation, updated readme
Browse files Browse the repository at this point in the history
Translations are now handled as follows:

- The db model has been updated with Translation and TranslationSet classes.
- TranslationSet accounts for groups of translations (those separated by
- commas), and records particular word areas where found (matching a
- '\s[A-Z]:' pattern) The dictionary_parser.py module has been updated with a
- parse_translation method, which creates the Translation and TranslationSet
- objects from the translation elements of the DICTLINE.GEN file in Words

README.md has been updated with a bit more explanation of the three parts of
the project (db, input parser, word parser)
  • Loading branch information
Matthew Badger authored and Matthew Badger committed Sep 1, 2015
1 parent 0f71807 commit 0540c31
Show file tree
Hide file tree
Showing 3 changed files with 81 additions and 13 deletions.
26 changes: 23 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,27 @@ The basic idea behind the DoLL is to separate out the lexical *meaning* of a wor

## DoLL

The DoLL is a continuation of the work of William Whitaker, who created the Latin-English-Latin dictionary [Whitaker’s Words](http://archives.nd.edu/whitaker/words.htm). It comprises three parts. The first is a python program which ingests the input files for *Words* (originally written in ADA) and creates the instances of the classes in its object model. The second is a sqlite database created from that object model using [sqlalchemy](http://www.sqlalchemy.org/); the third is a basic word parser (also written in python) for querying the database.
The DoLL is a continuation of the work of William Whitaker, who created the Latin-English-Latin dictionary [Whitaker’s Words](http://archives.nd.edu/whitaker/words.htm). It comprises three parts. The first is a sqlite database created from that object model using [sqlalchemy](http://www.sqlalchemy.org/). The second is a python program which ingests the input files for *Words* (originally written in ADA) and creates the instances of the classes in its object model. The third is a basic word parser (also written in python) for querying the database.

#### Database

The database, in the `doll/db` directory, defines the model for the database using sqlalchemy (`model.py`), and basic configuration elements in `config.py`. This determines the name for the database file and whether sqlalchemy prints output to the console (echo).

#### Input parser

The input parser, in the `doll/input_parser` directory, is itself in three parts:

* `add_database_types.py` does the job of adding basic type elements to the database, equivalent to the codes in *Words*, though with more detail (names and descriptions) for use in user interfaces

* `parse_dictionary.py` parses the `DICTLINE.GEN` file from the *Words* source code and creates the dictionary entries themselves

* `parse_inflections.py` parses the `INFLECTS.LAT` file from the *Words* source code and creates the inflections records

In `__init.py__` the method `parse_all_inputs` takes the location of the words source code as an input, and runs the methods in the other modules in the directory. It also checks that the required input files are present; currently this means just `DICTLINE.GEN` and `INFLECTS.LAT`, but in future will need to look for the addons input file.

#### Word parser

Occupying the `parse_test.py` module, the `parse_word` method rather presumptuously says *Welcome to Words!*. While the intention is that this will become a facsimile of *Words* proper, it is intended of an example of how to use the database model to achieve a goal.

## Current status

Expand All @@ -21,7 +41,7 @@ Firstly, two things should be noted about the software:

At this point, the DoLL has no version number, to highlight the fact that it is currently an exploration rather than being on a path to success. *Words* is hugely impressive, but its architecture is limited by the structure of its inputs, and creating a normalised database from those inputs is less than straightforward. The following things are currently identified as significant challenges:

- **parse_test.py currently handles only nouns, verbs, adjectives, and pronouns**
- **`parse_test.py` currently handles only nouns, verbs, adjectives, and pronouns**
- This is just a case of creating the queries for the other part of speech codes, but as these are created other problems arise than need considering.
- **There is no English-Latin translation**
- In *Words*, this is much more structurally straightforward as this is a word search on the dictionary, and for English lexemes that do inflect (verbs and pronouns) there is no attempt to link the various inflections between the two languages. Given how irregular English is, that seems completely sensible, but it does mean that the amount of effort to create the English-Latin part of a parser would be very slight.
Expand All @@ -30,6 +50,6 @@ At this point, the DoLL has no version number, to highlight the fact that it is
- **Translations are not handled well**
- Translations are currently stored in a single column in the dictionary_entry table (the Entry class). This needs sorting soon, as Whitaker left clear definitions on the structure of translations (,;: all have different meanings). It also limits the use of the DoLL to English, which, while popular, is not universal.
- **Addons are ignored**
- Prefixes, suffixes, and the like, are handled in *Words* by the addons_package and generated from the ADDONS.LAT file. These are not used at all by the DoLL, but are definitely something we want to add support for.
- Prefixes, suffixes, and the like, are handled in *Words* by the `addons_package` and generated from the `ADDONS.LAT` file. These are not used at all by the DoLL, but are definitely something we want to add support for.
- **qu/cu pronouns are a mess**
- These pronouns have multiple dictionary and inflection entries, created for computational convenience, but this leads to duplicate results when parsing words
36 changes: 26 additions & 10 deletions doll/db/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -510,6 +510,7 @@ class Entry(Base):
source = relationship('WordSource', backref=backref('dictionary_entry'))

stems = relationship('Stem', backref=backref('dictionary_stem'))
translation_sets = relationship('TranslationSet', backref=backref('dictionary_translation_set'))


# Stem of a dictionary entry
Expand All @@ -529,10 +530,28 @@ class Stem(Base):
entry = relationship('Entry', backref=backref('dictionary_stem'))


# Set of translations
class TranslationSet(Base):
"""A set of related translations"""
__tablename__ = 'dictionary_translation_set'

id = Column(Integer, primary_key=True, autoincrement=True)

entry_id = Column(Integer, ForeignKey('dictionary_entry.id',
name='FK_dictionary_translation_set_entry_id'))
language_id = Column(Integer, ForeignKey('type_language.id',
name='FK_dictionary_translation_set_language_id'))
area_code = Column(String(10), ForeignKey('type_wordarea.code',
name='FK_dictionary_translation_set_wordarea_code'))

# Relationships
entry = relationship('Entry', backref=backref('dictionary_translation_set'))
language = relationship('Language', backref=backref('dictionary_translation_set'))
area = relationship('WordArea', backref=backref('dictionary_translation_set'))

translations = relationship('Translation', backref=backref('dictionary_translation'))

'''

Start of a translation model, but we need groups...

# Translation
class Translation(Base):
Expand All @@ -541,16 +560,13 @@ class Translation(Base):

id = Column(Integer, primary_key=True, autoincrement=True)

entry_id = Column(Integer, ForeignKey('dictionary_entry.id',
name='FK_dictionary_translation_entry_id'))
language_id = Column(Integer, ForeignKey('type_language.id',
name='FK_dictionary_translation_language_id'))
translation_set_id = Column(Integer, ForeignKey('dictionary_translation_set.id',
name='FK_dictionary_translation_translation_set_id'))
translation = Column(Unicode(4096, collation='BINARY'))

# Relationships
entry = relationship('Entry', backref=backref('dictionary_translation'))
language = relationship('Language', backref=backref('dictionary_translation'))
'''
translation_set = relationship('TranslationSet', backref=backref('dictionary_translation'))



# Noun Entry
Expand Down
32 changes: 32 additions & 0 deletions doll/input_parser/parse_dictionary.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,31 @@

from doll.db import Connection
from doll.db.model import *
import re


def parse_translation(session, language, entry, translation):
"""Parses the translation line and creates the Translation and TranslationSet objects"""

for ts in [ts for ts in map(str.strip, translation.split(';')) if len(ts) > 0]:

# Check if the translation set includes an area
area_regex = re.match('\s([A-Z]):', ts)
if area_regex is not None:
area = session.query(WordArea).filter(WordArea.code == area_regex.group(0)).first()
translation_set = TranslationSet(entry=entry,
area=area,
language=language)
else:
translation_set = TranslationSet(entry=entry,
language=language)

session.add(translation_set)

for t in [t for t in map(str.strip, ts.split(',')) if len(t) > 0]:
session.add(Translation(translation_set=translation_set,
translation=t))


def parse_dict_file(dict_file, commit_changes=False):

Expand Down Expand Up @@ -44,6 +69,8 @@ def parse_dict_file(dict_file, commit_changes=False):

translation = line[110:].strip()

language = session.query(Language).filter(Language.code == 'E').first()

# Create the list of stems, ignoring those that are empty or zzz
stems = [Stem(stem_number=i, stem_word=s) for i, s in enumerate(stem_list, 1) if len(s) > 0 and s != 'zzz']

Expand All @@ -57,6 +84,11 @@ def parse_dict_file(dict_file, commit_changes=False):
translation=translation,
stems=stems)

parse_translation(session=session,
language=language,
entry=entry,
translation=translation)

# Create the specific entry given the part of speech
if entry.part_of_speech_code == 'N':
noun_entry = NounEntry(declension_code=part_of_speech_data[0],
Expand Down

0 comments on commit 0540c31

Please sign in to comment.