-
Notifications
You must be signed in to change notification settings - Fork 8
term matching
The dictionary lookup is performed as follows:
- Every sentence is segmented into tokens.
- Each token is transformed to a normalized version.
- Candidate spans (sequences of 1 or more tokens) are generated for lookup.
- Based on a stopword list, normalization is suspended for certain candidates.
The resulting candidates are then compared to the dictionary.
OGER performs term matching at the sentence level.
For matching, each sentence is split into a sequence of token-like segments by a RegEx.
The default RegEx is \d+|[\^\W\d\_]+
, ie. any contiguous sequence of either digits or letters is extracted.
Punctuation symbols and whitespace are removed.
For example, the sentence
SMC proliferation via p21Waf1/Cip1 signaling.
is segmented into the following 10 tokens:
SMC proliferation via p 21 Waf 1 Cip 1 signaling
This behaviour can be changed by adjusting the token RegEx through the termlist parameter term-token
.
Every token is normalized individually by applying a series of transformation operations.
Through the termlist parameter normalize
, any ordered combination of the following operations can be specified:
- lowercase: All characters are converted to lower case. This is the default operation.
- greektranslit: Letters of the Greek alphabet are transliterated to letter names in Latin script, eg. "α" → "alpha". This operation assumes that "lowercase" was applied first. Also, if compatibility characters like the Kappa symbol (ϰ) are an issue, prepending "unicode-NFKC" or "-NFKD" should help.
- stem[-ARG]: Apply a standard stemming algorithm to each token. Valid values for ARG are "Porter" and "Lancaster" (default), referring to the NLTK implementation of the respective stemmer.
- unicode[-ARG]: Apply Unicode normalization to the input. Valid values for ARG are "NFC", "NFD", "NFKC", "NFKD".
- mask[-REPL-TARGET]: Replace all occurrences of a certain token class with a placeholder. This operation takes two optional arguments: REPL defines the placeholder string and TARGET specifies the token class. REPL can be any string (but note that there is no escaping mechanism for encoding dashes). TARGET can take one of the following values:
- "digits": Mask tokens that consist entirely of digits.
- "numeric": Mask tokens that consist entirely of numeric characters. For the distinction between this and the former, cf. the Python docs on str.isdigit and str.isnumeric.
- "punct": Mask tokens that consist entirely of punctuation characters. This is approximated with the RegEx
[^\w\s]+
, ie. all characters that are neither alphanumeric nor blank. - Any other value is interpreted as a path/URL to a text file which lists all target tokens (one per line). Please note that the tokens of this list are not preprocessed by OGER in any way, so the user needs to take care that they match preceding normalization steps. The defaults for REPL and TARGET are "0" and "digits", respectively.
Typically, multiply normalization operations are applied in series. The commands can be specified as a space-separated list or as a JSON array. From the command-line, escaping will be needed. Bash example:
-c termlist-normalize '["unicode-NFKC", "lowercase", "greektranslit", "mask-STOP-/path/to/stopwords.txt"]'
During term matching, OGER looks for first-token triggers in order to construct lookup candidates. For example, if the termlist contains the (tokenized) terms "p 53" and "p 53 signaling complex", the above example text would trigger the two candidate spans "p 21" and "p 21 Waf 1", since terms starting with the token "p" can have length 2 and 4.
The stopword list is a user-defined list of words and multi-word expressions that should be exempt from normalization. Upon loading, each entry (text line) of this list is tokenized and normalized as described above. During term matching, if a candidate span is found in this list, an exact match is enforced, ie. the unnormalized version of the span and the dictionary entry are compared.