Skip to content

Latest commit

 

History

History
166 lines (107 loc) · 10.6 KB

architecture.md

File metadata and controls

166 lines (107 loc) · 10.6 KB

Noodle Architecture

Noodle is a tool for searching for words and phrases that match complex constraints.

"Noodle expressions" are similar to regular expressions, but have extensions for common wordplay operations (like anagramming) and approximate matches (fuzzy matching).

Note: This document is still a work-in-progress; and is not complete.

Prior Art

tools.qhex.org Word Play

The Word Play utility on tools.qhex.org is one of my personal favorite puzzle tools (even though it is not as full-featured as other tools).

The tool is open source as a command-line tool, but the wordlist & web UI are not.

Nutrimatic

Nutrimatic is a search tool based on Wikipedia n-grams.

The tool is fully open source under GPLv3.

Qat

Qat is a very powerful constraint/search tool.

The tool is free to use online, supports both English & French, but is not open-source.

Qat's author also maintains Qxw, an very good open-source (GPL) tool for constructing crosswords. In theory this tool could also be useful for solving puzzles, but I've never used it that way.

Other tools

There are also other, simpler, tools that are often used when solving wordplay puzzles. However, they are less directly comparable to Noodle.

High-Level Overview

NFAs

A key part of Noodle's architecture are Nondeterministic Finite Automata (NFAs). These are a type of state machine that are commonly used to implement regular expressions.

Unlike simpler Deterministic Fniite Automata (DFAs), each state can have multiple transitions for the same input. This means that evaluating an NFA requires working with sets of reachable states, whereas evaluating a DFA only requires working with a single state at a time.

All DFAs are technically NFAs. Conversely, you can convert an NFA with n states into a DFA with O(2^n) states.

// For a given number of states, NFAs are more complex than DFAs, but are more closely related Thompson's construction is a straightforward algorithm to build an NFA corresponding to a regular expression.

Internally, Noodle uses an NFA with ε-moves representation. Each NFA has a set of initial states, a success state, and intermediate states. Transitions between states can either match and consume exactly 1 character from the input, or be an ε-move which can always be taken and consumes no input. Unlike DFAs, evaluating an NFA requires tracking the set of reachable states from a given input string. A input string is considered "matching" if there is at least one path from any initial state to the success state.

NFAs are more complex and, in general, slower to evaluate than DFAs. Many regular expression libraries try to convert NFA representations to DFAs to improve evaluation speed.

Noodle's goal is to find words and phrases that match a Noodle Expression query.

The query is parsed, and each expression is translated into one or more NFAs, as described above. "Simple" expressions that are analogous to regular expressions can be represented as a single NFA. More complex constraints, like anagram segments (<...>), are represented by multiple NFAs.

Finding single words that exactly match the query is straightforward, and Noodle acts like a typical regular expression engine. Each string in the input wordlist is run through the query's NFAs, and it's emitted as a match if the string matches all the NFAs.

When Noodle searches for phrase matches, it takes advantage of the NFA internals to make the search more efficient. A naïve search for a k-word phrase from a wordlist with n words could be done by checking all n^k sequences of words

Process

  • User input: High-level Query
    • Queries provide syntactic sugar, and are easy to compile down into a set of Expressions
    • 1 Query with an anagram constraint would be turned into multiple Expressions
  • Regex-like Low-level Expressions
    • Based on POSIX Extended Regular Expressions, but does not support backreferences
    • Adds support fuzzy matching (Levenshtein edit distance)
    • Represented internally as an NFA
      • Each state in the NFA has (up to) two types of transitions:
        • Epsilon transitions (that do not consume a character), to a set of next states
        • Character transitions (that match & consume 1 character), with a set of characters that can transition to one next state
      • The transitive closure over epsilon transitions is pre-computed.
      • This form guarantees that each input character can be consumed in O(1) time (follow the character transition, followed by epsilon transitions)
      • Fuzzy matches are implemented by tracking set of states reachable within a given number of edits
        • Since we are tracking state sets for fuzzy matching, there's less benefit from transforming the NFA into a DFA
  • Wordlist
    • Reduced alphabet: only considers letters A-Z (case-insensitive), spaces, and punctuation
      • Any non-letter, non-space character is translated into "punctuation"
  • Compute a "transition table" for each word in the wordlist, for each Expressions in the Query
    • For each (Expression, word) compute the transition table [from_state, to_state] -> min_edit_distance
      • If min_edit_distance is more than the allowed fuzz, use infty
      • This transition table is currently represented as an array of bitsets: [from_state][edit_distance][to_state] -> true/false
    • If the transition table is all infty, the word isn't useful and can be ignored for the rest of the query processing
    • This step is roughly O(n_words * word_len * n_states^3 * (max_fuzz + 1) * O(bitset)) for each Expression
      • O(bitset) is O(1) if n_states is compile-time assumed to be small, e.g. <= 64. Otherwise, it is O(n_states)
      • n_words goes down for each successive Expression processed as words get pruned
      • Computation can be reused between words with shared stems
      • In pratice, max_fuzz has a super-linear effect on the runtime, because it ~quadratically increases the number of reachable states.
  • Use the transition table to follow an iterative-deepening depth-first search (IDDFS), up to a given maximum depth (maximum number of words)
    • Find a series of words which can connect the initial state to the success state (within the maximum edit distance) across every Expression
    • Do not include words that do not advance the state of any Expression
    • This step is roughly O((n_words * n_expressions * (max_fuzz + 1) * n_states * O(bitset)) ^ max_words)