Skip to content
Ilya Sher edited this page Feb 8, 2022 · 25 revisions

UPM Language Feature Design

Universal Pattern Matching is an attempt of generalization of different pattern matching facilities in NGS and other programming languages.

Reasoning

Many programming languages have pattern matching facilities. They are presented as separate language facilities. It might be beneficial to look at all these facilities and use cases in a more generic way, allowing simpler and consistent mental model.

This work is a mental exploration (currently mostly a brain dump) of the idea.

Requirements

  • Simple and consistent mental model
  • Fit well into NGS
    • Fit into overall language design
    • Minimal amount of new syntax
    • Integration with the test facility (todo)
    • Integration with assert method (implemented)
    • Maximal similarity to how the data would be constructed
    • Consider error messages for patten matching failure to be attributes of the predicate.
  • Support multiple data structures
    • Scalars
    • Eachable1 and Arr containers (Array, List, etc in other languages)
    • Hash and Namespace containers (Associative arrays, Dictionaries, etc in other languages)
  • Explore possibility of bidirectional use (like deserialize and serialize / unpack and pack):
    • Direction one: data + pattern -> interesting data
    • Direction two: interesting data + pattern -> data
  • Unconditional execution of code whenever particular sub-pattern matches (implemented: pattern -> action)
  • Execution of code as matching predicates (postponed. todo: see how frequently this is needed)
    • Consider special casing a function of one parameter and expected result. This way automatic reporting (of failure) will be easier: "The len of sub-pattern blah expected to be 2".
  • In cases where the result is needed (resulting data structure in cases similar to PEG parsers), support multiple strategies for aggregating/constructing the results:
    • Bool
    • Arr
    • Hash
    • (etc)
  • A way to disable backtracking after particular sub-pattern matches (postponed. todo: see how frequently this is needed)
  • Capturing and assignment to variables (todo)
  • Convenient way to express current object pattern as well as dive into the object. Example: expressing that current object is expected to be of type MyType and have the field field1 matching pattern pat1. (postponed. todo: see how frequently this is needed)
  • Back references
    • The "regular" way, to be able to use matched data as part of another pattern
    • Think parsing WAV or other TLV encoded data
  • Operate on streams (maybe with limited functionality to prevent unlimited memory consumption)
    • Example: process lines of a File (for line in File("blah"))
  • Convenient way to provide custom parsing errors messages
  • Consider easy expression of pattern / action and pattern / value pairs.
    • Pattern / action - awk like
    • Pattern / value - for result aggregation/construction
  • Default values for results/variables corresponding to unmatched (sub-)patterns
  • Try to unify if, cond, match, while
  • References to original data (position in a string, container+key, etc) so it could potentially be modified or new data can be constructed where the matched parts are replaced with something else.
  • Ability to include sub-patterns (somewhat similar to referencing other rules in PEG)
  • A way to exclude from capturing (for some sub-patterns such as whitespace)
  • Ability to match low-level values such as bytes and bits
  • Ability to parse streams

Nice to Haves

  • Ability to express a delimiter (when a sub-pattern repeats)

Prior Art

Pattern matching concept is well-established (TODO: continue here)

Following is a non-comprehensive list of facilities to be considered for generalization, forming the Universal Pattern Matching facility.

In no particular order:

NGS Facilities

Following is the list of facilities to be considered for generalization, forming the Universal Pattern Matching facility.

Following NGS methods and facilities could be expressed as pattern matching (in some abstract, not very well thought through notation).

  • any(container, pattern) -> container matches: start repeat(anything) pattern repeat(anything) end
  • all(container, pattern) -> container matches: start repeat(pattern) end
  • each(container, callback) -> container matches: start repeat(always_succeed(run(callback))) end
  • take(container, pattern) (also known as take_while() in other languages) -> start capture(repeat(pattern)) repeat(anything) end
  • drop(container, pattern) (also known as drop_while() in other languages) -> start repeat(pattern) capture(repeat(anything)) end
  • if a == 1 then - the a == 1 can be seen as a degenerated case of pattern matching (scalar data vs constant pattern)
  • filter(container, pattern) -> container matches: start repeat(always_succeed(pattern then_run collect)) end
  • reject(container, pattern) -> container matches: start repeat(always_succeed(not(pattern) then_run collect)) end
  • map(container, mapper) -> container matches: start repeat(always_succeed(collect(run(mapper)))) end.
  • my_array[pred1..pred2] -> my_array matches: start repeat_nongreedy(Any) pred1 capture(repeat_nongreedy(Any)) pred2 repeat(Any) end
  • first(container, pattern) -> container matches: start repeat_nongreedy(Any) pattern repeat(Any) end

Note that NGS already has =~ pattern matching facility but the idea here is to make something much bigger and general.

Use Cases

Very incomplete list:

  • Extract next page link (from JSON API response)
    • See how UPM can be combined with pagination facility (which does not exist yet).
    • Can be full page link or some pagination token to be used for next request parameter
  • File lines

Open Issues

  • Whole match or partial match by default?
  • Success and failure return values (currently, MatchSuccess and MatchFailure). This might need to change for PEG-like parser use case.
  • How exactly capturing should work?
  • {"path-to": {"my-element": HERE}}
    • It's desirable to place code at HERE but ...
    • It will be looking cluttered
    • It can be perceived as only a predicate should go there, not any arbitrary processing of my-element
    • If HERE will be my_func, it makes sense that my_func will be defined below, not above but NGS does not have hoisting.
  • Capturing
    • It's desirable to be able to capture values directly into local variables but ...
    • NGS does not have this language feature yet and it's not clear how it will look like
    • In case of something mentally straighforward like {"path-to": {"my-element": @my_var}}, the @my_var acts roughly like &my_var reference in C, which might not be how NGS should work.
    • The @my_var might not fit well if "variables extraction" facility will be otherwise using a, b, c <- my_obj or similar syntax
    • Side thought: "variables extraction" syntax should probably also fit for ... in syntax to extract key and value when iterating a Hash.
  • How to express that a key is a must?

Design

  • Patterns will be represented as objects. Patterns syntax will only be a shortcut for constructing the said objects.
  • Capturing, into MatchSuccess returned by =~, will have the following options:
    • flat capturing - captured value, doesn't matter how nested it is in the data structure, will be top-level accessible in the result
    • nested capturing
  • Capturing into variables
    • TBD How
  • Integration with the test facility and assert()
    • test - TBD
    • assert(data, pattern, message) - already implemented
  • Function parameters pattern matching / destructuring?
    • Pattern matching such that F my_func(1, blah) do_something() would work
  • (If use cases arise; not sure) Facility to return specific result from =~, not MatchResult

Pattern Primitives

  • Scalar - exists, matches itself
  • Type - exists, matches values of given type
  • RegExp - exists
  • Range - exists

Means of combination

  • AllOf - exists. Doesn't have similar facility in RegExp I think.
  • AnyOf - exists
  • Repeat - Repeat(pattern, Int) and Repeat(pattern, range)
  • (UNKNOWN) - something that would match the object and would have a pattern(s) to match the children
  • Hash - exists, matches field-wise
  • Arr - exists, matches element-wise.

Specials

  • Assert(pattren, error_message) - throw exception if the pattern did not match. Default error message would be something like: "expected (pattern) but found (actual data)"
  • Capture - Capture(pattern) and Capture('my_name', pattern)
    • Allow capturing the index of pattern somehow (for example index of an element in an array)
  • DontCapture - exclude from capturing
  • Lit - to match the item literally so that for example Lit(Int) would match the Int type, not an instance. That's an escape from recursive =~ to regular ==. (implemented)
  • Not? - Negate match (todo)
  • Pattern - As a container for a pattern (postponed. todo: see real life use cases)
  • Reference? - Reference previously captured sub-pattern
  • Instance(type, field1=pat1, field2=pat2, ...) - Object instance with given fields (postponed. todo: see real life use cases)
  • Check(fun, pattern) - Run fun on the subject and check the returned value against pattern (postponed. todo: see real life use cases; also can be easily done with {fun(A) =~ patten} sans the context functionality)
  • (UNKNOWN) - Prevent backtracking (postponed. todo: see real life use cases)
  • (UNKNOWN) - Match start (postponed. todo: see real life use cases)
  • (UNKNOWN) - Match end (postponed. todo: see real life use cases)

Destructuring assignment preliminary thoughts

Postponed

  • Is it really that useful?
  • [a, b, c] = my_arr
    • [a, (b), c] - to use the value of b as a sub-pattern and not assign
    • [a, *b, c] - capture several elements as an array into b
      • Maybe [a, Repeat(b), c] ? Or at least internally?
  • {a, b, c} = my_hash - extract similarly named values
    • How about renaming?
  • Alternatively, my_arr =~ [@a, @b, @c]. Downside - destination variables are not on the left, harder to spot.
  • Alternatively, [@a, @b, @c] = my_arr. Works as above but throws exception if there is no match.
  • "Extract values" operator: a, b, c <- my_something
    • for k, v <- in my_hash ?
    • for k, v <- x in my_hash ?
    • for k, v <- my_hash ?
    • I would really prefer a <- [1] and a, b <- [1, 2] over Python's (a,) = [1] and a, b = [1, 2] (even though the single-element tuple is probably rare). JavaScript's [a] = [1] looks fine.

Syntax Preliminary Thoughts

Postponed

All of the following syntax would be confined to the // ... //, thus having minimal impact on other parts of the language.

  • //UPM pattern//
  • Elements
    • An element - literal scalar, Arr, Hash, (Range?). Probably can not be arbitrary expression because of the syntax collisions with the below.
    • . - Any?
  • Means of combination
    • elt1 elt2 ... eltN - sequential match
    • elt1? - Repeat(elt1, 0...1)
    • elt1* - Repeat(elt1, 0..null)
    • elt1+ - Repeat(elt1, 1..null)
    • (elt1 & elt2 & ... & eltN) - AllOf([elt1, elt2, ..., eltN])
      • Not very elegant if you need to check a type and fields: (MyType & {"x": 1})
    • (elt1 | elt2 | ... | eltN) - AnyOf([elt1, elt2, ..., eltN]) name:elt - Capture('name', elt), consistent with PEG and parameters in function definitions in NGS
  • Specials
    • { code } - consistent with top-level syntax switch to "expressions syntax", zero width match (and non match?)
    • ${ code } - consistent with top-level syntax switch to "expressions syntax", returned value is used as a sub-pattern
  • TODO
    • Access previous captures
    • Repeat with given min and max
    • Non short-circuit checks maybe (elt1, elt2, ...,eltN)
    • Match at arbitrary depth?
    • Step inside a field (data {"a": {"b": 1}})
      • maybe Hash -> "a" -> "b" -> Int? Would construct (Hash & {"a": {"b": Int}}. What's the transformation rule?
      • maybe CSS selector resembling style, something like Hash["a"["b"[Int]]]? Is it even correct? What's the rule? Doesn't look very good. I do like the use of [...] to denote going inside the element.
      • maybe Hash["a"]["b"][Int] or Hash{"a"}{"b"}:Int -- @rdje
    • Unnamed captures? Maybe :elt? Maybe don't support unnamed captures?
    • Maybe syntax to indicate top-level (or simpler than find somewhere in the matching tree) access to captures

Note that all existing patterns return Bool when used with =~.