-
Notifications
You must be signed in to change notification settings - Fork 44
UPM Design
Universal Pattern Matching is an attempt of generalization of different pattern matching facilities in NGS and other programming languages.
Many programming languages have pattern matching facilities. They are presented as separate language facilities. It might be beneficial to look at all these facilities and use cases in a more generic way, allowing simpler and consistent mental model.
This work is a mental exploration (currently mostly a brain dump) of the idea.
- Simple and consistent mental model
- Fit well into NGS
- Fit into overall language design
- Minimal amount of new syntax
- Integration with the
test
facility (todo) - Integration with
assert
method (implemented) - Maximal similarity to how the data would be constructed
- Consider error messages for patten matching failure to be attributes of the predicate.
- Support multiple data structures
- Scalars
- Eachable1 and Arr containers (Array, List, etc in other languages)
- Hash and Namespace containers (Associative arrays, Dictionaries, etc in other languages)
- Explore possibility of bidirectional use (like deserialize and serialize / unpack and pack):
- Direction one: data + pattern -> interesting data
- Direction two: interesting data + pattern -> data
- Unconditional execution of code whenever particular sub-pattern matches (implemented:
pattern -> action
) - Execution of code as matching predicates (postponed. todo: see how frequently this is needed)
- Consider special casing a function of one parameter and expected result. This way automatic reporting (of failure) will be easier: "The
len
of sub-pattern blah expected to be2
".
- Consider special casing a function of one parameter and expected result. This way automatic reporting (of failure) will be easier: "The
- In cases where the result is needed (resulting data structure in cases similar to PEG parsers), support multiple strategies for aggregating/constructing the results:
- Bool
- Arr
- Hash
- (etc)
- A way to disable backtracking after particular sub-pattern matches (postponed. todo: see how frequently this is needed)
- Capturing and assignment to variables (todo)
- Convenient way to express current object pattern as well as dive into the object. Example: expressing that current object is expected to be of type
MyType
and have the fieldfield1
matching patternpat1
. (postponed. todo: see how frequently this is needed) - Back references
- The "regular" way, to be able to use matched data as part of another pattern
- Think parsing WAV or other TLV encoded data
- Operate on streams (maybe with limited functionality to prevent unlimited memory consumption)
- Example: process lines of a File (
for line in File("blah")
)
- Example: process lines of a File (
- Convenient way to provide custom parsing errors messages
- Consider easy expression of pattern / action and pattern / value pairs.
- Pattern / action -
awk
like - Pattern / value - for result aggregation/construction
- Pattern / action -
- Default values for results/variables corresponding to unmatched (sub-)patterns
- Try to unify
if
,cond
,match
,while
- References to original data (position in a string, container+key, etc) so it could potentially be modified or new data can be constructed where the matched parts are replaced with something else.
- Ability to include sub-patterns (somewhat similar to referencing other rules in PEG)
- A way to exclude from capturing (for some sub-patterns such as whitespace)
- Ability to match low-level values such as bytes and bits
- Ability to parse streams
- Ability to express a delimiter (when a sub-pattern repeats)
Pattern matching concept is well-established (TODO: continue here)
Following is a non-comprehensive list of facilities to be considered for generalization, forming the Universal Pattern Matching facility.
In no particular order:
- Regular Expressions
- Multiple Dispatch
- Raku Grammars
- PEG
- Destructuring assignment
- Expect
- AWK
- pack and unpack from Perl
-
scanf()
from C - Pattern Matching in Scala
- Different pattern matchers in Common Lisp
- Pattern Matching in Elixir
- CSS selectors
- Structural Pattern Matching in Python
- Awesome Pattern Matching (apm) for Python
- Pattern Matching in SNOBOL4
- JSON Schema
- globbing
- Pampy: Pattern Matching for Python
- Patterns (C# reference)
Following is the list of facilities to be considered for generalization, forming the Universal Pattern Matching facility.
Following NGS methods and facilities could be expressed as pattern matching (in some abstract, not very well thought through notation).
-
any(container, pattern)
->container
matches:start repeat(anything) pattern repeat(anything) end
-
all(container, pattern)
->container
matches:start repeat(pattern) end
-
each(container, callback)
->container
matches:start repeat(always_succeed(run(callback))) end
-
take(container, pattern)
(also known astake_while()
in other languages) ->start capture(repeat(pattern)) repeat(anything) end
-
drop(container, pattern)
(also known asdrop_while()
in other languages) ->start repeat(pattern) capture(repeat(anything)) end
-
if a == 1 then
- thea == 1
can be seen as a degenerated case of pattern matching (scalar data vs constant pattern) -
filter(container, pattern)
->container
matches:start repeat(always_succeed(pattern then_run collect)) end
-
reject(container, pattern)
->container
matches:start repeat(always_succeed(not(pattern) then_run collect)) end
-
map(container, mapper)
->container
matches:start repeat(always_succeed(collect(run(mapper)))) end
. -
my_array[pred1..pred2]
->my_array
matches:start repeat_nongreedy(Any) pred1 capture(repeat_nongreedy(Any)) pred2 repeat(Any) end
-
first(container, pattern)
->container
matches:start repeat_nongreedy(Any) pattern repeat(Any) end
Note that NGS already has =~
pattern matching facility but the idea here is to make something much bigger and general.
Very incomplete list:
- Extract next page link (from JSON API response)
- See how UPM can be combined with pagination facility (which does not exist yet).
- Can be full page link or some pagination token to be used for next request parameter
- File lines
- Whole match or partial match by default?
- Success and failure return values (currently,
MatchSuccess
andMatchFailure
). This might need to change for PEG-like parser use case. - How exactly capturing should work?
-
{"path-to": {"my-element": HERE}}
- It's desirable to place code at
HERE
but ... - It will be looking cluttered
- It can be perceived as only a predicate should go there, not any arbitrary processing of
my-element
- If
HERE
will bemy_func
, it makes sense thatmy_func
will be defined below, not above but NGS does not have hoisting.
- It's desirable to place code at
- Capturing
- It's desirable to be able to capture values directly into local variables but ...
- NGS does not have this language feature yet and it's not clear how it will look like
- In case of something mentally straighforward like
{"path-to": {"my-element": @my_var}}
, the@my_var
acts roughly like&my_var
reference in C, which might not be how NGS should work. - The
@my_var
might not fit well if "variables extraction" facility will be otherwise usinga, b, c <- my_obj
or similar syntax - Side thought: "variables extraction" syntax should probably also fit
for ... in
syntax to extract key and value when iterating a Hash.
- How to express that a key is a must?
- Patterns will be represented as objects. Patterns syntax will only be a shortcut for constructing the said objects.
- Capturing, into
MatchSuccess
returned by=~
, will have the following options:- flat capturing - captured value, doesn't matter how nested it is in the data structure, will be top-level accessible in the result
- nested capturing
- Capturing into variables
- TBD How
- Integration with the
test
facility andassert()
-
test
- TBD -
assert(data, pattern, message)
- already implemented
-
- Function parameters pattern matching / destructuring?
- Pattern matching such that
F my_func(1, blah) do_something()
would work
- Pattern matching such that
- (If use cases arise; not sure) Facility to return specific result from
=~
, notMatchResult
-
Scalar
- exists, matches itself -
Type
- exists, matches values of given type -
RegExp
- exists -
Range
- exists
-
AllOf
- exists. Doesn't have similar facility in RegExp I think. -
AnyOf
- exists -
Repeat
-Repeat(pattern, Int)
andRepeat(pattern, range)
- (UNKNOWN) - something that would match the object and would have a pattern(s) to match the children
- Hash - exists, matches field-wise
- Arr - exists, matches element-wise.
-
Assert(pattren, error_message)
- throw exception if thepattern
did not match. Default error message would be something like: "expected (pattern) but found (actual data)" -
Capture
-Capture(pattern)
andCapture('my_name', pattern)
- Allow capturing the index of
pattern
somehow (for example index of an element in an array)
- Allow capturing the index of
-
DontCapture
- exclude from capturing -
Lit
- to match the item literally so that for exampleLit(Int)
would match theInt
type, not an instance. That's an escape from recursive=~
to regular==
. (implemented) -
Not
? - Negate match (todo) -
Pattern
- As a container for a pattern (postponed. todo: see real life use cases) -
Reference
? - Reference previously captured sub-pattern -
Instance(type, field1=pat1, field2=pat2, ...)
- Object instance with given fields (postponed. todo: see real life use cases) -
Check(fun, pattern)
- Runfun
on the subject and check the returned value againstpattern
(postponed. todo: see real life use cases; also can be easily done with{fun(A) =~ patten}
sans the context functionality) - (UNKNOWN) - Prevent backtracking (postponed. todo: see real life use cases)
- (UNKNOWN) - Match start (postponed. todo: see real life use cases)
- (UNKNOWN) - Match end (postponed. todo: see real life use cases)
Postponed
- Is it really that useful?
-
[a, b, c] = my_arr
-
[a, (b), c]
- to use the value ofb
as a sub-pattern and not assign -
[a, *b, c]
- capture several elements as an array intob
- Maybe
[a, Repeat(b), c]
? Or at least internally?
- Maybe
-
-
{a, b, c} = my_hash
- extract similarly named values- How about renaming?
- Alternatively,
my_arr =~ [@a, @b, @c]
. Downside - destination variables are not on the left, harder to spot. - Alternatively,
[@a, @b, @c] = my_arr
. Works as above but throws exception if there is no match. - "Extract values" operator:
a, b, c <- my_something
-
for k, v <- in my_hash
? -
for k, v <- x in my_hash
? -
for k, v <- my_hash
? - I would really prefer
a <- [1]
anda, b <- [1, 2]
over Python's(a,) = [1]
anda, b = [1, 2]
(even though the single-element tuple is probably rare). JavaScript's[a] = [1]
looks fine.
-
Postponed
All of the following syntax would be confined to the // ... //
, thus having minimal impact on other parts of the language.
//UPM pattern//
- Elements
- An element - literal scalar, Arr, Hash, (Range?). Probably can not be arbitrary expression because of the syntax collisions with the below.
-
.
- Any?
- Means of combination
-
elt1 elt2 ... eltN
- sequential match -
elt1?
-Repeat(elt1, 0...1)
-
elt1*
-Repeat(elt1, 0..null)
-
elt1+
-Repeat(elt1, 1..null)
-
(elt1 & elt2 & ... & eltN)
-AllOf([elt1, elt2, ..., eltN])
- Not very elegant if you need to check a type and fields:
(MyType & {"x": 1})
- Not very elegant if you need to check a type and fields:
-
(elt1 | elt2 | ... | eltN)
-AnyOf([elt1, elt2, ..., eltN])
name:elt
-Capture('name', elt)
, consistent with PEG and parameters in function definitions in NGS
-
- Specials
-
{ code }
- consistent with top-level syntax switch to "expressions syntax", zero width match (and non match?) -
${ code }
- consistent with top-level syntax switch to "expressions syntax", returned value is used as a sub-pattern
-
- TODO
- Access previous captures
- Repeat with given min and max
- Non short-circuit checks maybe
(elt1, elt2, ...,eltN)
- Match at arbitrary depth?
- Step inside a field (data
{"a": {"b": 1}}
)- maybe
Hash -> "a" -> "b" -> Int
? Would construct(Hash & {"a": {"b": Int}}
. What's the transformation rule? - maybe CSS selector resembling style, something like
Hash["a"["b"[Int]]]
? Is it even correct? What's the rule? Doesn't look very good. I do like the use of[...]
to denote going inside the element. - maybe
Hash["a"]["b"][Int]
orHash{"a"}{"b"}:Int
-- @rdje
- maybe
- Unnamed captures? Maybe
:elt
? Maybe don't support unnamed captures? - Maybe syntax to indicate top-level (or simpler than find somewhere in the matching tree) access to captures
Note that all existing patterns return Bool
when used with =~
.
NGS official website is at https://ngs-lang.org/