-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(remap): re-implement VRL parser/compiler #6353
Conversation
Signed-off-by: Jean Mertz <[email protected]>
Signed-off-by: Jean Mertz <[email protected]>
Signed-off-by: Jean Mertz <[email protected]>
Signed-off-by: Jean Mertz <[email protected]>
Signed-off-by: Jean Mertz <[email protected]>
Signed-off-by: Jean Mertz <[email protected]>
Signed-off-by: Jean Mertz <[email protected]>
Signed-off-by: Jean Mertz <[email protected]>
Signed-off-by: Jean Mertz <[email protected]>
lib/remap-tests/src/main.rs
Outdated
@@ -1,107 +1,164 @@ | |||
use ansi_term::Colour; | |||
use chrono::{DateTime, Utc}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably ignore most changes in this file, as it's mostly just a quick way to run our test harness. No attention was given to performance/best-practices.
lib/remap-tests/tests/diagnostics/path_based_variable_assignment.vrl
Outdated
Show resolved
Hide resolved
# │ | ||
# = see function documentation at: https://master.vector.dev/docs/reference/remap/#parse_json | ||
# = see language documentation at: https://vector.dev/docs/reference/vrl/ | ||
# function call error for "parse_json" at (1:27): unable to parse json: key must be a string at line 1 column 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was changed from a pretty-printed diagnostic message, to a single-line compact variant because runtime errors are captured in Vector logs, and so we want to keep them more succinct.
# │ | ||
# = see function documentation at: https://master.vector.dev/docs/reference/remap/#to_string | ||
# = see language documentation at: https://vector.dev/docs/reference/vrl/ | ||
# function call error for "to_string" at (1:15): unable to coerce array into string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, re: succinct runtime error messages.
# │ | ||
# = hint: assign to "ok", without assigning to "err" | ||
# = see error handling documentation at: https://vector.dev/docs/reference/vrl/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ignore these links, I'll update them to point to the correct documentation sections before the PR is merged.
lib/remap-parser/src/parser.lalrpop
Outdated
// The main entrypoint into a VRL program. | ||
// | ||
// A program consists of one or more expressions. | ||
pub Program: Program = NonterminalNewline* <RootExprs> => Program(<>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick guide on how to read this syntax:
pub Program: Program = NonterminalNewline* <RootExprs> => Program(<>);
pub
— make this rule public, basically generating a<Rule>Parser::parse()
function (e.g.ProgramParser
)Program
— the name of the non-terminal: Program
— the (Rust) type returned by this non-terminal= NonterminalNewline* <RootExprs>
— the matching terminal (e.g. a sequence of lexed tokens that should match for this non-terminal to match)- furthermore, the
<...>
indicates that onlyRootExprs
is provided to the action of the match arm
- furthermore, the
=>
start the action of the match arm, anything after this token is regular Rust code*.- the exception being
<>
, which is short for "whatever is captured by the match arm".
- the exception being
For more details, see: https://lalrpop.github.io/lalrpop/tutorial/003_type_inference.html
lib/remap-parser/src/parser.lalrpop
Outdated
|
||
// Root expressions are allowed to fail. The parser will continue with the | ||
// next expression in the program. | ||
Sp<!> => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!
is a special terminal that captures any error generated by the parser. This is what allows the compiler to continue compiling root expressions even if one of them fails.
lib/remap-parser/src/parser.lalrpop
Outdated
|
||
|
||
Expr: Node<Expr> = { | ||
Sp<IfStatement> => Node::new(<>.span(), Expr::IfStatement(<>)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sp<...>
is a macro, it expands to a bigger set of matching rules. See the end of this file for its implementation. The TL;DR is that it enriches the if statement rule with span details pointing to start/end locations in the source code. This is what allows our diagnostic messages to give detailed help messages.
lib/remap-parser/src/parser.lalrpop
Outdated
|
||
#[inline] | ||
FunctionArgument: FunctionArgument = { | ||
<ident: (<Sp<AnyIdent>> ":")?> <expr: ArithmeticExpr> => FunctionArgument { <> }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that LR(1) parsers aren't allowed to be ambiguous, so this isn't allowed:
foo(ok, err)
foo(ok, err = ...)
Both cases would be parsed the same, but the first is two positional arguments, the second is one error-assignment expression for the first positional argument.
To work around this for now, using error-assignments in cases like these requires you to wrap it in parenthesis:
foo((ok, err = ...), second_arg)
single assignments still work without:
foo(ok = ..., second_arg)
lib/remap-compiler/src/expression.rs
Outdated
pub trait Expression: Send + Sync + fmt::Debug { | ||
/// Resolve an expression to a concrete [`Value`]. | ||
/// | ||
/// This method is executed at runtime. | ||
/// | ||
/// An expression is allowed to fail, which aborts the running program. | ||
fn resolve(&self, ctx: &mut Context) -> Resolved; | ||
|
||
/// Resolve an expression to its [`TypeDef`] type definition. | ||
/// | ||
/// This method is executed at compile-time. | ||
fn type_def(&self, state: &crate::State) -> TypeDef; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This trait is slightly changed compared to the previous variant:
execute
renamed toresolve
- return type wrapped in type alias
Resolved
, which now also returns anExpressionError
instead of the previous genericremap::Error
- it no longer takes a separate
state
andtarget
(previously namedobject
), but instead acontext
which includes both
lib/remap-compiler/src/expression.rs
Outdated
impl From<String> for ExpressionError { | ||
fn from(message: String) -> Self { | ||
ExpressionError { | ||
message, | ||
..Default::default() | ||
} | ||
} | ||
} | ||
|
||
impl From<&str> for ExpressionError { | ||
fn from(message: &str) -> Self { | ||
message.to_owned().into() | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows functions to return a string as their error message, instead of requiring them to implement Into<ExpressionError>
for different error types they might return.
This makes the error less specific (e.g. it doesn't include custom labels and notes), but in most use-cases, this makes no difference, since any fallible function will have to be handeld at compile-time anyway, so these error messages are only exposed if someone uses !
to abort at runtime or captures the error message via ok, err = ...
.
.into_iter() | ||
.fold(TypeDef::empty(), |acc, td| acc.merge(td)); | ||
|
||
TypeDef::new().array(Some(inner)).with_fallibility(fallible) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I overhauled the TypeDef
API to be more descriptive. You'll see in the actual implementation while reviewing. I still need to add the new inner_type_def
feature that @FungusHumungus implemented.
Also, because function argument type definitions (or fallibility) is now checked before the argument is passed to the function call, most functions no longer need to concern themselves with merging their type def with that of their arguments (some still do, though).
impl fmt::Display for Array { | ||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { | ||
let exprs = self | ||
.inner | ||
.iter() | ||
.map(|e| e.to_string()) | ||
.collect::<Vec<_>>() | ||
.join(", "); | ||
|
||
write!(f, "[{}]", exprs) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we actually start using this for some kind of "pretty formatter" (or perhaps already in the CLI), we should make some improvements to these, for example to split the array over multiple lines if they contain more than X elements.
// Fallible expressions require infallible assignment. | ||
if type_def.is_fallible() { | ||
return Err(Error { | ||
variant: ErrorVariant::FallibleAssignment( | ||
target.to_string(), | ||
expr.to_string(), | ||
), | ||
span, | ||
expr_span, | ||
assignment_span, | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, the compiler gathers multiple errors during compilation of a program and returns all of them to be displayed by the formatter.
However, I opted to return the first error encountered for any specific expression, as otherwise you have the chance to get nonsensical errors because a previous error check triggered, but allowed compilation to continue.
We still capture multiple errors across different expressions though.
lib/remap-lang/src/prelude.rs
Outdated
@@ -1,17 +1,28 @@ | |||
// commonly used modules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update this prelude as I migrate over existing functions.
accepts: |v| matches!(v, Value::Map(_) | Value::Array(_)), | ||
kind: kind::OBJECT | kind::ARRAY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shows the change in Parameter
signature to support compile-time argument type checking, as mentioned before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much nicer, and I think it provides a better hook for building in self-documentation features.
lib/remap-functions/src/compact.rs
Outdated
#[derive(Debug, Clone)] | ||
#[derive(Debug)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll probably have to make this Clone
again once the changes are integrated back into Vector, but we'll see.
} | ||
} | ||
|
||
#[cfg(test)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll re-add these tests in this PR as I start moving functions over.
lib/remap-functions/src/string.rs
Outdated
#[derive(Clone, Copy, Debug)] | ||
pub struct STring; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new function that's part of our story to enhance type safety.
Basically, because functions now expect arguments to have the correct type, you need a way to guarantee such a type.
We already have to_string(.foo)
, but that's a coercion function, whereas the new string(.foo)
is a type checking functions.
So for example, instead of doing this:
parse_json(.foo)
You could do this:
parse_json(to_string(.foo) ?? "{}")
But that still leaves the possibility of any non-string being coerced into a string, which might not be what you expect.
So now you can do this:
parse_json(string(.foo) ?? "{}")
Which guarantees that .foo
is only parsed as JSON if it is actually a string and no other type.
Note that like all functions, you still need to store the resulting value for the type check to be tracked.
So, this won't work:
string!(.foo)
parse_json(.foo) // type error: .foo is of unknown type
But this will:
.foo = string!(.foo)
parse_json(.foo)
The reason for this is that string
is a regular function like any other, and we don't want functions to mutate the type state of the program, as that can lead to bad results (e.g. it would invalidate all our type safety guarantees) if a function is badly implemented.
Instead, the resulting value is captured by the assignment expression, and it handles the type definition of .foo
for the rest of the program.
@binarylogic and I discussed providing a built-in solution to type checking, but we decided against that for now as we like the simplicity of everything being done through functions for now, and we also don't have a nice syntactic solution yet to add type information into the language. We might still do this in the future, but not for now.
Signed-off-by: Stephen Wakely <[email protected]>
Signed-off-by: Stephen Wakely <[email protected]>
Signed-off-by: Stephen Wakely <[email protected]>
Signed-off-by: Stephen Wakely <[email protected]>
Signed-off-by: Jean Mertz <[email protected]>
Signed-off-by: Jean Mertz <[email protected]>
Signed-off-by: Jean Mertz <[email protected]>
Signed-off-by: Jean Mertz <[email protected]>
Signed-off-by: Jean Mertz <[email protected]>
Signed-off-by: Stephen Wakely <[email protected]>
Signed-off-by: Stephen Wakely <[email protected]>
Signed-off-by: Stephen Wakely <[email protected]>
Signed-off-by: Jean Mertz <[email protected]>
These were commented out two years ago by #6353 . I was able to revive them and fix the compilation errors. Signed-off-by: Jesse Szwedko <[email protected]>
* chore(vrl): Revive old remap tests These were commented out two years ago by #6353 . I was able to revive them and fix the compilation errors. Signed-off-by: Jesse Szwedko <[email protected]> * Fix tests Signed-off-by: Jesse Szwedko <[email protected]> --------- Signed-off-by: Jesse Szwedko <[email protected]>
review notes
As our BDFL @binarylogic once said to me; “I’m Charlie in the back”. Well, him and me both.
This PR overhauls the VRL parser with the goal of:
It does so by making three substantial changes:
vrl-parser
crate now generates a syntactically correct ASTvrl-compiler
crate generates a semantically correct VRL programThis means that this is now basically a “two-stage compiler”
(note: while I’m using the terms “stage”, “compiler” and others in this description, they might not fit the exact CS terms, but serve the purpose of getting the meaning across)
Testing
Before I started this rewrite (let’s call it what it is), I added a whole heap of new UI tests to our test harness. This ensured that the end-result should not cause any regressions, and fix tests that were previously broken.
See the first commit for updates to the test harness, and the second commit for the added tests.
LALRPOP vs. Pest vs. …
I don’t plan on writing an extensive essay on why I chose the former over others, as the decision has already been made, and the result is empirically better than the old implementation.
Suffice it to say, LALRPOP enjoys widespread use, including in robust production-grade languages. There has been talk about replacing the Rust parser with LALRPOP, although that has stalled out as far as I know, mostly because of time constraints. The original author of the library is Niko (Rust compiler/language/core team lead), and it is now in the capable hands of the author of the Gluon language (itself parsed using LALRPOP), together with others in the community providing contributions.
I believe it strikes the right balance between (1) having syntax tailored to make it easy to parse tokens, and (2) staying close to Rust, allowing you to write Rust as you implement the parser. It is also fully typed, unlike Pest. In terms of performance, I haven’t done any benchmarking, but given that we have a custom-built lexer, I assume it’s either as-fast, or faster than our previous implementation.
Parser (stage 1)
The third commit implements the
vrl-parser
crate.This first stage generates an AST form a source program. It only cares about the validity of the syntax. It is a forgiving-parser, meaning it tries to parse each root expression in the program, but swaps out any badly formatted root expression with a noop, before recording the relevant error and continuing with the next root expression in the source program.
Because the parser only cares about syntax, not about semantics, it is perfectly valid to:
This stage also encodes span information into the AST, pointing to the byte start/end positions of individual parts of the program. Each element in the AST is a
Node
, which contains the span information, and holds the actual type generated by the parser.The parser is stateless, it goes from top to bottom, and extends the AST as it generates more nodes.
This stage also includes a custom lexer. Originally, I used the built-in lexer provided by LALRPOP, which works based on string and regex matching, but that turned out to be insufficient for our use-case. Specifically, we want the parser to remain whitespace agnostic, but this requires special handling of our path syntax (
.foo[2].(bar | baz)
). I started solving this with another regex, but things became unwieldy and unmaintainable, so I opted to implement our own lexer. This also comes with an added advantage of better performance, unit tests for the lexing phase, and allowing us to fine-tune the lexer to our exact needs.Compiler (stage 2)
The fourth commit implements the
vrl-compiler
crate.This second stage takes the generated AST and compiles it into a useable VRL program.
In this stage, all of the above “acceptable errors” are turned into diagnostic messages for the caller to solve before the compiler is satisfied. This means once the compiler is done, the resulting program is both syntactically and semantically correct.
The compiler stage holds state during compilation, it keeps track of known value types (progressive type checking), it tracks initialized variables (guarding against variable typos), and in the future it will also make sure all root-level expressions contribute to the result of the program (dead-code detector).
diagnostics
The fifth commit adds the new
vrl-diagnostic
crate, which serves the purpose of turning any VRL error into a readable diagnostic message, printing the relevant source code and providing help to solve the given issue.remap-lang
The sixth commit ties it all back together in the existing
remap-lang
crate.This crate now mostly serves as a single public entry-point into VRL. Its API is mostly similar to what was there before, with some ergonomic improvements.
It also still houses the diagnostic module. I plan on extracting that one out into a separate
remap-diagnostic
crate, but that can be tackled separately, and isn’t as important as the other changes done in this PR.Other
The seventh commit updates the CLI to work with the newest VRL changes.
The eighth commit updates the first batch of stdlib functions to make sure all
vrl-test
tests pass as expected. It also adds a newstring
function, which is the first of a new batch of “type enforcement” functions people can use in their programs (we’ll be addingfloat()
etc, as well asis_string()
etc).The ninth commitmoves all VRL related crates into the new
lib/vrl
top-level directory, to make it easier to work on VRL locally by opening that top-level directory in your editor, and having access to all relevant crates. In the process it also renames the crates with their appropriate names.Conclusion
To conclude. I realize this is a big PR, but I hope splitting it up into multiple logical commits makes it easier to review. Four commits are additive, (adding tests, adding the new lexer and parser, the compiler, and diagnostics), making it easy to review them without any diff noise.
It was a big effort (and a bit of a risk) to take so close to cutting 0.12, but as we found more corner-cases that weren’t correctly parsed, and more hacks were piled on to the old Pest grammar to resolve those issue, I was losing enough confidence and noticed a reduction in development speed due to that loss, that I thought it important enough to rebuild the foundation of VRL before we make it available to a bigger audience.
Nothing is perfect, as they say, so I’m sure there’s still bugs to squash and improvements to make, but this PR passes all existing tests and fixes previously failing tests, so it is empirically better, regardless of any further improvements we can make going forward.
Sorry for the surprise. Enjoy reviewing 😃 I left some comments in the PR to hopefully guide reviewers.
TODO
There’s still some things to do before I can merge this PR. This shouldn’t take more than a day, but meanwhile, reviewing of the core changes in this PR can commence already.
closes #3728
closes #4706
closes #5246
closes #5350
closes #5453
closes #5479
closes #5530
closes #5546
closes #5677
closes #5905
closes #5940
closes #6054
closes #6072
closes #6080
closes #6125
closes #6126
closes #6129
closes #6139
closes #6142
closes #6146
closes #6148
closes #6152
closes #6174
closes #6266
closes #6319
closes #6326
ref vectordotdev/vrl#138
ref vectordotdev/vrl#140
ref #5780