Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement serde::Serialize on AST types via #[generate_derive] #6347

Open
overlookmotel opened this issue Oct 4, 2024 · 33 comments
Open

Implement serde::Serialize on AST types via #[generate_derive] #6347

overlookmotel opened this issue Oct 4, 2024 · 33 comments
Assignees
Milestone

Comments

@overlookmotel
Copy link
Contributor

We currently use serde's derive macros to implement Serialize on AST types.

We could use #[generate_derive] to generate these impls instead.

Why is that a good thing?

1. Reduce compile time

serde's macro is pretty expensive at compile time for the NAPI build. We can remove it.

2. Reduce boilerplate

serde's derive macro is less powerful than ast_tools. Because Serialize is a macro, all it knows about is the type that #[derive(Serialize)] is on. Whereas ast_tools builds a schema of the entire AST, so it knows not just about the type it's deriving impl for, but also all the other types too, and how they link to each other.

Currently we have to put #[serde] attributes everywhere:

#[ast]
#[cfg_attr(feature = "serialize", derive(Serialize, Tsify))]
#[serde(tag = "type")]
pub struct ClassBody<'a> {
    #[serde(flatten)]
    pub span: Span,
    pub body: Vec<'a, ClassElement<'a>>,
}

#[ast]
#[cfg_attr(feature = "serialize", derive(Serialize, Tsify))]
#[serde(tag = "type")]
pub struct PrivateIdentifier<'a> {
    #[serde(flatten)]
    pub span: Span,
    pub name: Atom<'a>,
}

#[ast]
#[cfg_attr(feature = "serialize", derive(Serialize, Tsify))]
pub struct Span {
    start: u32,
    end: u32,
}

Instead, we can use ast_tools in 2 ways to remove this boilerplate:

  1. Make things that we implement on every type the defaults, so they don't need to be stated over and over.
  2. Use ast_tools's knowledge of the whole AST to move the instruction to flatten Span onto Span type itself. "flatten this" instruction does not need to be repeated on every type that contains Span.
#[ast]
#[generate_derive(ESTree)]
pub struct ClassBody<'a> { // <-- no `#[serde(tag = "type")]` attr
    pub span: Span, // <-- no `#[serde(flatten)]` attr
    pub body: Vec<'a, ClassElement<'a>>,
}

#[ast]
#[generate_derive(ESTree)]
pub struct PrivateIdentifier<'a> { // <-- no `#[serde(tag = "type")]` attr
    pub span: Span, // <-- no `#[serde(flatten)]` attr
    pub name: Atom<'a>,
}

#[ast]
#[generate_derive(ESTree)]
#[estree(flatten)] // <-- `flatten` is here now
pub struct Span {
    start: u32,
    end: u32,
}

I think this is an improvement. How types are serialized is not core to the function of the AST. I don't see moving the serialization logic elsewhere as "hiding it away", but rather a nice separation of concerns.

3. Open the door to different serializations

In example above Serialize has been replaced by ESTree. This is to allow for different serialization methods in future. For example:

Different serializers for plain JS AST and TS AST

When serializing a plain JS file, could produce JSON which skips all the TS fields, to make an AST which exactly aligns with canonical ESTree. We'd add #[ts] attribute to all TS-related fields, and ESTreeJS serializer would skip those fields. This would make the AST faster to deserialize on JS side.

The other advantage is the TS-less AST should perfectly match classic ESTree, so we can test it in full using Acorn's test suite.

Users who are not interested in type info can also request the cheaper JS-only AST, even when parsing TS code.

Serialize to other AST variants

e.g. #[generate_derive(Babel)] to serialize to a Babel-compatible JSON AST.

const {program} = parse(code, {flavor: 'babel'});

Not sure if this is useful, but this change makes it a possibility if we want to.

4. Simplify implementation of custom serialization

Currently we have pretty complex custom Serialize impls for massaging Oxc's AST into ESTree-compatible shape in oxc_ast/src/serialize.rs.

We can remove most of them if we use ast_tools to generate Serialize impls for us, guiding it with attributes on the AST types themselves:

#[ast]
#[generate_derive(ESTree)]
pub struct ObjectPattern<'a> {
    pub span: Span,
    pub properties: Vec<'a, BindingProperty<'a>>,
    #[estree(append_to_previous)]
    pub rest: Option<Box<'a, BindingRestElement<'a>>>,
}

5. Simply AST transfer code

AST transfer's JS-side deserializer (and eventually serializer too) can be simplified in same way, generating code for JS-side deserializer which matches the Rust-side one exactly, without writing the same logic twice and having to keep them in sync.

6. TS type generation

What "massaging" of the Rust AST we do to turn it into an ESTree-compatible JSON AST is now encoded as static attributes. We can use this to generate TS types, and we can get rid of Tsify.

How difficult is this?

serde's derive macro looks forbiddingly complex. But this is because it handles every conceivable case, almost all of which we don't use. The output it generates for our AST types is actually not so complicated.

So creating a codegen for impl Serialize I don't think would be too difficult.

@overlookmotel
Copy link
Contributor Author

JS-only AST (as discussed in point 3 above) has been requested by a user: #6284

Personally, I think it's a completely reasonable ask.

@overlookmotel overlookmotel transferred this issue from oxc-project/backlog Oct 7, 2024
@overlookmotel
Copy link
Contributor Author

@Boshen we have a contributor (@ottomated - see #6284) keen to work on this. Before he gets going, do you see any problem with my proposal above?

@overlookmotel
Copy link
Contributor Author

overlookmotel commented Oct 8, 2024

I spoke to Boshen. He's happy with the direction of this PR. Sounds like @ottomated is ready to get stuck in to implementation.

I suggest doing this in phases:

1. Generate Serialize impls with oxc_ast_tools

Replace:

#[ast]
#[cfg_attr(feature = "serialize", derive(Serialize))]
#[serde(tag = "type", rename = "RestElement")]
pub struct AssignmentTargetRest<'a> {
    #[serde(flatten)]
    pub span: Span,
    #[serde(rename = "argument")]
    pub target: AssignmentTarget<'a>,
}

with:

#[ast]
#[generate_derive(Serialize)]
#[serde(tag = "type", rename = "RestElement")]
struct AssignmentTargetRest {
    #[serde(flatten)]
    pub span: Span,
    #[serde(rename = "argument")]
    pub target: AssignmentTarget<'a>,
}
  • Leave the custom Serialize impls alone.
  • Don't rename Serialize to ESTree yet.
  • Don't change #[serde] attributes.
  • Leave Tsify alone for now.
  • We should then be able to disable derive feature on serde dependency.

2. Remove #[serde] attrs boilerplate

#[ast]
#[generate_derive(Serialize)]
#[serde(rename = "RestElement")] // <-- `tag = "type"` removed
struct AssignmentTargetRest {
    // `#[serde(flatten)]` removed - `#[serde(flatten)]` on `Span` struct instead
    pub span: Span,
    #[serde(rename = "argument")]
    pub target: AssignmentTarget<'a>,
}

Handle these in oxc_ast_tools codegen instead.

3. Replace Tsify

4. Rename trait to ESTree

#[ast]
#[generate_derive(ESTree)]
#[estree(rename = "RestElement")]
struct AssignmentTargetRest {
    pub span: Span,
    #[estree(rename = "argument")]
    pub target: AssignmentTarget<'a>,
}

I'm actually not quite sure how to do this, while still using serde::Serialize under the hood.

5. Remove the custom Serialize impls

#[ast]
#[generate_derive(ESTree)]
pub struct ObjectPattern<'a> {
    pub span: Span,
    pub properties: Vec<'a, BindingProperty<'a>>,
    #[estree(append_to_previous)]
    pub rest: Option<Box<'a, BindingRestElement<'a>>>,
}

This is the tricky/interesting part. The idea is create a kind of domain-specific language (DSL) to cover the various transformations needed to go from Rust AST to JS ESTree AST. That DSL is the #[estree(...)] attributes.

The advantage of a DSL which is static is that we can generate multiple things from it:

  1. Serialize impls.
  2. Deserialize impls (so we can provide an oxc-codegen NPM package).
  3. TS type defs.
  4. "raw" transfer serializer/deserializer.

I'm not completely sure how far we can get with the DSL approach. "append to previous" is a pattern that's used in several types, so it makes sense to make an #[estree(append_to_previous)] attr for it.

But for odd transforms which are only used in one place, we may prefer something like this:

#[ast]
#[generate_derive(ESTree)]
#[estree(via(MyTypeShim))]
pub struct MyType {
    one: u32,
    two: u32,
}

struct MyTypeShim {
    sum: u32,
}

impl From<&MyType> for MyTypeShim {
    fn from(mt: &MyType) -> Self {
        MyTypeShim { sum: mt.one + mt.two }
    }
}

#[estree(via(...))] is analogous to #[serde(from)] and #[serde(into)]. But I'm hoping we can use just 1 "intermediary" type to go in both directions.

Is this a good plan?

The above "mini-roadmap" is a suggestion rather than a list of demands! Am totally open to different ways to split up the work.

But I do think we should split it up into multiple steps somehow, because (a) smaller PRs are easier to review and (b) if the effort doesn't reach the finish line, we'll at least get part of the way, and others can continue it later on.

overlookmotel added a commit that referenced this issue Oct 19, 2024
Beginning of #6347. Instead of using serde-derive, we generate
`Serialize` impls manually.

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: overlookmotel <[email protected]>
@overlookmotel
Copy link
Contributor Author

overlookmotel commented Oct 21, 2024

Where we're up to

#6404 and the smaller PRs that followed it has got us to this stage:

  • Serialize impls are now codegen-ed by oxc_ast_tools (phase 1 on "roadmap" above)
  • Ditto TS types (phase 3 on "roadmap" above)

derive_estree.rs

We still use #[derive(Serialize)] on a few custom Serialize impls in serialize.rs. Tsify is completely gone.

Next steps

In my opinion the next steps are:

1. Generate TS type defs for oxc-parser package

  • Completed in feat(ast_tools): output typescript to a separate package #6755.

  • Generate all the types as a single .d.ts file.

  • Include it in oxc-parser package (not sure how to combine them with the types generated by napi-rs).

  • Continue to use wasm-bindgen for WASM. But replace all the short const TS_APPEND_CONTENT statements with one giant one including all the types as a single string.

The reason I think we should do this first is it'd be great to get all the type defs checked into git as a single file, so we'll notice if the types mistakenly get changed during further work.

2. Fix serialization of RegExpLiteral

Currently JSON AST for RegExpLiteral contains the entire parsed regexp Pattern. This is a huge deviation from ESTree, and the serialization of RegExps is generally a mess.

JSON AST should just contain strings for pattern and flags, as ESTree does.

We can remove the EmptyObject hack. That type only exists to produce a value field in the JSON AST, and is otherwise a pointless annoyance!

3. Improve TS type defs

Previously, type defs were in this style:

export interface BooleanLiteral extends Span {
    type: "BooleanLiteral";
    value: boolean;
}

Now they're like this:

export type BooleanLiteral = ({
    type: 'BooleanLiteral';
    value: boolean;
}) & Span;

I am no TypeScript expert, but I understand from Boshen that the two are almost equivalent, but that there is a slight difference - the interface style gives nicer error messages.

Our ts types came from typescript-eslint, I would model them as such https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/ast-spec/src/expression/ArrayExpression/spec.ts

Are we able to go back to interface?

4. Clean up #[estree] attrs

  • Remove #[estree(rename_all = "camelCase")] from enums. Just make camelCase the default. It only seems to be present on fieldless enums.
  • Remove #[estree(untagged)]. I think we can decide whether to tag based on if enum is fieldless or not?
  • Replace #[tsify(type = "...")] with #[estree(type(...))] (or #[estree(type = "...")]). That's just a renaming thing.

5. Reduce #[estree(flatten)] boilerplate

See "phase 2" in previous comment.

Primarily I'm talking about Span here. It'd be great not to need #[estree(flatten)] on every single span: Span field.

Note: TSThisParameter has a this_span: Span field which should not be flattened. typescript-eslint doesn't include a span for this, so we can just skip serializing that field, rather than needing an #[estree(no_flatten)] workaround.

6. DSL

🤷 Open to suggestions on how to approach this one!

@ottomated
Copy link
Contributor

@overlookmotel re: RegExp serialization - this could be a pretty big issue. oxc_codegen handles printing regex patterns by accessing the source text, but I don't know of a way to do this from within a Serialize impl.
image

@overlookmotel
Copy link
Contributor Author

overlookmotel commented Oct 23, 2024

Yes, that's not easy within a Serialize impl. It's possible to do with a custom Serializer which holds a reference to the source text, but it'd be a pain.

But personally, I don't think codegen is doing the right thing anyway. It should be converting RegExpPattern back to a string with pattern.to_string(). Ditto flags.to_string().

And that we can do in serialize.

RegExp parsing was only added fairly recently, so there are still some kinks in how it integrates / doesn't integrate with the rest of Oxc - that's why it's a bit messy at present.

Boshen pushed a commit that referenced this issue Oct 24, 2024
Part of #6347.

Moves typescript logic from derive_estree into a new ast_tools generator.
@Boshen
Copy link
Member

Boshen commented Oct 25, 2024

What's our next steps?

@overlookmotel
Copy link
Contributor Author

I've updated comments above to tick off the items which are now complete. I believe next steps are:

@ottomated
Copy link
Contributor

Step 2 – I'm not sure where the code for that should live. In oxc_codegen, I guess? Or do we want to avoid creating an entire codegen context for serialization, and it should live in oxc_regular_expression?

Step 4 – I can get on that, should be pretty quick

Step 5 – also should be pretty easy, wonder if any field named span: Span should just be hardcoded.

@Boshen
Copy link
Member

Boshen commented Oct 26, 2024

@ottomated feel free to work on 4 and 5, I can take a look at 2 later.

@ottomated
Copy link
Contributor

@ottomated feel free to work on 4 and 5, I can take a look at 2 later.

Sounds good. Let me know if there's anything else I need to do for the initial npm release before then, I think I got all the CI right but not sure.

@ottomated
Copy link
Contributor

@overlookmotel The cases where enums don't have rename_all = "camelCase" are Type and Kind enums: FunctionType, FormalParameterKind, ClassType, MethodDefinitionType, PropertyDefinitionType, and AccessorPropertyType.

@overlookmotel
Copy link
Contributor Author

The cases where enums don't have rename_all = "camelCase" are Type and Kind enums

Ah ha! I hadn't realized there are so many. If they're still in the minority, could we make camelCase the default, and add #[estree(no_rename_variants)] to these odd ones?

This isn't so important either way, just a thought. It's just always nice to remove repeated #[estree] boilerplate where we can.

@Boshen
Copy link
Member

Boshen commented Oct 26, 2024

I noticed that we still have tsify, we can try and get rid of it.

@overlookmotel
Copy link
Contributor Author

overlookmotel commented Oct 26, 2024

Step 5 – also should be pretty easy, wonder if any field named span: Span should just be hardcoded.

Personally I think it's preferable to make everything explicit in attributes on the AST types, rather than hiding away what's happening in ast_tools. ast_tools contains the implementation of generating Serialize impls, but how it does that is guided by the #[estree] attrs. Ideally we don't want anyone reading the code to have to delve into ast_tools. We should (at some point) document what all the #[estree] attributes mean, and just reading those docs should be sufficient to understand.

This will be particularly important if we introduce other JSON output "flavors" in future e.g. #[estree_js] (i.e. without TS fields) or #[babel]. At that point, being able to read the AST types and see "oh in ESTree this type is flattened, but in Babel it's not" will be helpful.

I noticed that we still have tsify, we can try and get rid of it.

You mean the #[tsify] attrs? tsify crate has been fully removed, and ast_tools now processes those attrs. It's included in "step 4" above to rename them to #[estree] attrs. At present the word "tsify" in these attrs is misleading.

@ottomated
Copy link
Contributor

Steps 4 and 5 should be completed by #6933, #6934, and #6935. I think the next step should be some kind of testbed for identifying differences between our serialization and acorn's, to help identify what features the DSL needs.

overlookmotel pushed a commit that referenced this issue Oct 28, 2024
Part of #6347

Other changes:
- added #[estree(skip)] to thisSpan in TSThisParameter
- Flattened the span in BoundaryAssertion (regex)
@overlookmotel
Copy link
Contributor Author

overlookmotel commented Oct 28, 2024

Those 3 PRs are merged (great!). I agree next step is to add conformance tests.

However, I think probably we should start with checking that our AST conforms to typescript-eslint, rather than Acorn. Obviously our AST won't conform to Acorn's because we have all the additional TS fields. And to make it so, we're going to have to mark all the TS fields with #[ts] and build a 2nd serializer which skips them.

So, in my view, it'd be preferable to get the existing AST conforming to standard, and then tackle the "plain JS" serializer after. What do you think?

The odd thing looking at typescript-eslint is that I couldn't find a lot of tests. I was expecting to find tons of test fixtures like e.g. Babel has. Any idea where they're hiding?

One problem we'll run into with comparing to Acorn/typescript-eslint is that our spans are counted in UTF-8 bytes, instead of UTF-16 characters. Probably this will affect very few tests, as the two are equivalent when source text is all ASCII, which is the common case. But it's something we'll need to tackle to pass all tests (related: #959).

Orenbek pushed a commit to Orenbek/oxc that referenced this issue Oct 28, 2024
…#6755)

Part of oxc-project#6347.

Moves typescript logic from derive_estree into a new ast_tools generator.
Orenbek pushed a commit to Orenbek/oxc that referenced this issue Oct 28, 2024
)

Part of oxc-project#6347

Other changes:
- added #[estree(skip)] to thisSpan in TSThisParameter
- Flattened the span in BoundaryAssertion (regex)
@ottomated
Copy link
Contributor

Obviously our AST won't conform to Acorn's because we have all the additional TS fields.

Once these are marked with an attribute macro, we can test against acorn by ignoring them and running on pure js only to begin with. No need for a second serializer yet, we can just make the current serializer turn None into undefined instead of null, which should be implemented anyway.

Are the typescript-eslint tests here?

@overlookmotel
Copy link
Contributor Author

overlookmotel commented Oct 28, 2024

Once these are marked with an attribute macro, we can test against acorn by ignoring them and running on pure js only to begin with. No need for a second serializer yet, we can just make the current serializer turn None into undefined instead of null, which should be implemented anyway.

I'm not sure that approach is ideal, for performance reasons.

If None fields are null in JSON AST, then every object representing an AST node of a certain kind always has the same "shape" i.e. the same fields, in same order.

But if None translates to undefined, that has to mean that field is not present at all in the object when it's None, because JSON does not support {"foo":undefined}. JSON.stringify({foo: undefined}) === '{}'. So then the shape of those objects is variable. For example, a CallExpression will sometimes have a typeParameters field, and sometimes it won't. As far as JS engine is concerned, those are 2 completely different "shapes".

So then a function which processes CallExpressions will not be monomorphic, because it can get called with 2 different "shapes". This impedes the JS engine's ability to optimize that function, which can have quite a large negative effect on performance. https://mathiasbynens.be/notes/shapes-ics

We can produce 2 different "flavors" of AST, each of which consistently has TS fields, or doesn't have TS fields. But producing an AST which flips between the two patterns is going to make writing performant code using the AST impossible.

So... the question, in my opinion, is which "flavor" of AST do we want to test first? I was proposing the with-TS version, since that's what we already have.

If your particular interest is the TS-less AST, and that's what you want to work on, then it's totally legitimate to say you want to do that first. But I do think we'd need a 2nd serializer for that.

Are the typescript-eslint tests here?

Yes, those are the ones I found. But there aren't very many!

I was hoping to find something more like this with loads of .js files and a corresponding .json file containing the parsed AST for each: https://github.com/babel/babel/tree/main/packages/babel-parser/test/fixtures

It seems hard to believe that there isn't a large set of fixtures for typescript-eslint somewhere, but I just can't locate them!

If there aren't any, we do have another option. We have all the Test262 and TypeScript fixtures already. We could parse them all with typescript-eslint and dump the ASTs to JSON files. Then we run Oxc's parser on all the same fixtures, serialize each to JSON using our serializer, and diff against the typescript-eslint JSON ASTs. If our AST matches typescript-eslint's, then they should all be identical.

@ottomated
Copy link
Contributor

Hmm, the monomorphism argument is interesting. However, I think it might be overoptimizing at this point:

  • Right now, while we're using JSON, serializing these fields as undefined saves a LOT of deserialization time and FFI overhead by making the JSON smaller
  • When we're at the raw transfer stage, and the AST uses either Proxies or getters, the point is moot because we're not passing raw objects into functions (I believe). We probably won't get any optimization from JIT at all. The internals might look something like this:
const raw_ast = new Uint8Array(...);
const types = ["Program", "Expression", "Literal", ...];
const node = {
  get type() {
    return types[raw_ast[0]];
  }
  // ...
}

If the optimization you're worried about is JIT rather than transfer speeds, then it would be more effective to serialize to a compact binary format and fully deserialize on the JS side, rather than raw transfer.

Looking at acorn, they don't seem to have a lot of tests either. I think Test262 is the way to go.

@overlookmotel
Copy link
Contributor Author

overlookmotel commented Oct 28, 2024

Right now, while we're using JSON, serializing these fields as undefined saves a LOT of deserialization time and FFI overhead by making the JSON smaller

You might be right. But I think you also might not be! It may well be that JSON.parse also benefits from predictable object shapes. I don't know the details of V8's internals well enough to say for sure either way. But, while working on the initial POC impl of raw transfer (#2457), I found that consistent object shapes had a very significant effect - removing a single ... spread operation from the deserializer code yielded a 40% speed-up.

So personally I don't think we can assume necessarily that shorter JSON = faster deserialization.

This (overly simple) benchmark seems to hint that my guess may be correct: https://jsbench.me/1sm2tbh3lb/1

JSON.parse on the JSON with unpredicable object shapes (3rd test case) is 27% slower than the first two. Even the one where the JSON string is shorter (4th case) is still 19% slower.

bench2  

When we're at the raw transfer stage...

The initial implementation of raw transfer will just deserialize the whole AST to JS objects, so it replaces JSON.parse, but should produce identical output to current JSON-based serializer. So essentially it is a compact binary format, just using Rust's native type layouts as the binary format instead of e.g. protobuff.

Later on, we'll produce an AST visitor which performs lazy deserialization. I'd imagine that some fields (e.g. type) would be eagerly deserialized, and only more expensive fields (other nodes) would be behind getters. Either way, as you say, at that point the issue goes away - it will always produce objects with the same shape, and JIT optimizer will get to work at least some of its magic. But all of that is some way off, and raw transfer will be introduced initially behind an "experimental" flag, with the current JSON-based scheme still the default. So current JSON-based scheme will be around for some time, and I'm not keen to hobble its performance.

It sounds like we're maybe not going to agree on this. I feel quite strongly about consistent object shapes, and it sounds like you are coming from a different direction.

So... can we put that aside for a minute and talk about other ways to crack this nut?

If what you're really interested in working on is the TS-less AST, let's consider how hard it'd be to build a 2nd serializer, which can then be tested against Acorn's fixtures. I am guessing we'd need to:

  1. Mark all TS-related fields #[ts].
  2. Create 2 custom serializers ESTreeJsSerializer and ESTreeTsSerializer which both implement serde::Serializer. They'd both be just thin wrappers around serde's original serializer.
  3. Generate for all AST types both impl Serialize for T { fn serialize(&self, serializer: ESTreeTsSerializer) { ... } } and impl Serialize for T { fn serialize(&self, serializer: ESTreeJsSerializer) { ... } }. The 2nd would skip serializing fields marked #[ts].

Do you think that'd work? If so, do you think it's achievable without a huge amount of effort?

@ottomated
Copy link
Contributor

Wow, very unintuitive result to me but I verified it by parsing checker.ts, naively replacing all null fields with undefined, and benchmarking those
image

It looks like even though the json is a million characters longer, it still deserializes faster. Of course, this doesn't measure FFI transfer or serde's serialization, but it looks like your point is correct. I still believe my point in #6284 about working with the AST objects is valid but that's less of a priority.

The thing we absolutely will need to do is marking the fields with #[ts]. After that, not sure what the best way to have multiple serialization options is. Feels like having double the impls would increase binary size by quite a lot. Other possibilities:

  • Some kind of global context like how SWC does it
  • Custom json serializer (shouldn't be that hard, right? The worst part is probably what we've already done with manually generating the Serialize impls)

@overlookmotel
Copy link
Contributor Author

overlookmotel commented Oct 29, 2024

Thanks loads for testing out my hunch properly. Unintuitive results are what keeps life interesting!

You are right that the extra fields may cost something on serialization side (although also sometimes doing more work unconditionally is cheaper than an unpredictable branch). And maybe larger strings cost more to pass from Rust to JS. On the other hand, I am as sure as I can be that for the consuming code on JS side, consistent object shapes is a definite perf win. Whether all these various factors add up to a gain or loss in total is hard to say, and may depend on the use case, size of ASTs, and number of ASTs processed in a session (do the functions receiving AST nodes get called enough for JS engine to optimize them?). But my guess is that on balance consistent object shapes probably wins.

Some kind of global context like how SWC does it

You're right that 2 serializers would affect binary size. How much I don't know. If we want to reduce that, we could instead use a runtime flag e.g.:

struct ESTreeSerializer {
    inner: serde_json::ser::Serializer<Vec<u8>>,
    include_ts_fields: bool,
}

impl<'s> Serializer for &'s mut ESTreeSerializer {
    fn serialize_bool(self, value: bool) -> Result<()> {
        self.inner.serialize_bool(value)
    }
    // ... etc ...
}

impl Serialize for CallExpression {
    fn serialize(&self, serializer: &mut ESTreeSerializer) -> Result {
        let mut map = serializer.serialize_map(None)?;
        // ...
        if serializer.include_ts_fields {
            map.serialize_entry("typeParameters", &self.type_parameters)?;
        }
        // ...
    }
}

Is that what you meant by "global context"? I'm not familiar with SWC's serialization method.

Custom json serializer (shouldn't be that hard, right?)

Serde's implementation is designed to be flexible, supporting multiple serialization formats, so not particularly optimized for JSON. So yes, we could very likely produce a more efficient serializer designed specifically for JSON format, and with static knowledge of the object shapes, to condense what is currently many individual writes to the buffer down to a single write for chunks of statically knowable text e.g. {"type":"CallExpression","start":. And yes, maybe it's not so difficult - you already did the hardest part.

This is definitely worth trying, and might be a significant speed-up. But... I think we should leave that until later. I suggest we stick with serde for now - as the old saying goes: get it right first, then optimize.

I still believe my point in #6284 about working with the AST objects is valid but that's less of a priority.

Can you explain this a bit more please?

@ottomated
Copy link
Contributor

I like the runtime flag more than multiple impls. I think SWC uses a global static cell that configures serialization options.

#6284 is more related to typescript DX when doing codegen in JS. If you're constructing an AST node and have to manually specify every single null field, it'll be annoying. We actually should be able to solve this by generating multiple type definition files, i.e. ast-js.d.ts and ast-ts.d.ts.

Another idea: we could type the fields like this:

interface Foo {
  typeAnnotation?: TypeAnnotation | null;
}

notice the ?: - we would still send null over the wire but if the consumer is building the nodes they wouldn't have to provide every field. Would take a bit more effort during deserialization maybe, but that's way later.

@overlookmotel
Copy link
Contributor Author

overlookmotel commented Oct 30, 2024

I like the runtime flag more than multiple impls.

OK cool. Let's go with that.

I think SWC uses a global static cell that configures serialization options.

Why on earth do they do that? That's a dreadful idea! How then do you serialize multiple ASTs on different threads simultaneously with different options? Plus the cost of synchronization on every access. Ahem... I'll stop being rude about SWC now...

#6284 is more related to typescript DX when doing codegen in JS.

OK, now I get it. As I think I mentioned, I don't use TypeScript myself, so I forget that this kind of thing is a pain when you're not in wild west anything-goes type-less JS land.

We actually should be able to solve this by generating multiple type definition files, i.e. ast-js.d.ts and ast-ts.d.ts.

I'm in favour of that. The JS and TS "flavoured" ASTs are different structures, and should have separate types.

typeAnnotation?: TypeAnnotation | null;

I'm less keen on this, but maybe we could do it too if it makes a huge difference to ergonomics. As you say, getting data back into Rust side to run codegen is a way off.

We could also at some point introduce utility builder methods like @babel/types. They put the optional params last, and if you leave out those params they default to null/false. e.g. t.arrayPattern(elements) returns {type: "ArrayPattern", elements, decorators: [], optional: false, typeAnnotation: null}.

Of course, this is not ideal from perspective of writing monomorphic code, but it's certainly convenient.

@ottomated
Copy link
Contributor

@overlookmotel Running into a few issues trying to take the approach we talked about. The Serialize trait requires the serialize function to take a type parameter, making it impossible to restrict the type of serializer to &mut ESTreeSerializer. If I make a custom ESTreeSerialize trait, we run into issues with nested serialization:

  let mut map = serializer.serialize_map(None)?;
  map.serialize_entry("language", &self.language)?;
  // ^ the trait bound `source_type::Language: Serialize` is not satisfied

There might be a way to make this work within serde's infrastructure, but it's beyond me if so.

@overlookmotel
Copy link
Contributor Author

overlookmotel commented Nov 2, 2024

Damn it! I thought I did this before, but now I realize it was rkyv that I built a custom serializer for, not serde. rykv helpfully defines their Serialize trait as trait Serialize<S: Serializer> which allows this kind of thing.

I got some of the way here.

The sticking points are:

  1. FlatMapSerializer
  2. SerializerMap::serialize_entry requiring the value param to implement Serialize.

I can see a few possible ways forwards:

  1. Stay within serde's infrastructure. Reimplement SerializerMap + get rid of FlatMapSerializer.
  2. Fork serde_json and alter it to make the Serializer have state.
  3. Pass state in values something like serde-serialize-seed.
  4. Store state in thread local storage
  5. Ditch serde altogether!
  6. Something else?

To expand a bit on these options, as I see them:

1. Reimplement

A good first step would be to get rid of FlatMapSerializer. This is mostly used for Span, and because our codegen already knows what Span's field are, we could replace this:

self.span.serialize(serde::__private::ser::FlatMapSerializer(&mut map))?;

with:

map.serialize_entry("start", &self.span.start)?;
map.serialize_entry("end", &self.span.end)?;

Then we'd expand on my starting point (or probably you did much the same). We'd need to add a WrappedSerializerMap, and probably implement WrappedSerialize on Option, and some other bits.

2. Fork serde_json

Maybe this isn't so bad. We could use existing serde_json::ser::Formatter which contains some of the harder stuff, rather than forking it.

Maybe this option ends up quite similar to option (1).

3. Pass state in values

Smuggle state into serialize methods in values instead of storing it in the Serializer.

Generate wrapper types for all AST types and implement Serialize on them:

#[derive(Clone, Copy)]
struct State { include_ts_fields: bool }

struct UnaryExpressionWithState<'a, 'b>(&'b UnaryExpression<'a>, State);
struct ExpressionWithState<'a, 'b>(&'b Expression<'a>, State);
struct UnaryOperatorWithState<'b>(&'b UnaryOperator, State);
struct SpanWithState<'b>(&'b Span, State);

impl<'a, 'b> Serialize for UnaryExpressionWithState<'a, 'b> {
    fn serialize<S: Serializer>(&self, serializer: S) -> Result<S::Ok, S::Error> {
        let value = self.0;
        let state = self.1;
        let mut map = serializer.serialize_map(None)?;
        map.serialize_entry("type", "UnaryExpression")?;
        let span = SpanWithState(&value.span, state);
        span.serialize(serde::__private::ser::FlatMapSerializer(&mut map))?;
        let operator = UnaryOperatorWithState(&value.operator, state);
        map.serialize_entry("operator", &operator)?;
        let argument = ExpressionWithState(&value.argument, state);
        map.serialize_entry("argument", &argument)?;
        if state.include_ts_fields {
            // Different logic for TS
        }
        map.end()
    }
}

impl<'a, 'b> Serialize for ExpressionWithState<'a, 'b> { /* ... */ }
impl<'b> Serialize for UnaryOperatorWithState<'b> { /* ... */ }
impl<'b> Serialize for SpanWithState<'b> { /* ... */ }

Advantage: No need to reimplement any of serde_json.

4. Store state in thread local storage

Similar to what you said SWC does, but using thread local storage rather than a global static.

5. Ditch serde

Completely reimplement serde. As you said above, maybe this isn't so difficult. And probably we can make it more efficient anyway by combining a bunch of separate string pushes into one. So it's tempting.

However, I can see a large downside: Eventually we'll want to implement Deserialize on types too (for getting AST back into Rust from JS), and that's a lot more complex than Serialize. Not having serde's help with that might be pretty daunting.

6. Something else?

serde_state crate looks interesting, though appears unmaintained.

@overlookmotel
Copy link
Contributor Author

On balance, I think my preference is option 3 (pass state in values). This would be a real chore if we had to write it by hand, but luckily we can generate all the wrappers with the codegen, and it's the least intrusive.

@ottomated
Copy link
Contributor

I think I would lean towards option 1, which seems like the least amount of work for something that won't impact performance. 2 seems like a lot of work, 3 feels ugly to me, and 4 is I think what swc does actually use.

Alternatively, option 5 is also attractive. It's the kind of thing that could start as simple code and be incrementally optimized. Deserialization could even still use serde — it's not like we've implemented any Deserialize traits yet.

@ottomated
Copy link
Contributor

A good first step would be to get rid of FlatMapSerializer

Harder than it sounds because of BindingPatternKind (an enum) being flattened.

@overlookmotel
Copy link
Contributor Author

(3) is indeed ugly, but I thought it'd be the easiest to do.

I'm not a fan of using thread local storage - it has a cost every time you access it. But, beyond that, whichever method you prefer.

If you do want to go for option (5) and remove serde_json entirely, then go for it. It'd probably be good to use CodeBuffer as the string builder, which is fairly well-optimized (and I intend to optimize it further in future). If you want to do that, please move CodeBuffer into oxc_data_structures crate (in a separate PR).

Yes BindingPatternKind is a pain! But maybe we could at least simplify flattening structs. That'd cover Span, which is the majority of cases, and would avoid having to deal with BindingPatternKind.

overlookmotel pushed a commit that referenced this issue Nov 6, 2024
…zation code (#7149)

Removed custom logic for ObjectPattern, ArrayPattern, etc. in favor of a custom attribute macro `#[estree(append_to = "foo")]` as part of #6347
@overlookmotel
Copy link
Contributor Author

We may need to get other external state (source text) into the serializer - #7211 (comment). Option (3) was not so bad when State was just a single bool, but looks much less attractive if we also need it to contain a &str (16 bytes).

@Boshen Boshen closed this as completed Dec 1, 2024
@overlookmotel
Copy link
Contributor Author

We have more work to do on this. I'll break out the remaining tasks into separate issues, but re-opening for now.

@overlookmotel overlookmotel reopened this Dec 2, 2024
@Boshen Boshen added this to the estree milestone Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants