-
-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement serde::Serialize
on AST types via #[generate_derive]
#6347
Comments
JS-only AST (as discussed in point 3 above) has been requested by a user: #6284 Personally, I think it's a completely reasonable ask. |
@Boshen we have a contributor (@ottomated - see #6284) keen to work on this. Before he gets going, do you see any problem with my proposal above? |
I spoke to Boshen. He's happy with the direction of this PR. Sounds like @ottomated is ready to get stuck in to implementation. I suggest doing this in phases: 1. Generate
|
Beginning of #6347. Instead of using serde-derive, we generate `Serialize` impls manually. --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: overlookmotel <[email protected]>
Where we're up to#6404 and the smaller PRs that followed it has got us to this stage:
We still use Next stepsIn my opinion the next steps are: 1. Generate TS type defs for
|
@overlookmotel re: RegExp serialization - this could be a pretty big issue. oxc_codegen handles printing regex patterns by accessing the source text, but I don't know of a way to do this from within a Serialize impl. |
Yes, that's not easy within a But personally, I don't think codegen is doing the right thing anyway. It should be converting And that we can do in RegExp parsing was only added fairly recently, so there are still some kinks in how it integrates / doesn't integrate with the rest of Oxc - that's why it's a bit messy at present. |
Part of #6347. Moves typescript logic from derive_estree into a new ast_tools generator.
What's our next steps? |
I've updated comments above to tick off the items which are now complete. I believe next steps are:
|
Step 2 – I'm not sure where the code for that should live. In oxc_codegen, I guess? Or do we want to avoid creating an entire codegen context for serialization, and it should live in oxc_regular_expression? Step 4 – I can get on that, should be pretty quick Step 5 – also should be pretty easy, wonder if any field named |
@ottomated feel free to work on 4 and 5, I can take a look at 2 later. |
Sounds good. Let me know if there's anything else I need to do for the initial npm release before then, I think I got all the CI right but not sure. |
@overlookmotel The cases where enums don't have rename_all = "camelCase" are Type and Kind enums: FunctionType, FormalParameterKind, ClassType, MethodDefinitionType, PropertyDefinitionType, and AccessorPropertyType. |
Ah ha! I hadn't realized there are so many. If they're still in the minority, could we make camelCase the default, and add This isn't so important either way, just a thought. It's just always nice to remove repeated |
I noticed that we still have |
Personally I think it's preferable to make everything explicit in attributes on the AST types, rather than hiding away what's happening in This will be particularly important if we introduce other JSON output "flavors" in future e.g.
You mean the |
Part of #6347 Other changes: - added #[estree(skip)] to thisSpan in TSThisParameter - Flattened the span in BoundaryAssertion (regex)
Those 3 PRs are merged (great!). I agree next step is to add conformance tests. However, I think probably we should start with checking that our AST conforms to So, in my view, it'd be preferable to get the existing AST conforming to standard, and then tackle the "plain JS" serializer after. What do you think? The odd thing looking at One problem we'll run into with comparing to Acorn/ |
…#6755) Part of oxc-project#6347. Moves typescript logic from derive_estree into a new ast_tools generator.
) Part of oxc-project#6347 Other changes: - added #[estree(skip)] to thisSpan in TSThisParameter - Flattened the span in BoundaryAssertion (regex)
Once these are marked with an attribute macro, we can test against acorn by ignoring them and running on pure js only to begin with. No need for a second serializer yet, we can just make the current serializer turn Are the typescript-eslint tests here? |
I'm not sure that approach is ideal, for performance reasons. If But if So then a function which processes We can produce 2 different "flavors" of AST, each of which consistently has TS fields, or doesn't have TS fields. But producing an AST which flips between the two patterns is going to make writing performant code using the AST impossible. So... the question, in my opinion, is which "flavor" of AST do we want to test first? I was proposing the with-TS version, since that's what we already have. If your particular interest is the TS-less AST, and that's what you want to work on, then it's totally legitimate to say you want to do that first. But I do think we'd need a 2nd serializer for that.
Yes, those are the ones I found. But there aren't very many! I was hoping to find something more like this with loads of It seems hard to believe that there isn't a large set of fixtures for If there aren't any, we do have another option. We have all the Test262 and TypeScript fixtures already. We could parse them all with |
Hmm, the monomorphism argument is interesting. However, I think it might be overoptimizing at this point:
const raw_ast = new Uint8Array(...);
const types = ["Program", "Expression", "Literal", ...];
const node = {
get type() {
return types[raw_ast[0]];
}
// ...
} If the optimization you're worried about is JIT rather than transfer speeds, then it would be more effective to serialize to a compact binary format and fully deserialize on the JS side, rather than raw transfer. Looking at acorn, they don't seem to have a lot of tests either. I think Test262 is the way to go. |
You might be right. But I think you also might not be! It may well be that So personally I don't think we can assume necessarily that shorter JSON = faster deserialization. This (overly simple) benchmark seems to hint that my guess may be correct: https://jsbench.me/1sm2tbh3lb/1
The initial implementation of raw transfer will just deserialize the whole AST to JS objects, so it replaces Later on, we'll produce an AST visitor which performs lazy deserialization. I'd imagine that some fields (e.g. It sounds like we're maybe not going to agree on this. I feel quite strongly about consistent object shapes, and it sounds like you are coming from a different direction. So... can we put that aside for a minute and talk about other ways to crack this nut? If what you're really interested in working on is the TS-less AST, let's consider how hard it'd be to build a 2nd serializer, which can then be tested against Acorn's fixtures. I am guessing we'd need to:
Do you think that'd work? If so, do you think it's achievable without a huge amount of effort? |
Wow, very unintuitive result to me but I verified it by parsing It looks like even though the json is a million characters longer, it still deserializes faster. Of course, this doesn't measure FFI transfer or serde's serialization, but it looks like your point is correct. I still believe my point in #6284 about working with the AST objects is valid but that's less of a priority. The thing we absolutely will need to do is marking the fields with
|
Thanks loads for testing out my hunch properly. Unintuitive results are what keeps life interesting! You are right that the extra fields may cost something on serialization side (although also sometimes doing more work unconditionally is cheaper than an unpredictable branch). And maybe larger strings cost more to pass from Rust to JS. On the other hand, I am as sure as I can be that for the consuming code on JS side, consistent object shapes is a definite perf win. Whether all these various factors add up to a gain or loss in total is hard to say, and may depend on the use case, size of ASTs, and number of ASTs processed in a session (do the functions receiving AST nodes get called enough for JS engine to optimize them?). But my guess is that on balance consistent object shapes probably wins.
You're right that 2 serializers would affect binary size. How much I don't know. If we want to reduce that, we could instead use a runtime flag e.g.: struct ESTreeSerializer {
inner: serde_json::ser::Serializer<Vec<u8>>,
include_ts_fields: bool,
}
impl<'s> Serializer for &'s mut ESTreeSerializer {
fn serialize_bool(self, value: bool) -> Result<()> {
self.inner.serialize_bool(value)
}
// ... etc ...
}
impl Serialize for CallExpression {
fn serialize(&self, serializer: &mut ESTreeSerializer) -> Result {
let mut map = serializer.serialize_map(None)?;
// ...
if serializer.include_ts_fields {
map.serialize_entry("typeParameters", &self.type_parameters)?;
}
// ...
}
} Is that what you meant by "global context"? I'm not familiar with SWC's serialization method.
Serde's implementation is designed to be flexible, supporting multiple serialization formats, so not particularly optimized for JSON. So yes, we could very likely produce a more efficient serializer designed specifically for JSON format, and with static knowledge of the object shapes, to condense what is currently many individual writes to the buffer down to a single write for chunks of statically knowable text e.g. This is definitely worth trying, and might be a significant speed-up. But... I think we should leave that until later. I suggest we stick with serde for now - as the old saying goes: get it right first, then optimize.
Can you explain this a bit more please? |
I like the runtime flag more than multiple impls. I think SWC uses a global static cell that configures serialization options. #6284 is more related to typescript DX when doing codegen in JS. If you're constructing an AST node and have to manually specify every single null field, it'll be annoying. We actually should be able to solve this by generating multiple type definition files, i.e. Another idea: we could type the fields like this:
notice the |
OK cool. Let's go with that.
Why on earth do they do that? That's a dreadful idea! How then do you serialize multiple ASTs on different threads simultaneously with different options? Plus the cost of synchronization on every access. Ahem... I'll stop being rude about SWC now...
OK, now I get it. As I think I mentioned, I don't use TypeScript myself, so I forget that this kind of thing is a pain when you're not in wild west anything-goes type-less JS land.
I'm in favour of that. The JS and TS "flavoured" ASTs are different structures, and should have separate types.
I'm less keen on this, but maybe we could do it too if it makes a huge difference to ergonomics. As you say, getting data back into Rust side to run codegen is a way off. We could also at some point introduce utility builder methods like @babel/types. They put the optional params last, and if you leave out those params they default to Of course, this is not ideal from perspective of writing monomorphic code, but it's certainly convenient. |
@overlookmotel Running into a few issues trying to take the approach we talked about. The let mut map = serializer.serialize_map(None)?;
map.serialize_entry("language", &self.language)?;
// ^ the trait bound `source_type::Language: Serialize` is not satisfied There might be a way to make this work within serde's infrastructure, but it's beyond me if so. |
Damn it! I thought I did this before, but now I realize it was rkyv that I built a custom serializer for, not I got some of the way here. The sticking points are:
I can see a few possible ways forwards:
To expand a bit on these options, as I see them: 1. ReimplementA good first step would be to get rid of self.span.serialize(serde::__private::ser::FlatMapSerializer(&mut map))?; with: map.serialize_entry("start", &self.span.start)?;
map.serialize_entry("end", &self.span.end)?; Then we'd expand on my starting point (or probably you did much the same). We'd need to add a 2. Fork
|
On balance, I think my preference is option 3 (pass state in values). This would be a real chore if we had to write it by hand, but luckily we can generate all the wrappers with the codegen, and it's the least intrusive. |
I think I would lean towards option 1, which seems like the least amount of work for something that won't impact performance. 2 seems like a lot of work, 3 feels ugly to me, and 4 is I think what swc does actually use. Alternatively, option 5 is also attractive. It's the kind of thing that could start as simple code and be incrementally optimized. Deserialization could even still use serde — it's not like we've implemented any Deserialize traits yet. |
Harder than it sounds because of BindingPatternKind (an enum) being flattened. |
(3) is indeed ugly, but I thought it'd be the easiest to do. I'm not a fan of using thread local storage - it has a cost every time you access it. But, beyond that, whichever method you prefer. If you do want to go for option (5) and remove Yes |
We may need to get other external state (source text) into the serializer - #7211 (comment). Option (3) was not so bad when |
We have more work to do on this. I'll break out the remaining tasks into separate issues, but re-opening for now. |
We currently use
serde
's derive macros to implementSerialize
on AST types.We could use
#[generate_derive]
to generate these impls instead.Why is that a good thing?
1. Reduce compile time
serde
's macro is pretty expensive at compile time for the NAPI build. We can remove it.2. Reduce boilerplate
serde
's derive macro is less powerful thanast_tools
. BecauseSerialize
is a macro, all it knows about is the type that#[derive(Serialize)]
is on. Whereasast_tools
builds a schema of the entire AST, so it knows not just about the type it's deriving impl for, but also all the other types too, and how they link to each other.Currently we have to put
#[serde]
attributes everywhere:Instead, we can use
ast_tools
in 2 ways to remove this boilerplate:ast_tools
's knowledge of the whole AST to move the instruction to flattenSpan
ontoSpan
type itself. "flatten this" instruction does not need to be repeated on every type that containsSpan
.I think this is an improvement. How types are serialized is not core to the function of the AST. I don't see moving the serialization logic elsewhere as "hiding it away", but rather a nice separation of concerns.
3. Open the door to different serializations
In example above
Serialize
has been replaced byESTree
. This is to allow for different serialization methods in future. For example:Different serializers for plain JS AST and TS AST
When serializing a plain JS file, could produce JSON which skips all the TS fields, to make an AST which exactly aligns with canonical ESTree. We'd add
#[ts]
attribute to all TS-related fields, andESTreeJS
serializer would skip those fields. This would make the AST faster to deserialize on JS side.The other advantage is the TS-less AST should perfectly match classic ESTree, so we can test it in full using Acorn's test suite.
Users who are not interested in type info can also request the cheaper JS-only AST, even when parsing TS code.
Serialize to other AST variants
e.g.
#[generate_derive(Babel)]
to serialize to a Babel-compatible JSON AST.Not sure if this is useful, but this change makes it a possibility if we want to.
4. Simplify implementation of custom serialization
Currently we have pretty complex custom
Serialize
impls for massaging Oxc's AST into ESTree-compatible shape in oxc_ast/src/serialize.rs.We can remove most of them if we use
ast_tools
to generateSerialize
impls for us, guiding it with attributes on the AST types themselves:5. Simply AST transfer code
AST transfer's JS-side deserializer (and eventually serializer too) can be simplified in same way, generating code for JS-side deserializer which matches the Rust-side one exactly, without writing the same logic twice and having to keep them in sync.
6. TS type generation
What "massaging" of the Rust AST we do to turn it into an ESTree-compatible JSON AST is now encoded as static attributes. We can use this to generate TS types, and we can get rid of
Tsify
.How difficult is this?
serde
's derive macro looks forbiddingly complex. But this is because it handles every conceivable case, almost all of which we don't use. The output it generates for our AST types is actually not so complicated.So creating a codegen for
impl Serialize
I don't think would be too difficult.The text was updated successfully, but these errors were encountered: