Skip to content
This repository has been archived by the owner on Jan 21, 2023. It is now read-only.

Box most NodeData variants #74

Closed
wants to merge 1 commit into from
Closed

Conversation

jyn514
Copy link
Contributor

@jyn514 jyn514 commented Jul 1, 2020

This reduces memory usage in NodeData by a factor of 5 (80 -> 16),
and usage in Node by a factor of 2 (120 -> 56).

Closes #73

This reduces memory usage in NodeData by a factor of 5 (80 -> 16),
and usage in Node by a factor of 2 (120 -> 56).
@SimonSapin
Copy link
Collaborator

This reduces memory usage in NodeData by a factor of 5 (80 -> 16),
and usage in Node by a factor of 2 (120 -> 56).

This analysis is definitely too simplistic because it ignores the memory usage of Box allocations. Yes Box<RefCell<String>> is 8 bytes but the 32 bytes for RefCell<String> are still stored somewhere. In some cases adding a pointer indirection can increase total memory usage, because the 8 bytes for that pointer also need to be stored.

The idea of boxing some enum variants is to reduce unused padding for other, smaller variants. The overall effect is very dependent on which variants are more frequent in some concrete use case. Here DocumentFragment is likely the variant that profits the most from this PR, but I suspect it’s very uncommon in real-world document.

I suspect that unfortunately in this enum the largest variant, Element, is typically also one of the most frequent. So I’d be inclined not to take a PR like this without some measurement or statistics that show that it actually helps.

@jyn514
Copy link
Contributor Author

jyn514 commented Jul 2, 2020

So I’d be inclined not to take a PR like this without some measurement or statistics that show that it actually helps.

Would this do? #73 (comment) (see servo/html5ever#420 (comment) to replicate)

@jyn514
Copy link
Contributor Author

jyn514 commented Jul 2, 2020

This analysis is definitely too simplistic because it ignores the memory usage of Box allocations. Yes Box<RefCell> is 8 bytes but the 32 bytes for RefCell are still stored somewhere. In some cases adding a pointer indirection can increase total memory usage, because the 8 bytes for that pointer also need to be stored.

This means that Text now uses a total of 40 bytes of memory, but that's still half of the 80 it was taking up before. All of the variants besides Element Doctype now use less memory overall than they did before. Element Doctype has now grown from 80 -> 88 which seems like a reasonable tradeoff.

@SimonSapin
Copy link
Collaborator

Here is the size of the fields of each variant as of 596ecac. (For the size of the enum we need to add the discriminant, which is 8 bytes with padding because of field alignment.) I think this should have been the very first step of this discussion.

    assert_eq!(64, size_of::<crate::ElementData>());
    assert_eq!(32, size_of::<std::cell::RefCell<String>>()); // Text, Comment
    assert_eq!(56, size_of::<std::cell::RefCell<(String, String)>>());  // ProcessingInstruction
    assert_eq!(72, size_of::<crate::Doctype>());
    assert_eq!(1, size_of::<crate::DocumentData>());

I’m gonna claim without backing data that elements and text are very common, comments somewhat uncommon, and everything else exceptional. (Processing instructions are an XML thing and are not generated by the HTML parser. Each document typically has one document node, and zero or one doctype.)

For uncommon enum variants we care about the direct size_of which increases the padding for other variants. For common variants we care about the total memory use, including separate allocations.

With this I think it becomes apparent that boxing doctypes is a win. Processing instructions as well, if we get element data below 56 bytes.

For elements and text I think it’s much less obvious, since they are common. But I think we can explore ideas other than just adding Box in a few places.

For example, how common are elements with zero attributes? If it’s more than 33% it would be a win to replace Attributes in ElementData with Option<Box<Attributes>>, and store None for those elements. However that’s ideally an implementation detail hidden from the public API.

    assert_eq!(24, size_of::<crate::Attributes>());

Ultimately though, if memory usage for large documents is important enough you should consider using something other than Kuchiki.

In my view Kuchiki is mostly a proof-of-concept to show what integrating the html5ever with and selectors crates look like. Designing a tree data structure for this is full of trade-offs. I tried to make Kuchicki is not terrible at any use case, but that means it’s also often not great.

In particular:

  • If you don’t care about removing some nodes from the tree and reclaiming memory before you drop the whole tree, use an area allocator and &'arena Node or a per-document Vec and integer indices. Rc<Node> has 16 bytes overhead per node.
  • If you don’t care about documents with more than four billion nodes, 32-bit integer indices
  • If you don’t care about mutating any node after the initial parsing, remove all RefCells. Those have 8 bytes overhead each.
  • If you do, replacing RefCell with Cell also removes that overhead at the cost of a less convenient API. Cell::update is still unstable but it’s fairly simple and doesn’t use unsafe, so duplicating it (or a variation of it that relies on T: Default and Cell::take instead of T: Copy and Cell::get) is easy and not risky.
  • If you don’t need CSS selector matching or anything else that goes "up" the tree from a given node, consider children: Vec<NodeRef> or children: Box<[NodeRef]> instead of five links for parent/previous sibling/next sibling/first child/last child.
  • If you do, that last one might still be viable if you manipulate not just references to a given node but "cursors" that contain a stack of indices into children sequences, forming the path from the root of the tree to a given node.

And I’m sure there are other interesting data structure ideas I haven’t thought of.

@jyn514
Copy link
Contributor Author

jyn514 commented Jul 2, 2020

I think this should have been the very first step of this discussion.

That would have been a good start, sorry about that.

With this I think it becomes apparent that boxing doctypes is a win. Processing instructions as well, if we get element data below 56 bytes.

🎉 🎉

If memory usage for large documents is important enough you should consider using something other than Kuchiki. In my view Kuchiki is mostly a proof-of-concept.

I understand your viewpoint. I'm not aware of any alternatives to Kuchiki though and it's about 2k lines of code - It'd be a shame to rewrite all that just for docs.rs. I'm not sure what the best way forward is.

@jdm
Copy link
Contributor

jdm commented Jul 2, 2020

@jyn514 To be fair, based on https://github.com/rust-lang/docs.rs/blob/2edd72e2ad317ff8725db1dfe303cb8bc4918255/src/utils/html.rs#L6-L36 there's a lot of kuchiki's implementation that you wouldn't need to replicate. You're basically looking at:

Any remaining bits could be replaced by manually walking the resulting parsed tree to find elements matching head and body, as far as I can see.

@jdm
Copy link
Contributor

jdm commented Jul 2, 2020

Alternatively, https://crates.io/crates/lol_html might be an option that better meets your needs out of the box.

@jyn514
Copy link
Contributor Author

jyn514 commented Jul 2, 2020

lol_html looks like exactly what we want. Thanks!

@jyn514 jyn514 closed this Jul 2, 2020
@jyn514 jyn514 deleted the moving-time branch April 20, 2021 17:15
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NodeData is unnecessarily large, bloating the size of all Nodes
3 participants