-
Notifications
You must be signed in to change notification settings - Fork 54
Conversation
This reduces memory usage in NodeData by a factor of 5 (80 -> 16), and usage in Node by a factor of 2 (120 -> 56).
This analysis is definitely too simplistic because it ignores the memory usage of The idea of boxing some enum variants is to reduce unused padding for other, smaller variants. The overall effect is very dependent on which variants are more frequent in some concrete use case. Here I suspect that unfortunately in this enum the largest variant, |
Would this do? #73 (comment) (see servo/html5ever#420 (comment) to replicate) |
This means that |
Here is the size of the fields of each variant as of 596ecac. (For the size of the enum we need to add the discriminant, which is 8 bytes with padding because of field alignment.) I think this should have been the very first step of this discussion. assert_eq!(64, size_of::<crate::ElementData>());
assert_eq!(32, size_of::<std::cell::RefCell<String>>()); // Text, Comment
assert_eq!(56, size_of::<std::cell::RefCell<(String, String)>>()); // ProcessingInstruction
assert_eq!(72, size_of::<crate::Doctype>());
assert_eq!(1, size_of::<crate::DocumentData>()); I’m gonna claim without backing data that elements and text are very common, comments somewhat uncommon, and everything else exceptional. (Processing instructions are an XML thing and are not generated by the HTML parser. Each document typically has one document node, and zero or one doctype.) For uncommon enum variants we care about the direct With this I think it becomes apparent that boxing doctypes is a win. Processing instructions as well, if we get element data below 56 bytes. For elements and text I think it’s much less obvious, since they are common. But I think we can explore ideas other than just adding For example, how common are elements with zero attributes? If it’s more than 33% it would be a win to replace assert_eq!(24, size_of::<crate::Attributes>()); Ultimately though, if memory usage for large documents is important enough you should consider using something other than Kuchiki. In my view Kuchiki is mostly a proof-of-concept to show what integrating the In particular:
And I’m sure there are other interesting data structure ideas I haven’t thought of. |
That would have been a good start, sorry about that.
🎉 🎉
I understand your viewpoint. I'm not aware of any alternatives to Kuchiki though and it's about 2k lines of code - It'd be a shame to rewrite all that just for docs.rs. I'm not sure what the best way forward is. |
@jyn514 To be fair, based on https://github.com/rust-lang/docs.rs/blob/2edd72e2ad317ff8725db1dfe303cb8bc4918255/src/utils/html.rs#L6-L36 there's a lot of kuchiki's implementation that you wouldn't need to replicate. You're basically looking at:
Any remaining bits could be replaced by manually walking the resulting parsed tree to find elements matching head and body, as far as I can see. |
Alternatively, https://crates.io/crates/lol_html might be an option that better meets your needs out of the box. |
lol_html looks like exactly what we want. Thanks! |
This reduces memory usage in NodeData by a factor of 5 (80 -> 16),
and usage in Node by a factor of 2 (120 -> 56).
Closes #73