-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add configurability for unicode and skipping parsers #14
Add configurability for unicode and skipping parsers #14
Conversation
- Unicode support greatly increases memory and latency and is not generally needed. - Compiling regex for unused parsers is just a waste of memory. Both unicode support and device/os/user_agent parsers can now be configured via `UserAgentParserBuilder`.
I dug in a bit after this discussion about memory usage: rust-lang/regex#1027 (comment) I took at a look at other uaparser implementations and none of them are done with full unicode support. The uaparser-core test suite passes without it just fine. So I moved everything to use Additionally, we at Remind are only interested in OS for our use case. What we've been doing has been to take the core regex yaml file and delete the Device and User Agent sections, but it would be nice to not have to curate that ourselves and instead just pass in flags to only parse and use the os parsers. So I added the ability to disable any of Some performance numbers with these changes: First, memory usage.
Disabling Unicode takes us from 300MB to 60MB, and only doing the OS parsers takes that all the way to 10MB. Beyond memory, the
A huge speedup on the I left the unicode support in as the default to make this a non-breaking change, but honestly it's just not needed and I'd recommend we rip it out and just make everything not support unicode. Anyway, let me know what you think. |
captures.expand(raw_replacement, &mut target); | ||
std::str::from_utf8(&target) | ||
.map(|s| Cow::Owned(s.trim().to_owned())) | ||
// What is the behavior if we can't parse a string??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only big question mark in behavior. If a replacement has a group and the result of the capture expands into a &[u8]
that can't be turned into a string, then we have to do something. Rather than panic or try and surface this, I chose to just skip the replacement.
This may never happen. It likely won't. I'm not sure if a panic is better or worse here and am very open to feedback.
This example is a good example of how the new builder works and the configuration available https://github.com/davidarmstronglewis/uap-rs/pull/14/files#diff-f2849e4e13ac1633666e5ef659e790e06feb21fa5848bf7268b80e7e7de39a94 |
user_agent: bool, | ||
unicode: bool, | ||
) -> Result<UserAgentParser, Error> { | ||
let device_matchers = if device { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explored a few options on how to disable/enable a specific class of parser. Using crate features was neat, but felt overkill for something that could just be provided as config. The most correct approach is probably making the Vec<device::Matcher>
an Option<Vec<device::Matcher>>
and making Client have optional fields for all the matches, but that was a bigger footprint and I wasn't sure if that was worth it. An empty Vec is what we're getting now with our stripped down regex yaml file and it works well enough.
Hey @hamiltop, thank you for the lovely PR and for linking out to context around these changes. Solid change, my only feedback would be to hide the new Once this is in I'll cut a new release. Thanks again ❤️ |
Great. Updated from your branch and then fixed up a few tests to use the new syntax. Should be g2g here. |
src/parser/builder.rs
Outdated
// fn foo() { | ||
// /// ```rust | ||
// /// # use uaparser::*; | ||
// let _parser = UserAgentParser::builder() | ||
// .unicode(false) | ||
// .user_agent() | ||
// /// ``` | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I think I left this in here erroneously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with a fixup
commit and rebased.
2216e6c
to
d7a94b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I'll merge this and cut a new release later tonight.
Thank you for the contribution ❤️
Two main changes
Both unicode support and device/os/user_agent parsers can now be configured via
UserAgentParserBuilder
.I tried very hard to keep this backwards compatible, but if we're willing to break backwards compatibility we can take these in a different direction.