-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: unbuffered text matching #852
Comments
This looks like a special case of #425 to me. And yes, I would expect that this would show up in regex-automata. Or rather, I think what I would prefer for the time being is that folks would develop these sorts of matching routines by writing their own regex engines. It sounds a little crazy maybe, but that's actually one of my goals with
Hmmm, yes, indeed. The start state computation does require doing look-behind (for start anchors and one half of the word boundary assertion). But this is at most one byte and it only happens when I do in part think that #425 subsumes this, but I would like to keep this open since there is a very clear and actionable use case that drives things here. It's also a bit simpler than the general case.
Right, the DFAs cannot and probably will never be able to handle Unicode word boundaries. You can use its heuristic support for it, and if the search quits, you could in theory give up and switch over to the PikeVM. But... doing this without a buffer seems tricky/impossible. So the heuristic might not be possible. It seems likely then that you'll need to do: if nfa.has_unicode_word_boundary() {
// do streaming version of PikeVM
} else {
// do streaming version of lazy DFA
} The PikeVM has its own problems with streaming, regrettably. For example, it will try to look up to 4 bytes behind and up to 4 bytes ahead in order to deal with Unicode word boundaries. (Up to 1 byte behind or ahead for other types of assertions, currently.) So there are problems to be worked out... One simple way forward for y'all would be to do this instead: if nfa.has_unicode_word_boundary() {
// copy string to buffer, run normal regex
} else {
// use lazy DFA
} And that might be good enough. Unless, of course, you find that most queries in practice really want Unicode word boundaries. Although, it seems likely that you are currently blocked by the Finally, I feel compelled to remind you of the warning at the top of the |
matchers currently uses [email protected], so yeah, this is more a "nice future" kind of feature request, nothing's really blocked. :-) Once [email protected] is more cooked, matchers can implement these patterns on top of the building blocks provided. (I'd like to remove the |
Aye. And to be clear, I hope there are no more
Hmmm. I don't believe |
On the topic of footguns, if you haven't read the security advisory, you might enjoy doing so: GHSA-m5pq-gvj9-9vr8 In summary, there are always ways of making regex compilation take quite a bit of time. The |
Yeah, regex-automata suffers from a similar issue; I just timed But you also very explicitly don't offer mitigation for this in regex-automata,
This might break the lazy DFA's guarantees, though, since it sort of implies being linear on pattern length / text length, and I don't think this is.
|
OK... so much to untangle here. Haha. This is why I didn't want to publish
So for example, if I do end up sticking with no size limits by default for |
I'm going to close this out in favor of the monster #425 issue. Namely, I don't think we need more than one issue for the streaming use case. Once #656 is done, my hope is that it will enable some experimentation. Note that this may include writing your own PikeVM if you need Unicode word boundary support. it is possibly easier than it sounds, given that |
(Probably blocked on #656 restructuring, and potentially better exposed as part of regex-automata.)
TL;DR:
In the advanced/"lower level" impl section, or perhaps on applicable
Regex
types provided by regex-automata:Motivation
TL;DR: the matchers crate, as used in tracing-subscriber.
tracing
spans/events structured logging facilities capture non-primitive types by theirDebug
formatting. It's expected for such formatting to be relatively cheap — cheaper to do multiple times than to allocate a buffer and do so a singular time, anyway. In most cases, the captured values will only be formatted once, into a large buffer used for recording the event, with all recorded fields printing into the same buffer.Where the regex comes in is filtering. Filters can observe the recorded field, and filter spans/events in/out depending on the value of the field, and this is primarily done via regex matching. When doing this filtering, tracing-subscriber wants to avoid allocating a buffer for the fields to test, just to throw it away afterwards. This is done by writing the formatting machinery directly into driving a regex-automata DFA byte-by-byte.
tracing-subscriber filters are not supposed to be resilient to untrusted construction, but using the regex crate instead of regex-automata would reduce the failure case from regex compilation taking near-unbounded CPU/RAM to filter application taking O(n) time on the length of the filter string.
Implementation sketch
For simplicity's sake, I'm providing an implementation sketch using
[email protected]::hybrid::dfa::DFA
. This implementation is derived frommatchers
', which is MIT licensed1.Implementation sketch
(I do not know how such an implementation would handle unicode word boundaries, as AIUI those are handled by annoying special cases and not the bytewise (hybrid) DFA.)
By my quick scanning through the regex implementation, the pikeVM is always driven byte-by-byte, and could be driven in a streaming fashion this same way.
Footnotes
I believe the code as presented here is simple/axiomatic enough that it is uncopyrightable. As much as is permissible by law, I release copyright on my modifications to the code, to the detriment of my heirs. (My modifications can be considered licensed under the Unlicense.) ↩
The text was updated successfully, but these errors were encountered: