-
-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limited Multi-Regex as the keys in the FST #58
Comments
I think you have an interesting use case. My suspicion at this point is that it would be better to prototype this out of tree, and maybe even maintain a separate crate and get experience with it. If it works well, then we can see about merging your changes back in. There are certainly some public API questions to answer here, and that's hard to do without actually playing with an implementation. I am in part saying this because I just personally don't have the bandwidth to work on something like this or even code review it thoroughly in a PR. My suspicion is that it will be a pretty gnarly change. The structure of the state machine is itself embedded into the binary format of the FST, so changes would need to trickle down to that level. I'd be happy to mentor this at a distance and answer questions, especially about parts of the code that are opaque or difficult to understand. |
Actually, I need this by tomorrow! You don't have time? This is a disaster :) Totally fair, I get that this would be a tonne of work. Mostly, at this stage, given I have a limited knowledge of FSTs, I'm trying to evaluate if theoretically FSTs can even handle this usecase. You're clearly an expert on string matching, and so I was just trying to get your opinion on the problem (possible other approaches like Regarding mentoring: I don't know how far I'll go with this, but if I do, I want to be cognizant of your time. So know I'm never in a rush for answers, take your time, and please just let me know if my questions ever become too much, too often. Hopefully, you have time to think about some of these questions:
|
:-) Hmm, so I basically just don't have any intuition on this because I've never really given it any thought. Part of the point of FSTs is that they do prefix and suffix compression, which is only really useful by taking advantage of the redundancy in the keys you store. If you have a million regexes, is there still redundant structure? I think the only thing I can say is to try and experiment with this. You might consider experimenting on it from scratch with a simpler in-memory data structure. Jumping straight into this crate will be difficult, because you'll really need to have a firm grasp on the structure of the machine in order to know how to encode it to disk. I do think you're on the right track with respect to limiting yourself to specific aspects of regexes. People have actually done some remarkable work in characterizing exactly what it is that causes DFAs to exhibit exponential memory growth. (There's a lot of fancy lingo there, but I think that if you read it carefully, you'll find that their results probably match your intuition. If your eyes glaze over, page 3 contains some comparatively light material with some examples.) With respect to your specific use case, I don't think you can say "alternations are fine" or "alternations aren't fine," but rather, it is the nature of those alternations that matter. Do they have significant overlap? If so, then you start getting state blow up. If not, then maybe you're fine. Apologies for not answering your questions directly, but I just don't have access to those answers at my fingertips. :) One other thought here: if you're looking at matching many regexes, you might consider checking out Hyperscan, which is supposed to specialize in handling huge pattern databases. I don't know if a million regexes is considered "huge" or "really fucking huge," but it might be an interesting baseline to test against. (Among open source software, Hyperscan is, without a doubt, one of the world's most sophisticated regex engines. Up there with PCRE2. So if any regex engine could handle your use case out of the box, it would be Hyperscan.) |
I'm going to close this request for now. |
Note: I know I can search with a regex,
fst.search(Regex::new(...))
, but it doesn't satisfy my usecase.Note: This suggestion is very similar to multi-regex, but hopefully by only introducing a limited feature set from regex we can keep the state machine small, and therefore really fast. I think an FST with "complex states" would be more efficient than
RegexSet
, however, you're the expert, so I'd trust you if you said otherwise.Summary: My usecase basically needs a
RegexSet
with a few millionRegex
s, I know its crazy. But this is why I'd prefer the FST toRegexSet
, I'm hoping for better memory efficiency (Ideally, improved throughput too).What are your thoughts around adding limited regex features into the keys of the FST?
For example: A system which parses log files for a 500+ route API
The only real change would be to create states which can transition back to itself (i.e. the graph becomes kind-of cyclic), but "transitioning back to itself" is equivalent to "the state remains unchanged" (i.e. the graph remains acyclic).
From my minimal understanding, the features (this might not be exhaustive) which cause
regex
s to be significantly less performant than an FST:Edit: The big performance consideration I didn't take into account actually was that because states can be shared, any state which can loop back to itself will basically get stuck on the traversal "frontier", and need to be re-checked for each character streaming through the machine. But I suppose this is alright, because its "pay for what you use"?
The text was updated successfully, but these errors were encountered: