-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fails to recognize delimiters that fall on the boundary between two buffers #4
Comments
Note: the better algorithm above also requires the ability to read multiple buffers at a time, so it is not an improvement in that sense. It is only better because the complexity of the regex matching is Ω(buffer size) while the second choice is Ω(max{(result string size), (buffer size)}). Combined with the fact that even the pseudocode is twice as long, I think this is premature optimization in the case of trailing delimiters. In the case of precedent delimiters, it has a difference of Ω(buffer size) vs Ω(file size) in the case that there is no precedent delimiter, so it is still necessary for that case, especially in use cases where one scans an infinite stream (e.g., network programming) or files too large to fit in memory. However, since they are precedent delimiters, we change the algorithm slightly in that case:
|
That ignores greedy operators. Test case: You can fix this with a little reordering:
|
I think that in order to read multiple buffer widths, we will have to implement The buffer should function identically to BufReader in the general case, but it should use a vector instead of a fixed slice. Further, we will provide a method As a consequence, any call to After extending the buffer, it should be allowed to shrink back to normal size over time as calls to |
That buffer data structure is starting to sound kinda complicated to handle ad-hoc. I've created a separate issue #5 dealing with the data structure. This issue is on hold pending that issue's resolution. |
See abonander/buf_redux, which provides the necessary lookahead features. We will not have to implement our own |
The last commit I pushed (36c64a3) contains breaking changes. I don't think they were necessary, but I'll have to reevaluate that in the morning. |
We still fail to detect precedent delims longer than the input buffer, but all trailing delims work. |
This is related to Issue #3 in that both solutions require that we are able to read multiple buffers of information without consuming them.
Terminating Delimiters
Currently we search for terminating delimiters using the following algorithm:
Note that in the first test case below, the delimiter is
(two spaces). The first buffer ends with one space and the second buffer begins with one space. Together, they form a delimiter, but this delimiter is not in either buffer, so the entirety of both buffers is read.
The best algorithm I can come up with, which follows, depends on rust-lang/regex#425; else we must make our own regex engine and implement
hits_end() -> bool
which returns true if the DFA is not in a dead state after we consume the last input character. This would imply that the buffer ends with a prefix to a word in that regex's language, and if we consume additional input it may or may not become a delimiter.Such functionality would also help us with the issue of parsing arbitrarily long starting delims. However, since there is nothing (simple) we can do to remedy that, we must consider an alternative. I propose the following:
Note that this requires reading multiple buffers without consuming them. By default,
BufRead.fill_buf()
will happily return the same buffer over and over again if you do notconsume()
between calls.Precedent Delimiters
This has all of the same problems as the above problem, with one added caveat: it is impossible to tell the difference between a string that has no precedent delimiters and a string having a very long precedent delimiter—without being able to check for delimiter prefixes—except by reading to EOF.
This issue technically still exists in the case of trailing delimiters, but in that case there is no efficiency hit because reading to EOF if there is no trailing delimiter is actually the desired behavior.
Test Cases
The text was updated successfully, but these errors were encountered: