-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
bug(wc): Add a test for unexpected behavior (#1723)
- Loading branch information
1 parent
f595164
commit 05d8cc5
Showing
2 changed files
with
11 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
05d8cc5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure GNU is "right" :) My notes on comparing all the major UTF-8 parsers. https://gist.github.com/chadbrewbaker/5ec5fbe06d294da95b15d17b70b4d4a3
05d8cc5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The easy conformant but not so great performance solution would be to check the locale and call the corresponding system library mbrtowc. The right solution is to get upstream Rust mainline or library support for major locales - to do that I'm noodling on a compatibility test suite for the major UTF-8 parsers. The major specification issues for 'wc' are 1) What are valid space characters for the locale? For newline it was settled that only the newline character would be parsed. 2) How are invalid UTF-8 characters handled while parsing? Do we reject the entire 1-4 bytes, reject the first byte and re-parse, reject bytes based only on the number of high one bits from the first byte, should stderr throw a warning at the end of the parse of counted bad UTF-8 reads?
wc has computational expectations of working in O(n) time with constant memory overhead.