Skip to content

Commit

Permalink
bug(wc): Add a test for unexpected behavior (#1723)
Browse files Browse the repository at this point in the history
  • Loading branch information
chadbrewbaker authored Feb 16, 2021
1 parent f595164 commit 05d8cc5
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 0 deletions.
11 changes: 11 additions & 0 deletions tests/by-util/test_wc.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,17 @@ fn test_stdin_default() {
.stdout_is(" 13 109 772\n");
}

#[test]
fn test_utf8() {
new_ucmd!()
.args(&["-lwmcL"])
.pipe_in_fixture("UTF_8_test.txt")
.run()
.stdout_is(" 0 0 0 0 0\n");
// GNU returns " 300 2086 22219 22781 79"
// TODO: we should fix that to match GNU's behavior
}

#[test]
fn test_stdin_line_len_regression() {
new_ucmd!()
Expand Down
Binary file added tests/fixtures/wc/UTF_8_test.txt
Binary file not shown.

2 comments on commit 05d8cc5

@chadbrewbaker
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure GNU is "right" :) My notes on comparing all the major UTF-8 parsers. https://gist.github.com/chadbrewbaker/5ec5fbe06d294da95b15d17b70b4d4a3

@chadbrewbaker
Copy link
Contributor Author

@chadbrewbaker chadbrewbaker commented on 05d8cc5 Feb 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The easy conformant but not so great performance solution would be to check the locale and call the corresponding system library mbrtowc. The right solution is to get upstream Rust mainline or library support for major locales - to do that I'm noodling on a compatibility test suite for the major UTF-8 parsers. The major specification issues for 'wc' are 1) What are valid space characters for the locale? For newline it was settled that only the newline character would be parsed. 2) How are invalid UTF-8 characters handled while parsing? Do we reject the entire 1-4 bytes, reject the first byte and re-parse, reject bytes based only on the number of high one bits from the first byte, should stderr throw a warning at the end of the parse of counted bad UTF-8 reads?

wc has computational expectations of working in O(n) time with constant memory overhead.

Please sign in to comment.