-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect invalid UTF-8 data at end of file when using PerlIO :encoding(utf-8) #59
Comments
See https://metacpan.org/pod/PerlIO::encoding There is variable $PerlIO::encoding::fallback and by default WARN_ON_ERR bit is set. So yes, it is bug as you did not get warning. |
@pali Yes when I try add in the code above (before starting to read the file):
The output is
which shows that the bitmask constants Interestingly, if I try to change the value to a code ref before reading:
The code hangs at readline (i.e. : |
Look at PerlIO::encoding source code, by default are set these bits:
Coderef check is supported only by some XS Encode modules, probably not by PerlIO::encoding. |
Looks like this is not Encode bug, but PerlIO::encoding! And PerlIO is part of Perl itself. Please report this bug directly to Perl. I used this test script:
|
It turns out this is partly an Encode issue too. PerlIO::encoding "renew"s the encoding object to ensure it has it's own encoding object (per Encode::Encoding), but Encode::decode_xs() treats such a renewed object as always stop_at_partial, which means that PerlIO::encoding can't use that encoding object to process that little bit of excess data at eof. So I'm stuck trying to fix this on the PerlIO::encoding side. Unfortunately, simply removing that renewed -> stop_at_partial will break PerlIO::encoding on validly encoded files on older perls, so I don't see a simple fix. |
Bug is in PerlIO::scalar and was fixed in perl 5.25.8 by this commit: |
There's an issue in PerlIO::encoding and the way it interacts with Encode too: $ ./perl -e 'print "\xef\xbe"' >shortuni.txt but it should be outputing a warning and \x{00EF}, like the following does: $ ./perl -e 'print "\xef\xbeA"' >shortuni.txt This is blead at v5.25.9-35-g32207c6 which includes the (irrelevant) PerlIO::scalar fix. |
PerlIO layer
:encoding(utf-8)
seems to fail to report malformed data at the end of a file.Suppose a file
$fn
contains valid UTF-8, except for the final character in the file. The last character in the file has an invalid UTF-8 encoding. I would like to have a warning printed to STDERR about invalid UTF-8 when reading this file, but strangely it seems not possible to achieve.For example:
now
$fn
contains invalid UTF-8 (the last byte). If I now try to read the file using PerlIO layer:encoding(utf-8)
:the output is
Note, that there is no warning
"\xE5" does not map to Unicode
in this case.However, if I read the file as bytes and then use
Encode::decode()
on the raw data, the warnings is printed:Why cannot the same thing be achieved with
PerlIO::encoding
? Is it a bug?The text was updated successfully, but these errors were encountered: