-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid UTF-8 byte sequence is accepted as valid UTF-8 by text-2.0.1 #575
Comments
This is actually a problem with UTF-8 decoding in import qualified Data.ByteString as BS
main :: IO ()
main = do
-- This byte has the high bit set and thus is not valid UTF 8 by itself
let simpleInvalidUtf8 = BS.pack [216]
-- Correctly says False
print $ BS.isValidUtf8 simpleInvalidUtf8
-- Pad the invalid character with spaces
let paddedInvalidUtf8 = BS.pack $ replicate 127 32 <> [216] <> replicate 128 32
-- Incorrectly says True
print $ BS.isValidUtf8 paddedInvalidUtf8 CC @kozross |
I'm currently on ARM so referring to The input string is structured that way that a) the first chunk is a valid partial input, ending with a start of multibyte sequence b) the second chunk is completely ASCII. bytestring/cbits/aarch64/is-valid-utf8.c Lines 218 to 222 in 0602eab
I don't quite comprehend this, but my suspicion is that we are so happy to consume a full chunk of ASCII bytes that we eagerly forget the decoding state at which the previous chunk has ended. |
@Bodigrim - that's my suspicion as well. I'll take a look when I can. |
Thanks for looking into it! I did some further experimentation, and it seems that there is something else going on, besides the chunk-boundary case you suspect in bytestring. In both reproducers, I increased the number of leading padding spaces to 128. Now, the following bytestring reproducer expectedly prints import qualified Data.ByteString as BS
main :: IO ()
main = do
-- This byte has the high bit set and thus is not valid UTF 8 by itself
let simpleInvalidUtf8 = BS.pack [216]
-- Correctly says False
print $ BS.isValidUtf8 simpleInvalidUtf8
-- Pad the invalid character with spaces
let paddedInvalidUtf8 = BS.pack $ replicate 128 32 <> [216] <> replicate 128 32
-- Incorrectly says True
print $ BS.isValidUtf8 paddedInvalidUtf8 However, this reproducer using the text library still claims to have decoded the second invalid string: import Control.Exception
import qualified Data.Text.Encoding as Enc
import qualified Data.ByteString as BS
main :: IO ()
main = do
-- This byte has the high bit set and thus is not valid UTF 8 by itself
let simpleInvalidUtf8 = BS.pack [216]
-- Correctly prints error
testDecoding simpleInvalidUtf8
-- Pad the invalid character with spaces
let paddedInvalidUtf8 = BS.pack $ replicate 128 32 <> [216] <> replicate 128 32
testDecoding paddedInvalidUtf8
testDecoding :: BS.ByteString -> IO ()
testDecoding input = do
let
decode = do
result <- evaluate $ Enc.decodeUtf8 input
putStrLn $ "Decoded " <> show input <> " to " <> show result
catch decode $ \exc -> do
putStrLn $ "Decoding " <> show input <> " failed with: " <> show (exc :: SomeException) That behavior persists even with more padding, but the amount of padding in my initial example was the least amount of padding necessary to trigger this bug. Is that something you can reproduce as well? |
@fatho I can reproduce your example, but the root cause is the same. When |
@Bodigrim Thanks for your explanation. The "parse one byte, then retry the rest"-part sounded like an accidentally-quadratic algorithm to me, so I wrote a reproducer for that and opened another issue to not derail this comment thread: haskell/text#495 |
@Bodigrim - sorry for my lack of response: work's been busy. I've got approval for two full workdays to chase this, but I'm just stitching up something more urgent. I'll let you know once I'm available to deal with this. Thanks for the regression tests - they'll help get this sorted. |
* Refactor isValidUtf8 tests to enable shrinking * Test isValidUtf8 on strings which are almost valid, with long ranges of ASCII * Repair invalid UTF-8 issue, more tests * Fix NEON ASCII check * Refactor isValidUtf8 tests to enable shrinking * Test isValidUtf8 on strings which are almost valid, with long ranges of ASCII * Use ctzll to ensure 8 bytes get processed * Set -fno-strict-aliasing --------- Co-authored-by: Bodigrim <[email protected]>
* Refactor isValidUtf8 tests to enable shrinking * Test isValidUtf8 on strings which are almost valid, with long ranges of ASCII * Repair invalid UTF-8 issue, more tests * Fix NEON ASCII check * Refactor isValidUtf8 tests to enable shrinking * Test isValidUtf8 on strings which are almost valid, with long ranges of ASCII * Use ctzll to ensure 8 bytes get processed * Set -fno-strict-aliasing --------- Co-authored-by: Bodigrim <[email protected]>
I found a weird quirk/bug in the UTF-8 decoding function of
text-2.0.1
. In some cases (given enough padding around it), thedecodeUtf8
function will accept an invalid byte sequence as valid UTF-8, as demonstrated by the following reproducer I ran with GHC 9.4.4 (which comes with text 2.0.1).As you can see, the input contains a byte with the highest bit set, meaning it should be part of a multi-byte UTF-8 sequence. This byte alone is invalid. And indeed, if we try to decode a singleton bytestring consisting of just that byte, we'll get the expected exception.
However, adding sufficient padding around the invalid byte appears to make the validation pass, as witnessed by the second decoding call.
Reproducer output:
Since I saw that text employs some fancy SIMD algorithms, my CPUID flags might also be relevant, so here they are.
For comparison, running the same reproducer with GHC 9.2.5 (which ships text
text-1.2.5.0
), it prints two errors (as I would expect):The text was updated successfully, but these errors were encountered: