Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isValidUtf8ByteArray returns false positives #540

Closed
sol opened this issue Oct 22, 2023 · 6 comments · Fixed by #553
Closed

isValidUtf8ByteArray returns false positives #540

sol opened this issue Oct 22, 2023 · 6 comments · Fixed by #553

Comments

@sol
Copy link
Member

sol commented Oct 22, 2023

Repro:

ghci> :set -XOverloadedLists
ghci> import Data.Text.Internal.Validate
ghci> isValidUtf8ByteArray [128] 0 1
False
ghci> isValidUtf8ByteArray [0, 128] 1 1
True

(GHC 9.8.1 / Linux x64)

@sol
Copy link
Member Author

sol commented Oct 22, 2023

  1. Apparently my GHC provided text-2.1 is not built with simdutf.
  2. The Haskell fallback looks dodgy. I guess the len in line 137 should be end where end = off + len (same in line 143).

isValidUtf8ByteArrayHaskell# b !off !len = start off
where
indexWord8 :: ByteArray# -> Int -> Word8
indexWord8 !x (I# i) = W8# (indexWord8Array# x i)
start ix
| ix >= len = True

| ix >= len = False

@andrewthad where did you take this code from? Are there any tests?

@Bodigrim
Copy link
Contributor

2. The Haskell fallback looks dodgy. I guess the len in line 137 should be end where end = off + len (same in line 143).

Good catch. Could you possibly prepare a PR please?

@sol
Copy link
Member Author

sol commented Oct 23, 2023

Good catch. Could you possibly prepare a PR please?

I am more inclined to fix this by using bytestring_is_valid_utf8. However, after spending some time on it, I am close to the conclusion that bytestring_is_valid_utf8 is broken as well.

@sol
Copy link
Member Author

sol commented Oct 23, 2023

haskell/bytestring#620

@andrewthad
Copy link
Contributor

@sol The implementation of isValidUtf8ByteArrayHaskell# is copied from one of the shims for Data.Text.Internal.Validate.isValidUtf8ByteString:

isValidUtf8ByteString :: ByteString -> Bool
#ifdef SIMDUTF
isValidUtf8ByteString bs = withBS bs $ \fp len -> unsafeDupablePerformIO $
  unsafeWithForeignPtr fp $ \ptr -> (/= 0) <$> c_is_valid_utf8_ptr_unsafe ptr (fromIntegral len)
#else
#if MIN_VERSION_bytestring(0,11,2)
isValidUtf8ByteString = B.isValidUtf8
#else
isValidUtf8ByteString bs = start 0
  where
    start ix
      | ix >= B.length bs = True
      | otherwise = case utf8DecodeStart (B.unsafeIndex bs ix) of
        Accept{} -> start (ix + 1)
        Reject{} -> False
        Incomplete st _ -> step (ix + 1) st
    step ix st
      | ix >= B.length bs = False
      -- We do not use decoded code point, so passing a dummy value to save an argument.
      | otherwise = case utf8DecodeContinue (B.unsafeIndex bs ix) st (CodePoint 0) of
        Accept{} -> start (ix + 1)
        Reject{} -> False
        Incomplete st' _ -> step (ix + 1) st'
#endif
#endif

I've investigated a little to figure out where isValidUtf8ByteString came from. Before I added anything, isValidUtf8ByteString was already there, but it was named isValidBS instead. But, between the time that I put up my PR and the time it was merged, commit 7ef771d made it in, which removed isValidBS. So, from commit history on master, it looks like isValidUtf8ByteString just came out of nowhere, but if you go back to 6f1917d, you can its predecessor isValidBS.

I botched the adaptation though, and the comparison should be ix >= off + len instead of ix >= len. (Or computing end = off + len in a where clause and then using ix >= end instead.

There are not any tests for this function that I'm aware of.

@Bodigrim
Copy link
Contributor

I botched the adaptation though, and the comparison should be ix >= off + len instead of ix >= len. (Or computing end = off + len in a where clause and then using ix >= end instead.

We need to fix this before the next release, which is soon. Any volunteers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants