Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] isValidUtf8 function, tests #423

Merged
merged 24 commits into from
Nov 3, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
b218286
isValidUtf8 function, tests
kozross Sep 14, 2021
1c19b83
Merge branch 'master' of github.com:haskell/bytestring
kozross Sep 16, 2021
3b168de
Cleanup, change to forAll
kozross Sep 16, 2021
800fa95
Fix accidentally-correct generation of wrong test cases
kozross Oct 10, 2021
32c6048
Increase test count, fix segfault on Clang
kozross Oct 10, 2021
0f513c3
Try fixing Windows
kozross Oct 11, 2021
331c9fa
Eliminate block-checking for ASCII in fallback mode
kozross Oct 16, 2021
374caa4
Use GHC 8.10 for CI on Windows
kozross Oct 16, 2021
c293c1c
Namespace externally-visible C function better
kozross Oct 20, 2021
4bfab53
Link to libgcc on Windows
kozross Oct 21, 2021
8be0375
Fix error accumulation in AVX2 check
kozross Oct 23, 2021
4c5271a
Merge branch 'master' of github.com:haskell/bytestring
kozross Oct 24, 2021
a2c8083
Fix use of <> on older GHCs
kozross Oct 24, 2021
9eff4d6
Repair Drone config to deal with older Cabal
kozross Oct 25, 2021
2584a5d
Merge branch 'master' of github.com:haskell/bytestring
kozross Oct 28, 2021
e3a9f21
Ensure non-SSE2 x86 can still build
kozross Oct 28, 2021
786151f
License information and contact for C code
kozross Oct 29, 2021
abb9a10
Fix test name typo, allow more than pre-set count of tests
kozross Oct 29, 2021
50d61e6
Stop using MagicHash for FFI wrapper
kozross Oct 29, 2021
df2a8b9
Revert change to 8.10 for Windows CI
kozross Oct 29, 2021
073499f
Use existing license verbatim for C code
kozross Oct 30, 2021
d8def8c
Explain the use of gcc and gcc_s
kozross Oct 30, 2021
121c4fc
More concise implementation of isValidUtf8
kozross Oct 30, 2021
aaeb320
Document AVX2 block-checking function
kozross Nov 2, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ cabal.sandbox.config
.hsenv
*~
dist-newstyle/
.nvimrc
kozross marked this conversation as resolved.
Show resolved Hide resolved
4 changes: 4 additions & 0 deletions Changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@

[0.12.0.0]: https://github.com/haskell/bytestring/compare/0.11.1.0...0.12.0.0

[0.11.2.0] - September 2021

* Add `Data.ByteString.isValidUtf8`
kozross marked this conversation as resolved.
Show resolved Hide resolved

[0.11.1.0] — February 2021

* [Add `Data.ByteString.Char8.findIndexEnd` and `Data.ByteString.Lazy.Char8.{elemIndexEnd,findIndexEnd,unzip}`](https://github.com/haskell/bytestring/pull/342)
Expand Down
17 changes: 16 additions & 1 deletion Data/ByteString.hs
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE MagicHash #-}
{-# LANGUAGE NamedFieldPuns #-}
{-# LANGUAGE TupleSections #-}
{-# OPTIONS_HADDOCK prune #-}
Expand Down Expand Up @@ -141,6 +142,9 @@ module Data.ByteString (
isSuffixOf,
isInfixOf,

-- ** Encoding validation
isValidUtf8,

-- ** Search for arbitrary substrings
breakSubstring,

Expand Down Expand Up @@ -238,7 +242,7 @@ import Control.Exception (IOException, catch, finally, assert, throwIO)
import Control.Monad (when, void)

import Foreign.C.String (CString, CStringLen)
import Foreign.C.Types (CSize)
import Foreign.C.Types (CSize (CSize), CInt (CInt))
import Foreign.ForeignPtr (ForeignPtr, withForeignPtr, touchForeignPtr)
import Foreign.ForeignPtr.Unsafe(unsafeForeignPtrToPtr)
import Foreign.Marshal.Alloc (allocaBytes)
Expand Down Expand Up @@ -1523,6 +1527,17 @@ stripSuffix bs1@(BS _ l1) bs2@(BS _ l2)
isInfixOf :: ByteString -> ByteString -> Bool
isInfixOf p s = null p || not (null $ snd $ breakSubstring p s)

-- | /O(n)/ Check whether a 'ByteString' represents valid UTF-8.
--
-- @since 0.11.2.0
isValidUtf8 :: ByteString -> Bool
Bodigrim marked this conversation as resolved.
Show resolved Hide resolved
isValidUtf8 (BS ptr len) = accursedUnutterablePerformIO $ unsafeWithForeignPtr ptr $ \p -> do
CInt i <- isValidUtf8# p (CSize . fromIntegral $ len)
pure $ i /= 0

foreign import ccall unsafe "is_valid_utf8" isValidUtf8#
:: Ptr Word8 -> CSize -> IO CInt

-- | Break a string on a substring, returning a pair of the part of the
-- string prior to the match, and the rest of the string.
--
Expand Down
20 changes: 17 additions & 3 deletions bytestring.cabal
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Name: bytestring
Version: 0.11.1.0
Version: 0.11.2.0
Synopsis: Fast, compact, strict and lazy byte strings with a list interface
Description:
An efficient compact, immutable byte string type (both strict and lazy)
Expand Down Expand Up @@ -113,9 +113,15 @@ library
-fmax-simplifier-iterations=10
-fdicts-cheap
-fspec-constr-count=6
if (arch(x86_64) || arch(i386))
c-sources: cbits/fpstring.c
cbits/itoa.c
cbits/x86/is-valid-utf8.c
else
c-sources: cbits/fpstring.c
cbits/itoa.c
cbits/is-valid-utf8.c
kozross marked this conversation as resolved.
Show resolved Hide resolved

c-sources: cbits/fpstring.c
cbits/itoa.c
cc-options: -std=c11
include-dirs: include
includes: fpstring.h
Expand Down Expand Up @@ -172,6 +178,14 @@ test-suite bytestring-th
ghc-options: -Wall -fwarn-tabs -threaded -rtsopts
default-language: Haskell2010

test-suite is-valid-utf8
type: exitcode-stdio-1.0
hs-source-dirs: tests/is-valid-utf8
main-is: Main.hs
build-depends: base, bytestring, tasty, tasty-quickcheck, QuickCheck
ghc-options: -Wall -fwarn-tabs -threaded -rtsopts
default-language: Haskell2010

benchmark bytestring-bench
main-is: BenchAll.hs
other-modules: BenchBoundsCheckFusion
Expand Down
2 changes: 2 additions & 0 deletions cabal.project.local
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
ignore-project: False
with-compiler: ghc-9.0.1
108 changes: 108 additions & 0 deletions cbits/is-valid-utf8.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
#pragma GCC push_options
sjakobi marked this conversation as resolved.
Show resolved Hide resolved
#pragma GCC optimize("-O2")
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>

// 0x80 in every 'lane'.
static uint64_t const high_bits_mask = 0x8080808080808080ULL;

int is_valid_utf8(uint8_t const *const src, size_t const len) {
Bodigrim marked this conversation as resolved.
Show resolved Hide resolved
uint8_t const *ptr = (uint8_t const *)src;
// This is 'one past the end' to make loop termination and bounds checks
// easier.
uint8_t const *const end = ptr + len;
while (ptr < end) {
uint8_t const byte = *ptr;
// Check if the byte is ASCII.
if (byte <= 0x7F) {
ptr++;
// If we saw one ASCII byte, as long as it's not whitespace, it's quite
// likely we'll see more.
bool is_not_whitespace = byte > 32;
// If possible, do a block-check ahead.
if ((ptr + 32 < end) && is_not_whitespace) {
uint64_t const *big_ptr = (uint64_t const *)ptr;
// Non-ASCII bytes have a set MSB. Thus, if we AND with 0x80 in every
// 'lane', we will get 0 if everything is ASCII, and something else
// otherwise.
uint64_t results[4] = {(*big_ptr) & high_bits_mask,
(*(big_ptr + 1)) & high_bits_mask,
(*(big_ptr + 2)) & high_bits_mask,
(*(big_ptr + 3)) & high_bits_mask};
Comment on lines +59 to +62
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C is not a lazy language, I don't think it makes sense to precompute all 4 strides, when just below you don't use the remaining strides when one of the masked results is non-zero... The start pointer here is not aligned to a 32-byte multiple, so I don't think we get particular cache line efficiency here by preloading these? Am I missing something?

if (results[0] == 0) {
ptr += 8;
if (results[1] == 0) {
ptr += 8;
if (results[2] == 0) {
ptr += 8;
if (results[3] == 0) {
ptr += 8;
} else {
ptr += (__builtin_ctzl(results[3]) / 8);
Bodigrim marked this conversation as resolved.
Show resolved Hide resolved
}
} else {
ptr += (__builtin_ctzl(results[2]) / 8);
}
} else {
ptr += (__builtin_ctzl(results[1]) / 8);
}
} else {
ptr += (__builtin_ctzl(results[0]) / 8);
}
}
}
// Check for a valid 2-byte sequence.
//
// We use a signed comparison to avoid an extra comparison with 0x80, since
// _signed_ 0x80 is -128.
else if (ptr + 1 < end && byte >= 0xC2 && byte <= 0xDF &&
((int8_t) * (ptr + 1)) <= (int8_t)0xBF) {
ptr += 2;
}
// Check for a valid 3-byte sequence.
else if (ptr + 2 < end && byte >= 0xE0 && byte <= 0xEF) {
uint8_t const byte2 = *(ptr + 1);
bool byte2_valid = (int8_t)byte2 <= (int8_t)0xBF;
bool byte3_valid = ((int8_t) * (ptr + 2)) <= (int8_t)0xBF;
if (byte2_valid && byte3_valid &&
// E0, A0..BF, 80..BF
((byte == 0xE0 && byte2 >= 0xA0) ||
// E1..EC, 80..BF, 80..BF
(byte >= 0xE1 && byte <= 0xEC) ||
// ED, 80..9F, 80..BF
(byte == 0xED && byte2 <= 0x9F) ||
// EE..EF, 80..BF, 80..BF
(byte >= 0xEE && byte <= 0xEF))) {
ptr += 3;
} else {
return 0;
}
}
// Check for a valid 4-byte sequence.
else if (ptr + 3 < end) {
uint8_t const byte2 = *(ptr + 1);
bool byte2_valid = (int8_t)byte2 <= (int8_t)0xBF;
bool byte3_valid = ((int8_t) * (ptr + 2)) <= (int8_t)0xBF;
bool byte4_valid = ((int8_t) * (ptr + 3)) <= (int8_t)0xBF;
if (byte2_valid && byte3_valid && byte4_valid &&
// F0, 90..BF, 80..BF, 80..BF
((byte == 0xF0 && byte2 >= 0x90) ||
// F1..F3, 80..BF, 80..BF, 80..BF
(byte >= 0xF1 && byte <= 0xF3) ||
// F4, 80..8F, 80..BF, 80..BF
(byte == 0xF4 && byte2 <= 0x8F))) {
ptr += 4;
} else {
return 0;
}
}
// Otherwise, invalid.
else {
return 0;
}
}
// If we got this far, we're valid.
return 1;
}
#pragma GCC pop_options
Loading