Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] isValidUtf8 function, tests #423

Merged
merged 24 commits into from
Nov 3, 2021
Merged

[WIP] isValidUtf8 function, tests #423

merged 24 commits into from
Nov 3, 2021

Conversation

kozross
Copy link
Contributor

@kozross kozross commented Sep 15, 2021

Closes #417, using work done previously for text's UTF-8ing efforts. This is marked WIP for now until we can get comparative benches of this solution versus the simdjson-based one currently in text.

Currently, there is no Aarch64-specific solution - I will provide one in a follow-up PR. The fallback C version should be quite fast though.

Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for nicely commented C-code!

.gitignore Outdated Show resolved Hide resolved
Changelog.md Show resolved Hide resolved
bytestring.cabal Outdated Show resolved Hide resolved
tests/is-valid-utf8/Main.hs Outdated Show resolved Hide resolved
@Bodigrim
Copy link
Contributor

@kozross As witnessed by CI, this does not quite work at least on Mac, unfortunately.

$ cabal run is-valid-utf8
Resolving dependencies...
Build profile: -w ghc-8.10.6 -O1
In order, the following will be built (use -v for more details):
 - bytestring-0.11.2.0 (test:is-valid-utf8) (additional components to build)
Preprocessing test suite 'is-valid-utf8' for bytestring-0.11.2.0..
Building test suite 'is-valid-utf8' for bytestring-0.11.2.0..
zsh: segmentation fault  cabal run is-valid-utf8

@kozross
Copy link
Contributor Author

kozross commented Sep 16, 2021

This is difficult for me to debug, as I don't have a Mac.

@Bodigrim
Copy link
Contributor

@kozross FreeBSD build also fails: https://cirrus-ci.com/task/4802955137253376?logs=main#L715
Does it suggest an issue with clang?

@kozross
Copy link
Contributor Author

kozross commented Sep 20, 2021

This is possible. I can test this, but this involves building a clang-only toolchain. I'll try doing that and see if it reproduces.

@kozross
Copy link
Contributor Author

kozross commented Oct 3, 2021

@Bodigrim - I have narrowed the issue down. It occurs only in the combination of Clang and tasty. GHCi works fine, a toy executable using isValidUtf8 also works fine, and both behave exactly as we would expect. As I don't know the internals of tasty, I can only assume that there's something going on there that causes this to segfault.

Interestingly, in my case (GHC 9.0.1, built with Clang), I don't even get notified of a segfault: the tests just fail outright without explanation.

Would setting up a catch of SomeException in the test bodies possibly tell us something? I genuinely do not know if segfaults in FFI code trigger the throwing of some exception (pun not intended) or not.

@Bodigrim
Copy link
Contributor

Bodigrim commented Oct 3, 2021

Sorry, AFK, cannot take a closer look atm.
This is more or less typical for segmentation faults: different heap layouts lead to different outcomes and may mask an issue. I've seen this couple of times before and it rarely signals any issue with tasty itself.

@Bodigrim
Copy link
Contributor

Bodigrim commented Oct 9, 2021

$ cabal run is-valid-utf8 -- -p '/Invalid UTF-8/' --quickcheck-replay=431466
Up to date
UTF-8 validation
  Invalid UTF-8: FAIL
    *** Failed! Falsified (after 49 tests):
    InvalidUtf8 {prefix = "", invalid = "\214", suffix = "\130{", asBS = "\214\130{", length = 3}
    True /= False
    Use --quickcheck-replay=431466 to reproduce.

1 out of 1 tests failed (0.00s)

@Bodigrim Bodigrim mentioned this pull request Oct 10, 2021
@kozross
Copy link
Contributor Author

kozross commented Oct 10, 2021

@Bodigrim That's a generator bug - that's not invalid UTF-8, since that's the sequence 0xD6, 0x82 (which is 'ւ') followed by a '{'. Will fix too.

@Bodigrim
Copy link
Contributor

Windows failure sounds like https://gitlab.haskell.org/ghc/ghc/-/issues/19900, please try if os(windows) extra-libraries: gcc_s.

@Bodigrim
Copy link
Contributor

@kozross I recall that there is a subtle issue with TemplateHaskell and C sources introduced in GHC 9.0.1, already fixed in GHC HEAD. Could you please manually downgrade GHC, used in Windows build, to 8.10 in your branch?

- os: windows-latest
ghc: 'latest'

@Bodigrim
Copy link
Contributor

Ah, big endian is biting again:

UTF-8 validation
  Valid UTF-8:   FAIL (0.05s)
    *** Failed! Falsified (after 40 tests):
    "\233\128\154\231\148\160\231\141\173/\235\178\151eB\221\183g\217\154\228\164\189\208\154\242\142\168\153\205\146\231\135\176\229\162\189\219\145\240\185\158\159\236\150\141\207\173B\224\175\143\243\151\164\148\225\134\178"
    False /= True
    Use --quickcheck-replay=410548 to reproduce.
    Use -p '/Valid UTF-8/' to rerun this test only.
  Invalid UTF-8: OK (16.27s)
    +++ OK, passed 100000 tests.

You can reproduce as follows:

apt install -y docker.io && \
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes && \
docker run --rm -i -t s390x/ubuntu

and then inside of the virtual machine:

apt update && \
apt install -y ghc cabal-install git build-essential && \
git clone https://github.com/mlabs-haskell/bytestring && \
cd bytestring && \
git checkout master && \
cabal new-update && \
cabal -j1 new-test is-valid-utf8

@kozross
Copy link
Contributor Author

kozross commented Oct 11, 2021

@Bodigrim Sure. I'll need an 8.10 with Clang to make sure I catch breakages, but that's workable.

Endianness is... a tricky business for sure. I need to do some refactors to make my code endian-safe, which will likely cause regressions in performance. I could probably do some checks statically for this, as I don't think we need to be runtime endianness portable. This should only be an issue in the fallback though.

@Bodigrim
Copy link
Contributor

@kozross If you could possibly resolve big endian issue soon, we might potentially merge as is and leave performance polishing for later. It’s important to get API in place and make a release of bytestring, so that text can be built upon it.

@kozross
Copy link
Contributor Author

kozross commented Oct 16, 2021

@Bodigrim Will get that done tomorrow.

@Bodigrim
Copy link
Contributor

This is blocked now by Windows build, unfortunately. I thought that we are dealing with https://gitlab.haskell.org/ghc/ghc/-/issues/19417, which is specific to 9.0 series, but apparently it is something else. @kozross if you get time to raise this build failure at GHC bug tracker, it would be much appreciated.

@kozross
Copy link
Contributor Author

kozross commented Oct 18, 2021

@Bodigrim Do we need a minimal reproducing example, or is it enough just to aim people at this PR, or specifically, the build failure?

@Bodigrim
Copy link
Contributor

I don't really understand why 32-bit build was fine before, but now cabal-1.24 struggles to derive a build plan. Is it because there is one more build target?.. Anyways, please add --constraint 'QuickCheck +old-random' and raise a separate PR for it.

@Bodigrim
Copy link
Contributor

@kozross please rebase atop the latest master.

Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kozross! I rely on tests to validate UTF-8 validation logic, so just some nitpicking. Sorry that it takes so long.

@vdukhovni would you be interested to take a look at C?

.github/workflows/ci.yml Outdated Show resolved Hide resolved
Data/ByteString.hs Outdated Show resolved Hide resolved
cbits/is-valid-utf8.c Show resolved Hide resolved
tests/is-valid-utf8/Main.hs Outdated Show resolved Hide resolved
tests/is-valid-utf8/Main.hs Outdated Show resolved Hide resolved
cbits/is-valid-utf8.c Outdated Show resolved Hide resolved
@Bodigrim Bodigrim requested a review from sjakobi October 29, 2021 21:29
Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick, initial review. I don't understand C very well, so I can't say much about it.

bytestring.cabal Show resolved Hide resolved
Data/ByteString.hs Outdated Show resolved Hide resolved
Copy link
Contributor

@vdukhovni vdukhovni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions/comments.

Comment on lines +59 to +62
uint64_t results[4] = {(*big_ptr) & high_bits_mask,
(*(big_ptr + 1)) & high_bits_mask,
(*(big_ptr + 2)) & high_bits_mask,
(*(big_ptr + 3)) & high_bits_mask};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C is not a lazy language, I don't think it makes sense to precompute all 4 strides, when just below you don't use the remaining strides when one of the masked results is non-zero... The start pointer here is not aligned to a 32-byte multiple, so I don't think we get particular cache line efficiency here by preloading these? Am I missing something?

cbits/is-valid-utf8.c Show resolved Hide resolved
cbits/is-valid-utf8.c Show resolved Hide resolved
Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Bodigrim Bodigrim merged commit dad6dd1 into haskell:master Nov 3, 2021
@Bodigrim
Copy link
Contributor

Bodigrim commented Nov 3, 2021

Thanks @kozross, this is great! Would you be able to work on aarch64 solution and restoration of happy path for ASCII any time soon?

Bodigrim pushed a commit to Bodigrim/bytestring that referenced this pull request Nov 3, 2021
* isValidUtf8 function, tests

* Cleanup, change to forAll

* Fix accidentally-correct generation of wrong test cases

* Increase test count, fix segfault on Clang

* Try fixing Windows

* Eliminate block-checking for ASCII in fallback mode

* Use GHC 8.10 for CI on Windows

* Namespace externally-visible C function better

* Link to libgcc on Windows

* Fix error accumulation in AVX2 check

* Fix use of <> on older GHCs

* Repair Drone config to deal with older Cabal

* Ensure non-SSE2 x86 can still build

* License information and contact for C code

* Fix test name typo, allow more than pre-set count of tests

* Stop using MagicHash for FFI wrapper

* Revert change to 8.10 for Windows CI

* Use existing license verbatim for C code

* Explain the use of gcc and gcc_s

* More concise implementation of isValidUtf8

* Document AVX2 block-checking function
@Bodigrim Bodigrim added this to the 0.11.2.0 milestone Nov 3, 2021
@kozross
Copy link
Contributor Author

kozross commented Nov 3, 2021

@Bodigrim I can do this within the next day or two. The Aarch64 solution using SIMD instructions is fairly easy (and the fallback is pretty fast in the meantime), whereas the 'happy path' is only disabled on the fallback, and will take a bit longer. I have the tools to test this properly now at least, so it shouldn't be very long.

noughtmare pushed a commit to noughtmare/bytestring that referenced this pull request Dec 12, 2021
* isValidUtf8 function, tests

* Cleanup, change to forAll

* Fix accidentally-correct generation of wrong test cases

* Increase test count, fix segfault on Clang

* Try fixing Windows

* Eliminate block-checking for ASCII in fallback mode

* Use GHC 8.10 for CI on Windows

* Namespace externally-visible C function better

* Link to libgcc on Windows

* Fix error accumulation in AVX2 check

* Fix use of <> on older GHCs

* Repair Drone config to deal with older Cabal

* Ensure non-SSE2 x86 can still build

* License information and contact for C code

* Fix test name typo, allow more than pre-set count of tests

* Stop using MagicHash for FFI wrapper

* Revert change to 8.10 for Windows CI

* Use existing license verbatim for C code

* Explain the use of gcc and gcc_s

* More concise implementation of isValidUtf8

* Document AVX2 block-checking function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide isValidUtf8 function
5 participants