-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] isValidUtf8 function, tests #423
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for nicely commented C-code!
@kozross As witnessed by CI, this does not quite work at least on Mac, unfortunately.
|
This is difficult for me to debug, as I don't have a Mac. |
@kozross FreeBSD build also fails: https://cirrus-ci.com/task/4802955137253376?logs=main#L715 |
This is possible. I can test this, but this involves building a |
@Bodigrim - I have narrowed the issue down. It occurs only in the combination of Clang and Interestingly, in my case (GHC 9.0.1, built with Clang), I don't even get notified of a segfault: the tests just fail outright without explanation. Would setting up a catch of |
Sorry, AFK, cannot take a closer look atm. |
|
@Bodigrim That's a generator bug - that's not invalid UTF-8, since that's the sequence |
Windows failure sounds like https://gitlab.haskell.org/ghc/ghc/-/issues/19900, please try |
@kozross I recall that there is a subtle issue with TemplateHaskell and C sources introduced in GHC 9.0.1, already fixed in GHC HEAD. Could you please manually downgrade GHC, used in Windows build, to 8.10 in your branch? bytestring/.github/workflows/ci.yml Lines 22 to 23 in bd3c32c
|
Ah, big endian is biting again:
You can reproduce as follows: apt install -y docker.io && \
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes && \
docker run --rm -i -t s390x/ubuntu and then inside of the virtual machine:
|
@Bodigrim Sure. I'll need an 8.10 with Clang to make sure I catch breakages, but that's workable. Endianness is... a tricky business for sure. I need to do some refactors to make my code endian-safe, which will likely cause regressions in performance. I could probably do some checks statically for this, as I don't think we need to be runtime endianness portable. This should only be an issue in the fallback though. |
@kozross If you could possibly resolve big endian issue soon, we might potentially merge as is and leave performance polishing for later. It’s important to get API in place and make a release of |
@Bodigrim Will get that done tomorrow. |
This is blocked now by Windows build, unfortunately. I thought that we are dealing with https://gitlab.haskell.org/ghc/ghc/-/issues/19417, which is specific to 9.0 series, but apparently it is something else. @kozross if you get time to raise this build failure at GHC bug tracker, it would be much appreciated. |
@Bodigrim Do we need a minimal reproducing example, or is it enough just to aim people at this PR, or specifically, the build failure? |
I don't really understand why 32-bit build was fine before, but now |
@kozross please rebase atop the latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kozross! I rely on tests to validate UTF-8 validation logic, so just some nitpicking. Sorry that it takes so long.
@vdukhovni would you be interested to take a look at C?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick, initial review. I don't understand C very well, so I can't say much about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions/comments.
uint64_t results[4] = {(*big_ptr) & high_bits_mask, | ||
(*(big_ptr + 1)) & high_bits_mask, | ||
(*(big_ptr + 2)) & high_bits_mask, | ||
(*(big_ptr + 3)) & high_bits_mask}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C is not a lazy language, I don't think it makes sense to precompute all 4 strides, when just below you don't use the remaining strides when one of the masked results is non-zero... The start pointer here is not aligned to a 32-byte multiple, so I don't think we get particular cache line efficiency here by preloading these? Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Thanks @kozross, this is great! Would you be able to work on aarch64 solution and restoration of happy path for ASCII any time soon? |
* isValidUtf8 function, tests * Cleanup, change to forAll * Fix accidentally-correct generation of wrong test cases * Increase test count, fix segfault on Clang * Try fixing Windows * Eliminate block-checking for ASCII in fallback mode * Use GHC 8.10 for CI on Windows * Namespace externally-visible C function better * Link to libgcc on Windows * Fix error accumulation in AVX2 check * Fix use of <> on older GHCs * Repair Drone config to deal with older Cabal * Ensure non-SSE2 x86 can still build * License information and contact for C code * Fix test name typo, allow more than pre-set count of tests * Stop using MagicHash for FFI wrapper * Revert change to 8.10 for Windows CI * Use existing license verbatim for C code * Explain the use of gcc and gcc_s * More concise implementation of isValidUtf8 * Document AVX2 block-checking function
@Bodigrim I can do this within the next day or two. The Aarch64 solution using SIMD instructions is fairly easy (and the fallback is pretty fast in the meantime), whereas the 'happy path' is only disabled on the fallback, and will take a bit longer. I have the tools to test this properly now at least, so it shouldn't be very long. |
* isValidUtf8 function, tests * Cleanup, change to forAll * Fix accidentally-correct generation of wrong test cases * Increase test count, fix segfault on Clang * Try fixing Windows * Eliminate block-checking for ASCII in fallback mode * Use GHC 8.10 for CI on Windows * Namespace externally-visible C function better * Link to libgcc on Windows * Fix error accumulation in AVX2 check * Fix use of <> on older GHCs * Repair Drone config to deal with older Cabal * Ensure non-SSE2 x86 can still build * License information and contact for C code * Fix test name typo, allow more than pre-set count of tests * Stop using MagicHash for FFI wrapper * Revert change to 8.10 for Windows CI * Use existing license verbatim for C code * Explain the use of gcc and gcc_s * More concise implementation of isValidUtf8 * Document AVX2 block-checking function
Closes #417, using work done previously for
text
's UTF-8ing efforts. This is marked WIP for now until we can get comparative benches of this solution versus thesimdjson
-based one currently intext
.Currently, there is no Aarch64-specific solution - I will provide one in a follow-up PR. The fallback C version should be quite fast though.