Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch between safe and unsafe calls to isValidUtf8 depending on size #451

Closed
Bodigrim opened this issue Dec 11, 2021 · 5 comments · Fixed by #470
Closed

Switch between safe and unsafe calls to isValidUtf8 depending on size #451

Bodigrim opened this issue Dec 11, 2021 · 5 comments · Fixed by #470
Milestone

Comments

@Bodigrim
Copy link
Contributor

See https://www.reddit.com/r/haskell/comments/rc0us9/comment/hnrx0nr/?utm_source=share&utm_medium=web2x&context=3

CC @noughtmare @kozross

@Bodigrim Bodigrim added this to the 0.11.3.0 milestone Dec 11, 2021
@kozross
Copy link
Contributor

kozross commented Dec 11, 2021

This is an interesting observation. I can make that change easily enough - what do you think we should set the 'breakpoint' at which we separate safe and unsafe calls?

@noughtmare
Copy link
Contributor

I think around 1MB is reasonable. I've made a branch with benchmarks here. The results on my machine are:

All
  isValidUtf8
    1000
      unsafe: OK (0.61s)
        35.2 ns ± 1.2 ns
      safe:   OK (0.42s)
        98.4 ns ± 6.0 ns
      mix:    OK (0.63s)
        36.4 ns ± 1.1 ns
    10000
      unsafe: OK (0.37s)
        177  ns ± 6.9 ns
      safe:   OK (0.25s)
        231  ns ±  22 ns
      mix:    OK (0.19s)
        173  ns ±  17 ns
    100000
      unsafe: OK (0.29s)
        2.14 μs ± 198 ns
      safe:   OK (0.14s)
        2.22 μs ± 189 ns
      mix:    OK (0.28s)
        2.14 μs ± 149 ns
    1000000
      unsafe: OK (0.19s)
        24.3 μs ± 1.8 μs
      safe:   OK (0.20s)
        24.4 μs ± 1.7 μs
      mix:    OK (0.40s)
        24.5 μs ± 1.9 μs
    10000000
      unsafe: OK (0.13s)
        506  μs ±  47 μs
      safe:   OK (0.13s)
        507  μs ±  43 μs
      mix:    OK (0.13s)
        510  μs ±  39 μs
    100000000
      unsafe: OK (0.18s)
        5.97 ms ± 401 μs
      safe:   OK (0.19s)
        5.99 ms ± 342 μs
      mix:    OK (0.19s)
        6.01 ms ± 364 μs
    1000000000
      unsafe: OK (0.78s)
        59.5 ms ± 2.0 ms
      safe:   OK (0.18s)
        59.5 ms ± 4.0 ms
      mix:    OK (0.18s)
        59.1 ms ± 3.7 ms

All 21 tests passed (6.27s)

The input is just S.replicate (10 ^ i) 65 where i is between 3 and 9.

@Bodigrim
Copy link
Contributor Author

Yeah, 1M sounds reasonable to me. Please patch an instance for ShortByteString as well:

foreign import ccall unsafe "bytestring_is_valid_utf8" cIsValidUtf8
:: ByteArray# -> CSize -> IO CInt

@Bodigrim
Copy link
Contributor Author

@noughtmare @kozross could one of you contribute a patch please?

@noughtmare
Copy link
Contributor

Does this require a benchmark? I can't really imagine that this might regress in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants