Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use safe isValidUtf8 for large inputs #470

Merged
merged 3 commits into from
Feb 24, 2022

Conversation

noughtmare
Copy link
Contributor

Fixes #451

@Bodigrim
Copy link
Contributor

@noughtmare could you please take care of ShortByteString as well?

isValidUtf8 :: ShortByteString -> Bool
isValidUtf8 sbs@(SBS ba#) = accursedUnutterablePerformIO $ do
i <- cIsValidUtf8 ba# (fromIntegral (length sbs))
return $ i /= 0
foreign import ccall unsafe "bytestring_is_valid_utf8" cIsValidUtf8
:: ByteArray# -> CSize -> IO CInt

@noughtmare
Copy link
Contributor Author

noughtmare commented Jan 20, 2022

Done! Edit: now really...

@Bodigrim Bodigrim requested a review from sjakobi January 20, 2022 20:46
Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good amount of duplication – would it be difficult to refactor this code so it's shared between the ByteString and the ShortByteString versions?

Data/ByteString.hs Show resolved Hide resolved
@noughtmare
Copy link
Contributor Author

The FFI imports are very subtly different in their type and they are also used in a slightly different way, so I would say that it is hard to deduplicate.

@sjakobi
Copy link
Member

sjakobi commented Jan 20, 2022

The FFI imports are very subtly different in their type and they are also used in a slightly different way, so I would say that it is hard to deduplicate.

Alright. Please add cross-references then that remind us to update both versions in sync that changes to one version should probably also be made to the other version.

@noughtmare
Copy link
Contributor Author

@sjakobi I have incorporated your suggestions.

Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Bodigrim Bodigrim added this to the 0.11.3.0 milestone Jan 20, 2022
@clyring
Copy link
Member

clyring commented Jan 20, 2022

Are we sure that every supported version of GHC will always make large ByteArray#s pinned? If not, then the safe FFI call for ShortByteString is a segfault waiting to happen. I think it's very cheap insurance to also check isByteArrayPinned# in the large ShortByteString case.

@sjakobi
Copy link
Member

sjakobi commented Jan 20, 2022

@clyring Good point. There's some documentation on this at https://ghc.gitlab.haskell.org/ghc/doc/users_guide/exts/ffi.html#pinned-byte-arrays, but it's unclear what the guarantees are with older GHCs, or to what extent this may change in the future. Let's add the pinnedness check.

let n = length sbs
i <- if n < 1000000
then cIsValidUtf8 ba# (fromIntegral n)
else cIsValidUtf8Safe ba# (fromIntegral n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is ba# guaranteed to be kept alive for the duration of this call? Or do we need some touch# or with... stuff here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right that some kind of touch is necessary here, although I don't know what would be the best way to do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think checking isPinnedByteArray# is sufficient here. We do not extract a pointer to byteArrayContent#, so it's GHC's responsibility not to garbage collect ba# before time.

Copy link
Contributor Author

@noughtmare noughtmare Jan 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so the touch# only has to be used if you actually use an Addr#, not with ByteArray#. @clyring do you agree that this doesn't need any further action?

Copy link
Member

@clyring clyring Jan 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My quick testing does suggest that GHC currently does keep ByteArray# foreign function parameters alive for the duration of even a safe FFI call. But if this behavior is documented anywhere I wasn't able to find it. (I should raise a ticket on GHC's issue tracker about this. see also this GHC issue) I would personally prefer some touch# anyway to be safe since it should have no runtime cost and if it turns out to have been necessary in some situation the resulting bug will be near-impossible for a user to track down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure what exactly you suggest to touch. byteArrayContent#?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had in mind the ByteArray# reference itself, ba#. Though I concede it's possible that this won't do what I expect.

But, now that I check, the quick test I hacked together also fails to detect the unsafety of Addr# passing without any form of touch# or keepAlive#. So my previous comment should not be seen as persuasive evidence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not follow. We have a thunk depending on ba# - how could it be garbage collected before the thunk is fully evaluated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, my concern is that there might not be a thunk depending on ba# at runtime, if ba# is not subsequently used in Haskell code. I don't know the exact inner workings of GHC's implementation of safe FFI calls at runtime. If it involves creating a pinned thunk that holds on to the arguments of the FFI call, then that would obviously keep alive the ByteArray# argument. But this isn't made clear by the documentation, and it's easy to imagine an implementation that doesn't provide such a guarantee.

@noughtmare
Copy link
Contributor Author

I've added the pinnedness check.

@noughtmare
Copy link
Contributor Author

I've disabled the safe isValidUtf8 for short bytestrings for base versions before 4.10, because those don't have isByteArrayPinned# yet.

@AndreasPK
Copy link
Contributor

I haven't looked at the code but do note my comment on the ghc issue tracker: https://gitlab.haskell.org/ghc/ghc/-/issues/21025#note_405881

This might need touch# in some places in light of this.

@noughtmare
Copy link
Contributor Author

I've added the touch. It is a bit awkward. Does anyone know a better way?

@Bodigrim
Copy link
Contributor

@clyring any opinion about touch# please?
@noughtmare could you please rebase?

i <- if n < 1000000 || not (isPinned ba#)
then cIsValidUtf8 ba# (fromIntegral n)
else cIsValidUtf8Safe ba# (fromIntegral n)
IO (\s -> (# touch# ba# s, () #))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know of a better way to do this. As direct primop uses go, this is very clean.

Consider adding a comment linking to the earlier discussion about this or to the related GHC ticket.

@Bodigrim Bodigrim merged commit 38889aa into haskell:master Feb 24, 2022
@Bodigrim
Copy link
Contributor

Rebased and merged, thanks @noughtmare!

Bodigrim pushed a commit to Bodigrim/bytestring that referenced this pull request Feb 24, 2022
* Use safe isValidUtf8 for large inputs

* Use compat function for checking pinnedness

* Touch byte array after passing it to safe FFI call
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch between safe and unsafe calls to isValidUtf8 depending on size
5 participants