-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use safe isValidUtf8 for large inputs #470
Conversation
@noughtmare could you please take care of bytestring/Data/ByteString/Short/Internal.hs Lines 616 to 622 in 6ff6ed4
|
8b910ed
to
e4b5545
Compare
Done! Edit: now really... |
e4b5545
to
00b2240
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good amount of duplication – would it be difficult to refactor this code so it's shared between the ByteString
and the ShortByteString
versions?
The FFI imports are very subtly different in their type and they are also used in a slightly different way, so I would say that it is hard to deduplicate. |
Alright. Please add cross-references then that remind us |
00b2240
to
cd14a11
Compare
@sjakobi I have incorporated your suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Are we sure that every supported version of GHC will always make large |
@clyring Good point. There's some documentation on this at https://ghc.gitlab.haskell.org/ghc/doc/users_guide/exts/ffi.html#pinned-byte-arrays, but it's unclear what the guarantees are with older GHCs, or to what extent this may change in the future. Let's add the pinnedness check. |
let n = length sbs | ||
i <- if n < 1000000 | ||
then cIsValidUtf8 ba# (fromIntegral n) | ||
else cIsValidUtf8Safe ba# (fromIntegral n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is ba#
guaranteed to be kept alive for the duration of this call? Or do we need some touch#
or with...
stuff here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right that some kind of touch is necessary here, although I don't know what would be the best way to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think checking isPinnedByteArray#
is sufficient here. We do not extract a pointer to byteArrayContent#
, so it's GHC's responsibility not to garbage collect ba#
before time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so the touch#
only has to be used if you actually use an Addr#
, not with ByteArray#
. @clyring do you agree that this doesn't need any further action?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My quick testing does suggest that GHC currently does keep But if this behavior is documented anywhere I wasn't able to find it. (ByteArray#
foreign function parameters alive for the duration of even a safe FFI call.I should raise a ticket on GHC's issue tracker about this. see also this GHC issue) I would personally prefer some touch#
anyway to be safe since it should have no runtime cost and if it turns out to have been necessary in some situation the resulting bug will be near-impossible for a user to track down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure what exactly you suggest to touch. byteArrayContent#?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had in mind the ByteArray#
reference itself, ba#
. Though I concede it's possible that this won't do what I expect.
But, now that I check, the quick test I hacked together also fails to detect the unsafety of Addr#
passing without any form of touch#
or keepAlive#
. So my previous comment should not be seen as persuasive evidence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not follow. We have a thunk depending on ba#
- how could it be garbage collected before the thunk is fully evaluated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, my concern is that there might not be a thunk depending on ba#
at runtime, if ba#
is not subsequently used in Haskell code. I don't know the exact inner workings of GHC's implementation of safe FFI calls at runtime. If it involves creating a pinned thunk that holds on to the arguments of the FFI call, then that would obviously keep alive the ByteArray#
argument. But this isn't made clear by the documentation, and it's easy to imagine an implementation that doesn't provide such a guarantee.
cd14a11
to
a4ef5d9
Compare
I've added the pinnedness check. |
a4ef5d9
to
e3cc90a
Compare
I've disabled the safe isValidUtf8 for short bytestrings for base versions before 4.10, because those don't have |
I haven't looked at the code but do note my comment on the ghc issue tracker: https://gitlab.haskell.org/ghc/ghc/-/issues/21025#note_405881 This might need touch# in some places in light of this. |
I've added the touch. It is a bit awkward. Does anyone know a better way? |
@clyring any opinion about |
i <- if n < 1000000 || not (isPinned ba#) | ||
then cIsValidUtf8 ba# (fromIntegral n) | ||
else cIsValidUtf8Safe ba# (fromIntegral n) | ||
IO (\s -> (# touch# ba# s, () #)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know of a better way to do this. As direct primop uses go, this is very clean.
Consider adding a comment linking to the earlier discussion about this or to the related GHC ticket.
466d5d5
to
631c93d
Compare
Rebased and merged, thanks @noughtmare! |
* Use safe isValidUtf8 for large inputs * Use compat function for checking pinnedness * Touch byte array after passing it to safe FFI call
Fixes #451