-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up Data.ByteString.Short.unpack #526
Conversation
Git history reveals that bytestring/Data/ByteString/Internal.hs Lines 454 to 457 in 2b9416d
bytestring/Data/ByteString/Internal.hs Lines 494 to 502 in 2b9416d
This is a reasonable strategy for |
My laptop might not be too reliable about performance tests, but I ran it twice and it seems that the strict |
Core:
|
Could you change I don't find it very surprising that a |
But why is it faster with the previous version? |
Comparing the Core between the old and new versions, both for |
Here are the cores: https://gist.github.com/hasufell/653dea9cc5ad0f0d6cd35e47442fe35b |
Even regular I was able to speed them up by writing explicit versions: foldr :: (Word8 -> a -> a) -> a -> ShortByteString -> a
foldr k v = \sbs ->
let l = length sbs
ba = asBA sbs
w = indexWord8Array ba
go !n | n >= l = v
| otherwise = k (w n) (go (n + 1))
in go 0
-- TODO: could write tail recursive right to left version
foldr' :: (Word8 -> a -> a) -> a -> ShortByteString -> a
foldr' k v = \sbs ->
let l = length sbs
ba = asBA sbs
w = indexWord8Array ba
go !n | n >= l = v
| otherwise = let !acc = go (n + 1)
in k (w n) acc
in go 0 but doesn't that indicate that there's a problem with list fusion? |
Ok, so lazy That only leaves strict That leaves that question whether we have other functions that don't fuse/inline properly as well. |
So, here are the benchmark results of the latest patchset, compared with the baseline: https://gist.github.com/hasufell/3710e86fcdfcd5f2c641e7d0a57e30ce
|
The need to use The performance characteristics of If I'm reading the benchmark results @hasufell shared in the previous comment correctly, the proposed implementation takes about 75% longer in this case, which is a rather large regression. One alternative worth considering is adapting the existing fusion mechanism for Also, the benchmark results @hasufell has reported for 'unpack and look at a small number of leading elements' here and in #524 seem unbelievably large. Reading 100 bytes, allocating a hundred cons cells, and inspecting a few of them should take on the order of a microsecond, not on the order of ten milliseconds. (And it could be reasonably argued that even starting with 100 bytes is too much.) That's four orders of magnitude! I haven't tried reproducing or investigating yet, but something is very wrong somewhere. |
@clyring when I use this implementation: -- | /O(n)/. Convert a 'ShortByteString' into a list.
unpack :: ShortByteString -> [Word8]
unpack sbs = build (unpackFoldr sbs)
{-# INLINE unpack #-}
--
-- Have unpack fuse with good list consumers
--
unpackFoldr :: ShortByteString -> (Word8 -> a -> a) -> a -> a
unpackFoldr sbs k z = foldr k z sbs
{-# INLINE [0] unpackFoldr #-}
{-# RULES
"ShortByteString unpack-list" [1] forall bs .
unpackFoldr bs (:) [] = unpackBytes bs
#-} with the existing 100 chunk
however... "unpack and get last element" from this PR is still magnitudes faster than the version above with unpackFoldr:
|
This really terrible performance is caused by the bang on
I looked into this; it stems a quirk of the current implementation of In more typical situations, there is no advantage to peeling off the last iteration of the fused loop in this way. Indeed, it typically results mainly in the loop body being duplicated once. |
Alright... so I changed the implementation to be similar to Data.ByteString, utilizing the The results look pretty good... only the first few tests seem to have regressed, but everything else is apparently faster: Bench result
However... it's still (a little) slower for the last 2 benchmarks than my original unpack that uses unpack sbs = let ix = length sbs - 1
in List.map (unsafeIndex sbs) [0..ix]
{-# INLINE unpack #-} which runs
vs the latest implementation:
But that's quite an improvement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delays, @hasufell. This seems to have an excellent power-to-weight ratio and I'd love to see it merged soon. I have one nit-pick to make, though.
Co-authored-by: Matthew Craven <[email protected]>
Drone CI seems broken |
{-# OPTIONS_HADDOCK not-home #-} | ||
|
||
{-# OPTIONS_GHC -fno-warn-name-shadowing -fexpose-all-unfoldings #-} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should probably stop using -fexpose-all-unfolding
. Originally I thought that this flag would have the same effect as marking every definition as INLINEABLE
, but it actually exposes the optimized unfoldings which may not interact with RULEs etc. like the unoptimized unfoldings would. The GHC issues https://gitlab.haskell.org/ghc/ghc/-/issues/22202 and https://gitlab.haskell.org/ghc/ghc/-/issues/22203 are relevant in this context.
This doesn't need to happen in this PR though.
* Add benchmarks for pack/unpack (ShortByteString) * Add benchmarks for lazy folds * Improve performance of pack/unpack/folds in ShortByteString * Use `GHC.Exts.build (unpackFoldr sbs)` for unpack and fix laziness * Don't inline 'break' with zero arguments Co-authored-by: Matthew Craven <[email protected]> Co-authored-by: Matthew Craven <[email protected]>
Following the discussion from #524 (comment)
Before:
After
I'm not sure what are the benefits of the old implementation. This one seems to be better in every regard and probably fuses better.