dropWhile is slow #141

charles-cooper · 2016-10-20T13:45:50Z

the following example shows dropWhile being roughly 60x slower than findIndex followed by drop n with -O3 -- and roughly 560x slower without -O3!

https://github.com/charles-cooper/vector-perf

$ ./run_perf 
+ stack clean
+ stack build
vector-perf-0.1.0.0: configure
Configuring vector-perf-0.1.0.0...
vector-perf-0.1.0.0: build
Preprocessing executable 'vector-perf' for vector-perf-0.1.0.0...
[1 of 1] Compiling Main             ( src/Main.hs, .stack-work/dist/x86_64-linux/Cabal-1.24.0.0/build/vector-perf/vector-perf-tmp/Main.o )
Linking .stack-work/dist/x86_64-linux/Cabal-1.24.0.0/build/vector-perf/vector-perf ...
vector-perf-0.1.0.0: copy/register
Installing executable(s) in
/home/ubuntu/vector-perf/.stack-work/install/x86_64-linux/lts-7.4/8.0.1/bin
+ stack exec vector-perf
took 0.026105s
took 1.566821s
+ stack clean
+ stack build --fast
vector-perf-0.1.0.0: configure
Configuring vector-perf-0.1.0.0...
vector-perf-0.1.0.0: build
Preprocessing executable 'vector-perf' for vector-perf-0.1.0.0...
[1 of 1] Compiling Main             ( src/Main.hs, .stack-work/dist/x86_64-linux/Cabal-1.24.0.0/build/vector-perf/vector-perf-tmp/Main.o )
Linking .stack-work/dist/x86_64-linux/Cabal-1.24.0.0/build/vector-perf/vector-perf ...
vector-perf-0.1.0.0: copy/register
Installing executable(s) in
/home/ubuntu/vector-perf/.stack-work/install/x86_64-linux/lts-7.4/8.0.1/bin
+ stack exec vector-perf
took 0.12741s
took 71.967015s

$ tail -n25 /proc/cpuinfo
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
stepping        : 2
microcode       : 0x36
cpu MHz         : 1209.375
cache size      : 20480 KB
physical id     : 1
siblings        : 16
core id         : 7
cpu cores       : 8
apicid          : 31
initial apicid  : 31
fpu             : yes
fpu_exception   : yes
cpuid level     : 15
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid
bogomips        : 4801.74
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.1 LTS
Release:        16.04
Codename:       xenial

The text was updated successfully, but these errors were encountered:

nmk · 2016-12-22T22:18:42Z

This was commented a bit in #108.

The `unstream` version results in a copy of the remainder of the vector which is rarely what you want to do. This commit uses `drop` directly after `findIndex` which results in a big speed gain if the remaining part of the vector is large. Cf. haskell#108 and haskell#141

dolio · 2017-01-03T05:26:04Z

So, I feel like this is an example of a broader problem.

In your example, you will have a big vector materialized in memory. Streaming this to drop some elements and then making a new vector is a lot of work compared to changing the offset/length and reusing the underlying array.

However, another use case of vector is stream fusion, and if stream fusion is happening, this new implementation (in pull request #146) actually breaks it up in sub-optimal ways. And it's kind of accidental that drop does not do streaming, too.

However, I think that maybe the generalized stream fusion can do something better than it is doing right now. dropWhileM in the bundle fusion is just forcing the full stream. However, I think it could actually be a bundle modification that modifies the three things in the bundle differently, and puts off the decision of whether to stream or reuse the vector to other things in the fusion pipeline. This would hopefully give the best answer for both use cases.

I think this was the point of the generalized fusion, so I don't really understand why it wasn't used here, or in lots of other places (takeWhile would be similarly bad in an example like yours, I think). I'll see if I can make the generalized fusion implement something that runs well for your example, but is also good (or, better, at least) for streaming before merging #146.

cartazio · 2017-01-03T13:08:33Z

Yeah, sharing vs slicing is a tricky trade off !

charles-cooper · 2017-01-03T14:55:29Z

I see. So ideally we would have dropWhile f use sharing and dropWhile f . map g (or similar) use stream fusion.

Thanks for the insight by the way. Up to this point I did not understand what Bundle does -- apparently it is for exactly this purpose of choosing whether to stream or slice?

dolio · 2017-01-06T02:54:38Z

That's sort of the idea, yes. A simpler example than bundle is that you can use the type:

data CoYo f a = forall e. CY (e -> a) (f e)

to fuse up fmap calls on a Functor f by collecting them into the stored function and only applying the function once on extract :: CoYo f a -> f a. However, this actually does extra work (in the form of fmap id) for 0 composed functions. So you can fix this by instead using:

data BundleYo f a = forall e. BY (e -> a) (f e) (f a)

Where the extra f a field stores the original f a at first (so no additional work is done), and when fmap is used, we put what we would extract at that point there. Due to laziness, only the final result actually gets used, so we never do (much) more work than CoYo.

The only tricky part I can think of is that the selection of algorithms kind of flows one way in bundles. So, I think map g . dropWhile p will use streaming dropWhile (because map will always choose to stream), whereas what dropWhile p . map g does would be determined on how the bundle it produces is consumed, even though streaming is again probably the best option (because you can't do much better when a map is involved). Maybe there's some mechanism that allows a better selection, though. I'm not sure.

gksato · 2020-06-05T16:05:25Z

For this matter, why don't we just have dropWhile be like the current implementations of index, head or tail ? That is, we could define dropWhile f = snd . span f with INLINE_FUSED and have a rule

dropWhile f (new (New.unstream s)) = new (New.unstream (Bundle.dropWhile f s))

I'm soon gonna submit a pull request for this.

cartazio · 2020-06-05T17:36:28Z

Sounds wonderful!

…

On Fri, Jun 5, 2020 at 12:05 PM Genki Sato ***@***.***> wrote: For this matter, why don't we just have dropWhile be like current implementation index or head or tail ? That is, we could define dropWhile f = snd . span f with INLINE_FUSED and have a rule dropWhile f (unstream s) = unstream (Bundle.dropWhile f s) I'm soon gonna submit a pull request for this. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#141 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAABBQSXVIWTV2R7BHMW67DRVEJVVANCNFSM4CTQSUWQ> .

Now dropWhile's implementation uses stream only in case the stream fusion is already running. This is expected to reduce unnecessary re-allocation. cf. haskell#141 haskell#108

gksato · 2020-08-02T08:44:41Z

I've submitted a pull request for this, finally! #327

* make dropWhile fusion adaptive Now dropWhile's implementation uses stream only in case the stream fusion is already running. This is expected to reduce unnecessary re-allocation. cf. #141 #108 * Adaptive dropWhile Implementation: Implementation comment modified Modified the comment for the RULE "dropWhile/unstream [Vector]" in order to make the intention of the implementation there clear.

Shimuuar · 2020-08-13T17:59:57Z

Fixed by #327

* make dropWhile fusion adaptive Now dropWhile's implementation uses stream only in case the stream fusion is already running. This is expected to reduce unnecessary re-allocation. cf. #141 #108 * Adaptive dropWhile Implementation: Implementation comment modified Modified the comment for the RULE "dropWhile/unstream [Vector]" in order to make the intention of the implementation there clear.

charles-cooper mentioned this issue Dec 22, 2016

Fix dropWhile performance #146

Closed

Shimuuar added this to the 0.13 milestone Jun 11, 2020

gksato added a commit to gksato/vector that referenced this issue Aug 2, 2020

make dropWhile fusion adaptive

b15027f

Now dropWhile's implementation uses stream only in case the stream fusion is already running. This is expected to reduce unnecessary re-allocation. cf. haskell#141 haskell#108

gksato mentioned this issue Aug 2, 2020

make dropWhile fusion adaptive #327

Merged

Shimuuar closed this as completed Aug 13, 2020

gksato mentioned this issue Aug 30, 2020

Generic.takeWhile is not copy-free #330

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dropWhile is slow #141

dropWhile is slow #141

charles-cooper commented Oct 20, 2016

nmk commented Dec 22, 2016

dolio commented Jan 3, 2017

cartazio commented Jan 3, 2017

charles-cooper commented Jan 3, 2017

dolio commented Jan 6, 2017

gksato commented Jun 5, 2020 •

edited

Loading

cartazio commented Jun 5, 2020 via email

gksato commented Aug 2, 2020

Shimuuar commented Aug 13, 2020

dropWhile is slow #141

dropWhile is slow #141

Comments

charles-cooper commented Oct 20, 2016

nmk commented Dec 22, 2016

dolio commented Jan 3, 2017

cartazio commented Jan 3, 2017

charles-cooper commented Jan 3, 2017

dolio commented Jan 6, 2017

gksato commented Jun 5, 2020 • edited Loading

cartazio commented Jun 5, 2020 via email

gksato commented Aug 2, 2020

Shimuuar commented Aug 13, 2020

gksato commented Jun 5, 2020 •

edited

Loading