Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dropWhile is slow #141

Closed
charles-cooper opened this issue Oct 20, 2016 · 9 comments
Closed

dropWhile is slow #141

charles-cooper opened this issue Oct 20, 2016 · 9 comments
Milestone

Comments

@charles-cooper
Copy link

the following example shows dropWhile being roughly 60x slower than findIndex followed by drop n with -O3 -- and roughly 560x slower without -O3!

https://github.com/charles-cooper/vector-perf

$ ./run_perf 
+ stack clean
+ stack build
vector-perf-0.1.0.0: configure
Configuring vector-perf-0.1.0.0...
vector-perf-0.1.0.0: build
Preprocessing executable 'vector-perf' for vector-perf-0.1.0.0...
[1 of 1] Compiling Main             ( src/Main.hs, .stack-work/dist/x86_64-linux/Cabal-1.24.0.0/build/vector-perf/vector-perf-tmp/Main.o )
Linking .stack-work/dist/x86_64-linux/Cabal-1.24.0.0/build/vector-perf/vector-perf ...
vector-perf-0.1.0.0: copy/register
Installing executable(s) in
/home/ubuntu/vector-perf/.stack-work/install/x86_64-linux/lts-7.4/8.0.1/bin
+ stack exec vector-perf
took 0.026105s
took 1.566821s
+ stack clean
+ stack build --fast
vector-perf-0.1.0.0: configure
Configuring vector-perf-0.1.0.0...
vector-perf-0.1.0.0: build
Preprocessing executable 'vector-perf' for vector-perf-0.1.0.0...
[1 of 1] Compiling Main             ( src/Main.hs, .stack-work/dist/x86_64-linux/Cabal-1.24.0.0/build/vector-perf/vector-perf-tmp/Main.o )
Linking .stack-work/dist/x86_64-linux/Cabal-1.24.0.0/build/vector-perf/vector-perf ...
vector-perf-0.1.0.0: copy/register
Installing executable(s) in
/home/ubuntu/vector-perf/.stack-work/install/x86_64-linux/lts-7.4/8.0.1/bin
+ stack exec vector-perf
took 0.12741s
took 71.967015s

$ tail -n25 /proc/cpuinfo
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
stepping        : 2
microcode       : 0x36
cpu MHz         : 1209.375
cache size      : 20480 KB
physical id     : 1
siblings        : 16
core id         : 7
cpu cores       : 8
apicid          : 31
initial apicid  : 31
fpu             : yes
fpu_exception   : yes
cpuid level     : 15
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid
bogomips        : 4801.74
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.1 LTS
Release:        16.04
Codename:       xenial
@nmk
Copy link

nmk commented Dec 22, 2016

This was commented a bit in #108.

charles-cooper added a commit to charles-cooper/vector that referenced this issue Dec 22, 2016
The `unstream` version results in a copy of the remainder of the vector
which is rarely what you want to do. This commit uses `drop` directly
after `findIndex` which results in a big speed gain if the remaining
part of the vector is large.

Cf. haskell#108
and haskell#141
@dolio
Copy link
Contributor

dolio commented Jan 3, 2017

So, I feel like this is an example of a broader problem.

In your example, you will have a big vector materialized in memory. Streaming this to drop some elements and then making a new vector is a lot of work compared to changing the offset/length and reusing the underlying array.

However, another use case of vector is stream fusion, and if stream fusion is happening, this new implementation (in pull request #146) actually breaks it up in sub-optimal ways. And it's kind of accidental that drop does not do streaming, too.

However, I think that maybe the generalized stream fusion can do something better than it is doing right now. dropWhileM in the bundle fusion is just forcing the full stream. However, I think it could actually be a bundle modification that modifies the three things in the bundle differently, and puts off the decision of whether to stream or reuse the vector to other things in the fusion pipeline. This would hopefully give the best answer for both use cases.

I think this was the point of the generalized fusion, so I don't really understand why it wasn't used here, or in lots of other places (takeWhile would be similarly bad in an example like yours, I think). I'll see if I can make the generalized fusion implement something that runs well for your example, but is also good (or, better, at least) for streaming before merging #146.

@cartazio
Copy link
Contributor

cartazio commented Jan 3, 2017

Yeah, sharing vs slicing is a tricky trade off !

@charles-cooper
Copy link
Author

I see. So ideally we would have dropWhile f use sharing and dropWhile f . map g (or similar) use stream fusion.

Thanks for the insight by the way. Up to this point I did not understand what Bundle does -- apparently it is for exactly this purpose of choosing whether to stream or slice?

@dolio
Copy link
Contributor

dolio commented Jan 6, 2017

That's sort of the idea, yes. A simpler example than bundle is that you can use the type:

data CoYo f a = forall e. CY (e -> a) (f e)

to fuse up fmap calls on a Functor f by collecting them into the stored function and only applying the function once on extract :: CoYo f a -> f a. However, this actually does extra work (in the form of fmap id) for 0 composed functions. So you can fix this by instead using:

data BundleYo f a = forall e. BY (e -> a) (f e) (f a)

Where the extra f a field stores the original f a at first (so no additional work is done), and when fmap is used, we put what we would extract at that point there. Due to laziness, only the final result actually gets used, so we never do (much) more work than CoYo.

The only tricky part I can think of is that the selection of algorithms kind of flows one way in bundles. So, I think map g . dropWhile p will use streaming dropWhile (because map will always choose to stream), whereas what dropWhile p . map g does would be determined on how the bundle it produces is consumed, even though streaming is again probably the best option (because you can't do much better when a map is involved). Maybe there's some mechanism that allows a better selection, though. I'm not sure.

@gksato
Copy link
Contributor

gksato commented Jun 5, 2020

For this matter, why don't we just have dropWhile be like the current implementations of index, head or tail ? That is, we could define dropWhile f = snd . span f with INLINE_FUSED and have a rule

dropWhile f (new (New.unstream s)) = new (New.unstream (Bundle.dropWhile f s))

I'm soon gonna submit a pull request for this.

@cartazio
Copy link
Contributor

cartazio commented Jun 5, 2020 via email

@Shimuuar Shimuuar added this to the 0.13 milestone Jun 11, 2020
gksato added a commit to gksato/vector that referenced this issue Aug 2, 2020
Now dropWhile's implementation uses stream
only in case the stream fusion is already running.
This is expected to reduce unnecessary re-allocation.

cf. haskell#141
haskell#108
@gksato
Copy link
Contributor

gksato commented Aug 2, 2020

I've submitted a pull request for this, finally! #327

Shimuuar pushed a commit that referenced this issue Aug 13, 2020
* make dropWhile fusion adaptive

Now dropWhile's implementation uses stream
only in case the stream fusion is already running.
This is expected to reduce unnecessary re-allocation.

cf. #141
#108

* Adaptive dropWhile Implementation: Implementation comment modified

Modified the comment for the RULE "dropWhile/unstream [Vector]"
in order to make the intention of the implementation there clear.
@Shimuuar
Copy link
Contributor

Fixed by #327

lehins pushed a commit that referenced this issue Jan 16, 2021
* make dropWhile fusion adaptive

Now dropWhile's implementation uses stream
only in case the stream fusion is already running.
This is expected to reduce unnecessary re-allocation.

cf. #141
#108

* Adaptive dropWhile Implementation: Implementation comment modified

Modified the comment for the RULE "dropWhile/unstream [Vector]"
in order to make the intention of the implementation there clear.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants