Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switching from 1.2.1 to 2.x.x drastically slows my app #912

Open
MindStudioOfficial opened this issue Oct 19, 2022 · 19 comments
Open

Switching from 1.2.1 to 2.x.x drastically slows my app #912

MindStudioOfficial opened this issue Oct 19, 2022 · 19 comments

Comments

@MindStudioOfficial
Copy link

MindStudioOfficial commented Oct 19, 2022

In my app I allocate realtime video frames (24MB per frame * 60fps)

In my app I time the processing of a frame in milliseconds.

Using ffi: ^1.2.1 I get 8-11ms per frame
Using ffi: ^2.0.0 I get 70 to 200ms per frame in DEBUG mode and 20-30ms in RELEASE mode which makes my app unusable.

I suspect that it has something to do with the switch mentioned in the changelog of 2.0.0

Switch Windows memory allocation to use CoTaskMemAlloc and CoTaskMemFree

I've also found an article that talks about a bug in windows 10 which supposedly makes this a lot slower:
https://randomascii.wordpress.com/2022/07/11/slower-memory-zeroing-through-parallelism/

For now I can't upgrade to ffi 2.x.x since I can't use my app with this version.

Is this a known thing?

EDIT: switching from ffi.calloc to ffi.malloc reduces the time immensly but of course is not always applicable

@dcharkes
Copy link
Collaborator

calloc and malloc both use CoTaskMemAlloc.

https://github.com/dart-lang/ffi/blob/18b2b549d55009ff594600b04705ff6161681e07/lib/src/allocation.dart#L30-L32

The only difference is zero-ing out the memory.

https://github.com/dart-lang/ffi/blob/18b2b549d55009ff594600b04705ff6161681e07/lib/src/allocation.dart#L137-L139

Are you allocating large chunks of memory?

https://github.com/dart-lang/ffi/blob/18b2b549d55009ff594600b04705ff6161681e07/lib/src/allocation.dart#L105-L110

For large chunks, using memset would probably be much faster than the manual loop above.

https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/memset-wmemset?view=msvc-170

Could you check if using memset instead of a manual loop is faster? If so we can make a PR for that. (We never ran into this on the other platforms, because they support allocating zeroed out memory. I don't see something similar for Windows.)

@dcharkes
Copy link
Collaborator

cc @timsneath who introduced the zeroing out of memory for Windows.

@MindStudioOfficial
Copy link
Author

Are you allocating large chunks of memory?

Yes. It is about 24 MB per frame at for example 60fps. That would be 1.44 GB per second. (of course every frame gets freed)

For large chunks, using memset would probably be much faster than the manual loop above.

How do I access memset in dart?

@dcharkes
Copy link
Collaborator

Try

typedef _dart_memset = void Function(Pointer<Uint8>, int, int);
typedef _c_memset = Void Function(Pointer<Uint8>, Int32, IntPtr);

    fbMemset = DynamicLibrary.process()
        .lookupFunction<_c_memset, _dart_memset>('memset');

You probably need to be on the master branch to get access to DynamicLibrary.process() on Windows as I added that recently to dart:ffi.

@MindStudioOfficial
Copy link
Author

I've written a short dart example to test this:

import 'dart:ffi';
import 'package:ffi/ffi.dart';

typedef _dart_memset = void Function(Pointer<Uint8> dest, int value, int count);
typedef _c_memset = Void Function(Pointer<Uint8> dest, Int32 value, IntPtr count);

/// Sets a buffer to a specified character.
///
/// [dest] Pointer to destination.
/// [value] value to set.
/// [count] Number of characters.
final memset = DynamicLibrary.process().lookupFunction<_c_memset, _dart_memset>('memset');

void main(List<String> arguments) {
  Stopwatch sw = Stopwatch()..start();
  for (int i = 0; i < 1000; i++) {
    Pointer<Uint8> p = malloc.call<Uint8>(1024 * 1024); // 1 MiB
    malloc.free(p);
  }
  sw.stop();
  int us = sw.elapsedMicroseconds;
  print("Allocating 1000 * 1MiB with malloc took \t\t$usμs.");

  sw.reset();
  sw.start();
  for (int i = 0; i < 1000; i++) {
    Pointer<Uint8> p = calloc.call<Uint8>(1024 * 1024); // 1 MiB
    calloc.free(p);
  }
  sw.stop();
  us = sw.elapsedMicroseconds;
  print("Allocating 1000 * 1MiB with calloc took \t\t$usμs.");

  sw.reset();
  sw.start();
  for (int i = 0; i < 1000; i++) {
    Pointer<Uint8> p = malloc.call<Uint8>(1024 * 1024); // 1 MiB
    memset(p, 0, 1024 * 1024);
    calloc.free(p);
  }
  sw.stop();
  us = sw.elapsedMicroseconds;
  print("Allocating 1000 * 1MiB with malloc + memset took \t$usμs.");
}

ffi: ^2.0.1

Allocating 1000 * 1MiB with malloc took 7.028μs. (7ms)
Allocating 1000 * 1MiB with calloc took 3.294.333μs. (3294ms)
Allocating 1000 * 1MiB with malloc + memset took 187.259μs. (187ms)

ffi: ^1.2.1

Allocating 1000 * 1MiB with malloc took 7.338μs. (7ms)
Allocating 1000 * 1MiB with calloc took 6.357μs. (6ms)
Allocating 1000 * 1MiB with malloc + memset took 191.216μs. (191ms)

  1. in ffi 1.2.1 calloc takes roughly the same amount of time as malloc
  2. in ffi 2.0.1 calloc is 17x slower than malloc + memset.

@MindStudioOfficial
Copy link
Author

@dcharkes btw:
grafik

Still says "not available on Windows" in the doc comments.

Still works thoough :D

@dcharkes
Copy link
Collaborator

Thanks for the benchmarks!

We migrated from

result = winHeapAlloc(processHeap, /*flags=*/ HEAP_ZERO_MEMORY, byteCount).cast();

to

      result = winCoTaskMemAlloc(byteCount).cast();
      _zeroMemory(result, byteCount);

in

to enable using NativeFinalizers. HeapFree does not have the same signature as free, while CoTaskMemFree does.

We should probably replace the zeroing loop with memset once DynamicLibrary.process() windows availability is available in Dart/Flutter stable.

@MindStudioOfficial could you check at which allocation size the loop and memset are the same speed? (I suspect that for very small allocations, the loop is faster.)

@timsneath looks like our migration to CoTaskMemAlloc has a severe performance regression. I suspect that by now you're already using NativeFinalizers and depending on being able to use CoTaskMemFree.

@MindStudioOfficial as a workaround, you could define your own Allocator which copies the sources from the older package:ffi.

I wonder if we should include a setting/parameter or an extra allocator on windows to re-expose allocating with HeapAlloc which does then not support using NativeFinalizers.

@mkustermann
Copy link
Member

We should probably replace the zeroing loop with memset once DynamicLibrary.process() windows availability is available in Dart/Flutter stable.

Why not in Dart itself? memory.asTypedList(...).fillRange(...)?

@dcharkes
Copy link
Collaborator

That's another good thing to try. Though I would expect memset to be faster for 1 MB chunks.

@dcharkes
Copy link
Collaborator

@dcharkes btw: grafik

Still says "not available on Windows" in the doc comments.

Still works thoough :D

Oopsie 🙃

https://dart-review.googlesource.com/c/sdk/+/264940

@MindStudioOfficial
Copy link
Author

Some more benchmarks

Size in KiB malloc in μs calloc in μs malloc+memset in μs
2000 28 8037 62
4000 20 12065 87
6000 20 18784 126
8000 19 25437 154
10000 23 28993 186
12000 23 34216 203
14000 24 41588 243
16000 167 50309 1758
18000 124 56043 1994
20000 140 61813 2129
22000 130 72455 2280
24000 139 74347 2749
26000 131 80171 2710
28000 130 88833 2406
30000 120 100929 3311
32000 122 102998 3013
34000 110 104624 2139
36000 103 111944 2351
38000 100 122905 2520
40000 122 121536 6003
42000 144 127213 2850
44000 133 140059 3196
46000 136 156606 2843
48000 94 165196 2157
50000 95 167505 2077
52000 94 213821 2271
54000 90 179461 2338
56000 96 176380 2398
58000 103 186485 2941
60000 170 191335 2809
62000 97 187990 2787
64000 69 192022 2861
66000 86 206934 2685
68000 90 208057 2921
70000 73 224153 2753
72000 112 219760 3309
74000 86 222953 3671
76000 210 235642 3556
78000 96 253633 3273
80000 100 294777 3754
82000 104 309105 3297
84000 114 328592 3682
86000 104 300506 3382
88000 112 343889 3383
90000 114 354755 3845
92000 113 289370 4120
94000 120 284351 4024
96000 127 306265 4069
98000 87 300042 4237
100000 129 312252 4147

@dcharkes
Copy link
Collaborator

Thanks @MindStudioOfficial! ❤️

What for smaller values? (In a longer loop)

Size in KiB malloc in μs calloc in μs malloc+memset in μs
2000 28 8037 62

Does calloc in μs ever drop under malloc+memset? For a single byte? For 10 bytes?

And could you verify whether .asTypedList .fillRange is slower than memset? (I could check on a Linux machine easily, but it's better to be sure for your Windows setup.)

@MindStudioOfficial
Copy link
Author

MindStudioOfficial commented Oct 20, 2022

I did more benchmarks this time using a logarithmic approach for the sizes.
Every chucksize gets allocated 1000 times for statistical stability hence the "1000 *"

Size in Bytes malloc in μs calloc in μs malloc+memset in μs fillRange in μs
1000 * 2 68 81 79 85
1000 * 4 65 85 78 119
1000 * 8 78 88 77 114
1000 * 16 67 91 75 123
1000 * 32 64 97 80 212
1000 * 64 69 159 74 254
1000 * 128 70 178 79 441
1000 * 256 68 205 82 777
1000 * 512 143 409 91 1564
1000 * 1024 54 666 75 2804
1000 * 2048 116 1133 95 5238
1000 * 4096 93 1828 128 10560
1000 * 8192 101 3949 201 21530
1000 * 16384 106 7676 314 40361
1000 * 32768 95 14229 558 80580
1000 * 65536 128 31589 1058 160970
1000 * 131072 117 58080 1879 331348
1000 * 262144 151 123197 4144 654644
1000 * 524288 128 231918 7360 1319933
1000 * 1048576 4911 626603 191911 2888054
1000 * 2097152 6616 1210723 346860 5590725
1000 * 4194304 7037 2395864 669042 11304400

grafik

Y axis is logarithmic base 10
X axis is logarithmic base 2

@dcharkes
Copy link
Collaborator

Sweet! ❤️

So CoTaskMemAlloc with memset is always faster than CoTaskMemAlloc with .asTypedList and CoTaskMemAlloc with the loop.
And CoTaskMemAlloc with memset is order of magnitude slower than HeapAlloc with the HEAP_ZERO_MEMORY flag for 100kb+ allocations.

I've filed

to use memset.

For the large allocations maybe we should provide another allocator for windows. (But that will not allow us to use NativeFinalizers to clean up the memory.)

Thanks so much for the benchmarks @MindStudioOfficial! Are you unblocked by either using a malloc+memset or the older a copy of the older ffi calloc (without support for native finalizers).

@MindStudioOfficial
Copy link
Author

MindStudioOfficial commented Oct 20, 2022

Does calloc in μs ever drop under malloc+memset? For a single byte? For 10 bytes?

Doesn't look like it.

And could you verify whether .asTypedList().fillRange is slower than memset?

Its the slowest.

Are you unblocked by either using a malloc+memset[...]

Yes, I've made sure to use malloc whenever possible (mostly the data is overridden anyways).

@MindStudioOfficial
Copy link
Author

MindStudioOfficial commented Oct 20, 2022

I've also tried using the c style calloc:

typedef _c_calloc = Pointer<Void> Function(IntPtr number, IntPtr size);
typedef _dart_calloc = Pointer<Void> Function(int number, int size);

final ccalloc = DynamicLibrary.process().lookupFunction<_c_calloc, _dart_calloc>('calloc');

Which came out faster than malloc+memset.

grafik

@dcharkes
Copy link
Collaborator

dcharkes commented Oct 20, 2022

We don't use malloc and calloc on Windows on Windows, because every .dll has it's own implementation of malloc/calloc/free and will segfault if you pass memory from one dll to another and try to free it there.

@mahesh-hegde
Copy link
Contributor

I've also tried using the c style calloc: Which came out faster than malloc+memset.

Calloc can be faster than malloc + memset because OS zeroes out virtual memory anyway before handling out pages, and a clever implementation of calloc can avoid repeating that work. I'd imagine if calloc calls sbrk() or equivalent syscall, it wouldn't zero out that memory again.

The difference between winHeapMemAlloc and winCoTaskMemAlloc could be along same lines.


Footnote: speedup from using memset is also expected. memset always almost uses SIMD or special instruction under the hood. I once benchmarked a naive AVX256 memset and compared with setting Uint64s to 0, AVX one was ~2.5 times faster. I assume libc impl will be much more clever.

@dcharkes dcharkes transferred this issue from dart-archive/ffi Jan 16, 2024
@alexmercerind
Copy link

I experience something similar... media_kit bundles ffi: ^1.2.1 to continue working properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants