Switching from 1.2.1 to 2.x.x drastically slows my app #912

MindStudioOfficial · 2022-10-19T22:25:40Z

In my app I allocate realtime video frames (24MB per frame * 60fps)

In my app I time the processing of a frame in milliseconds.

Using ffi: ^1.2.1 I get 8-11ms per frame
Using ffi: ^2.0.0 I get 70 to 200ms per frame in DEBUG mode and 20-30ms in RELEASE mode which makes my app unusable.

I suspect that it has something to do with the switch mentioned in the changelog of 2.0.0

Switch Windows memory allocation to use CoTaskMemAlloc and CoTaskMemFree

I've also found an article that talks about a bug in windows 10 which supposedly makes this a lot slower:
https://randomascii.wordpress.com/2022/07/11/slower-memory-zeroing-through-parallelism/

For now I can't upgrade to ffi 2.x.x since I can't use my app with this version.

Is this a known thing?

EDIT: switching from ffi.calloc to ffi.malloc reduces the time immensly but of course is not always applicable

The text was updated successfully, but these errors were encountered:

dcharkes · 2022-10-20T06:20:43Z

calloc and malloc both use CoTaskMemAlloc.

https://github.com/dart-lang/ffi/blob/18b2b549d55009ff594600b04705ff6161681e07/lib/src/allocation.dart#L30-L32

The only difference is zero-ing out the memory.

https://github.com/dart-lang/ffi/blob/18b2b549d55009ff594600b04705ff6161681e07/lib/src/allocation.dart#L137-L139

Are you allocating large chunks of memory?

https://github.com/dart-lang/ffi/blob/18b2b549d55009ff594600b04705ff6161681e07/lib/src/allocation.dart#L105-L110

For large chunks, using memset would probably be much faster than the manual loop above.

https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/memset-wmemset?view=msvc-170

Could you check if using memset instead of a manual loop is faster? If so we can make a PR for that. (We never ran into this on the other platforms, because they support allocating zeroed out memory. I don't see something similar for Windows.)

dcharkes · 2022-10-20T06:21:30Z

cc @timsneath who introduced the zeroing out of memory for Windows.

MindStudioOfficial · 2022-10-20T08:20:00Z

Are you allocating large chunks of memory?

Yes. It is about 24 MB per frame at for example 60fps. That would be 1.44 GB per second. (of course every frame gets freed)

For large chunks, using memset would probably be much faster than the manual loop above.

How do I access memset in dart?

dcharkes · 2022-10-20T08:40:48Z

Try

typedef _dart_memset = void Function(Pointer<Uint8>, int, int);
typedef _c_memset = Void Function(Pointer<Uint8>, Int32, IntPtr);

    fbMemset = DynamicLibrary.process()
        .lookupFunction<_c_memset, _dart_memset>('memset');

You probably need to be on the master branch to get access to DynamicLibrary.process() on Windows as I added that recently to dart:ffi.

MindStudioOfficial · 2022-10-20T09:13:31Z

I've written a short dart example to test this:

import 'dart:ffi';
import 'package:ffi/ffi.dart';

typedef _dart_memset = void Function(Pointer<Uint8> dest, int value, int count);
typedef _c_memset = Void Function(Pointer<Uint8> dest, Int32 value, IntPtr count);

/// Sets a buffer to a specified character.
///
/// [dest] Pointer to destination.
/// [value] value to set.
/// [count] Number of characters.
final memset = DynamicLibrary.process().lookupFunction<_c_memset, _dart_memset>('memset');

void main(List<String> arguments) {
  Stopwatch sw = Stopwatch()..start();
  for (int i = 0; i < 1000; i++) {
    Pointer<Uint8> p = malloc.call<Uint8>(1024 * 1024); // 1 MiB
    malloc.free(p);
  }
  sw.stop();
  int us = sw.elapsedMicroseconds;
  print("Allocating 1000 * 1MiB with malloc took \t\t$usμs.");

  sw.reset();
  sw.start();
  for (int i = 0; i < 1000; i++) {
    Pointer<Uint8> p = calloc.call<Uint8>(1024 * 1024); // 1 MiB
    calloc.free(p);
  }
  sw.stop();
  us = sw.elapsedMicroseconds;
  print("Allocating 1000 * 1MiB with calloc took \t\t$usμs.");

  sw.reset();
  sw.start();
  for (int i = 0; i < 1000; i++) {
    Pointer<Uint8> p = malloc.call<Uint8>(1024 * 1024); // 1 MiB
    memset(p, 0, 1024 * 1024);
    calloc.free(p);
  }
  sw.stop();
  us = sw.elapsedMicroseconds;
  print("Allocating 1000 * 1MiB with malloc + memset took \t$usμs.");
}

ffi: ^2.0.1

Allocating 1000 * 1MiB with malloc took 7.028μs. (7ms)
Allocating 1000 * 1MiB with calloc took 3.294.333μs. (3294ms)
Allocating 1000 * 1MiB with malloc + memset took 187.259μs. (187ms)

ffi: ^1.2.1

Allocating 1000 * 1MiB with malloc took 7.338μs. (7ms)
Allocating 1000 * 1MiB with calloc took 6.357μs. (6ms)
Allocating 1000 * 1MiB with malloc + memset took 191.216μs. (191ms)

in ffi 1.2.1 calloc takes roughly the same amount of time as malloc
in ffi 2.0.1 calloc is 17x slower than malloc + memset.

MindStudioOfficial · 2022-10-20T09:19:25Z

@dcharkes btw:

Still says "not available on Windows" in the doc comments.

Still works thoough :D

dcharkes · 2022-10-20T09:34:43Z

Thanks for the benchmarks!

We migrated from

result = winHeapAlloc(processHeap, /*flags=*/ HEAP_ZERO_MEMORY, byteCount).cast();

to

      result = winCoTaskMemAlloc(byteCount).cast();
      _zeroMemory(result, byteCount);

in

Replace HeapFree with CoTaskMemAlloc dart-archive/ffi#144

to enable using NativeFinalizers. HeapFree does not have the same signature as free, while CoTaskMemFree does.

We should probably replace the zeroing loop with memset once DynamicLibrary.process() windows availability is available in Dart/Flutter stable.

@MindStudioOfficial could you check at which allocation size the loop and memset are the same speed? (I suspect that for very small allocations, the loop is faster.)

@timsneath looks like our migration to CoTaskMemAlloc has a severe performance regression. I suspect that by now you're already using NativeFinalizers and depending on being able to use CoTaskMemFree.

@MindStudioOfficial as a workaround, you could define your own Allocator which copies the sources from the older package:ffi.

I wonder if we should include a setting/parameter or an extra allocator on windows to re-expose allocating with HeapAlloc which does then not support using NativeFinalizers.

mkustermann · 2022-10-20T09:38:33Z

We should probably replace the zeroing loop with memset once DynamicLibrary.process() windows availability is available in Dart/Flutter stable.

Why not in Dart itself? memory.asTypedList(...).fillRange(...)?

dcharkes · 2022-10-20T09:41:41Z

That's another good thing to try. Though I would expect memset to be faster for 1 MB chunks.

dcharkes · 2022-10-20T09:47:21Z

@dcharkes btw:

Still says "not available on Windows" in the doc comments.

Still works thoough :D

Oopsie 🙃

https://dart-review.googlesource.com/c/sdk/+/264940

MindStudioOfficial · 2022-10-20T10:01:08Z

Some more benchmarks

Size in KiB	malloc in μs	calloc in μs	malloc+memset in μs
2000	28	8037	62
4000	20	12065	87
6000	20	18784	126
8000	19	25437	154
10000	23	28993	186
12000	23	34216	203
14000	24	41588	243
16000	167	50309	1758
18000	124	56043	1994
20000	140	61813	2129
22000	130	72455	2280
24000	139	74347	2749
26000	131	80171	2710
28000	130	88833	2406
30000	120	100929	3311
32000	122	102998	3013
34000	110	104624	2139
36000	103	111944	2351
38000	100	122905	2520
40000	122	121536	6003
42000	144	127213	2850
44000	133	140059	3196
46000	136	156606	2843
48000	94	165196	2157
50000	95	167505	2077
52000	94	213821	2271
54000	90	179461	2338
56000	96	176380	2398
58000	103	186485	2941
60000	170	191335	2809
62000	97	187990	2787
64000	69	192022	2861
66000	86	206934	2685
68000	90	208057	2921
70000	73	224153	2753
72000	112	219760	3309
74000	86	222953	3671
76000	210	235642	3556
78000	96	253633	3273
80000	100	294777	3754
82000	104	309105	3297
84000	114	328592	3682
86000	104	300506	3382
88000	112	343889	3383
90000	114	354755	3845
92000	113	289370	4120
94000	120	284351	4024
96000	127	306265	4069
98000	87	300042	4237
100000	129	312252	4147

dcharkes · 2022-10-20T10:05:33Z

Thanks @MindStudioOfficial! ❤️

What for smaller values? (In a longer loop)

Size in KiB malloc in μs calloc in μs malloc+memset in μs

2000 28 8037 62

Does calloc in μs ever drop under malloc+memset? For a single byte? For 10 bytes?

And could you verify whether .asTypedList .fillRange is slower than memset? (I could check on a Linux machine easily, but it's better to be sure for your Windows setup.)

MindStudioOfficial · 2022-10-20T10:43:57Z

I did more benchmarks this time using a logarithmic approach for the sizes.
Every chucksize gets allocated 1000 times for statistical stability hence the "1000 *"

Size in Bytes	malloc in μs	calloc in μs	malloc+memset in μs	fillRange in μs
1000 * 2	68	81	79	85
1000 * 4	65	85	78	119
1000 * 8	78	88	77	114
1000 * 16	67	91	75	123
1000 * 32	64	97	80	212
1000 * 64	69	159	74	254
1000 * 128	70	178	79	441
1000 * 256	68	205	82	777
1000 * 512	143	409	91	1564
1000 * 1024	54	666	75	2804
1000 * 2048	116	1133	95	5238
1000 * 4096	93	1828	128	10560
1000 * 8192	101	3949	201	21530
1000 * 16384	106	7676	314	40361
1000 * 32768	95	14229	558	80580
1000 * 65536	128	31589	1058	160970
1000 * 131072	117	58080	1879	331348
1000 * 262144	151	123197	4144	654644
1000 * 524288	128	231918	7360	1319933
1000 * 1048576	4911	626603	191911	2888054
1000 * 2097152	6616	1210723	346860	5590725
1000 * 4194304	7037	2395864	669042	11304400

Y axis is logarithmic base 10
X axis is logarithmic base 2

dcharkes · 2022-10-20T11:03:14Z

Sweet! ❤️

So CoTaskMemAlloc with memset is always faster than CoTaskMemAlloc with .asTypedList and CoTaskMemAlloc with the loop.
And CoTaskMemAlloc with memset is order of magnitude slower than HeapAlloc with the HEAP_ZERO_MEMORY flag for 100kb+ allocations.

I've filed

https://github.com/dart-lang/ffi/issues/160

to use memset.

For the large allocations maybe we should provide another allocator for windows. (But that will not allow us to use NativeFinalizers to clean up the memory.)

Thanks so much for the benchmarks @MindStudioOfficial! Are you unblocked by either using a malloc+memset or the older a copy of the older ffi calloc (without support for native finalizers).

MindStudioOfficial · 2022-10-20T11:07:00Z

Does calloc in μs ever drop under malloc+memset? For a single byte? For 10 bytes?

Doesn't look like it.

And could you verify whether .asTypedList().fillRange is slower than memset?

Its the slowest.

Are you unblocked by either using a malloc+memset[...]

Yes, I've made sure to use malloc whenever possible (mostly the data is overridden anyways).

MindStudioOfficial · 2022-10-20T11:31:47Z

I've also tried using the c style calloc:

typedef _c_calloc = Pointer<Void> Function(IntPtr number, IntPtr size);
typedef _dart_calloc = Pointer<Void> Function(int number, int size);

final ccalloc = DynamicLibrary.process().lookupFunction<_c_calloc, _dart_calloc>('calloc');

Which came out faster than malloc+memset.

dcharkes · 2022-10-20T11:47:17Z

We don't use malloc and calloc on Windows on Windows, because every .dll has it's own implementation of malloc/calloc/free and will segfault if you pass memory from one dll to another and try to free it there.

mahesh-hegde · 2022-11-04T09:36:25Z

I've also tried using the c style calloc: Which came out faster than malloc+memset.

Calloc can be faster than malloc + memset because OS zeroes out virtual memory anyway before handling out pages, and a clever implementation of calloc can avoid repeating that work. I'd imagine if calloc calls sbrk() or equivalent syscall, it wouldn't zero out that memory again.

The difference between winHeapMemAlloc and winCoTaskMemAlloc could be along same lines.

Footnote: speedup from using memset is also expected. memset always almost uses SIMD or special instruction under the hood. I once benchmarked a naive AVX256 memset and compared with setting Uint64s to 0, AVX one was ~2.5 times faster. I assume libc impl will be much more clever.

alexmercerind · 2024-11-16T21:33:17Z

I experience something similar... media_kit bundles ffi: ^1.2.1 to continue working properly.

dcharkes mentioned this issue Oct 20, 2022

Use memset for zeroing out memory on Windows in Dart 2.19 #913

Open

dcharkes mentioned this issue Oct 20, 2022

Add MemoryInitializeInstr (similar to MemoryCopyInstr) and make use of it in e.g. TypedData.fillRange() dart-lang/sdk#50256

Open

dcharkes transferred this issue from dart-archive/ffi Jan 16, 2024

dcharkes added the package:ffi label Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switching from 1.2.1 to 2.x.x drastically slows my app #912

Switching from 1.2.1 to 2.x.x drastically slows my app #912

MindStudioOfficial commented Oct 19, 2022 •

edited

Loading

dcharkes commented Oct 20, 2022

dcharkes commented Oct 20, 2022

MindStudioOfficial commented Oct 20, 2022

dcharkes commented Oct 20, 2022

MindStudioOfficial commented Oct 20, 2022

MindStudioOfficial commented Oct 20, 2022

dcharkes commented Oct 20, 2022

mkustermann commented Oct 20, 2022

dcharkes commented Oct 20, 2022

dcharkes commented Oct 20, 2022

MindStudioOfficial commented Oct 20, 2022

dcharkes commented Oct 20, 2022

MindStudioOfficial commented Oct 20, 2022 •

edited

Loading

dcharkes commented Oct 20, 2022

MindStudioOfficial commented Oct 20, 2022 •

edited

Loading

MindStudioOfficial commented Oct 20, 2022 •

edited

Loading

dcharkes commented Oct 20, 2022 •

edited

Loading

mahesh-hegde commented Nov 4, 2022

alexmercerind commented Nov 16, 2024

Switching from 1.2.1 to 2.x.x drastically slows my app #912

Switching from 1.2.1 to 2.x.x drastically slows my app #912

Comments

MindStudioOfficial commented Oct 19, 2022 • edited Loading

dcharkes commented Oct 20, 2022

dcharkes commented Oct 20, 2022

MindStudioOfficial commented Oct 20, 2022

dcharkes commented Oct 20, 2022

MindStudioOfficial commented Oct 20, 2022

MindStudioOfficial commented Oct 20, 2022

dcharkes commented Oct 20, 2022

mkustermann commented Oct 20, 2022

dcharkes commented Oct 20, 2022

dcharkes commented Oct 20, 2022

MindStudioOfficial commented Oct 20, 2022

dcharkes commented Oct 20, 2022

MindStudioOfficial commented Oct 20, 2022 • edited Loading

dcharkes commented Oct 20, 2022

MindStudioOfficial commented Oct 20, 2022 • edited Loading

MindStudioOfficial commented Oct 20, 2022 • edited Loading

dcharkes commented Oct 20, 2022 • edited Loading

mahesh-hegde commented Nov 4, 2022

alexmercerind commented Nov 16, 2024

MindStudioOfficial commented Oct 19, 2022 •

edited

Loading

MindStudioOfficial commented Oct 20, 2022 •

edited

Loading

MindStudioOfficial commented Oct 20, 2022 •

edited

Loading

MindStudioOfficial commented Oct 20, 2022 •

edited

Loading

dcharkes commented Oct 20, 2022 •

edited

Loading