-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switching from 1.2.1 to 2.x.x drastically slows my app #912
Comments
The only difference is zero-ing out the memory. Are you allocating large chunks of memory? For large chunks, using https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/memset-wmemset?view=msvc-170 Could you check if using |
cc @timsneath who introduced the zeroing out of memory for Windows. |
Yes. It is about 24 MB per frame at for example 60fps. That would be 1.44 GB per second. (of course every frame gets freed)
How do I access memset in dart? |
Try typedef _dart_memset = void Function(Pointer<Uint8>, int, int);
typedef _c_memset = Void Function(Pointer<Uint8>, Int32, IntPtr);
fbMemset = DynamicLibrary.process()
.lookupFunction<_c_memset, _dart_memset>('memset'); You probably need to be on the master branch to get access to |
I've written a short dart example to test this: import 'dart:ffi';
import 'package:ffi/ffi.dart';
typedef _dart_memset = void Function(Pointer<Uint8> dest, int value, int count);
typedef _c_memset = Void Function(Pointer<Uint8> dest, Int32 value, IntPtr count);
/// Sets a buffer to a specified character.
///
/// [dest] Pointer to destination.
/// [value] value to set.
/// [count] Number of characters.
final memset = DynamicLibrary.process().lookupFunction<_c_memset, _dart_memset>('memset');
void main(List<String> arguments) {
Stopwatch sw = Stopwatch()..start();
for (int i = 0; i < 1000; i++) {
Pointer<Uint8> p = malloc.call<Uint8>(1024 * 1024); // 1 MiB
malloc.free(p);
}
sw.stop();
int us = sw.elapsedMicroseconds;
print("Allocating 1000 * 1MiB with malloc took \t\t$usμs.");
sw.reset();
sw.start();
for (int i = 0; i < 1000; i++) {
Pointer<Uint8> p = calloc.call<Uint8>(1024 * 1024); // 1 MiB
calloc.free(p);
}
sw.stop();
us = sw.elapsedMicroseconds;
print("Allocating 1000 * 1MiB with calloc took \t\t$usμs.");
sw.reset();
sw.start();
for (int i = 0; i < 1000; i++) {
Pointer<Uint8> p = malloc.call<Uint8>(1024 * 1024); // 1 MiB
memset(p, 0, 1024 * 1024);
calloc.free(p);
}
sw.stop();
us = sw.elapsedMicroseconds;
print("Allocating 1000 * 1MiB with malloc + memset took \t$usμs.");
}
|
@dcharkes btw: Still says "not available on Windows" in the doc comments. Still works thoough :D |
Thanks for the benchmarks! We migrated from result = winHeapAlloc(processHeap, /*flags=*/ HEAP_ZERO_MEMORY, byteCount).cast(); to result = winCoTaskMemAlloc(byteCount).cast();
_zeroMemory(result, byteCount); in to enable using We should probably replace the zeroing loop with memset once @MindStudioOfficial could you check at which allocation size the loop and memset are the same speed? (I suspect that for very small allocations, the loop is faster.) @timsneath looks like our migration to @MindStudioOfficial as a workaround, you could define your own I wonder if we should include a setting/parameter or an extra allocator on windows to re-expose allocating with |
Why not in Dart itself? |
That's another good thing to try. Though I would expect |
Oopsie 🙃 |
Some more benchmarks
|
Thanks @MindStudioOfficial! ❤️ What for smaller values? (In a longer loop)
Does And could you verify whether |
I did more benchmarks this time using a logarithmic approach for the sizes.
Y axis is logarithmic base 10 |
Sweet! ❤️ So I've filed to use For the large allocations maybe we should provide another allocator for windows. (But that will not allow us to use Thanks so much for the benchmarks @MindStudioOfficial! Are you unblocked by either using a malloc+memset or the older a copy of the older ffi calloc (without support for native finalizers). |
Doesn't look like it.
Its the slowest.
Yes, I've made sure to use malloc whenever possible (mostly the data is overridden anyways). |
I've also tried using the c style calloc: typedef _c_calloc = Pointer<Void> Function(IntPtr number, IntPtr size);
typedef _dart_calloc = Pointer<Void> Function(int number, int size);
final ccalloc = DynamicLibrary.process().lookupFunction<_c_calloc, _dart_calloc>('calloc'); Which came out faster than malloc+memset. |
We don't use |
Calloc can be faster than malloc + memset because OS zeroes out virtual memory anyway before handling out pages, and a clever implementation of calloc can avoid repeating that work. I'd imagine if calloc calls sbrk() or equivalent syscall, it wouldn't zero out that memory again. The difference between winHeapMemAlloc and winCoTaskMemAlloc could be along same lines. Footnote: speedup from using memset is also expected. memset always almost uses SIMD or special instruction under the hood. I once benchmarked a naive AVX256 memset and compared with setting Uint64s to 0, AVX one was ~2.5 times faster. I assume libc impl will be much more clever. |
I experience something similar... media_kit bundles ffi: ^1.2.1 to continue working properly. |
In my app I allocate realtime video frames (24MB per frame * 60fps)
In my app I time the processing of a frame in milliseconds.
Using
ffi: ^1.2.1
I get 8-11ms per frameUsing
ffi: ^2.0.0
I get 70 to 200ms per frame in DEBUG mode and 20-30ms in RELEASE mode which makes my app unusable.I suspect that it has something to do with the switch mentioned in the changelog of 2.0.0
I've also found an article that talks about a bug in windows 10 which supposedly makes this a lot slower:
https://randomascii.wordpress.com/2022/07/11/slower-memory-zeroing-through-parallelism/
For now I can't upgrade to ffi 2.x.x since I can't use my app with this version.
Is this a known thing?
EDIT: switching from ffi.calloc to ffi.malloc reduces the time immensly but of course is not always applicable
The text was updated successfully, but these errors were encountered: