-
-
Notifications
You must be signed in to change notification settings - Fork 21.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement StaticBlockAllocator for HashMap, List, RBMap, RBSet. #97016
base: master
Are you sure you want to change the base?
Implement StaticBlockAllocator for HashMap, List, RBMap, RBSet. #97016
Conversation
I did the godot-benchmark test.
|
Something I noticed is that RSA key generation has become 5 times slower according to these benchmarks. Might be worth looking at? |
443bcb6
to
5382acb
Compare
I did find a regression created by this PR. There was a significant degradation of multithreading due to the fact that the data in different threads of the same structure type were next to each other, because of which we were only working with the 3rd level of the processor cache. Now I have not only fixed it, but even improved it. Separate allocators are used for each thread. This helped each thread to use its own mutex, which greatly improved performance in multithreading. So it can be tested with this code: func _some_job(n):
var dict : Dictionary
var start :int = Time.get_ticks_usec()
for i in 1000000:
dict[i] = i
var end :int = Time.get_ticks_usec()
for res in dict:
pass
print("Time insert:", end - start)
func _ready() -> void:
var start :int = Time.get_ticks_usec()
var task_id := WorkerThreadPool.add_group_task(_some_job, 8, -1, true)
WorkerThreadPool.wait_for_group_task_completion(task_id)
var end :int = Time.get_ticks_usec()
print("Full Time insert:", end - start)
Results
Before fixing:
Now:
Edit: Actually the multithreading bottleneck is the atomic variable in the Memoy class, not malloc: Line 112 in a40fc23
|
1ac12d4
to
800bd55
Compare
nice |
60bd85b
to
c9bb60a
Compare
85aab54
to
2ffd3ee
Compare
2ffd3ee
to
aaee547
Compare
About this PR
This PR improves performance and reduces RAM usage of the entire engine where
HashMap
,List
,RBMap
,RBSet
are used. I implemented newStaticBlockAllocator
for such structures asHashMap
,List
,RBMap
,RBSet
. I have also implemented aBlockAllocator
which is used byStaticBlockAllocator
.How elements in structures allocated?
Structures like
HashMap
,List
,RBMap
,RBSet
currently usemalloc
which is wraped intoMemory
class for allocate and free memory for a new element. As many know this allocator is a general purpose allocator. For these data structures, for each element,malloc
allocates a lot of metadata and also allocates a little more space than we asked for in case we want to use realloc. That is, if we want to allocate 16 bytes of data, malloc will allocate 32 of which 8 bytes will go to store the size we allocated and 8 bytes for therealloc
. It is obvious that we are not going to allocate more than what is allocated to the List element.Memory means class
Memory
from core.Custom allocators
The solution to this is to use custom allocators. Currently, godot uses PagedAllocator for RID allocation and for some HashMap. But this allocator is bad as the default allocator in that it needs to be initialized for each new instance, and the allocator itself weighs a lot, so it will not be suitable for scene use, since the nodes will weigh twice as much, and adding the first element to the Node will last a long.
So for default it turned out to be best to implement a static allocator that would have its own PagedAllocator for each element size. It turns out that PagedAllocator is not suitable for these purposes. So I used my BlockAllocator.
What is BlockAllocator?
BlockAllocator is a hybrid of PagedAllocator and StackAllocator. I created it about half a year ago as my first allocator. If an element is deleted, it is written to the free_list. This list is stored in place of deleted items, which saves a lot of memory. Until the size of deleted list is not equal to 0, then the pointer of our allocator will not move further. With each new page, the page size is doubled. It turned out to be even better than PagedAllocator.
How free and alloc works.
How blocks of blocks are interconnected.
What is StaticBlockAllocator?
StaticBlockAllocator
is a static allocator. It allocates to our size an id that we use to allocate and free data. When requesting an ID, the allocator checks whether it has an allocator of the required size. If there is none, then he creates it. This allocator for each thread has its own array ofBlockAllocator
, with unique sizes. If two threads use size 32 in two different threads, two allocators with that size will be created separately.What is TypedStaticBlockAllocator?
This is the final templated allocator for our structures. It weighs 8 bytes and stores the id provided by
StaticBlockAllocator
. According to this id, it allocates and frees data. Also calls the constructor and destructor for our classes.Performance Tests
Let's start by testing List<int>.
Master branch:malloc implementation List<int> 10 000 000 elements.
TypedStaticBlockAllocator:
Direct Usage BlockAllocator:
Direct Usage PagedAllocator:
Memory Usage List<int> for 100 000 000 elements:
So, as you can see, my allocator works even better than PagedAllocator(although I thought it would be slower). It is worth paying attention to the improvement in iteration time, which means much faster work with allocated data, since it is more cache friendly.
Code used:
Dictionary test.
Dictionary[int, String]
Master branch:
Current PR:
Code Used:
Memory Usage Dictionary[int, String] for 100 000 000 elements:
Code used:
The speed of adding children to the node.
Add 1000000 Nodes as child:
Some improved performance in MRPs
MRP from #85871.
Improves performance and reduces memory usage in the editor.
Memory usage can be checked via htop or task manager.
Until this #92554 isn`t merged. This PR improves animation performance by 5%.
Also, this MRP #93568 loads 5% faster. And in general, projects are loaded 5-10% faster.
Improving mesh generation
I used this demo. https://github.com/godotengine/godot-demo-projects/tree/master/3d/voxel.
Here, the chart shows the before and after average mesh generation time for a chunk.
Better multithreading performance
This comment #97016 (comment)
Godot Benchmarks