Implement StaticBlockAllocator for HashMap, List, RBMap, RBSet. #97016

Nazarwadim · 2024-09-14T21:24:56Z

About this PR

This PR improves performance and reduces RAM usage of the entire engine where HashMap, List, RBMap, RBSet are used. I implemented new StaticBlockAllocator for such structures as HashMap, List, RBMap, RBSet. I have also implemented a BlockAllocator which is used by StaticBlockAllocator.

How elements in structures allocated?

Structures like HashMap, List, RBMap, RBSet currently use malloc which is wraped into Memory class for allocate and free memory for a new element. As many know this allocator is a general purpose allocator. For these data structures, for each element, malloc allocates a lot of metadata and also allocates a little more space than we asked for in case we want to use realloc. That is, if we want to allocate 16 bytes of data, malloc will allocate 32 of which 8 bytes will go to store the size we allocated and 8 bytes for the realloc. It is obvious that we are not going to allocate more than what is allocated to the List element.

Memory means class Memory from core.

Custom allocators

The solution to this is to use custom allocators. Currently, godot uses PagedAllocator for RID allocation and for some HashMap. But this allocator is bad as the default allocator in that it needs to be initialized for each new instance, and the allocator itself weighs a lot, so it will not be suitable for scene use, since the nodes will weigh twice as much, and adding the first element to the Node will last a long.

So for default it turned out to be best to implement a static allocator that would have its own PagedAllocator for each element size. It turns out that PagedAllocator is not suitable for these purposes. So I used my BlockAllocator.

What is BlockAllocator?

BlockAllocator is a hybrid of PagedAllocator and StackAllocator. I created it about half a year ago as my first allocator. If an element is deleted, it is written to the free_list. This list is stored in place of deleted items, which saves a lot of memory. Until the size of deleted list is not equal to 0, then the pointer of our allocator will not move further. With each new page, the page size is doubled. It turned out to be even better than PagedAllocator.

How free and alloc works.

How blocks of blocks are interconnected.

What is StaticBlockAllocator?

StaticBlockAllocator is a static allocator. It allocates to our size an id that we use to allocate and free data. When requesting an ID, the allocator checks whether it has an allocator of the required size. If there is none, then he creates it. This allocator for each thread has its own array of BlockAllocator, with unique sizes. If two threads use size 32 in two different threads, two allocators with that size will be created separately.

What is TypedStaticBlockAllocator?

This is the final templated allocator for our structures. It weighs 8 bytes and stores the id provided by StaticBlockAllocator. According to this id, it allocates and frees data. Also calls the constructor and destructor for our classes.

Performance Tests

Let's start by testing List<int>.

Master branch:malloc implementation List<int> 10 000 000 elements.

Time insert: 236674 usec
Time iteration: 39287 usec
Time clear: 141672 usec
Time total: 417736 usec

TypedStaticBlockAllocator:

Time insert: 141104 usec
Time iteration: 25416 usec
Time clear: 76181 usec
Time total: 242765 usec

Direct Usage BlockAllocator:

Time insert: 96636 usec
Time iteration: 26894 usec
Time clear: 39474 usec
Time total: 163112 usec

Direct Usage PagedAllocator:

Time insert: 106321 usec
Time iteration: 31373 usec
Time clear: 59099 usec
Time total: 196889 usec

Memory Usage List<int> for 100 000 000 elements:

BlockAllocator: 2,95 GB
PagedAllocator: 3,62 GB
Malloc: 4,63 GB

So, as you can see, my allocator works even better than PagedAllocator(although I thought it would be slower). It is worth paying attention to the improvement in iteration time, which means much faster work with allocated data, since it is more cache friendly.

Code used:

int var = 0;

template <typename T>
void test_list() {
	T list_standart;
	auto start = chrono::system_clock::now();
	{
		auto clock_start = chrono::system_clock::now();
		for (int i = 0; i < 10000000; i++) {
			list_standart.push_back(i);
		}
		auto clock_now = chrono::system_clock::now();
		auto currentTime = float(chrono::duration_cast<chrono::nanoseconds>(clock_now - clock_start).count());
		print_line("Time insert:", currentTime);
	}

	{
		auto clock_start = chrono::system_clock::now();
		for (const int num : list_standart) {
			var = num;
		}
		auto clock_now = chrono::system_clock::now();
		auto currentTime = float(chrono::duration_cast<chrono::nanoseconds>(clock_now - clock_start).count());
		print_line("Time iteration:", currentTime);
	}

	{
		auto clock_start = chrono::system_clock::now();
		list_standart.clear();
		auto clock_now = chrono::system_clock::now();
		auto currentTime = float(chrono::duration_cast<chrono::nanoseconds>(clock_now - clock_start).count());
		print_line("Time clear:", currentTime);
	}
	auto end = chrono::system_clock::now();
	auto currentTime = float(chrono::duration_cast<chrono::nanoseconds>(end - start).count());
	print_line("Time total:", currentTime);
}

void main_benchmark() {
	test_list<List<int>>();
}

Dictionary test.

Dictionary[int, String]

Master branch:

Time insert:2655 msec
Time iteration:325 msec
Time clear:801 msec
Time total:5116 msec

Current PR:

Time insert:2472 msec
Time iteration:288 msec
Time clear:500 msec
Time total:3260 msec

Code Used:

func _ready() -> void:
	var start_test :int = Time.get_ticks_msec()
	var dict : Dictionary[int, String]
	if true:
		var start :int = Time.get_ticks_msec()
		for i in 10000000:
			dict[i] = " "
		var end :int = Time.get_ticks_msec()
		print("Time insert:", end - start)
	
	if true:
		var start :int = Time.get_ticks_msec()
		for n in dict.keys():
			pass
		var end :int = Time.get_ticks_msec()
		print("Time iteration:", end - start)
		
	if true:
		var start :int = Time.get_ticks_msec()
		var num : int
		dict.clear()
		var end :int = Time.get_ticks_msec()
		print("Time clear:", end - start)
	var end_test :int = Time.get_ticks_msec()
	print("Time total:", end_test - start_test)

Memory Usage Dictionary[int, String] for 100 000 000 elements:

Master: 11.37 GB
Current PR: 8.4 GB

Code used:

for i in 100000000:
	dict[i] = " "
while true:
    pass

The speed of adding children to the node.

Add 1000000 Nodes as child:

Master: 3225423 usec
Current PR: 2825762 usec

func _ready() -> void:
	var start :int = Time.get_ticks_usec()
	for i in 1000000:
		add_child(Node.new())
	
	var end :int = Time.get_ticks_usec()
	print(end - start)

Some improved performance in MRPs

MRP from #85871.
Improves performance and reduces memory usage in the editor.

Master 3,73 GB
Current PR: 2,97 GB

Memory usage can be checked via htop or task manager.

Until this #92554 isn`t merged. This PR improves animation performance by 5%.

Also, this MRP #93568 loads 5% faster. And in general, projects are loaded 5-10% faster.

Improving mesh generation

I used this demo. https://github.com/godotengine/godot-demo-projects/tree/master/3d/voxel.
Here, the chart shows the before and after average mesh generation time for a chunk.

Better multithreading performance

This comment #97016 (comment)

Godot Benchmarks

Master: master.json
Current PR: multithread_allocator.json

Nazarwadim · 2024-09-16T18:01:00Z

I did the godot-benchmark test.
Compile command:

scons CC=gcc-12 CXX=g++-12 scu_build=yes scu_limit=1024 lto=full linker=mold

~~master.json~~
~~StaticBlockAllocator.json~~
Edit: Old benchmarks, new benchmarks are in the PR description.

MewPurPur · 2024-09-17T07:45:48Z

Something I noticed is that RSA key generation has become 5 times slower according to these benchmarks. Might be worth looking at?

Nazarwadim · 2024-09-17T09:01:22Z

It looks like the RSA benchmark is not deterministic.

Master:

Current PR:

Anyway, I don't think that these structures are used there and that they can affect the performance.

Nazarwadim · 2024-09-17T16:57:07Z

I did find a regression created by this PR. There was a significant degradation of multithreading due to the fact that the data in different threads of the same structure type were next to each other, because of which we were only working with the 3rd level of the processor cache.

Now I have not only fixed it, but even improved it. Separate allocators are used for each thread. This helped each thread to use its own mutex, which greatly improved performance in multithreading.

So it can be tested with this code:

func _some_job(n):
	var dict : Dictionary
	var start :int = Time.get_ticks_usec()
	for i in 1000000:
		dict[i] = i
	var end :int = Time.get_ticks_usec()
	for res in dict:
		pass
	print("Time insert:", end - start)

func _ready() -> void:
	var start :int = Time.get_ticks_usec()
	var task_id := WorkerThreadPool.add_group_task(_some_job, 8, -1, true)
	
	WorkerThreadPool.wait_for_group_task_completion(task_id)
	var end :int = Time.get_ticks_usec()
	print("Full Time insert:", end - start)

Results
Master:

Time insert:509230
Time insert:592604
Time insert:559632
Time insert:577218
Time insert:600717
Time insert:661088
Time insert:591868
Time insert:634930
Full Time insert:1361299

Before fixing:

Time insert:2104979
Time insert:2193414
Time insert:2261067
Time insert:2261572
Time insert:2256125
Time insert:2301050
Time insert:2307332
Time insert:2352035
Full Time insert:5498995

Now:

Time insert:335565
Time insert:338475
Time insert:374039
Time insert:483497
Time insert:484436
Time insert:485592
Time insert:493170
Time insert:490287
Full Time insert:790270

Edit: Actually the multithreading bottleneck is the atomic variable in the Memoy class, not malloc:

godot/core/os/memory.cpp

Line 112 in a40fc23

alloc_count.increment();

a527919 · 2024-09-30T10:44:45Z

nice

Nazarwadim requested a review from a team as a code owner September 14, 2024 21:24

AThousandShips added enhancement topic:core performance labels Sep 15, 2024

AThousandShips added this to the 4.x milestone Sep 15, 2024

RandomShaper mentioned this pull request Sep 16, 2024

Optimize rendering_device _add_dependency and _free_dependencies #90010

Closed

Nazarwadim force-pushed the implement_static_block_allocator branch 2 times, most recently from 443bcb6 to 5382acb Compare September 17, 2024 16:30

Nazarwadim force-pushed the implement_static_block_allocator branch 4 times, most recently from 1ac12d4 to 800bd55 Compare September 19, 2024 16:26

Nazarwadim force-pushed the implement_static_block_allocator branch 2 times, most recently from 60bd85b to c9bb60a Compare October 2, 2024 14:58

Nazarwadim force-pushed the implement_static_block_allocator branch 2 times, most recently from 85aab54 to 2ffd3ee Compare November 22, 2024 16:41

MewPurPur mentioned this pull request Dec 28, 2024

[Draft] Variant alloc #100861

Draft

Implement StaticBlockAllocator

aaee547

Nazarwadim force-pushed the implement_static_block_allocator branch from 2ffd3ee to aaee547 Compare December 31, 2024 10:24

Nazarwadim mentioned this pull request Dec 31, 2024

Use StaticBlockAllocator for variant alloc #100979

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement StaticBlockAllocator for HashMap, List, RBMap, RBSet. #97016

Implement StaticBlockAllocator for HashMap, List, RBMap, RBSet. #97016

Nazarwadim commented Sep 14, 2024 •

edited

Loading

Nazarwadim commented Sep 16, 2024 •

edited

Loading

MewPurPur commented Sep 17, 2024

Nazarwadim commented Sep 17, 2024

Nazarwadim commented Sep 17, 2024 •

edited

Loading

a527919 commented Sep 30, 2024

Implement StaticBlockAllocator for HashMap, List, RBMap, RBSet. #97016

Are you sure you want to change the base?

Implement StaticBlockAllocator for HashMap, List, RBMap, RBSet. #97016

Conversation

Nazarwadim commented Sep 14, 2024 • edited Loading

About this PR

How elements in structures allocated?

Custom allocators

What is BlockAllocator?

What is StaticBlockAllocator?

What is TypedStaticBlockAllocator?

Performance Tests

Some improved performance in MRPs

Improving mesh generation

Better multithreading performance

Godot Benchmarks

Nazarwadim commented Sep 16, 2024 • edited Loading

MewPurPur commented Sep 17, 2024

Nazarwadim commented Sep 17, 2024

Nazarwadim commented Sep 17, 2024 • edited Loading

a527919 commented Sep 30, 2024

Nazarwadim commented Sep 14, 2024 •

edited

Loading

Nazarwadim commented Sep 16, 2024 •

edited

Loading

Nazarwadim commented Sep 17, 2024 •

edited

Loading