Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement StaticBlockAllocator for HashMap, List, RBMap, RBSet. #97016

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Nazarwadim
Copy link
Contributor

@Nazarwadim Nazarwadim commented Sep 14, 2024

About this PR

This PR improves performance and reduces RAM usage of the entire engine where HashMap, List, RBMap, RBSet are used. I implemented new StaticBlockAllocator for such structures as HashMap, List, RBMap, RBSet. I have also implemented a BlockAllocator which is used by StaticBlockAllocator.

How elements in structures allocated?

Structures like HashMap, List, RBMap, RBSet currently use malloc which is wraped into Memory class for allocate and free memory for a new element. As many know this allocator is a general purpose allocator. For these data structures, for each element, malloc allocates a lot of metadata and also allocates a little more space than we asked for in case we want to use realloc. That is, if we want to allocate 16 bytes of data, malloc will allocate 32 of which 8 bytes will go to store the size we allocated and 8 bytes for the realloc. It is obvious that we are not going to allocate more than what is allocated to the List element.

Memory means class Memory from core.

Custom allocators

The solution to this is to use custom allocators. Currently, godot uses PagedAllocator for RID allocation and for some HashMap. But this allocator is bad as the default allocator in that it needs to be initialized for each new instance, and the allocator itself weighs a lot, so it will not be suitable for scene use, since the nodes will weigh twice as much, and adding the first element to the Node will last a long.

So for default it turned out to be best to implement a static allocator that would have its own PagedAllocator for each element size. It turns out that PagedAllocator is not suitable for these purposes. So I used my BlockAllocator.

What is BlockAllocator?

BlockAllocator is a hybrid of PagedAllocator and StackAllocator. I created it about half a year ago as my first allocator. If an element is deleted, it is written to the free_list. This list is stored in place of deleted items, which saves a lot of memory. Until the size of deleted list is not equal to 0, then the pointer of our allocator will not move further. With each new page, the page size is doubled. It turned out to be even better than PagedAllocator.

How free and alloc works.

How blocks of blocks are interconnected.

What is StaticBlockAllocator?

StaticBlockAllocator is a static allocator. It allocates to our size an id that we use to allocate and free data. When requesting an ID, the allocator checks whether it has an allocator of the required size. If there is none, then he creates it. This allocator for each thread has its own array of BlockAllocator, with unique sizes. If two threads use size 32 in two different threads, two allocators with that size will be created separately.

What is TypedStaticBlockAllocator?

This is the final templated allocator for our structures. It weighs 8 bytes and stores the id provided by StaticBlockAllocator. According to this id, it allocates and frees data. Also calls the constructor and destructor for our classes.

Performance Tests

Let's start by testing List<int>.

Master branch:malloc implementation List<int> 10 000 000 elements.

  • Time insert: 236674 usec
  • Time iteration: 39287 usec
  • Time clear: 141672 usec
  • Time total: 417736 usec

TypedStaticBlockAllocator:

  • Time insert: 141104 usec
  • Time iteration: 25416 usec
  • Time clear: 76181 usec
  • Time total: 242765 usec

Direct Usage BlockAllocator:

  • Time insert: 96636 usec
  • Time iteration: 26894 usec
  • Time clear: 39474 usec
  • Time total: 163112 usec

Direct Usage PagedAllocator:

  • Time insert: 106321 usec
  • Time iteration: 31373 usec
  • Time clear: 59099 usec
  • Time total: 196889 usec

Memory Usage List<int> for 100 000 000 elements:

  • BlockAllocator: 2,95 GB
  • PagedAllocator: 3,62 GB
  • Malloc: 4,63 GB

So, as you can see, my allocator works even better than PagedAllocator(although I thought it would be slower). It is worth paying attention to the improvement in iteration time, which means much faster work with allocated data, since it is more cache friendly.

Code used:

int var = 0;

template <typename T>
void test_list() {
	T list_standart;
	auto start = chrono::system_clock::now();
	{
		auto clock_start = chrono::system_clock::now();
		for (int i = 0; i < 10000000; i++) {
			list_standart.push_back(i);
		}
		auto clock_now = chrono::system_clock::now();
		auto currentTime = float(chrono::duration_cast<chrono::nanoseconds>(clock_now - clock_start).count());
		print_line("Time insert:", currentTime);
	}

	{
		auto clock_start = chrono::system_clock::now();
		for (const int num : list_standart) {
			var = num;
		}
		auto clock_now = chrono::system_clock::now();
		auto currentTime = float(chrono::duration_cast<chrono::nanoseconds>(clock_now - clock_start).count());
		print_line("Time iteration:", currentTime);
	}

	{
		auto clock_start = chrono::system_clock::now();
		list_standart.clear();
		auto clock_now = chrono::system_clock::now();
		auto currentTime = float(chrono::duration_cast<chrono::nanoseconds>(clock_now - clock_start).count());
		print_line("Time clear:", currentTime);
	}
	auto end = chrono::system_clock::now();
	auto currentTime = float(chrono::duration_cast<chrono::nanoseconds>(end - start).count());
	print_line("Time total:", currentTime);
}

void main_benchmark() {
	test_list<List<int>>();
}

Dictionary test.

Dictionary[int, String]

Master branch:

  • Time insert:2655 msec
  • Time iteration:325 msec
  • Time clear:801 msec
  • Time total:5116 msec

Current PR:

  • Time insert:2472 msec
  • Time iteration:288 msec
  • Time clear:500 msec
  • Time total:3260 msec

Code Used:

func _ready() -> void:
	var start_test :int = Time.get_ticks_msec()
	var dict : Dictionary[int, String]
	if true:
		var start :int = Time.get_ticks_msec()
		for i in 10000000:
			dict[i] = " "
		var end :int = Time.get_ticks_msec()
		print("Time insert:", end - start)
	
	if true:
		var start :int = Time.get_ticks_msec()
		for n in dict.keys():
			pass
		var end :int = Time.get_ticks_msec()
		print("Time iteration:", end - start)
		
	if true:
		var start :int = Time.get_ticks_msec()
		var num : int
		dict.clear()
		var end :int = Time.get_ticks_msec()
		print("Time clear:", end - start)
	var end_test :int = Time.get_ticks_msec()
	print("Time total:", end_test - start_test)

Memory Usage Dictionary[int, String] for 100 000 000 elements:

  • Master: 11.37 GB
  • Current PR: 8.4 GB

Code used:

for i in 100000000:
	dict[i] = " "
while true:
    pass

The speed of adding children to the node.

Add 1000000 Nodes as child:

  • Master: 3225423 usec
  • Current PR: 2825762 usec
func _ready() -> void:
	var start :int = Time.get_ticks_usec()
	for i in 1000000:
		add_child(Node.new())
	
	var end :int = Time.get_ticks_usec()
	print(end - start)

Some improved performance in MRPs

MRP from #85871.
Improves performance and reduces memory usage in the editor.

  • Master 3,73 GB
  • Current PR: 2,97 GB

Memory usage can be checked via htop or task manager.

Until this #92554 isn`t merged. This PR improves animation performance by 5%.

Also, this MRP #93568 loads 5% faster. And in general, projects are loaded 5-10% faster.

Improving mesh generation

I used this demo. https://github.com/godotengine/godot-demo-projects/tree/master/3d/voxel.
Here, the chart shows the before and after average mesh generation time for a chunk.

image-1

Better multithreading performance

This comment #97016 (comment)

Godot Benchmarks

@Nazarwadim
Copy link
Contributor Author

Nazarwadim commented Sep 16, 2024

I did the godot-benchmark test.
Compile command:

scons CC=gcc-12 CXX=g++-12 scu_build=yes scu_limit=1024 lto=full linker=mold

master.json
StaticBlockAllocator.json
Edit: Old benchmarks, new benchmarks are in the PR description.

@MewPurPur
Copy link
Contributor

Something I noticed is that RSA key generation has become 5 times slower according to these benchmarks. Might be worth looking at?

@Nazarwadim
Copy link
Contributor Author

It looks like the RSA benchmark is not deterministic.

Master:
Screenshot from 2024-09-17 11-46-44
Screenshot from 2024-09-17 11-47-02
Screenshot from 2024-09-17 11-51-13

Current PR:
Screenshot from 2024-09-17 11-48-27
Screenshot from 2024-09-17 11-49-51
Screenshot from 2024-09-17 11-56-29

Anyway, I don't think that these structures are used there and that they can affect the performance.

@Nazarwadim Nazarwadim force-pushed the implement_static_block_allocator branch 2 times, most recently from 443bcb6 to 5382acb Compare September 17, 2024 16:30
@Nazarwadim
Copy link
Contributor Author

Nazarwadim commented Sep 17, 2024

I did find a regression created by this PR. There was a significant degradation of multithreading due to the fact that the data in different threads of the same structure type were next to each other, because of which we were only working with the 3rd level of the processor cache.

Now I have not only fixed it, but even improved it. Separate allocators are used for each thread. This helped each thread to use its own mutex, which greatly improved performance in multithreading.

So it can be tested with this code:

func _some_job(n):
	var dict : Dictionary
	var start :int = Time.get_ticks_usec()
	for i in 1000000:
		dict[i] = i
	var end :int = Time.get_ticks_usec()
	for res in dict:
		pass
	print("Time insert:", end - start)

func _ready() -> void:
	var start :int = Time.get_ticks_usec()
	var task_id := WorkerThreadPool.add_group_task(_some_job, 8, -1, true)
	
	WorkerThreadPool.wait_for_group_task_completion(task_id)
	var end :int = Time.get_ticks_usec()
	print("Full Time insert:", end - start)

Results
Master:

Time insert:509230
Time insert:592604
Time insert:559632
Time insert:577218
Time insert:600717
Time insert:661088
Time insert:591868
Time insert:634930
Full Time insert:1361299

Before fixing:

Time insert:2104979
Time insert:2193414
Time insert:2261067
Time insert:2261572
Time insert:2256125
Time insert:2301050
Time insert:2307332
Time insert:2352035
Full Time insert:5498995

Now:

Time insert:335565
Time insert:338475
Time insert:374039
Time insert:483497
Time insert:484436
Time insert:485592
Time insert:493170
Time insert:490287
Full Time insert:790270

Edit: Actually the multithreading bottleneck is the atomic variable in the Memoy class, not malloc:

alloc_count.increment();

@Nazarwadim Nazarwadim force-pushed the implement_static_block_allocator branch 4 times, most recently from 1ac12d4 to 800bd55 Compare September 19, 2024 16:26
@a527919
Copy link

a527919 commented Sep 30, 2024

nice

@Nazarwadim Nazarwadim force-pushed the implement_static_block_allocator branch 2 times, most recently from 60bd85b to c9bb60a Compare October 2, 2024 14:58
@Nazarwadim Nazarwadim force-pushed the implement_static_block_allocator branch 2 times, most recently from 85aab54 to 2ffd3ee Compare November 22, 2024 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants