-
-
Notifications
You must be signed in to change notification settings - Fork 895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocation performance with limited variable scope #52
Comments
(just a personal 3rd party opinion here) It's unlikely you may have a generic recommendation in such a case for a simple reason - there is no one that fits all or at least majority of use cases. In the quoted code it surely will be a constructor invoked. But in general case you cannot tell whether the |
Well, I'm thinking since it's in the runtime performance section after all. If there's a significant difference in performance when reinitializing large objects over and over it might be worth at least pointing out. With small stack allocated stuff I trust the compiler to do things nicely, but I'm less sure about big heap allocated stuff. At least in a simple case like this allocating the memory outside the loop and then running the loop on that instance isn't terribly confusing after all, it's something like one extra line of code with some pretty large potential gains. Not being able to tell in the generel case, under what conditions could it re-initialise? Maybe something like declaring the object static in its own function could help in some cases, although I suspect that would raise thread safety issues. Then again, it's not like running multiple threads with memory allocated elsewhere would make it simple either... |
Just provide your improvement for the code from the initial message. |
Well, supposing MyObject allocates a bunch of memory dynamically, something like MyObject obj();
for(int i = 0; i < 15; ++i) {
obj.some_func(i); //depends on what happens here, really
// maybe some_func fills some sort of array with i
//Maybe do some other stuff with obj
} I can totally see your point in how it's very case specific. |
(Side note, you have invoked "the most vexing parse" in your example) If you are performing multiple operations on the same object over and over, then not making a bunch of copies is clearly better. This is the case where you are performing different operations on different objects. The cases we should be comparing are:
vs
If "case 1" is more efficient than "case 2" then there is arguably a bug in your library. I'm not aware of systems where reassignment is better than construction. |
Besides that you also force the compiler to perform the |
I have closed this issue and linked back to the doc. Discussion can continue here, even in the closed issue. |
Oops! That's must be why the compiler gets angry so often when I instantiate objects! I should stop doing that. As for the subject matter, consider I have loop reading pictures from a camera into some image container. If I'm not totally off, running the constructor it would first need to allocate memory, and then copy the data to that memory. Assignment would just need to copy the data, without any allocation. I imagine it could keep the data allocation in place in a simple example like this, but have doubts that'd be the case if data is allocated in a fucntion called by that loop, possibly located in a separate compilation unit. Maybe I'm just swaying away too far from the original practice I'm commenting on here. But I think it's interesting to think about where to draw the line. The actual code I had in the profiling I was talking about was something along the lines of this (obviously huugely simplified): int main (int argc, char ** argv)
{
while(true) {
Image_container image = get_img(); //get image from camera
Area_vector text_areas = detect_text(image); //Find areas in image with text in them
draw_rectangles(image, text_areas); //make pretty boxes around text
write_image_to_screen(image);
}
return 0;
}
// In separate text_detection.cpp
Area_vector detect_text(const Image_container& input)
{
Image_container filter1_image(height, width), filter2_image(height, width);
// ok, the comma operator must have saved me from the most vexing parse here!
filter1(filter1_image, input); //run filter1 on input and output to filter1_image
// filter1 is O(n), but straightforward, no new object instances
filter2(filter2_image, filter1_image, input); //use filter1_image and input to generate
// filter2_image. filter2 is itself pretty large and internally runs several filter
// operations, these creating in total 8 local instances per call to filter2
Area_vector result = find_letters(input, filter1_image, filter2_image);
} Essentially 11 instances get created each iteration, each about 100-1000 kB big. Also, they don't need to be initialized since they are always written to before they are read. Although the compiler should be able to figure out that easily too I suppose. Seemed like it would let go of memory each time it exited scope their scope though, with a lot of time spent allocating memory. Essentially allocating and deallocating several MB of memory 20-30 times/second. I implemented the Imagecontainer myself, and maybe I just did it poorly, and that's the reason why it does this. I'll post it on pastebin (.hpp and .cpp) in case it's of interest for the discussion. Written for visual c++ 4 years ago, and I wasn't that avare C++11 was much of a thing back then. It's not necessarily going to be a pinnacle of wise design decisions. :) To me it seems likely that I could gain a lot of performance just instantiating all my 11 instances once outside the main loop and just use them over and over again by reference. Now this would probably eat into readability a lot, but collecting them into a struct, it shouldn't get too bad. Now I'm curious. Probably need to get out my profiler and play around again, and get some hard data! So this got long and pretty specific, and the algorithm itself could probably need a lot of polishing itself. Sorry if it feels too off-topic, but the point is to show what I meant in a more 'full scale' example. Maybe I'm too far into some sort of optimization swamp here, but sometimes performance really matters. And I can totally imagine similar patterns emerging in all sorts of high performance computing like games, video, audio and image treating programs etc. that deal with large amounts of data. I'm typically a lot more concerned with memory access speeds than I am with memory usage for this sort of thing. |
RVO (Return Value Optimization) should make any copying of objects or extra allocation simply "go away" on this line, which I believe is the main line of interest:
This is an optimization that virtually every compiler has implemented since, forever. However, since you are doing things like manual memory management with copy constructors and custom destructors you are possibly interfering with RVO and in general, the compiler's ability to optimize some cases. You'd almost certainly get a much larger gain by following the "Rule of 0" so that move operations are enabled than you'd ever get from 'caching' the objects outside of the loop (which, I strongly believe should not be a gain at all, but probably also not a loss, I'd expect a 0 game). See here https://github.com/lefticus/cppbestpractices/blob/master/08-Considering_Performance.md#enable-move-operations and here https://blog.feabhas.com/2015/01/the-rule-of-zero/ |
Hm, I'll have to try that out! Although it will be a while. Will report back if I get any interesting numbers. |
So after some quick testing, there are mixed results. On the one hand, in the original code it didn't seem that allocations were a big problem after all. On the other hand, changing the container to use std::vector (don't know sizes at compilation time, so can't use std::array) to follow the rule of zero, turned out to slow down the code massively (6.7s run time compared to my earlier 4.3s for my test set of data). This was caused by a lot of time spent in memset, since containers are being initialized to 0. This is mostly due to how the code is written though (essentially refactored C code) and could probably be fixed. But I believe "caching" an object would be more efficient in this specific scenario where vectors end up being 0 initialized. If that can be avoided, less so. Essentially: // bad idea
for(size_t n=0; n < t_areas.size(); ++n) {
image_container filtered_image(height, width);
run_filter(binary_image, input_image, t_areas[n]);
find_stuff(binary_image, t_areas[n]);
}
// better, but deals with symptoms, not the cause
image_container filtered_image(height, width);
for(size_t n=0; n < t_areas.size(); ++n) {
run_filter(binary_image, input_image, t_areas[n]);
find_stuff(binary_image, t_areas[n]);
} Now, I strongly suspect the compiler could get rid of the 0 initialization if all the values were actually written to in run_filter, but since it only runs over a subarea this is not done, and so initialization happens. Even better would be if the object could be initialized from the filter output, but that would require pretty large rewrites to work nicely with vectors. Now this is essentially stupid, since it would be much better to just use an output container the same size as the output in question (instead of mapping directly from input data), so that all values get written to, we use less memory and get better cache performance. // best
for(size_t n=0; n < t_areas.size(); ++n) {
image_container filtered_image(t_areas[n].get_height(), t_areas[n].get_width());
run_filter(binary_image, input_image, t_areas[n]);
// however, understand that run_filter gets somewhat
// more complex in this scenario
find_stuff(binary_image);
} TakeawayBe careful that you don't start unnecessarily initializing large objects inside loops. They will need to be reinitialized in every iteration. If this happens, take a look on why it gets initialized! Can you use better constructors? Can you find other ways for letting the optimizer to eat them up? |
So this is mostly from a discussion perspective. I'm not experienced when it comes to C++, but I'm thinking about this discussion of limited scope.
I'm mostly concerned that we might spend too much time initializing objects. Assume I have a large object (say an image) that I want to run an operation on to get some value, and I further run several operations per picture (running different sorts of filters, etc. etc.) each of these operations needing their own copy of the image that they run their operations on. Further, all of this happens several times a second, since it comes from a video stream.
I would have some function treat_frame() working with each frame, and have that function call functions for each operation. Wouldn't limiting the scope of the working objects (having them local to each function) cause constructors and destructors for these quite large objects several times every second? From my perspective it seems wiser to allocate them once outside the scope of the loop, and then have them use the same memory over and over.
Not sure how far this goes though. I did something like the above in a university project, and in profiling it turned out a lot of time was spent in constructors and allocators. Not sure how it would be with something like the example code. Like, would the constructor get invoked every iteration in the below code?
The text was updated successfully, but these errors were encountered: