Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocation performance with limited variable scope #52

Closed
rubicus opened this issue Sep 27, 2016 · 11 comments
Closed

Allocation performance with limited variable scope #52

rubicus opened this issue Sep 27, 2016 · 11 comments

Comments

@rubicus
Copy link

rubicus commented Sep 27, 2016

So this is mostly from a discussion perspective. I'm not experienced when it comes to C++, but I'm thinking about this discussion of limited scope.

I'm mostly concerned that we might spend too much time initializing objects. Assume I have a large object (say an image) that I want to run an operation on to get some value, and I further run several operations per picture (running different sorts of filters, etc. etc.) each of these operations needing their own copy of the image that they run their operations on. Further, all of this happens several times a second, since it comes from a video stream.

I would have some function treat_frame() working with each frame, and have that function call functions for each operation. Wouldn't limiting the scope of the working objects (having them local to each function) cause constructors and destructors for these quite large objects several times every second? From my perspective it seems wiser to allocate them once outside the scope of the loop, and then have them use the same memory over and over.

Not sure how far this goes though. I did something like the above in a university project, and in profiling it turned out a lot of time was spent in constructors and allocators. Not sure how it would be with something like the example code. Like, would the constructor get invoked every iteration in the below code?

// Good Idea
for (int i = 0; i < 15; ++i)
{
  MyObject obj(i);
  // do something with obj
}
@zerkms
Copy link

zerkms commented Sep 27, 2016

(just a personal 3rd party opinion here)

It's unlikely you may have a generic recommendation in such a case for a simple reason - there is no one that fits all or at least majority of use cases.

In the quoted code it surely will be a constructor invoked.

But in general case you cannot tell whether the ByObject type can re-initialise its state. Not to say that such optimisation would make code more confusing (as almost every other optimisation) and must be done after careful research & profiling, not by default in every case.

@rubicus
Copy link
Author

rubicus commented Sep 27, 2016

Well, I'm thinking since it's in the runtime performance section after all. If there's a significant difference in performance when reinitializing large objects over and over it might be worth at least pointing out. With small stack allocated stuff I trust the compiler to do things nicely, but I'm less sure about big heap allocated stuff. At least in a simple case like this allocating the memory outside the loop and then running the loop on that instance isn't terribly confusing after all, it's something like one extra line of code with some pretty large potential gains.

Not being able to tell in the generel case, under what conditions could it re-initialise?

Maybe something like declaring the object static in its own function could help in some cases, although I suspect that would raise thread safety issues. Then again, it's not like running multiple threads with memory allocated elsewhere would make it simple either...

@zerkms
Copy link

zerkms commented Sep 27, 2016

Not being able to tell in the generel case, under what conditions could it re-initialise?

Just provide your improvement for the code from the initial message.

@rubicus
Copy link
Author

rubicus commented Sep 27, 2016

Well, supposing MyObject allocates a bunch of memory dynamically, something like

MyObject obj();
for(int i = 0; i < 15; ++i) {
    obj.some_func(i); //depends on what happens here, really
    // maybe some_func fills some sort of array with i
    //Maybe do some other stuff with obj
}

I can totally see your point in how it's very case specific.

@lefticus
Copy link
Member

(Side note, you have invoked "the most vexing parse" in your example) If you are performing multiple operations on the same object over and over, then not making a bunch of copies is clearly better. This is the case where you are performing different operations on different objects.

The cases we should be comparing are:

//case 1
MyObject obj;
for(int i = 0; i < 15; ++i) {
    obj = get_object(i);
}

vs

//case 2
for(int i = 0; i < 15; ++i) {
    MyObject obj = get_object(i);
}

If "case 1" is more efficient than "case 2" then there is arguably a bug in your library. I'm not aware of systems where reassignment is better than construction.

@lefticus
Copy link
Member

Besides that you also force the compiler to perform the operator= call when in the case of construction the compiler may be able to eliminate the object entirely at runtime, depending on various things.

@lefticus
Copy link
Member

I have closed this issue and linked back to the doc. Discussion can continue here, even in the closed issue.

@rubicus
Copy link
Author

rubicus commented Sep 28, 2016

Oops! That's must be why the compiler gets angry so often when I instantiate objects! I should stop doing that.

As for the subject matter, consider I have loop reading pictures from a camera into some image container. If I'm not totally off, running the constructor it would first need to allocate memory, and then copy the data to that memory. Assignment would just need to copy the data, without any allocation. I imagine it could keep the data allocation in place in a simple example like this, but have doubts that'd be the case if data is allocated in a fucntion called by that loop, possibly located in a separate compilation unit. Maybe I'm just swaying away too far from the original practice I'm commenting on here. But I think it's interesting to think about where to draw the line.

The actual code I had in the profiling I was talking about was something along the lines of this (obviously huugely simplified):

int main (int argc, char ** argv) 
{
    while(true) {
        Image_container image = get_img(); //get image from camera
        Area_vector text_areas = detect_text(image); //Find areas in image with text in them
        draw_rectangles(image, text_areas); //make pretty boxes around text
        write_image_to_screen(image);
    }
    return 0;
}

// In separate text_detection.cpp
Area_vector detect_text(const Image_container& input) 
{
    Image_container filter1_image(height, width), filter2_image(height, width); 
    // ok, the comma operator must have saved me from the most vexing parse here!

    filter1(filter1_image, input); //run filter1 on input and output to filter1_image
    // filter1 is O(n), but straightforward, no new object instances
    filter2(filter2_image, filter1_image, input); //use filter1_image and input to generate 
    // filter2_image. filter2 is itself pretty large and internally runs several filter 
    // operations, these creating in total 8 local instances per call to filter2

    Area_vector result = find_letters(input, filter1_image, filter2_image);
}

Essentially 11 instances get created each iteration, each about 100-1000 kB big. Also, they don't need to be initialized since they are always written to before they are read. Although the compiler should be able to figure out that easily too I suppose. Seemed like it would let go of memory each time it exited scope their scope though, with a lot of time spent allocating memory. Essentially allocating and deallocating several MB of memory 20-30 times/second.

I implemented the Imagecontainer myself, and maybe I just did it poorly, and that's the reason why it does this. I'll post it on pastebin (.hpp and .cpp) in case it's of interest for the discussion. Written for visual c++ 4 years ago, and I wasn't that avare C++11 was much of a thing back then. It's not necessarily going to be a pinnacle of wise design decisions. :)

To me it seems likely that I could gain a lot of performance just instantiating all my 11 instances once outside the main loop and just use them over and over again by reference. Now this would probably eat into readability a lot, but collecting them into a struct, it shouldn't get too bad. Now I'm curious. Probably need to get out my profiler and play around again, and get some hard data!

So this got long and pretty specific, and the algorithm itself could probably need a lot of polishing itself. Sorry if it feels too off-topic, but the point is to show what I meant in a more 'full scale' example. Maybe I'm too far into some sort of optimization swamp here, but sometimes performance really matters. And I can totally imagine similar patterns emerging in all sorts of high performance computing like games, video, audio and image treating programs etc. that deal with large amounts of data. I'm typically a lot more concerned with memory access speeds than I am with memory usage for this sort of thing.

@lefticus
Copy link
Member

RVO (Return Value Optimization) should make any copying of objects or extra allocation simply "go away" on this line, which I believe is the main line of interest:

Image_container image = get_img(); //get image from camera

This is an optimization that virtually every compiler has implemented since, forever.

However, since you are doing things like manual memory management with copy constructors and custom destructors you are possibly interfering with RVO and in general, the compiler's ability to optimize some cases.

You'd almost certainly get a much larger gain by following the "Rule of 0" so that move operations are enabled than you'd ever get from 'caching' the objects outside of the loop (which, I strongly believe should not be a gain at all, but probably also not a loss, I'd expect a 0 game). See here https://github.com/lefticus/cppbestpractices/blob/master/08-Considering_Performance.md#enable-move-operations and here https://blog.feabhas.com/2015/01/the-rule-of-zero/

@rubicus
Copy link
Author

rubicus commented Sep 29, 2016

Hm, I'll have to try that out! Although it will be a while. Will report back if I get any interesting numbers.

@rubicus
Copy link
Author

rubicus commented Oct 1, 2016

So after some quick testing, there are mixed results. On the one hand, in the original code it didn't seem that allocations were a big problem after all. On the other hand, changing the container to use std::vector (don't know sizes at compilation time, so can't use std::array) to follow the rule of zero, turned out to slow down the code massively (6.7s run time compared to my earlier 4.3s for my test set of data). This was caused by a lot of time spent in memset, since containers are being initialized to 0.

This is mostly due to how the code is written though (essentially refactored C code) and could probably be fixed. But I believe "caching" an object would be more efficient in this specific scenario where vectors end up being 0 initialized. If that can be avoided, less so. Essentially:

// bad idea
for(size_t n=0; n < t_areas.size(); ++n) {
    image_container filtered_image(height, width);
    run_filter(binary_image, input_image, t_areas[n]);
    find_stuff(binary_image, t_areas[n]);
}

// better, but deals with symptoms, not the cause
image_container filtered_image(height, width);
for(size_t n=0; n < t_areas.size(); ++n) {
    run_filter(binary_image, input_image, t_areas[n]);
    find_stuff(binary_image, t_areas[n]);
}

Now, I strongly suspect the compiler could get rid of the 0 initialization if all the values were actually written to in run_filter, but since it only runs over a subarea this is not done, and so initialization happens. Even better would be if the object could be initialized from the filter output, but that would require pretty large rewrites to work nicely with vectors.

Now this is essentially stupid, since it would be much better to just use an output container the same size as the output in question (instead of mapping directly from input data), so that all values get written to, we use less memory and get better cache performance.

// best
for(size_t n=0; n < t_areas.size(); ++n) {
    image_container filtered_image(t_areas[n].get_height(), t_areas[n].get_width());
    run_filter(binary_image, input_image, t_areas[n]);
    // however, understand that run_filter gets somewhat 
    // more complex in this scenario
    find_stuff(binary_image);
}

Takeaway

Be careful that you don't start unnecessarily initializing large objects inside loops. They will need to be reinitialized in every iteration. If this happens, take a look on why it gets initialized! Can you use better constructors? Can you find other ways for letting the optimizer to eat them up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants