-
Notifications
You must be signed in to change notification settings - Fork 397
Performance Tuning
Bringing the C++ EnergyPlus into the range of ~5x faster than the Fortran EnergyPlus should be achievable by continuing the tuning work performed by Objexx in 2014 and the performance work of Geof Sawaya and others. An overview of performance tuning methods is presented here with the hope that developers will use and add to this.
Performance improvement can be slow and difficult work and there is no cookbook for it: effective results depends on understanding where the time is being spent and finding the right changes to make it faster. Tuning the code without breaking it is also a challenge: expanding the EnergyPlus testing "safety net" should be a priority in addition to performance work. There can be many causes of slowness and each has its own solution. But some of the typical tuning changes are outlined here.
Auto-Parallelization and Auto-Vectorization
Performance tuning should normally be based on profiling the code for the cases of interest. Guesses about where the code performance-critical "hot spots" are is unlikely to be accurate. Tuning parts of the code that are not hot spots is at best a wasted effort and at worst may introduce bugs and/or unnecessary code complexity. (In some cases code refactored for speed can also become more clear and elegant, but that is not our focus here.)
Profilers vary in the information provided, their run-time efficiency, and the accuracy of their results. Using multiple profilers on multiple platforms is best. Here are some notes and recommendations on profilers:
- gprof doesn't report system (i/o, string, etc.) calls, which are currently heavily used in EnergyPlus.
- prof (Linux) provides system call information and good drill-down capability but the source-level annotation does not appear to be accurate/helpful in most cases.
- VTune (Intel) provides a lot of information and a GUI but its source-level annotation is also not helpful.
"Keyhole" tuning includes localized changes that don't alter the program design in a broader way. Some of these found useful with EnergyPlus are outlined below.
EnergyPlus can spend a lot of time in string processing for some inputs. A lot of string overhead was reduced in the 2014 tuning work but more can be done.
- Avoid case-insenstive operations where not needed.
- Reduce passing string literals to std::string arguments (which requires string construction) in hot spots by making static
std::string
s from those literals. Alternatively, provide function overloads that have C style (char *) string arguments (and operate on them without creating std::strings from them). - Use efficient
std::string
and ObjexxFCL string.functions functions where appropriate.
- C++ stream i/o is slow and the Fortran FORMAT system emulation provided by ObjexxFCL on top of it is, by necessity, even slower. I/o hot spots should be migrated to lower level C-style (
sprintf
, etc.) calls, taking care to avoid i/o on files already opened by the stream system. - Some fast C++ stream-like implementations can be considered but are probably not necessary
- Use ObjexxFCL
gio::Fmt
format objects instead ofstd::string
for reused formats (avoid reparsing overhead). They should normally be static to avoid cost of generating the format data structure on each call. They should not be const, to allow the faster i/o API that reuses them. - InputProcessor is complex and slow in C++ due to the i/o and string processing.
- If replacement by an XML system is pending it is not worth expending effort on it, but the XML i/o system should be written with performance in mind and leveraging fast XML libraries.
- Case-sensitive input files are suggested but with a warning/fixup layer so that internal input processing code can use faster case-sensitive operations.
- Inline small, hot spot functions by moving their implementations into header files where the declaration was and adding the
inline
keyword before the return type.- When a function is too large/complex to get compilers to inline it you may be able to extract a small core inlineable function that handles most calls and have it call the (non-inline) larger, rarely used code for error or special case handling.
- Make function-local expensive to construct (arrays, strings, anything that does heap allocation, ...) const or initialized-on-each-call objects static.
- Change proxy arguments to pass by reference (or value if small) if possible:
- FArrayNA and FArrayNS proxies aren't needed if all callers pass the expected rank/dimension FArrayND.
- Eliminate Optional argument proxies in favor of overloaded functions with and without the arguments or C++ style default values that can flag "not present".
- Don't pass string literals to
std::string
arguments: they force string construction on each call.
Heap allocations are very expensive. FArrays are not designed as grow/shrink-friendly data structures, which means that they do a heap allocation every time you resize them. In the long run, containers that need to grow and shrink a lot are probably better moved to suitable containers such as std::vector or std::set (depending on the usage expected). Until then, the Fortranic method used to resize arrays in EnergyPlus can be done twice as efficiently by using FArray redimension operations. This means changing code that might be of this form:
TempArray.allocate( nArray + 1 ); // Allocation
TempArray( {1,nArray} ) = Array; // Array copy (and slice creation)
Array.deallocate();
++nArray;
Array.allocate( nArray ); // Allocation
Array = TempArray; // Array copy
TempArray.deallocate();
to
Array.redimension( ++nArray ); // Allocation + array copy
The redimension method uses C++ swap operations internally to avoid the second allocation and copy. Reducing the lines of code is also beneficial for code quality and clarity.
- The redimension call can also take a second argument that is a value to fill in any new elements created by the redimensioning.
- If you don't need to preserve existing values use the (faster) dimension operation.
- The EnergyPlus functions that encapsulated the Fortran style array resizing were changed to 1-line inline functions that call redimension to keep the interface that developers are used to but for new code there is no reason not to call redimension directly.
- An FArray variant that is grow/shrink friendly (at the cost of some extra space) like std::vector may be added.
This change to existing code may be done by developers over the next few months so focus on using redimension in new code.
- FArray "linear indexing" can speed up hot spots: see the ObjexxFCL FArray documentation for details and/or ask for help the first time you want to try this.
- Migrate Fortran-isms to array method calls. For example, ubound( array, 2 ) should become array.u2() which is faster.
- Avoid heap allocations whenever possible: they are very slow! Array functions that generate arrays will do heap allocation so avoid them where possible.
- Pass std::string (and other non-builtin types) by reference unless you are sure passing by value makes sense.
- Use local loop accumulators/variables in hot spots: pass-by-reference arguments require an extra deference step on each access.
- Hoist redundant expensive computations out of performance-critical loops.
- Unrolling hot spot loops with small bodies may benefit performance (compilers can do some unrolling for you).
- Use efficient utility functions:
- ObjexxFCL pow_N and root_N in place of
std::pow
calls with integer exponents.
- ObjexxFCL pow_N and root_N in place of
- Conditionals:
- A
switch
statement is faster than anif
block with many conditions.
- A
- Avoid non-cheap expressions in for loop stop criteria: unlike Fortran the stop criteria are evaluated on every pass through the loop. For example, replace:
for ( int i = 1; i <= expression; ++i )
with
for ( int i = 1, i_end = expression; i <= i_end; ++i )
Sometimes more invasive, cross-cutting changes can be made to data structures and/or algorithms to obtain a big performance improvement. This can arise when a simpler approach is used for prototype code or code that wasn't expected to be a hot spot turns out to be used heavily. In EnergyPlus we also have the Fortran legacy style where all data structures are arrays, which affects data and algorithm performance: often there are faster data structures and algorithms we can build in C++.
Arrays are inefficient as general purpose containers for data structures that are more naturally linked lists, sets, queues, hashes/maps, and growable/shrinkable vectors. The C++ Standard Library offers a number of good data structures and others can be built by using them or writing special purpose containers. Each container has its own big-O complexity and smaller scale performance profile. Choosing or designing the right container can depend on the relative frequency of different types of operations: add/remove, lookup, sort, etc. The performance aspects of different containers may take time to learn but often a good choice is apparent.
When reviewing a performance-critical function, there are clues to help identify data structures that may benefit from refactoring:
- Arrays of objects being looped over in hot spots to find a subset of interest: containers of shared-ownership smart pointers for the specific subset could be more efficient.
- Arrays being repeatedly copied to temporary arrays, de/reallocated, and copied back in are probably best converted to containers that can grow/shrink more efficiently. At the least, using more efficient FArray
redimension
calls will improve performance. - Arrays that act as linked lists, queues, sets, maps are obvious candidates for replacement by the appropriate C++ Standard Library container.
Algorithms with lower big-O complexity and/or higher performance may be possible for many EnergyPlus hot spots. It is not practical to list faster algorithms here but good algorithms books and other resources should be consulted.
Algorithm areas that might be relevant for EnergyPlus (please add/revise!) could include:
- Computational Geometry algorithms may be helpful in speeding up surface-related computations.
- Ray tracing and scene graph approaches to avoiding unnecessary computations by skipping ("culling") those for inactive or not visible objects may be useful.
- Spatial sorting data structures and lookup algorithms may reduce the effort of Zone X Surface nested loops by avoiding expending effort for non-contributing surface pairs.
Refactoring procedural code to an object-oriented design will not, in general, provide major performance gains on its own, but, for example, simply replacing the repeated deep if
blocks that perform dispatch based on what are effectively types with a single virtual function call could be a performance benefit (as well as a code maintenance win). And a side effect of a clean, OO design may to be enable more efficient operations in a more natural way. Common design patterns may be applicable to EnergyPlus computations and their use could bring speed gains.
- The Release configuration enables auto-vectorization by default.
- Auto-parallelization is enabled with /Qpar
- To enable diagnostic reporting for both of these, use /Qpar-report:2 and /Qvec-report:2 to output the status for every loop at compile time.
- Since this is quite verbose and can slow down the build significantly, it's best to add these options per file rather than for the whole project.
- Right-click a source file in the solution explorer, and go to Properties --> C/C++ --> All Options and paste one or both of these options into the "Additional Options" field. This is only useful in the Release configuration, of course.
- This will produce messages like this:
\src\EnergyPlus\HeatBalanceIntRadExchange.cc(2008) : info C5001: loop vectorized
\src\EnergyPlus\HeatBalanceIntRadExchange.cc(2008) : info C5012: loop not parallelized due to reason '1000'
\src\EnergyPlus\HeatBalanceIntRadExchange.cc(1893) : info C5002: loop not vectorized due to reason '1203'
- Reason codes may be found here.
- General info on MS C++ Auto-Parallelization and Auto-Vectorization.
- HINT: While experimenting to satisfy the auto-vectorizer, it's much faster to just compile the single source file that you're working on. No need to build the entire project. This may be obvious to some, but wasn't to me at first . . .
??? Need to add info for other platforms.