-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce compile time of variadic functions #243
Conversation
It appears that the defects on Clang and GCC in Release Binary Size may be worth it for the compile time, especially with the erratic release time of Clang and GCC (according to the benchmark graphs, but how many tests were performed to get those graphs?) |
@Spartan322 In general, each data point is measured 3 times, but for those outliers I did a lot of extra passes and different configurations and it always returned the same, at least on my system. The benchmark is here if anyone wants to try it. The Clang results seem reasonable to me: it probably switches to a different optimization path after certain threshold. But GCC really is erratic. |
Extra BenchmarksI think this might be interesting. Here is the data for all the formatting libraries. C++ Format does really well:
|
Thanks for such a detailed PR. This looks like a nice improvement and I think that regression in binary size for 16 arguments is not a problem because it is not a very common case. I'll probably have more comments and/or questions after I look in more details at the implementation and do some tests. |
ArgArray construction depends on this behavior. Note that this is only relevant in the C++11 code path.
Alias templates may not be supported even if variadic templates are.
@dean0x7d I looked a bit into the implementation and have a few questions:
|
|
Accidentally merged this PR into master, but "undid" it by moving to the |
No worries. Let me know if I should open a new PR for any changes. |
I've done a few tests myself using gcc 4.8.4 confirming the improvement in compile time for large number of arguments: There are no weird spikes at 20 and 22 in this version of gcc. Unfortunately the results for smaller number of arguments which I care more about are somewhat erratic: and vary noticeably between the runs with no clear winner. Overall I think it's a nice improvement and will be happy to merge this in. But I'd prefer to keep the inheritance of |
I can do an interactive rebase to get rid of the composition (and also squash that alias template fix). I'll submit a new PR. I have an older Ubuntu with GCC 4.8 on a virtual machine. I'll try to see what's up with that strange spike at 5 arguments. It's entirely possible that the benchmark is a bit too artificial (consists only of |
Sounds good.
Don't worry about it, it's just an artifact of this particular run. I've rerun the test on 5 arguments a few times and there was no spike in other runs. Thanks for the info about the |
I reworked this to keep the inheritance, but there are a couple of issues I didn't realize before I started. The modified variadic2 branch is here. First off, the tests that I added earlier are now failing, specifically:
Basically, // A formatting argument. It is a POD type to allow storage in
// internal::MemoryBuffer.
struct Arg : Value {
Type type;
}; By extension
Without these two properties, If the inheritance is mainly to avoid repeating |
Good point about POD. I think the correct term here is trivially copyable although I need to check if
That's an option, but ideally it would be nice to find a way not to rely on the "ugly" chain of conversions. I did a quick and dirty implementation in https://github.com/cppformat/cppformat/tree/variadic-test to test an alternative and got similar compile time improvement on gcc: but haven't tested clang yet. The compile code size for large (>16) number of arguments is not as good as in your implementation, but better than in the current one. |
I experimented a bit more. The results are below:
All 3 versions solve the 8-12 arg slowdown on Clang, but v2 has a negative effect for >=16 args. I had to cut off the last point (42 seconds) to keep a reasonable scale. Note that all versions a have negligible effect (< 5%) for 4 or less arguments, but it's pretty nice for 6-12. Versions 2 and 3 see a small regression (~2.5 %) for >= 16 args, but I don't think it's too bad. GCC doesn't have any trouble with v2 like Clang did. There's a small regression in binary size for v3. As far as I can tell from the assembly, this is because the inlining behavior changes for certain arg counts. Actually, v3 is close to what I initially did but dropped it because of this binary size regression. Let me know what you think. |
Thanks for another detailed analysis. I'd go with v3 because it solves the slowdown on Clang and doesn't rely on the chain of conversions. Since the binary size regression is not too big and only affects a large number of arguments (>= 9), I wouldn't worry about it. I think you are right about inlining because making There is also a question of API safety. In the current version one has to pass
while in the new version types and typename ArgArray::Type array{
ArgArray::template make<fmt::BasicFormatter<Char> >(args)...};
call(FMT_FOR_EACH(FMT_GET_ARG_NAME, __VA_ARGS__),
fmt::ArgList(fmt::internal::make_type(args...), array)); But hopefully this can be addressed separately. Anyway, if you agree with using v3, please submit a PR. |
The proposed PR would reduce the compile time of variadic templates like
fmt::print
andfmt::format
. The improvement on bloat-test is small but measurable (reduced from 35 to 33 seconds on my system), but that doesn't tell the whole story.In order to get a more detailed measurement I modified bloat-test to generate tests for any number of arguments. The resulting variadic-test can be found here. When invoked with
make variadic-test ARGS="1 25 10"
it will generatefmt::print
statements with1 <= x < 25
arguments and it will compile each with10
translation units.Results
The following figures show the results for Clang and GCC, before and after the proposed PR changes. The data is gathered from
make variadic-test ARGS="1 25 10"
. Clang 7 was tested on OS X 10.11, while GCC 5.2 was on Antergos (Arch).Compile time
Binary size
The Good
Clang benefits the most. With the current implementation there is a weird spike between 8 and 12 arguments in release mode. This spike gets ironed out with the proposed changes and there is a nice overall speed up.
GCC also sees some strange compile time spikes at 20 and 22 arguments - these show up consistently for me. The proposed changes clear that up and there is an overall speedup, although a bit hard to see in the figure because the two spikes mess with the scale.
There is a small binary size improvement for GCC with
> 16
arguments in release mode.Debug mode sees benefits across the board. Not that important, but nice.
The Bad
Both GCC and Clang see a binary size regression at exactly 16 arguments (clearly visible in the figures). This is because the proposed implementation must generate all unpacked types for
>=16
arguments, while currently 16 is a special case with part packed and part unpacked types.The Ugly
The proposed implementation replaces the
ArgArray
type fromValue[]
/Arg[]
toMakeValue[]
/MakeArg[]
. When the argument array is passed, it's implicitly converted in the chainMakeValue[]
->MakeValue*
->Value*
, andValue*
is then used asValue[]
. This works fine becauseMakeValue
andValue
are equivalent, standard layout types, but I feel bad writing it...If
Value
andMakeValue
change so they are not equivalent, the code will still compile, but it will not work correctly. I added tests to trigger in this case.Alternatives
I attempted a few alternatives for the ugly bit, but they didn't fare too well. This
could be avoided by writing this
where
is_packed_tag
would select if aValue
orArg
should be created. Unfortunately, this results in a huge increase in binary size on GCC, which refuses to inline more than 6 or 7 arguments.The array could be placed into a struct:
This restores proper inlining on GCC, but the compile time becomes much worse.
Epilogue
This started because I noticed that compile times were slower on Clang than on GCC which is not something I usually see. I may have fallen into a rabbit hole, but it's been interesting. There are tradeoffs as outlined above, so I'd understand if the proposed changes aren't completely desirable, but I hope at least some part will be useful.