-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize floating-point formatting #1426
Comments
Related to #1559 (There is a three times faster double-to-string implementation). |
Has this materialized? I use a LOT of those conversions, so this would be good news. |
No but {fmt}'s floating-point formatter is already quite fast (see e.g. the benchmark results in https://github.com/fmtlib/fmt/releases/tag/6.1.0) so if you switch from |
Thanks.
|
I recommend using 6.1.2 or master which includes the same optimization as 6.1.0 + some fixes.
Yes. |
@vitaut, maybe fmt could use the new charconv in STL? They use Ryu which is (according to Stephan T. Lavavej, the implementer) 2x to 3x as fast as doubleconv (which you seem to be more or less on par with). |
It could be possible once charconv becomes widely available. Note that Ryu is only marginally faster than optimized Grisu (or slower than some implementations) on independent benchmarks with random input and has several times larger data tables. Grisu also has nice property that it's faster on short output. In real applications the difference will be negligible because of other factors. |
Ok, awesome. I've started making the switch to fmtlib. Thanks a BUNCH for this library! |
{fmt}'s Grisu3 is now on par or better than Milo Yip's and the original Grisu2 implementations despite providing shortness guarantees: |
@data-man, sorry for deleting your comment - trying to avoid links to spammy repos. |
Another option is to integrate https://github.com/jk-jeon/dragonbox by @jk-jeon. The only problem here is introducing of another license. |
FMT currently has the permissive MIT license, which means just about everyone in the world can use it. If more licenses are piled on, this project will become largely useless to many people in the world (I'm looking at you GPL3). |
@SanderBouwhuis I don't have any good knowledge on licensing stuffs, but BSL (Boost Software License) is even more permissive license than MIT. But probably having to depend on a library with a different license can be a problem, and might degrade the current easiness of integration? @vitaut I'm afraid dragonbox is not a perfect fit for fmt as it does not provide the option for fixed-precision conversion. Not like Grisu3, it does not perform any iterative search for the shortest representation. Rather, it somehow finds the shortest decimal number with just one trial-and-error, so it would be more difficult (if not impossible) to modify it for fixed-precision conversion. If you want to have separate implementations for shortest-roundtrip and fixed-precision, then I can consider rewriting the whole thing (excluding the features that fmt does not want) and license it under MIT, but having two implementations feels like a big trade-off. |
That's OK, {fmt} can continue using Grisu3+Dragon4 for this.
If dragonbox or some part of it could be distributed under the same license as {fmt} (MIT with an optional exception for binary distribution) that would be awesome. |
Great😁 Let's start discussing what's needed and what's not needed then. Let's cross out big things first. Could you let me know the things you think worth keeping among these?
(I didn't implement compressed cache version for float, but it would be not difficult, so you can consider using that version as well. I expect that the trade-offs between size vs speed should be smaller in terms of percentages to that for double's) There are also stuffs I made for possible extensions to 16-bit, 128-bit, or 80-bit floating-point numbers. The thing is, it is not possible to just look at the type and figure out what encoding format the type is internally using, because it depends very, very much on the specific platform. Especially 128-bit or 80-bit float's are kind of messy. Hence, the idea was to eliminate explicit mentions to But if you care only about "sane" platforms where Given that lots of things in the list above will be removed, I expect that it would be not so painful to port the code so that the list you wrote here can stay the same. And about licensing. I believe it should be okay to copy-and-paste-and-modify my own code and distribute it under a different license. But to be sure, I'll write a question on software engineering stack exchange. Thanks! |
I did some experiment on how much fmt can be made faster by adapting dragonbox. The following (and some namespace hack to avoid name clash) is the change I made to fmt to do this experiment: if (precision < 0) {
#if 0
fp fp_value;
auto boundaries = specs.binary32
? fp_value.assign_float_with_boundaries(value)
: fp_value.assign_with_boundaries(value);
fp_value = normalize(fp_value);
// Find a cached power of 10 such that multiplying value by it will bring
// the exponent in the range [min_exp, -32].
const fp cached_pow = get_cached_power(
min_exp - (fp_value.e + fp::significand_size), cached_exp10);
// Multiply value and boundaries by the cached power of 10.
fp_value = fp_value * cached_pow;
boundaries.lower = multiply(boundaries.lower, cached_pow.f);
boundaries.upper = multiply(boundaries.upper, cached_pow.f);
assert(min_exp <= fp_value.e && fp_value.e <= -32);
--boundaries.lower; // \tilde{M}^- - 1 ulp -> M^-_{\downarrow}.
++boundaries.upper; // \tilde{M}^+ + 1 ulp -> M^+_{\uparrow}.
// Numbers outside of (lower, upper) definitely do not round to value.
grisu_shortest_handler handler{buf.data(), 0,
boundaries.upper - fp_value.f};
auto result =
grisu_gen_digits(fp(boundaries.upper, fp_value.e),
boundaries.upper - boundaries.lower, exp, handler);
if (result == digits::error) {
exp += handler.size - cached_exp10 - 1;
fallback_format(value, -1, specs.binary32, buf, exp);
return exp;
}
buf.try_resize(to_unsigned(handler.size));
#else
// full cache verion
auto dec = jkj::dragonbox::to_decimal(value,
jkj::dragonbox::policy::sign::ignore);
// compressed cache version
/*auto dec = jkj::dragonbox::to_decimal(value,
jkj::dragonbox::policy::sign::ignore,
jkj::dragonbox::policy::cache::compressed);*/
int length = detail::count_digits(dec.significand);
buf.try_resize(to_unsigned(length));
uint32_t dec_significand;
int i = 0;
if (dec.significand >= 100000000) {
dec_significand = uint32_t(dec.significand / 100000000);
uint32_t r = uint32_t(dec.significand) - 100000000 * dec_significand;
uint32_t r1 = r / 10000;
uint32_t r2 = r % 10000;
uint32_t a = r1 / 100;
uint32_t b = r1 % 100;
uint32_t c = r2 / 100;
uint32_t d = r2 % 100;
copy2(buf.data() + length - 8, data::digits[a]);
copy2(buf.data() + length - 6, data::digits[b]);
copy2(buf.data() + length - 4, data::digits[c]);
copy2(buf.data() + length - 2, data::digits[d]);
i += 8;
}
else {
dec_significand = uint32_t(dec.significand);
}
while (dec_significand >= 10000) {
uint32_t r = dec_significand % 10000;
dec_significand /= 10000;
copy2(buf.data() + length - i - 4, data::digits[r / 100]);
copy2(buf.data() + length - i - 2, data::digits[r % 100]);
i += 4;
}
if (dec_significand >= 100) {
uint32_t r = dec_significand % 100;
dec_significand /= 100;
copy2(buf.data() + length - i - 2, data::digits[r]);
i += 2;
}
if (dec_significand >= 10) {
copy2(buf.data() + length - i - 2, data::digits[dec_significand]);
}
else {
*buf.data() = char('0' + dec_significand);
}
return dec.exponent;
#endif
} ( The following is the result: (Source code: https://github.com/jk-jeon/dtoa-benchmark/tree/fmt_dragonbox_test) So the speed-up is about 30% for the compressed version, and is about 50% for the full version. I think the main reason why it is slower than the corresponding upstream versions of dragonbox is unnecessary intermediate buffer copies for prettifying. But in order to fix that maybe we should change the code a lot. |
@jk-jeon, It would be great if you added my implementation to your benchmarks. |
Good. I included yours and reran benchmark with clang-cl (which better optimizes virtually all algorithms contained in the benchmark than msvc): But I'm not sure why yours do not perform as advertised in your repo. Did you use some intrinsics that GCC knows but Clang doesn't? Or might be because your code wrongly assumes that the compiler is msvc as FYI: here is the result of the same benchmark compiled with msvc: |
Thanks a lot.
Hmm, well, I was definitely not trying to cheat... Instantly I runs with gcc 9.3.0 (Ubuntu 9.3.0-10ubuntu2) on the i7-4600U CPU @ 2.10GHz:
So, I see fmt now is x10 faster than sprint, rather than x7.7 as before (gcc 9.2, fedora 30, i7-7820HQ).
I don't use MSVC for my daily work, much less clang-cl (additional toolchain from MSVC-2019). |
clang version 10.0.0 (Fedora 10.0.0-2.fc32) on the Fedora 32, i7-7820HQ CPU
gcc version 10.2.1 20200723 (Red Hat 10.2.1-1) on the same machine (on the Fedora 32, i7-7820HQ CPU)
So, clang code is ~x1.5 slower than gcc. |
Ah. I remember clang has a bug regarding modular operation. That should be the prime reason I guess. Could you update your code and rerun the benchmark with clang? The fix should be simple; just replace Also,
EDIT: EDIT2: |
The first try. It's too late now for me (3am by my local time). |
@erthink Yeah, c u 2mr.
EDIT: never mind, was my mistake. The performance improvement was about |
These two look essential:
{fmt} always uses round to nearest even mode (e.g. when resolving ties). The full cache is not essential but could be included as an option if it's not too difficult. Right now only IEEE 754 binary32 and binary64 are handled since those are most widely used. I wouldn't worry about exotic formats right now and let them be handled by the fallback.
That's a cool speedup. Prettification code and integration with digit generator have a lot of room for improvement and can be optimized separately.
If you are the author, you can definitely distribute your code under any license. Your contribution to {fmt} will of course be acknowledged. |
I think it depends on how you provide the option to the users. Maybe the easiest (and dirtiest) way is to use a macro and conditional compilation. But I'm concerned about ODR violation hell if we choose that path. Exactly for this reason, I personally dislike any library using macros and conditional compilation for the configuration, but it would be very easy to include the full cache version if we do that.
(To be precise, the speed-up was about 20% for the compressed version and about 30-40% for the full version when compiled with clang, and I think these are probably more relevant numbers as clang currently optimizes the code better, which means the amount of improvement for the case of msvc might decrease in the future.) Currently, {fmt} first generates all the digit characters and then prettify them, but I think there are not many reasons to do that if you adapt dragonbox. But that's probably another topic.
Yeah, I was too over-worried lol Anyway, I think now I can start porting the code:) |
@jk-jeon |
@SanderBouwhuis I think using multiple headers is not better than macro/conditional compilation solution regarding the ODR issue. But maybe I'm too sensitive to that issue. |
@vitaut I have a question regarding 128-bit integer types. It seems that currently,
Is this correct? Dragonbox requires some 128-bit integer arithmetic so I will define a struct for that. Can I assume that if More precisely, here is the definition of #if (defined(__GNUC__) || defined(__clang__)) && defined(__SIZEOF_INT128__) && defined(__x86_64__) with #if FMT_USE_INT128 and struct uint128 {
uint128() = default;
#if (defined(__GNUC__) || defined(__clang__)) && defined(__SIZEOF_INT128__) && defined(__x86_64__)
unsigned __int128 internal_;
constexpr uint128(std::uint64_t high, std::uint64_t low) noexcept :
internal_{ ((unsigned __int128)low) | (((unsigned __int128)high) << 64) } {}
constexpr uint128(unsigned __int128 u) : internal_{ u } {}
constexpr std::uint64_t high() const noexcept {
return std::uint64_t(internal_ >> 64);
}
constexpr std::uint64_t low() const noexcept {
return std::uint64_t(internal_);
}
uint128& operator+=(std::uint64_t n) & noexcept {
internal_ += n;
return *this;
}
#else
std::uint64_t high_;
std::uint64_t low_;
constexpr uint128(std::uint64_t high, std::uint64_t low) noexcept :
high_{ high }, low_{ low } {}
constexpr std::uint64_t high() const noexcept {
return high_;
}
constexpr std::uint64_t low() const noexcept {
return low_;
}
uint128& operator+=(std::uint64_t n) & noexcept {
#if defined(_MSC_VER) && defined(_M_X64)
auto carry = _addcarry_u64(0, low_, n, &low_);
_addcarry_u64(carry, high_, 0, &high_);
return *this;
#else
auto sum = low_ + n;
high_ += (sum < low_ ? 1 : 0);
low_ = sum;
return *this;
#endif
}
#endif
}; |
I think a macro is fine, there is a bunch of configuration macros already which are mainly for targeting resource-constrained systems. It's possible to turn ODR violations into compile (or link) errors if desired but we can address it later. In general users are expected to not mix configurations.
Looks correct.
Yes.
Yes. Thanks for working on this! |
@vitaut I think I'm almost done, but here is another question. As I understood, the function Couple of other small issues:
Sorry for spamming.. |
It falls back to Line 1131 in 45da432
but we should probably separate
Yes, please.
Sure.
No, it's accidental. Feel free to change the MSVC version to return |
Thanks to @jk-jeon, I guess we are done here. I'll open a new issue to separate shortest and fixed FP formatting (and make integration of the former more efficient). |
{fmt} is now ~10% faster than double-conversion and 15x faster than libc++
std::ostringstream
on floating-point formatting according to the dtoa-benchmark but there is still a lot of room for improvement.#1097 gives some optimization ideas.
The text was updated successfully, but these errors were encountered: