Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize floating-point formatting #1426

Closed
vitaut opened this issue Nov 27, 2019 · 32 comments
Closed

Optimize floating-point formatting #1426

vitaut opened this issue Nov 27, 2019 · 32 comments

Comments

@vitaut
Copy link
Contributor

vitaut commented Nov 27, 2019

{fmt} is now ~10% faster than double-conversion and 15x faster than libc++ std::ostringstream on floating-point formatting according to the dtoa-benchmark but there is still a lot of room for improvement.

Screen Shot 2019-11-27 at 7 53 15 AM

#1097 gives some optimization ideas.

@erthink
Copy link
Contributor

erthink commented Feb 19, 2020

Related to #1559 (There is a three times faster double-to-string implementation).

@SanderBouwhuis
Copy link

SanderBouwhuis commented Mar 30, 2020

Has this materialized? I use a LOT of those conversions, so this would be good news.
Having said that, I'm currently using MSVC printf, so presumably converting to fmt would give me a speed boost even without this(?)

@vitaut
Copy link
Contributor Author

vitaut commented Mar 30, 2020

No but {fmt}'s floating-point formatter is already quite fast (see e.g. the benchmark results in https://github.com/fmtlib/fmt/releases/tag/6.1.0) so if you switch from printf you should see perf improvement.

@SanderBouwhuis
Copy link

SanderBouwhuis commented Mar 30, 2020

Thanks.

  1. Should I use v6.1.0 which is 'somewhat' old, or can I use the nightlies?
    I.e., are the nightlies safe, or are there regressions or known bugs?

  2. Is fmt still fast if I supply a wchar_t buffer that fmt should use? Everything in my source code makes use of wchar_t arrays, and I want a 'smooth' / nearly in-place transition of sprintf to fmt.

  3. I asked Erthink whether (s)he would consider donating the faster code to fmt.

@vitaut
Copy link
Contributor Author

vitaut commented Mar 31, 2020

Should I use v6.1.0 which is 'somewhat' old, or can I use the nightlies?
I.e., are the nightlies safe, or are there regressions or known bugs?

I recommend using 6.1.2 or master which includes the same optimization as 6.1.0 + some fixes.

Is fmt still fast if I supply a wchar_t buffer that fmt should use?

Yes.

@SanderBouwhuis
Copy link

SanderBouwhuis commented Mar 31, 2020

@vitaut, maybe fmt could use the new charconv in STL? They use Ryu which is (according to Stephan T. Lavavej, the implementer) 2x to 3x as fast as doubleconv (which you seem to be more or less on par with).
This would make the implementation quite clean, albeit a bit slower than Erthink's optimization. Also, because STL gets vetted MUCH better than any other library, I trust it more.

@vitaut
Copy link
Contributor Author

vitaut commented Mar 31, 2020

maybe fmt could use the new charconv in STL?

It could be possible once charconv becomes widely available. Note that Ryu is only marginally faster than optimized Grisu (or slower than some implementations) on independent benchmarks with random input and has several times larger data tables. Grisu also has nice property that it's faster on short output. In real applications the difference will be negligible because of other factors.

@SanderBouwhuis
Copy link

Ok, awesome. I've started making the switch to fmtlib. Thanks a BUNCH for this library!

@vitaut
Copy link
Contributor Author

vitaut commented Jun 20, 2020

{fmt}'s Grisu3 is now on par or better than Milo Yip's and the original Grisu2 implementations despite providing shortness guarantees:
image
but there is still a lot of optimization opportunities.

@fmtlib fmtlib deleted a comment from data-man Jun 20, 2020
@vitaut
Copy link
Contributor Author

vitaut commented Jun 21, 2020

@data-man, sorry for deleting your comment - trying to avoid links to spammy repos.

@vitaut
Copy link
Contributor Author

vitaut commented Sep 14, 2020

Another option is to integrate https://github.com/jk-jeon/dragonbox by @jk-jeon. The only problem here is introducing of another license.

@SanderBouwhuis
Copy link

FMT currently has the permissive MIT license, which means just about everyone in the world can use it. If more licenses are piled on, this project will become largely useless to many people in the world (I'm looking at you GPL3).
So, please keep the MIT license if at all possible!

@jk-jeon
Copy link
Contributor

jk-jeon commented Sep 15, 2020

@SanderBouwhuis I don't have any good knowledge on licensing stuffs, but BSL (Boost Software License) is even more permissive license than MIT. But probably having to depend on a library with a different license can be a problem, and might degrade the current easiness of integration?

@vitaut I'm afraid dragonbox is not a perfect fit for fmt as it does not provide the option for fixed-precision conversion. Not like Grisu3, it does not perform any iterative search for the shortest representation. Rather, it somehow finds the shortest decimal number with just one trial-and-error, so it would be more difficult (if not impossible) to modify it for fixed-precision conversion. If you want to have separate implementations for shortest-roundtrip and fixed-precision, then I can consider rewriting the whole thing (excluding the features that fmt does not want) and license it under MIT, but having two implementations feels like a big trade-off.

@vitaut
Copy link
Contributor Author

vitaut commented Sep 15, 2020

it does not provide the option for fixed-precision conversion.

That's OK, {fmt} can continue using Grisu3+Dragon4 for this.

If you want to have separate implementations for shortest-roundtrip and fixed-precision, then I can consider rewriting the whole thing (excluding the features that fmt does not want) and license it under MIT

If dragonbox or some part of it could be distributed under the same license as {fmt} (MIT with an optional exception for binary distribution) that would be awesome.

@jk-jeon
Copy link
Contributor

jk-jeon commented Sep 15, 2020

That's OK, {fmt} can continue using Grisu3+Dragon4 for this.

Great😁 Let's start discussing what's needed and what's not needed then. Let's cross out big things first. Could you let me know the things you think worth keeping among these?

  1. Support for various rounding modes.
  2. Various options for what to do when there is tie for correct rounding
  3. Full (10kb) cache table for double
  4. Compressed (1.1kb) cache table for double
  5. Conversion routine for float

(I didn't implement compressed cache version for float, but it would be not difficult, so you can consider using that version as well. I expect that the trade-offs between size vs speed should be smaller in terms of percentages to that for double's)

There are also stuffs I made for possible extensions to 16-bit, 128-bit, or 80-bit floating-point numbers. The thing is, it is not possible to just look at the type and figure out what encoding format the type is internally using, because it depends very, very much on the specific platform. Especially 128-bit or 80-bit float's are kind of messy. Hence, the idea was to eliminate explicit mentions to float or double completely and allow users to provide details of encoding format of their types by themselves.

But if you care only about "sane" platforms where float is IEEE-754 binary32 and double is IEEE-754 binary64, and if you are happy to just use the fallback for other formats, then these stuffs can be removed as well.

Given that lots of things in the list above will be removed, I expect that it would be not so painful to port the code so that the list you wrote here can stay the same.

And about licensing. I believe it should be okay to copy-and-paste-and-modify my own code and distribute it under a different license. But to be sure, I'll write a question on software engineering stack exchange.

Thanks!

@jk-jeon
Copy link
Contributor

jk-jeon commented Sep 15, 2020

I did some experiment on how much fmt can be made faster by adapting dragonbox. The following (and some namespace hack to avoid name clash) is the change I made to fmt to do this experiment:

if (precision < 0) {
#if 0
    fp fp_value;
    auto boundaries = specs.binary32
                          ? fp_value.assign_float_with_boundaries(value)
                          : fp_value.assign_with_boundaries(value);
    fp_value = normalize(fp_value);
    // Find a cached power of 10 such that multiplying value by it will bring
    // the exponent in the range [min_exp, -32].
    const fp cached_pow = get_cached_power(
        min_exp - (fp_value.e + fp::significand_size), cached_exp10);
    // Multiply value and boundaries by the cached power of 10.
    fp_value = fp_value * cached_pow;
    boundaries.lower = multiply(boundaries.lower, cached_pow.f);
    boundaries.upper = multiply(boundaries.upper, cached_pow.f);
    assert(min_exp <= fp_value.e && fp_value.e <= -32);
    --boundaries.lower;  // \tilde{M}^- - 1 ulp -> M^-_{\downarrow}.
    ++boundaries.upper;  // \tilde{M}^+ + 1 ulp -> M^+_{\uparrow}.
    // Numbers outside of (lower, upper) definitely do not round to value.
    grisu_shortest_handler handler{buf.data(), 0,
                                   boundaries.upper - fp_value.f};
    auto result =
        grisu_gen_digits(fp(boundaries.upper, fp_value.e),
                         boundaries.upper - boundaries.lower, exp, handler);
    if (result == digits::error) {
      exp += handler.size - cached_exp10 - 1;
      fallback_format(value, -1, specs.binary32, buf, exp);
      return exp;
    }
    buf.try_resize(to_unsigned(handler.size));
#else
    // full cache verion
    auto dec = jkj::dragonbox::to_decimal(value,
      jkj::dragonbox::policy::sign::ignore);
    // compressed cache version
    /*auto dec = jkj::dragonbox::to_decimal(value,
      jkj::dragonbox::policy::sign::ignore,
      jkj::dragonbox::policy::cache::compressed);*/
    int length = detail::count_digits(dec.significand);
    buf.try_resize(to_unsigned(length));
    uint32_t dec_significand;
    int i = 0;
    if (dec.significand >= 100000000) {
      dec_significand = uint32_t(dec.significand / 100000000);
      uint32_t r = uint32_t(dec.significand) - 100000000 * dec_significand;
      uint32_t r1 = r / 10000;
      uint32_t r2 = r % 10000;
      uint32_t a = r1 / 100;
      uint32_t b = r1 % 100;
      uint32_t c = r2 / 100;
      uint32_t d = r2 % 100;
      
      copy2(buf.data() + length - 8, data::digits[a]);
      copy2(buf.data() + length - 6, data::digits[b]);
      copy2(buf.data() + length - 4, data::digits[c]);
      copy2(buf.data() + length - 2, data::digits[d]);
      i += 8;
    }
    else {
      dec_significand = uint32_t(dec.significand);
    }
    while (dec_significand >= 10000) {
        uint32_t r = dec_significand % 10000;
        dec_significand /= 10000;

        copy2(buf.data() + length - i - 4, data::digits[r / 100]);
        copy2(buf.data() + length - i - 2, data::digits[r % 100]);
        i += 4;
    }
    if (dec_significand >= 100) {
        uint32_t r = dec_significand % 100;
        dec_significand /= 100;

        copy2(buf.data() + length - i - 2, data::digits[r]);
        i += 2;
    }
    if (dec_significand >= 10) {
        copy2(buf.data() + length - i - 2, data::digits[dec_significand]);
    }
    else {
        *buf.data() = char('0' + dec_significand);
    }
    
    return dec.exponent;
#endif
  }

(format-inl.h, in function format_float)

The following is the result:

milo2
milo1

(Source code: https://github.com/jk-jeon/dtoa-benchmark/tree/fmt_dragonbox_test)

So the speed-up is about 30% for the compressed version, and is about 50% for the full version. I think the main reason why it is slower than the corresponding upstream versions of dragonbox is unnecessary intermediate buffer copies for prettifying. But in order to fix that maybe we should change the code a lot.

@erthink
Copy link
Contributor

erthink commented Sep 15, 2020

@jk-jeon, It would be great if you added my implementation to your benchmarks.
This will allow me and other participants to evaluate my code (currently, it seems to me to be the most compact and about the same performance as the best competitors.).

@jk-jeon
Copy link
Contributor

jk-jeon commented Sep 15, 2020

@jk-jeon, It would be great if you added my implementation to your benchmarks.
This will allow me and other participants to evaluate my code (currently, it seems to me to be the most compact and about the same performance as the best competitors.).

Good. I included yours and reran benchmark with clang-cl (which better optimizes virtually all algorithms contained in the benchmark than msvc):

milo1
milo2

But I'm not sure why yours do not perform as advertised in your repo. Did you use some intrinsics that GCC knows but Clang doesn't? Or might be because your code wrongly assumes that the compiler is msvc as _MSC_VER is defined?

FYI: here is the result of the same benchmark compiled with msvc:

milo1_msvc
milo2_msvc

@erthink
Copy link
Contributor

erthink commented Sep 15, 2020

Good. I included yours and reran benchmark

Thanks a lot.

But I'm not sure why yours do not perform as advertised in your repo. Did you use some intrinsics that GCC knows but Clang doesn't?

Hmm, well, I was definitely not trying to cheat...
I think it depends significantly on the compiler (and the specific version), libc version, and the CPU.

Instantly I runs with gcc 9.3.0 (Ubuntu 9.3.0-10ubuntu2) on the i7-4600U CPU @ 2.10GHz:

Function Min ns RMS ns Max ns Sum ns Speedup
null 1.5 1.506 1.6 25.6 ×609.2
erthink 25.4 40.358 53.1 671.0 ×23.2
erthink_shodan 27.8 43.428 60.3 723.8 ×21.5
ryu 45.2 60.106 73.6 1010.0 ×15.4
milo 41.2 62.357 75.5 1050.1 ×14.9
emyg 41.8 62.714 75.8 1056.2 ×14.8
floaxie 29.5 76.898 101.1 1242.8 ×12.5
grisu2 71.5 88.986 108.4 1502.6 ×10.4
fmt 62.0 92.026 121.2 1544.6 ×10.1
doubleconv 72.2 113.108 145.3 1898.0 ×8.2
fpconv 107.8 156.169 179.4 2634.1 ×5.9
stb 157.0 160.634 163.8 2730.6 ×5.7
sprintf 842.0 918.346 979.0 15594.7 ×1.0
ostrstream 1286.4 1359.673 1419.4 23103.9 ×0.7
ostringstream 1408.3 1499.231 1580.1 25467.3 ×0.6

So, I see fmt now is x10 faster than sprint, rather than x7.7 as before (gcc 9.2, fedora 30, i7-7820HQ).

Or might be because your code wrongly assumes that the compiler is msvc as _MSC_VER is defined?

I don't use MSVC for my daily work, much less clang-cl (additional toolchain from MSVC-2019).
However, as far as I remember, I made some corrections somewhere to define the compiler as MSVC if _MSC_VER is defined and __clang__ is not defined.

@erthink
Copy link
Contributor

erthink commented Sep 15, 2020

clang version 10.0.0 (Fedora 10.0.0-2.fc32) on the Fedora 32, i7-7820HQ CPU

Function Min ns RMS ns Max ns Sum ns Speedup
null 1.2 1.200 1.2 20.4 ×651.0
erthink 37.1 50.521 62.6 849.9 ×15.6
ryu 38.1 50.891 64.2 854.2 ×15.5
erthink_shodan 38.9 53.804 70.6 904.1 ×14.7
emyg 46.9 65.674 79.9 1108.4 ×12.0
floaxie 27.0 72.207 97.5 1157.8 ×11.5
fmt 57.2 74.883 100.5 1262.0 ×10.5
milo 53.4 77.634 99.7 1303.7 ×10.2
doubleconv 65.1 99.226 129.3 1664.6 ×8.0
grisu2 99.2 116.581 136.0 1974.6 ×6.7
stb 126.6 129.911 133.9 2208.3 ×6.0
fpconv 103.6 144.810 167.9 2445.6 ×5.4
sprintf 710.3 782.257 840.9 13279.6 ×1.0
ostrstream 1011.0 1079.121 1136.0 18332.3 ×0.7
ostringstream 1159.5 1247.095 1326.8 21177.7 ×0.6

gcc version 10.2.1 20200723 (Red Hat 10.2.1-1) on the same machine (on the Fedora 32, i7-7820HQ CPU)

Function Min ns RMS ns Max ns Sum ns Speedup
null 1.2 1.206 1.3 20.5 ×648.2
erthink 22.8 34.331 44.8 573.6 ×23.2
erthink_shodan 26.5 40.953 58.4 680.5 ×19.5
ryu 36.6 49.335 62.8 828.7 ×16.0
emyg 40.1 59.694 71.9 1005.6 ×13.2
milo 39.9 62.649 81.2 1048.5 ×12.7
floaxie 24.9 68.544 91.9 1099.9 ×12.1
fmt 56.6 82.959 110.2 1393.4 ×9.5
doubleconv 74.6 115.409 145.6 1937.9 ×6.9
grisu2 95.7 117.441 137.5 1985.5 ×6.7
stb 126.0 128.549 131.6 2185.2 ×6.1
fpconv 101.6 143.204 163.0 2418.3 ×5.5
sprintf 710.8 782.770 841.2 13288.4 ×1.0
ostrstream 1024.5 1095.819 1155.5 18615.3 ×0.7
ostringstream 1056.4 1136.396 1204.2 19300.7 ×0.7

So, clang code is ~x1.5 slower than gcc.
This is probably the result of some flaw in the optimizer (like the % bug, i.e. modulus operation).

@jk-jeon
Copy link
Contributor

jk-jeon commented Sep 15, 2020

Ah. I remember clang has a bug regarding modular operation. That should be the prime reason I guess. Could you update your code and rerun the benchmark with clang? The fix should be simple; just replace a % b to a - b * (a/b), like here.

Also,

  1. It should be easy to include dragonbox to your benchmark yourself.
  2. I think there is no reason to test ostrinsgream or sprintf, as they are obviously much, much, much slower. Why not temporarily disable them from the benchmark and save your precious time?

EDIT:
I think OS nor libc version nor CPU can matter a lot. We are not allocating memory/thread or anything, not writing to file/network/etc., so no reason to depend largely on OS. Optimized code seems not calling any std library function so no reason to depend on libc version. Also, Haswell for mobile (4600U) and Kaby Lake for mobile (7820HQ, 7700HQ (mine)) should not be that different.

EDIT2:
I've just confirmed that clang issues the dreaded div instruction in your code. Something should be terribly wrong!

@erthink
Copy link
Contributor

erthink commented Sep 16, 2020

Ah. I remember clang has a bug regarding modular operation. That should be the prime reason I guess. Could you update your code and rerun the benchmark with clang? The fix should be simple; just replace a % b to a - b * (a/b), like here.

The first try.
Speedup for clang-10: ×15.5 -> ×18.2.
In the same env: gcc-9.3 ×23.4, gcc-10 ×22.6.

It's too late now for me (3am by my local time).
c u 2mr.

@jk-jeon
Copy link
Contributor

jk-jeon commented Sep 16, 2020

@erthink Yeah, c u 2mr.

FYI: The performance has been improved, but clang still issues div instruction, I don't know why. Might be great if you can check if gcc also issues div. (msvc doesn't.)

EDIT: never mind, was my mistake. The performance improvement was about 4-5ns compared to the original version.

@vitaut
Copy link
Contributor Author

vitaut commented Sep 16, 2020

Could you let me know the things you think worth keeping among these?

These two look essential:

  • Compressed (1.1kb) cache table for double
  • Conversion routine for float

{fmt} always uses round to nearest even mode (e.g. when resolving ties). The full cache is not essential but could be included as an option if it's not too difficult.

Right now only IEEE 754 binary32 and binary64 are handled since those are most widely used. I wouldn't worry about exotic formats right now and let them be handled by the fallback.

So the speed-up is about 30% for the compressed version, and is about 50% for the full version.

That's a cool speedup. Prettification code and integration with digit generator have a lot of room for improvement and can be optimized separately.

And about licensing. I believe it should be okay to copy-and-paste-and-modify my own code and distribute it under a different license.

If you are the author, you can definitely distribute your code under any license. Your contribution to {fmt} will of course be acknowledged.

@jk-jeon
Copy link
Contributor

jk-jeon commented Sep 16, 2020

The full cache is not essential but could be included as an option if it's not too difficult.

I think it depends on how you provide the option to the users. Maybe the easiest (and dirtiest) way is to use a macro and conditional compilation. But I'm concerned about ODR violation hell if we choose that path. Exactly for this reason, I personally dislike any library using macros and conditional compilation for the configuration, but it would be very easy to include the full cache version if we do that.

So the speed-up is about 30% for the compressed version, and is about 50% for the full version.

That's a cool speedup. Prettification code and integration with digit generator have a lot of room for improvement and can be optimized separately.

(To be precise, the speed-up was about 20% for the compressed version and about 30-40% for the full version when compiled with clang, and I think these are probably more relevant numbers as clang currently optimizes the code better, which means the amount of improvement for the case of msvc might decrease in the future.)

Currently, {fmt} first generates all the digit characters and then prettify them, but I think there are not many reasons to do that if you adapt dragonbox. But that's probably another topic.

If you are the author, you can definitely distribute your code under any license. Your contribution to {fmt} will of course be acknowledged.

Yeah, I was too over-worried lol

Anyway, I think now I can start porting the code:)

@SanderBouwhuis
Copy link

SanderBouwhuis commented Sep 16, 2020

@jk-jeon
I would definitely prefer to have the option to use the full cache for performance reasons on beefy machines while also having the option for the less performant but smaller compressed cache version. Maybe the choice could depend on which of 2 header files you include? E.g., fmt_dragon.h vs fmt_dragon_full.h.

@jk-jeon
Copy link
Contributor

jk-jeon commented Sep 16, 2020

@SanderBouwhuis I think using multiple headers is not better than macro/conditional compilation solution regarding the ODR issue. But maybe I'm too sensitive to that issue.

@jk-jeon
Copy link
Contributor

jk-jeon commented Sep 17, 2020

@vitaut I have a question regarding 128-bit integer types. It seems that currently,

  1. If FMT_USE_INT128 is defined and is not 0, then assume the user have defined int128_t and uint128_t with the correct semantics.
  2. If FMT_USE_INT128 is not defined or is 0, but the compiler supports 128-bit integer types, then declare int128_t and uint128_t as aliases to those types.
  3. Otherwise, define int128_t and uint128_t as empty structs.

Is this correct?

Dragonbox requires some 128-bit integer arithmetic so I will define a struct for that. Can I assume that if FMT_USE_INT128 is 1, then uint128_t is a correctly functioning type?

More precisely, here is the definition of uint128 in the upstream dragonbox implementation. Can I replace the line

#if (defined(__GNUC__) || defined(__clang__)) && defined(__SIZEOF_INT128__) && defined(__x86_64__)

with

#if FMT_USE_INT128

and __int128 with fmt::detail::uint128_t?

struct uint128 {
	uint128() = default;

#if (defined(__GNUC__) || defined(__clang__)) && defined(__SIZEOF_INT128__) && defined(__x86_64__)
	unsigned __int128	internal_;

	constexpr uint128(std::uint64_t high, std::uint64_t low) noexcept :
		internal_{ ((unsigned __int128)low) | (((unsigned __int128)high) << 64) } {}

	constexpr uint128(unsigned __int128 u) : internal_{ u } {}

	constexpr std::uint64_t high() const noexcept {
		return std::uint64_t(internal_ >> 64);
	}
	constexpr std::uint64_t low() const noexcept {
		return std::uint64_t(internal_);
	}

	uint128& operator+=(std::uint64_t n) & noexcept {
		internal_ += n;
		return *this;
	}
#else
	std::uint64_t	high_;
	std::uint64_t	low_;

	constexpr uint128(std::uint64_t high, std::uint64_t low) noexcept :
		high_{ high }, low_{ low } {}

	constexpr std::uint64_t high() const noexcept {
		return high_;
	}
	constexpr std::uint64_t low() const noexcept {
		return low_;
	}

	uint128& operator+=(std::uint64_t n) & noexcept {
#if defined(_MSC_VER) && defined(_M_X64)
		auto carry = _addcarry_u64(0, low_, n, &low_);
		_addcarry_u64(carry, high_, 0, &high_);
		return *this;
#else
		auto sum = low_ + n;
		high_ += (sum < low_ ? 1 : 0);
		low_ = sum;
		return *this;
#endif
	}
#endif
};

@vitaut
Copy link
Contributor Author

vitaut commented Sep 17, 2020

Maybe the easiest (and dirtiest) way is to use a macro and conditional compilation.

I think a macro is fine, there is a bunch of configuration macros already which are mainly for targeting resource-constrained systems. It's possible to turn ODR violations into compile (or link) errors if desired but we can address it later. In general users are expected to not mix configurations.

I have a question regarding 128-bit integer types ... Is this correct?

Looks correct.

Can I assume that if FMT_USE_INT128 is 1, then uint128_t is a correctly functioning type?

Yes.

Can I replace the line ... with ... and __int128 with fmt::detail::uint128_t?

Yes.

Thanks for working on this!

@jk-jeon
Copy link
Contributor

jk-jeon commented Sep 17, 2020

@vitaut I think I'm almost done, but here is another question. As I understood, the function format_float<T> is instantiated for float, double, and long double. First of all, I currently don't have an implementation for 80-bit/128-bit float's, so for platforms where long double is not IEEE-754 binary64, it will cause a problem. But as far as I understood your original Grisu3+Dragon4 also works only for IEEE-754 binary32/64, so what exactly happens here? Do you just ignore 80/128-bit float's and just treat long double as double, or there are some fallback mechanisms outside of format_float<T> that I couldn't find?

Couple of other small issues:

  1. Should I use basic_data<T> and data for static data? If that is the case, then to avoid ambiguously similar names, maybe I want to change names of some existing tables, e.g., pow10_significands into grisu_pow10_significands.
  2. I need TZCNT intrinsics, so copied/modified LZCNT detection codes in format.h. But I realized that while __builtin_clz and friends return int, the MSVC version returns uint32_t. Is this intended?

Sorry for spamming..

@vitaut
Copy link
Contributor Author

vitaut commented Sep 18, 2020

what exactly happens here?

It falls back to snprintf for exotic floating point:

if (!specs.use_grisu) return snprintf_float(value, precision, specs, buf);

but we should probably separate format_float instantiation for long double.

Should I use basic_data and data for static data?

Yes, please.

If that is the case, then to avoid ambiguously similar names, maybe I want to change names of some existing tables

Sure.

Is this intended?

No, it's accidental. Feel free to change the MSVC version to return int.

@vitaut
Copy link
Contributor Author

vitaut commented Sep 20, 2020

Thanks to @jk-jeon, I guess we are done here. I'll open a new issue to separate shortest and fixed FP formatting (and make integration of the former more efficient).

image

@vitaut vitaut closed this as completed Sep 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants