UTF8/ASCII-based parser #49

lemire · 2021-02-23T20:46:35Z

We add support inside the library for UTF-8 parsing with some accompanying optimizations.

Performance results:

Method	FileName	Mean	Error	StdDev	Min	Ratio	MFloat/s	MB/s
Utf8Parser	data/canada.txt	21.605 ms	0.0656 ms	0.0512 ms	21.541 ms	0.84	5.16	96.93
'FastFloat.ParseDouble() - UTF8'	data/canada.txt	4.624 ms	0.0026 ms	0.0021 ms	4.619 ms	0.18	24.06	452.01
FastFloat.ParseDouble()	data/canada.txt	4.878 ms	0.0085 ms	0.0075 ms	4.869 ms	0.19	22.82	428.87
'ParseNumberString() only'	data/canada.txt	2.026 ms	0.0086 ms	0.0067 ms	2.022 ms	0.08	54.95	1032.54
Double.Parse()	data/canada.txt	25.709 ms	0.0482 ms	0.0403 ms	25.652 ms	1.00	4.33	81.40

Utf8Parser	data/mesh.txt	2.761 ms	0.0016 ms	0.0015 ms	2.759 ms	0.54	26.46	224.70
'FastFloat.ParseDouble() - UTF8'	data/mesh.txt	1.359 ms	0.0038 ms	0.0032 ms	1.357 ms	0.27	53.82	457.00
FastFloat.ParseDouble()	data/mesh.txt	1.504 ms	0.0009 ms	0.0008 ms	1.503 ms	0.30	48.59	412.60
'ParseNumberString() only'	data/mesh.txt	1.060 ms	0.0165 ms	0.0155 ms	1.037 ms	0.21	70.39	597.71
Double.Parse()	data/mesh.txt	5.074 ms	0.0067 ms	0.0056 ms	5.067 ms	1.00	14.41	122.36

cc @LordJZ @buybackoff @CarlVerret

Would fix #43

Related to: #42

lemire · 2021-02-24T17:23:30Z

@mburbea and @gfoidl : you guys should chime in if you can. The idea here is that if we know that we have a little endian system, we can do some neat optimizations. (The same optimizations are possible under big endian systems, but they may require different code paths.) Given that big endian systems are... rare... to say the least... this seems like a very fruitful thing to do.

It is also very likely that my PR leaves performance on the table.

gfoidl · 2021-02-24T17:32:10Z

you guys should chime in if you can

Happy to chime in -- need to familiarize with the code a bit first 😃

lemire · 2021-02-24T17:33:40Z

@gfoidl Great. Keep in mind that it is a draft. It is not something I would ever merge into the main branch. It is to spur discussions.

gfoidl

This is great starting-point. Just had fligh over the code...need to check more in depth tomorrow (it's late here).
(I assume to respond slow tomorrow, being mostly out of office).

csFastFloat/FastDoubleParser.cs

csFastFloat/FastFloatParser.cs

csFastFloat/Structures/DecimalInfo.cs

lemire · 2021-02-25T17:22:24Z

I have added span-based public functions so that one could reasonably test these functions.

Utf8Parser benchmark

lemire · 2021-02-25T18:18:01Z

Tests and benchmarks are in.

lemire · 2021-02-25T18:47:26Z

@CarlVerret @gfoidl @mburbea

This is now sensible and ready for review.

It includes @LordJZ 's UTF-8 benchmark and now we can compare a function that parses UTF-8 inputs with a function that compares UTF8 so it is fair.

Of course, the real benefit of UTF-8 parsing is that you can "parse in place" if your input was UTF-8 to begin with... instead of having to convert it.

lemire · 2021-02-25T18:59:07Z

Generally speaking and not just for @gfoidl, pull requests on this branch are invited. I am not assuming that this is the cleanest or fastest implementation.

csFastFloat/FastDoubleParser.cs

Benchmark/FastParserBenchmark.cs

csFastFloat/Utils/Utils.cs

gfoidl · 2021-02-25T20:41:17Z

Well, what I've seen this code looks really good 👍 and I couldn't spot anything to improve (for the SWAR there's a separate issue).
Nice.

lemire · 2021-02-25T20:44:48Z

@gfoidl Thanks for the review!

buybackoff · 2021-02-25T23:01:26Z

Thanks for tagging and adding UTF8 support so fast! 👍

Any chance to support netstandard2.0? I wanted to run Utf8Json bench, but BitOperations class is not present on netstandard2.0. Also intrinsics usage is not #ifdef-ed in Utils.

I used a "polyfill" below.

Also, is it possible to support an API that takes the starting pointer and returns read count? Such as:

Double ParseNumber(byte* first, int maxLength, out int readCount, chars_format expectedFormat = chars_format.is_general, byte decimal_separator = (byte)'.')

Double ParseNumber(ReadOnlySpan<byte> span, out int readCount, chars_format expectedFormat = chars_format.is_general, byte decimal_separator = (byte)'.')

E.g. in Utf8Json, we know where to expect a double, start at first non-whitespace byte, and consume bytes while they could be a part of a double. We do not know the length in advance. Doing the length calculation before double parsing will be quite expensive.

Utf8Parser.TryParse also returns bytesConsumed. bool TryParse... overloads are also very useful to avoid exception handling when exceptions are expected.

The "polyfill":

  <PropertyGroup Condition="'$(TargetFramework)'=='net5.0' OR '$(TargetFramework)'=='netcoreapp3.1'">
    <DefineConstants>$(DefineConstants);HAS_INTRINSICS;</DefineConstants>
  </PropertyGroup>

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static int LeadingZeroCount(long value)
        {
#if HAS_INTRINSICS
            if (X86.Lzcnt.X64.IsSupported)
                return (int) X86.Lzcnt.X64.LeadingZeroCount((ulong) value);

            if (Arm.ArmBase.Arm64.IsSupported)
                return Arm.ArmBase.Arm64.LeadingZeroCount((ulong) value);
#endif
#if NET5_0
            return System.Numerics.BitOperations.LeadingZeroCount((ulong) value);
#else
            unchecked
            {
                if (value == 0L)
                {
                    return 64;
                }

                int n = 1;
                if ((long) ((ulong) value >> 32) == 0)
                {
                    n += 32;
                    value <<= 32;
                }

                if ((long) ((ulong) value >> 48) == 0)
                {
                    n += 16;
                    value <<= 16;
                }

                if ((long) ((ulong) value >> 56) == 0)
                {
                    n += 8;
                    value <<= 8;
                }

                if ((long) ((ulong) value >> 60) == 0)
                {
                    n += 4;
                    value <<= 4;
                }

                if ((long) ((ulong) value >> 62) == 0)
                {
                    n += 2;
                    value <<= 2;
                }

                n -= (int) ((ulong) value >> 63);
                return n;
            }
#endif
        }

pkese · 2021-02-26T01:27:16Z

Also, is it possible to support an API that takes the starting pointer and returns read count?

^^^^^^^
+1 !!! That would be extremely useful.

lemire · 2021-02-26T01:34:22Z

@pkese @buybackoff

I have opened an issue...

#61

It should actually be quite easy to return a pointer to the end of the pattern.

lemire · 2021-02-26T01:35:25Z

@buybackoff

Any chance to support netstandard2.0?

Would you be willing to contribute a pull request?

buybackoff · 2021-02-26T07:11:55Z

@buybackoff

Any chance to support netstandard2.0?

Would you be willing to contribute a pull request?

Over this weekend, yes

gfoidl · 2021-02-26T10:08:34Z

@buybackoff for your "polyfill" some notes you can keep in mind when you setup the PR.

#if HAS_INTRINSICS isn't needed, as BitOperations.LeadingZeroCount does this implicitely and supported .NET 3.0 onwards
.NET has a software fallback based on De Bruijn sequences that you could use as "inspiration" (license-wise it can be used)

buybackoff · 2021-02-26T10:11:54Z

@buybackoff for your "polyfill" some notes you can keep in mind when you setup the PR.

#if HAS_INTRINSICS isn't needed, as BitOperations.LeadingZeroCount does this implicitely and supported .NET 3.0 onwards

.NET has a software fallback based on De Bruijn sequences that you could use as "inspiration" (license-wise it can be used)

Yes, that was some old code I had at hands and always wanted to cleanup to use BitOperations internals instead.

lemire · 2021-02-26T15:47:44Z

I am going to merge this. It can be improved later.

LordJZ and others added 2 commits February 23, 2021 04:57

Utf8Parser benchmark #40

da338ee

Some prototype.

b69e4c1

CarlVerret self-assigned this Feb 23, 2021

gfoidl reviewed Feb 24, 2021

View reviewed changes

csFastFloat/FastDoubleParser.cs Outdated Show resolved Hide resolved

csFastFloat/FastDoubleParser.cs Show resolved Hide resolved

csFastFloat/FastFloatParser.cs Show resolved Hide resolved

mburbea reviewed Feb 25, 2021

View reviewed changes

csFastFloat/Structures/DecimalInfo.cs Outdated Show resolved Hide resolved

lemire added 2 commits February 25, 2021 12:08

Merge branch 'master' into dlemire/fast_utf8

d468efe

Switching to Unsafe.ReadUnaligned<ulong>(chars).

3f674e7

lemire added 6 commits February 25, 2021 12:23

Merge pull request #42 from LordJZ/utf8parser

d5c52b0

Utf8Parser benchmark

Merge branch 'master' into dlemire/fast_utf8

c005f89

Finishing merge with main branch.

f4db282

With benchmarks.

2f3ae69

Adding tests.

750f01e

Fixed typo

7099b9f

lemire marked this pull request as ready for review February 25, 2021 18:32

lemire added 2 commits February 25, 2021 13:35

Explaining the ASCII strings.

6f79ca7

Fixing exception handling.

c108400

lemire changed the title ~~Draft of a UTF8/ASCII-based parser~~ UTF8/ASCII-based parser Feb 25, 2021

lemire linked an issue Feb 25, 2021 that may be closed by this pull request

Compare to Utf8Parser #40

Closed

mburbea reviewed Feb 25, 2021

View reviewed changes

csFastFloat/FastDoubleParser.cs Show resolved Hide resolved

CarlVerret removed their assignment Feb 25, 2021

lemire mentioned this pull request Feb 25, 2021

Consider using csFastFloat for faster FP parsing dotnet/runtime#48646

Closed

gfoidl reviewed Feb 25, 2021

View reviewed changes

Benchmark/FastParserBenchmark.cs Show resolved Hide resolved

csFastFloat/Utils/Utils.cs Show resolved Hide resolved

gfoidl mentioned this pull request Feb 26, 2021

Might fix issue 58 #62

Merged

lemire merged commit 73c31a9 into master Feb 26, 2021

buybackoff mentioned this pull request Feb 26, 2021

Support netstandard2.0 target #66

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8/ASCII-based parser #49

UTF8/ASCII-based parser #49

lemire commented Feb 23, 2021 •

edited

Loading

lemire commented Feb 24, 2021

gfoidl commented Feb 24, 2021

lemire commented Feb 24, 2021

gfoidl left a comment

lemire commented Feb 25, 2021

lemire commented Feb 25, 2021

lemire commented Feb 25, 2021

lemire commented Feb 25, 2021

gfoidl commented Feb 25, 2021

lemire commented Feb 25, 2021

buybackoff commented Feb 25, 2021 •

edited

Loading

pkese commented Feb 26, 2021

lemire commented Feb 26, 2021

lemire commented Feb 26, 2021

buybackoff commented Feb 26, 2021

gfoidl commented Feb 26, 2021

buybackoff commented Feb 26, 2021

lemire commented Feb 26, 2021

UTF8/ASCII-based parser #49

UTF8/ASCII-based parser #49

Conversation

lemire commented Feb 23, 2021 • edited Loading

lemire commented Feb 24, 2021

gfoidl commented Feb 24, 2021

lemire commented Feb 24, 2021

gfoidl left a comment

Choose a reason for hiding this comment

lemire commented Feb 25, 2021

lemire commented Feb 25, 2021

lemire commented Feb 25, 2021

lemire commented Feb 25, 2021

gfoidl commented Feb 25, 2021

lemire commented Feb 25, 2021

buybackoff commented Feb 25, 2021 • edited Loading

pkese commented Feb 26, 2021

lemire commented Feb 26, 2021

lemire commented Feb 26, 2021

buybackoff commented Feb 26, 2021

gfoidl commented Feb 26, 2021

buybackoff commented Feb 26, 2021

lemire commented Feb 26, 2021

lemire commented Feb 23, 2021 •

edited

Loading

buybackoff commented Feb 25, 2021 •

edited

Loading