Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8/ASCII-based parser #49

Merged
merged 12 commits into from
Feb 26, 2021
Merged

UTF8/ASCII-based parser #49

merged 12 commits into from
Feb 26, 2021

Conversation

lemire
Copy link
Collaborator

@lemire lemire commented Feb 23, 2021

We add support inside the library for UTF-8 parsing with some accompanying optimizations.

Performance results:

Method FileName Mean Error StdDev Min Ratio MFloat/s MB/s
Utf8Parser data/canada.txt 21.605 ms 0.0656 ms 0.0512 ms 21.541 ms 0.84 5.16 96.93
'FastFloat.ParseDouble() - UTF8' data/canada.txt 4.624 ms 0.0026 ms 0.0021 ms 4.619 ms 0.18 24.06 452.01
FastFloat.ParseDouble() data/canada.txt 4.878 ms 0.0085 ms 0.0075 ms 4.869 ms 0.19 22.82 428.87
'ParseNumberString() only' data/canada.txt 2.026 ms 0.0086 ms 0.0067 ms 2.022 ms 0.08 54.95 1032.54
Double.Parse() data/canada.txt 25.709 ms 0.0482 ms 0.0403 ms 25.652 ms 1.00 4.33 81.40
Utf8Parser data/mesh.txt 2.761 ms 0.0016 ms 0.0015 ms 2.759 ms 0.54 26.46 224.70
'FastFloat.ParseDouble() - UTF8' data/mesh.txt 1.359 ms 0.0038 ms 0.0032 ms 1.357 ms 0.27 53.82 457.00
FastFloat.ParseDouble() data/mesh.txt 1.504 ms 0.0009 ms 0.0008 ms 1.503 ms 0.30 48.59 412.60
'ParseNumberString() only' data/mesh.txt 1.060 ms 0.0165 ms 0.0155 ms 1.037 ms 0.21 70.39 597.71
Double.Parse() data/mesh.txt 5.074 ms 0.0067 ms 0.0056 ms 5.067 ms 1.00 14.41 122.36

cc @LordJZ @buybackoff @CarlVerret

Would fix #43

Related to: #42

@CarlVerret CarlVerret self-assigned this Feb 23, 2021
@lemire
Copy link
Collaborator Author

lemire commented Feb 24, 2021

@mburbea and @gfoidl : you guys should chime in if you can. The idea here is that if we know that we have a little endian system, we can do some neat optimizations. (The same optimizations are possible under big endian systems, but they may require different code paths.) Given that big endian systems are... rare... to say the least... this seems like a very fruitful thing to do.

It is also very likely that my PR leaves performance on the table.

@gfoidl
Copy link
Contributor

gfoidl commented Feb 24, 2021

you guys should chime in if you can

Happy to chime in -- need to familiarize with the code a bit first 😃

@lemire
Copy link
Collaborator Author

lemire commented Feb 24, 2021

@gfoidl Great. Keep in mind that it is a draft. It is not something I would ever merge into the main branch. It is to spur discussions.

Copy link
Contributor

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great starting-point. Just had fligh over the code...need to check more in depth tomorrow (it's late here).
(I assume to respond slow tomorrow, being mostly out of office).

@lemire
Copy link
Collaborator Author

lemire commented Feb 25, 2021

I have added span-based public functions so that one could reasonably test these functions.

@lemire
Copy link
Collaborator Author

lemire commented Feb 25, 2021

Tests and benchmarks are in.

@lemire lemire marked this pull request as ready for review February 25, 2021 18:32
@lemire
Copy link
Collaborator Author

lemire commented Feb 25, 2021

@CarlVerret @gfoidl @mburbea

This is now sensible and ready for review.

It includes @LordJZ 's UTF-8 benchmark and now we can compare a function that parses UTF-8 inputs with a function that compares UTF8 so it is fair.

Of course, the real benefit of UTF-8 parsing is that you can "parse in place" if your input was UTF-8 to begin with... instead of having to convert it.

@lemire lemire changed the title Draft of a UTF8/ASCII-based parser UTF8/ASCII-based parser Feb 25, 2021
@lemire lemire linked an issue Feb 25, 2021 that may be closed by this pull request
@lemire
Copy link
Collaborator Author

lemire commented Feb 25, 2021

Generally speaking and not just for @gfoidl, pull requests on this branch are invited. I am not assuming that this is the cleanest or fastest implementation.

@gfoidl
Copy link
Contributor

gfoidl commented Feb 25, 2021

Well, what I've seen this code looks really good 👍 and I couldn't spot anything to improve (for the SWAR there's a separate issue).
Nice.

@lemire
Copy link
Collaborator Author

lemire commented Feb 25, 2021

@gfoidl Thanks for the review!

@buybackoff
Copy link
Contributor

buybackoff commented Feb 25, 2021

Thanks for tagging and adding UTF8 support so fast! 👍

Any chance to support netstandard2.0? I wanted to run Utf8Json bench, but BitOperations class is not present on netstandard2.0. Also intrinsics usage is not #ifdef-ed in Utils.

I used a "polyfill" below.

Also, is it possible to support an API that takes the starting pointer and returns read count? Such as:

Double ParseNumber(byte* first, int maxLength, out int readCount, chars_format expectedFormat = chars_format.is_general, byte decimal_separator = (byte)'.')

Double ParseNumber(ReadOnlySpan<byte> span, out int readCount, chars_format expectedFormat = chars_format.is_general, byte decimal_separator = (byte)'.')

E.g. in Utf8Json, we know where to expect a double, start at first non-whitespace byte, and consume bytes while they could be a part of a double. We do not know the length in advance. Doing the length calculation before double parsing will be quite expensive.

Utf8Parser.TryParse also returns bytesConsumed. bool TryParse... overloads are also very useful to avoid exception handling when exceptions are expected.

The "polyfill":

  <PropertyGroup Condition="'$(TargetFramework)'=='net5.0' OR '$(TargetFramework)'=='netcoreapp3.1'">
    <DefineConstants>$(DefineConstants);HAS_INTRINSICS;</DefineConstants>
  </PropertyGroup>

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static int LeadingZeroCount(long value)
        {
#if HAS_INTRINSICS
            if (X86.Lzcnt.X64.IsSupported)
                return (int) X86.Lzcnt.X64.LeadingZeroCount((ulong) value);

            if (Arm.ArmBase.Arm64.IsSupported)
                return Arm.ArmBase.Arm64.LeadingZeroCount((ulong) value);
#endif
#if NET5_0
            return System.Numerics.BitOperations.LeadingZeroCount((ulong) value);
#else
            unchecked
            {
                if (value == 0L)
                {
                    return 64;
                }

                int n = 1;
                if ((long) ((ulong) value >> 32) == 0)
                {
                    n += 32;
                    value <<= 32;
                }

                if ((long) ((ulong) value >> 48) == 0)
                {
                    n += 16;
                    value <<= 16;
                }

                if ((long) ((ulong) value >> 56) == 0)
                {
                    n += 8;
                    value <<= 8;
                }

                if ((long) ((ulong) value >> 60) == 0)
                {
                    n += 4;
                    value <<= 4;
                }

                if ((long) ((ulong) value >> 62) == 0)
                {
                    n += 2;
                    value <<= 2;
                }

                n -= (int) ((ulong) value >> 63);
                return n;
            }
#endif
        }

@pkese
Copy link

pkese commented Feb 26, 2021

Also, is it possible to support an API that takes the starting pointer and returns read count?

^^^^^^^
+1 !!! That would be extremely useful.

@lemire
Copy link
Collaborator Author

lemire commented Feb 26, 2021

@pkese @buybackoff

I have opened an issue...

#61

It should actually be quite easy to return a pointer to the end of the pattern.

@lemire
Copy link
Collaborator Author

lemire commented Feb 26, 2021

@buybackoff

Any chance to support netstandard2.0?

Would you be willing to contribute a pull request?

@buybackoff
Copy link
Contributor

@buybackoff

Any chance to support netstandard2.0?

Would you be willing to contribute a pull request?

Over this weekend, yes

@gfoidl
Copy link
Contributor

gfoidl commented Feb 26, 2021

@buybackoff for your "polyfill" some notes you can keep in mind when you setup the PR.

  • #if HAS_INTRINSICS isn't needed, as BitOperations.LeadingZeroCount does this implicitely and supported .NET 3.0 onwards
  • .NET has a software fallback based on De Bruijn sequences that you could use as "inspiration" (license-wise it can be used)

@buybackoff
Copy link
Contributor

@buybackoff for your "polyfill" some notes you can keep in mind when you setup the PR.

  • #if HAS_INTRINSICS isn't needed, as BitOperations.LeadingZeroCount does this implicitely and supported .NET 3.0 onwards
  • .NET has a software fallback based on De Bruijn sequences that you could use as "inspiration" (license-wise it can be used)

Yes, that was some old code I had at hands and always wanted to cleanup to use BitOperations internals instead.

@gfoidl gfoidl mentioned this pull request Feb 26, 2021
@lemire
Copy link
Collaborator Author

lemire commented Feb 26, 2021

I am going to merge this. It can be improved later.

@lemire lemire merged commit 73c31a9 into master Feb 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide UTF-8 based parser Compare to Utf8Parser
7 participants