-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF8/ASCII-based parser #49
Conversation
@mburbea and @gfoidl : you guys should chime in if you can. The idea here is that if we know that we have a little endian system, we can do some neat optimizations. (The same optimizations are possible under big endian systems, but they may require different code paths.) Given that big endian systems are... rare... to say the least... this seems like a very fruitful thing to do. It is also very likely that my PR leaves performance on the table. |
Happy to chime in -- need to familiarize with the code a bit first 😃 |
@gfoidl Great. Keep in mind that it is a draft. It is not something I would ever merge into the main branch. It is to spur discussions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great starting-point. Just had fligh over the code...need to check more in depth tomorrow (it's late here).
(I assume to respond slow tomorrow, being mostly out of office).
I have added span-based public functions so that one could reasonably test these functions. |
Tests and benchmarks are in. |
This is now sensible and ready for review. It includes @LordJZ 's UTF-8 benchmark and now we can compare a function that parses UTF-8 inputs with a function that compares UTF8 so it is fair. Of course, the real benefit of UTF-8 parsing is that you can "parse in place" if your input was UTF-8 to begin with... instead of having to convert it. |
Generally speaking and not just for @gfoidl, pull requests on this branch are invited. I am not assuming that this is the cleanest or fastest implementation. |
Well, what I've seen this code looks really good 👍 and I couldn't spot anything to improve (for the SWAR there's a separate issue). |
@gfoidl Thanks for the review! |
Thanks for tagging and adding UTF8 support so fast! 👍 Any chance to support netstandard2.0? I wanted to run Utf8Json bench, but I used a "polyfill" below. Also, is it possible to support an API that takes the starting pointer and returns read count? Such as:
E.g. in Utf8Json, we know where to expect a double, start at first non-whitespace byte, and consume bytes while they could be a part of a double. We do not know the length in advance. Doing the length calculation before double parsing will be quite expensive.
The "polyfill":
|
^^^^^^^ |
I have opened an issue... It should actually be quite easy to return a pointer to the end of the pattern. |
Would you be willing to contribute a pull request? |
Over this weekend, yes |
@buybackoff for your "polyfill" some notes you can keep in mind when you setup the PR.
|
Yes, that was some old code I had at hands and always wanted to cleanup to use |
I am going to merge this. It can be improved later. |
We add support inside the library for UTF-8 parsing with some accompanying optimizations.
Performance results:
cc @LordJZ @buybackoff @CarlVerret
Would fix #43
Related to: #42