Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use vector operations in text decoding #272

Closed
Fuuzetsu opened this issue Dec 11, 2019 · 5 comments
Closed

Use vector operations in text decoding #272

Fuuzetsu opened this issue Dec 11, 2019 · 5 comments

Comments

@Fuuzetsu
Copy link
Member

Fuuzetsu commented Dec 11, 2019

There's a fair amount of literature about decoding UTF8 as well as conversion to UTF16. It seems like cbits.c could probably benefit massively from a modern update.

See https://github.com/cyb70289/utf8 for example, also u8u16 library/paper &c.

@Fuuzetsu
Copy link
Member Author

Doing some more research https://bitbucket.org/stgatilov/utf8lut seems quite good, it can validate and convert between utf8 <-> utf16/32 which would allow replacing a bunch of decoders/encoders in Text

@ethercrow
Copy link
Contributor

@Fuuzetsu ohai

Just saw this issue. Lately I've been applying SSE2 to obvious places in encoding and decoding:

I haven't yet read the state of the art in SIMD conversions, thanks for the links.

If we're to go into SSE4 and AVX, I wonder if GHC should support a -march=native flag so that users don't have to manually pick individual instruction set flags like -msse4.2 specifically for their machines.

@Fuuzetsu
Copy link
Member Author

@Lysxia can you comment on why the issue for closed? are we happy enough with current state of vectorisation in modern text? If so, great, just nice to have couple of words

@Lysxia
Copy link
Contributor

Lysxia commented May 16, 2022

Sure thing. Given @ethercrow's SSE2 patches in text-1.2.5.0 and @Bodigrim's switch to UTF-8 (PR #365) which bundled the C++ simdutf8 library with text-2.0, it's fair to say that the situation regarding vectorization in text has significantly evolved in the last two years. I'm closing this issue based on that assessment.

I realize I tend to be trigger-happy in closing issues. I won't be opposed if you or @haskell/text maintainers would prefer this issue to remain open to keep track of long-term progress on vectorization in text. But my view is that it's a given that comparisons against the state of the art and further improvements in the area are always welcome. Hence I think issues have more utility when they are driven by more focused discussion and more actionable goals.

@Fuuzetsu
Copy link
Member Author

Thank you. I don't have a problem with closing this – when I initially created the ticket, I think there was about zero vectorisation code and as you mention, a bunch has been added. It should be much easier to continue the trend now and we don't an issue explicitly anymore, probably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants