Use vector operations in text decoding #272

Fuuzetsu · 2019-12-11T03:40:39Z

There's a fair amount of literature about decoding UTF8 as well as conversion to UTF16. It seems like cbits.c could probably benefit massively from a modern update.

See https://github.com/cyb70289/utf8 for example, also u8u16 library/paper &c.

Fuuzetsu · 2019-12-12T05:38:51Z

Doing some more research https://bitbucket.org/stgatilov/utf8lut seems quite good, it can validate and convert between utf8 <-> utf16/32 which would allow replacing a bunch of decoders/encoders in Text

ethercrow · 2020-10-08T20:28:02Z

@Fuuzetsu ohai

Just saw this issue. Lately I've been applying SSE2 to obvious places in encoding and decoding:

I haven't yet read the state of the art in SIMD conversions, thanks for the links.

If we're to go into SSE4 and AVX, I wonder if GHC should support a -march=native flag so that users don't have to manually pick individual instruction set flags like -msse4.2 specifically for their machines.

Fuuzetsu · 2022-05-16T12:12:55Z

@Lysxia can you comment on why the issue for closed? are we happy enough with current state of vectorisation in modern text? If so, great, just nice to have couple of words

Lysxia · 2022-05-16T12:55:49Z

Sure thing. Given @ethercrow's SSE2 patches in text-1.2.5.0 and @Bodigrim's switch to UTF-8 (PR #365) which bundled the C++ simdutf8 library with text-2.0, it's fair to say that the situation regarding vectorization in text has significantly evolved in the last two years. I'm closing this issue based on that assessment.

I realize I tend to be trigger-happy in closing issues. I won't be opposed if you or @haskell/text maintainers would prefer this issue to remain open to keep track of long-term progress on vectorization in text. But my view is that it's a given that comparisons against the state of the art and further improvements in the area are always welcome. Hence I think issues have more utility when they are driven by more focused discussion and more actionable goals.

Fuuzetsu · 2022-05-17T00:13:52Z

Thank you. I don't have a problem with closing this – when I initially created the ticket, I think there was about zero vectorisation code and as you mention, a bunch has been added. It should be much easier to continue the trend now and we don't an issue explicitly anymore, probably.

Lysxia added the feature request label Mar 7, 2021

This was linked to pull requests Mar 8, 2021

Use SSE2 in the x86_64 C version of decodeLatin1 #297

Closed

Use SSE2 in the ASCII fast path of decodeUtf8 #298

Closed

Use SSE2 in the x86_64 C version of encodeUtf8 #300

Closed

SSE2 patches for encoding and decoding functions #302

Merged

This was unlinked from pull requests Mar 8, 2021

Use SSE2 in the x86_64 C version of decodeLatin1 #297

Closed

Use SSE2 in the ASCII fast path of decodeUtf8 #298

Closed

Use SSE2 in the x86_64 C version of encodeUtf8 #300

Closed

SSE2 patches for encoding and decoding functions #302

Merged

Fuuzetsu mentioned this issue Aug 22, 2021

Switch internal representation to UTF8 #365

Merged

Lysxia closed this as completed May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use vector operations in text decoding #272

Use vector operations in text decoding #272

Fuuzetsu commented Dec 11, 2019 •

edited

Loading

Fuuzetsu commented Dec 12, 2019

ethercrow commented Oct 8, 2020

Fuuzetsu commented May 16, 2022

Lysxia commented May 16, 2022

Fuuzetsu commented May 17, 2022

Use vector operations in text decoding #272

Use vector operations in text decoding #272

Comments

Fuuzetsu commented Dec 11, 2019 • edited Loading

Fuuzetsu commented Dec 12, 2019

ethercrow commented Oct 8, 2020

Fuuzetsu commented May 16, 2022

Lysxia commented May 16, 2022

Fuuzetsu commented May 17, 2022

Fuuzetsu commented Dec 11, 2019 •

edited

Loading