Skip to content
This repository has been archived by the owner on Jun 17, 2022. It is now read-only.

Unicode support #44

Open
benman1 opened this issue Mar 30, 2020 · 12 comments
Open

Unicode support #44

benman1 opened this issue Mar 30, 2020 · 12 comments
Labels
enhancement New feature or request need-realworld-feedback feedback from users needed on how the decision would affect real world use

Comments

@benman1
Copy link

benman1 commented Mar 30, 2020

Please implement unicode string support.

In C++, std::wstring is a wrapper for wchar_t* similar to std::string which is a wrapper for char*. wchar_t is defined in C as well [1]. A similar API in C is Glib::ustring.

The major difference to std::string is that a character is defined by 4 bytes rather than 1.

@aep aep added the enhancement New feature or request label Apr 1, 2020
@aep aep added the bounty sponsorship by devguard available label Jul 22, 2020
@jacereda
Copy link

jacereda commented Sep 9, 2020

I would stick to UTF-8 encoded strings and just implement a glyphs for getting an array of unicode code points.

@aep
Copy link
Collaborator

aep commented Sep 9, 2020

that sounds like a good idea to me, assuming by array you mean an iterator.
is there a portable C library for doing that?

@aep
Copy link
Collaborator

aep commented Sep 9, 2020

is this sane? https://github.com/adricoin2010/UTF8-Iterator

looks suspiciously simple

@jacereda
Copy link

jacereda commented Sep 9, 2020

Not so simple, I prefer this one:

https://bjoern.hoehrmann.de/utf-8/decoder/dfa/

Scroll down to the bottom, there's a better implementation than the one on top.

@aep
Copy link
Collaborator

aep commented Sep 15, 2020

ok i'm implementing this now, but i need everyone to comment on the api and its features.

here's my first draft.

  • string::String is similar to slice::Slice, except there's no MutSlice and the iterators are on utf8 codepoints rather than bytes
  • string::buffer::StringBuffer is similar to buffer::Buffer and autocasts to String
pub fn main() {

   /// construct a new string reference from a borrowed null terminated char*
   let s = string::from_cstr("你好世界");

   /// len() counts codepoints or scalar values?
   err::assert(s.len() == 4);

   /// iterator over codepoints
   for let mut it = s.iter(); it.next(); {
       /// one codepoint is 4 byte long
       u32 ch = it.val;
   }

   /// note the lack of char indexing.
   /// this is not possible
   u32 ch = s[2];

   // but you can convert it to a slice for byte indexing
   let sl = s.as_utf8_slice();
   u8 meh = sl.mem[2];

   // or copy to a vec
   new[item = u32, +100] v = s.to_vec();
   u32 bleh = v.items[2];

   /// return string as null terminated utf8 char*
   printf("%s", s.cstr());

   /// concat two strings using a string buffer
   new[+1000] b = string::buffer::make();
   b.append(string::from_cstr("hello world"));
   b.append(string::from_cstr("  "));
   b.append(string::from_cstr("你好世界"));
   
   /// borrow a buffer as str
   let x = b.as_str();   

   /// split
   usize mut iterator = 0;
   let s1 = x.split(" ", &iterator);
   let s2 = x.split(" ", &iterator);

   /// compare
   err::assert(!s1.eq(s2));

   /// substrings compares
   err::assert(s2.starts_with(string::from_cstr("")));

}

we MIGHT also completely replace char* with string::String some day, removing the explicit calls to from_cstr, but not until we're sure string is ready

@aep aep added need-realworld-feedback feedback from users needed on how the decision would affect real world use and removed bounty sponsorship by devguard available labels Sep 15, 2020
@jwerle
Copy link
Member

jwerle commented Sep 15, 2020

Give me a few

@jwerle
Copy link
Member

jwerle commented Sep 15, 2020

This API looks pretty straightforward and absolutely needed. I am happy that we took the approach to rewrite the string module with utf8 in mind!

My only (unrelated) question is:

// or copy to a vec
new[item = u32, +100] v = s.to_vec();

Can we do this now? (new constructor from an "instance" method)

@aep
Copy link
Collaborator

aep commented Sep 15, 2020

oh right, i actually forgot that's broken, thanks for the reminder

opened #123

@jacereda
Copy link

Why is string needed? Wouldn't a uiter() for iterating over unicode code points on a slice suffice?

@aep
Copy link
Collaborator

aep commented Sep 15, 2020

technically yes, but string manipulation behaves differently on unicode vs bytes. having to prefix all functions with unicode_split etc seems awkward and the type is effectively free as its just emitted as fat pointer to C

also slice holds any arbitrary binary data, string holds null terminated utf8. this distinction is useful in api contracts and automatic mapping to other type systems

@aep
Copy link
Collaborator

aep commented Sep 15, 2020

actually i wonder if we can use attached type aliases to implement it as specialized slice.

type String = slice::Slice[nullterm(self.mem), utf8(self.mem)];

edit: never mind, still would have to prefix utf8 specific functions, which is weird. but String can just inherit from slice by first-member rule, so you can use it as if it was a slice.

@sternenseemann
Copy link

I'd recommend using Julia's utf8proc which is reasonably lightweight, supports UTF-8 decoding and encoding (from and to codepoints) and other features that definitely needed for proper unicode handling like utf8 normalization and grapheme clustering.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request need-realworld-feedback feedback from users needed on how the decision would affect real world use
Projects
None yet
Development

No branches or pull requests

5 participants