-
Notifications
You must be signed in to change notification settings - Fork 52
Unicode support #44
Comments
I would stick to UTF-8 encoded strings and just implement a |
that sounds like a good idea to me, assuming by array you mean an iterator. |
is this sane? https://github.com/adricoin2010/UTF8-Iterator looks suspiciously simple |
Not so simple, I prefer this one: https://bjoern.hoehrmann.de/utf-8/decoder/dfa/ Scroll down to the bottom, there's a better implementation than the one on top. |
ok i'm implementing this now, but i need everyone to comment on the api and its features. here's my first draft.
pub fn main() {
/// construct a new string reference from a borrowed null terminated char*
let s = string::from_cstr("你好世界");
/// len() counts codepoints or scalar values?
err::assert(s.len() == 4);
/// iterator over codepoints
for let mut it = s.iter(); it.next(); {
/// one codepoint is 4 byte long
u32 ch = it.val;
}
/// note the lack of char indexing.
/// this is not possible
u32 ch = s[2];
// but you can convert it to a slice for byte indexing
let sl = s.as_utf8_slice();
u8 meh = sl.mem[2];
// or copy to a vec
new[item = u32, +100] v = s.to_vec();
u32 bleh = v.items[2];
/// return string as null terminated utf8 char*
printf("%s", s.cstr());
/// concat two strings using a string buffer
new[+1000] b = string::buffer::make();
b.append(string::from_cstr("hello world"));
b.append(string::from_cstr(" "));
b.append(string::from_cstr("你好世界"));
/// borrow a buffer as str
let x = b.as_str();
/// split
usize mut iterator = 0;
let s1 = x.split(" ", &iterator);
let s2 = x.split(" ", &iterator);
/// compare
err::assert(!s1.eq(s2));
/// substrings compares
err::assert(s2.starts_with(string::from_cstr("你")));
}
we MIGHT also completely replace char* with string::String some day, removing the explicit calls to from_cstr, but not until we're sure string is ready |
Give me a few |
This API looks pretty straightforward and absolutely needed. I am happy that we took the approach to rewrite the My only (unrelated) question is: // or copy to a vec
new[item = u32, +100] v = s.to_vec(); Can we do this now? ( |
oh right, i actually forgot that's broken, thanks for the reminder opened #123 |
Why is |
technically yes, but string manipulation behaves differently on unicode vs bytes. having to prefix all functions with unicode_split etc seems awkward and the type is effectively free as its just emitted as fat pointer to C also slice holds any arbitrary binary data, string holds null terminated utf8. this distinction is useful in api contracts and automatic mapping to other type systems |
actually i wonder if we can use attached type aliases to implement it as specialized slice. type String = slice::Slice[nullterm(self.mem), utf8(self.mem)]; edit: never mind, still would have to prefix utf8 specific functions, which is weird. but String can just inherit from slice by first-member rule, so you can use it as if it was a slice. |
I'd recommend using Julia's utf8proc which is reasonably lightweight, supports UTF-8 decoding and encoding (from and to codepoints) and other features that definitely needed for proper unicode handling like utf8 normalization and grapheme clustering. |
Please implement unicode string support.
In C++, std::wstring is a wrapper for wchar_t* similar to std::string which is a wrapper for char*. wchar_t is defined in C as well [1]. A similar API in C is Glib::ustring.
The major difference to std::string is that a character is defined by 4 bytes rather than 1.
The text was updated successfully, but these errors were encountered: