Tcl bindings for ICU, to provide enhanced Unicode support. It tries
to mostly provide support for things that aren't in core tcl, but
there is some overlap between string
functions and icu::string
ones. Much more useful with tcl 8.7.
tcl 8.6, and ICU libraries and headers, Critcl (At least for now. Might turn this into a pure C extension later).
Run tclsh build.tcl [LIBRARY_PATH]
, possibly with sudo
. If a path
is not given, uses info library
.
MIT.
package require icu
Package version.
Version of ICU being used.
Version of Unicode being used.
Ensemble with various character related commands. Unless otherwise
specified, arguments are numeric codepoints. Also see icu::string is
for classification functions that act on characters.
Returns the codepoint for the given character.
Returns the character corresponding to the given codepoint.
Returns the name of the codepoint.
Returns the codepoint corresponding to the given name, or -1 on unknown names.
Returns the script the given codepoint belongs to.
Returns the upper-case version of the character if there is one, otherwise the character.
Returns the lower-case version of the character if there is one, otherwise the character.
Returns the title-case version of the character if there is one, otherwise the character.
Tests properties of a single codepoint. Also see icu::string is
.
Is this codepoint assigned a character?
Does the codepoint have the Bidi_Mirrored
property?
Return the mirror codepoint of the character, or the character if it doesn't have one.
Return the character's paired bracket codepoint, or itself if there isn't one.
Returns the decimal digit value of a decimal digit character, or -1.
Returns the decimal digit value of the character in the specified radix (Defaults to 10), or -1. Radix can be between 2 and 36.
Returns the floating-point value of the character, or NaN.
Ensemble with various string-related commands. Unlike the ones in
::string
, these will handle UTF-16 strings with characters outside
of the BMP correctly. However, due to that, most of them are also
O(N)
complexity.
Anything that refers to indexes uses codepoint index, not code
unit index like the core string
functions. These are the same for
characters in the BMP, but not for ones outside of it. Don't mix and
match between the two ensembles.
Returns the number of codepoints in the string.
Compares s1
and s2
in code point order, returning a number less
than 0, 0 or greater than 0 if s1
is less than, euqal to or greater
than s2
. -nocase
does case-insensitive comparision, and
-exclude-special-i
special-cases the Turkish dotted I (U+0130) and
dotless i U+0131) characters (Only meaningful with -nocase
).
The -equivalence
option does Unicode equivalence
comparision. Canonical equivalence between two strings is defined as
their normalized forms (NFD or NFC) being identical.
For locale-specific string comparision, see icu::collator
.
Returns 1 if the two strings are equal, 0 if not. Options are the same
as for compare
.
Return the character at the i
th code point of s
. If the string is
not that long, returns an empty string.
Returns the substring of s
starting with index first
and ending
with index last
. If first
and last
are the same, it's equivalent
to index
.
Returns the index of the first occurence of needleString
in
haystackString
, or -1 if not found.
Returns the index of the last occurence of needleString
in
haystackString
, or -1 if not found.
Returns the index of the first character in s
that is also in
chars
. Returns -1 if none are.
Returns the index of the first character in s
that is not in
chars
. Returns -1 if all are.
Returns an upper-cased version of s
, according to the optional
locale
rules. If the locale is an empty string, uses the root
locale. If not present, uses the default one.
Returns a lower-cased version of s
, according to the optional
locale
rules. If the locale is an empty string, uses the root
locale. If not present, uses the default one.
Returns a title-cased version of s
, according to the optional
locale
rules. If the locale is an empty string, uses the root
locale. If not present, uses the default one.
Returns a case-folded version of s
. If -exclude-special-i
is given,
excludes mappings for the Turkish dotted I (U+0130) and dotless i
(U+0131), etc.
Returns all its arguments concatenated together and normalized in NFC.
Returns all its arguments concatenated together and normalized in NFD.
Returns all its arguments concatenated together and normalized in NFKC.
Returns all its arguments concatenated together and normalized in NFKD.
Returns true if all codepoints of the string match some condition. An
empty string is true unless -strict
is given, in which case it's false.
Is the string in NFC mode?
Is the string in NFD mode?
Is the string in NFKC mode?
Is the string in NFKD mode?
Is every codepoint in the string titlecased?
Is every codepoint in the string a base character?
Returns a list of components of the string, broken up according to the
subcommand
and optional locale. If -rule
is given, the words
and
sentences
breaks return lists of pairs, with the second elements
giving extra information about the reason for the break.
Subcommands are:
Split up into individual extended grapheme clusters.
Split up into individual numeric codepoints.
Split up into individual words. If -all
is given, includes the
spaces between words as their own entries, otherwise leaves them out
of the results.
Split up into sentences.
Find line breaks. The list it returns is of pairs, with the first
element being text, and the second the type of break that follows that
text - soft
for suggested break points, hard
for mandatory break
points.
Creates and returns the name of a new command that collates strings. If no arguments, uses the default locale's collator. If an empty string, uses the root collator. If a name for the collator is not given, one is generated. A single argument is interpreted as the locale argument.
Returns the name of the locale used by the collator command.
Returns -1, 0 or 1 depending on if s1
is less than, equal to, or
greater than s2
according to the collator's rules.
Returns true or false depending on if s1
and s2
are equal.
Returns true or false depending on if s1
is greater than s2
or not.
Returns true or false depending on if s1
is greater than or equal to
s2
or not.
An ensemble with various locale-related commands.
Return the default locale. With an argument, also sets the default to that.
Returns information about a given locale, or the default one if no locale is specified.
Return the language code for the locale.
Return the script used by the locale.
Return the locale's country code.
Return the locale's variant code.
Return the full name of the locale.
Return the canonicalized full name of the locale.
Returns 1 if the locale's script is read right to left, otherwise 0.
Returns left-to-right
or right-to-left
or unknown
.
Returns top-to-bottom
or bottom-to-top
or unknown
.
Returns a list of known ISO language codes.
Returns a list of known ISO country codes.
Returns a list of known locales.
An ensemble with commands for formatting data.
TODO: Numbers, dates, times, etc.
Formats a list according to the rules specified by the options. If a
locale is not given, uses the default one. -type
defaults to and
and -width
defaults to wide
. The -type
and -width
options have
no effect unless using ICU 67 or newer; older versions always act like
the defaults are used.