-
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathutf8.txt
109 lines (86 loc) · 3.85 KB
/
utf8.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
*utf8.nvim* *utf8.txt*
Copyright © 2020–2022 Lua.org, PUC-Rio.
==============================================================================
INTRODUCTION
This library provides basic support for UTF-8 encoding. It provides all its
functions inside the table utf8. This library does not provide any support for
Unicode other than the handling of the encoding. Any operation that needs the
meaning of a character, such as character classification, is outside its
scope.
Unless stated otherwise, all functions that expect a byte position as a
parameter assume that the given position is either the start of a byte
sequence or one plus the length of the subject string. As in the string
library, negative indices count from the end of the string.
Functions that create byte sequences accept all values up to 0x10FFFF, as
defined in the UTF-8 specification (RFC 3629); that implies byte sequences of
up to four bytes.
Functions that interpret byte sequences only accept valid sequences (well
formed and not overlong). By default, they only accept byte sequences that
result in valid Unicode code points, rejecting values greater than 10FFFF and
surrogates.
==============================================================================
FUNCTIONS
A Global variable is not added and must be called in `require()`.
>lua
local utf8 = require("utf8")
<
*utf8.char()*
utf8.char([{int}...])
Receives zero or more integers, converts each one to its corresponding
UTF-8 byte sequence and returns a string with the concatenation of all
these sequences.
Parameters: ~
• {int} (integer) Code points
*utf8.charpattern*
utf8.charpattern
The pattern (a string, not a function)
"[\0-\x7F\xC2-\xFD][\x80-\xBF]*", which matches exactly one UTF-8 byte
sequence, assuming that the subject is a valid UTF-8 string.
*utf8.codes()*
utf8.codes({s})
Returns values so that the construction
>lua
for p, c in utf8.codes(s) do body end
<
will iterate over all UTF-8 characters in string s, with p being the
position (in bytes) and c the code point of each character. It raises
an error if it meets any invalid byte sequence.
Parameters: ~
• {s} (string) UTF-8 string
*utf8.codepoint()*
utf8.codepoint({s} [, {i} [, {j}]])
Returns the code points (as integers) from all characters in s that
start between byte position i and j (both included). The default for i
is 1 and for j is i. It raises an error if it meets any invalid byte
sequence.
Parameters: ~
• {s} (string) UTF-8 string
• {i} (integer) Start byte position (included)
• {j} (integer) End byte position (included)
*utf8.len()*
utf8.len({s} [, {i} [, {j}]])
Returns the number of UTF-8 characters in string s that start between
positions i and j (both inclusive). The default for i is 1 and for j
is -1. If it finds any invalid byte sequence, returns fail plus the
position of the first invalid byte.
Parameters: ~
• {s} (string) UTF-8 string
• {i} (integer) Start byte position (included)
• {j} (integer) End byte position (included)
*utf8.offset()*
utf8.offset({s}, {n} [, {i}])
Returns the position (in bytes) where the encoding of the n-th
character of s (counting from position i) starts. A negative n gets
characters before position i. The default for i is 1 when n is
non-negative and #s + 1 otherwise, so that utf8.offset(s, -n) gets the
offset of the n-th character from the end of the string. If the
specified character is neither in the subject nor right after its end,
the function returns fail.
As a special case, when n is 0 the function returns the start of the
encoding of the character that contains the i-th byte of s.
This function assumes that s is a valid UTF-8 string.
Parameters: ~
• {s} (string) UTF-8 string
• {n} (integer) The n-th character from {i} of {s}
• {i} (integer) The byte position
vim:tw=78:ts=8:noet:ft=help:norl: