forked from python/peps
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpep-0293.txt
439 lines (340 loc) · 14.9 KB
/
pep-0293.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
PEP: 293
Title: Codec Error Handling Callbacks
Version: $Revision$
Last-Modified: $Date$
Author: Walter Dörwald <[email protected]>
Status: Final
Type: Standards Track
Content-Type: text/x-rst
Created: 18-Jun-2002
Python-Version: 2.3
Post-History: 19-Jun-2002
Abstract
========
This PEP aims at extending Python's fixed codec error handling
schemes with a more flexible callback based approach.
Python currently uses a fixed error handling for codec error
handlers. This PEP describes a mechanism which allows Python to
use function callbacks as error handlers. With these more
flexible error handlers it is possible to add new functionality to
existing codecs by e.g. providing fallback solutions or different
encodings for cases where the standard codec mapping does not
apply.
Specification
=============
Currently the set of codec error handling algorithms is fixed to
either "strict", "replace" or "ignore" and the semantics of these
algorithms is implemented separately for each codec.
The proposed patch will make the set of error handling algorithms
extensible through a codec error handler registry which maps
handler names to handler functions. This registry consists of the
following two C functions::
int PyCodec_RegisterError(const char *name, PyObject *error)
PyObject *PyCodec_LookupError(const char *name)
and their Python counterparts::
codecs.register_error(name, error)
codecs.lookup_error(name)
``PyCodec_LookupError`` raises a ``LookupError`` if no callback function
has been registered under this name.
Similar to the encoding name registry there is no way of
unregistering callback functions or iterating through the
available functions.
The callback functions will be used in the following way by the
codecs: when the codec encounters an encoding/decoding error, the
callback function is looked up by name, the information about the
error is stored in an exception object and the callback is called
with this object. The callback returns information about how to
proceed (or raises an exception).
For encoding, the exception object will look like this::
class UnicodeEncodeError(UnicodeError):
def __init__(self, encoding, object, start, end, reason):
UnicodeError.__init__(self,
"encoding '%s' can't encode characters " +
"in positions %d-%d: %s" % (encoding,
start, end-1, reason))
self.encoding = encoding
self.object = object
self.start = start
self.end = end
self.reason = reason
This type will be implemented in C with the appropriate setter and
getter methods for the attributes, which have the following
meaning:
* ``encoding``: The name of the encoding;
* ``object``: The original unicode object for which ``encode()`` has
been called;
* ``start``: The position of the first unencodable character;
* ``end``: (The position of the last unencodable character)+1 (or
the length of object, if all characters from start to the end
of object are unencodable);
* ``reason``: The reason why ``object[start:end]`` couldn't be encoded.
If object has consecutive unencodable characters, the encoder
should collect those characters for one call to the callback if
those characters can't be encoded for the same reason. The
encoder is not required to implement this behaviour but may call
the callback for every single character, but it is strongly
suggested that the collecting method is implemented.
The callback must not modify the exception object. If the
callback does not raise an exception (either the one passed in, or
a different one), it must return a tuple::
(replacement, newpos)
replacement is a unicode object that the encoder will encode and
emit instead of the unencodable ``object[start:end]`` part, newpos
specifies a new position within object, where (after encoding the
replacement) the encoder will continue encoding.
Negative values for newpos are treated as being relative to
end of object. If newpos is out of bounds the encoder will raise
an ``IndexError``.
If the replacement string itself contains an unencodable character
the encoder raises the exception object (but may set a different
reason string before raising).
Should further encoding errors occur, the encoder is allowed to
reuse the exception object for the next call to the callback.
Furthermore, the encoder is allowed to cache the result of
``codecs.lookup_error``.
If the callback does not know how to handle the exception, it must
raise a ``TypeError``.
Decoding works similar to encoding with the following differences:
* The exception class is named ``UnicodeDecodeError`` and the attribute
object is the original 8bit string that the decoder is currently
decoding.
* The decoder will call the callback with those bytes that
constitute one undecodable sequence, even if there is more than
one undecodable sequence that is undecodable for the same reason
directly after the first one. E.g. for the "unicode-escape"
encoding, when decoding the illegal string ``\\u00\\u01x``, the
callback will be called twice (once for ``\\u00`` and once for
``\\u01``). This is done to be able to generate the correct number
of replacement characters.
* The replacement returned from the callback is a unicode object
that will be emitted by the decoder as-is without further
processing instead of the undecodable ``object[start:end]`` part.
There is a third API that uses the old strict/ignore/replace error
handling scheme::
PyUnicode_TranslateCharmap/unicode.translate
The proposed patch will enhance ``PyUnicode_TranslateCharmap``, so
that it also supports the callback registry. This has the
additional side effect that ``PyUnicode_TranslateCharmap`` will
support multi-character replacement strings (see SF feature
request #403100 [1]_).
For ``PyUnicode_TranslateCharmap`` the exception class will be named
``UnicodeTranslateError``. ``PyUnicode_TranslateCharmap`` will collect
all consecutive untranslatable characters (i.e. those that map to
``None``) and call the callback with them. The replacement returned
from the callback is a unicode object that will be put in the
translated result as-is, without further processing.
All encoders and decoders are allowed to implement the callback
functionality themselves, if they recognize the callback name
(i.e. if it is a system callback like "strict", "replace" and
"ignore"). The proposed patch will add two additional system
callback names: "backslashreplace" and "xmlcharrefreplace", which
can be used for encoding and translating and which will also be
implemented in-place for all encoders and
``PyUnicode_TranslateCharmap``.
The Python equivalent of these five callbacks will look like this::
def strict(exc):
raise exc
def ignore(exc):
if isinstance(exc, UnicodeError):
return (u"", exc.end)
else:
raise TypeError("can't handle %s" % exc.__name__)
def replace(exc):
if isinstance(exc, UnicodeEncodeError):
return ((exc.end-exc.start)*u"?", exc.end)
elif isinstance(exc, UnicodeDecodeError):
return (u"\\ufffd", exc.end)
elif isinstance(exc, UnicodeTranslateError):
return ((exc.end-exc.start)*u"\\ufffd", exc.end)
else:
raise TypeError("can't handle %s" % exc.__name__)
def backslashreplace(exc):
if isinstance(exc,
(UnicodeEncodeError, UnicodeTranslateError)):
s = u""
for c in exc.object[exc.start:exc.end]:
if ord(c)<=0xff:
s += u"\\x%02x" % ord(c)
elif ord(c)<=0xffff:
s += u"\\u%04x" % ord(c)
else:
s += u"\\U%08x" % ord(c)
return (s, exc.end)
else:
raise TypeError("can't handle %s" % exc.__name__)
def xmlcharrefreplace(exc):
if isinstance(exc,
(UnicodeEncodeError, UnicodeTranslateError)):
s = u""
for c in exc.object[exc.start:exc.end]:
s += u"&#%d;" % ord(c)
return (s, exc.end)
else:
raise TypeError("can't handle %s" % exc.__name__)
These five callback handlers will also be accessible to Python as
``codecs.strict_error``, ``codecs.ignore_error``, ``codecs.replace_error``,
``codecs.backslashreplace_error`` and ``codecs.xmlcharrefreplace_error``.
Rationale
=========
Most legacy encoding do not support the full range of Unicode
characters. For these cases many high level protocols support a
way of escaping a Unicode character (e.g. Python itself supports
the ``\x``, ``\u`` and ``\U`` convention, XML supports character references
via &#xxx; etc.).
When implementing such an encoding algorithm, a problem with the
current implementation of the encode method of Unicode objects
becomes apparent: For determining which characters are unencodable
by a certain encoding, every single character has to be tried,
because encode does not provide any information about the location
of the error(s), so
::
# (1)
us = u"xxx"
s = us.encode(encoding)
has to be replaced by
::
# (2)
us = u"xxx"
v = []
for c in us:
try:
v.append(c.encode(encoding))
except UnicodeError:
v.append("&#%d;" % ord(c))
s = "".join(v)
This slows down encoding dramatically as now the loop through the
string is done in Python code and no longer in C code.
Furthermore, this solution poses problems with stateful encodings.
For example, UTF-16 uses a Byte Order Mark at the start of the
encoded byte string to specify the byte order. Using (2) with
UTF-16, results in an 8 bit string with a BOM between every
character.
To work around this problem, a stream writer - which keeps state
between calls to the encoding function - has to be used::
# (3)
us = u"xxx"
import codecs, cStringIO as StringIO
writer = codecs.getwriter(encoding)
v = StringIO.StringIO()
uv = writer(v)
for c in us:
try:
uv.write(c)
except UnicodeError:
uv.write(u"&#%d;" % ord(c))
s = v.getvalue()
To compare the speed of (1) and (3) the following test script has
been used::
# (4)
import time
us = u"äa"*1000000
encoding = "ascii"
import codecs, cStringIO as StringIO
t1 = time.time()
s1 = us.encode(encoding, "replace")
t2 = time.time()
writer = codecs.getwriter(encoding)
v = StringIO.StringIO()
uv = writer(v)
for c in us:
try:
uv.write(c)
except UnicodeError:
uv.write(u"?")
s2 = v.getvalue()
t3 = time.time()
assert(s1==s2)
print "1:", t2-t1
print "2:", t3-t2
print "factor:", (t3-t2)/(t2-t1)
On Linux this gives the following output (with Python 2.3a0)::
1: 0.274321913719
2: 51.1284689903
factor: 186.381278466
i.e. (3) is 180 times slower than (1).
Callbacks must be stateless, because as soon as a callback is
registered it is available globally and can be called by multiple
``encode()`` calls. To be able to use stateful callbacks, the errors
parameter for encode/decode/translate would have to be changed
from ``char *`` to ``PyObject *``, so that the callback could be used
directly, without the need to register the callback globally. As
this requires changes to lots of C prototypes, this approach was
rejected.
Currently all encoding/decoding functions have arguments
::
const Py_UNICODE *p, int size
or
::
const char *p, int size
to specify the unicode characters/8bit characters to be
encoded/decoded. So in case of an error the codec has to create a
new unicode or str object from these parameters and store it in
the exception object. The callers of these encoding/decoding
functions extract these parameters from str/unicode objects
themselves most of the time, so it could speed up error handling
if these object were passed directly. As this again requires
changes to many C functions, this approach has been rejected.
For stream readers/writers the errors attribute must be changeable
to be able to switch between different error handling methods
during the lifetime of the stream reader/writer. This is currently
the case for ``codecs.StreamReader`` and ``codecs.StreamWriter`` and
all their subclasses. All core codecs and probably most of the
third party codecs (e.g. ``JapaneseCodecs``) derive their stream
readers/writers from these classes so this already works,
but the attribute errors should be documented as a requirement.
Implementation Notes
====================
A sample implementation is available as SourceForge patch #432401
[2]_ including a script for testing the speed of various
string/encoding/error combinations and a test script.
Currently the new exception classes are old style Python
classes. This means that accessing attributes results
in a dict lookup. The C API is implemented in a way
that makes it possible to switch to new style classes
behind the scene, if ``Exception`` (and ``UnicodeError``) will
be changed to new style classes implemented in C for
improved performance.
The class ``codecs.StreamReaderWriter`` uses the errors parameter for
both reading and writing. To be more flexible this should
probably be changed to two separate parameters for reading and
writing.
The errors parameter of ``PyUnicode_TranslateCharmap`` is not
availably to Python, which makes testing of the new functionality
of ``PyUnicode_TranslateCharmap`` impossible with Python scripts. The
patch should add an optional argument errors to unicode.translate
to expose the functionality and make testing possible.
Codecs that do something different than encoding/decoding from/to
unicode and want to use the new machinery can define their own
exception classes and the strict handlers will automatically work
with it. The other predefined error handlers are unicode specific
and expect to get a ``Unicode(Encode|Decode|Translate)Error``
exception object so they won't work.
Backwards Compatibility
=======================
The semantics of unicode.encode with errors="replace" has changed:
The old version always stored a ? character in the output string
even if no character was mapped to ? in the mapping. With the
proposed patch, the replacement string from the callback will
again be looked up in the mapping dictionary. But as all
supported encodings are ASCII based, and thus map ? to ?, this
should not be a problem in practice.
Illegal values for the errors argument raised ``ValueError`` before,
now they will raise ``LookupError``.
References
==========
.. [1] SF feature request #403100
"Multicharacter replacements in PyUnicode_TranslateCharmap"
https://bugs.python.org/issue403100
.. [2] SF patch #432401 "unicode encoding error callbacks"
https://bugs.python.org/issue432401
Copyright
=========
This document has been placed in the public domain.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: