Message36789
Logged In: YES
user_id=89016
Changing the decoding API is done now. There
are new functions
codec.register_unicodedecodeerrorhandler and
codec.lookup_unicodedecodeerrorhandler.
Only the standard handlers for 'strict',
'ignore' and 'replace' are preregistered.
There may be many reasons for decoding errors
in the byte string, so I added an additional
argument to the decoding API: reason, which
gives the reason for the failure, e.g.:
>>> "\\U1111111".decode("unicode_escape")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: encoding 'unicodeescape' can't decode byte
0x31 in position 8: truncated \UXXXXXXXX escape
>>> "\\U11111111".decode("unicode_escape")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: encoding 'unicodeescape' can't decode byte
0x31 in position 9: illegal Unicode character
For symmetry I added this to the encoding API too:
>>> u"\xff".encode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: encoding 'ascii' can't decode byte 0xff in
position 0: ordinal not in range(128)
The parameters passed to the callbacks now are:
encoding, unicode, position, reason, state.
The encoding and decoding API for strings has been
adapted too, so now the new API should be usable
everywhere:
>>> unicode("a\xffb\xffc", "ascii",
... lambda enc, uni, pos, rea, sta: (u"<?>", pos+1))
u'a<?>b<?>c'
>>> "a\xffb\xffc".decode("ascii",
... lambda enc, uni, pos, rea, sta: (u"<?>",
pos+1))
u'a<?>b<?>c'
I had a problem with the decoding API: all the
functions in _codecsmodule.c used the t# format
specifier. I changed that to O! with
&PyString_Type, because otherwise we would have
the problem that the decoding API would must pass
buffer object around instead of strings, and
the callback would have to call str() on the
buffer anyway to access a specific character, so
this wouldn't be any faster than calling str()
on the buffer before decoding. It seems that
buffers aren't used anyway.
I changed all the old function to call the new
ones so bugfixes don't have to be done in two
places. There are two exceptions: I didn't
change PyString_AsEncodedString and
PyString_AsDecodedString because they are
documented as deprecated anyway (although they
are called in a few spots) This means that I
duplicated part of their functionality in
PyString_AsEncodedObjectEx and
PyString_AsDecodedObjectEx.
There are still a few spots that call the old API:
E.g. PyString_Format still calls PyUnicode_Decode
(but with strict decoding) because it passes the
rest of the format string to PyUnicode_Format
when it encounters a Unicode object.
Should we switch to the new API everywhere even
if strict encoding/decoding is used?
The size of this patch begins to scare me. I
guess we need an extensive test script for all the
new features and documentation. I hope you have time
to do that, as I'll be busy with other projects in
the next weeks. (BTW, I have't touched
PyUnicode_TranslateCharmap yet.)
|
|
Date |
User |
Action |
Args |
2007-08-23 15:06:06 | admin | link | issue432401 messages |
2007-08-23 15:06:06 | admin | create | |
|