This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author doerwalter
Recipients
Date 2001-07-27.03:55:36
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
Logged In: YES 
user_id=89016

Changing the decoding API is done now. There 
are new functions
codec.register_unicodedecodeerrorhandler and
codec.lookup_unicodedecodeerrorhandler. 
Only the standard handlers for 'strict', 
'ignore' and 'replace' are preregistered.

There may be many reasons for decoding errors 
in the byte string, so I added an additional
argument to the decoding API: reason, which 
gives the reason for the failure, e.g.:

>>> "\\U1111111".decode("unicode_escape")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: encoding 'unicodeescape' can't decode byte 
0x31 in position 8: truncated \UXXXXXXXX escape
>>> "\\U11111111".decode("unicode_escape")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: encoding 'unicodeescape' can't decode byte 
0x31 in position 9: illegal Unicode character

For symmetry I added this to the encoding API too:
>>> u"\xff".encode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: encoding 'ascii' can't decode byte 0xff in 
position 0: ordinal not in range(128)

The parameters passed to the callbacks now are:
encoding, unicode, position, reason, state.

The encoding and decoding API for strings has been 
adapted too, so now the new API should be usable 
everywhere:

>>> unicode("a\xffb\xffc", "ascii", 
...    lambda enc, uni, pos, rea, sta: (u"<?>", pos+1))
u'a<?>b<?>c'
>>> "a\xffb\xffc".decode("ascii",
...    lambda enc, uni, pos, rea, sta: (u"<?>", 
pos+1))            
u'a<?>b<?>c'

I had a problem with the decoding API: all the 
functions in _codecsmodule.c used the t# format 
specifier. I changed that to O! with 
&PyString_Type, because otherwise we would have 
the problem that the decoding API would must pass
buffer object around instead of strings, and 
the callback would have to call str() on the 
buffer anyway to access a specific character, so 
this wouldn't be any faster than calling str() 
on the buffer before decoding. It seems that 
buffers  aren't used anyway. 

I changed all the old function to call the new 
ones so bugfixes don't have to be done in two 
places. There are two exceptions: I didn't 
change PyString_AsEncodedString and 
PyString_AsDecodedString because they are 
documented as deprecated anyway (although they 
are called in a few spots) This means that I 
duplicated part of their functionality in 
PyString_AsEncodedObjectEx and 
PyString_AsDecodedObjectEx.

There are still a few spots that call the old API:
E.g. PyString_Format still calls PyUnicode_Decode 
(but with strict decoding) because it passes the 
rest of the format string to PyUnicode_Format 
when it encounters a Unicode object.

Should we switch to the new API everywhere even 
if strict encoding/decoding is used?

The size of this patch begins to scare me. I 
guess we need an extensive test script for all the 
new features and documentation. I hope you have time 
to do that, as I'll be busy with other projects in
the next weeks. (BTW, I have't touched 
PyUnicode_TranslateCharmap yet.)
History
Date User Action Args
2007-08-23 15:06:06adminlinkissue432401 messages
2007-08-23 15:06:06admincreate