This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author doerwalter
Recipients
Date 2002-03-07.23:09:58
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
Logged In: YES 
user_id=89016

I'm think about extending the API a little bit:

Consider the following example:
>>> "\\u1".decode("unicode-escape")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: encoding 'unicodeescape' 
can't decode byte 0x31 
in position 2: truncated \uXXXX escape

The error message is a lie: Not the '1' 
in position 2 is the problem, but the 
complete truncated sequence '\\u1'. 
For this the decoder should pass a start 
and an end position to the handler.

For encoding this would be useful too: 
Suppose I want to have an encoder that 
colors the unencodable character via an 
ANSI escape sequences. Then I could do 
the following:
>>> import codecs
>>> def color(enc, uni, pos, why, sta):
...    return (u"\033[1m<%d>\033[0m" % ord(uni[pos]), pos+1)
... 
>>> codecs.register_unicodeencodeerrorhandler("color", 
color)
>>> u"aäüöo".encode("ascii", "color")
'a\x1b[1m<228>\x1b[0m\x1b[1m<252>\x1b[0m\x1b[1m<246>\x1b
[0mo'

But here the sequences "\x1b[0m\x1b[1m" are not needed.

To fix this problem the encoder could collect as many
unencodable characters as possible and pass those to 
the error callback in one go (passing a start and 
end+1 position).

This fixes the above problem and reduces the number of 
calls to the callback, so it should speed up the 
algorithms in case of custom encoding names. 
(And it makes the implementation very interesting ;))

What do you think?
History
Date User Action Args
2007-08-23 15:06:07adminlinkissue432401 messages
2007-08-23 15:06:07admincreate