Message36794
Logged In: YES
user_id=89016
I'm think about extending the API a little bit:
Consider the following example:
>>> "\\u1".decode("unicode-escape")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: encoding 'unicodeescape'
can't decode byte 0x31
in position 2: truncated \uXXXX escape
The error message is a lie: Not the '1'
in position 2 is the problem, but the
complete truncated sequence '\\u1'.
For this the decoder should pass a start
and an end position to the handler.
For encoding this would be useful too:
Suppose I want to have an encoder that
colors the unencodable character via an
ANSI escape sequences. Then I could do
the following:
>>> import codecs
>>> def color(enc, uni, pos, why, sta):
... return (u"\033[1m<%d>\033[0m" % ord(uni[pos]), pos+1)
...
>>> codecs.register_unicodeencodeerrorhandler("color",
color)
>>> u"aäüöo".encode("ascii", "color")
'a\x1b[1m<228>\x1b[0m\x1b[1m<252>\x1b[0m\x1b[1m<246>\x1b
[0mo'
But here the sequences "\x1b[0m\x1b[1m" are not needed.
To fix this problem the encoder could collect as many
unencodable characters as possible and pass those to
the error callback in one go (passing a start and
end+1 position).
This fixes the above problem and reduces the number of
calls to the callback, so it should speed up the
algorithms in case of custom encoding names.
(And it makes the implementation very interesting ;))
What do you think?
|
|
Date |
User |
Action |
Args |
2007-08-23 15:06:07 | admin | link | issue432401 messages |
2007-08-23 15:06:07 | admin | create | |
|