This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author doerwalter
Recipients
Date 2002-04-17.20:50:06
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
Logged In: YES 
user_id=89016

> About the difference between encoding 
> and decoding: you shouldn't just look 
> at the case where you work with Unicode 
> and strings, e.g. take the rot-13 codec
> which works on strings only or other
> codecs which translate objects into 
> strings and vice-versa.

unicode.encode encodes to str and 
str.decode decodes to unicode,
even for rot-13:

>>> u"gürk".encode("rot13")
't\xfcex'
>>> "gürk".decode("rot13")
u't\xfcex'
>>> u"gürk".decode("rot13")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
AttributeError: 'unicode' object has no attribute 'decode'
>>> "gürk".encode("rot13")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/home/walter/Python-current-
readonly/dist/src/Lib/encodings/rot_13.py", line 18, in 
encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeError: ASCII decoding error: ordinal not in range
(128)

Here the str is converted to unicode
first, before encode is called, but the
conversion to unicode fails.

Is there an example where something
else happens?

> Error handling has to be flexible enough 
> to handle all these situations. Since 
> the codecs know best how to handle the
> situations, I'd make this an implementation 
> detail of the codec and leave the
> behaviour undefined in the general case.

OK, but we should suggest, that for encoding
unencodable characters are collected
and for decoding seperate byte sequences
that are considered broken by the codec
are passed to the callback: i.e for 
decoding the handler will never get
all broken data in one call, e.g. 
for "\\u30\\Uffffffff".decode("unicode-escape")
the handler will be called twice (once for
"\\u30" and "truncated \\u escape" as the
reason and once for "\\Uffffffff" and
"illegal character" as the reason.)

> For the existing codecs, backward 
> compatibility should be maintained, 
> if at all possible. If the patch gets 
> overly complicated because of this, 
> we may have to provide a downgrade solution
> for this particular problem (I don't think 
> replace is used in any computational context, 
> though, since you can never be sure how 
> many replacement character do get 
> inserted, so the case may not be 
> that realistic).
> 
> Raising an exception for the charmap codec 
> is the right way to go, IMHO. I would 
> consider the current behaviour a bug.

OK, this is implemented in PyUnicode_EncodeCharmap now, 
and collecting unencodable characters works too.

I completely changed the implementation,
because the stack approach would have
gotten much more complicated when
unencodable characters are collected.

> For new codecs, I think we should 
> suggest that replace tries to collect 
> as much illegal data as possible before
> invoking the error handler. The handler 
> should be aware of the fact that it 
> won't necessarily get all the broken 
> data in one call.

OK for encoders, for decoders see
above.

> About the codec error handling 
> registry: You seem to be using a 
> Unicode specific approach here. 
> I'd rather like to see a generic 
> approach which uses the API 
> we discussed earlier. Would that be possible?

The handlers in the registry are all Unicode
specific. and they are different for encoding
and for decoding.

I renamed the function because of your
comment from 2001-06-13 10:05 (which 
becomes exceedingly difficult to find on
this long page! ;)).

> In that case, the codec API should 
> probably be called 
> codecs.register_error('myhandler', myhandler).
> 
> Does that make sense ?

We could require that unique names
are used for custom handlers, but
for the standard handlers we do have
name collisions. To prevent them, we
could either remove them from the registry
and require that the codec implements
the error handling for those itself,
or we could to some fiddling, so that
u"üöä".encode("ascii", "replace")
becomes 
u"üöä".encode("ascii", "unicodeencodereplace")
behind the scenes.

But I think two unicode specific 
registries are much simpler to handle.

> BTW, the patch which uses the callback 
> registry does not seem to be available 
> on this SF page (the last patch still 
> converts the errors argument to a 
> PyObject, which shouldn't be needed
> anymore with the new approach). 
> Can you please upload your 
> latest version?

OK, I'll upload a preliminary version
tomorrow. PyUnicode_EncodeDecimal and
PyUnicode_TranslateCharmap are still
missing, but otherwise the patch seems
to be finished. All decoders work and
the encoders collect unencodable characters
and implement the handling of known
callback handler names themselves.

As PyUnicode_EncodeDecimal is only used
by the int, long, float, and complex constructors,
I'd love to get rid of the errors argument,
but for completeness sake, I'll implement
the callback functionality.

> Note that the highlighting codec 
> would make a nice example
> for the new feature.

This could be part of the codec callback test
script, which I've started to write. We could
kill two birds with one stone here:
1. Test the implementation.
2. Document and advocate what is 
   possible with the patch.

Another idea: we could have as an example
a decoding handler that relaxes the
UTF-8 minimal encoding restriction, e.g.

def relaxedutf8(enc, uni, startpos, endpos, reason, data):
   if uni[startpos:startpos+2] == u"\xc0\x80":
      return (u"\x00", startpos+2)
   else:
      raise UnicodeError(...)
History
Date User Action Args
2007-08-23 15:06:07adminlinkissue432401 messages
2007-08-23 15:06:07admincreate