Message 36781 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	doerwalter
Recipients
Date	2001-06-13.13:57:07
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
Logged In: YES user_id=89016 > > [...] > > raise an exception). U+FFFD characters in the replacement > > string will be replaced with a character that the encoder > > chooses ('?' in all cases). > > Nice. But the special casing of U+FFFD makes the interface somewhat less clean than it could be. It was only done to be 100% backwards compatible. With the original "replace" error handling the codec chose the replacement character. But as far as I can tell none of the codecs uses anything other than '?', so I guess we could change the replace handler to always return u'?'. This would make the implementation a little bit simpler, but the explanation of the callback feature a lot simpler. And if you still want to handle an unencodable U+FFFD, you can write a special callback for that, e.g. def FFFDreplace(enc, uni, pos): if uni[pos] == "\ufffd": return u"?" else: raise UnicodeError(...) > > The implementation of the loop through the string is done > > in the following way. A stack with two strings is kept > > and the loop always encodes a character from the string > > at the stacktop. If an error is encountered and the stack > > has only one entry (during encoding of the original string) > > the callback is called and the unicode object returned is > > pushed on the stack, so the encoding continues with the > > replacement string. If the stack has two entries when an > > error is encountered, the replacement string itself has > > an unencodable character and a normal exception raised. > > When the encoder has reached the end of it's current string > > there are two possibilities: when the stack contains two > > entries, this was the replacement string, so the replacement > > string will be poppep from the stack and encoding continues > > with the next character from the original string. If the > > stack had only one entry, encoding is finished. > > Very elegant solution ! I'll put it as a comment in the source. > > (I hope that's enough explanation of the API and > implementation) > > Could you add these docs to the Misc/unicode.txt file ? I > will eventually take that file and turn it into a PEP which > will then serve as general documentation for these things. I could, but first we should work out how the decoding callback API will work. > > I have renamed the static ...121 function to all lowercase > > names. > > Ok. > > > BTW, I guess PyUnicode_EncodeUnicodeEscape could be > > reimplemented as PyUnicode_EncodeASCII with a \uxxxx > > replacement callback. > > Hmm, wouldn't that result in a slowdown ? If so, I'd rather > leave the special encoder in place, since it is being used a > lot in Python and probably some applications too. It would be a slowdown. But callbacks open many possiblities. For example: Why can't I print u"gürk"? is probably one of the most frequently asked questions in comp.lang.python. For printing Unicode stuff, print could be extended the use an error handling callback for Unicode strings (or objects where __str__ or tp_str returns a Unicode object) instead of using str() which always returns an 8bit string and uses strict encoding. There might even be a sys.setprintencodehandler()/sys.getprintencodehandler() > [...] > I think it would be worthwhile to rename the callbacks to > include "Unicode" somewhere, e.g. > PyCodec_UnicodeReplaceEncodeErrors(). It's a long name, but > then it points out the application field of the callback > rather well. Same for the callbacks exposed through the > _codecsmodule. OK, done (and PyCodec_XMLCharRefReplaceUnicodeEncodeErrors really is a long name ;)) > > I have not touched PyUnicode_TranslateCharmap yet, > > should this function also support error callbacks? Why > > would one want the insert None into the mapping to call > > the callback? > > 1. Yes. > 2. The user may want to e.g. restrict usage of certain > character ranges. In this case the codec would be used to > verify the input and an exception would indeed be useful > (e.g. say you want to restrict input to Hangul + ASCII). OK, do we want TranslateCharmap to work exactly like encoding, i.e. in case of an error should the returned replacement string again be mapped through the translation mapping or should it be copied to the output directly? The former would be more in line with encoding, but IMHO the latter would be much more useful. BTW, when I implement it I can implement patch #403100 ("Multicharacter replacements in PyUnicode_TranslateCharmap") along the way. Should the old TranslateCharmap map to the new TranslateCharmapEx and inherit the "multicharacter replacement" feature, or should I leave it as it is? > > A remaining problem is how to implement decoding error > > callbacks. In Python 2.1 encoding and decoding errors are > > handled in the same way with a string value. But with > > callbacks it doesn't make sense to use the same callback > > for encoding and decoding (like codecs.StreamReaderWriter > > and codecs.StreamRecoder do). Decoding callbacks have a > > different API. Which arguments should be passed to the > > decoding callback, and what is the decoding callback > > supposed to do? > > I'd suggest adding another set of PyCodec_UnicodeDecode... () > APIs for this. We'd then have to augment the base classes of > the StreamCodecs to provide two attributes for .errors with > a fallback solution for the string case (i.s. "strict" can > still be used for both directions). Sounds good. Now what is the decoding callback supposed to do? I guess it will be called in the same way as the encoding callback, i.e. with encoding name, original string and position of the error. It might returns a Unicode string (i.e. an object of the decoding target type), that will be emitted from the codec instead of the one offending byte. Or it might return a tuple with replacement Unicode object and a resynchronisation offset, i.e. returning (u"?", 1) means emit a '?' and skip the offending character. But to make the offset really useful the callback has to know something about the encoding, perhaps the codec should be allowed to pass an additional state object to the callback? Maybe the same should be added to the encoding callbacks to? Maybe the encoding callback should be able to tell the encoder if the replacement returned should be reencoded (in which case it's a Unicode object), or directly emitted (in which case it's an 8bit string)? > > One additional note: It is vital that errors is an > > assignable attribute of the StreamWriter. > > It is already ! I know, but IMHO it should be documented that an assignable errors attribute must be supported as part of the official codec API. Misc/unicode.txt is not clear on that: """ It is not required by the Unicode implementation to use these base classes, only the interfaces must match; this allows writing Codecs as extension types. """

Logged In: YES 
user_id=89016

> > [...]
> > raise an exception). U+FFFD characters in the 
replacement
> > string will be replaced with a character that the 
encoder
> > chooses ('?' in all cases).
>
> Nice.

But the special casing of U+FFFD makes the interface 
somewhat
less clean than it could be. It was only done to be 100%
backwards compatible. With the original "replace" error
handling the codec chose the replacement character. But as
far as I can tell none of the codecs uses anything other
than '?', so I guess we could change the replace handler
to always return u'?'. This would make the implementation a
little bit simpler, but the explanation of the callback
feature *a lot* simpler. And if you still want to handle
an unencodable U+FFFD, you can write a special callback for
that, e.g.

def FFFDreplace(enc, uni, pos):
if uni[pos] == "\ufffd":
return u"?"
else:
raise UnicodeError(...)

> > The implementation of the loop through the string is 
done
> > in the following way. A stack with two strings is kept
> > and the loop always encodes a character from the string
> > at the stacktop. If an error is encountered and the 
stack
> > has only one entry (during encoding of the original 
string)
> > the callback is called and the unicode object returned 
is
> > pushed on the stack, so the encoding continues with the
> > replacement string. If the stack has two entries when an
> > error is encountered, the replacement string itself has
> > an unencodable character and a normal exception raised.
> > When the encoder has reached the end of it's current 
string
> > there are two possibilities: when the stack contains two
> > entries, this was the replacement string, so the 
replacement
> > string will be poppep from the stack and encoding 
continues
> > with the next character from the original string. If the
> > stack had only one entry, encoding is finished.
>
> Very elegant solution !

I'll put it as a comment in the source.

> > (I hope that's enough explanation of the API and
> implementation)
>
> Could you add these docs to the Misc/unicode.txt file ? I
> will eventually take that file and turn it into a PEP 
which
> will then serve as general documentation for these things.

I could, but first we should work out how the decoding
callback API will work.

> > I have renamed the static ...121 function to all 
lowercase
> > names.
>
> Ok.
>
> > BTW, I guess PyUnicode_EncodeUnicodeEscape could be
> > reimplemented as PyUnicode_EncodeASCII with a \uxxxx
> > replacement callback.
>
> Hmm, wouldn't that result in a slowdown ? If so, I'd 
rather
> leave the special encoder in place, since it is being 
used a
> lot in Python and probably some applications too.

It would be a slowdown. But callbacks open many 
possiblities.

For example:

   Why can't I print u"gürk"?

is probably one of the most frequently asked questions in
comp.lang.python. For printing Unicode stuff, print could be
extended the use an error handling callback for Unicode 
strings (or objects where __str__ or tp_str returns a 
Unicode object) instead of using str() which always returns 
an 8bit string and uses strict encoding. There might even 
be a
sys.setprintencodehandler()/sys.getprintencodehandler()

> [...]
> I think it would be worthwhile to rename the callbacks to
> include "Unicode" somewhere, e.g.
> PyCodec_UnicodeReplaceEncodeErrors(). It's a long name, 
but
> then it points out the application field of the callback
> rather well. Same for the callbacks exposed through the
> _codecsmodule.

OK, done (and PyCodec_XMLCharRefReplaceUnicodeEncodeErrors
really is a long name ;))

> > I have not touched PyUnicode_TranslateCharmap yet,
> > should this function also support error callbacks? Why
> > would one want the insert None into the mapping to call
> > the callback?
>
> 1. Yes.
> 2. The user may want to e.g. restrict usage of certain
> character ranges. In this case the codec would be used to
> verify the input and an exception would indeed be useful
> (e.g. say you want to restrict input to Hangul + ASCII).

OK, do we want TranslateCharmap to work exactly like 
encoding,
i.e. in case of an error should the returned replacement
string again be mapped through the translation mapping or
should it be copied to the output directly? The former would
be more in line with encoding, but IMHO the latter would
be much more useful.

BTW, when I implement it I can implement patch #403100
("Multicharacter replacements in 
PyUnicode_TranslateCharmap")
along the way.

Should the old TranslateCharmap map to the new 
TranslateCharmapEx
and inherit the "multicharacter replacement" feature, or
should I leave it as it is?

> > A remaining problem is how to implement decoding error
> > callbacks. In Python 2.1 encoding and decoding errors 
are
> > handled in the same way with a string value. But with
> > callbacks it doesn't make sense to use the same callback
> > for encoding and decoding (like 
codecs.StreamReaderWriter
> > and codecs.StreamRecoder do). Decoding callbacks have a
> > different API. Which arguments should be passed to the
> > decoding callback, and what is the decoding callback
> > supposed to do?
>
> I'd suggest adding another set of PyCodec_UnicodeDecode...
()
> APIs for this. We'd then have to augment the base classes 
of
> the StreamCodecs to provide two attributes for .errors 
with
> a fallback solution for the string case (i.s. "strict" can
> still be used for both directions).

Sounds good. Now what is the decoding callback supposed to 
do?
I guess it will be called in the same way as the encoding
callback, i.e. with encoding name, original string and
position of the error. It might returns a Unicode string
(i.e. an object of the decoding target type), that will be
emitted from the codec instead of the one offending byte. Or
it might return a tuple with replacement Unicode object and
a resynchronisation offset, i.e. returning (u"?", 1) means
emit a '?' and skip the offending character. But to make
the offset really useful the callback has to know something
about the encoding, perhaps the codec should be allowed to
pass an additional state object to the callback?

Maybe the same should be added to the encoding callbacks to?
Maybe the encoding callback should be able to tell the
encoder if the replacement returned should be reencoded
(in which case it's a Unicode object), or directly emitted
(in which case it's an 8bit string)?

> > One additional note: It is vital that errors is an
> > assignable attribute of the StreamWriter.
>
> It is already !

I know, but IMHO it should be documented that an assignable
errors attribute must be supported as part of the official
codec API.

Misc/unicode.txt is not clear on that:
"""
It is not required by the Unicode implementation to use 
these base classes, only the interfaces must match; this 
allows writing Codecs as extension types.
"""

History
Date	User	Action	Args
2007-08-23 15:06:05	admin	link	issue432401 messages
2007-08-23 15:06:05	admin	create