Message 36780 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients
Date	2001-06-13.08:05:58
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
Logged In: YES user_id=38388 > How the callbacks work: > > A PyObject * named errors is passed in. This may by NULL, > Py_None, 'strict', u'strict', 'ignore', u'ignore', > 'replace', u'replace' or a callable object. > PyCodec_EncodeHandlerForObject maps all of these objects to > one of the three builtin error callbacks > PyCodec_RaiseEncodeErrors (raises an exception), > PyCodec_IgnoreEncodeErrors (returns an empty replacement > string, in effect ignoring the error), > PyCodec_ReplaceEncodeErrors (returns U+FFFD, the Unicode > replacement character to signify to the encoder that it > should choose a suitable replacement character) or directly > returns errors if it is a callable object. When an > unencodable character is encounterd the error handling > callback will be called with the encoding name, the original > unicode object and the error position and must return a > unicode object that will be encoded instead of the offending > character (or the callback may of course raise an > exception). U+FFFD characters in the replacement string will > be replaced with a character that the encoder chooses ('?' > in all cases). Nice. > The implementation of the loop through the string is done in > the following way. A stack with two strings is kept and the > loop always encodes a character from the string at the > stacktop. If an error is encountered and the stack has only > one entry (during encoding of the original string) the > callback is called and the unicode object returned is pushed > on the stack, so the encoding continues with the replacement > string. If the stack has two entries when an error is > encountered, the replacement string itself has an > unencodable character and a normal exception raised. When > the encoder has reached the end of it's current string there > are two possibilities: when the stack contains two entries, > this was the replacement string, so the replacement string > will be poppep from the stack and encoding continues with > the next character from the original string. If the stack > had only one entry, encoding is finished. Very elegant solution ! > (I hope that's enough explanation of the API and implementation) Could you add these docs to the Misc/unicode.txt file ? I will eventually take that file and turn it into a PEP which will then serve as general documentation for these things. > I have renamed the static ...121 function to all lowercase > names. Ok. > BTW, I guess PyUnicode_EncodeUnicodeEscape could be > reimplemented as PyUnicode_EncodeASCII with a \uxxxx > replacement callback. Hmm, wouldn't that result in a slowdown ? If so, I'd rather leave the special encoder in place, since it is being used a lot in Python and probably some applications too. > PyCodec_RaiseEncodeErrors, PyCodec_IgnoreEncodeErrors, > PyCodec_ReplaceEncodeErrors are globally visible because > they have to be available in _codecsmodule.c to wrap them as > Python function objects, but they can't be implemented in > _codecsmodule, because they need to be available to the > encoders in unicodeobject.c (through > PyCodec_EncodeHandlerForObject), but importing the codecs > module might result in an endless recursion, because > importing a module requires unpickling of the bytecode, > which might require decoding utf8, which ... (but this will > only happen, if we implement the same mechanism for the > decoding API) I think that codecs.c is the right place for these APIs. _codecsmodule.c is only meant as Python access wrapper for the internal codecs and nothing more. One thing I noted about the callbacks: they assume that they will always get Unicode objects as input. This is certainly not true in the general case (it is for the codecs you touch in the patch). I think it would be worthwhile to rename the callbacks to include "Unicode" somewhere, e.g. PyCodec_UnicodeReplaceEncodeErrors(). It's a long name, but then it points out the application field of the callback rather well. Same for the callbacks exposed through the _codecsmodule. > I have not touched PyUnicode_TranslateCharmap yet, > should this function also support error callbacks? Why would > one want the insert None into the mapping to call the callback? 1. Yes. 2. The user may want to e.g. restrict usage of certain character ranges. In this case the codec would be used to verify the input and an exception would indeed be useful (e.g. say you want to restrict input to Hangul + ASCII). > A remaining problem is how to implement decoding error > callbacks. In Python 2.1 encoding and decoding errors are > handled in the same way with a string value. But with > callbacks it doesn't make sense to use the same callback for > encoding and decoding (like codecs.StreamReaderWriter and > codecs.StreamRecoder do). Decoding callbacks have a > different API. Which arguments should be passed to the > decoding callback, and what is the decoding callback > supposed to do? I'd suggest adding another set of PyCodec_UnicodeDecode...() APIs for this. We'd then have to augment the base classes of the StreamCodecs to provide two attributes for .errors with a fallback solution for the string case (i.s. "strict" can still be used for both directions). > One additional note: It is vital that errors is an > assignable attribute of the StreamWriter. It is already ! > Consider the XML example: For writing an XML DOM tree one > StreamWriter object is used. When a text node is written, > the error handling has to be set to > codecs.xmlreplace_encode_errors, but inside a comment or > processing instruction replacing unencodable characters with > charrefs is not possible, so here codecs.raise_encode_errors > should be used (or better a custom error handler that raises > an error that says "sorry, you can't have unencodable > characters inside a comment") Sure. > BTW, should we continue the discussion in the i18n SIG > mailing list? An email program is much more comfortable than > a HTML textarea! ;) I'd rather keep the discussions on this patch here -- forking it off to the i18n sig will make it very hard to follow up on it. (This HTML area is indeed damn small ;-)

Logged In: YES 
user_id=38388

> How the callbacks work:
> 
> A PyObject * named errors is passed in. This may by NULL,
> Py_None, 'strict', u'strict', 'ignore', u'ignore',
> 'replace', u'replace' or a callable object.
> PyCodec_EncodeHandlerForObject maps all of these objects
to
> one of the three builtin error callbacks
> PyCodec_RaiseEncodeErrors (raises an exception),
> PyCodec_IgnoreEncodeErrors (returns an empty replacement
> string, in effect ignoring the error),
> PyCodec_ReplaceEncodeErrors (returns U+FFFD, the Unicode
> replacement character to signify to the encoder that it
> should choose a suitable replacement character) or
directly
> returns errors if it is a callable object. When an
> unencodable character is encounterd the error handling
> callback will be called with the encoding name, the
original
> unicode object and the error position and must return a
> unicode object that will be encoded instead of the
offending
> character (or the callback may of course raise an
> exception). U+FFFD characters in the replacement string
will
> be replaced with a character that the encoder chooses ('?'
> in all cases).

Nice.
 
> The implementation of the loop through the string is done
in
> the following way. A stack with two strings is kept and
the
> loop always encodes a character from the string at the
> stacktop. If an error is encountered and the stack has
only
> one entry (during encoding of the original string) the
> callback is called and the unicode object returned is
pushed
> on the stack, so the encoding continues with the
replacement
> string. If the stack has two entries when an error is
> encountered, the replacement string itself has an
> unencodable character and a normal exception raised. When
> the encoder has reached the end of it's current string
there
> are two possibilities: when the stack contains two
entries,
> this was the replacement string, so the replacement string
> will be poppep from the stack and encoding continues with
> the next character from the original string. If the stack
> had only one entry, encoding is finished.

Very elegant solution !
 
> (I hope that's enough explanation of the API and
implementation)

Could you add these docs to the Misc/unicode.txt file ? I
will eventually take that file and turn it into a PEP which
will then serve as general documentation for these things.
 
> I have renamed the static ...121 function to all lowercase
> names.

Ok.
 
> BTW, I guess PyUnicode_EncodeUnicodeEscape could be
> reimplemented as PyUnicode_EncodeASCII with a \uxxxx
> replacement callback.

Hmm, wouldn't that result in a slowdown ? If so, I'd rather
leave the special encoder in place, since it is being used a
lot in Python and probably some applications too.
 
> PyCodec_RaiseEncodeErrors, PyCodec_IgnoreEncodeErrors,
> PyCodec_ReplaceEncodeErrors are globally visible because
> they have to be available in _codecsmodule.c to wrap them
as
> Python function objects, but they can't be implemented in
> _codecsmodule, because they need to be available to the
> encoders in unicodeobject.c (through
> PyCodec_EncodeHandlerForObject), but importing the codecs
> module might result in an endless recursion, because
> importing a module requires unpickling of the bytecode,
> which might require decoding utf8, which ... (but this
will
> only happen, if we implement the same mechanism for the
> decoding API)

I think that codecs.c is the right place for these APIs.
_codecsmodule.c is only meant as Python access wrapper for
the internal codecs and nothing more. 

One thing I noted about the callbacks: they assume that they
will always get Unicode objects as input. This is certainly
not true in the general case (it is for the codecs you touch
in the patch). 

I think it would be worthwhile to rename the callbacks to
include "Unicode" somewhere, e.g.
PyCodec_UnicodeReplaceEncodeErrors(). It's a long name, but
then it points out the application field of the callback
rather well. Same for the callbacks exposed through the
_codecsmodule.

> I have not touched PyUnicode_TranslateCharmap yet,
> should this function also support error callbacks? Why
would
> one want the insert None into the mapping to call the
callback?

1. Yes.
2. The user may want to e.g. restrict usage of certain
character ranges. In this case the codec would be used to
verify the input and an exception would indeed be useful
(e.g. say you want to restrict input to Hangul + ASCII).
 
> A remaining problem is how to implement decoding error
> callbacks. In Python 2.1 encoding and decoding errors are
> handled in the same way with a string value. But with
> callbacks it doesn't make sense to use the same callback
for
> encoding and decoding (like codecs.StreamReaderWriter and
> codecs.StreamRecoder do). Decoding callbacks have a
> different API. Which arguments should be passed to the
> decoding callback, and what is the decoding callback
> supposed to do?

I'd suggest adding another set of PyCodec_UnicodeDecode...()
APIs for this. We'd then have to augment the base classes of
the StreamCodecs to provide two attributes for .errors with
a fallback solution for the string case (i.s. "strict" can
still be used for both directions).

> One additional note: It is vital that errors is an
> assignable attribute of the StreamWriter.

It is already !
 
> Consider the XML example: For writing an XML DOM tree one
> StreamWriter object is used. When a text node is written,
> the error handling has to be set to
> codecs.xmlreplace_encode_errors, but inside a comment or
> processing instruction replacing unencodable characters
with
> charrefs is not possible, so here
codecs.raise_encode_errors
> should be used (or better a custom error handler that
raises
> an error that says "sorry, you can't have unencodable
> characters inside a comment")

Sure.
 
> BTW, should we continue the discussion in the i18n SIG
> mailing list? An email program is much more comfortable
than
> a HTML textarea! ;)

I'd rather keep the discussions on this patch here --
forking it off to the i18n sig will make it very hard to
follow up on it. (This HTML area is indeed damn small ;-)

History
Date	User	Action	Args
2007-08-23 15:06:04	admin	link	issue432401 messages
2007-08-23 15:06:04	admin	create