Message 36778 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	doerwalter
Recipients
Date	2001-06-12.18:59:24
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
Logged In: YES user_id=89016 How the callbacks work: A PyObject * named errors is passed in. This may by NULL, Py_None, 'strict', u'strict', 'ignore', u'ignore', 'replace', u'replace' or a callable object. PyCodec_EncodeHandlerForObject maps all of these objects to one of the three builtin error callbacks PyCodec_RaiseEncodeErrors (raises an exception), PyCodec_IgnoreEncodeErrors (returns an empty replacement string, in effect ignoring the error), PyCodec_ReplaceEncodeErrors (returns U+FFFD, the Unicode replacement character to signify to the encoder that it should choose a suitable replacement character) or directly returns errors if it is a callable object. When an unencodable character is encounterd the error handling callback will be called with the encoding name, the original unicode object and the error position and must return a unicode object that will be encoded instead of the offending character (or the callback may of course raise an exception). U+FFFD characters in the replacement string will be replaced with a character that the encoder chooses ('?' in all cases). The implementation of the loop through the string is done in the following way. A stack with two strings is kept and the loop always encodes a character from the string at the stacktop. If an error is encountered and the stack has only one entry (during encoding of the original string) the callback is called and the unicode object returned is pushed on the stack, so the encoding continues with the replacement string. If the stack has two entries when an error is encountered, the replacement string itself has an unencodable character and a normal exception raised. When the encoder has reached the end of it's current string there are two possibilities: when the stack contains two entries, this was the replacement string, so the replacement string will be poppep from the stack and encoding continues with the next character from the original string. If the stack had only one entry, encoding is finished. (I hope that's enough explanation of the API and implementation) I have renamed the static ...121 function to all lowercase names. BTW, I guess PyUnicode_EncodeUnicodeEscape could be reimplemented as PyUnicode_EncodeASCII with a \uxxxx replacement callback. PyCodec_RaiseEncodeErrors, PyCodec_IgnoreEncodeErrors, PyCodec_ReplaceEncodeErrors are globally visible because they have to be available in _codecsmodule.c to wrap them as Python function objects, but they can't be implemented in _codecsmodule, because they need to be available to the encoders in unicodeobject.c (through PyCodec_EncodeHandlerForObject), but importing the codecs module might result in an endless recursion, because importing a module requires unpickling of the bytecode, which might require decoding utf8, which ... (but this will only happen, if we implement the same mechanism for the decoding API) I have not touched PyUnicode_TranslateCharmap yet, should this function also support error callbacks? Why would one want the insert None into the mapping to call the callback? A remaining problem is how to implement decoding error callbacks. In Python 2.1 encoding and decoding errors are handled in the same way with a string value. But with callbacks it doesn't make sense to use the same callback for encoding and decoding (like codecs.StreamReaderWriter and codecs.StreamRecoder do). Decoding callbacks have a different API. Which arguments should be passed to the decoding callback, and what is the decoding callback supposed to do?

Logged In: YES 
user_id=89016

How the callbacks work:

A PyObject * named errors is passed in. This may by NULL,
Py_None, 'strict', u'strict', 'ignore', u'ignore',
'replace', u'replace' or a callable object.
PyCodec_EncodeHandlerForObject maps all of these objects to
one of the three builtin error callbacks
PyCodec_RaiseEncodeErrors (raises an exception),
PyCodec_IgnoreEncodeErrors (returns an empty replacement
string, in effect ignoring the error),
PyCodec_ReplaceEncodeErrors (returns U+FFFD, the Unicode
replacement character to signify to the encoder that it
should choose a suitable replacement character) or directly
returns errors if it is a callable object. When an
unencodable character is encounterd the error handling
callback will be called with the encoding name, the original
unicode object and the error position and must return a
unicode object that will be encoded instead of the offending
character (or the callback may of course raise an
exception). U+FFFD characters in the replacement string will 
be replaced with a character that the encoder chooses ('?'
in all cases).

The implementation of the loop through the string is done in
the following way. A stack with two strings is kept and the
loop always encodes a character from the string at the
stacktop. If an error is encountered and the stack has only
one entry (during encoding of the original string) the
callback is called and the unicode object returned is pushed
on the stack, so the encoding continues with the replacement
string. If the stack has two entries when an error is
encountered, the replacement string itself has an
unencodable character and a normal exception raised. When
the encoder has reached the end of it's current string there
are two possibilities: when the stack contains two entries,
this was the replacement string, so the replacement string
will be poppep from the stack and encoding continues with
the next character from the original string. If the stack
had only one entry, encoding is finished.

(I hope that's enough explanation of the API and implementation)

I have renamed the static ...121 function to all lowercase
names.

BTW, I guess PyUnicode_EncodeUnicodeEscape could be
reimplemented as PyUnicode_EncodeASCII with a \uxxxx
replacement callback.

PyCodec_RaiseEncodeErrors, PyCodec_IgnoreEncodeErrors,
PyCodec_ReplaceEncodeErrors are globally visible because
they have to be available in _codecsmodule.c to wrap them as
Python function objects, but they can't be implemented in
_codecsmodule, because they need to be available to the
encoders in unicodeobject.c (through
PyCodec_EncodeHandlerForObject), but importing the codecs
module might result in an endless recursion, because
importing a module requires unpickling of the bytecode,
which might require decoding utf8, which ... (but this will
only happen, if we implement the same mechanism for the
decoding API)

I have not touched PyUnicode_TranslateCharmap yet, 
should this function also support error callbacks? Why would
one want the insert None into the mapping to call the callback?

A remaining problem is how to implement decoding error
callbacks. In Python 2.1 encoding and decoding errors are
handled in the same way with a string value. But with
callbacks it doesn't make sense to use the same callback for
encoding and decoding (like codecs.StreamReaderWriter and
codecs.StreamRecoder do). Decoding callbacks have a
different API. Which arguments should be passed to the
decoding callback, and what is the decoding callback
supposed to do?

History
Date	User	Action	Args
2007-08-23 15:06:03	admin	link	issue432401 messages
2007-08-23 15:06:03	admin	create