Message 36785 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients
Date	2001-07-10.12:29:11
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
Logged In: YES user_id=38388 Ok, here we go... > > > raise an exception). U+FFFD characters in the > replacement > > > string will be replaced with a character that the > encoder > > > chooses ('?' in all cases). > > > > Nice. > > But the special casing of U+FFFD makes the interface > somewhat > less clean than it could be. It was only done to be 100% > backwards compatible. With the original "replace" > error > handling the codec chose the replacement character. But as > far as I can tell none of the codecs uses anything other > than '?', True. > so I guess we could change the replace handler > to always return u'?'. This would make the implementation a > little bit simpler, but the explanation of the callback > feature a lot simpler. Go for it. > And if you still want to handle > an unencodable U+FFFD, you can write a special callback for > that, e.g. > > def FFFDreplace(enc, uni, pos): > if uni[pos] == "\ufffd": > return u"?" > else: > raise UnicodeError(...) > > > ...docs... > > > > Could you add these docs to the Misc/unicode.txt file ? I > > will eventually take that file and turn it into a PEP > which > > will then serve as general documentation for these things. > > I could, but first we should work out how the decoding > callback API will work. Ok. BTW, Barry Warsaw already did the work of converting the unicode.txt to PEP 100, so the docs should eventually go there. > > > BTW, I guess PyUnicode_EncodeUnicodeEscape could be > > > reimplemented as PyUnicode_EncodeASCII with a \uxxxx > > > replacement callback. > > > > Hmm, wouldn't that result in a slowdown ? If so, I'd > rather > > leave the special encoder in place, since it is being > used a > > lot in Python and probably some applications too. > > It would be a slowdown. But callbacks open many > possiblities. True, but in this case I believe that we should stick with the native implementation for "unicode-escape". Having a standard callback error handler which does the \uXXXX replacement would be nice to have though, since this would also be usable with lots of other codecs (e.g. all the code page ones). > For example: > > Why can't I print u"gürk"? > > is probably one of the most frequently asked questions in > comp.lang.python. For printing Unicode stuff, print could be > extended the use an error handling callback for Unicode > strings (or objects where __str__ or tp_str returns a > Unicode object) instead of using str() which always returns > an 8bit string and uses strict encoding. There might even > be a > sys.setprintencodehandler()/sys.getprintencodehandler() There already is a print callback in Python (forgot the name of the hook though), so this should be possible by providing the encoding logic in the hook. > > > I have not touched PyUnicode_TranslateCharmap yet, > > > should this function also support error callbacks? Why > > > would one want the insert None into the mapping to > call > > > the callback? > > > > 1. Yes. > > 2. The user may want to e.g. restrict usage of certain > > character ranges. In this case the codec would be used to > > verify the input and an exception would indeed be useful > > (e.g. say you want to restrict input to Hangul + ASCII). > > OK, do we want TranslateCharmap to work exactly like > encoding, > i.e. in case of an error should the returned replacement > string again be mapped through the translation mapping or > should it be copied to the output directly? The former would > be more in line with encoding, but IMHO the latter would > be much more useful. It's better to take the second approach (copy the callback output directly to the output string) to avoid endless recursion and other pitfalls. I suppose this will also simplify the implementation somewhat. > BTW, when I implement it I can implement patch #403100 > ("Multicharacter replacements in > PyUnicode_TranslateCharmap") > along the way. I've seen it; will comment on it later. > Should the old TranslateCharmap map to the new > TranslateCharmapEx > and inherit the "multicharacter replacement" feature, > or > should I leave it as it is? If possible, please also add the multichar replacement to the old API. I think it is very useful and since the old APIs work on raw buffers it would be a benefit to have the functionality in the old implementation too. [Decoding error callbacks] > > > A remaining problem is how to implement decoding error > > > callbacks. In Python 2.1 encoding and decoding errors > are > > > handled in the same way with a string value. But with > > > callbacks it doesn't make sense to use the same > callback > > > for encoding and decoding (like > codecs.StreamReaderWriter > > > and codecs.StreamRecoder do). Decoding callbacks have > a > > > different API. Which arguments should be passed to the > > > decoding callback, and what is the decoding callback > > > supposed to do? > > > > I'd suggest adding another set of PyCodec_UnicodeDecode... > () > > APIs for this. We'd then have to augment the base classes > of > > the StreamCodecs to provide two attributes for .errors > with > > a fallback solution for the string case (i.s. "strict" > can > > still be used for both directions). > > Sounds good. Now what is the decoding callback supposed to > do? > I guess it will be called in the same way as the encoding > callback, i.e. with encoding name, original string and > position of the error. It might returns a Unicode string > (i.e. an object of the decoding target type), that will be > emitted from the codec instead of the one offending byte. Or > it might return a tuple with replacement Unicode object and > a resynchronisation offset, i.e. returning (u"?", 1) > means > emit a '?' and skip the offending character. But to make > the offset really useful the callback has to know something > about the encoding, perhaps the codec should be allowed to > pass an additional state object to the callback? > > Maybe the same should be added to the encoding callbacks to? > Maybe the encoding callback should be able to tell the > encoder if the replacement returned should be reencoded > (in which case it's a Unicode object), or directly emitted > (in which case it's an 8bit string)? I like the idea of having an optional state object (basically this should be a codec-defined arbitrary Python object) which then allow the callback to apply additional tricks. The object should be documented to be modifyable in place (simplifies the interface). About the return value: I'd suggest to always use the same tuple interface, e.g. callback(encoding, input_data, input_position, state) -> (output_to_be_appended, new_input_position) (I think it's better to use absolute values for the position rather than offsets.) Perhaps the encoding callbacks should use the same interface... what do you think ? > > > One additional note: It is vital that errors is an > > > assignable attribute of the StreamWriter. > > > > It is already ! > > I know, but IMHO it should be documented that an assignable > errors attribute must be supported as part of the official > codec API. > > Misc/unicode.txt is not clear on that: > """ > It is not required by the Unicode implementation to use > these base classes, only the interfaces must match; this > allows writing Codecs as extension types. > """ Good point. I'll add that to the PEP 100.

Logged In: YES 
user_id=38388

Ok, here we go...

>    > > raise an exception). U+FFFD characters in the 
>    replacement
>    > > string will be replaced with a character that the 
>    encoder
>    > > chooses ('?' in all cases).
>    >
>    > Nice.
> 
>    But the special casing of U+FFFD makes the interface 
>    somewhat
>    less clean than it could be. It was only done to be 100%
>    backwards compatible. With the original "replace"
>    error
>    handling the codec chose the replacement character. But as
>    far as I can tell none of the codecs uses anything other
>    than '?', 

True.

>    so I guess we could change the replace handler
>    to always return u'?'. This would make the implementation a
>    little bit simpler, but the explanation of the callback
>    feature *a lot* simpler. 

Go for it.

>    And if you still want to handle
>    an unencodable U+FFFD, you can write a special callback for
>    that, e.g.
> 
>    def FFFDreplace(enc, uni, pos):
>    if uni[pos] == "\ufffd":
>    return u"?"
>    else:
>    raise UnicodeError(...)
>
>    > ...docs...
>    >
>    > Could you add these docs to the Misc/unicode.txt file ? I
>    > will eventually take that file and turn it into a PEP 
>    which
>    > will then serve as general documentation for these things.
> 
>    I could, but first we should work out how the decoding
>    callback API will work.

Ok. BTW, Barry Warsaw already did the work of converting the
unicode.txt to PEP 100, so the docs should eventually go there.
 
>    > > BTW, I guess PyUnicode_EncodeUnicodeEscape could be
>    > > reimplemented as PyUnicode_EncodeASCII with a \uxxxx
>    > > replacement callback.
>    >
>    > Hmm, wouldn't that result in a slowdown ? If so, I'd 
>    rather
>    > leave the special encoder in place, since it is being 
>    used a
>    > lot in Python and probably some applications too.
> 
>    It would be a slowdown. But callbacks open many 
>    possiblities.

True, but in this case I believe that we should stick with
the native implementation for "unicode-escape". Having
a standard callback error handler which does the \uXXXX
replacement would be nice to have though, since this would
also be usable with lots of other codecs (e.g. all the code page
ones).
 
>    For example:
> 
>       Why can't I print u"gürk"?
> 
>    is probably one of the most frequently asked questions in
>    comp.lang.python. For printing Unicode stuff, print could be
>    extended the use an error handling callback for Unicode 
>    strings (or objects where __str__ or tp_str returns a 
>    Unicode object) instead of using str() which always returns 
>    an 8bit string and uses strict encoding. There might even 
>    be a
>    sys.setprintencodehandler()/sys.getprintencodehandler()

There already is a print callback in Python (forgot the name of the
hook though), so this should be possible by providing the
encoding logic in the hook.
 
>    > > I have not touched PyUnicode_TranslateCharmap yet,
>    > > should this function also support error callbacks? Why
>    > > would one want the insert None into the mapping to
>    call
>    > > the callback?
>    >
>    > 1. Yes.
>    > 2. The user may want to e.g. restrict usage of certain
>    > character ranges. In this case the codec would be used to
>    > verify the input and an exception would indeed be useful
>    > (e.g. say you want to restrict input to Hangul + ASCII).
> 
>    OK, do we want TranslateCharmap to work exactly like 
>    encoding,
>    i.e. in case of an error should the returned replacement
>    string again be mapped through the translation mapping or
>    should it be copied to the output directly? The former would
>    be more in line with encoding, but IMHO the latter would
>    be much more useful.

It's better to take the second approach (copy the callback
output directly to the output string) to avoid endless
recursion and other pitfalls.

I suppose this will also simplify the implementation somewhat.
 
>    BTW, when I implement it I can implement patch #403100
>    ("Multicharacter replacements in 
>    PyUnicode_TranslateCharmap")
>    along the way.

I've seen it; will comment on it later.
 
>    Should the old TranslateCharmap map to the new 
>    TranslateCharmapEx
>    and inherit the "multicharacter replacement" feature,
>    or
>    should I leave it as it is?

If possible, please also add the multichar replacement
to the old API. I think it is very useful and since the
old APIs work on raw buffers it would be a benefit to have
the functionality in the old implementation too.
 
[Decoding error callbacks]

>    > > A remaining problem is how to implement decoding error
>    > > callbacks. In Python 2.1 encoding and decoding errors 
>    are
>    > > handled in the same way with a string value. But with
>    > > callbacks it doesn't make sense to use the same
>    callback
>    > > for encoding and decoding (like 
>    codecs.StreamReaderWriter
>    > > and codecs.StreamRecoder do). Decoding callbacks have
>    a
>    > > different API. Which arguments should be passed to the
>    > > decoding callback, and what is the decoding callback
>    > > supposed to do?
>    >
>    > I'd suggest adding another set of PyCodec_UnicodeDecode...
>    ()
>    > APIs for this. We'd then have to augment the base classes 
>    of
>    > the StreamCodecs to provide two attributes for .errors 
>    with
>    > a fallback solution for the string case (i.s. "strict"
>    can
>    > still be used for both directions).
> 
>    Sounds good. Now what is the decoding callback supposed to 
>    do?
>    I guess it will be called in the same way as the encoding
>    callback, i.e. with encoding name, original string and
>    position of the error. It might returns a Unicode string
>    (i.e. an object of the decoding target type), that will be
>    emitted from the codec instead of the one offending byte. Or
>    it might return a tuple with replacement Unicode object and
>    a resynchronisation offset, i.e. returning (u"?", 1)
>    means
>    emit a '?' and skip the offending character. But to make
>    the offset really useful the callback has to know something
>    about the encoding, perhaps the codec should be allowed to
>    pass an additional state object to the callback?
> 
>    Maybe the same should be added to the encoding callbacks to?
>    Maybe the encoding callback should be able to tell the
>    encoder if the replacement returned should be reencoded
>    (in which case it's a Unicode object), or directly emitted
>    (in which case it's an 8bit string)?

I like the idea of having an optional state object (basically
this should be a codec-defined arbitrary Python object)
which then allow the callback to apply additional tricks.
The object should be documented to be modifyable in place
(simplifies the interface).

About the return value:

I'd suggest to always use the same tuple interface, e.g.

    callback(encoding, input_data, input_position, state) -> 
        (output_to_be_appended, new_input_position)

(I think it's better to use absolute values for the position 
rather than offsets.)

Perhaps the encoding callbacks should use the same 
interface... what do you think ?

>    > > One additional note: It is vital that errors is an
>    > > assignable attribute of the StreamWriter.
>    >
>    > It is already !
> 
>    I know, but IMHO it should be documented that an assignable
>    errors attribute must be supported as part of the official
>    codec API.
> 
>    Misc/unicode.txt is not clear on that:
>    """
>    It is not required by the Unicode implementation to use 
>    these base classes, only the interfaces must match; this 
>    allows writing Codecs as extension types.
>    """

Good point. I'll add that to the PEP 100.

History
Date	User	Action	Args
2007-08-23 15:06:05	admin	link	issue432401 messages
2007-08-23 15:06:05	admin	create