Message 36786 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	doerwalter
Recipients
Date	2001-07-12.11:03:15
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
Logged In: YES user_id=89016 > > [...] > > so I guess we could change the replace handler > > to always return u'?'. This would make the > > implementation a little bit simpler, but the > > explanation of the callback feature a lot > > simpler. > > Go for it. OK, done! > [...] > > > Could you add these docs to the Misc/unicode.txt > > > file ? I will eventually take that file and turn > > > it into a PEP which will then serve as general > > > documentation for these things. > > > > I could, but first we should work out how the > > decoding callback API will work. > > Ok. BTW, Barry Warsaw already did the work of converting > the unicode.txt to PEP 100, so the docs should eventually > go there. OK. I guess it would be best to do this when everything is finished. > > > > BTW, I guess PyUnicode_EncodeUnicodeEscape > > > > could be reimplemented as PyUnicode_EncodeASCII > > > > with \uxxxx replacement callback. > > > > > > Hmm, wouldn't that result in a slowdown ? If so, > > > I'd rather leave the special encoder in place, > > > since it is being used a lot in Python and > > > probably some applications too. > > > > It would be a slowdown. But callbacks open many > > possiblities. > > True, but in this case I believe that we should stick with > the native implementation for "unicode-escape". Having > a standard callback error handler which does the \uXXXX > replacement would be nice to have though, since this would > also be usable with lots of other codecs (e.g. all the > code page ones). OK, done, now there's a PyCodec_EscapeReplaceUnicodeEncodeErrors/ codecs.escapereplace_unicodeencode_errors that uses \u (or \U if x>0xffff (with a wide build of Python)). > > For example: > > > > Why can't I print u"gürk"? > > > > is probably one of the most frequently asked > > questions in comp.lang.python. For printing > > Unicode stuff, print could be extended the use an > > error handling callback for Unicode strings (or > > objects where __str__ or tp_str returns a Unicode > > object) instead of using str() which always > > returns an 8bit string and uses strict encoding. > > There might even be a > > sys.setprintencodehandler()/sys.getprintencodehandler () > > There already is a print callback in Python (forgot the > name of the hook though), so this should be possible by > providing the encoding logic in the hook. True: sys.displayhook > [...] > > Should the old TranslateCharmap map to the new > > TranslateCharmapEx and inherit the > > "multicharacter replacement" feature, > > or should I leave it as it is? > > If possible, please also add the multichar replacement > to the old API. I think it is very useful and since the > old APIs work on raw buffers it would be a benefit to have > the functionality in the old implementation too. OK! I will try to find the time to implement that in the next days. > [Decoding error callbacks] > > About the return value: > > I'd suggest to always use the same tuple interface, e.g. > > callback(encoding, input_data, input_position, state) -> > (output_to_be_appended, new_input_position) > > (I think it's better to use absolute values for the > position rather than offsets.) > > Perhaps the encoding callbacks should use the same > interface... what do you think ? This would make the callback feature hypergeneric and a little slower, because tuples have to be created, but it (almost) unifies the encoding and decoding API. ("almost" because, for the encoder output_to_be_appended will be reencoded, for the decoder it will simply be appended.), so I'm for it. I implemented this and changed the encoders to only lookup the error handler on the first error. The UCS1 encoder now no longer uses the two-item stack strategy. (This strategy only makes sense for those encoder where the encoding itself is much more complicated than the looping/callback etc.) So now memory overflow tests are only done, when an unencodable error occurs, so now the UCS1 encoder should be as fast as it was without error callbacks. Do we want to enforce new_input_position>input_position, or should jumping back be allowed? > > > > One additional note: It is vital that errors > > > > is an assignable attribute of the StreamWriter. > > > > > > It is already ! > > > > I know, but IMHO it should be documented that an > > assignable errors attribute must be supported > > as part of the official codec API. > > > > Misc/unicode.txt is not clear on that: > > """ > > It is not required by the Unicode implementation > > to use these base classes, only the interfaces must > > match; this allows writing Codecs as extension types. > > """ > > Good point. I'll add that to the PEP 100. OK. Here's is the current todo list: 1. implement a new TranslateCharmap and fix the old. 2. New encoding API for string objects too. 3. Decoding 4. Documentation 5. Test cases I'm thinking about a different strategy for implementing callbacks (see http://mail.python.org/pipermail/i18n-sig/2001- July/001262.html) We coould have a error handler registry, which maps names to error handlers, then it would be possible to keep the errors argument as "const char " instead of "PyObject ". Currently PyCodec_UnicodeEncodeHandlerForObject is a backwards compatibility hack that will never go away, because it's always more convenient to type u"...".encode("...", "strict") instead of import codecs u"...".encode("...", codecs.raise_encode_errors) But with an error handler registry this function would become the official lookup method for error handlers. (PyCodec_LookupUnicodeEncodeErrorHandler?) Python code would look like this: --- def xmlreplace(encoding, unicode, pos, state): return (u"&#%d;" % ord(uni[pos]), pos+1) import codec codec.registerError("xmlreplace",xmlreplace) --- and then the following call can be made: u"äöü".encode("ascii", "xmlreplace") As soon as the first error is encountered, the encoder uses its builtin error handling method if it recognizes the name ("strict", "replace" or "ignore") or looks up the error handling function in the registry if it doesn't. In this way the speed for the backwards compatible features is the same as before and "const char error" can be kept as the parameter to all encoding functions. For speed common error handling names could even be implemented in the encoder itself. But for special one-shot error handlers, it might still be useful to pass the error handler directly, so maybe we should leave error as PyObject , but implement the registry anyway?

Logged In: YES 
user_id=89016

> >    [...]
> >    so I guess we could change the replace handler
> >    to always return u'?'. This would make the
> >    implementation a little bit simpler, but the 
> >    explanation of the callback feature *a lot* 
> >    simpler. 
> 
> Go for it.

OK, done!

> [...]
> >    > Could you add these docs to the Misc/unicode.txt
> >    > file ? I will eventually take that file and turn 
> >    > it into a PEP which will then serve as general 
> >    > documentation for these things.
> > 
> >    I could, but first we should work out how the 
> >    decoding callback API will work.
> 
> Ok. BTW, Barry Warsaw already did the work of converting
> the unicode.txt to PEP 100, so the docs should eventually 
> go there.

OK. I guess it would be best to do this when everything 
is finished.

> >    > > BTW, I guess PyUnicode_EncodeUnicodeEscape
> >    > > could be reimplemented as PyUnicode_EncodeASCII 
> >    > > with \uxxxx replacement callback.
> >    >
> >    > Hmm, wouldn't that result in a slowdown ? If so,
> >    > I'd rather leave the special encoder in place, 
> >    > since it is being used a lot in Python and 
> >    > probably some applications too.
> > 
> >    It would be a slowdown. But callbacks open many 
> >    possiblities.
> 
> True, but in this case I believe that we should stick with
> the native implementation for "unicode-escape". Having
> a standard callback error handler which does the \uXXXX
> replacement would be nice to have though, since this would
> also be usable with lots of other codecs (e.g. all the
> code page ones).

OK, done, now there's a 
PyCodec_EscapeReplaceUnicodeEncodeErrors/
codecs.escapereplace_unicodeencode_errors
that uses \u (or \U if x>0xffff (with a wide build
of Python)).

> >    For example:
> > 
> >       Why can't I print u"gürk"?
> > 
> >    is probably one of the most frequently asked
> >    questions in comp.lang.python. For printing 
> >    Unicode stuff, print could be extended the use an 
> >    error handling callback for Unicode strings (or 
> >    objects where __str__ or tp_str returns a Unicode 
> >    object) instead of using str() which always 
> >    returns an 8bit string and uses strict encoding. 
> >    There might even be a
> >    sys.setprintencodehandler()/sys.getprintencodehandler
()
> 
> There already is a print callback in Python (forgot the
> name of the hook though), so this should be possible by 
> providing the encoding logic in the hook.

True: sys.displayhook

> [...]
> >    Should the old TranslateCharmap map to the new 
> >    TranslateCharmapEx and inherit the 
> >    "multicharacter replacement" feature,
> >    or should I leave it as it is?
> 
> If possible, please also add the multichar replacement
> to the old API. I think it is very useful and since the
> old APIs work on raw buffers it would be a benefit to have
> the functionality in the old implementation too.

OK! I will try to find the time to implement that in the 
next days.

> [Decoding error callbacks]
>
> About the return value:
> 
> I'd suggest to always use the same tuple interface, e.g.
> 
>     callback(encoding, input_data, input_position, 
state) -> 
>         (output_to_be_appended, new_input_position)
> 
> (I think it's better to use absolute values for the 
> position rather than offsets.)
> 
> Perhaps the encoding callbacks should use the same 
> interface... what do you think ?

This would make the callback feature hypergeneric and a
little slower, because tuples have to be created, but it
(almost) unifies the encoding and decoding API. ("almost" 
because, for the encoder output_to_be_appended will be 
reencoded, for the decoder it will simply be appended.), 
so I'm for it.

I implemented this and changed the encoders to only 
lookup the error handler on the first error. The UCS1 
encoder now no longer uses the two-item stack strategy. 
(This strategy only makes sense for those encoder where 
the encoding itself is much more complicated than the 
looping/callback etc.) So now memory overflow tests are 
only done, when an unencodable error occurs, so now the 
UCS1 encoder should be as fast as it was without 
error callbacks.

Do we want to enforce new_input_position>input_position,
or should jumping back be allowed?

> >    > > One additional note: It is vital that errors
> >    > > is an assignable attribute of the StreamWriter.
> >    >
> >    > It is already !
> > 
> >    I know, but IMHO it should be documented that an
> >    assignable errors attribute must be supported 
> >    as part of the official codec API.
> > 
> >    Misc/unicode.txt is not clear on that:
> >    """
> >    It is not required by the Unicode implementation
> >    to use these base classes, only the interfaces must 
> >    match; this allows writing Codecs as extension types.
> >    """
> 
> Good point. I'll add that to the PEP 100.

OK.

Here's is the current todo list:
1. implement a new TranslateCharmap and fix the old.
2. New encoding API for string objects too.
3. Decoding
4. Documentation
5. Test cases

I'm thinking about a different strategy for implementing 
callbacks
(see http://mail.python.org/pipermail/i18n-sig/2001-
July/001262.html)

We coould have a error handler registry, which maps names 
to error handlers, then it would be possible to keep the 
errors argument as "const char *" instead of "PyObject *". 
Currently PyCodec_UnicodeEncodeHandlerForObject is a 
backwards compatibility hack that will never go away, 
because 
it's always more convenient to type
   u"...".encode("...", "strict")
instead of
   import codecs
   u"...".encode("...", codecs.raise_encode_errors)

But with an error handler registry this function would 
become the official lookup method for error handlers. 
(PyCodec_LookupUnicodeEncodeErrorHandler?)
Python code would look like this:
---
def xmlreplace(encoding, unicode, pos, state):
   return (u"&#%d;" % ord(uni[pos]), pos+1)

import codec

codec.registerError("xmlreplace",xmlreplace)
---
and then the following call can be made:
	u"äöü".encode("ascii", "xmlreplace")
As soon as the first error is encountered, the encoder uses
its builtin error handling method if it recognizes the name 
("strict", "replace" or "ignore") or looks up the error 
handling function in the registry if it doesn't. In this way
the speed for the backwards compatible features is the same 
as before and "const char *error" can be kept as the 
parameter to all encoding functions. For speed common error 
handling names could even be implemented in the encoder 
itself.

But for special one-shot error handlers, it might still be 
useful to pass the error handler directly, so maybe we 
should leave error as PyObject *, but implement the 
registry anyway?

History
Date	User	Action	Args
2007-08-23 15:06:06	admin	link	issue432401 messages
2007-08-23 15:06:06	admin	create