This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author lemburg
Recipients
Date 2001-07-13.11:26:07
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
Logged In: YES 
user_id=38388

> > >    > > BTW, I guess PyUnicode_EncodeUnicodeEscape
> > >    > > could be reimplemented as PyUnicode_EncodeASCII
> > >    > > with \uxxxx replacement callback.
> > >    >
> > >    > Hmm, wouldn't that result in a slowdown ? If so,
> > >    > I'd rather leave the special encoder in place,
> > >    > since it is being used a lot in Python and
> > >    > probably some applications too.
> > >
> > >    It would be a slowdown. But callbacks open many
> > >    possiblities.
> >
> > True, but in this case I believe that we should stick with
> > the native implementation for "unicode-escape". Having
> > a standard callback error handler which does the \uXXXX
> > replacement would be nice to have though, since this would
> > also be usable with lots of other codecs (e.g. all the
> > code page ones).
> 
> OK, done, now there's a
> PyCodec_EscapeReplaceUnicodeEncodeErrors/
> codecs.escapereplace_unicodeencode_errors
> that uses \u (or \U if x>0xffff (with a wide build
> of Python)).

Great !
 
> > [...]
> > >    Should the old TranslateCharmap map to the new
> > >    TranslateCharmapEx and inherit the
> > >    "multicharacter replacement" feature,
> > >    or should I leave it as it is?
> >
> > If possible, please also add the multichar replacement
> > to the old API. I think it is very useful and since the
> > old APIs work on raw buffers it would be a benefit to have
> > the functionality in the old implementation too.
> 
> OK! I will try to find the time to implement that in the
> next days.

Good.
 
> > [Decoding error callbacks]
> >
> > About the return value:
> >
> > I'd suggest to always use the same tuple interface, e.g.
> >
> >     callback(encoding, input_data, input_position,
> state) ->
> >         (output_to_be_appended, new_input_position)
> >
> > (I think it's better to use absolute values for the
> > position rather than offsets.)
> >
> > Perhaps the encoding callbacks should use the same
> > interface... what do you think ?
> 
> This would make the callback feature hypergeneric and a
> little slower, because tuples have to be created, but it
> (almost) unifies the encoding and decoding API. ("almost"
> because, for the encoder output_to_be_appended will be
> reencoded, for the decoder it will simply be appended.),
> so I'm for it.

That's the point. 

Note that I don't think the tuple creation
will hurt much (see the make_tuple() API in codecs.c)
since small tuples are cached by Python internally.
 
> I implemented this and changed the encoders to only
> lookup the error handler on the first error. The UCS1
> encoder now no longer uses the two-item stack strategy.
> (This strategy only makes sense for those encoder where
> the encoding itself is much more complicated than the
> looping/callback etc.) So now memory overflow tests are
> only done, when an unencodable error occurs, so now the
> UCS1 encoder should be as fast as it was without
> error callbacks.
> 
> Do we want to enforce new_input_position>input_position,
> or should jumping back be allowed?

No; moving backwards should be allowed (this may be useful
in order to resynchronize with the input data).
 
> Here's is the current todo list:
> 1. implement a new TranslateCharmap and fix the old.
> 2. New encoding API for string objects too.
> 3. Decoding
> 4. Documentation
> 5. Test cases
> 
> I'm thinking about a different strategy for implementing
> callbacks
> (see http://mail.python.org/pipermail/i18n-sig/2001-
> July/001262.html)
> 
> We coould have a error handler registry, which maps names
> to error handlers, then it would be possible to keep the
> errors argument as "const char *" instead of "PyObject *".
> Currently PyCodec_UnicodeEncodeHandlerForObject is a
> backwards compatibility hack that will never go away,
> because
> it's always more convenient to type
>    u"...".encode("...", "strict")
> instead of
>    import codecs
>    u"...".encode("...", codecs.raise_encode_errors)
> 
> But with an error handler registry this function would
> become the official lookup method for error handlers.
> (PyCodec_LookupUnicodeEncodeErrorHandler?)
> Python code would look like this:
> ---
> def xmlreplace(encoding, unicode, pos, state):
>    return (u"&#%d;" % ord(uni[pos]), pos+1)
> 
> import codec
> 
> codec.registerError("xmlreplace",xmlreplace)
> ---
> and then the following call can be made:
>         u"äöü".encode("ascii", "xmlreplace")
> As soon as the first error is encountered, the encoder uses
> its builtin error handling method if it recognizes the name
> ("strict", "replace" or "ignore") or looks up the error
> handling function in the registry if it doesn't. In this way
> the speed for the backwards compatible features is the same
> as before and "const char *error" can be kept as the
> parameter to all encoding functions. For speed common error
> handling names could even be implemented in the encoder
> itself.
> 
> But for special one-shot error handlers, it might still be
> useful to pass the error handler directly, so maybe we
> should leave error as PyObject *, but implement the
> registry anyway?

Good idea !

One minor nit: codecs.registerError() should be named
codecs.register_errorhandler() to be more inline with
the Python coding style guide.
History
Date User Action Args
2007-08-23 15:06:06adminlinkissue432401 messages
2007-08-23 15:06:06admincreate