Message36787
Logged In: YES
user_id=38388
> > > > > BTW, I guess PyUnicode_EncodeUnicodeEscape
> > > > > could be reimplemented as PyUnicode_EncodeASCII
> > > > > with \uxxxx replacement callback.
> > > >
> > > > Hmm, wouldn't that result in a slowdown ? If so,
> > > > I'd rather leave the special encoder in place,
> > > > since it is being used a lot in Python and
> > > > probably some applications too.
> > >
> > > It would be a slowdown. But callbacks open many
> > > possiblities.
> >
> > True, but in this case I believe that we should stick with
> > the native implementation for "unicode-escape". Having
> > a standard callback error handler which does the \uXXXX
> > replacement would be nice to have though, since this would
> > also be usable with lots of other codecs (e.g. all the
> > code page ones).
>
> OK, done, now there's a
> PyCodec_EscapeReplaceUnicodeEncodeErrors/
> codecs.escapereplace_unicodeencode_errors
> that uses \u (or \U if x>0xffff (with a wide build
> of Python)).
Great !
> > [...]
> > > Should the old TranslateCharmap map to the new
> > > TranslateCharmapEx and inherit the
> > > "multicharacter replacement" feature,
> > > or should I leave it as it is?
> >
> > If possible, please also add the multichar replacement
> > to the old API. I think it is very useful and since the
> > old APIs work on raw buffers it would be a benefit to have
> > the functionality in the old implementation too.
>
> OK! I will try to find the time to implement that in the
> next days.
Good.
> > [Decoding error callbacks]
> >
> > About the return value:
> >
> > I'd suggest to always use the same tuple interface, e.g.
> >
> > callback(encoding, input_data, input_position,
> state) ->
> > (output_to_be_appended, new_input_position)
> >
> > (I think it's better to use absolute values for the
> > position rather than offsets.)
> >
> > Perhaps the encoding callbacks should use the same
> > interface... what do you think ?
>
> This would make the callback feature hypergeneric and a
> little slower, because tuples have to be created, but it
> (almost) unifies the encoding and decoding API. ("almost"
> because, for the encoder output_to_be_appended will be
> reencoded, for the decoder it will simply be appended.),
> so I'm for it.
That's the point.
Note that I don't think the tuple creation
will hurt much (see the make_tuple() API in codecs.c)
since small tuples are cached by Python internally.
> I implemented this and changed the encoders to only
> lookup the error handler on the first error. The UCS1
> encoder now no longer uses the two-item stack strategy.
> (This strategy only makes sense for those encoder where
> the encoding itself is much more complicated than the
> looping/callback etc.) So now memory overflow tests are
> only done, when an unencodable error occurs, so now the
> UCS1 encoder should be as fast as it was without
> error callbacks.
>
> Do we want to enforce new_input_position>input_position,
> or should jumping back be allowed?
No; moving backwards should be allowed (this may be useful
in order to resynchronize with the input data).
> Here's is the current todo list:
> 1. implement a new TranslateCharmap and fix the old.
> 2. New encoding API for string objects too.
> 3. Decoding
> 4. Documentation
> 5. Test cases
>
> I'm thinking about a different strategy for implementing
> callbacks
> (see http://mail.python.org/pipermail/i18n-sig/2001-
> July/001262.html)
>
> We coould have a error handler registry, which maps names
> to error handlers, then it would be possible to keep the
> errors argument as "const char *" instead of "PyObject *".
> Currently PyCodec_UnicodeEncodeHandlerForObject is a
> backwards compatibility hack that will never go away,
> because
> it's always more convenient to type
> u"...".encode("...", "strict")
> instead of
> import codecs
> u"...".encode("...", codecs.raise_encode_errors)
>
> But with an error handler registry this function would
> become the official lookup method for error handlers.
> (PyCodec_LookupUnicodeEncodeErrorHandler?)
> Python code would look like this:
> ---
> def xmlreplace(encoding, unicode, pos, state):
> return (u"&#%d;" % ord(uni[pos]), pos+1)
>
> import codec
>
> codec.registerError("xmlreplace",xmlreplace)
> ---
> and then the following call can be made:
> u"äöü".encode("ascii", "xmlreplace")
> As soon as the first error is encountered, the encoder uses
> its builtin error handling method if it recognizes the name
> ("strict", "replace" or "ignore") or looks up the error
> handling function in the registry if it doesn't. In this way
> the speed for the backwards compatible features is the same
> as before and "const char *error" can be kept as the
> parameter to all encoding functions. For speed common error
> handling names could even be implemented in the encoder
> itself.
>
> But for special one-shot error handlers, it might still be
> useful to pass the error handler directly, so maybe we
> should leave error as PyObject *, but implement the
> registry anyway?
Good idea !
One minor nit: codecs.registerError() should be named
codecs.register_errorhandler() to be more inline with
the Python coding style guide.
|
|
Date |
User |
Action |
Args |
2007-08-23 15:06:06 | admin | link | issue432401 messages |
2007-08-23 15:06:06 | admin | create | |
|