Author malin
Recipients ezio.melotti, malin, vstinner
Date 2020-07-18.04:53:39
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1595048019.57.0.348575052072.issue41330@roundup.psfhosted.org>
In-reply-to
Content
CJK encode/decode functions only have three error-handler fast-paths:
    replace
    ignore
    strict  
See the code: [1][2]

If use other built-in error-handlers, need to get the error-handler object, and call it with an Unicode Exception argument. See the code: [3]

But the error-handler object is not cached, it needs to be looked up from a dict every time, which is very inefficient.


Another possible optimization is to write fast-path for common error-handlers, Python has these built-in error-handlers:

    strict
    replace
    ignore
    backslashreplace
    xmlcharrefreplace
    namereplace
    surrogateescape
    surrogatepass (only for utf-8/utf-16/utf-32 family)

For example, maybe `xmlcharrefreplace` is heavily used in Web application, it can be implemented as a fast-path, so that no need to call the error-handler object every time.
Just like the `xmlcharrefreplace` fast-path in `PyUnicode_EncodeCharmap` [4].

[1] encode function:
https://github.com/python/cpython/blob/v3.9.0b4/Modules/cjkcodecs/multibytecodec.c#L192

[2] decode function:
https://github.com/python/cpython/blob/v3.9.0b4/Modules/cjkcodecs/multibytecodec.c#L347

[3] `call_error_callback` function:
https://github.com/python/cpython/blob/v3.9.0b4/Modules/cjkcodecs/multibytecodec.c#L82

[4] `xmlcharrefreplace` fast-path in `PyUnicode_EncodeCharmap`:
https://github.com/python/cpython/blob/v3.9.0b4/Objects/unicodeobject.c#L8662
History
Date User Action Args
2020-07-18 04:53:39malinsetrecipients: + malin, vstinner, ezio.melotti
2020-07-18 04:53:39malinsetmessageid: <1595048019.57.0.348575052072.issue41330@roundup.psfhosted.org>
2020-07-18 04:53:39malinlinkissue41330 messages
2020-07-18 04:53:39malincreate