Issue 41330: Inefficient error-handle for CJK encodings

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/85502

classification

Title:	Inefficient error-handle for CJK encodings
Type:	performance	Stage:
Components:	Unicode	Versions:	Python 3.10

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	corona10, ezio.melotti, malin, methane, serhiy.storchaka, vstinner
Priority:	normal	Keywords:

Created on 2020-07-18 04:53 by malin, last changed 2022-04-11 14:59 by admin.

Files
File name	Uploaded	Description	Edit
error_handers_fast_paths.txt	malin, 2020-07-18 07:37

Messages (11)
msg373871 - (view)	Author: Ma Lin (malin) *	Date: 2020-07-18 04:53
CJK encode/decode functions only have three error-handler fast-paths: replace ignore strict See the code: [1][2] If use other built-in error-handlers, need to get the error-handler object, and call it with an Unicode Exception argument. See the code: [3] But the error-handler object is not cached, it needs to be looked up from a dict every time, which is very inefficient. Another possible optimization is to write fast-path for common error-handlers, Python has these built-in error-handlers: strict replace ignore backslashreplace xmlcharrefreplace namereplace surrogateescape surrogatepass (only for utf-8/utf-16/utf-32 family) For example, maybe `xmlcharrefreplace` is heavily used in Web application, it can be implemented as a fast-path, so that no need to call the error-handler object every time. Just like the `xmlcharrefreplace` fast-path in `PyUnicode_EncodeCharmap` [4]. [1] encode function: https://github.com/python/cpython/blob/v3.9.0b4/Modules/cjkcodecs/multibytecodec.c#L192 [2] decode function: https://github.com/python/cpython/blob/v3.9.0b4/Modules/cjkcodecs/multibytecodec.c#L347 [3] `call_error_callback` function: https://github.com/python/cpython/blob/v3.9.0b4/Modules/cjkcodecs/multibytecodec.c#L82 [4] `xmlcharrefreplace` fast-path in `PyUnicode_EncodeCharmap`: https://github.com/python/cpython/blob/v3.9.0b4/Objects/unicodeobject.c#L8662
msg373878 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-07-18 06:50
I am not even sure it was worth to add fast path for "xmlcharrefreplace". "surrogateescape" and "surrogatepass" are most likely used in performance critical cases. It is also easy to add support of "ignore" and "replace". "strict" raises an exception in any case, and "backslashreplace", "xmlcharrefreplace" and "namereplace" are too complex and used in cases when coding time is not dominant (error reporting, debugging, formatting complex documents).
msg373881 - (view)	Author: Ma Lin (malin) *	Date: 2020-07-18 07:37
IMO "xmlcharrefreplace" is useful for Web application. For example, the page's charset is "gbk", then this statement can generate the bytes content easily & safely: s.encode('gbk', 'xmlcharrefreplace') Maybe some HTML-related frameworks use this way to escape characters, such as Sphinx [1]. Attached file `error_handers_fast_paths.txt` summarized all current error-handler fast-paths. [1] Sphinx use 'xmlcharrefreplace' to escape https://github.com/sphinx-doc/sphinx/blob/e65021fb9b0286f373f01dc19a5777e5eed49576/sphinx/builders/html/__init__.py#L1029
msg373885 - (view)	Author: Inada Naoki (methane) *	Date: 2020-07-18 08:09
But how many new Python web application use CJK codec instead of UTF-8?
msg373889 - (view)	Author: Ma Lin (malin) *	Date: 2020-07-18 08:31
> But how many new Python web application use CJK codec instead of UTF-8? A CJK character usually takes 2-bytes in CJK encodings, but takes 3-bytes in UTF-8. I tested a Chinese book: in GBK: 853,025 bytes in UTF-8: 1,267,523 bytes For CJK content, UTF-8 is wasteful, maybe CJK encodings will not be eliminated.
msg373890 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-07-18 08:39
In the Web application you need first to generate data (this may involve some network requests, IO operations, and some data transformations), then format the page, then encode it, and finally send it to client. I suppose that the encoding part is minor in comparison with others. Also, as Inada-san noted, UTF-8 is more popular encoding in modern applications. It is also fast, so you may prefer UTF-8 if the performance of encoding is important to you.
msg374616 - (view)	Author: Dong-hee Na (corona10) *	Date: 2020-07-30 14:48
I am also +1 on Serhiy's opinion. As I am Korean, (I don't know Japan or China environment) I know that there still exist old Korean websites that use EUC-KR encoding. But at least 2010s modern Korea website/application. Most of the applications are built on UTF-8.
msg374656 - (view)	Author: Ma Lin (malin) *	Date: 2020-08-01 05:44
At least fix this bug: the error-handler object is not cached, it needs to be looked up from a dict every time, which is very inefficient. The code: https://github.com/python/cpython/blob/v3.9.0b4/Modules/cjkcodecs/multibytecodec.c#L81-L98 I will submit a PR at some point.
msg374768 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-08-03 23:18
Since CJK codecs have been implemented, unicodeobject.c got multiple optimizations: * _PyUnicodeWriter for decoder: API designed with efficiency and PEP 393 (compact string) in mind * _PyBytesWriter for encoders: in short, API to overallocate a buffer * _Py_error_handler enum and "_Py_error_handler _Py_GetErrorHandler(const char *errors)" function to pass an error handler as an integer rather than a string But rewriting CJK codecs with these is a lot of effort, I'm not sure that it's worth it.
msg374775 - (view)	Author: Ma Lin (malin) *	Date: 2020-08-04 00:25
I'm working on issue41265. If nothing happens, I also would like to write a zstd module for stdlib before the end of the year, but I dare not promise this. If anyone wants to work on this issue, very grateful.
msg374776 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-08-04 00:32
(off topic) > If nothing happens, I also would like to write a zstd module for stdlib before the end of the year, but I dare not promise this. I suggest you to publish it on PyPI. Once it will be mature, you can propose it on python-ideas. Last time someone proposed a new compression algorithm to the stdlib, it was rejected if I recall correctly. I forgot which one was proposed. Maybe search for "compresslib" on python-ideas.

History
Date	User	Action	Args
2022-04-11 14:59:33	admin	set	github: 85502
2020-08-04 00:32:34	vstinner	set	messages: + msg374776
2020-08-04 00:25:00	malin	set	messages: + msg374775
2020-08-03 23:18:57	vstinner	set	messages: + msg374768
2020-08-01 05:44:18	malin	set	messages: + msg374656
2020-07-30 14:48:43	corona10	set	nosy: + corona10 messages: + msg374616
2020-07-18 08:39:08	serhiy.storchaka	set	messages: + msg373890
2020-07-18 08:31:13	malin	set	messages: + msg373889
2020-07-18 08:09:04	methane	set	nosy: + methane messages: + msg373885
2020-07-18 07:37:10	malin	set	files: + error_handers_fast_paths.txt messages: + msg373881
2020-07-18 06:50:14	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg373878
2020-07-18 04:53:39	malin	create