Issue 18059: Add multibyte encoding support to pyexpat

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/62259

classification

Title:	Add multibyte encoding support to pyexpat
Type:	enhancement	Stage:	patch review
Components:	Extension Modules, XML	Versions:	Python 3.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	amaury.forgeotdarc, doerwalter, lemburg, ncoghlan, scoder, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2013-05-25 18:41 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
pyexpat_multibyte_encodings_4.patch	serhiy.storchaka, 2013-05-26 10:31		review
pyexpat_multibyte_encodings_5.patch	serhiy.storchaka, 2013-09-14 18:55		review

Messages (16)
msg189989 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-05-25 18:41
It is possible to add the support of most multibyte encodings to pyexpat. There are several ways to do this: 1. Generate maps with a special script and add generated file to repository. After adding or updating a multibyte encoding this file should be regenerated. 2. Generate maps on fly. It requires more time for first use of the encoding, but allows support of arbitrary encoding which compatible with expat.
msg189995 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2013-05-25 19:53
I guess GB18030 can't be supported at all?
msg190011 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-05-25 21:24
Here is a patch which implements first way. Yes, looks as followed encodings could not be supported at all: euc-kr, gb18030, iso2022-kr, utf-7, cp037, cp424, cp500, cp864, cp875, cp1026, cp1140, utf_32, utf_32_be, utf_32_le.
msg190022 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2013-05-25 21:58
Then you should also remove the "Make it as simple as possible" comment :-/
msg190024 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-05-25 22:03
It is still simple enough.
msg190064 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-05-26 05:57
Patch updated. Fixed an error in the encodings generator and added additional compatibility check for 8-bit encodings in PyUnknownEncodingHandler(). Feel free to bikesheed the encodings generator.
msg190070 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-05-26 10:31
Patch updated. Yet some tests added and yet some bugs fixed.
msg197686 - (view)	Author: Stefan Behnel (scoder) *	Date: 2013-09-14 05:26
I don't think I have my head deep enough in the encodings implementation to say that this is the correct/best way to do it, but the patch looks mostly reasonable to me and would be a helpful addition. I have two comments on the pyexpat_encoding_convert() function: 1) I can't see a safe-guard against reading beyond the data buffer. What if s already points to the last byte and we are trying to read two or three bytes to decode them? I wouldn't be surprised to see that this kind of input can be crafted. 2) Creating a throw-away Unicode object through a named decoder looks like a huge overhead for decoding two bytes. It might be considered an optimisation to change that, but if you are really trying to parse a longer XML document with lots of Japanese text in it (i.e. if you actually need this feature), it will most likely end up being way too slow to make any real use of it. I think that both points should be addressed before this gets added.
msg197708 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-09-14 13:19
1) Expat itself responsible for this guard. It has all necessary information and provides an input of required size for custom converter. 2) Yes, this is a problem. I'm working on another approach, when full encoding table built at first request for the encoding (and than cache it). It makes decoding individual characters fast, but requires about 0.5 sec for initialization. Is such approach more suitable?
msg197725 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-09-14 18:53
Here is a totally rewritten patch, which builds decoding table at first request for encoding and save it in the cache. Decoding should be very fast. Do you have large testing XML files with multibyte encodings? Could you please measure the time of parsing this files and for comparision the time of parsing this files encoded with utf-8 and utf-16?
msg203839 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-11-22 19:09
If anybody is interested in support of multibyte encodings in XML parser, it is time to make a review.
msg203898 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-11-22 22:03
I'm not sure that multibyte encodings other than UTF-8 are used in the world. I'm not convinced that we should support them. If the changes are small, it's maybe not a bad thing. Do you know which applications use such codecs? pyexpat_encoding_create() looks like an heuristic. How many multibyte codecs can be used with your patch? A whitelist of multibyte codecs may be less reliable. What do you think?
msg203906 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-11-22 22:53
> I'm not sure that multibyte encodings other than UTF-8 are used in the world. I don't use any of them but I heard some of them are still widely used. This issue was provoked by issue13612. See also related issue15877. > pyexpat_encoding_create() looks like an heuristic. How many multibyte codecs can be used with your patch? All codecs which can be supported by expat. """ 1. Every ASCII character that can appear in a well-formed XML document, other than the characters $@\^`{}~ must be represented by a single byte, and that byte must be the same byte that represents that character in ASCII. 2. No character may require more than 4 bytes to encode. 3. All characters encoded must have Unicode scalar values <= 0xFFFF, (i.e., characters that would be encoded by surrogates in UTF-16 are not allowed). Note that this restriction doesn't apply to the built-in support for UTF-8 and UTF-16. 4. No Unicode character may be encoded by more than one distinct sequence of bytes. """ 14 Python encodings satisfy these criteria: big5, big5hkscs, cp932, cp949, cp950, euc-jp, euc-jis-2004, euc-jisx0213, gb2312, gbk, johab, shift-jis, shift-jis-2004, shift-jisx0213. > A whitelist of multibyte codecs may be less reliable. What do you think? pyexpat_multibyte_encodings_4.patch implements this way. It hardcodes a list of supported encodings with minimal required tables. pyexpat_multibyte_encodings_5.patch supports any encoding which satisfy expat criteria and builds all needed data at first access (tens kilobytes). After heavy start it works much faster than previous patch.
msg203919 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-11-22 23:35
On 22.11.2013 23:03, STINNER Victor wrote: > > I'm not sure that multibyte encodings other than UTF-8 are used in the world. I'm not convinced that we should support them. If the changes are small, it's maybe not a bad thing. Do you know which applications use such codecs? I'm not sure what you mean with multibyte encodings. There's UTF-16 which is a popular 2-byte encoding and then there are a whole lot of variable length encodings such as UTF-8 and many of the Asian codecs in the stdlib. While you see those used a lot for text, I'm not sure whether the same is true for XML documents, where UTF-8 is the standard, but other encodings can be specified if needed. Serhiy: Apart from this being a nice-to-have feature, where do you see the practical use ?
msg290483 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-03-25 13:39
Marc-Andre, there are at least two issues about supporting East Asian encodings (issue13612 and issue15877). I think this means that that encodings are used in XML in wild. Current support of encodings (8-bit + UTF-8 + UTF-16) is enough for my needs, but I never have deal with East Asian languages. Currently the CodecInfo object has an optional flag _is_text_encoding. I think we can add more private attributes (flags and precomputed tables) for using with the expat parser. If they are not set (third-party encodings) the current autodetection code can be used as a fallback.
msg290576 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2017-03-27 10:33
This looks to me like a limited reimplementation of the codec machinery. Why not use incremental codecs as a preprocessor? Would this be to slow?

History
Date	User	Action	Args
2022-04-11 14:57:46	admin	set	github: 62259
2017-03-27 10:33:59	doerwalter	set	messages: + msg290576
2017-03-25 13:39:26	serhiy.storchaka	set	versions: + Python 3.7, - Python 3.4
2017-03-25 13:39:08	serhiy.storchaka	set	nosy: + ncoghlan messages: + msg290483
2013-11-22 23:35:57	lemburg	set	messages: + msg203919
2013-11-22 22:53:40	serhiy.storchaka	set	messages: + msg203906
2013-11-22 22:03:53	vstinner	set	nosy: + vstinner messages: + msg203898
2013-11-22 19:09:37	serhiy.storchaka	set	messages: + msg203839
2013-09-14 18:55:03	serhiy.storchaka	set	files: + pyexpat_multibyte_encodings_5.patch
2013-09-14 18:54:31	serhiy.storchaka	set	files: - pyexpat_multibyte_encodings.patch
2013-09-14 18:53:54	serhiy.storchaka	set	files: + pyexpat_multibyte_encodings.patch messages: + msg197725
2013-09-14 13:19:41	serhiy.storchaka	set	messages: + msg197708
2013-09-14 05:26:45	scoder	set	messages: + msg197686
2013-09-13 20:28:16	eli.bendersky	set	nosy: - eli.bendersky
2013-09-13 20:25:33	serhiy.storchaka	set	nosy: + scoder
2013-05-26 10:31:37	serhiy.storchaka	set	files: - pyexpat_multibyte_encodings_3.patch
2013-05-26 10:31:28	serhiy.storchaka	set	files: + pyexpat_multibyte_encodings_4.patch messages: + msg190070
2013-05-26 07:11:51	serhiy.storchaka	set	files: + pyexpat_multibyte_encodings_3.patch
2013-05-26 07:08:48	serhiy.storchaka	set	files: - pyexpat_multibyte_encodings_2.patch
2013-05-26 05:58:30	serhiy.storchaka	set	stage: patch review
2013-05-26 05:57:57	serhiy.storchaka	set	files: - expat_encodings.py
2013-05-26 05:57:43	serhiy.storchaka	set	files: - pyexpat_multibyte_encodings.patch
2013-05-26 05:57:14	serhiy.storchaka	set	files: + pyexpat_multibyte_encodings_2.patch messages: + msg190064
2013-05-25 22:03:22	serhiy.storchaka	set	messages: + msg190024
2013-05-25 21:58:57	amaury.forgeotdarc	set	messages: + msg190022
2013-05-25 21:24:42	serhiy.storchaka	set	files: + pyexpat_multibyte_encodings.patch keywords: + patch messages: + msg190011
2013-05-25 19:53:05	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg189995
2013-05-25 18:41:28	serhiy.storchaka	create