classification
Title: Add multibyte encoding support to pyexpat
Type: enhancement Stage: patch review
Components: Extension Modules, XML Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, doerwalter, lemburg, ncoghlan, scoder, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2013-05-25 18:41 by serhiy.storchaka, last changed 2017-03-27 10:33 by doerwalter.

Files
File name Uploaded Description Edit
pyexpat_multibyte_encodings_4.patch serhiy.storchaka, 2013-05-26 10:31 review
pyexpat_multibyte_encodings_5.patch serhiy.storchaka, 2013-09-14 18:55 review
Messages (16)
msg189989 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-25 18:41
It is possible to add the support of most multibyte encodings to pyexpat.

There are several ways to do this:

1. Generate maps with a special script and add generated file to repository. After adding or updating a multibyte encoding this file should be regenerated.

2. Generate maps on fly. It requires more time for first use of the encoding, but allows support of arbitrary encoding which compatible with expat.
msg189995 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2013-05-25 19:53
I guess GB18030 can't be supported at all?
msg190011 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-25 21:24
Here is a patch which implements first way.

Yes, looks as followed encodings could not be supported at all: euc-kr, gb18030, iso2022-kr, utf-7, cp037, cp424, cp500, cp864, cp875, cp1026, cp1140, utf_32, utf_32_be, utf_32_le.
msg190022 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2013-05-25 21:58
Then you should also remove the "Make it as simple as possible" comment :-/
msg190024 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-25 22:03
It is still simple enough.
msg190064 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-26 05:57
Patch updated. Fixed an error in the encodings generator and added additional compatibility check for 8-bit encodings in PyUnknownEncodingHandler().

Feel free to bikesheed the encodings generator.
msg190070 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-26 10:31
Patch updated. Yet some tests added and yet some bugs fixed.
msg197686 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-09-14 05:26
I don't think I have my head deep enough in the encodings implementation to say that this is the correct/best way to do it, but the patch looks mostly reasonable to me and would be a helpful addition.

I have two comments on the pyexpat_encoding_convert() function:

1) I can't see a safe-guard against reading beyond the data buffer. What if s already points to the last byte and we are trying to read two or three bytes to decode them? I wouldn't be surprised to see that this kind of input can be crafted.

2) Creating a throw-away Unicode object through a named decoder looks like a huge overhead for decoding two bytes. It might be considered an optimisation to change that, but if you are really trying to parse a longer XML document with lots of Japanese text in it (i.e. if you actually *need* this feature), it will most likely end up being way too slow to make any real use of it.

I think that both points should be addressed before this gets added.
msg197708 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-09-14 13:19
1) Expat itself responsible for this guard. It has all necessary information and provides an input of required size for custom converter.

2) Yes, this is a problem. I'm working on another approach, when full encoding table built at first request for the encoding (and than cache it). It makes decoding individual characters fast, but requires about 0.5 sec for initialization. Is such approach more suitable?
msg197725 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-09-14 18:53
Here is a totally rewritten patch, which builds decoding table at first request for encoding and save it in the cache. Decoding should be very fast.

Do you have large testing XML files with multibyte encodings? Could you please measure the time of parsing this files and for comparision the time of parsing this files encoded with utf-8 and utf-16?
msg203839 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-11-22 19:09
If anybody is interested in support of multibyte encodings in XML parser, it is time to make a review.
msg203898 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-11-22 22:03
I'm not sure that multibyte encodings other than UTF-8 are used in the world. I'm not convinced that we should support them. If the changes are small, it's maybe not a bad thing. Do you know which applications use such codecs?

pyexpat_encoding_create() looks like an heuristic. How many multibyte codecs can be used with your patch? A whitelist of multibyte codecs may be less reliable. What do you think?
msg203906 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-11-22 22:53
> I'm not sure that multibyte encodings other than UTF-8 are used in the world.

I don't use any of them but I heard some of them are still widely used.

This issue was provoked by issue13612. See also related issue15877.

> pyexpat_encoding_create() looks like an heuristic. How many multibyte codecs can be used with your patch?

All codecs which can be supported by expat.

"""
   1. Every ASCII character that can appear in a well-formed XML document,
      other than the characters

      $@\^`{}~

      must be represented by a single byte, and that byte must be the
      same byte that represents that character in ASCII.

   2. No character may require more than 4 bytes to encode.

   3. All characters encoded must have Unicode scalar values <=
      0xFFFF, (i.e., characters that would be encoded by surrogates in
      UTF-16 are  not allowed).  Note that this restriction doesn't
      apply to the built-in support for UTF-8 and UTF-16.

   4. No Unicode character may be encoded by more than one distinct
      sequence of bytes.
"""

14 Python encodings satisfy these criteria: big5, big5hkscs, cp932, cp949, cp950, euc-jp, euc-jis-2004, euc-jisx0213, gb2312, gbk, johab, shift-jis, shift-jis-2004, shift-jisx0213.

> A whitelist of multibyte codecs may be less reliable. What do you think?

pyexpat_multibyte_encodings_4.patch implements this way. It hardcodes a list of supported encodings with minimal required tables.

pyexpat_multibyte_encodings_5.patch supports any encoding which satisfy expat criteria and builds all needed data at first access (tens kilobytes). After heavy start it works much faster than previous patch.
msg203919 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-11-22 23:35
On 22.11.2013 23:03, STINNER Victor wrote:
> 
> I'm not sure that multibyte encodings other than UTF-8 are used in the world. I'm not convinced that we should support them. If the changes are small, it's maybe not a bad thing. Do you know which applications use such codecs?

I'm not sure what you mean with multibyte encodings. There's UTF-16 which
is a popular 2-byte encoding and then there are a whole lot of variable
length encodings such as UTF-8 and many of the Asian codecs in the stdlib.

While you see those used a lot for text, I'm not sure whether the
same is true for XML documents, where UTF-8 is the standard,
but other encodings can be specified if needed.

Serhiy: Apart from this being a nice-to-have feature, where do you see
the practical use ?
msg290483 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-03-25 13:39
Marc-Andre, there are at least two issues about supporting East Asian encodings (issue13612 and issue15877). I think this means that that encodings are used in XML in wild. Current support of encodings (8-bit + UTF-8 + UTF-16) is enough for my needs, but I never have deal with East Asian languages.

Currently the CodecInfo object has an optional flag _is_text_encoding. I think we can add more private attributes (flags and precomputed tables) for using with the expat parser. If they are not set (third-party encodings) the current autodetection code can be used as a fallback.
msg290576 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2017-03-27 10:33
This looks to me like a limited reimplementation of the codec machinery. Why not use incremental codecs as a preprocessor? Would this be to slow?
History
Date User Action Args
2017-03-27 10:33:59doerwaltersetmessages: + msg290576
2017-03-25 13:39:26serhiy.storchakasetversions: + Python 3.7, - Python 3.4
2017-03-25 13:39:08serhiy.storchakasetnosy: + ncoghlan
messages: + msg290483
2013-11-22 23:35:57lemburgsetmessages: + msg203919
2013-11-22 22:53:40serhiy.storchakasetmessages: + msg203906
2013-11-22 22:03:53vstinnersetnosy: + vstinner
messages: + msg203898
2013-11-22 19:09:37serhiy.storchakasetmessages: + msg203839
2013-09-14 18:55:03serhiy.storchakasetfiles: + pyexpat_multibyte_encodings_5.patch
2013-09-14 18:54:31serhiy.storchakasetfiles: - pyexpat_multibyte_encodings.patch
2013-09-14 18:53:54serhiy.storchakasetfiles: + pyexpat_multibyte_encodings.patch

messages: + msg197725
2013-09-14 13:19:41serhiy.storchakasetmessages: + msg197708
2013-09-14 05:26:45scodersetmessages: + msg197686
2013-09-13 20:28:16eli.benderskysetnosy: - eli.bendersky
2013-09-13 20:25:33serhiy.storchakasetnosy: + scoder
2013-05-26 10:31:37serhiy.storchakasetfiles: - pyexpat_multibyte_encodings_3.patch
2013-05-26 10:31:28serhiy.storchakasetfiles: + pyexpat_multibyte_encodings_4.patch

messages: + msg190070
2013-05-26 07:11:51serhiy.storchakasetfiles: + pyexpat_multibyte_encodings_3.patch
2013-05-26 07:08:48serhiy.storchakasetfiles: - pyexpat_multibyte_encodings_2.patch
2013-05-26 05:58:30serhiy.storchakasetstage: patch review
2013-05-26 05:57:57serhiy.storchakasetfiles: - expat_encodings.py
2013-05-26 05:57:43serhiy.storchakasetfiles: - pyexpat_multibyte_encodings.patch
2013-05-26 05:57:14serhiy.storchakasetfiles: + pyexpat_multibyte_encodings_2.patch

messages: + msg190064
2013-05-25 22:03:22serhiy.storchakasetmessages: + msg190024
2013-05-25 21:58:57amaury.forgeotdarcsetmessages: + msg190022
2013-05-25 21:24:42serhiy.storchakasetfiles: + pyexpat_multibyte_encodings.patch
keywords: + patch
messages: + msg190011
2013-05-25 19:53:05amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg189995
2013-05-25 18:41:28serhiy.storchakacreate