msg189989 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-05-25 18:41 |
It is possible to add the support of most multibyte encodings to pyexpat.
There are several ways to do this:
1. Generate maps with a special script and add generated file to repository. After adding or updating a multibyte encoding this file should be regenerated.
2. Generate maps on fly. It requires more time for first use of the encoding, but allows support of arbitrary encoding which compatible with expat.
|
msg189995 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * |
Date: 2013-05-25 19:53 |
I guess GB18030 can't be supported at all?
|
msg190011 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-05-25 21:24 |
Here is a patch which implements first way.
Yes, looks as followed encodings could not be supported at all: euc-kr, gb18030, iso2022-kr, utf-7, cp037, cp424, cp500, cp864, cp875, cp1026, cp1140, utf_32, utf_32_be, utf_32_le.
|
msg190022 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * |
Date: 2013-05-25 21:58 |
Then you should also remove the "Make it as simple as possible" comment :-/
|
msg190024 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-05-25 22:03 |
It is still simple enough.
|
msg190064 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-05-26 05:57 |
Patch updated. Fixed an error in the encodings generator and added additional compatibility check for 8-bit encodings in PyUnknownEncodingHandler().
Feel free to bikesheed the encodings generator.
|
msg190070 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-05-26 10:31 |
Patch updated. Yet some tests added and yet some bugs fixed.
|
msg197686 - (view) |
Author: Stefan Behnel (scoder) * |
Date: 2013-09-14 05:26 |
I don't think I have my head deep enough in the encodings implementation to say that this is the correct/best way to do it, but the patch looks mostly reasonable to me and would be a helpful addition.
I have two comments on the pyexpat_encoding_convert() function:
1) I can't see a safe-guard against reading beyond the data buffer. What if s already points to the last byte and we are trying to read two or three bytes to decode them? I wouldn't be surprised to see that this kind of input can be crafted.
2) Creating a throw-away Unicode object through a named decoder looks like a huge overhead for decoding two bytes. It might be considered an optimisation to change that, but if you are really trying to parse a longer XML document with lots of Japanese text in it (i.e. if you actually *need* this feature), it will most likely end up being way too slow to make any real use of it.
I think that both points should be addressed before this gets added.
|
msg197708 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-09-14 13:19 |
1) Expat itself responsible for this guard. It has all necessary information and provides an input of required size for custom converter.
2) Yes, this is a problem. I'm working on another approach, when full encoding table built at first request for the encoding (and than cache it). It makes decoding individual characters fast, but requires about 0.5 sec for initialization. Is such approach more suitable?
|
msg197725 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-09-14 18:53 |
Here is a totally rewritten patch, which builds decoding table at first request for encoding and save it in the cache. Decoding should be very fast.
Do you have large testing XML files with multibyte encodings? Could you please measure the time of parsing this files and for comparision the time of parsing this files encoded with utf-8 and utf-16?
|
msg203839 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-11-22 19:09 |
If anybody is interested in support of multibyte encodings in XML parser, it is time to make a review.
|
msg203898 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2013-11-22 22:03 |
I'm not sure that multibyte encodings other than UTF-8 are used in the world. I'm not convinced that we should support them. If the changes are small, it's maybe not a bad thing. Do you know which applications use such codecs?
pyexpat_encoding_create() looks like an heuristic. How many multibyte codecs can be used with your patch? A whitelist of multibyte codecs may be less reliable. What do you think?
|
msg203906 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-11-22 22:53 |
> I'm not sure that multibyte encodings other than UTF-8 are used in the world.
I don't use any of them but I heard some of them are still widely used.
This issue was provoked by issue13612. See also related issue15877.
> pyexpat_encoding_create() looks like an heuristic. How many multibyte codecs can be used with your patch?
All codecs which can be supported by expat.
"""
1. Every ASCII character that can appear in a well-formed XML document,
other than the characters
$@\^`{}~
must be represented by a single byte, and that byte must be the
same byte that represents that character in ASCII.
2. No character may require more than 4 bytes to encode.
3. All characters encoded must have Unicode scalar values <=
0xFFFF, (i.e., characters that would be encoded by surrogates in
UTF-16 are not allowed). Note that this restriction doesn't
apply to the built-in support for UTF-8 and UTF-16.
4. No Unicode character may be encoded by more than one distinct
sequence of bytes.
"""
14 Python encodings satisfy these criteria: big5, big5hkscs, cp932, cp949, cp950, euc-jp, euc-jis-2004, euc-jisx0213, gb2312, gbk, johab, shift-jis, shift-jis-2004, shift-jisx0213.
> A whitelist of multibyte codecs may be less reliable. What do you think?
pyexpat_multibyte_encodings_4.patch implements this way. It hardcodes a list of supported encodings with minimal required tables.
pyexpat_multibyte_encodings_5.patch supports any encoding which satisfy expat criteria and builds all needed data at first access (tens kilobytes). After heavy start it works much faster than previous patch.
|
msg203919 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2013-11-22 23:35 |
On 22.11.2013 23:03, STINNER Victor wrote:
>
> I'm not sure that multibyte encodings other than UTF-8 are used in the world. I'm not convinced that we should support them. If the changes are small, it's maybe not a bad thing. Do you know which applications use such codecs?
I'm not sure what you mean with multibyte encodings. There's UTF-16 which
is a popular 2-byte encoding and then there are a whole lot of variable
length encodings such as UTF-8 and many of the Asian codecs in the stdlib.
While you see those used a lot for text, I'm not sure whether the
same is true for XML documents, where UTF-8 is the standard,
but other encodings can be specified if needed.
Serhiy: Apart from this being a nice-to-have feature, where do you see
the practical use ?
|
msg290483 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2017-03-25 13:39 |
Marc-Andre, there are at least two issues about supporting East Asian encodings (issue13612 and issue15877). I think this means that that encodings are used in XML in wild. Current support of encodings (8-bit + UTF-8 + UTF-16) is enough for my needs, but I never have deal with East Asian languages.
Currently the CodecInfo object has an optional flag _is_text_encoding. I think we can add more private attributes (flags and precomputed tables) for using with the expat parser. If they are not set (third-party encodings) the current autodetection code can be used as a fallback.
|
msg290576 - (view) |
Author: Walter Dörwald (doerwalter) * |
Date: 2017-03-27 10:33 |
This looks to me like a limited reimplementation of the codec machinery. Why not use incremental codecs as a preprocessor? Would this be to slow?
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:46 | admin | set | github: 62259 |
2017-03-27 10:33:59 | doerwalter | set | messages:
+ msg290576 |
2017-03-25 13:39:26 | serhiy.storchaka | set | versions:
+ Python 3.7, - Python 3.4 |
2017-03-25 13:39:08 | serhiy.storchaka | set | nosy:
+ ncoghlan messages:
+ msg290483
|
2013-11-22 23:35:57 | lemburg | set | messages:
+ msg203919 |
2013-11-22 22:53:40 | serhiy.storchaka | set | messages:
+ msg203906 |
2013-11-22 22:03:53 | vstinner | set | nosy:
+ vstinner messages:
+ msg203898
|
2013-11-22 19:09:37 | serhiy.storchaka | set | messages:
+ msg203839 |
2013-09-14 18:55:03 | serhiy.storchaka | set | files:
+ pyexpat_multibyte_encodings_5.patch |
2013-09-14 18:54:31 | serhiy.storchaka | set | files:
- pyexpat_multibyte_encodings.patch |
2013-09-14 18:53:54 | serhiy.storchaka | set | files:
+ pyexpat_multibyte_encodings.patch
messages:
+ msg197725 |
2013-09-14 13:19:41 | serhiy.storchaka | set | messages:
+ msg197708 |
2013-09-14 05:26:45 | scoder | set | messages:
+ msg197686 |
2013-09-13 20:28:16 | eli.bendersky | set | nosy:
- eli.bendersky
|
2013-09-13 20:25:33 | serhiy.storchaka | set | nosy:
+ scoder
|
2013-05-26 10:31:37 | serhiy.storchaka | set | files:
- pyexpat_multibyte_encodings_3.patch |
2013-05-26 10:31:28 | serhiy.storchaka | set | files:
+ pyexpat_multibyte_encodings_4.patch
messages:
+ msg190070 |
2013-05-26 07:11:51 | serhiy.storchaka | set | files:
+ pyexpat_multibyte_encodings_3.patch |
2013-05-26 07:08:48 | serhiy.storchaka | set | files:
- pyexpat_multibyte_encodings_2.patch |
2013-05-26 05:58:30 | serhiy.storchaka | set | stage: patch review |
2013-05-26 05:57:57 | serhiy.storchaka | set | files:
- expat_encodings.py |
2013-05-26 05:57:43 | serhiy.storchaka | set | files:
- pyexpat_multibyte_encodings.patch |
2013-05-26 05:57:14 | serhiy.storchaka | set | files:
+ pyexpat_multibyte_encodings_2.patch
messages:
+ msg190064 |
2013-05-25 22:03:22 | serhiy.storchaka | set | messages:
+ msg190024 |
2013-05-25 21:58:57 | amaury.forgeotdarc | set | messages:
+ msg190022 |
2013-05-25 21:24:42 | serhiy.storchaka | set | files:
+ pyexpat_multibyte_encodings.patch keywords:
+ patch messages:
+ msg190011
|
2013-05-25 19:53:05 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages:
+ msg189995
|
2013-05-25 18:41:28 | serhiy.storchaka | create | |