classification
Title: xml.etree.ElementTree says unknown encoding of a regular encoding
Type: behavior Stage: resolved
Components: Extension Modules, Unicode, XML Versions: Python 3.3, Python 3.4, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, dongying, eli.bendersky, ezio.melotti, flox, python-dev, serhiy.storchaka, terry.reedy, vstinner
Priority: normal Keywords: patch

Created on 2011-12-16 09:04 by dongying, last changed 2013-08-04 13:11 by eli.bendersky. This issue is now closed.

Files
File name Uploaded Description Edit
test.py dongying, 2011-12-16 09:04 source code
expat_unknown_encoding_handler.patch serhiy.storchaka, 2013-05-21 03:31 review
expat_unknown_encoding_handler_2.patch serhiy.storchaka, 2013-05-22 13:52 review
expat_et_share_unknown_handler.1.patch eli.bendersky, 2013-05-23 12:48 review
expat_buffer_overflow-2.7.patch serhiy.storchaka, 2013-08-04 10:24 review
Messages (25)
msg149602 - (view) Author: Dongying Zhang (dongying) Date: 2011-12-16 09:04
I've been trying to parse xml string using python, codes following:

    #-*- coding: utf-8 -*-
    import xml.etree.ElementTree as xmlet
    s = '<?xml version="1.0" encoding="GBK"?><info></info>'
    xmlet.fromstring(s)

Then:
    $ python2.6 test.py
or:
    $ pypy test.py

Traceback message came out like this:

    Traceback (most recent call last):
      File "test.py", line 4, in <module>
        xmlet.fromstring(s)
      File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/etree/ElementTree.py", line 963, in XML
        parser.feed(text)
      File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/etree/ElementTree.py", line 1245, in feed
        self._parser.Parse(data, 0)
    xml.parsers.expat.ExpatError: unknown encoding: line 1, column 30

I've seen this in following cases:

    Python2.5.6 on Mac OS X Lion
    Python2.6.5 on Ubuntu 10.04
    pypy1.7.0 on Mac OS X Lion

But not in these cases:

    Python2.6.7 on Mac OS X Lion
    Python2.7.1 on Mac OS X Lion
msg149612 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2011-12-16 11:26
Actually, this fails on 2.6 and 2.7 on wide unicode builds, and passes with narrow unicode builds (on my 64bit Linux box).

In pyexpat.c, PyUnknownEncodingHandler accesses 256 characters of a unicode buffer, without checking its length... which happens to be 192 chars long.
So buffers overflow, etc.  The function has a comment "supports only 8bit encodings"; indeed.
Versions 3.2 and 3.3 happen to pass the test, probably by pure luck.

Supporting multibytes codecs won't be easy: pyexpat requires to fill an array which specifies the number of bytes needed by each start byte (for example, in utf-8, 0xc3 starts a 2-bytes sequence, 0xef starts a 3-bytes sequence).  Our codecs framwork does not provide this information, and some codecs (gb18030 for example) need the second char to determine whether it will need 4 bytes.
msg149652 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-12-17 01:25
3.2, Win7 (a narrow build) it indeed works and returns
<Element 'info' at 0x3389128>
msg189660 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2013-05-20 13:17
In 3.3+ there's no distinction between wide and narrow builds. Does anyone know how this should be affected? [I don't know much about unicode and encodings, unfortunately]
msg189671 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2013-05-20 15:35
encoding="GBK" causes a buffer overflow in PyUnknownEncodingHandler, because the result of PyUnicode_Decode() is only 192 characters long.
Exact behavior is not defined...
msg189685 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-05-20 19:53
3.3 shifted the wide-build problem to all builds ;-). I now get

  File "C:\Python\mypy\tem.py", line 4, in <module>
    xmlet.fromstring(s)
  File "C:...33\lib\xml\etree\ElementTree.py", line 1356, in XML
    parser.feed(text)
  File "<string>", line None
xml.etree.ElementTree.ParseError: unknown encoding: line 1, column 30

I do not understand the 'unknown encoding' bit. Replacing 'GBK' with a truly unknown encoding changes the last line to
LookupError: unknown encoding: xyz, so the lookup of 'GBK' succeeded.

I get the same two messages if I add a 'b' prefix to make s be bytes, which it logically should be (and was in 2.7). (I presume .fromstring 'encodes' unicode input to bytes with the ascii or latin-1 encoder and then decodes back to unicode according to the announced encoding.)

With s so prefixed, s.decode(encoding="GBK") works and returns the original unicode version of s, so Python does know "GBK". And it indeed is in the list of official IANA charset names.

I don't know unicode internals to understand Amaury's comment. However, it almost reads to me as if this is a unicode bug, not ET bug.
msg189694 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-20 21:18
PyUnknownEncodingHandler gets an encoding name and fills a special XML_Encoding structure which expat parser uses for decoding. Currently PyUnknownEncodingHandler works only with 8-bit encodings and I don't see an efficient method how extent it to handle general multibyte encoding.
msg189720 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-21 03:31
I can propose only raise a specialized exception instead of general xml.etree.ElementTree.ParseError. Here is a patch. It also fixes a buffer overread bug mentioned by Amaury.
msg189785 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2013-05-21 20:59
The patch goes in the right direction: consistently reject non-8bit encodings which the current implementation does not support.
- please add a unit test
- remove usage of PyUnicodeObject and the "Stupid to access directly" comment, they are outdated.

Real support for GBK is more work, http://wang.yuxuan.org/blog/?itemid=63 shows how to set a conversion callback. It will difficult to implement in the general case, see my older comment.
msg189796 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-22 08:01
For unit tests we first should fix issue16986.
msg189809 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2013-05-22 12:27
> For unit tests we first should fix issue16986.

I did another round of code review on issue 16986 now.
msg189810 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2013-05-22 12:29
Looked at Serhiy's patch here too: LGTM with a unit test :)
msg189818 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-22 13:52
Here is an updated patch. PyUnknownEncodingHandler() and expat_unknown_encoding_handler() are synchronized. Added tests.
msg189845 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2013-05-23 04:02
Serhiy, would it make sense to share the code somewhere instead of duplicating it?
msg189847 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-23 04:48
May be. But it needs more work for building. It is simpler just duplicate a code (the function is small enough) and add comments:

/* The code is duplicated as xx() in xxx.c.  Keep both functions synchronized. */
msg189865 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2013-05-23 12:48
How about this patch (not tested it too much - just as a proof of concept).

We're pretty free in the C API exported by pyexpat through a capsule to _elementtree, so we can also add a default handler there. This API already has some general utilities like ErrorString.

ET can then simply call the handler from pyexpat.
msg189898 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-24 06:19
LGTM. But why not just inline expat_unknown_encoding_handler()?

For 2.7 we perhaps should add SetStartDoctypeDeclHandler and SetEncoding in the exported C API. Shouldn't we update the C API version (for this issue and for issue16986)?
msg189956 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-05-25 12:27
New changeset f7b47fb30169 by Eli Bendersky in branch '3.3':
Issue #13612: handle unknown encodings without a buffer overflow.
http://hg.python.org/cpython/rev/f7b47fb30169

New changeset 47e719b11c46 by Eli Bendersky in branch 'default':
Issue #13612: handle unknown encodings without a buffer overflow.
http://hg.python.org/cpython/rev/47e719b11c46
msg189957 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2013-05-25 12:32
A few notes:

1. If by C API version you mean PyExpat_CAPI_MAGIC, I'm not sure what difference that makes. It has never been updated and it's also being used only in _elementtree. Since the latter is statically compiled against pyexpat, I don't see a reason to touch it.

2. I wouldn't backport it to 2.7; again, not critical enough.

3. Should we open a separate issue for adding support to GBK and similar encodings "sometime in the future"? This issue kind-of got derailed to fix the bug.
msg189960 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-25 13:44
Tests were duplicated. There is no Misc/NEWS entry.

I think a buffer overflow is critical enough for backporting.
msg189961 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2013-05-25 14:18
Oh, I didn't notice that your patches had duplication in the tests. Fixed now. I'll wait to see what unfolds for the Misc/NEWS discussion on python-committers.
msg194061 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2013-08-01 12:48
Serhiy, do you want to backport the buffer overflow fix to 2.7?
msg194340 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-08-04 10:24
Here is a patch for 2.7.
msg194364 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-08-04 13:10
New changeset b3efc140d8a6 by Eli Bendersky in branch '2.7':
Issue #13612: Fix a buffer overflow in case of a multi-byte encoding.
http://hg.python.org/cpython/rev/b3efc140d8a6
msg194365 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2013-08-04 13:11
Thanks, Serhiy.
History
Date User Action Args
2013-08-04 13:11:32eli.benderskysetstatus: open -> closed
resolution: fixed
messages: + msg194365

stage: patch review -> resolved
2013-08-04 13:10:41python-devsetmessages: + msg194364
2013-08-04 10:24:34serhiy.storchakasetfiles: + expat_buffer_overflow-2.7.patch

messages: + msg194340
2013-08-01 12:48:56eli.benderskysetmessages: + msg194061
2013-05-25 14:18:49eli.benderskysetmessages: + msg189961
2013-05-25 13:44:25serhiy.storchakasetmessages: + msg189960
2013-05-25 12:32:56eli.benderskysetmessages: + msg189957
2013-05-25 12:27:32python-devsetnosy: + python-dev
messages: + msg189956
2013-05-24 06:19:08serhiy.storchakasetmessages: + msg189898
2013-05-23 12:48:20eli.benderskysetfiles: + expat_et_share_unknown_handler.1.patch

messages: + msg189865
2013-05-23 04:48:59serhiy.storchakasetmessages: + msg189847
2013-05-23 04:02:53eli.benderskysetmessages: + msg189845
2013-05-22 13:57:33serhiy.storchakasetdependencies: - ElementTree incorrectly parses strings with declared encoding not UTF-8
2013-05-22 13:52:31serhiy.storchakasetfiles: + expat_unknown_encoding_handler_2.patch

messages: + msg189818
2013-05-22 12:29:22eli.benderskysetmessages: + msg189810
2013-05-22 12:27:27eli.benderskysetmessages: + msg189809
2013-05-22 08:01:25serhiy.storchakasetmessages: + msg189796
2013-05-22 07:59:11serhiy.storchakasetdependencies: + ElementTree incorrectly parses strings with declared encoding not UTF-8
2013-05-21 20:59:41amaury.forgeotdarcsetmessages: + msg189785
2013-05-21 03:31:10serhiy.storchakasetfiles: + expat_unknown_encoding_handler.patch

components: + Extension Modules, Unicode, XML, - Library (Lib)
versions: + Python 3.4
keywords: + patch
nosy: + ezio.melotti

messages: + msg189720
stage: patch review
2013-05-20 21:18:51serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg189694
2013-05-20 19:53:03terry.reedysetmessages: + msg189685
versions: - Python 3.2
2013-05-20 15:35:55amaury.forgeotdarcsetmessages: + msg189671
2013-05-20 13:17:22eli.benderskysetmessages: + msg189660
2012-07-21 15:27:22floxsetnosy: + eli.bendersky
2011-12-19 15:15:16pitrousetnosy: + vstinner, flox

versions: + Python 3.2, Python 3.3
2011-12-17 01:25:36terry.reedysetnosy: + terry.reedy

messages: + msg149652
versions: + Python 2.7, - Python 2.6, 3rd party
2011-12-16 11:26:44amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg149612
2011-12-16 09:05:08dongyingsettype: behavior
2011-12-16 09:04:34dongyingsetversions: + 3rd party
2011-12-16 09:04:10dongyingcreate