xml.etree.ElementTree says unknown encoding of a regular encoding #57821

dongying · 2011-12-16T09:04:11Z

BPO	13612
Nosy	@terryjreedy, @amauryfa, @vstinner, @ezio-melotti, @florentx, @serhiy-storchaka
Files	test.py: source code expat_unknown_encoding_handler.patch expat_unknown_encoding_handler_2.patch expat_et_share_unknown_handler.1.patch expat_buffer_overflow-2.7.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2013-08-04.13:11:32.579>
created_at = <Date 2011-12-16.09:04:10.905>
labels = ['extension-modules', 'expert-XML', 'type-bug', 'expert-unicode']
title = 'xml.etree.ElementTree says unknown encoding of a regular encoding'
updated_at = <Date 2013-08-04.13:11:32.578>
user = 'https://bugs.python.org/dongying'

bugs.python.org fields:

activity = <Date 2013-08-04.13:11:32.578>
actor = 'eli.bendersky'
assignee = 'none'
closed = True
closed_date = <Date 2013-08-04.13:11:32.579>
closer = 'eli.bendersky'
components = ['Extension Modules', 'Unicode', 'XML']
creation = <Date 2011-12-16.09:04:10.905>
creator = 'dongying'
dependencies = []
files = ['23974', '30328', '30342', '30347', '31150']
hgrepos = []
issue_num = 13612
keywords = ['patch']
message_count = 25.0
messages = ['149602', '149612', '149652', '189660', '189671', '189685', '189694', '189720', '189785', '189796', '189809', '189810', '189818', '189845', '189847', '189865', '189898', '189956', '189957', '189960', '189961', '194061', '194340', '194364', '194365']
nosy_count = 9.0
nosy_names = ['terry.reedy', 'amaury.forgeotdarc', 'vstinner', 'ezio.melotti', 'eli.bendersky', 'flox', 'dongying', 'python-dev', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue13612'
versions = ['Python 2.7', 'Python 3.3', 'Python 3.4']

dongying · 2011-12-16T09:04:10Z

I've been trying to parse xml string using python, codes following:

    #-*- coding: utf-8 -*-
    import xml.etree.ElementTree as xmlet
    s = '<?xml version="1.0" encoding="GBK"?><info></info>'
    xmlet.fromstring(s)

Then:
$ python2.6 test.py
or:
$ pypy test.py

Traceback message came out like this:

    Traceback (most recent call last):
      File "test.py", line 4, in <module>
        xmlet.fromstring(s)
      File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/etree/ElementTree.py", line 963, in XML
        parser.feed(text)
      File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/etree/ElementTree.py", line 1245, in feed
        self._parser.Parse(data, 0)
    xml.parsers.expat.ExpatError: unknown encoding: line 1, column 30

I've seen this in following cases:

Python2.5.6 on Mac OS X Lion
Python2.6.5 on Ubuntu 10.04
pypy1.7.0 on Mac OS X Lion

But not in these cases:

Python2.6.7 on Mac OS X Lion
Python2.7.1 on Mac OS X Lion

amauryfa · 2011-12-16T11:26:44Z

Actually, this fails on 2.6 and 2.7 on wide unicode builds, and passes with narrow unicode builds (on my 64bit Linux box).

In pyexpat.c, PyUnknownEncodingHandler accesses 256 characters of a unicode buffer, without checking its length... which happens to be 192 chars long.
So buffers overflow, etc. The function has a comment "supports only 8bit encodings"; indeed.
Versions 3.2 and 3.3 happen to pass the test, probably by pure luck.

Supporting multibytes codecs won't be easy: pyexpat requires to fill an array which specifies the number of bytes needed by each start byte (for example, in utf-8, 0xc3 starts a 2-bytes sequence, 0xef starts a 3-bytes sequence). Our codecs framwork does not provide this information, and some codecs (gb18030 for example) need the second char to determine whether it will need 4 bytes.

terryjreedy · 2011-12-17T01:25:37Z

3.2, Win7 (a narrow build) it indeed works and returns
<Element 'info' at 0x3389128>

elibendersky · 2013-05-20T13:17:23Z

In 3.3+ there's no distinction between wide and narrow builds. Does anyone know how this should be affected? [I don't know much about unicode and encodings, unfortunately]

amauryfa · 2013-05-20T15:35:56Z

encoding="GBK" causes a buffer overflow in PyUnknownEncodingHandler, because the result of PyUnicode_Decode() is only 192 characters long.
Exact behavior is not defined...

terryjreedy · 2013-05-20T19:53:03Z

3.3 shifted the wide-build problem to all builds ;-). I now get

File "C:\Python\mypy\tem.py", line 4, in <module>
xmlet.fromstring(s)
File "C:...33\lib\xml\etree\ElementTree.py", line 1356, in XML
parser.feed(text)
File "<string>", line None
xml.etree.ElementTree.ParseError: unknown encoding: line 1, column 30

I do not understand the 'unknown encoding' bit. Replacing 'GBK' with a truly unknown encoding changes the last line to
LookupError: unknown encoding: xyz, so the lookup of 'GBK' succeeded.

I get the same two messages if I add a 'b' prefix to make s be bytes, which it logically should be (and was in 2.7). (I presume .fromstring 'encodes' unicode input to bytes with the ascii or latin-1 encoder and then decodes back to unicode according to the announced encoding.)

With s so prefixed, s.decode(encoding="GBK") works and returns the original unicode version of s, so Python does know "GBK". And it indeed is in the list of official IANA charset names.

I don't know unicode internals to understand Amaury's comment. However, it almost reads to me as if this is a unicode bug, not ET bug.

serhiy-storchaka · 2013-05-20T21:18:51Z

PyUnknownEncodingHandler gets an encoding name and fills a special XML_Encoding structure which expat parser uses for decoding. Currently PyUnknownEncodingHandler works only with 8-bit encodings and I don't see an efficient method how extent it to handle general multibyte encoding.

serhiy-storchaka · 2013-05-21T03:31:10Z

I can propose only raise a specialized exception instead of general xml.etree.ElementTree.ParseError. Here is a patch. It also fixes a buffer overread bug mentioned by Amaury.

amauryfa · 2013-05-21T20:59:41Z

The patch goes in the right direction: consistently reject non-8bit encodings which the current implementation does not support.

please add a unit test
remove usage of PyUnicodeObject and the "Stupid to access directly" comment, they are outdated.

Real support for GBK is more work, http://wang.yuxuan.org/blog/?itemid=63 shows how to set a conversion callback. It will difficult to implement in the general case, see my older comment.

serhiy-storchaka · 2013-05-22T08:01:25Z

For unit tests we first should fix bpo-16986.

elibendersky · 2013-05-22T12:27:27Z

For unit tests we first should fix bpo-16986.

I did another round of code review on bpo-16986 now.

elibendersky · 2013-05-22T12:29:22Z

Looked at Serhiy's patch here too: LGTM with a unit test :)

serhiy-storchaka · 2013-05-22T13:52:30Z

Here is an updated patch. PyUnknownEncodingHandler() and expat_unknown_encoding_handler() are synchronized. Added tests.

elibendersky · 2013-05-23T04:02:54Z

Serhiy, would it make sense to share the code somewhere instead of duplicating it?

serhiy-storchaka · 2013-05-23T04:48:59Z

May be. But it needs more work for building. It is simpler just duplicate a code (the function is small enough) and add comments:

/* The code is duplicated as xx() in xxx.c. Keep both functions synchronized. */

elibendersky · 2013-05-23T12:48:20Z

How about this patch (not tested it too much - just as a proof of concept).

We're pretty free in the C API exported by pyexpat through a capsule to _elementtree, so we can also add a default handler there. This API already has some general utilities like ErrorString.

ET can then simply call the handler from pyexpat.

serhiy-storchaka · 2013-05-24T06:19:08Z

LGTM. But why not just inline expat_unknown_encoding_handler()?

For 2.7 we perhaps should add SetStartDoctypeDeclHandler and SetEncoding in the exported C API. Shouldn't we update the C API version (for this issue and for bpo-16986)?

python-dev · 2013-05-25T12:27:33Z

New changeset f7b47fb30169 by Eli Bendersky in branch '3.3':
Issue bpo-13612: handle unknown encodings without a buffer overflow.
http://hg.python.org/cpython/rev/f7b47fb30169

New changeset 47e719b11c46 by Eli Bendersky in branch 'default':
Issue bpo-13612: handle unknown encodings without a buffer overflow.
http://hg.python.org/cpython/rev/47e719b11c46

elibendersky · 2013-05-25T12:32:56Z

A few notes:

If by C API version you mean PyExpat_CAPI_MAGIC, I'm not sure what difference that makes. It has never been updated and it's also being used only in _elementtree. Since the latter is statically compiled against pyexpat, I don't see a reason to touch it.
I wouldn't backport it to 2.7; again, not critical enough.
Should we open a separate issue for adding support to GBK and similar encodings "sometime in the future"? This issue kind-of got derailed to fix the bug.

serhiy-storchaka · 2013-05-25T13:44:25Z

Tests were duplicated. There is no Misc/NEWS entry.

I think a buffer overflow is critical enough for backporting.

elibendersky · 2013-05-25T14:18:50Z

Oh, I didn't notice that your patches had duplication in the tests. Fixed now. I'll wait to see what unfolds for the Misc/NEWS discussion on python-committers.

elibendersky · 2013-08-01T12:48:57Z

Serhiy, do you want to backport the buffer overflow fix to 2.7?

serhiy-storchaka · 2013-08-04T10:24:34Z

Here is a patch for 2.7.

python-dev · 2013-08-04T13:10:41Z

New changeset b3efc140d8a6 by Eli Bendersky in branch '2.7':
Issue bpo-13612: Fix a buffer overflow in case of a multi-byte encoding.
http://hg.python.org/cpython/rev/b3efc140d8a6

elibendersky · 2013-08-04T13:11:33Z

Thanks, Serhiy.

dongying mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Dec 16, 2011

serhiy-storchaka added extension-modules C modules in the Modules dir topic-unicode topic-XML and removed stdlib Python modules in the Lib dir labels May 21, 2013

elibendersky mannequin closed this as completed Aug 4, 2013

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xml.etree.ElementTree says unknown encoding of a regular encoding #57821

xml.etree.ElementTree says unknown encoding of a regular encoding #57821

dongying mannequin commented Dec 16, 2011

dongying mannequin commented Dec 16, 2011

amauryfa commented Dec 16, 2011

terryjreedy commented Dec 17, 2011

elibendersky mannequin commented May 20, 2013

amauryfa commented May 20, 2013

terryjreedy commented May 20, 2013

serhiy-storchaka commented May 20, 2013

serhiy-storchaka commented May 21, 2013

amauryfa commented May 21, 2013

serhiy-storchaka commented May 22, 2013

elibendersky mannequin commented May 22, 2013

elibendersky mannequin commented May 22, 2013

serhiy-storchaka commented May 22, 2013

elibendersky mannequin commented May 23, 2013

serhiy-storchaka commented May 23, 2013

elibendersky mannequin commented May 23, 2013

serhiy-storchaka commented May 24, 2013

python-dev mannequin commented May 25, 2013

elibendersky mannequin commented May 25, 2013

serhiy-storchaka commented May 25, 2013

elibendersky mannequin commented May 25, 2013

elibendersky mannequin commented Aug 1, 2013

serhiy-storchaka commented Aug 4, 2013

python-dev mannequin commented Aug 4, 2013

elibendersky mannequin commented Aug 4, 2013

xml.etree.ElementTree says unknown encoding of a regular encoding #57821

xml.etree.ElementTree says unknown encoding of a regular encoding #57821

Comments

dongying mannequin commented Dec 16, 2011

dongying mannequin commented Dec 16, 2011

amauryfa commented Dec 16, 2011

terryjreedy commented Dec 17, 2011

elibendersky mannequin commented May 20, 2013

amauryfa commented May 20, 2013

terryjreedy commented May 20, 2013

serhiy-storchaka commented May 20, 2013

serhiy-storchaka commented May 21, 2013

amauryfa commented May 21, 2013

serhiy-storchaka commented May 22, 2013

elibendersky mannequin commented May 22, 2013

elibendersky mannequin commented May 22, 2013

serhiy-storchaka commented May 22, 2013

elibendersky mannequin commented May 23, 2013

serhiy-storchaka commented May 23, 2013

elibendersky mannequin commented May 23, 2013

serhiy-storchaka commented May 24, 2013

python-dev mannequin commented May 25, 2013

elibendersky mannequin commented May 25, 2013

serhiy-storchaka commented May 25, 2013

elibendersky mannequin commented May 25, 2013

elibendersky mannequin commented Aug 1, 2013

serhiy-storchaka commented Aug 4, 2013

python-dev mannequin commented Aug 4, 2013

elibendersky mannequin commented Aug 4, 2013