New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xml.etree.ElementTree says unknown encoding of a regular encoding #57821
Comments
I've been trying to parse xml string using python, codes following: #-*- coding: utf-8 -*-
import xml.etree.ElementTree as xmlet
s = '<?xml version="1.0" encoding="GBK"?><info></info>'
xmlet.fromstring(s) Then: Traceback message came out like this: Traceback (most recent call last):
File "test.py", line 4, in <module>
xmlet.fromstring(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/etree/ElementTree.py", line 963, in XML
parser.feed(text)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/etree/ElementTree.py", line 1245, in feed
self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: unknown encoding: line 1, column 30 I've seen this in following cases:
But not in these cases:
|
Actually, this fails on 2.6 and 2.7 on wide unicode builds, and passes with narrow unicode builds (on my 64bit Linux box). In pyexpat.c, PyUnknownEncodingHandler accesses 256 characters of a unicode buffer, without checking its length... which happens to be 192 chars long. Supporting multibytes codecs won't be easy: pyexpat requires to fill an array which specifies the number of bytes needed by each start byte (for example, in utf-8, 0xc3 starts a 2-bytes sequence, 0xef starts a 3-bytes sequence). Our codecs framwork does not provide this information, and some codecs (gb18030 for example) need the second char to determine whether it will need 4 bytes. |
3.2, Win7 (a narrow build) it indeed works and returns |
In 3.3+ there's no distinction between wide and narrow builds. Does anyone know how this should be affected? [I don't know much about unicode and encodings, unfortunately] |
encoding="GBK" causes a buffer overflow in PyUnknownEncodingHandler, because the result of PyUnicode_Decode() is only 192 characters long.
Exact behavior is not defined... |
3.3 shifted the wide-build problem to all builds ;-). I now get File "C:\Python\mypy\tem.py", line 4, in <module> I do not understand the 'unknown encoding' bit. Replacing 'GBK' with a truly unknown encoding changes the last line to I get the same two messages if I add a 'b' prefix to make s be bytes, which it logically should be (and was in 2.7). (I presume .fromstring 'encodes' unicode input to bytes with the ascii or latin-1 encoder and then decodes back to unicode according to the announced encoding.) With s so prefixed, s.decode(encoding="GBK") works and returns the original unicode version of s, so Python does know "GBK". And it indeed is in the list of official IANA charset names. I don't know unicode internals to understand Amaury's comment. However, it almost reads to me as if this is a unicode bug, not ET bug. |
PyUnknownEncodingHandler gets an encoding name and fills a special XML_Encoding structure which expat parser uses for decoding. Currently PyUnknownEncodingHandler works only with 8-bit encodings and I don't see an efficient method how extent it to handle general multibyte encoding. |
I can propose only raise a specialized exception instead of general xml.etree.ElementTree.ParseError. Here is a patch. It also fixes a buffer overread bug mentioned by Amaury. |
The patch goes in the right direction: consistently reject non-8bit encodings which the current implementation does not support.
Real support for GBK is more work, http://wang.yuxuan.org/blog/?itemid=63 shows how to set a conversion callback. It will difficult to implement in the general case, see my older comment. |
For unit tests we first should fix bpo-16986. |
I did another round of code review on bpo-16986 now. |
Looked at Serhiy's patch here too: LGTM with a unit test :) |
Here is an updated patch. PyUnknownEncodingHandler() and expat_unknown_encoding_handler() are synchronized. Added tests. |
Serhiy, would it make sense to share the code somewhere instead of duplicating it? |
May be. But it needs more work for building. It is simpler just duplicate a code (the function is small enough) and add comments: /* The code is duplicated as xx() in xxx.c. Keep both functions synchronized. */ |
How about this patch (not tested it too much - just as a proof of concept). We're pretty free in the C API exported by pyexpat through a capsule to _elementtree, so we can also add a default handler there. This API already has some general utilities like ErrorString. ET can then simply call the handler from pyexpat. |
LGTM. But why not just inline expat_unknown_encoding_handler()? For 2.7 we perhaps should add SetStartDoctypeDeclHandler and SetEncoding in the exported C API. Shouldn't we update the C API version (for this issue and for bpo-16986)? |
New changeset f7b47fb30169 by Eli Bendersky in branch '3.3': New changeset 47e719b11c46 by Eli Bendersky in branch 'default': |
A few notes:
|
Tests were duplicated. There is no Misc/NEWS entry. I think a buffer overflow is critical enough for backporting. |
Oh, I didn't notice that your patches had duplication in the tests. Fixed now. I'll wait to see what unfolds for the Misc/NEWS discussion on python-committers. |
Serhiy, do you want to backport the buffer overflow fix to 2.7? |
Here is a patch for 2.7. |
New changeset b3efc140d8a6 by Eli Bendersky in branch '2.7': |
Thanks, Serhiy. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: