msg149602 - (view) |
Author: Dongying Zhang (dongying) |
Date: 2011-12-16 09:04 |
I've been trying to parse xml string using python, codes following:
#-*- coding: utf-8 -*-
import xml.etree.ElementTree as xmlet
s = '<?xml version="1.0" encoding="GBK"?><info></info>'
xmlet.fromstring(s)
Then:
$ python2.6 test.py
or:
$ pypy test.py
Traceback message came out like this:
Traceback (most recent call last):
File "test.py", line 4, in <module>
xmlet.fromstring(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/etree/ElementTree.py", line 963, in XML
parser.feed(text)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/etree/ElementTree.py", line 1245, in feed
self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: unknown encoding: line 1, column 30
I've seen this in following cases:
Python2.5.6 on Mac OS X Lion
Python2.6.5 on Ubuntu 10.04
pypy1.7.0 on Mac OS X Lion
But not in these cases:
Python2.6.7 on Mac OS X Lion
Python2.7.1 on Mac OS X Lion
|
msg149612 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2011-12-16 11:26 |
Actually, this fails on 2.6 and 2.7 on wide unicode builds, and passes with narrow unicode builds (on my 64bit Linux box).
In pyexpat.c, PyUnknownEncodingHandler accesses 256 characters of a unicode buffer, without checking its length... which happens to be 192 chars long.
So buffers overflow, etc. The function has a comment "supports only 8bit encodings"; indeed.
Versions 3.2 and 3.3 happen to pass the test, probably by pure luck.
Supporting multibytes codecs won't be easy: pyexpat requires to fill an array which specifies the number of bytes needed by each start byte (for example, in utf-8, 0xc3 starts a 2-bytes sequence, 0xef starts a 3-bytes sequence). Our codecs framwork does not provide this information, and some codecs (gb18030 for example) need the second char to determine whether it will need 4 bytes.
|
msg149652 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2011-12-17 01:25 |
3.2, Win7 (a narrow build) it indeed works and returns
<Element 'info' at 0x3389128>
|
msg189660 - (view) |
Author: Eli Bendersky (eli.bendersky) *  |
Date: 2013-05-20 13:17 |
In 3.3+ there's no distinction between wide and narrow builds. Does anyone know how this should be affected? [I don't know much about unicode and encodings, unfortunately]
|
msg189671 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2013-05-20 15:35 |
encoding="GBK" causes a buffer overflow in PyUnknownEncodingHandler, because the result of PyUnicode_Decode() is only 192 characters long.
Exact behavior is not defined...
|
msg189685 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2013-05-20 19:53 |
3.3 shifted the wide-build problem to all builds ;-). I now get
File "C:\Python\mypy\tem.py", line 4, in <module>
xmlet.fromstring(s)
File "C:...33\lib\xml\etree\ElementTree.py", line 1356, in XML
parser.feed(text)
File "<string>", line None
xml.etree.ElementTree.ParseError: unknown encoding: line 1, column 30
I do not understand the 'unknown encoding' bit. Replacing 'GBK' with a truly unknown encoding changes the last line to
LookupError: unknown encoding: xyz, so the lookup of 'GBK' succeeded.
I get the same two messages if I add a 'b' prefix to make s be bytes, which it logically should be (and was in 2.7). (I presume .fromstring 'encodes' unicode input to bytes with the ascii or latin-1 encoder and then decodes back to unicode according to the announced encoding.)
With s so prefixed, s.decode(encoding="GBK") works and returns the original unicode version of s, so Python does know "GBK". And it indeed is in the list of official IANA charset names.
I don't know unicode internals to understand Amaury's comment. However, it almost reads to me as if this is a unicode bug, not ET bug.
|
msg189694 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-05-20 21:18 |
PyUnknownEncodingHandler gets an encoding name and fills a special XML_Encoding structure which expat parser uses for decoding. Currently PyUnknownEncodingHandler works only with 8-bit encodings and I don't see an efficient method how extent it to handle general multibyte encoding.
|
msg189720 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-05-21 03:31 |
I can propose only raise a specialized exception instead of general xml.etree.ElementTree.ParseError. Here is a patch. It also fixes a buffer overread bug mentioned by Amaury.
|
msg189785 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2013-05-21 20:59 |
The patch goes in the right direction: consistently reject non-8bit encodings which the current implementation does not support.
- please add a unit test
- remove usage of PyUnicodeObject and the "Stupid to access directly" comment, they are outdated.
Real support for GBK is more work, http://wang.yuxuan.org/blog/?itemid=63 shows how to set a conversion callback. It will difficult to implement in the general case, see my older comment.
|
msg189796 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-05-22 08:01 |
For unit tests we first should fix issue16986.
|
msg189809 - (view) |
Author: Eli Bendersky (eli.bendersky) *  |
Date: 2013-05-22 12:27 |
> For unit tests we first should fix issue16986.
I did another round of code review on issue 16986 now.
|
msg189810 - (view) |
Author: Eli Bendersky (eli.bendersky) *  |
Date: 2013-05-22 12:29 |
Looked at Serhiy's patch here too: LGTM with a unit test :)
|
msg189818 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-05-22 13:52 |
Here is an updated patch. PyUnknownEncodingHandler() and expat_unknown_encoding_handler() are synchronized. Added tests.
|
msg189845 - (view) |
Author: Eli Bendersky (eli.bendersky) *  |
Date: 2013-05-23 04:02 |
Serhiy, would it make sense to share the code somewhere instead of duplicating it?
|
msg189847 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-05-23 04:48 |
May be. But it needs more work for building. It is simpler just duplicate a code (the function is small enough) and add comments:
/* The code is duplicated as xx() in xxx.c. Keep both functions synchronized. */
|
msg189865 - (view) |
Author: Eli Bendersky (eli.bendersky) *  |
Date: 2013-05-23 12:48 |
How about this patch (not tested it too much - just as a proof of concept).
We're pretty free in the C API exported by pyexpat through a capsule to _elementtree, so we can also add a default handler there. This API already has some general utilities like ErrorString.
ET can then simply call the handler from pyexpat.
|
msg189898 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-05-24 06:19 |
LGTM. But why not just inline expat_unknown_encoding_handler()?
For 2.7 we perhaps should add SetStartDoctypeDeclHandler and SetEncoding in the exported C API. Shouldn't we update the C API version (for this issue and for issue16986)?
|
msg189956 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2013-05-25 12:27 |
New changeset f7b47fb30169 by Eli Bendersky in branch '3.3':
Issue #13612: handle unknown encodings without a buffer overflow.
http://hg.python.org/cpython/rev/f7b47fb30169
New changeset 47e719b11c46 by Eli Bendersky in branch 'default':
Issue #13612: handle unknown encodings without a buffer overflow.
http://hg.python.org/cpython/rev/47e719b11c46
|
msg189957 - (view) |
Author: Eli Bendersky (eli.bendersky) *  |
Date: 2013-05-25 12:32 |
A few notes:
1. If by C API version you mean PyExpat_CAPI_MAGIC, I'm not sure what difference that makes. It has never been updated and it's also being used only in _elementtree. Since the latter is statically compiled against pyexpat, I don't see a reason to touch it.
2. I wouldn't backport it to 2.7; again, not critical enough.
3. Should we open a separate issue for adding support to GBK and similar encodings "sometime in the future"? This issue kind-of got derailed to fix the bug.
|
msg189960 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-05-25 13:44 |
Tests were duplicated. There is no Misc/NEWS entry.
I think a buffer overflow is critical enough for backporting.
|
msg189961 - (view) |
Author: Eli Bendersky (eli.bendersky) *  |
Date: 2013-05-25 14:18 |
Oh, I didn't notice that your patches had duplication in the tests. Fixed now. I'll wait to see what unfolds for the Misc/NEWS discussion on python-committers.
|
msg194061 - (view) |
Author: Eli Bendersky (eli.bendersky) *  |
Date: 2013-08-01 12:48 |
Serhiy, do you want to backport the buffer overflow fix to 2.7?
|
msg194340 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-08-04 10:24 |
Here is a patch for 2.7.
|
msg194364 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2013-08-04 13:10 |
New changeset b3efc140d8a6 by Eli Bendersky in branch '2.7':
Issue #13612: Fix a buffer overflow in case of a multi-byte encoding.
http://hg.python.org/cpython/rev/b3efc140d8a6
|
msg194365 - (view) |
Author: Eli Bendersky (eli.bendersky) *  |
Date: 2013-08-04 13:11 |
Thanks, Serhiy.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:24 | admin | set | github: 57821 |
2013-08-04 13:11:32 | eli.bendersky | set | status: open -> closed resolution: fixed messages:
+ msg194365
stage: patch review -> resolved |
2013-08-04 13:10:41 | python-dev | set | messages:
+ msg194364 |
2013-08-04 10:24:34 | serhiy.storchaka | set | files:
+ expat_buffer_overflow-2.7.patch
messages:
+ msg194340 |
2013-08-01 12:48:56 | eli.bendersky | set | messages:
+ msg194061 |
2013-05-25 14:18:49 | eli.bendersky | set | messages:
+ msg189961 |
2013-05-25 13:44:25 | serhiy.storchaka | set | messages:
+ msg189960 |
2013-05-25 12:32:56 | eli.bendersky | set | messages:
+ msg189957 |
2013-05-25 12:27:32 | python-dev | set | nosy:
+ python-dev messages:
+ msg189956
|
2013-05-24 06:19:08 | serhiy.storchaka | set | messages:
+ msg189898 |
2013-05-23 12:48:20 | eli.bendersky | set | files:
+ expat_et_share_unknown_handler.1.patch
messages:
+ msg189865 |
2013-05-23 04:48:59 | serhiy.storchaka | set | messages:
+ msg189847 |
2013-05-23 04:02:53 | eli.bendersky | set | messages:
+ msg189845 |
2013-05-22 13:57:33 | serhiy.storchaka | set | dependencies:
- ElementTree incorrectly parses strings with declared encoding not UTF-8 |
2013-05-22 13:52:31 | serhiy.storchaka | set | files:
+ expat_unknown_encoding_handler_2.patch
messages:
+ msg189818 |
2013-05-22 12:29:22 | eli.bendersky | set | messages:
+ msg189810 |
2013-05-22 12:27:27 | eli.bendersky | set | messages:
+ msg189809 |
2013-05-22 08:01:25 | serhiy.storchaka | set | messages:
+ msg189796 |
2013-05-22 07:59:11 | serhiy.storchaka | set | dependencies:
+ ElementTree incorrectly parses strings with declared encoding not UTF-8 |
2013-05-21 20:59:41 | amaury.forgeotdarc | set | messages:
+ msg189785 |
2013-05-21 03:31:10 | serhiy.storchaka | set | files:
+ expat_unknown_encoding_handler.patch
components:
+ Extension Modules, Unicode, XML, - Library (Lib) versions:
+ Python 3.4 keywords:
+ patch nosy:
+ ezio.melotti
messages:
+ msg189720 stage: patch review |
2013-05-20 21:18:51 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages:
+ msg189694
|
2013-05-20 19:53:03 | terry.reedy | set | messages:
+ msg189685 versions:
- Python 3.2 |
2013-05-20 15:35:55 | amaury.forgeotdarc | set | messages:
+ msg189671 |
2013-05-20 13:17:22 | eli.bendersky | set | messages:
+ msg189660 |
2012-07-21 15:27:22 | flox | set | nosy:
+ eli.bendersky
|
2011-12-19 15:15:16 | pitrou | set | nosy:
+ vstinner, flox
versions:
+ Python 3.2, Python 3.3 |
2011-12-17 01:25:36 | terry.reedy | set | nosy:
+ terry.reedy
messages:
+ msg149652 versions:
+ Python 2.7, - Python 2.6, 3rd party |
2011-12-16 11:26:44 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages:
+ msg149612
|
2011-12-16 09:05:08 | dongying | set | type: behavior |
2011-12-16 09:04:34 | dongying | set | versions:
+ 3rd party |
2011-12-16 09:04:10 | dongying | create | |