classification
Title: xml.etree.ElementTree fails to parse a document (regression)
Type: behavior Stage: resolved
Components: XML Versions: Python 3.7, Python 3.6
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Update to expat 2.2.4 (expat: utf8_toUtf8 cannot properly handle exhausting buffer)
View: 31170
Assigned To: serhiy.storchaka Nosy List: Vyacheslav.Rafalskiy, haypo, serhiy.storchaka
Priority: critical Keywords:

Created on 2017-08-29 17:45 by Vyacheslav.Rafalskiy, last changed 2017-08-30 05:15 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
bad_file.xml Vyacheslav.Rafalskiy, 2017-08-29 17:45
bad_file.xml Vyacheslav.Rafalskiy, 2017-08-29 17:50
Messages (3)
msg300996 - (view) Author: Vyacheslav Rafalskiy (Vyacheslav.Rafalskiy) Date: 2017-08-29 17:45
In Python 3.5.4 and 3.6.2, both on Windows and Linux, parsing a manifestly correct xml file like:

xml.etree.ElementTree.parse('bad_file.xml')

raises:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 1023: invalid continuation byte

Any other Python version I tried works fine, including 2.7.13, 3.5.2 ...
msg300997 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-08-29 18:58
Simpler reproducer:

>>> import xml.etree.ElementTree
>>> xml.etree.ElementTree.XML(b'<key attr="' + b'x'*1023 + b'\xc3\xa0&quot;"/>')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/serhiy/py/cpython/Lib/xml/etree/ElementTree.py", line 1315, in XML
    parser.feed(text)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 1023: invalid continuation byte

Seems this is a regression in the Expat library.
msg301010 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-08-30 05:15
This is a duplicate of issue31170. Updating expat to 2.2.4 fixes this issue.
History
Date User Action Args
2017-08-30 05:15:12serhiy.storchakasetstatus: open -> closed
superseder: Update to expat 2.2.4 (expat: utf8_toUtf8 cannot properly handle exhausting buffer)
messages: + msg301010

resolution: duplicate
stage: resolved
2017-08-29 18:58:09serhiy.storchakasetpriority: normal -> critical
nosy: + haypo
messages: + msg300997

2017-08-29 18:12:16serhiy.storchakasetassignee: serhiy.storchaka
type: crash -> behavior
versions: + Python 3.7, - Python 3.5
2017-08-29 17:58:31r.david.murraysetnosy: + serhiy.storchaka
2017-08-29 17:50:24Vyacheslav.Rafalskiysetfiles: + bad_file.xml
2017-08-29 17:45:53Vyacheslav.Rafalskiycreate