classification
Title: Expat parser parses strings only when XML encoding is UTF-8
Type: behavior Stage: resolved
Components: Extension Modules, Unicode, XML Versions: Python 3.4, Python 3.2, Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: ezio.melotti, python-dev, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2013-01-31 10:01 by serhiy.storchaka, last changed 2013-05-22 18:17 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
pyexpat_parse_str.patch serhiy.storchaka, 2013-01-31 10:01 review
Messages (2)
msg181014 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-31 10:01
xmlparser.Parse() works with string data only if XML encoding is utf-8 (or ascii). Examples:

>>> import xml.parsers.expat
>>> parser = xml.parsers.expat.ParserCreate()
>>> content = []
>>> parser.CharacterDataHandler = content.append
>>> parser.Parse("<?xml version='1.0' encoding='utf-8'?><tag>\xb5</tag>")
1
>>> content
['µ']
>>> parser = xml.parsers.expat.ParserCreate()
>>> content = []
>>> parser.CharacterDataHandler = content.append
>>> parser.Parse("<?xml version='1.0' encoding='iso8859'?><tag>\xb5</tag>")
1
>>> content
['µ']
>>> parser = xml.parsers.expat.ParserCreate()
>>> content = []
>>> parser.CharacterDataHandler = content.append
>>> parser.Parse("<?xml version='1.0' encoding='utf-16'?><tag>\xb5</tag>")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
xml.parsers.expat.ExpatError: encoding specified in XML declaration is incorrect: line 1, column 30

This affects all other modules which works with XML: xml.sax, xml.dom.minidom, xml.dom.pulldom, xml.etree.ElementTree.

Here is a patch which fixes parsing string data with non-UTF-8 XML.
msg181347 - (view) Author: Roundup Robot (python-dev) Date: 2013-02-04 16:32
New changeset 3cc2a2de36e3 by Serhiy Storchaka in branch '3.2':
Issue #17089: Expat parser now correctly works with string input not only when
http://hg.python.org/cpython/rev/3cc2a2de36e3

New changeset 6c27b0e09c43 by Serhiy Storchaka in branch '3.3':
Issue #17089: Expat parser now correctly works with string input not only when
http://hg.python.org/cpython/rev/6c27b0e09c43

New changeset c4e6e560e6f5 by Serhiy Storchaka in branch 'default':
Issue #17089: Expat parser now correctly works with string input not only when
http://hg.python.org/cpython/rev/c4e6e560e6f5
History
Date User Action Args
2013-05-22 18:17:25serhiy.storchakasetversions: - Python 2.7
2013-02-25 15:41:05serhiy.storchakalinkissue16986 dependencies
2013-02-13 13:46:43serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2013-02-04 16:32:53python-devsetnosy: + python-dev
messages: + msg181347
2013-01-31 10:05:47serhiy.storchakalinkissue10590 dependencies
2013-01-31 10:03:46serhiy.storchakalinkissue1483 dependencies
2013-01-31 10:02:57serhiy.storchakalinkissue2174 dependencies
2013-01-31 10:02:24serhiy.storchakalinkissue2175 dependencies
2013-01-31 10:01:19serhiy.storchakacreate