classification
Title: xml.dom.minidom cannot parse ISO-2022-JP
Type: Stage:
Components: XML Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, dcallagh
Priority: normal Keywords:

Created on 2012-09-07 06:38 by dcallagh, last changed 2012-09-07 09:22 by amaury.forgeotdarc.

Messages (2)
msg169974 - (view) Author: Dan Callaghan (dcallagh) Date: 2012-09-07 06:38
Python 2.7.3 (default, Jul 24 2012, 10:05:38) 
[GCC 4.7.0 20120507 (Red Hat 4.7.0-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> c = u'\u65e5\u672c\u8a9e'
>>> import xml.dom.minidom

Encoded as UTF-8, everything is fine:

>>> xml.dom.minidom.parseString('<?xml version="1.0" encoding="UTF-8" ?><x>%s</x>' % c.encode('UTF-8'))
<xml.dom.minidom.Document instance at 0x7f310d27dcf8>

but not ISO-2022-JP:

>>> xml.dom.minidom.parseString('<?xml version="1.0" encoding="ISO-2022-JP" ?><x>%s</x>' % c.encode('ISO-2022-JP'))
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/usr/lib64/python2.7/site-packages/_xmlplus/dom/minidom.py", line 1925, in parseString
    return expatbuilder.parseString(string)
  File "/usr/lib64/python2.7/site-packages/_xmlplus/dom/expatbuilder.py", line 942, in parseString
    return builder.parseString(string)
  File "/usr/lib64/python2.7/site-packages/_xmlplus/dom/expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 48

lxml can handle it fine though:

>>> import lxml.etree
>>> lxml.etree.fromstring('<?xml version="1.0" encoding="ISO-2022-JP" ?><x>%s</x>' % c.encode('ISO-2022-JP'))
<Element x at 0x7f310d284960>
>>> _.text == c
True
msg169982 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-09-07 09:22
This is similar to issue13612: pyexpat does not support multibytes encodings.
History
Date User Action Args
2012-09-07 09:22:32amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg169982
2012-09-07 06:38:03dcallaghcreate