This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: xml.dom.minidom produces errors with certain unicode chars
Type: Stage:
Components: Unicode Versions: Python 2.3
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: effbot, lemburg, leogah, peerjanssen
Priority: normal Keywords:

Created on 2004-11-27 13:58 by peerjanssen, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
unicodebug.zip peerjanssen, 2004-11-27 14:29 Test program and files for this bug
Messages (6)
msg23340 - (view) Author: Peer Janssen (peerjanssen) Date: 2004-11-27 13:58
(note: I tried to file this before, but it didn't show
up in the list, so I try again.)

In a XML document generated by Trados Translators
Workbench (a TMX V 1.1 Translation Memory), the Unicode
characters U+0001 ("START OF HEADING", see
http://www.fileformat.info/info/unicode/char/0001/index.htm)
and SINGLE LOW-9 QUOTATION MARK (U+201A, see
http://www.fileformat.info/info/unicode/char/201a/index.htm)
produce errors when parsing it from a file with
"xml.dom.minidom".

The first one (0001) produces this output:

Traceback (most recent call last):
  File "G:\_Prog\TMworks\domtree.py", line 7, in ?
    dom=parse(tm)
  File "C:\Python23\lib\xml\dom\minidom.py", line 1919,
in parse
    return expatbuilder.parse(file)
  File "C:\Python23\lib\xml\dom\expatbuilder.py", line
928, in parse
    result = builder.parseFile(file)
  File "C:\Python23\lib\xml\dom\expatbuilder.py", line
207, in parseFile
    parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid
token): line 420, column 106

The second one (201A) produces this output:

Traceback (most recent call last):
  File "G:\_Prog\TMworks\domtree.py", line 7, in ?
    dom=parse(tm)
  File "C:\Python23\lib\xml\dom\minidom.py", line 1919,
in parse
    return expatbuilder.parse(file)
  File "C:\Python23\lib\xml\dom\expatbuilder.py", line
928, in parse
    result = builder.parseFile(file)
  File "C:\Python23\lib\xml\dom\expatbuilder.py", line
207, in parseFile
    parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: mismatched tag: line 624,
column 2

Deleting these two characters in the whole document
produces the desired result.

I don't see why these characters should be of any
problem, especially the quotation mark.
msg23341 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-11-27 14:02
Logged In: YES 
user_id=38388

Please provide an example that lets us reproduce
the error.

Unassigning, since I'm not an expert for minidom.
msg23342 - (view) Author: Peer Janssen (peerjanssen) Date: 2004-11-27 14:27
Logged In: YES 
user_id=896722

Here is a zip file with a test program domtree.py and two
test files. I noticed that the first test file produces it's
bug only on my windows box, but the second test file
produces an error on both my windows and my linux box.

The windows python version is:
Python 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit
(Intel)] on win32
The linux python version is:
 Python 2.3.3. (#2, Feb 17, 2004, 11:45:40) [GCC 3.3.2
(Mandrake Linux 10.0 3.3.2-6mdk)] on linux2
msg23343 - (view) Author: Peer Janssen (peerjanssen) Date: 2004-11-27 14:29
Logged In: YES 
user_id=896722

The file.
msg23344 - (view) Author: Richard Brodie (leogah) Date: 2004-12-03 00:37
Logged In: YES 
user_id=356893

I don't think there are any bugs here: at least not Python ones.

U+0001 (SOH) isn't an allowed character in XML 1.0:
http://www.w3.org/International/questions/qa-controls

U+201A (SINGLE LOW-9 QUOTATION MARK) should be fine, except
that \x1A is converted to EOF on Windows; then expat chokes
on all the unclosed tags. Open the file 'rb'.

RB.
msg23345 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2004-12-03 11:29
Logged In: YES 
user_id=38376

Closing; see leogah's reply for background.
History
Date User Action Args
2022-04-11 14:56:08adminsetgithub: 41235
2004-11-27 13:58:10peerjanssencreate