This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: xml.etree parser does not accept valid control characters
Type: Stage:
Components: XML Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Romuald, eli.bendersky, scoder, serhiy.storchaka
Priority: normal Keywords:

Created on 2021-04-02 11:03 by Romuald, last changed 2022-04-11 14:59 by admin.

Messages (3)
msg390050 - (view) Author: Romuald Brunet (Romuald) * Date: 2021-04-02 11:03
Python XML parser (xml.etree) does not seems to allow control characters that are invalid in XML 1.0, but valid in XML 1.1 [1] [2]


Considering the following sample:


import xml.etree.ElementTree as ET

bad = '<?xml version="1.1"?><foo>bar &#x19; baz</foo>'
print(ET.fromstring(bad))


The parser raises the following error:
ParseError: reference to invalid character number: line 1, column 30



[1] https://www.w3.org/TR/xml11/Overview.html#charsets
[2] https://www.w3.org/TR/xml11/Overview.html#sec-xml11
msg390065 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-04-02 14:20
It is a known issue, see issue11804 and issue39512.

In short, the underlying library for XML parsing (expat) does not support XML 1.1 and does not have plans to support it. And seems that XML 1.1 is a dead standard if it is not supported in popular parsing libraries.

From where you get your XML data? What programs generated them?
msg390066 - (view) Author: Romuald Brunet (Romuald) * Date: 2021-04-02 14:39
Thanks for the quick reply

We're getting data from about a hundred different providers around the world; some of them not really keen on standards, so we already have some hacks to fix invalid XML. We'll add one to the list

In that particular case, the XML was invalid anyway since it was an XML 1.0 document, and the character was sent as "binary" (\x19)
History
Date User Action Args
2022-04-11 14:59:43adminsetgithub: 87869
2021-04-02 14:39:51Romualdsetmessages: + msg390066
2021-04-02 14:20:12serhiy.storchakasetmessages: + msg390065
2021-04-02 12:33:24xtreaksetnosy: + scoder, eli.bendersky, serhiy.storchaka
2021-04-02 11:03:35Romualdcreate