Issue 18753: [c]ElementTree.fromstring fails to parse <value>]]></value>

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/62953

classification

Title:	[c]ElementTree.fromstring fails to parse ]]>
Type:	behavior	Stage:	resolved
Components:	XML	Versions:	Python 2.7

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	eli.bendersky, kees, r.david.murray, scoder
Priority:	normal	Keywords:

Created on 2013-08-16 08:31 by kees, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (5)
msg195315 - (view)	Author: Kees Bos (kees) *	Date: 2013-08-16 08:31
ElementTree.fromstring and cElementTree.fromstring fail on parsing "<value>]]></value>", but do parse "<value>]]></value>" $ python Python 2.7.3 (default, Apr 10 2013, 05:09:49) [GCC 4.7.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from xml.etree import cElementTree as ET >>> ET.fromstring("<value>]]></value>").text Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<string>", line 124, in XML cElementTree.ParseError: not well-formed (invalid token): line 1, column 9 >>> ET.fromstring("<value>]]></value>").text ']]>' >>> from xml.etree import ElementTree as ET >>> ET.fromstring("<value>]]></value>").text Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1301, in XML parser.feed(text) File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed self._raiseerror(v) File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1507, in _raiseerror raise err xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 9 >>> ET.fromstring("<value>]]></value>").text ']]>' >>>
msg195327 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-08-16 13:28
Why do you think this is a bug? (You may well be right; I'm not familiar with the intricacies of XML. But on its face the behavior looks reasonable.)
msg195359 - (view)	Author: Kees Bos (kees) *	Date: 2013-08-16 16:48
I'm not an expert, but from: http://www.w3.org/TR/REC-xml/#NT-AttValue AttValue ::= '"' ([^<&"] \| Reference)* '"' \| "'" ([^<&'] \| Reference)* "'" which I read as: Any Reference character is valid, except & and <, which are used for escaping and closing the element. The sequence <value>]]></value> also valdates as well-formed at http://www.xmlvalidation.com/ The sequence <value>]></value> parses OK (So, it's only with a double ] and > ) It's probably related to parsing <![CDATA[ ... ]]> (i.e. I guess when the parser detects ]]> it assumes / requires the state of <![CDATA[ which is, of course, not true) The sequence <value><![CDATA[foo]]></value> is parsed correctly: >>> ET.fromstring('<value><![CDATA[foo]]></value>').text 'foo' BTW, lxml.etree.fromstring fails also and so does http://www.w3schools.com/xml/xml_validator.asp I'll ask around on the lxml mailinglist what they think about this behavior.
msg195399 - (view)	Author: Kees Bos (kees) *	Date: 2013-08-16 19:13
OK. I got clarification from the lxml list. It's not a bug. And it's sepcified in section 2.4 (http://www.w3.org/TR/REC-xml/#syntax): The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings " & " and " < " respectively. The right angle bracket (>) may be represented using the string " > ", and MUST, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section. In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup and does not include the CDATA-section-close delimiter, " ]]> ". In a CDATA section, character data is any string of characters not including the CDATA-section-close delimiter, " ]]> ". Sorry for the confusion and taking your time for a bogus report.
msg195400 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-08-16 19:19
Not a problem, these things are often subtle. And now there is a record of it in the tracker if anyone else questions it in the future.

History
Date	User	Action	Args
2022-04-11 14:57:49	admin	set	github: 62953
2013-08-16 19:19:10	r.david.murray	set	messages: + msg195400 stage: resolved
2013-08-16 19:13:34	kees	set	status: open -> closed resolution: not a bug
2013-08-16 19:13:14	kees	set	messages: + msg195399
2013-08-16 16:51:58	pitrou	set	nosy: + scoder, eli.bendersky
2013-08-16 16:48:27	kees	set	messages: + msg195359
2013-08-16 13:28:01	r.david.murray	set	nosy: + r.david.murray messages: + msg195327
2013-08-16 08:31:03	kees	create