This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: [c]ElementTree.fromstring fails to parse ]]>
Type: behavior Stage: resolved
Components: XML Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: eli.bendersky, kees, r.david.murray, scoder
Priority: normal Keywords:

Created on 2013-08-16 08:31 by kees, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (5)
msg195315 - (view) Author: Kees Bos (kees) * Date: 2013-08-16 08:31
ElementTree.fromstring and cElementTree.fromstring fail on parsing
"<value>]]></value>", but do parse "<value>]]&gt;</value>"

$ python
Python 2.7.3 (default, Apr 10 2013, 05:09:49) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from xml.etree import cElementTree as ET
>>> ET.fromstring("<value>]]></value>").text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 124, in XML
cElementTree.ParseError: not well-formed (invalid token): line 1, column 9
>>> ET.fromstring("<value>]]&gt;</value>").text
']]>'
>>> from xml.etree import ElementTree as ET
>>> ET.fromstring("<value>]]></value>").text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1301, in XML
    parser.feed(text)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
    self._raiseerror(v)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1507, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 9
>>> ET.fromstring("<value>]]&gt;</value>").text
']]>'
>>>
msg195327 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-08-16 13:28
Why do you think this is a bug?  (You may well be right; I'm not familiar with the intricacies of XML. But on its face the behavior looks reasonable.)
msg195359 - (view) Author: Kees Bos (kees) * Date: 2013-08-16 16:48
I'm not an expert, but from: http://www.w3.org/TR/REC-xml/#NT-AttValue

	AttValue ::= '"' ([^<&"] | Reference)* '"' | "'" ([^<&'] | Reference)* "'"

which I read as: Any Reference character is valid, except & and <, which are used for escaping and closing the element.

The sequence <value>]]></value> also valdates as well-formed at http://www.xmlvalidation.com/

The sequence <value>]></value> parses OK (So, it's only with a double ] and > )

It's probably related to parsing <![CDATA[ ... ]]> (i.e. I guess when the parser detects ]]> it 
assumes / requires the state of <![CDATA[ which is, of course, not true)

The sequence <value><![CDATA[foo]]></value> is parsed correctly:
>>> ET.fromstring('<value><![CDATA[foo]]></value>').text
'foo'


BTW, lxml.etree.fromstring fails also and so does http://www.w3schools.com/xml/xml_validator.asp

I'll ask around on the lxml mailinglist what they think about this behavior.
msg195399 - (view) Author: Kees Bos (kees) * Date: 2013-08-16 19:13
OK. I got clarification from the lxml list. It's not a bug. And it's sepcified in section 2.4 (http://www.w3.org/TR/REC-xml/#syntax):

The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings " &amp; " and " &lt; " respectively. The right angle bracket (>) may be represented using the string " &gt; ", and MUST, for compatibility, be escaped using either " &gt; " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup and does not include the CDATA-section-close delimiter, " ]]> ". In a CDATA section, character data is any string of characters not including the CDATA-section-close delimiter, " ]]> ".


Sorry for the confusion and taking your time for a bogus report.
msg195400 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-08-16 19:19
Not a problem, these things are often subtle.  And now there is a record of it in the tracker if anyone else questions it in the future.
History
Date User Action Args
2022-04-11 14:57:49adminsetgithub: 62953
2013-08-16 19:19:10r.david.murraysetmessages: + msg195400
stage: resolved
2013-08-16 19:13:34keessetstatus: open -> closed
resolution: not a bug
2013-08-16 19:13:14keessetmessages: + msg195399
2013-08-16 16:51:58pitrousetnosy: + scoder, eli.bendersky
2013-08-16 16:48:27keessetmessages: + msg195359
2013-08-16 13:28:01r.david.murraysetnosy: + r.david.murray
messages: + msg195327
2013-08-16 08:31:03keescreate