Issue 24079: xml.etree.ElementTree.Element.text does not conform to the documentation

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/68267

classification

Title:	xml.etree.ElementTree.Element.text does not conform to the documentation
Type:		Stage:	resolved
Components:	Documentation, XML	Versions:	Python 3.6, Python 3.4, Python 3.5, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	docs@python, eli.bendersky, jlaurens, martin.panter, ned.deily, python-dev, rbcollins, rhettinger, scoder
Priority:	normal	Keywords:	patch

Created on 2015-04-29 23:34 by jlaurens, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
etree-text.patch	martin.panter, 2015-05-30 01:34		review
etree-text.v2.patch	martin.panter, 2015-06-03 13:37		review

Messages (23)
msg242256 - (view)	Author: Jérôme Laurens (jlaurens)	Date: 2015-04-29 23:34
The documentation for xml.etree.ElementTree.Element.text reads "If the element is created from an XML file the attribute will contain any text found between the element tags." import xml.etree.ElementTree as ET root3 = ET.fromstring('<a><b/>TEXT</a>') print(root3.text) CURRENT OUTPUT None "TEXT" is between the elements tags but does not appear in the output BTW : this is well formed xml and has nothing to do with tail.
msg242257 - (view)	Author: Ned Deily (ned.deily) *	Date: 2015-04-30 02:35
(This issue is a followup to your Issue24072.) Again, while the ElementTree documentation is certainly not nearly as complete as it should be, I don't think this is a documentation error per se. The key issue is: with which element is each text string associated? Perhaps this example will help: >>> root4 = ET.fromstring('<a>ATEXT<b>BTEXT</b>BTAIL</a>') >>> root4 <Element 'a' at 0x10224c228> >>> root4.text 'ATEXT' >>> root4.tail >>> root4[0] <Element 'b' at 0x1022ab278> >>> root4[0].text 'BTEXT' >>> root4[0].tail 'BTAIL' As in your original example, any text following the element b is associated with b's tail attribute until a new tag is found, pushing or popping the tree stack. While the description of the "text" attribute does not explicitly state this, the "tail" attribute description immediately following it does. This is also explained in more detail in the ElementTree resources on effbot.org that are linked to from the Python Standard Library documentation. Nevertheless, it probably would be helpful to expand the documentation on this point if someone is willing to put together a documentation patch for review. With regard to your comment about "well formed xml", I don't think there is anything in the documentation that implies (or should imply) that the distinction between the "text" attribute and the "tail" attribute has anything to do with whether it is well-formed XML. The tutorial for the third-party lxml package, which provides another implementation of ElementTree, goes into more detail about why, in general, both "text" and "tail" are necessary. https://docs.python.org/3/library/xml.etree.elementtree.html#additional-resources http://effbot.org/zone/element.htm#text-content http://lxml.de/tutorial.html#elements-contain-text
msg242263 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-04-30 06:38
> this is well formed xml and has nothing to do with tail. In fact, it does have something to do with tail. The 'TEXT' is a captured as the tail of element b: >>> root3 = ET.fromstring('<a><b/>TEXT</a>') >>> root3[0].tail 'TEXT'
msg242264 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-04-30 07:04
I agree that the wording in the documentation isn't great: """ text The text attribute can be used to hold additional data associated with the element. As the name implies this attribute is usually a string but may be any application-specific object. If the element is created from an XML file the attribute will contain any text found between the element tags. tail The tail attribute can be used to hold additional data associated with the element. This attribute is usually a string but may be any application-specific object. If the element is created from an XML file the attribute will contain any text found after the element’s end tag and before the next tag. """ Special cases that no-one uses (sticking non-string objects into text/tail) are given too much space and the difference isn't explained as needed. Since the distinction between text and tail is a (great but) rather special feature of ElementTree, it needs to be given more room in the docs. Proposal: """ text The text attribute holds the immediate text content of the element. It contains any text found up to either the closing tag if the element has no children, or the next opening child tag within the element. For text following an element, see the `tail` attribute. To collect the entire text content of a subtree, see `tostring`. Applications may store arbitrary objects in this attribute. tail The tail attribute holds any text that directly follows the element. For example, in a document like ``<a>Text<b/>BTail<c/>CTail</a>``, the `text` attribute of the ``a`` element holds the string "Text", and the tail attributes of ``b`` and ``c`` hold the strings "BTail" and "CTail" respectively. Applications may store arbitrary objects in this attribute. """
msg242268 - (view)	Author: Jérôme Laurens (jlaurens)	Date: 2015-04-30 11:35
Since the text and tail notions seem tightly coupled, I would vote for a more detailed explanation in the text doc and a forward link in the tail documentation. """ text The text attribute holds the text between the element's begin tag and the next tag or None. The tail attribute holds the text between the element's end tag and the next tag or None. For "<a><b>1<c>2<d/>3</c></b>4</a>" xml data, the a element has None for both text and tail attributes, the b element has text '1' and tail '4', the c element has text '2' and tail None, the d element hast text None and tail '3'. To collect the inner text of an element, see `tostring` with method 'text'. Applications may store arbitrary objects in this attribute. tail The tail attribute holds the text between the element's end tag and the next tag or None. See `text` for more details. Applications may store arbitrary objects in this attribute. """ It is very important to mention that the 'text' attribute does not always hold a string contrary to what would suggest its name. BTW, I was not aware of the tostring method with 'text' argument. The fact is that the documentation reads "Returns an (optionally) encoded string containing the XML data." which is misleading because the text is not xml data in general. This also needs to be rephrased or simply removed.
msg242279 - (view)	Author: Jérôme Laurens (jlaurens)	Date: 2015-04-30 17:56
The totsstring(..., method='text') is not suitable for the inner text because it adds the tail of the top element. A proper implementation would be def innertext(elt): return (elt.text or '') +''.join(innertext(e)+e.tail for e in elt) that can be included in the doc instead of the mention of the to string trick
msg242280 - (view)	Author: Jérôme Laurens (jlaurens)	Date: 2015-04-30 18:03
Erratum def innertext(elt): return (elt.text or '') +''.join(innertext(e)+(e.tail or '') for e in elt)
msg243032 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-05-13 04:06
Another problem with tostring() is that it seems you have to call it with encoding="unicode". Perhaps it would be better to suggest code like "".join(element.itertext())? I would also improve on Jérôme’s version by making the None case more explicit. And perhaps both attributes can be defined together, rather than giving a half-hearted definition linking between them: .. attribute:: text .. attribute:: tail The text attribute holds any text between the element's begin tag and the next tag. The tail attribute holds any text between the element's end tag and the next tag. These attributes are set to ``None`` if there is no text. For example, in the XML data ``<a><b>1<c>2<d/>3</c></b>4</a>``, the a element has ``None`` for both text and tail attributes, the b element has text ``"1"`` and tail ``"4"``, the c element has text ``"2"`` and tail ``None``, the d element has text ``None`` and tail ``"3"``. To collect the inner text of an element, use :meth:`itertext`, for example ``"".join(element.itertext())``. Applications may store arbitrary objects in these attributes.
msg244434 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-05-30 01:34
Here is a patch with my suggestion. Let me know what you think.
msg244445 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-05-30 05:35
IMHO less clear and less correct than the previous suggestions.
msg244446 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-05-30 05:40
Seems like a good idea to explain "text" and "tail" in one section, though. That makes "tail" easier to find for those who are not used to this kind of split (and that's basically everyone who needs to read the docs in the first place).
msg244744 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-06-03 13:37
Okay, here is a version with most of the wording reverted to Jérôme’s suggestion. I only left my itertext() example, and the grouping of text and tail together. If there are any more bits that are incorrect or unclear please identify them.
msg244869 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-06-05 15:08
Looks good to me.
msg247736 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-07-31 06:09
could we apply this patch, please?
msg247740 - (view)	Author: Ned Deily (ned.deily) *	Date: 2015-07-31 07:17
I note that the current wording for both "text" and "tail" are careful to allow for the most general use of the Element class, that is, that it may be used in non-XML contexts, for example: "The text attribute can be used to hold additional data associated with the element. As the name implies this attribute is usually a string but may be any application-specific object. If the element is created from an XML file the attribute will contain any text found between the element tags." The proposed patch downplays that generality. How about modifying the original wording so that the description starts something like: "These attributes can be used to hold additional [...] application-specific object. If the element is created from an XML file, the text attribute holds either the text between the element'sstart tag and its first child or end tag, or ``None``and the tail attribute holds either the text [...]."
msg247741 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-07-31 07:35
> The proposed patch downplays that generality. That is completely intentional. Almost all readers of the documentation will first need to understand the difference between text and tail before they can go and think about any more advanced use cases that will almost certainly fail on their first serialisation attempts. The most important aim of the new phrasing is therefore to make that difference clear. Everything else is secondary, although still worth mentioning.
msg247744 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-07-31 13:34
I think Ned’s version is an acceptable solution (modulo some punctuation) to the original problem, although I do agree with Stefan that downplaying the generality would be even better. Perhaps we could add a qualifier, like “The text attribute [normally] holds . . .”
msg247745 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-07-31 14:02
Personally, I would prefer getting the improved version applied over bikeshedding for another couple of months. But maybe that's just me.
msg248481 - (view)	Author: Robert Collins (rbcollins) *	Date: 2015-08-12 22:34
So it is downplayed but it is still documented as being application usable. I'll give this another week for Ned to reply, then commit it in the absence of a reply: I think its ok as is. I'd be ok with a tweaked version along the lines Ned proposed too: both ways are better than whats in tree today.
msg248752 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-08-18 02:17
New changeset d3cda8cf4d42 by Ned Deily in branch '2.7': Issue #24079: Improve description of the text and tail attributes for https://hg.python.org/cpython/rev/d3cda8cf4d42 New changeset ad0491f85050 by Ned Deily in branch '3.4': Issue #24079: Improve description of the text and tail attributes for https://hg.python.org/cpython/rev/ad0491f85050 New changeset 17ce3486fd8f by Ned Deily in branch '3.5': Issue #24079: merge from 3.4 https://hg.python.org/cpython/rev/17ce3486fd8f New changeset 3c94ece57c43 by Ned Deily in branch 'default': Issue #24079: merge from 3.5 https://hg.python.org/cpython/rev/3c94ece57c43
msg248753 - (view)	Author: Ned Deily (ned.deily) *	Date: 2015-08-18 02:20
Thanks for all of your contributions on this. I've committed a version along the lines I suggested along with Martin's example.
msg248760 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-08-18 06:20
The "can store arbitrary objects" sentence is now duplicated, and still way too visible. I have to read three sentences until it tells me what I need to know.
msg248762 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-08-18 06:29
I think the first two sentences can simply be removed to fix this, without loss of readability or information.

History
Date	User	Action	Args
2022-04-11 14:58:16	admin	set	github: 68267
2015-08-18 06:29:21	scoder	set	messages: + msg248762
2015-08-18 06:20:17	scoder	set	messages: + msg248760
2015-08-18 02:20:53	ned.deily	set	status: open -> closed type: behavior -> messages: + msg248753 resolution: fixed stage: commit review -> resolved
2015-08-18 02:17:03	python-dev	set	nosy: + python-dev messages: + msg248752
2015-08-12 22:34:53	rbcollins	set	nosy: + rbcollins messages: + msg248481
2015-07-31 14:02:32	scoder	set	messages: + msg247745
2015-07-31 13:34:14	martin.panter	set	messages: + msg247744
2015-07-31 07:35:52	scoder	set	messages: + msg247741
2015-07-31 07:17:17	ned.deily	set	messages: + msg247740
2015-07-31 06:09:02	scoder	set	messages: + msg247736
2015-07-07 00:24:48	martin.panter	set	stage: patch review -> commit review
2015-06-05 15:08:48	scoder	set	messages: + msg244869
2015-06-03 13:37:14	martin.panter	set	files: + etree-text.v2.patch messages: + msg244744
2015-05-30 05:40:18	scoder	set	messages: + msg244446
2015-05-30 05:35:57	scoder	set	messages: + msg244445
2015-05-30 01:34:13	martin.panter	set	files: + etree-text.patch versions: + Python 3.6 messages: + msg244434 components: + XML keywords: + patch stage: needs patch -> patch review
2015-05-13 04:06:07	martin.panter	set	nosy: + martin.panter messages: + msg243032
2015-04-30 18:03:21	jlaurens	set	messages: + msg242280
2015-04-30 17:56:16	jlaurens	set	messages: + msg242279
2015-04-30 11:35:53	jlaurens	set	messages: + msg242268
2015-04-30 07:04:40	scoder	set	messages: + msg242264
2015-04-30 06:38:16	rhettinger	set	nosy: + rhettinger, scoder, eli.bendersky messages: + msg242263
2015-04-30 02:35:08	ned.deily	set	assignee: docs@python components: + Documentation, - XML versions: + Python 2.7, Python 3.5 nosy: + docs@python, ned.deily messages: + msg242257 stage: needs patch
2015-04-29 23:34:55	jlaurens	create