classification
Title: xml.etree.ElementTree.Element.text does not conform to the documentation
Type: Stage: resolved
Components: Documentation, XML Versions: Python 3.6, Python 3.4, Python 3.5, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: docs@python, eli.bendersky, jlaurens, martin.panter, ned.deily, python-dev, rbcollins, rhettinger, scoder
Priority: normal Keywords: patch

Created on 2015-04-29 23:34 by jlaurens, last changed 2015-08-18 06:29 by scoder. This issue is now closed.

Files
File name Uploaded Description Edit
etree-text.patch martin.panter, 2015-05-30 01:34 review
etree-text.v2.patch martin.panter, 2015-06-03 13:37 review
Messages (23)
msg242256 - (view) Author: Jérôme Laurens (jlaurens) Date: 2015-04-29 23:34
The documentation for xml.etree.ElementTree.Element.text reads "If the element is created from an XML file the attribute will contain any text found between the element tags."

import xml.etree.ElementTree as ET
root3 = ET.fromstring('<a><b/>TEXT</a>')
print(root3.text)

CURRENT OUTPUT

None

"TEXT" is between the elements tags but does not appear in the output

BTW : this is well formed xml and has nothing to do with tail.
msg242257 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2015-04-30 02:35
(This issue is a followup to your Issue24072.)  Again, while the ElementTree documentation is certainly not nearly as complete as it should be, I don't think this is a documentation error per se.  The key issue is: with which element is each text string associated?  Perhaps this example will help:

>>> root4 = ET.fromstring('<a>ATEXT<b>BTEXT</b>BTAIL</a>')
>>> root4
<Element 'a' at 0x10224c228>
>>> root4.text
'ATEXT'
>>> root4.tail
>>> root4[0]
<Element 'b' at 0x1022ab278>
>>> root4[0].text
'BTEXT'
>>> root4[0].tail
'BTAIL'

As in your original example, any text following the element b is associated with b's tail attribute until a new tag is found, pushing or popping the tree stack.  While the description of the "text" attribute does not explicitly state this, the "tail" attribute description immediately following it does.  This is also explained in more detail in the ElementTree resources on effbot.org that are linked to from the Python Standard Library documentation.  Nevertheless, it probably would be helpful to expand the documentation on this point if someone is willing to put together a documentation patch for review.

With regard to your comment about "well formed xml", I don't think there is anything in the documentation that implies (or should imply) that the distinction between the "text" attribute and the "tail" attribute has anything to do with whether it is well-formed XML.  The tutorial for the third-party lxml package, which provides another implementation of ElementTree, goes into more detail about why, in general, both "text" and "tail" are necessary.

https://docs.python.org/3/library/xml.etree.elementtree.html#additional-resources
http://effbot.org/zone/element.htm#text-content
http://lxml.de/tutorial.html#elements-contain-text
msg242263 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-04-30 06:38
> this is well formed xml and has nothing to do with tail.

In fact, it does have something to do with tail.
The 'TEXT' is a captured as the tail of element b:

>>> root3 = ET.fromstring('<a><b/>TEXT</a>')
>>> root3[0].tail
'TEXT'
msg242264 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2015-04-30 07:04
I agree that the wording in the documentation isn't great:

"""
text

    The text attribute can be used to hold additional data associated with the element. As the name implies this attribute is usually a string but may be any application-specific object. If the element is created from an XML file the attribute will contain any text found between the element tags.

tail

    The tail attribute can be used to hold additional data associated with the element. This attribute is usually a string but may be any application-specific object. If the element is created from an XML file the attribute will contain any text found after the element’s end tag and before the next tag.
"""

Special cases that no-one uses (sticking non-string objects into text/tail) are given too much space and the difference isn't explained as needed.

Since the distinction between text and tail is a (great but) rather special feature of ElementTree, it needs to be given more room in the docs.

Proposal:

"""
text

    The text attribute holds the immediate text content of the element. It contains any text found up to either the closing tag if the element has no children, or the next opening child tag within the element. For text following an element, see the `tail` attribute. To collect the entire text content of a subtree, see `tostring`. Applications may store arbitrary objects in this attribute.

tail

    The tail attribute holds any text that directly follows the element. For example, in a document like ``<a>Text<b/>BTail<c/>CTail</a>``, the `text` attribute of the ``a`` element holds the string "Text", and the tail attributes of ``b`` and ``c`` hold the strings "BTail" and "CTail" respectively. Applications may store arbitrary objects in this attribute.
"""
msg242268 - (view) Author: Jérôme Laurens (jlaurens) Date: 2015-04-30 11:35
Since the text and tail notions seem tightly coupled, I would vote for a more detailed explanation in the text doc and a forward link in the tail documentation.


"""
text

    The text attribute holds the text between the element's begin tag and the next tag or None. The tail attribute holds the text between the element's end tag and the next tag or None. For "<a><b>1<c>2<d/>3</c></b>4</a>" xml data, the a element has None for both text and tail attributes, the b element has text '1' and tail '4', the c element has text '2' and tail None, the d element hast text None and tail '3'.

To collect the inner text of an element, see `tostring` with method 'text'.

Applications may store arbitrary objects in this attribute.

tail

    The tail attribute holds the text between the element's end tag and the next tag or None. See `text` for more details.

Applications may store arbitrary objects in this attribute.
"""

It is very important to mention that the 'text' attribute does not always hold a string contrary to what would suggest its name.

BTW, I was not aware of the tostring method with 'text' argument. The fact is that the documentation reads "Returns an (optionally) encoded string containing the XML data." which is misleading because the text is not xml data in general. This also needs to be rephrased or simply removed.
msg242279 - (view) Author: Jérôme Laurens (jlaurens) Date: 2015-04-30 17:56
The totsstring(..., method='text') is not suitable for the inner text because it adds the tail of the top element.

A proper implementation would be

def innertext(elt):
    return (elt.text or '') +''.join(innertext(e)+e.tail for e in elt)

that can be included in the doc instead of the mention of the to string trick
msg242280 - (view) Author: Jérôme Laurens (jlaurens) Date: 2015-04-30 18:03
Erratum

def innertext(elt):
    return (elt.text or '') +''.join(innertext(e)+(e.tail or '') for e in elt)
msg243032 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-05-13 04:06
Another problem with tostring() is that it seems you have to call it with encoding="unicode". Perhaps it would be better to suggest code like "".join(element.itertext())?

I would also improve on Jérôme’s version by making the None case more explicit. And perhaps both attributes can be defined together, rather than giving a half-hearted definition linking between them:

.. attribute:: text
.. attribute:: tail

   The *text* attribute holds any text between the element's begin tag and the next tag. The *tail* attribute holds any text between the element's end tag and the next tag. These attributes are set to ``None`` if there is no text. For example, in the XML data ``<a><b>1<c>2<d/>3</c></b>4</a>``, the *a* element has ``None`` for both *text* and *tail* attributes, the *b* element has *text* ``"1"`` and *tail* ``"4"``, the *c* element has *text* ``"2"`` and *tail* ``None``, the *d* element has *text* ``None`` and *tail* ``"3"``.
   
   To collect the inner text of an element, use :meth:`itertext`, for example ``"".join(element.itertext())``.
   
   Applications may store arbitrary objects in these attributes.
msg244434 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-05-30 01:34
Here is a patch with my suggestion. Let me know what you think.
msg244445 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2015-05-30 05:35
IMHO less clear and less correct than the previous suggestions.
msg244446 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2015-05-30 05:40
Seems like a good idea to explain "text" and "tail" in one section, though. That makes "tail" easier to find for those who are not used to this kind of split (and that's basically everyone who needs to read the docs in the first place).
msg244744 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-06-03 13:37
Okay, here is a version with most of the wording reverted to Jérôme’s suggestion. I only left my itertext() example, and the grouping of text and tail together. If there are any more bits that are incorrect or unclear please identify them.
msg244869 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2015-06-05 15:08
Looks good to me.
msg247736 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2015-07-31 06:09
could we apply this patch, please?
msg247740 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2015-07-31 07:17
I note that the current wording for both "text" and "tail" are careful to allow for the most general use of the Element class, that is, that it may be used in non-XML contexts, for example:

"The text attribute can be used to hold additional data associated with the
element. As the name implies this attribute is usually a string but may be any
application-specific object. If the element is created from an XML file the
attribute will contain any text found between the element tags."

The proposed patch downplays that generality.  How about modifying the original wording so that the description starts something like:

"These attributes can be used to hold additional [...] application-specific object.  If the element is created from an XML file, the *text* attribute holds either the text between the element'sstart tag and its first child or end tag, or ``None``and the *tail* attribute holds either the text [...]."
msg247741 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2015-07-31 07:35
> The proposed patch downplays that generality.

That is completely intentional. Almost all readers of the documentation will first need to understand the difference between text and tail before they can go and think about any more advanced use cases that will almost certainly fail on their first serialisation attempts. The most important aim of the new phrasing is therefore to make that difference clear. Everything else is secondary, although still worth mentioning.
msg247744 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-07-31 13:34
I think Ned’s version is an acceptable solution (modulo some punctuation) to the original problem, although I do agree with Stefan that downplaying the generality would be even better.

Perhaps we could add a qualifier, like “The *text* attribute [normally] holds . . .”
msg247745 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2015-07-31 14:02
Personally, I would prefer getting the improved version applied over bikeshedding for another couple of months. But maybe that's just me.
msg248481 - (view) Author: Robert Collins (rbcollins) * (Python committer) Date: 2015-08-12 22:34
So it is downplayed but it is still documented as being application usable.

I'll give this another week for Ned to reply, then commit it in the absence of a reply: I think its ok as is. I'd be ok with a tweaked version along the lines Ned proposed too: both ways are better than whats in tree today.
msg248752 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-08-18 02:17
New changeset d3cda8cf4d42 by Ned Deily in branch '2.7':
Issue #24079: Improve description of the text and tail attributes for
https://hg.python.org/cpython/rev/d3cda8cf4d42

New changeset ad0491f85050 by Ned Deily in branch '3.4':
Issue #24079: Improve description of the text and tail attributes for
https://hg.python.org/cpython/rev/ad0491f85050

New changeset 17ce3486fd8f by Ned Deily in branch '3.5':
Issue #24079: merge from 3.4
https://hg.python.org/cpython/rev/17ce3486fd8f

New changeset 3c94ece57c43 by Ned Deily in branch 'default':
Issue #24079: merge from 3.5
https://hg.python.org/cpython/rev/3c94ece57c43
msg248753 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2015-08-18 02:20
Thanks for all of your contributions on this.  I've committed a version along the lines I suggested along with Martin's example.
msg248760 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2015-08-18 06:20
The "can store arbitrary objects" sentence is now duplicated, and still way too visible. I have to read three sentences until it tells me what I need to know.
msg248762 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2015-08-18 06:29
I think the first two sentences can simply be removed to fix this, without loss of readability or information.
History
Date User Action Args
2015-08-18 06:29:21scodersetmessages: + msg248762
2015-08-18 06:20:17scodersetmessages: + msg248760
2015-08-18 02:20:53ned.deilysetstatus: open -> closed
type: behavior ->
messages: + msg248753

resolution: fixed
stage: commit review -> resolved
2015-08-18 02:17:03python-devsetnosy: + python-dev
messages: + msg248752
2015-08-12 22:34:53rbcollinssetnosy: + rbcollins
messages: + msg248481
2015-07-31 14:02:32scodersetmessages: + msg247745
2015-07-31 13:34:14martin.pantersetmessages: + msg247744
2015-07-31 07:35:52scodersetmessages: + msg247741
2015-07-31 07:17:17ned.deilysetmessages: + msg247740
2015-07-31 06:09:02scodersetmessages: + msg247736
2015-07-07 00:24:48martin.pantersetstage: patch review -> commit review
2015-06-05 15:08:48scodersetmessages: + msg244869
2015-06-03 13:37:14martin.pantersetfiles: + etree-text.v2.patch

messages: + msg244744
2015-05-30 05:40:18scodersetmessages: + msg244446
2015-05-30 05:35:57scodersetmessages: + msg244445
2015-05-30 01:34:13martin.pantersetfiles: + etree-text.patch
versions: + Python 3.6
messages: + msg244434

components: + XML
keywords: + patch
stage: needs patch -> patch review
2015-05-13 04:06:07martin.pantersetnosy: + martin.panter
messages: + msg243032
2015-04-30 18:03:21jlaurenssetmessages: + msg242280
2015-04-30 17:56:16jlaurenssetmessages: + msg242279
2015-04-30 11:35:53jlaurenssetmessages: + msg242268
2015-04-30 07:04:40scodersetmessages: + msg242264
2015-04-30 06:38:16rhettingersetnosy: + rhettinger, scoder, eli.bendersky
messages: + msg242263
2015-04-30 02:35:08ned.deilysetassignee: docs@python
components: + Documentation, - XML
versions: + Python 2.7, Python 3.5
nosy: + docs@python, ned.deily

messages: + msg242257
stage: needs patch
2015-04-29 23:34:55jlaurenscreate