This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: ElementTree.iterparse and Element.tail confusion
Type: behavior Stage: resolved
Components: XML Versions: Python 2.6
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: effbot, flox, jeroen.dirks
Priority: low Keywords:

Created on 2009-01-29 20:05 by jeroen.dirks, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (3)
msg80783 - (view) Author: Jeroen Dirks (jeroen.dirks) Date: 2009-01-29 20:05
I am using cElementTree.iterparse in order to parse through a huge XML
document and filter out sections of interest.

The usage pattern is that I wait for an "end" event for a element of
interest and then if it matches a some criterium I write it out using
cElementTree.tostring().

My code had bug in it because the cElementTree.tostring methods prints
the element including its tail. The element retreived from the iterparse
iterator sometimes contains the tail by the time it emits the end event
but sometimes it does not.

In my document the tail just consisted of the newline '\n' character and
about 98% of the time it was attached to the element during its end event.

This is rather confusing behavior. 

Could ElementTree/cElementTree.iterparse be changed so that if you
respond to the end event for an element its tail is never set?
msg99400 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-02-16 12:42
"ET.tostring(elem)" works as documented.

Proposed workaround:

import copy

elem_copy = copy.copy(elem)
elem_copy.tail = ''
print ET.tostring(elem_copy)
msg100854 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2010-03-11 14:05
Footnote: "iterparse" does things this way mostly to keep the implementation simple and fast; due to buffering, the tree builder are usually ahead of the event generation with up to 16k.  See the note on this page:

http://effbot.org/zone/element-iterparse.htm

and the message it links to for more on this topic.

Your case is a very common use case for "tostring", so it would probably have made sense to make "tostring" skip the tail on the element itself, at least if it's whitespace only.  Guess we could add an option...

But in your case, you can probably just nuke or normalize the "tail" element before writing it out (i.e. set it to None or "\n").
History
Date User Action Args
2022-04-11 14:56:45adminsetgithub: 49350
2010-03-11 14:05:32effbotsetmessages: + msg100854
2010-02-26 11:22:48floxsetstatus: pending -> closed
2010-02-16 12:42:36floxsetstatus: open -> pending
priority: low


nosy: + effbot, flox
messages: + msg99400
resolution: wont fix
stage: resolved
2009-01-29 20:05:21jeroen.dirkscreate