Issue 5100: ElementTree.iterparse and Element.tail confusion

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/49350

classification

Title:	ElementTree.iterparse and Element.tail confusion
Type:	behavior	Stage:	resolved
Components:	XML	Versions:	Python 2.6

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	effbot, flox, jeroen.dirks
Priority:	low	Keywords:

Created on 2009-01-29 20:05 by jeroen.dirks, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (3)
msg80783 - (view)	Author: Jeroen Dirks (jeroen.dirks)	Date: 2009-01-29 20:05
I am using cElementTree.iterparse in order to parse through a huge XML document and filter out sections of interest. The usage pattern is that I wait for an "end" event for a element of interest and then if it matches a some criterium I write it out using cElementTree.tostring(). My code had bug in it because the cElementTree.tostring methods prints the element including its tail. The element retreived from the iterparse iterator sometimes contains the tail by the time it emits the end event but sometimes it does not. In my document the tail just consisted of the newline '\n' character and about 98% of the time it was attached to the element during its end event. This is rather confusing behavior. Could ElementTree/cElementTree.iterparse be changed so that if you respond to the end event for an element its tail is never set?
msg99400 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-02-16 12:42
"ET.tostring(elem)" works as documented. Proposed workaround: import copy elem_copy = copy.copy(elem) elem_copy.tail = '' print ET.tostring(elem_copy)
msg100854 - (view)	Author: Fredrik Lundh (effbot) *	Date: 2010-03-11 14:05
Footnote: "iterparse" does things this way mostly to keep the implementation simple and fast; due to buffering, the tree builder are usually ahead of the event generation with up to 16k. See the note on this page: http://effbot.org/zone/element-iterparse.htm and the message it links to for more on this topic. Your case is a very common use case for "tostring", so it would probably have made sense to make "tostring" skip the tail on the element itself, at least if it's whitespace only. Guess we could add an option... But in your case, you can probably just nuke or normalize the "tail" element before writing it out (i.e. set it to None or "\n").

History
Date	User	Action	Args
2022-04-11 14:56:45	admin	set	github: 49350
2010-03-11 14:05:32	effbot	set	messages: + msg100854
2010-02-26 11:22:48	flox	set	status: pending -> closed
2010-02-16 12:42:36	flox	set	status: open -> pending priority: low nosy: + effbot, flox messages: + msg99400 resolution: wont fix stage: resolved
2009-01-29 20:05:21	jeroen.dirks	create