This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: xml.dom.pulldom strange behavior
Type: behavior Stage:
Components: XML Versions: Python 2.5
process
Status: closed Resolution: works for me
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Myrosia.Dzikovska, amaury.forgeotdarc, docs@python, vojta.rylko
Priority: normal Keywords:

Created on 2010-10-05 10:17 by vojta.rylko, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (5)
msg117999 - (view) Author: Vojtěch Rylko (vojta.rylko) Date: 2010-10-05 10:17
Hi,

I have file with 10 000 records of same element item (always same):

$ head test.xml
<channel>
<item><section>Twitter</section></item>
<item><section>Twitter</section></item>
<item><section>Twitter</section></item>
<item><section>Twitter</section></item>
<item><section>Twitter</section></item>
<item><section>Twitter</section></item>
<item><section>Twitter</section></item>
<item><section>Twitter</section></item>
<item><section>Twitter</section></item>

And run simply program for printing content of element section:

$ python pulldom.py test.xml | head
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter

Seems work fine:
$ python pulldom.py test.xml | wc -l
10000

But (in two cases of 10 000) gives me just "Twi" not Twitter:
$ python pulldom.py test.xml  | grep -v Twitter
Twi
Twi 


Why? This example program demonstrate big problems in my real application - xml.dom.pulldom is cutting content of some elements.

Thanks for any advice
Vojta Rylko

---------------------------
Python 2.5.4 (r254:67916, Feb 10 2009, 14:58:09)
[GCC 4.2.4] on linux2
---------------------------
pulldom.py:
---------------------------
file=open(sys.argv[1])
events = pulldom.parse(file)

for event, node in events:
        if event == pulldom.START_ELEMENT:
                if node.tagName == 'item':
                        events.expandNode(node)
                        print node.getElementsByTagName('section').item(0).firstChild.data
msg118002 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-10-05 11:16
Please read
http://docs.python.org/library/xml.etree.elementtree.html?highlight=elementtree#xml.etree.ElementTree.iterparse
At START_ELEMENT, the element is not guaranteed to be fully populated;
you should handle the END_ELEMENT event instead.

This should be documented for the pulldom module as well, though.
msg118004 - (view) Author: Vojtěch Rylko (vojta.rylko) Date: 2010-10-05 11:38
Program below also splits two of 10 000 elements into two rows. Is it acceptable behavior?

OUTPUT (ill part)
=============
<DOM Text node "u'Twitter'">
<DOM Text node "u'\n'">
<DOM Text node "u'Twi'">
<DOM Text node "u'tter'">
<DOM Text node "u'\n'">
<DOM Text node "u'Twitter'">
<DOM Text node "u'\n'">
<DOM Text node "u'Twitter'">


PROGRAM
=============
for event, node in events:
        if event == pulldom.CHARACTERS:
                print node.data
msg118006 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-10-05 12:41
Yes, sax parsers may split CHARACTER events. See also the discussion:
http://www.mail-archive.com/xml-sig@python.org/msg00234.html

Again, the END_ELEMENT event is guaranteed to return the complete node.
msg140870 - (view) Author: Myrosia Dzikovska (Myrosia.Dzikovska) Date: 2011-07-22 11:37
I have the same problem, and I tried the solution suggested in here, namely expanding the node at END_ELEMENT. It does not work, raising the following exception:

Traceback (most recent call last):
  File "/group/project/onrbee/data/beetle2-eval-09/annotation_tools/logTools/add_start_times.py", line 163, in <module>
    main(sys.argv[1:])
  File "/group/project/onrbee/data/beetle2-eval-09/annotation_tools/logTools/add_start_times.py", line 130, in main
    events.expandNode(node)
  File "/usr/lib/python2.6/site-packages/_xmlplus/dom/pulldom.py", line 248, in expandNode
    parents[-1].appendChild(cur_node)
IndexError: list index out of range


The code fragment was:

  events = xml.dom.pulldom.parse( outName )
    for (event,node) in events:
        if (event == xml.dom.pulldom.END_ELEMENT) and (node.tagName == "message"):
             events.expandNode(node)
History
Date User Action Args
2022-04-11 14:57:07adminsetgithub: 54235
2011-07-22 11:37:23Myrosia.Dzikovskasetnosy: + Myrosia.Dzikovska
messages: + msg140870
2010-10-06 05:11:54georg.brandlsetstatus: open -> closed
resolution: works for me
2010-10-05 12:41:06amaury.forgeotdarcsetmessages: + msg118006
2010-10-05 11:38:14vojta.rylkosetmessages: + msg118004
2010-10-05 11:16:17amaury.forgeotdarcsetassignee: docs@python

messages: + msg118002
nosy: + amaury.forgeotdarc, docs@python
2010-10-05 10:17:32vojta.rylkocreate