Issue10026
Created on 2010-10-05 10:17 by vojta.rylko, last changed 2011-07-22 11:37 by Myrosia.Dzikovska. This issue is now closed.
| Messages (5) | |||
|---|---|---|---|
| msg117999 - (view) | Author: Vojtěch Rylko (vojta.rylko) | Date: 2010-10-05 10:17 | |
Hi,
I have file with 10 000 records of same element item (always same):
$ head test.xml
<channel>
<item><section>Twitter</section></item>
<item><section>Twitter</section></item>
<item><section>Twitter</section></item>
<item><section>Twitter</section></item>
<item><section>Twitter</section></item>
<item><section>Twitter</section></item>
<item><section>Twitter</section></item>
<item><section>Twitter</section></item>
<item><section>Twitter</section></item>
And run simply program for printing content of element section:
$ python pulldom.py test.xml | head
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter
Seems work fine:
$ python pulldom.py test.xml | wc -l
10000
But (in two cases of 10 000) gives me just "Twi" not Twitter:
$ python pulldom.py test.xml | grep -v Twitter
Twi
Twi
Why? This example program demonstrate big problems in my real application - xml.dom.pulldom is cutting content of some elements.
Thanks for any advice
Vojta Rylko
---------------------------
Python 2.5.4 (r254:67916, Feb 10 2009, 14:58:09)
[GCC 4.2.4] on linux2
---------------------------
pulldom.py:
---------------------------
file=open(sys.argv[1])
events = pulldom.parse(file)
for event, node in events:
if event == pulldom.START_ELEMENT:
if node.tagName == 'item':
events.expandNode(node)
print node.getElementsByTagName('section').item(0).firstChild.data
|
|||
| msg118002 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * ![]() |
Date: 2010-10-05 11:16 | |
Please read http://docs.python.org/library/xml.etree.elementtree.html?highlight=elementtree#xml.etree.ElementTree.iterparse At START_ELEMENT, the element is not guaranteed to be fully populated; you should handle the END_ELEMENT event instead. This should be documented for the pulldom module as well, though. |
|||
| msg118004 - (view) | Author: Vojtěch Rylko (vojta.rylko) | Date: 2010-10-05 11:38 | |
Program below also splits two of 10 000 elements into two rows. Is it acceptable behavior?
OUTPUT (ill part)
=============
<DOM Text node "u'Twitter'">
<DOM Text node "u'\n'">
<DOM Text node "u'Twi'">
<DOM Text node "u'tter'">
<DOM Text node "u'\n'">
<DOM Text node "u'Twitter'">
<DOM Text node "u'\n'">
<DOM Text node "u'Twitter'">
PROGRAM
=============
for event, node in events:
if event == pulldom.CHARACTERS:
print node.data
|
|||
| msg118006 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * ![]() |
Date: 2010-10-05 12:41 | |
Yes, sax parsers may split CHARACTER events. See also the discussion: http://www.mail-archive.com/xml-sig@python.org/msg00234.html Again, the END_ELEMENT event is guaranteed to return the complete node. |
|||
| msg140870 - (view) | Author: Myrosia Dzikovska (Myrosia.Dzikovska) | Date: 2011-07-22 11:37 | |
I have the same problem, and I tried the solution suggested in here, namely expanding the node at END_ELEMENT. It does not work, raising the following exception:
Traceback (most recent call last):
File "/group/project/onrbee/data/beetle2-eval-09/annotation_tools/logTools/add_start_times.py", line 163, in <module>
main(sys.argv[1:])
File "/group/project/onrbee/data/beetle2-eval-09/annotation_tools/logTools/add_start_times.py", line 130, in main
events.expandNode(node)
File "/usr/lib/python2.6/site-packages/_xmlplus/dom/pulldom.py", line 248, in expandNode
parents[-1].appendChild(cur_node)
IndexError: list index out of range
The code fragment was:
events = xml.dom.pulldom.parse( outName )
for (event,node) in events:
if (event == xml.dom.pulldom.END_ELEMENT) and (node.tagName == "message"):
events.expandNode(node)
|
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2011-07-22 11:37:23 | Myrosia.Dzikovska | set | nosy:
+ Myrosia.Dzikovska messages: + msg140870 |
| 2010-10-06 05:11:54 | georg.brandl | set | status: open -> closed resolution: works for me |
| 2010-10-05 12:41:06 | amaury.forgeotdarc | set | messages: + msg118006 |
| 2010-10-05 11:38:14 | vojta.rylko | set | messages: + msg118004 |
| 2010-10-05 11:16:17 | amaury.forgeotdarc | set | assignee: docs@python messages: + msg118002 nosy: + amaury.forgeotdarc, docs@python |
| 2010-10-05 10:17:32 | vojta.rylko | create | |
