This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: pulldom cannot handle xml file with large external entity properly
Type: resource usage Stage: needs patch
Components: XML Versions: Python 3.8, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Jeffrey.Kintscher, christian.heimes, hanselda, mvolz, scoder
Priority: normal Keywords:

Created on 2008-05-11 13:32 by hanselda, last changed 2022-04-11 14:56 by admin.

Messages (1)
msg66628 - (view) Author: Luyang Han (hanselda) Date: 2008-05-11 13:32
when use xml.dom.pulldom module to parse a large xml file, if all the 
information is saved in one xml file, the module can handle it in the 
following way without construction the whole DOM:

events = xml.dom.pulldom.parse('file.xml')
for (event, node) in events:
    process(event, node)

But if 'file.xml' contains some large external entities, for example:

<!ENTITY file_external SYSTEM "others.xml">
<body>&file_external;</body>

Then using the same python snippet above leads to enormous memory 
usage. I did not perform a concrete benchmark, in one case a 3M 
external xml file drained about 1 GB memory. I think in this case it 
might be the whole DOM structure is constructed.
History
Date User Action Args
2022-04-11 14:56:34adminsetgithub: 47067
2019-05-28 08:47:13Jeffrey.Kintschersetnosy: + Jeffrey.Kintscher
2019-05-16 01:09:38cheryl.sabellasetnosy: + scoder

versions: + Python 3.8, - Python 3.2, Python 3.3, Python 3.4
2014-03-07 12:46:54mvolzsetnosy: + mvolz
2013-03-30 22:51:28pitrousetnosy: + christian.heimes
2012-11-09 13:23:17ezio.melottisetstage: needs patch
versions: + Python 3.3, Python 3.4, - Python 3.1
2010-06-09 22:16:33terry.reedysetversions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.5
2008-05-11 13:32:17hanseldacreate