classification
Title: Remove "lightweight" from minidom description
Type: performance Stage:
Components: Documentation, XML Versions: Python 3.3, Python 3.2, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: docs@python, eli.bendersky, eric.araujo, ezio.melotti, fdrake, flox, loewis, orsenthil, pitrou, scoder, tshepang
Priority: normal Keywords:

Created on 2011-03-02 19:25 by scoder, last changed 2012-02-09 04:10 by eli.bendersky.

Messages (28)
msg129914 - (view) Author: Stefan Behnel (scoder) Date: 2011-03-02 19:25
http://docs.python.org/library/xml.dom.minidom.html

presents MiniDOM as a "Lightweight DOM implementation". The word "lightweight" is easily misunderstood as meaning "efficient" or "memory friendly". MiniDOM is well known to be neither of the two.

The first paragraph then continues:

"""
xml.dom.minidom is a light-weight implementation of the Document Object Model interface. It is intended to be simpler than the full DOM and also significantly smaller.
"""

Again, "smaller" can be misread as "low memory footprint", whereas it is actually supposed to refer to an incomplete DOM API implementation. And "simpler" is also clearly exaggerated when compared to the alternative ElementTree package.

I would like to see this changed and combined with a clear and visible comment that MiniDOM has very high resource profile, e.g.

"""
19.7. xml.dom.minidom — Pure Python DOM implementation

xml.dom.minidom is a pure Python implementation of the Document Object Model interface, as known from other programming languages. It is intended to provide a smaller API than the full DOM.

Note, however, that MiniDOM has a very large memory footprint compared to other Python XML libraries. If you need a fast and memory friendly XML tree implementation with a vastly simpler API, use the xml.etree package instead.
"""
msg129918 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-03-02 21:49
-1. The description is factually correct - minidom *does* have a lower footprint than other Python DOM implementations (such as 4DOM).
msg129934 - (view) Author: Stefan Behnel (scoder) Date: 2011-03-03 07:02
Well, I'm not aware of many people who use 4DOM these days, and if that's what it's meant to refer to, maybe that should be made more obvious, because it currently is not at all. Even cDomlette uses only half of the memory according to

http://effbot.org/zone/celementtree.htm

When you say that the description is "factually correct", that does by no means imply that the average reader will understand how it's meant. My point is that almost everyone who reads this will draw the wrong conclusions.

Also, when you say "lower footprint", that does not yet make it "light weight" in any way. It still uses something like ten times as much memory as cElementTree or lxml in Python 2 (and likely much more than even that in Python 3), and still something like 4-5 times as much as plain Python ElementTree. That's a huge difference.

What about this phrasing then:

"""
MiniDOM has a smaller memory footprint than some of the other DOM compliant implementations for Python (such as 4DOM), but uses about 10x more memory than the faster and simpler xml.etree.cElementTree module.
"""
msg129936 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-03-03 08:16
> What about this phrasing then:
> 
> """ MiniDOM has a smaller memory footprint than some of the other DOM
> compliant implementations for Python (such as 4DOM), but uses about
> 10x more memory than the faster and simpler xml.etree.cElementTree
> module. """

But that's not a DOM implementation - so it would be comparing apples
and oranges.
msg129937 - (view) Author: Stefan Behnel (scoder) Date: 2011-03-03 08:31
It's the tree based API most python users are parsing XML with, though. So I do not agree that it's comparing apples and oranges, not at all. It's comparing tree based XML libraries, only one of which is worth being called "light weight", and that's not the one that is currently carrying that name.

I think it's worth telling new users what they are committing to when they write code that uses MiniDOM. The documentation should allow them to understand that.
msg129939 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-03-03 08:36
> It's the tree based API most python users are parsing XML with,
> though. So I do not agree that it's comparing apples and oranges, not
> at all. It's comparing tree based XML libraries, only one of which is
> worth being called "light weight", and that's not the one that is
> currently carrying that name.

If that is a real concern, I'd rather reduce the memory footprint of
minidom than put actual performance figures into the documentation
that will likely outdate over time.

Notice that the documentation doesn't claim that it is a lightweight
XML library, only that it's a ligthweight DOM implementation. SAX is,
of course, even lighter-weight.
msg129944 - (view) Author: Stefan Behnel (scoder) Date: 2011-03-03 09:29
> If that is a real concern, I'd rather reduce the memory footprint of
> minidom than put actual performance figures into the documentation
> that will likely outdate over time.

Personally, I do not think it's worth putting much work into MiniDOM. I'd rather deprecate it to prevent new code from being written for it, but that's just my personal opinion, and this is the wrong place to discuss that. Given the current performance characteristics, I wouldn't be surprised if there was quite some room for improvements left in the xml.dom package.

If you dislike the "10x", feel free to use "several times". I doubt that MiniDOM will ever get so much closer to cET and lxml to prove that phrasing wrong.


> Notice that the documentation doesn't claim that it is a lightweight
> XML library, only that it's a ligthweight DOM implementation.

I imagine that you are as aware as I am that this nuance is easy to miss, especially for a new user. From my experience, it is very common for users, especially those with a Java-ish background, to confuse the terms "DOM" and "XML tree API/library". Hence my push to change the documentation.


> SAX is, of course, even lighter-weight.

Not so much more light weight than cET's iterparse(), but that's getting OT here.

Stefan
msg129951 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-03-03 10:58
Agreed with Stefan's concern.
msg148512 - (view) Author: Stefan Behnel (scoder) Date: 2011-11-28 18:54
Ok, so, what do we make of this? I proposed improvements to the wording in the documentation, which make it much clearer for users what they are buying into when they start using minidom. I still think that "factually correct" but clearly misleading documentation is not helpful and that it needs fixing. Here is an updated phrasing that I hope we can settle on:

"""
:mod:`xml.dom.minidom` --- Pure Python DOM implementation

[...]

:mod:`xml.dom.minidom` is a pure Python implementation of the Document Object Model interface, as known from other programming languages. It is intended to provide a smaller and simpler API than the full W3C DOM.

Note that MiniDOM has a several times larger memory footprint than :mod:`xml.etree.ElementTree`, the light-weight Python XML library in the standard library. If you do not need a (mostly) compliant W3C DOM implementation, but a fast and memory friendly XML tree implementation with an easy to learn API, use that instead.
"""
msg148558 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-11-29 13:05
Is memory footprint something important enough to put in the doc?  Ease of use is IMO more important, but then it becomes subjective..
msg148562 - (view) Author: Stefan Behnel (scoder) Date: 2011-11-29 13:39
I find a factor of an order of magnitude worth mentioning, because it prevents certain kinds of usages.
msg148565 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-11-29 13:46
Usually we don't talk about performance in the doc, and in my personal experience I didn't notice any major difference between the different implementations (but than again I haven't used them much).
Talking about the other implementations and their advantages/disadvantages is fine, but things like "MiniDOM has a several times larger memory footprint" seems like FUD to me (see also http://docs.python.org/dev/documenting/style.html#affirmative-tone).
msg148566 - (view) Author: Fred L. Drake, Jr. (fdrake) (Python committer) Date: 2011-11-29 13:49
Removing "Lightweight" and changing the first paragraph to (something like)

:mod:`xml.dom.minidom` is an implementation of the Document Object Model
interface.  The API is slightly simpler than the full W3C DOM, but the
implementation has a significantly higher memory footprint than
:mod:`xml.dom.etree`.

would be entirely reasonable.

(I don't think it's wrong to discuss relative memory footprints in comparison to other modules in the standard library.)
msg148570 - (view) Author: Stefan Behnel (scoder) Date: 2011-11-29 14:12
I don't think "FUD" is a suitable term for the rather minidom-friendly wording in my last proposal. Seriously, minidom is widely known for being extremely slow and extremely memory hungry. And that is backed by basically any benchmark that has ever been done on the subject. If 4DOM, which Martin cites, is really worse in terms of performance (I never used it), it must truly be the only existing species of that kind.

Still, here's a cleaned up version of Fred's proposal that I could live with:

"""
:mod:`xml.dom.minidom` --- Pure Python DOM implementation

:mod:`xml.dom.minidom` is an implementation of the Document Object Model interface.  The API is (intentionally) slightly simpler than the full W3C DOM, but the implementation has a significantly higher memory footprint than the XML tree library in :mod:`xml.etree.ElementTree`.
"""
msg148572 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-29 14:14
> I don't think "FUD" is a suitable term for the rather minidom-friendly
> wording in my last proposal. Seriously, minidom is widely known for
> being extremely slow and extremely memory hungry. And that is backed
> by basically any benchmark that has ever been done on the subject.

If it's both slow and memory-hungry, perhaps use the more generic
"performance" instead of "memory footprint"?
msg148578 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-11-29 15:26
> Seriously, minidom is widely known for being extremely slow and 
> extremely memory hungry. And that is backed by basically any benchmark 
> that has ever been done on the subject.

Do you have any link?
My point is that if you say thing like "significantly/several times higher memory footprint than X" you are basically scaring the users away from the module.  If for an average documents it takes, say, 30-50MB of memory, it seems perfectly reasonable to me, even if ElementTree takes 3-5MB.  I would actually consider 100-200MB still ok too, unless I have to parse lot of documents or I'm running low of memory for other reasons.
msg148579 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-29 15:33
> My point is that if you say thing like "significantly/several times
> higher memory footprint than X" you are basically scaring the users
> away from the module.

Only those users who know they'll be processing significantly large
documents.
I don't think "scaring away people" is a good enough reason *not* to
document performance characteristics. For example, we already mention
that string joining is faster than repeated concatenation; I haven't
heard anyone complain that it scared people away from string
concatenation. And while it's true that we shouldn't try to document
performance characteristics *too precisely*, it is still a good thing to
document the most outstanding facts (for examples, C accelerator modules
are clearly superior in performance to pure Python modules; should we
shy away from documenting that, and instead present it as some kind of
neutral choice?).

And, of course, if minidom gets some serious performance attention, the
claims will have to be revisited. But given the amount of attention
minidom gets at all, it sounds rather implausible.

> If for an average documents it takes, say, 30-50MB of memory, it seems
> perfectly reasonable to me, even if ElementTree takes 3-5MB.  I would
> actually consider 100-200MB still ok too

Some use cases would not really like a 100-200MB memory consumption, or
even 50MB. Think a long-running daemon, for instance.
msg148584 - (view) Author: Stefan Behnel (scoder) Date: 2011-11-29 16:26
Ezio Melotti, 29.11.2011 16:26:
>> Seriously, minidom is widely known for being extremely slow and
>> extremely memory hungry. And that is backed by basically any benchmark
>> that has ever been done on the subject.
>
> Do you have any link?

I just did a quick Google search for "python minidom benchmark" and found 
these:

http://www.opensourcetutorials.com/tutorials/Server-Side-Coding/Python/xml-matters/page2.html

http://effbot.org/zone/celementtree.htm#benchmarks

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Note that all three authors risk being biased, but given how similar the 
results are, I tend to believe them.

Stefan
msg148585 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-29 16:45
> I just did a quick Google search for "python minidom benchmark" and found 
> these:
> 
> http://www.opensourcetutorials.com/tutorials/Server-Side-Coding/Python/xml-matters/page2.html
> 
> http://effbot.org/zone/celementtree.htm#benchmarks
> 
> http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
> 
> Note that all three authors risk being biased, but given how similar the 
> results are, I tend to believe them.

Thanks for the links. The performance gap looks significant enough to be
mentioned, at least generically.
msg148594 - (view) Author: Stefan Behnel (scoder) Date: 2011-11-29 19:02
Given that the links were generally somewhat dated and used Py2.x instead of the post-PEP393 Py3.3, here is another little benchmark, comparing the parser performance of minidom to lxml.etree (latest), ElementTree and cElementTree (stdlib) in a recent Py3.3 build (e66b7c62eec0), everything properly optimised for my platform (Linux 64bit). I used os.fork() to start a new process after importing everything and reading the file a couple of times, and before parsing. The memory usage is measured inside of the forked child using the resource module's ru_maxrss value, so it correlates with the growth of CPython's memory heap after parsing, thus giving an estimate of the maximum amount of memory used during parsing and tree building.

Parsing hamlet.xml in English, 274KB:

Memory usage: 7284
xml.etree.ElementTree.parse done in 0.104 seconds
Memory usage: 14240 (+6956)
xml.etree.cElementTree.parse done in 0.022 seconds
Memory usage: 9736 (+2452)
lxml.etree.parse done in 0.014 seconds
Memory usage: 11028 (+3744)
minidom tree read in 0.152 seconds
Memory usage: 30360 (+23076)

Parsing the old testament in English (ot.xml, 3.4MB) into memory:

Memory usage: 20444
xml.etree.ElementTree.parse done in 0.385 seconds
Memory usage: 46088 (+25644)
xml.etree.cElementTree.parse done in 0.056 seconds
Memory usage: 32628 (+12184)
lxml.etree.parse done in 0.041 seconds
Memory usage: 37500 (+17056)
minidom tree read in 0.672 seconds
Memory usage: 110428 (+89984)

A 25MB XML file with Slavic Unicode text content:

Memory usage: 57368
xml.etree.ElementTree.parse done in 3.274 seconds
Memory usage: 223720 (+166352)
xml.etree.cElementTree.parse done in 0.459 seconds
Memory usage: 154012 (+96644)
lxml.etree.parse done in 0.454 seconds
Memory usage: 135720 (+78352)
minidom tree read in 6.193 seconds
Memory usage: 604860 (+547492)

And a contrived 4.5MB XML file with lot more structure than data:

Memory usage: 13308
xml.etree.ElementTree.parse done in 4.178 seconds
Memory usage: 222088 (+208780)
xml.etree.cElementTree.parse done in 0.478 seconds
Memory usage: 103056 (+89748)
lxml.etree.parse done in 0.199 seconds
Memory usage: 101860 (+88552)
minidom tree read in 8.705 seconds
Memory usage: 810964 (+797656)

Things to note: The factor of 5-10 for the memory overhead compared to cET depends heavily on the data. Also, minidom is consistently slower by more than a factor of 10 compared to the fastest parser (apparently the one in libxml2/lxml.etree, both of which surely can't be said to provide less features than the DOM that minidom implements).
msg148598 - (view) Author: Stefan Behnel (scoder) Date: 2011-11-29 19:57
Hmm, looks like I messed up the last example. I accidentally left in the formatting whitespace, thus growing the file to 6.2 MB. Removing that, I get this for the (now really) 4.5 MB XML file with lots of structure and very little data:

Memory usage: 11600
xml.etree.ElementTree.parse done in 3.374 seconds
Memory usage: 203420 (+191820)
xml.etree.cElementTree.parse done in 0.192 seconds
Memory usage: 36444 (+24844)
lxml.etree.parse done in 0.131 seconds
Memory usage: 62648 (+51048)
minidom tree read in 5.935 seconds
Memory usage: 527684 (+516084)

It's actually surprising how much of a difference trailing whitespace content makes in minidom (from 2MB on disk to 300MB in memory???), most likely due to the usage of dedicated DOM text nodes in the tree.

PS: I think the "XML/performance" tags on this bug would hint at a separate ticket. This is really meant as a documentation bug.
msg149604 - (view) Author: Stefan Behnel (scoder) Date: 2011-12-16 09:29
I started a mailing list thread on the same topic:

http://thread.gmane.org/gmane.comp.python.devel/127963

Especially see

http://thread.gmane.org/gmane.comp.python.devel/127963/focus=128162

where I extract a proposal from the discussion. Basically, there should be a note at the top of the xml.dom documentation as follows:

"""
[[Note: The xml.dom.minidom module provides an implementation of the W3C-DOM whose API is similar to that in other programming languages. Users who are unfamiliar with the W3C-DOM interface or who would like to write less code for processing XML files should consider using the xml.etree.ElementTree module instead.]]
"""

I think this should go on the xml.dom.minidom page as well as the xml.dom package page. Hand-wavingly, users who are new to the DOM are more likely to hit the package page first, whereas those who know it already will likely find the MiniDOM page directly.

Note that I'd still encourage the removal of the misleading word "lightweight" until it makes sense to put it back in a meaningful way. I therefore propose the following minimalistic changes to the first paragraph on the minidom page:

"""
xml.dom.minidom is a [-XXX: light-weight] implementation of the Document Object Model interface. It is intended to be simpler than the full DOM and also [+XXX: provide a] significantly smaller [+XXX: API].
"""

Additionally, the documentation on the xml.sax page would benefit from the following paragraph:

"""
[[Note: The xml.sax package provides an implementation of the SAX interface whose API is similar to that in other programming languages. Users who are unfamiliar with the SAX interface or who would like to write less code for efficient stream processing of XML files should consider using the iterparse() function in the xml.etree.ElementTree module instead.]]
"""
msg149611 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-12-16 11:09
> xml.dom.minidom is a [-XXX: light-weight] implementation of the Document Object Model interface.

This is ok.

> It is intended to be simpler than the full DOM and also
> [+XXX: provide a] significantly smaller [+XXX: API].

Doesn't "simpler" here refer to the API already?

Another option is to add somewhere a section like:
"If you have to work with XML, ElementTree is usually the best choice, because it has a simple API and it's efficient [or whatever].  xml.dom.minidom provides a subset of the W3C-DOM API, and xml.sax a SAX interface.", possibly expanding a bit on the differences and showing a minimal example with the 3 different implementations, and then link to it from the other modules' pages.
msg149634 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-12-16 17:59
> "If you have to work with XML, ElementTree is usually the best
> choice, because it has a simple API and it's efficient [or whatever].

I still object such a wording, for many reasons.
msg152836 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012-02-08 03:42
IMHO this wording proposed by Stefan:

"""
[[Note: The xml.dom.minidom module provides an implementation of the W3C-DOM whose API is similar to that in other programming languages. Users who are unfamiliar with the W3C-DOM interface or who would like to write less code for processing XML files should consider using the xml.etree.ElementTree module instead.]]
"""

Sounds very reasonable. Perhaps something about a more Pythonic API can also be added there, in addition to "to write less code".

Any objections?
msg152862 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2012-02-08 14:33
On Wed, Feb 08, 2012 at 03:42:16AM +0000, Eli Bendersky wrote:
> Any objections?

None. The explanation sounds reasonable.
msg152866 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-02-08 15:11
+1 to the suggested wording.

-1 to talking about a more pythonic API.

(Want a nit?  s/W3C-DOM/W3C DOM/)
msg152924 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012-02-09 04:10
Martin, do you find the wording I quoted (*without* the reference to a more Pythonic API) acceptable?
History
Date User Action Args
2012-02-09 04:10:30eli.benderskysetmessages: + msg152924
2012-02-08 15:11:37eric.araujosetmessages: + msg152866
2012-02-08 14:33:02orsenthilsetnosy: + orsenthil
messages: + msg152862
2012-02-08 10:30:08tshepangsetnosy: + tshepang
2012-02-08 03:42:15eli.benderskysetnosy: + eli.bendersky
messages: + msg152836
2011-12-16 17:59:06loewissetmessages: + msg149634
2011-12-16 11:09:20ezio.melottisetmessages: + msg149611
2011-12-16 09:29:16scodersetmessages: + msg149604
2011-11-29 19:57:26scodersetmessages: + msg148598
2011-11-29 19:28:06floxsetcomponents: + XML
2011-11-29 19:27:38floxsetnosy: + flox
type: performance
2011-11-29 19:02:15scodersetmessages: + msg148594
2011-11-29 16:45:17pitrousetmessages: + msg148585
2011-11-29 16:26:23scodersetmessages: + msg148584
2011-11-29 15:33:04pitrousetmessages: + msg148579
2011-11-29 15:26:35ezio.melottisetmessages: + msg148578
2011-11-29 14:14:31pitrousetmessages: + msg148572
2011-11-29 14:12:24scodersetmessages: + msg148570
2011-11-29 13:49:08fdrakesetnosy: + fdrake
messages: + msg148566
2011-11-29 13:46:53ezio.melottisetnosy: + ezio.melotti
messages: + msg148565
2011-11-29 13:39:49scodersetmessages: + msg148562
2011-11-29 13:05:47eric.araujosetnosy: + eric.araujo
messages: + msg148558
2011-11-28 18:54:18scodersetmessages: + msg148512
2011-03-03 10:58:02pitrousetnosy: + pitrou
messages: + msg129951
2011-03-03 09:29:01scodersetnosy: loewis, scoder, docs@python
messages: + msg129944
2011-03-03 08:36:34loewissetnosy: loewis, scoder, docs@python
messages: + msg129939
2011-03-03 08:31:30scodersetnosy: loewis, scoder, docs@python
messages: + msg129937
2011-03-03 08:16:05loewissetnosy: loewis, scoder, docs@python
messages: + msg129936
2011-03-03 07:02:17scodersetnosy: loewis, scoder, docs@python
messages: + msg129934
2011-03-02 21:49:45loewissetnosy: + loewis
messages: + msg129918
2011-03-02 19:25:07scodercreate