This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author akuchling
Recipients akuchling
Date 2008-02-15.14:42:52
SpamBayes Score 0.005780893
Marked as misclassified No
Message-id <1203086575.38.0.209227458245.issue2124@psf.upfronthosting.co.za>
In-reply-to
Content
The W3C posted an item at
http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
describing how their DTDs are being fetched up to 130M times per day.  

The Python parsers are part of the problem, as 
noted by Paul Boddie on the python-advocacy list:

There are two places which stand out:

xml/dom/xmlbuilder.py
xml/sax/saxutils.py

What gives them away is the way as the cause of the described problem is
that 
they are both fetching things which are given as "system identifiers" - the 
things you get in the document type declaration at the top of an XML
document 
which look like a URL.

If you then put some trace statements into the code and then try and parse 
something using, for example, the xml.sax API, it becomes evident that by 
default the parser attempts to fetch lots of DTD-related resources, not 
helped by the way that stuff like XHTML is now "modular" and thus employs 
lots of separate files in the DTD. This is obvious because you get
something 
like this printed to the terminal:

saxutils: opened http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-inlstyle-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-framework-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-datatypes-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-qname-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-events-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-attribs-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml11-model-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-charent-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-lat1.ent
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-symbol.ent
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-special.ent
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-text-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-inlstruct-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-inlphras-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-blkstruct-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-blkphras-1.mod

Of course, the "best practice" with APIs like SAX is that you define
your own 
resolver or handler classes which don't go and fetch DTDs from the W3C all 
the time, but this isn't the "out of the box" behaviour. Instead, 
implementers have chosen the most convenient behaviour which arguably 
involves the least effort in telling people how to get hold of DTDs so that 
they may validate their documents, but which isn't necessarily the "right 
thing" in terms of network behaviour. Naturally, since defining specific 
resolvers/handlers involves a lot of boilerplate (and you should try it in 
Java!) then a lot of developers just incur the penalty of having the
default 
behaviour, instead of considering the finer points of the various W3C 
specifications (which is never really any fun).

Anyway, I posted a comment saying much the same on the blog referenced
at the 
start of this thread, but we should be aware that this is default standard 
library behaviour, not rogue application developer behaviour.
History
Date User Action Args
2008-02-15 14:42:55akuchlingsetspambayes_score: 0.00578089 -> 0.005780893
recipients: + akuchling
2008-02-15 14:42:55akuchlingsetspambayes_score: 0.00578089 -> 0.00578089
messageid: <1203086575.38.0.209227458245.issue2124@psf.upfronthosting.co.za>
2008-02-15 14:42:54akuchlinglinkissue2124 messages
2008-02-15 14:42:53akuchlingcreate