classification
Title: xml.sax and xml.dom fetch DTDs by default
Type: resource usage Stage:
Components: XML Versions: Python 2.7, Python 2.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: christian.heimes, mcepl, r.david.murray, rsandwick3
Priority: normal Keywords:

Created on 2013-02-28 00:50 by rsandwick3, last changed 2018-03-05 00:46 by mcepl.

Messages (2)
msg183193 - (view) Author: Raynard Sandwick (rsandwick3) Date: 2013-02-28 00:50
Note that URIs in the following are only meant as links when in parentheses; otherwise, they are identifiers and mostly will not yield useful results. I have only worked with xml.sax in Python 2.6 and 2.7, so I cannot speak to its current state in later versions.

The condition described in Python issue #2124 (http://bugs.python.org/issue2124) may yet be a defect, and is at the least a reasonably important enhancement, but apparently was not sufficiently specified, so I will attempt to clarify. As an aside, it is similar to a libxml2 issue on which I have also commented today (https://bugzilla.gnome.org/show_bug.cgi?id=162776), whose statement of issue actually contains what I would expect to be correct behavior if the toggling action were setting an option/feature rather than importing an additional module.

The most common case, and the reason w3c has been inundated with the described requests, is that every time any user anywhere uses xml.sax in its default form to parse an XHTML document containing a doctype declaration, a request is sent to www.w3.org for the contents of that DTD from the URI in its system identifier. This is not documented anywhere (which would be the primary reason to call this a defect), and is confusing because it has the effect of using the terms parser and validator (or "validating parser," whichever is the preferred name) interchangeably.

The w3c is largely to blame, since their own definition document for XML (http://www.w3.org/TR/REC-xml/#sec-external-ent) defines the DTD as a "special kind of external entity," and then goes on to say that XML processors *MAY* use any combination of pubid+sysid to find an alternative method of resolving the reference, but otherwise *MUST* use the URI.

However, this is only necessary when *validating* XML. The DTD is a "mostly useless, but required" (http://en.wikipedia.org/wiki/Document_Type_Declaration) entity in HTML5, e.g., but is not required in XML generally. Even when present, the only time a processor should consult the DTD is during validation, not parsing. If the default parser revealed by xml.sax is a validator rather than just a parser, that should be communicated clearly to the user. When we discuss a CSV parser, we expect it to accept lines separated by some character, each with columns separated by commas. We do not expect it to verify that certain values are found in certain columns of the first line unless we specify that it should. In specifying that it should, we have asked for a validator rather than a parser. This issue is related to the XML analogue of that distinction.

The most valid and important complaint in the referenced blog post is: "don't fetch stuff unless you actually need it," which is what xml.sax users may be unwittingly doing if validation is the default behavior. Further, if xml.sax were actually *not* conducting validation by default, there is no reason whatever to retrieve the DTD, since any external entity references can remain unresolved in well-formed XML prior to validation.

Note that the features, http://xml.org/sax/features/external-general-entities, .../external-parameter-entities, and .../validation have no specified defaults (http://www.saxproject.org/apidoc/org/xml/sax/package-summary.html#package_description). Making these enabled by default causes network-required side effects, which I would contend is improper: unless a user asks for network activity, none should occur. An implicit request for network activity, such as validation, should be fully and widely-visibly documented as a legitimate side effect.

The set of primary use cases for the xml.sax parsers certainly include validation, but users will often be unaware that it is the default, and more importantly be unaware that the parser will therefore request the DTD from its URI. While the feature, .../external-general-entities, partially solves the problem, it is not a full solution, because a well-formed XML document can contain external entities regardless of the location of DTD subsets. The w3c's description ("special kind of external entity") is important here - the DTD is special for a reason, and has its own tag/specifier as a result: resolving general external entities after intentionally omitting an external DTD subset is an acceptable use case, especially in a non-validating parser.

My proposal would be to enhance/fix xml.sax by doing the following:

1) allow toggling of external DTD subset loading via a feature such as http://apache.org/xml/features/nonvalidating/load-external-dtd (http://xerces.apache.org/xerces-j/features.html),
2) cause the feature, http://xml.org/sax/features/validation, to automatically enable the DTD loading feature as well, just as it does for the two currently implemented external entity features,
3) document the default behavior, specially noting that users can expect URIs to be resolved, across the network/internet if necessary, after either the DTD feature or the validation feature is toggled to the enabled state,

and in my opinion:

4) disable the DTD feature by default, so the xml.sax-uninitiated developer who arrives upon the module as a solution doesn't start testing/using it without realizing these requests will be sent.

Sufficient documentation could override #4, since there is a backward-compatibility issue, but I think the detriment to the w3c is enough reason to rethink it nevertheless. Catalogs are a nice solution as well when validation is needed, but when it is not needed, there is no reason to require the extra work of building a catalog (that can't be guaranteed to be writable in situ without sysadmin access) when it is essentially purposeless.

I am continuing to search for the entry point for "<!" (and would appreciate any pointers) but have resorted to subclassing ExpatParser (which, again, kills the nice abstraction xml.sax touts) and omitting external entites with public identifiers starting with "-//W3C//DTD " - this is not a general solution, as DTDs are not guaranteed to come from w3c, and a correct solution would apply the appropriate omissions to the full "<!DOCTYPE ...>" entity, either by leaving external subsets as unparsed and orphaned entities in the document, or (only as a secondary potential solution since internal subsets could still be present and would thus become broken) by ignoring it completely. It might not even be reasonable to consider the latter, though when parsing only and not validating it would be a correctly-working result, so if the former is unachievable, the latter would be a decent improvement for that situation.

As a final note, in case it's helpful: my approach to a fix has been to examine ways to treat the DOCTYPE declaration itself, but another approach would be to have EntityResolver.resolveEntity receive a declaration type alongside the public and system identifiers, and thus the DOCTYPE declaration type could receive appropriate treatment within the current framework quite easily.
msg183200 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-02-28 04:11
I believe this is a subset of issue 17239, and it may be appropriate to close it as a duplicate.  I'll let that up to Chris, though, since he knows what still needs to be specified/worked out.
History
Date User Action Args
2018-03-05 00:46:04mceplsetnosy: + mcepl
2016-06-12 12:11:24martin.panterlinkissue17239 dependencies
2016-06-12 11:25:40christian.heimessetassignee: christian.heimes ->
2013-02-28 04:11:47r.david.murraysetassignee: christian.heimes

messages: + msg183200
nosy: + r.david.murray, christian.heimes
2013-02-28 00:50:37rsandwick3create