New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xml.sax and xml.dom fetch DTDs by default #46377
Comments
The W3C posted an item at The Python parsers are part of the problem, as There are two places which stand out: xml/dom/xmlbuilder.py What gives them away is the way as the cause of the described problem is If you then put some trace statements into the code and then try and parse saxutils: opened http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd Of course, the "best practice" with APIs like SAX is that you define Anyway, I posted a comment saying much the same on the blog referenced |
Here's a simple test to demonstrate the problem: from xml.sax import make_parser
from xml.sax.saxutils import prepare_input_source
parser = make_parser()
inp = prepare_input_source('file:file.xhtml')
parser.parse(inp) file.xhtml contains: <?xml version="1.0" encoding="UTF-8"?> If you insert a debug print into saxutils.prepare_input_source, I don't see a good way to fix this without breaking backward If we had catalog support, we could ship the XHTML 1.1 DTDs and any |
On systems that support catalogs, the parsers should be changed to However, I see really no way how the library could avoid resolving the |
The blog page talked about 503 responses. What about issuing a warning Or what about in-memory caching of the DTDs? Sure, it wouldn't be as |
(Andrew, thanks for making a bug, and apologies for not reporting this Although an in-memory caching solution might seem to be sufficient, if I think that the nicest compatible solution would be to have some kind |
What if we just tried to make the remote accesses apparent to the user, We should also modify the docs to point this out; it's not likely to |
-1 on issuing a warning. I really cannot see much of a problem in this |
-1 on the systematic warnings too, but what I was talking about is a |
Martin, I agree that simply not resolving DTDs is an unreasonable In Python, a winning combo would be an arbitrary (and explicit) FS In that scenario, AMK could save latter W3C hits with: from xml.sax import make_parser
from dtdcache.sax.saxutils import prepare_input_source # <- dtdcache
parser = make_parser()
inp = prepare_input_source('file:file.xhtml', cache="/tmp/xmlcache") It might be interesting to have read-only, force-write and read-write Regarding the std-lib, I believe effective caching hooks for DTDs trump |
It's indeed possible to provide that as a third-party module; one would
Please take a look at catalogs - they are a read-only repository for
So then nothing needs to be done - the hooks have been in place since |
The solution of adding caching, If-Modified-Since, etc. is a good one, So, I propose we:
I'm willing to do the necessary writing. |
I may have lost track somewhere: what does have urllib* to do with this |
I mention urllib because, as noted earlier in the discussion, code that |
I just ran into this problem. I was very surprised to realize that I attempted to fix the issue by adding an EntityResolver that would Unfortunately, this proves to not be possible. The main docbook DTD publicId: -//OASIS//DTD DocBook V4.1//EN ...includes this second DTD: publicId: -//OASIS//ENTITIES DocBook Notations V4.4//EN The EntityResolver's resolveEntity() method is not, however, passed the This makes it impossible to properly implement a parser which caches |
I don't think this is actually the case. Did you try calling getSystemId |
On Feb 3, 2009, at 11:23 AM, Martin v. Löwis wrote:
EntityResolver.resolveEntity() is called with the publicId and systemId as
|
Though it's inconvenient to do so, you can arrange to have the locator Note that you have to be careful with the InputStreams you return from |
Sure. But ContentHandler.setDocumentLocator receives it, and you are |
I don't think this is true, for several reasons. First, most people never notice that they are writing or using an Second, it is *difficult* to implement the non-network behavior. Third, there are several pitfalls on the way to a correct implementation So I think it would be a significant improvement if there were a simple,
Quoting part of the XML design goals isn't a strong argument for the |
On Feb 3, 2009, at 1:42 PM, Martin v. Löwis wrote:
Where in the following sequence am I supposed to receive the document parser = xml.sax.make_parser()
parser.setEntityResolver(CachingEntityResolver())
doc = xml.dom.minidom.parse('file.xml', parser) The content handler is being created deep inside xml.dom. It does, in Or, as a more general question: How can I get a DOM tree that includes |
I just discovered another really fun wrinkle in this. Let's say I want to have my entity resolver return a reference to my source = xml.sax.InputSource()
source.setPublicId(publicId)
source.setSystemId(systemId)
source.setCharacterStream(file(path_to_local_copy))
return source This will appear to work. However, the parser will still silently fetch the DTD over the network! I'd never have noticed this if I hadn't used strace on my process and |
This is DOM parsing, not SAX parsing.
So break layers of abstraction, then. Or else, use dom.expatbuilder,
This tracker is really not the place to ask questions; use python-list |
On Feb 3, 2009, at 3:12 PM, Martin v. Löwis wrote:
Is that really the answer? Read the source code to xml.dom.*, and write hacks based on what I find
That was a rhetorical question. The answer is, as best I can tell, "You can't do that." |
Does anybody know if users are still experiencing problems with this issue? |
Yes. |
..still an issue. |
And my position still remains the same: this is not a bug. Applications affected by this need to use the APIs that are in place precisely to deal with this issue. So I propose to close this report as invalid. |
Of course, you can do as you like. http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/ |
Well, the issue is clearly underspecified, and different people read different things into it. I take your citation of the W3C blog entry that you are asking that caching should be employed. I read the issue entirely different, namely that by default no attempt to download the DTD should be made, or that the DOM loaders should provide better customization in that matter, or that catalogs shall be used. Given that the issue was underspecified to begin with, I'm now closing it. Anybody who still has an issue here, please open a new issue and report your specific problem, preferably also proposing a solution. If you need to follow up to this message, please do so in private email (martin@v.loewis.de). |
Note that Python 3 provided a good opportunity for doing the minimal amount of |
I have opened issue bpo-17318 to try to specify the problem better. While I do think that catalogs are the correct fix for the validation use case (and thus would like to see something more out-of-the-box in that vein), the real trouble is that users are often unaware that they're sending requests to DTD URIs, so some combination of fixes in default behavior and/or documentation is definitely needed. The external_ges feature does help, in a way, but is poorly communicated to new users, and moreover does not respect the difference between external DTD subsets and external general entities (there's a reason "DOCTYPE" isn't spelled "ENTITY"). The default behavior is not well documented, and the constraining behavior of DTDs is frequently unnecessary. Either a user should have to explicitly enable validation, or it should be irrevocably obvious to a user that validation is the default behavior, and in both cases it should be blatantly documented that validation may cause network side effects. I think the input has been reasonable all around, and yet I find it rather insane that this issue didn't eventually at least result in a documentation fix, thanks to what looks like push-back for push-back's sake, though I will gladly admit the conclusion that it was underspecified is entirely valid. Anyway, further info in the new issue... |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: