Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xml.sax and xml.dom fetch DTDs by default #46377

Closed
akuchling opened this issue Feb 15, 2008 · 31 comments
Closed

xml.sax and xml.dom fetch DTDs by default #46377

akuchling opened this issue Feb 15, 2008 · 31 comments
Labels
performance Performance or resource usage topic-XML

Comments

@akuchling
Copy link
Member

BPO 2124
Nosy @loewis, @akuchling, @devdanzin

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2012-01-13.21:29:50.177>
created_at = <Date 2008-02-15.14:42:54.826>
labels = ['expert-XML', 'performance']
title = 'xml.sax and xml.dom fetch DTDs by default'
updated_at = <Date 2013-02-28.01:32:10.423>
user = 'https://github.com/akuchling'

bugs.python.org fields:

activity = <Date 2013-02-28.01:32:10.423>
actor = 'rsandwick3'
assignee = 'none'
closed = True
closed_date = <Date 2012-01-13.21:29:50.177>
closer = 'loewis'
components = ['XML']
creation = <Date 2008-02-15.14:42:54.826>
creator = 'akuchling'
dependencies = []
files = []
hgrepos = []
issue_num = 2124
keywords = []
message_count = 31.0
messages = ['62430', '62431', '62446', '62452', '62475', '62478', '62483', '62486', '62512', '62515', '62787', '62790', '62975', '81080', '81091', '81102', '81103', '81104', '81106', '81108', '81111', '81114', '81122', '110894', '110899', '151191', '151195', '151196', '151197', '151204', '183197']
nosy_count = 10.0
nosy_names = ['loewis', 'akuchling', 'pboddie', 'exarkun', 'ajaksu2', 'vdupras', 'damien', 'BreamoreBoy', 'Brian.Visel', 'rsandwick3']
pr_nums = []
priority = 'normal'
resolution = 'rejected'
stage = None
status = 'closed'
superseder = None
type = 'resource usage'
url = 'https://bugs.python.org/issue2124'
versions = ['Python 3.1', 'Python 2.7', 'Python 3.2']

@akuchling
Copy link
Member Author

The W3C posted an item at
http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
describing how their DTDs are being fetched up to 130M times per day.

The Python parsers are part of the problem, as
noted by Paul Boddie on the python-advocacy list:

There are two places which stand out:

xml/dom/xmlbuilder.py
xml/sax/saxutils.py

What gives them away is the way as the cause of the described problem is
that
they are both fetching things which are given as "system identifiers" - the
things you get in the document type declaration at the top of an XML
document
which look like a URL.

If you then put some trace statements into the code and then try and parse
something using, for example, the xml.sax API, it becomes evident that by
default the parser attempts to fetch lots of DTD-related resources, not
helped by the way that stuff like XHTML is now "modular" and thus employs
lots of separate files in the DTD. This is obvious because you get
something
like this printed to the terminal:

saxutils: opened http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-inlstyle-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-framework-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-datatypes-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-qname-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-events-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-attribs-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml11-model-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-charent-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-lat1.ent
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-symbol.ent
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-special.ent
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-text-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-inlstruct-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-inlphras-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-blkstruct-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-blkphras-1.mod

Of course, the "best practice" with APIs like SAX is that you define
your own
resolver or handler classes which don't go and fetch DTDs from the W3C all
the time, but this isn't the "out of the box" behaviour. Instead,
implementers have chosen the most convenient behaviour which arguably
involves the least effort in telling people how to get hold of DTDs so that
they may validate their documents, but which isn't necessarily the "right
thing" in terms of network behaviour. Naturally, since defining specific
resolvers/handlers involves a lot of boilerplate (and you should try it in
Java!) then a lot of developers just incur the penalty of having the
default
behaviour, instead of considering the finer points of the various W3C
specifications (which is never really any fun).

Anyway, I posted a comment saying much the same on the blog referenced
at the
start of this thread, but we should be aware that this is default standard
library behaviour, not rogue application developer behaviour.

@akuchling
Copy link
Member Author

Here's a simple test to demonstrate the problem:

from xml.sax import make_parser
from xml.sax.saxutils import prepare_input_source
parser = make_parser()
inp = prepare_input_source('file:file.xhtml')
parser.parse(inp)

file.xhtml contains:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"\>
<html xmlns="http://www.w3.org/1999/xhtml" />

If you insert a debug print into saxutils.prepare_input_source,
in the branch which uses urllib.urlopen(), you get the above list of
inputs accessed: the XHTML 1.1 DTD, which is nicely modular and pulls in
all those other files.

I don't see a good way to fix this without breaking backward
compatibility to some degree. The
external-general-entities features defaults to 'on', which enables this
fetching; we could change the default to 'off', which would save the
parsing effort, but would also mean that entities like é weren't
defined.

If we had catalog support, we could ship the XHTML 1.1 DTDs and any
other DTDs of wide usage, but we don't.

@akuchling akuchling added the performance Performance or resource usage label Feb 15, 2008
@loewis
Copy link
Mannequin

loewis mannequin commented Feb 15, 2008

On systems that support catalogs, the parsers should be changed to
support public identifiers, using local copies of these DTDs.

However, I see really no way how the library could avoid resolving the
DTDs altogether. The blog is WRONG in claiming that the system
identifier is not a downloadable URL, but a mere identification (the
namespace is a mere identification, but our parsers would never try to
download anything). The parser needs the DTDs in case external entities
occur in the document (of course, download could be delayed until the
first reference occurs).

@vdupras
Copy link
Mannequin

vdupras mannequin commented Feb 16, 2008

The blog page talked about 503 responses. What about issuing a warning
on these responses? Maybe it would be enough to make developers aware of
the problem?

Or what about in-memory caching of the DTDs? Sure, it wouldn't be as
good as a catalog or anything, but it might help for the worst cases?

@pboddie
Copy link
Mannequin

pboddie mannequin commented Feb 16, 2008

(Andrew, thanks for making a bug, and apologies for not reporting this
in a timely fashion.)

Although an in-memory caching solution might seem to be sufficient, if
one considers things like CGI programs, it's clear that such programs
aren't going to benefit from such a solution. It would be interesting to
know what widely deployed software does use the affected parsers,
though. A Google code search might be helpful.

I think that the nicest compatible solution would be to have some kind
of filesystem cache for the downloaded resources, but I don't recall any
standard library caching solution of this nature. Things like being able
to write to a known directory, perhaps using the temporary file APIs
which should work even as a "very unprivileged" user, would be useful
properties of such a solution.

@akuchling
Copy link
Member Author

What if we just tried to make the remote accesses apparent to the user,
by making a warning.warn() call in the default implementation that was
deactivated by a setFeature() call. With a warning, code will continue
to run but the user will at least be aware they're hitting a remote
resource, and can think about it, even if they decide to suppress the
warning.

We should also modify the docs to point this out; it's not likely to
help very much, but it's still worth doing.

@loewis
Copy link
Mannequin

loewis mannequin commented Feb 17, 2008

-1 on issuing a warning. I really cannot see much of a problem in this
entire issue. XML was designed to "be straightforwardly usable over the
Internet" (XML rec., section 1.1), and this issue is a direct
consequence of that design decision. You might just as well warn people
against using XML in the first place.

@vdupras
Copy link
Mannequin

vdupras mannequin commented Feb 17, 2008

-1 on the systematic warnings too, but what I was talking about is a
warning that would say "The server you are trying to fetch your resource
from is refusing the connection. Don't cha think you misbehave?" only on
5xx and 4xx responses, not on every remote resource fetching.

@devdanzin
Copy link
Mannequin

devdanzin mannequin commented Feb 17, 2008

Martin, I agree that simply not resolving DTDs is an unreasonable
request (and said so in the blog post). But IMHO there are lots of
possible optimizations, and the most valuable would be those darn easy
for newcomers to understand and use.

In Python, a winning combo would be an arbitrary (and explicit) FS
"dtdcache" that people could use with simple a drop-in import (from a
third-party module?). Perhaps the cache lives in a pickled dictionary
with IDs, checksums and DTDs. Could also be a sqlite DB, if updating the
dict becomes problematic.

In that scenario, AMK could save latter W3C hits with:

from xml.sax import make_parser
from dtdcache.sax.saxutils import prepare_input_source # <- dtdcache
parser = make_parser()
inp = prepare_input_source('file:file.xhtml', cache="/tmp/xmlcache")

It might be interesting to have read-only, force-write and read-write
modes. Not sure how to map that on EntityResolver and DTD consumers (I'm
no XML user myself).

Regarding the std-lib, I believe effective caching hooks for DTDs trump
implementing in-memory or sqlite/FS. IMNSHO, correct, accessible support
for catalogs shouldn't be the only change, as caching should give better
performance on both ends.

@loewis
Copy link
Mannequin

loewis mannequin commented Feb 18, 2008

In Python, a winning combo would be an arbitrary (and explicit) FS
"dtdcache" that people could use with simple a drop-in import (from a
third-party module?).

It's indeed possible to provide that as a third-party module; one would
have to implement an EntityResolver, and applications would have to use
it. If there was a need for such a thing, somebody would have done it
years ago.

It might be interesting to have read-only, force-write and read-write
modes.

Please take a look at catalogs - they are a read-only repository for
external entities (provided those entities have a public identifier,
which the W3C DTDs all have).

Regarding the std-lib, I believe effective caching hooks for DTDs trump
implementing in-memory or sqlite/FS.

So then nothing needs to be done - the hooks have been in place since
Python 2.0.

@akuchling
Copy link
Member Author

The solution of adding caching, If-Modified-Since, etc. is a good one,
but I quail in fear at the prospect of expanding the saxutils resolver
into a fully caching HTML agent that uses a cache across processes. We
should really be encouraging people to use more capable libraries such
as httplib2 (http://code.google.com/p/httplib2/), but this is slightly
at war
with the batteries-included philosophy.

So, I propose we:

  • add warnings to the urllib, urllib2, saxutil module docs that parsing
    can retrieve arbitrary resources over the network, and encourage the
    user to use a smarter library such as httplib2.
  • update the urllib2 HOWTO to mention this.

I'm willing to do the necessary writing.

@akuchling akuchling self-assigned this Feb 23, 2008
@loewis
Copy link
Mannequin

loewis mannequin commented Feb 23, 2008

I may have lost track somewhere: what does have urllib* to do with this
issue?

@akuchling
Copy link
Member Author

I mention urllib because, as noted earlier in the discussion, code that
fetches resources over HTTP should really be caching (looking at the
Expires header, sending an ETag and an If-Modified-Since header, etc.).
It's difficult for us to add these features to the urllib/urllib2 APIs,
so we should make users aware of other libraries that provide these
features, so that users will use such a library instead of just writing
to the lower-level APIs provided with Python.

@damien
Copy link
Mannequin

damien mannequin commented Feb 3, 2009

I just ran into this problem. I was very surprised to realize that
every time the code I was working on parsed a docbook file, it generated
several HTTP requests to oasis-open.org to fetch the docbook DTDs.

I attempted to fix the issue by adding an EntityResolver that would
cache fetched DTDs. (The documentation on how to do this is not, by the
way, very clear.)

Unfortunately, this proves to not be possible. The main docbook DTD
includes subsidiary DTDs using relative system identifiers. For
example, the main DTD at:

publicId: -//OASIS//DTD DocBook V4.1//EN
systemId: http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd

...includes this second DTD:

publicId: -//OASIS//ENTITIES DocBook Notations V4.4//EN
systemId: dbnotnx.mod

The EntityResolver's resolveEntity() method is not, however, passed the
base path to resolve the relative systemId from.

This makes it impossible to properly implement a parser which caches
fetched DTDs.

@loewis
Copy link
Mannequin

loewis mannequin commented Feb 3, 2009

The EntityResolver's resolveEntity() method is not, however, passed the
base path to resolve the relative systemId from.

This makes it impossible to properly implement a parser which caches
fetched DTDs.

I don't think this is actually the case. Did you try calling getSystemId
on the locator?

@damien
Copy link
Mannequin

damien mannequin commented Feb 3, 2009

On Feb 3, 2009, at 11:23 AM, Martin v. Löwis wrote:

I don't think this is actually the case. Did you try calling getSystemId
on the locator?

EntityResolver.resolveEntity() is called with the publicId and systemId as
arguments. It does not receive a locator.

        - Damien

@exarkun
Copy link
Mannequin

exarkun mannequin commented Feb 3, 2009

Though it's inconvenient to do so, you can arrange to have the locator
available from the entity resolver. The content handler's
setDocumentLocator method will be called early on with the locator
object. So you can give your entity resolver a reference to your
content handler and save a reference to the document locator in the
content handler. Then in the entity resolver's resolveEntity method you
can reach over into the content handler and grab the document locator to
call its getSystemId method.

Note that you have to be careful with the InputStreams you return from
resolveEntity. I wasn't aware of this before (and perhaps I've
misinterpreted some observer), but I just noticed that if you return an
InputSource based on a file object, the file object's name will be used
as the document id! This is quite not what you want. InputStream has a
setSystemId method, but even if you call it before you call
setByteStream, the system id will be the name of the file object passed
to setByteStream. Perhaps calling these two methods in the opposite
order will fix this, I'm not sure, I haven't tried.

@loewis
Copy link
Mannequin

loewis mannequin commented Feb 3, 2009

EntityResolver.resolveEntity() is called with the publicId and systemId as
arguments. It does not receive a locator.

Sure. But ContentHandler.setDocumentLocator receives it, and you are
supposed to store it for the entire parse, to always know what entity
is being processed if you want to.

@exarkun
Copy link
Mannequin

exarkun mannequin commented Feb 3, 2009

It's indeed possible to provide that as a third-party module; one
would have to implement an EntityResolver, and applications would
have to use it. If there was a need for such a thing, somebody would
have done it years ago.

I don't think this is true, for several reasons.

First, most people never notice that they are writing or using an
application which has this behavior. This is because the behavior is
transparent in almost all cases, manifesting only as a slowdown. Often,
no one is paying close attention to whether a function takes 0.1s or
0.5s. So code gets written which fetches resources from the network by
accident. Similarly, users generally don't have any idea that this kind
of defect is possible, or they don't think it's unusual behavior. In
general, they're not equipped to understand why this is a bad thing. At
best, they may decide a program is slow and be upset, but out of the
myriad reasons a program might be slow, they have no particular reason
to settle on this one as the real cause.

Second, it is *difficult* to implement the non-network behavior.
Seriously, seriously difficult. The documentation for these APIs is
obscure and incomplete in places. It takes a long time to puzzle out
what it means and how to achieve the desired behavior. I wouldn't be
surprised if many people simply gave up and either switched to another
parser or decided they could live with the slowdown (perhaps not
realizing that it could be arbitrarily long and might add a network
dependency to a program which doesn't already have one).

Third, there are several pitfalls on the way to a correct implementation
of the non-network behavior which may lead a developer to decide they
have succeeded when they have actually failed. The most obvious is that
simply turning off the external-general-entities feature appears to
solve the problem but actually changes the parser's behavior so that it
will silently drop named character entities. This is quite surprising
behavior to anyone who hasn't spent a lot of time with the XML
specification.

So I think it would be a significant improvement if there were a simple,
documented way to switch from network retrieval to local retrieval from
a cache. I also think that the current default behavior is wrong. The
default should not be to go out to the network, even if there is a
well-behaved HTTP caching client involved. So the current behavior
should be deprecated. After a sufficient period of time, the local-only
behavior should be made the default. I don't see any problem with
making it easy to re-enable the old behavior, though.

-1 on issuing a warning. I really cannot see much of a problem in
this entire issue. XML was designed to "be straightforwardly usable
over the Internet" (XML rec., section 1.1), and this issue is a
direct consequence of that design decision. You might just as well
warn people against using XML in the first place.

Quoting part of the XML design goals isn't a strong argument for the
current behavior. Transparently requesting network resources in order
to process local data isn't a necessary consequence of the
"straightforwardly usable over the internet" goal. Allowing this
behavior to be explicitly enabled, but not enabled by default, easily
meets this goal. Straightforwardly supporting a local cache of DTDs is
even better, since it improves application performance and removes a
large number of of security concerns. With the general disfavor of DTDs
(in favor of other validation techniques, such as relax-ng) and the
general disfavor of named character entities (basically only XHTML uses
them), I find it extremely difficult to justify Python's current default
behavior.

@damien
Copy link
Mannequin

damien mannequin commented Feb 3, 2009

On Feb 3, 2009, at 1:42 PM, Martin v. Löwis wrote:

Sure. But ContentHandler.setDocumentLocator receives it, and you are
supposed to store it for the entire parse, to always know what entity
is being processed if you want to.

Where in the following sequence am I supposed to receive the document
locator?

parser = xml.sax.make_parser()
parser.setEntityResolver(CachingEntityResolver())
doc = xml.dom.minidom.parse('file.xml', parser)

The content handler is being created deep inside xml.dom. It does, in
fact, store the document locator, but not in any place that I can easily
access without breaking several layers of abstraction.

Or, as a more general question: How can I get a DOM tree that includes
external entities? If there's an easy way to do it, the documentation
does not make it clear at all.

@damien
Copy link
Mannequin

damien mannequin commented Feb 3, 2009

I just discovered another really fun wrinkle in this.

Let's say I want to have my entity resolver return a reference to my
local copy of a DTD. I write:

    source = xml.sax.InputSource()
    source.setPublicId(publicId)
    source.setSystemId(systemId)
    source.setCharacterStream(file(path_to_local_copy))
    return source

This will appear to work.

However, the parser will still silently fetch the DTD over the network!
I needed to call source.setByteStream()--character streams are silently
ignored.

I'd never have noticed this if I hadn't used strace on my process and
noticed a slew of recvfrom() calls that shouldn't have been there.

@loewis
Copy link
Mannequin

loewis mannequin commented Feb 3, 2009

Where in the following sequence am I supposed to receive the document
locator?

parser = xml.sax.make_parser()
parser.setEntityResolver(CachingEntityResolver())
doc = xml.dom.minidom.parse('file.xml', parser)

This is DOM parsing, not SAX parsing.

The content handler is being created deep inside xml.dom. It does, in
fact, store the document locator, but not in any place that I can easily
access without breaking several layers of abstraction.

So break layers of abstraction, then. Or else, use dom.expatbuilder,
and ignore SAX/pulldom for DOM parsing.

Or, as a more general question: How can I get a DOM tree that includes
external entities? If there's an easy way to do it, the documentation
does not make it clear at all.

This tracker is really not the place to ask questions; use python-list
for that.

@damien
Copy link
Mannequin

damien mannequin commented Feb 4, 2009

On Feb 3, 2009, at 3:12 PM, Martin v. Löwis wrote:

This is DOM parsing, not SAX parsing.

  1. The title of this ticket begins with "xml.sax and xml.dom...".
  2. I am creating a SAX parser and passing it to xml.dom, which uses it.

So break layers of abstraction, then. Or else, use dom.expatbuilder,
and ignore SAX/pulldom for DOM parsing.

Is that really the answer?

Read the source code to xml.dom.*, and write hacks based on what I find
there? Note also that xml.dom.expatbuilder does not appear to be an
external API--there is no mention of it in the documentation for
xml.dom.*.

This tracker is really not the place to ask questions; use python-list
for that.

That was a rhetorical question.

The answer is, as best I can tell, "You can't do that."

@BreamoreBoy
Copy link
Mannequin

BreamoreBoy mannequin commented Jul 20, 2010

Does anybody know if users are still experiencing problems with this issue?

@exarkun
Copy link
Mannequin

exarkun mannequin commented Jul 20, 2010

Yes.

@akuchling akuchling removed their assignment Nov 12, 2010
@BrianVisel
Copy link
Mannequin

BrianVisel mannequin commented Jan 13, 2012

..still an issue.

@loewis
Copy link
Mannequin

loewis mannequin commented Jan 13, 2012

And my position still remains the same: this is not a bug. Applications affected by this need to use the APIs that are in place precisely to deal with this issue.

So I propose to close this report as invalid.

@BrianVisel
Copy link
Mannequin

BrianVisel mannequin commented Jan 13, 2012

Of course, you can do as you like.

http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/

@loewis
Copy link
Mannequin

loewis mannequin commented Jan 13, 2012

Well, the issue is clearly underspecified, and different people read different things into it. I take your citation of the W3C blog entry that you are asking that caching should be employed. I read the issue entirely different, namely that by default no attempt to download the DTD should be made, or that the DOM loaders should provide better customization in that matter, or that catalogs shall be used.

Given that the issue was underspecified to begin with, I'm now closing it. Anybody who still has an issue here, please open a new issue and report your specific problem, preferably also proposing a solution.

If you need to follow up to this message, please do so in private email (martin@v.loewis.de).

@loewis loewis mannequin closed this as completed Jan 13, 2012
@pboddie
Copy link
Mannequin

pboddie mannequin commented Jan 13, 2012

Note that Python 3 provided a good opportunity for doing the minimal amount of
work here - just stop things from accessing remote DTDs - but I imagine that
even elementary standard library improvements of this kind weren't made (let
alone the more extensive standard library changes I advocated), so there's
going to be a backwards compatibility situation regardless of which Python
series is involved now, unfortunately.

@rsandwick3
Copy link
Mannequin

rsandwick3 mannequin commented Feb 28, 2013

I have opened issue bpo-17318 to try to specify the problem better. While I do think that catalogs are the correct fix for the validation use case (and thus would like to see something more out-of-the-box in that vein), the real trouble is that users are often unaware that they're sending requests to DTD URIs, so some combination of fixes in default behavior and/or documentation is definitely needed.

The external_ges feature does help, in a way, but is poorly communicated to new users, and moreover does not respect the difference between external DTD subsets and external general entities (there's a reason "DOCTYPE" isn't spelled "ENTITY").

The default behavior is not well documented, and the constraining behavior of DTDs is frequently unnecessary. Either a user should have to explicitly enable validation, or it should be irrevocably obvious to a user that validation is the default behavior, and in both cases it should be blatantly documented that validation may cause network side effects. I think the input has been reasonable all around, and yet I find it rather insane that this issue didn't eventually at least result in a documentation fix, thanks to what looks like push-back for push-back's sake, though I will gladly admit the conclusion that it was underspecified is entirely valid.

Anyway, further info in the new issue...

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage topic-XML
Projects
None yet
Development

No branches or pull requests

1 participant