This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author exarkun
Recipients ajaksu2, akuchling, damien, exarkun, loewis, pboddie, vdupras
Date 2009-02-03.22:02:11
SpamBayes Score 3.8857806e-16
Marked as misclassified No
Message-id <1233698537.63.0.989341435073.issue2124@psf.upfronthosting.co.za>
In-reply-to
Content
> It's indeed possible to provide that as a third-party module; one
> would have to implement an EntityResolver, and applications would
> have to use it. If there was a need for such a thing, somebody would
> have done it years ago.

I don't think this is true, for several reasons.

First, most people never notice that they are writing or using an
application which has this behavior.  This is because the behavior is
transparent in almost all cases, manifesting only as a slowdown.  Often,
no one is paying close attention to whether a function takes 0.1s or
0.5s.  So code gets written which fetches resources from the network by
accident.  Similarly, users generally don't have any idea that this kind
of defect is possible, or they don't think it's unusual behavior.  In
general, they're not equipped to understand why this is a bad thing.  At
best, they may decide a program is slow and be upset, but out of the
myriad reasons a program might be slow, they have no particular reason
to settle on this one as the real cause.

Second, it is *difficult* to implement the non-network behavior. 
Seriously, seriously difficult.  The documentation for these APIs is
obscure and incomplete in places.  It takes a long time to puzzle out
what it means and how to achieve the desired behavior.  I wouldn't be
surprised if many people simply gave up and either switched to another
parser or decided they could live with the slowdown (perhaps not
realizing that it could be arbitrarily long and might add a network
dependency to a program which doesn't already have one).

Third, there are several pitfalls on the way to a correct implementation
of the non-network behavior which may lead a developer to decide they
have succeeded when they have actually failed.  The most obvious is that
simply turning off the external-general-entities feature appears to
solve the problem but actually changes the parser's behavior so that it
will silently drop named character entities.  This is quite surprising
behavior to anyone who hasn't spent a lot of time with the XML
specification.

So I think it would be a significant improvement if there were a simple,
documented way to switch from network retrieval to local retrieval from
a cache.  I also think that the current default behavior is wrong.  The
default should not be to go out to the network, even if there is a
well-behaved HTTP caching client involved.  So the current behavior
should be deprecated.  After a sufficient period of time, the local-only
behavior should be made the default.  I don't see any problem with
making it easy to re-enable the old behavior, though.

> -1 on issuing a warning. I really cannot see much of a problem in
> this entire issue. XML was designed to "be straightforwardly usable
> over the Internet" (XML rec., section 1.1), and this issue is a
> direct consequence of that design decision. You might just as well
> warn people against using XML in the first place.

Quoting part of the XML design goals isn't a strong argument for the
current behavior.  Transparently requesting network resources in order
to process local data isn't a necessary consequence of the
"straightforwardly usable over the internet" goal.  Allowing this
behavior to be explicitly enabled, but not enabled by default, easily
meets this goal.  Straightforwardly supporting a local cache of DTDs is
even better, since it improves application performance and removes a
large number of of security concerns.  With the general disfavor of DTDs
(in favor of other validation techniques, such as relax-ng) and the
general disfavor of named character entities (basically only XHTML uses
them), I find it extremely difficult to justify Python's current default
behavior.
History
Date User Action Args
2009-02-03 22:02:17exarkunsetrecipients: + exarkun, loewis, akuchling, pboddie, ajaksu2, vdupras, damien
2009-02-03 22:02:17exarkunsetmessageid: <1233698537.63.0.989341435073.issue2124@psf.upfronthosting.co.za>
2009-02-03 22:02:14exarkunlinkissue2124 messages
2009-02-03 22:02:11exarkuncreate