This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author akuchling
Recipients akuchling
Date 2008-02-15.15:15:01
SpamBayes Score 0.016560765
Marked as misclassified No
Message-id <1203088508.92.0.545443917543.issue2124@psf.upfronthosting.co.za>
In-reply-to
Content
Here's a simple test to demonstrate the problem:

from xml.sax import make_parser
from xml.sax.saxutils import prepare_input_source
parser = make_parser()
inp = prepare_input_source('file:file.xhtml')
parser.parse(inp)

file.xhtml contains:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" />

If you insert a debug print into saxutils.prepare_input_source, 
in the branch which uses urllib.urlopen(), you get the above list of
inputs accessed: the XHTML 1.1 DTD, which is nicely modular and pulls in
all those other files.

I don't see a good way to fix this without breaking backward
compatibility to some degree.  The 
external-general-entities features defaults to 'on', which enables this
fetching; we could change the default to 'off', which would save the
parsing effort, but would also mean that entities like &eacute; weren't
defined.

If we had catalog support, we could ship the XHTML 1.1 DTDs and any
other DTDs of wide usage, but we don't.
History
Date User Action Args
2008-02-15 15:15:09akuchlingsetspambayes_score: 0.0165608 -> 0.016560765
recipients: + akuchling
2008-02-15 15:15:09akuchlingsetspambayes_score: 0.0165608 -> 0.0165608
messageid: <1203088508.92.0.545443917543.issue2124@psf.upfronthosting.co.za>
2008-02-15 15:15:02akuchlinglinkissue2124 messages
2008-02-15 15:15:01akuchlingcreate