Message 62431 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	akuchling
Recipients	akuchling
Date	2008-02-15.15:15:01
SpamBayes Score	0.016560765
Marked as misclassified	No
Message-id	<1203088508.92.0.545443917543.issue2124@psf.upfronthosting.co.za>
In-reply-to

Content
Here's a simple test to demonstrate the problem: from xml.sax import make_parser from xml.sax.saxutils import prepare_input_source parser = make_parser() inp = prepare_input_source('file:file.xhtml') parser.parse(inp) file.xhtml contains: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" /> If you insert a debug print into saxutils.prepare_input_source, in the branch which uses urllib.urlopen(), you get the above list of inputs accessed: the XHTML 1.1 DTD, which is nicely modular and pulls in all those other files. I don't see a good way to fix this without breaking backward compatibility to some degree. The external-general-entities features defaults to 'on', which enables this fetching; we could change the default to 'off', which would save the parsing effort, but would also mean that entities like é weren't defined. If we had catalog support, we could ship the XHTML 1.1 DTDs and any other DTDs of wide usage, but we don't.

Here's a simple test to demonstrate the problem:

from xml.sax import make_parser
from xml.sax.saxutils import prepare_input_source
parser = make_parser()
inp = prepare_input_source('file:file.xhtml')
parser.parse(inp)

file.xhtml contains:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" />

If you insert a debug print into saxutils.prepare_input_source, 
in the branch which uses urllib.urlopen(), you get the above list of
inputs accessed: the XHTML 1.1 DTD, which is nicely modular and pulls in
all those other files.

I don't see a good way to fix this without breaking backward
compatibility to some degree.  The 
external-general-entities features defaults to 'on', which enables this
fetching; we could change the default to 'off', which would save the
parsing effort, but would also mean that entities like &eacute; weren't
defined.

If we had catalog support, we could ship the XHTML 1.1 DTDs and any
other DTDs of wide usage, but we don't.

History
Date	User	Action	Args
2008-02-15 15:15:09	akuchling	set	spambayes_score: 0.0165608 -> 0.016560765 recipients: + akuchling
2008-02-15 15:15:09	akuchling	set	spambayes_score: 0.0165608 -> 0.0165608 messageid: <1203088508.92.0.545443917543.issue2124@psf.upfronthosting.co.za>
2008-02-15 15:15:02	akuchling	link	issue2124 messages
2008-02-15 15:15:01	akuchling	create