Issue 2174: xml.sax.xmlreader does not support the InputSource protocol

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/46427

classification

Title:	xml.sax.xmlreader does not support the InputSource protocol
Type:	behavior	Stage:	resolved
Components:	Library (Lib), XML	Versions:	Python 3.5

process

Status:	closed	Resolution:	fixed
Dependencies:	17089	Superseder:
Assigned To:	fdrake	Nosy List:	fdrake, serhiy.storchaka, ygale
Priority:	low	Keywords:

Created on 2008-02-24 13:52 by ygale, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (9)
msg62900 - (view)	Author: Yitz Gale (ygale)	Date: 2008-02-24 13:52
In the documentation for xml.sax.xmlreader.InputSource objects (section 8.12.4 of the Library Reference) we find that users of InputSource objects should use the following sequence to get their input data: 1. If the InputSource has a character stream, use that. 2. Otherwise, if the InputSource has a byte stream, use that. 3. Otherwise, open a URI connection to the system ID. The parse() method of IncrementalParser skips step 1. In addition, we need to add a method getSourceEncoding() to the XMLReader interface; if non-null, it will indicate to the parser that the input is a byte stream in the given encoding. The documentation should indicate what the parser should do if the XML itself announces that its encoding is something else. I propose that the parser should be required to raise an error in that case. See also #1483.
msg62904 - (view)	Author: Yitz Gale (ygale)	Date: 2008-02-24 14:09
See also: #1483 and #2175.
msg62907 - (view)	Author: Yitz Gale (ygale)	Date: 2008-02-24 14:18
Hmm. When getSourceEncoding() is None, there needs to be some way for the parser to distinguish between the cases where it is getting pre-decoded Unicode through a character stream, or where it is getting a byte stream with an unspecified encoding. In the latter case, it will have to look in the XML for an encoding declaration, or use UTF-8 by default). Note that expat only can handle the latter case.
msg62909 - (view)	Author: Yitz Gale (ygale)	Date: 2008-02-24 14:53
So I think there are two possibilities: 1. Use a special value for getSourceEnconding(), like "unicode", to indicate that this is a unicode character stream and not a byte stream. 2. Provide yet another method in the XMLReader interface: sourceIsCharacterStream(), returning a bool. There is a more drastic option: 3. Since expat doesn't support this stuff anyway, and perhaps not too many people have written parsers that do support it, dumb down the InputSource interface. Specifically, deprecate setCharacterStream(), getCharacterStream(), setEncoding() and getEncoding(), none of which are used by expat. Parsers should read the XML from the byte stream and use that to determine the encoding. That may upset some implementors of XML libraries though. They would each have to go to some trouble to provide their own proprietary and possibly incompatible mechanisms for this, if they need it. Perhaps a compromise fourth path would be to have subclasses of InputSource for the two cases of character stream and byte stream.
msg62940 - (view)	Author: Yitz Gale (ygale)	Date: 2008-02-24 21:16
Subclass of XMLReader would be needed, not InputStream.
msg64644 - (view)	Author: Fred Drake (fdrake)	Date: 2008-03-28 18:42
It's certainly arguable that the current behavior is a bug, though I suspect it shouldn't be considered major since I've not seen any prior complaints about this. It should be easy to fix the bug you describe by taking the character stream and encoding it before feeding it to the XML parser; Expat can certainly be forced to take a known encoding, ignoring what's in the XML declaration. On the other hand, it's not at all clear that changing this is worthwhile. This API borrows quite literally from the Java SAX APIs; perhaps this separation of the character stream from the byte stream makes sense for some of the Java XML parsers, but I don't know that there are any Python parsers that benefit from that separation.
msg239312 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-03-26 07:29
Issue2175 has a patch that covers all three issues: issue1483, issue2174 and issue2175. I hesitate what parts of the patch are worth to be applied to maintained releases.
msg239939 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-04-02 18:12
Fixed in issue2175 (in 3.5 only).
msg240171 - (view)	Author: Fred Drake (fdrake)	Date: 2015-04-06 19:18
Given that this has languished this long, patching historical releases seems pointless.

History
Date	User	Action	Args
2022-04-11 14:56:31	admin	set	github: 46427
2015-04-06 19:27:13	Arfrever	set	components: + XML
2015-04-06 19:26:18	Arfrever	set	stage: resolved resolution: fixed components: + Library (Lib), - Documentation, XML versions: + Python 3.5, - Python 3.1, Python 2.7, Python 3.2
2015-04-06 19:18:26	fdrake	set	status: open -> closed messages: + msg240171
2015-04-02 18:12:48	serhiy.storchaka	set	messages: + msg239939
2015-03-26 07:29:04	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg239312
2013-01-31 10:02:57	serhiy.storchaka	set	dependencies: + Expat parser parses strings only when XML encoding is UTF-8
2010-06-09 21:59:34	terry.reedy	set	versions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6, Python 2.5, Python 3.0
2008-03-28 18:42:40	fdrake	set	priority: normal -> low messages: + msg64644 components: - Library (Lib), Unicode
2008-03-20 02:52:31	jafo	set	priority: normal assignee: fdrake nosy: + fdrake
2008-02-24 21:16:40	ygale	set	messages: + msg62940
2008-02-24 14:53:29	ygale	set	messages: + msg62909
2008-02-24 14:18:28	ygale	set	messages: + msg62907
2008-02-24 14:09:57	ygale	set	messages: + msg62904 components: + Unicode
2008-02-24 13:52:31	ygale	create