This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: xml.sax.xmlreader does not support the InputSource protocol
Type: behavior Stage: resolved
Components: Library (Lib), XML Versions: Python 3.5
Status: closed Resolution: fixed
Dependencies: 17089 Superseder:
Assigned To: fdrake Nosy List: fdrake, serhiy.storchaka, ygale
Priority: low Keywords:

Created on 2008-02-24 13:52 by ygale, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (9)
msg62900 - (view) Author: Yitz Gale (ygale) Date: 2008-02-24 13:52
In the documentation for xml.sax.xmlreader.InputSource objects
(section 8.12.4 of the Library Reference) we find that
users of InputSource objects should use the following
sequence to get their input data:

1. If the InputSource has a character stream, use that.
2. Otherwise, if the InputSource has a byte stream, use that.
3. Otherwise, open a URI connection to the system ID.

The parse() method of IncrementalParser skips step 1.

In addition, we need to add a method
getSourceEncoding() to the XMLReader interface;
if non-null, it will indicate to the parser that
the input is a byte stream in the given encoding.

The documentation should indicate what the parser
should do if the XML itself announces that its
encoding is something else. I propose that the parser should
be required to raise an error in that case.

See also #1483.
msg62904 - (view) Author: Yitz Gale (ygale) Date: 2008-02-24 14:09
See also: #1483 and #2175.
msg62907 - (view) Author: Yitz Gale (ygale) Date: 2008-02-24 14:18
Hmm. When getSourceEncoding() is None, there needs to be some
way for the parser to distinguish between the cases where it
is getting pre-decoded Unicode through a character stream,
or where it is getting a byte stream with an unspecified
encoding. In the latter case, it will have to look in the
XML for an encoding declaration, or use UTF-8 by default).

Note that expat only can handle the latter case.
msg62909 - (view) Author: Yitz Gale (ygale) Date: 2008-02-24 14:53
So I think there are two possibilities:

1. Use a special value for getSourceEnconding(),
like "unicode", to indicate that this is a
unicode character stream and not a byte stream.

2. Provide yet another method in the XMLReader
interface: sourceIsCharacterStream(), returning
a bool.

There is a more drastic option:

3. Since expat doesn't support this stuff
anyway, and perhaps not too many people
have written parsers that do support it,
dumb down the InputSource interface.

Specifically, deprecate setCharacterStream(),
getCharacterStream(), setEncoding() and
getEncoding(), none of which are used by
expat. Parsers should read the XML from
the byte stream and use that to determine
the encoding.

That may upset some implementors of XML
libraries though. They would each have to go
to some trouble to provide their own
proprietary and possibly incompatible
mechanisms for this, if they need it.

Perhaps a compromise fourth path would
be to have subclasses of InputSource for
the two cases of character stream and
byte stream.
msg62940 - (view) Author: Yitz Gale (ygale) Date: 2008-02-24 21:16
Subclass of XMLReader would be needed, not InputStream.
msg64644 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2008-03-28 18:42
It's certainly arguable that the current behavior is a bug, though I
suspect it shouldn't be considered major since I've not seen any prior
complaints about this.

It should be easy to fix the bug you describe by taking the character
stream and encoding it before feeding it to the XML parser; Expat can
certainly be forced to take a known encoding, ignoring what's in the XML

On the other hand, it's not at all clear that changing this is
worthwhile.  This API borrows quite literally from the Java SAX APIs;
perhaps this separation of the character stream from the byte stream
makes sense for some of the Java XML parsers, but I don't know that
there are any Python parsers that benefit from that separation.
msg239312 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-03-26 07:29
Issue2175 has a patch that covers all three issues: issue1483, issue2174 and issue2175. I hesitate what parts of the patch are worth to be applied to maintained releases.
msg239939 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-04-02 18:12
Fixed in issue2175 (in 3.5 only).
msg240171 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2015-04-06 19:18
Given that this has languished this long, patching historical releases seems pointless.
Date User Action Args
2022-04-11 14:56:31adminsetgithub: 46427
2015-04-06 19:27:13Arfreversetcomponents: + XML
2015-04-06 19:26:18Arfreversetstage: resolved
resolution: fixed
components: + Library (Lib), - Documentation, XML
versions: + Python 3.5, - Python 3.1, Python 2.7, Python 3.2
2015-04-06 19:18:26fdrakesetstatus: open -> closed

messages: + msg240171
2015-04-02 18:12:48serhiy.storchakasetmessages: + msg239939
2015-03-26 07:29:04serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg239312
2013-01-31 10:02:57serhiy.storchakasetdependencies: + Expat parser parses strings only when XML encoding is UTF-8
2010-06-09 21:59:34terry.reedysetversions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6, Python 2.5, Python 3.0
2008-03-28 18:42:40fdrakesetpriority: normal -> low
messages: + msg64644
components: - Library (Lib), Unicode
2008-03-20 02:52:31jafosetpriority: normal
assignee: fdrake
nosy: + fdrake
2008-02-24 21:16:40ygalesetmessages: + msg62940
2008-02-24 14:53:29ygalesetmessages: + msg62909
2008-02-24 14:18:28ygalesetmessages: + msg62907
2008-02-24 14:09:57ygalesetmessages: + msg62904
components: + Unicode
2008-02-24 13:52:31ygalecreate