classification
Title: xml.sax.xmlreader.XMLReader.getProperty (xml.sax.handler.property_xml_string) returns bytes
Type: behavior Stage: patch review
Components: XML Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Jonathan.Gossage, amaury.forgeotdarc, christian.heimes, cms103, loewis, scoder, taleinat
Priority: normal Keywords: patch

Created on 2009-08-11 19:19 by cms103, last changed 2019-05-30 20:00 by cheryl.sabella.

Files
File name Uploaded Description Edit
expatreader.py.patch cms103, 2009-08-12 21:06 Patch to return xml.sax.handler.property_xml_string as a string rather than bytes.
expatreader.py.patch2 cms103, 2009-08-12 21:07 Patch to return xml.sax.handler.property_xml_string as a string and to provide the Locator2 interface.
Pull Requests
URL Status Linked Edit
PR 9715 closed Jonathan.Gossage, 2018-10-05 15:38
PR 10328 closed Jonathan.Gossage, 2018-11-05 02:36
Messages (7)
msg91482 - (view) Author: Colin Stewart (cms103) Date: 2009-08-11 19:19
The documentation for the xml.sax.handler.property_xml_string SAX
property states that it should be "data type: String".  However when
retrieving this value in Python 3.1 it returns a bytes object instead.

This makes handling the returned value very difficult because there is
no method for retrieving the character set encoding that the XML was
originally encoded with.

This is currently blocking the port of SimpleTAL to Python 3 achieving
feature parity with Python 2.
msg91503 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-08-12 19:19
Would you like to contribute a patch?
msg91504 - (view) Author: Colin Stewart (cms103) Date: 2009-08-12 21:06
I'm not familiar with the inner workings of the expat integration with
Python, so the attached patches need careful review.

The first patch (expatreader.py.patch) is the minimum to resolve this
issue.  The second patch (expatreader.py.patch2) also exposes the
version and encoding parameters via the Locator2 interface
(http://www.saxproject.org/apidoc/org/xml/sax/ext/Locator2.html), which
I'd recommend including.
msg91505 - (view) Author: Colin Stewart (cms103) Date: 2009-08-12 21:07
Adding second patch.
msg110871 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-07-20 07:11
A unit test (or even a sample script) showing the desired feature is needed.
msg327700 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2018-10-14 09:13
See additional research and discussion in the comments of PR GH-9715.

Simply changing this to return a string rather than bytes would break backwards compatibility.

I certainly agree that this should have returned a string in the first place, especially since the Unicode decoding is otherwise completely abstracted away and the encoding used is not made available.

Our options:

1. Return a string starting with 3.8, document the change in What's New & fix the docs for older 3.x.
2. Continue returning bytes, update the docs for all 3.x that this returns bytes, and that there's no good way to know the proper encoding to use for decoding it.
3. As 2 above, but also expose the encoding used.

Since this appears to be rarely used and option 3 requires significantly more effort than the others, I am against it. 

Option 2 seems the safest, but I'd like to hear more from those more experienced with XML.
msg327708 - (view) Author: Jonathan Gossage (Jonathan.Gossage) * Date: 2018-10-14 13:52
The other thing to consider which also supports option 2 is that xml.parsers.expat provides an interface to the Expat parser which is easier to use and more complete than the Sax parser implementation and is the implementation likely to be used by anyone needing a streaming parser.
History
Date User Action Args
2019-05-30 20:00:45cheryl.sabellasetnosy: + scoder
2018-11-05 02:36:07Jonathan.Gossagesetpull_requests: + pull_request9632
2018-10-14 13:52:28Jonathan.Gossagesetmessages: + msg327708
2018-10-14 09:13:12taleinatsetnosy: + taleinat, Jonathan.Gossage

messages: + msg327700
versions: + Python 3.6, Python 3.7, Python 3.8, - Python 3.1
2018-10-05 15:38:40Jonathan.Gossagesetstage: test needed -> patch review
pull_requests: + pull_request9101
2018-10-04 14:49:19zach.waresetnosy: + christian.heimes
2010-07-20 07:11:34amaury.forgeotdarcsetnosy: + amaury.forgeotdarc

messages: + msg110871
stage: test needed
2009-08-12 21:07:27cms103setfiles: + expatreader.py.patch2

messages: + msg91505
2009-08-12 21:06:32cms103setfiles: + expatreader.py.patch
keywords: + patch
messages: + msg91504
2009-08-12 19:19:57loewissetnosy: + loewis
messages: + msg91503
2009-08-11 19:19:52cms103create