classification
Title: sax.parser cannot get xml data from a subprocess pipe
Type: behavior Stage: resolved
Components: XML Versions: Python 3.5, Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: Avneesh.Chadha, JocelynJ, christian.heimes, python-dev, r.david.murray, serhiy.storchaka
Priority: normal Keywords: easy, patch

Created on 2014-11-21 23:54 by JocelynJ, last changed 2014-11-27 21:10 by JocelynJ. This issue is now closed.

Files
File name Uploaded Description Edit
test-22915.diff JocelynJ, 2014-11-22 22:52 test review
toto.py JocelynJ, 2014-11-22 22:57 testcase
sax_non_str_file_name.patch serhiy.storchaka, 2014-11-23 10:47 review
Messages (12)
msg231504 - (view) Author: Jocelyn (JocelynJ) * Date: 2014-11-21 23:54
With the attached code, and an xml file, I got the following error with python 3.4.2:

$ cat toto.xml
<?xml version='1.0' encoding='UTF-8'?>
<test></test>


$ python3.4 toto.py 
Traceback (most recent call last):
  File "toto.py", line 10, in <module>
    parse(proc.stdout, ContentHandler())
  File "/usr/lib/python3.4/xml/sax/__init__.py", line 33, in parse
    parser.parse(source)
  File "/usr/lib/python3.4/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python3.4/xml/sax/xmlreader.py", line 119, in parse
    self.prepareParser(source)
  File "/usr/lib/python3.4/xml/sax/expatreader.py", line 111, in prepareParser
    self._parser.SetBase(source.getSystemId())
TypeError: must be str, not int


The same test works with python2, and I would expect the code to work with python 3 too.


Thanks,
Jocelyn
msg231505 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-11-22 00:10
My guess is that this is similar to issue 21044.
msg231508 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-11-22 02:29
I've looked at the sax code, and this does indeed have the same root cause: in python2 a dummy string was used for the 'name' attribute of io objects that are based on file descriptors, whereas in python3 the name is the integer value of the file descriptor.  In test_sax we can see getSystemId being tested that it returns a filename (see test_expat_locator_withinfo).

The fix should be analogous to the issue 21044 fix: check that 'name' is a string and if not don't use it.  I'm marking this as easy; hopefully someone will want to tackle figuring out exactly where in the code it needs to be fixed and adding tests for it.
msg231537 - (view) Author: Jocelyn (JocelynJ) * Date: 2014-11-22 22:52
Here is a patch to add a test to test_sax.py.

I'm not sure on the fix. Is the following one a correct one ?

     def prepareParser(self, source):
         if source.getSystemId() is not None:
-            self._parser.SetBase(source.getSystemId())
+            self._parser.SetBase(str(source.getSystemId()))


Or should I remove the call to SetBase if getSystemId() is not a string ?
msg231538 - (view) Author: Jocelyn (JocelynJ) * Date: 2014-11-22 22:57
Forgot to attach the testcase when opening the bug.
msg231545 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-11-23 05:46
Has anyone investigated what exactly sax uses SystemId/SetBase *for*?  I think think we need that info in order to decide what to do, and I'm not familiar with sax.
msg231553 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-11-23 10:47
This bug should be fixed in other place. Here is a patch.
msg231568 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-11-23 16:27
Serhiy's patch looks correct to me.  Given that if the source doesn't have a name attribute it is simply not set in the existing code, this change should be "safe" (backward compatible).

Elsewhere the possibility was raised of converting the int to a string (<fdopen: N>), but that issue is a more global one and would apply at the io module level...and if implemented this fix would automatically take advantage of it.

So I think this should be committed.
msg231572 - (view) Author: Jocelyn (JocelynJ) * Date: 2014-11-23 17:35
The only explicit documentation I found on SystemId is from the java specification (it is my understanding that python sax implementation is adapted from Java one):
http://www.saxproject.org/apidoc/org/xml/sax/InputSource.html#setSystemId%28java.lang.String%29

The documentation says that "The system identifier is optional if there is a byte stream or a character stream".

So, I agree that Serhiy's patch looks correct.

Note that I'm not sure that my testcase with a subprocess is covered by Serhiy's tests, as these tests call parser() with a file object.
msg231777 - (view) Author: Roundup Robot (python-dev) Date: 2014-11-27 20:17
New changeset 27ae1a476ef7 by Serhiy Storchaka in branch '3.4':
Issue #22915: SAX parser now supports files opened with file descriptor or
https://hg.python.org/cpython/rev/27ae1a476ef7

New changeset ce9881eecfb4 by Serhiy Storchaka in branch 'default':
Issue #22915: SAX parser now supports files opened with file descriptor or
https://hg.python.org/cpython/rev/ce9881eecfb4
msg231780 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-11-27 20:53
Jocelyn, in both cases the argument of parse() is a file object with integer name. Tests use one of simplest way to create such object.
msg231782 - (view) Author: Jocelyn (JocelynJ) * Date: 2014-11-27 21:10
Ok, I understand your tests now.

Thanks for the fixes !
History
Date User Action Args
2014-11-27 21:10:16JocelynJsetmessages: + msg231782
2014-11-27 20:53:22serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg231780

stage: commit review -> resolved
2014-11-27 20:17:05python-devsetnosy: + python-dev
messages: + msg231777
2014-11-23 17:35:31JocelynJsetmessages: + msg231572
2014-11-23 16:27:17r.david.murraysetmessages: + msg231568
stage: patch review -> commit review
2014-11-23 10:47:15serhiy.storchakasetfiles: + sax_non_str_file_name.patch

nosy: + christian.heimes, serhiy.storchaka
messages: + msg231553

assignee: serhiy.storchaka
stage: needs patch -> patch review
2014-11-23 05:46:36r.david.murraysetmessages: + msg231545
2014-11-22 22:57:39JocelynJsetfiles: + toto.py

messages: + msg231538
2014-11-22 22:52:19JocelynJsetfiles: + test-22915.diff
keywords: + patch
messages: + msg231537
2014-11-22 09:33:53Avneesh.Chadhasetnosy: + Avneesh.Chadha
2014-11-22 08:02:54serhiy.storchakasetstage: needs patch
2014-11-22 02:29:52r.david.murraysetkeywords: + easy

messages: + msg231508
2014-11-22 00:11:13r.david.murraysetversions: + Python 3.5
2014-11-22 00:10:02r.david.murraysetnosy: + r.david.murray
messages: + msg231505
2014-11-21 23:54:11JocelynJcreate