classification
Title: Sax parser crashes if given unicode file name
Type: behavior Stage: resolved
Components: XML Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: John.Chandler, Sergey.Prokhorov, cgrohmann, christian.heimes, ezio.melotti, python-dev, ricli85, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2011-02-09 14:20 by ricli85, last changed 2013-02-02 10:19 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
sax_unicode_fn-2.7.patch serhiy.storchaka, 2012-12-09 12:18 review
sax_unicode_fn-3.x.patch serhiy.storchaka, 2013-01-13 11:48 review
sax_unicode_fn_alt-2.7.patch serhiy.storchaka, 2013-01-14 11:00 Use the file system encoding only for file opening review
Messages (10)
msg128212 - (view) Author: Rickard Lindberg (ricli85) Date: 2011-02-09 14:20
The error is the following:

    Traceback (most recent call last):
      File "<stdin>", line 4, in <module>
      File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/__init__.py", line 31, in parse
        parser.parse(filename_or_stream)
      File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse
        xmlreader.IncrementalParser.parse(self, source)
      File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/xmlreader.py", line 119, in parse
        self.prepareParser(source)
      File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py", line 121, in prepareParser
        self._parser.SetBase(source.getSystemId())
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 0: ordinal not in range(128)

The following bash script can be used to reproduce the error:

    #!/bin/sh

    cat > å.timeline <<EOF
    <?xml version="1.0" encoding="utf-8"?>
    <timeline>
      <version>0.13.0devb38ace0a572b+</version>
      <categories>
      </categories>
      <events>
        <event>
          <start>2011-02-01 00:00:00</start>
          <end>2011-02-03 08:46:00</end>
          <text>asdsd</text>
        </event>
      </events>
      <view>
        <displayed_period>
          <start>2011-01-24 16:38:11</start>
          <end>2011-02-23 16:38:11</end>
        </displayed_period>
        <hidden_categories>
        </hidden_categories>
      </view>
    </timeline>
    EOF

    python <<EOF
    # encoding: utf-8
    from xml.sax import parse
    from xml.sax.handler import ContentHandler
    parse(open(u"å.timeline", 'r'), ContentHandler())
    EOF

If I instead do this, it works fine:

    parse(u"å.timeline".encode("utf-8"), ContentHandler())

Also:

    >>> sys.getfilesystemencoding()
    'UTF-8'

I heard from another user that this was not a problem with Python 3.1.2.
msg142666 - (view) Author: John Chandler (John.Chandler) Date: 2011-08-22 04:15
Confirmed about not being an issue in Python 3. Just checked with Python 3.3.0a0 and the example works fine - no exception raised.
msg177211 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-09 12:18
However Python doesn't work with bytes filenames (I don't think this is a bug).

The proposed patch allows unicode filenames be used in SAX parser.
msg179866 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-13 11:48
Ported tests for nonascii System-Id on 3.x.

If no one objects I'll commit this next week.
msg179919 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-01-14 07:58
I don't think that the file system encoding is the correct answer here. AFAIR expat uses UTF-8 encoded strings. Python 3.x uses PyArg_ParseTupleAndKeywords() with "s" which converts PyUnicode to PyBytes with the utf-8 codec.
msg179926 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-14 09:12
Yes, this thing was doubted me too. I proceeded from the following considerations.

1. Often system id is used for file operations and in this case you need to use the file system encoding. Unfortunately Python 2 does not have 'surrogateescape' handler which would allow to encode arbitrary name and then restore and re-encode it for file operations.

2. Python 2 in contrary to Python 3 accepts bytes and they may not be valid UTF-8.

We have to choose between compatibility with Python 2 and Python 3. I chose the first, because it is more important for bugfix.

May be I am wrong.
msg179932 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-14 11:00
Here is an alternative patch. It doesn't encode system id when it settled, instead system id attribute can be bytes or an unicode and encoding/decoding happened only a file opened.
msg181145 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-02-02 08:44
New changeset d3e7aea8a550 by Serhiy Storchaka in branch '2.7':
Issue #11159: SAX parser now supports unicode file names.
http://hg.python.org/cpython/rev/d3e7aea8a550

New changeset d2622ca8493a by Serhiy Storchaka in branch '3.2':
Issue #11159: Add tests for testing SAX parser support of non-ascii file names.
http://hg.python.org/cpython/rev/d2622ca8493a

New changeset b85ba45b9579 by Serhiy Storchaka in branch '3.3':
Issue #11159: Add tests for testing SAX parser support of non-ascii file names.
http://hg.python.org/cpython/rev/b85ba45b9579

New changeset 107a06f1a542 by Serhiy Storchaka in branch 'default':
Issue #11159: Add tests for testing SAX parser support of non-ascii file names.
http://hg.python.org/cpython/rev/107a06f1a542
msg181146 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-02-02 08:53
Fixed. Thank you for the report.
msg181157 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-02-02 10:19
New changeset 706218e0facb by Serhiy Storchaka in branch '2.7':
Fix tests for issue #11159.
http://hg.python.org/cpython/rev/706218e0facb

New changeset a7c074d9cbfb by Serhiy Storchaka in branch '3.2':
Fix tests for issue #11159.
http://hg.python.org/cpython/rev/a7c074d9cbfb

New changeset 2bf01f03ff40 by Serhiy Storchaka in branch '3.3':
Fix tests for issue #11159.
http://hg.python.org/cpython/rev/2bf01f03ff40

New changeset 4ab386b00aaf by Serhiy Storchaka in branch 'default':
Fix tests for issue #11159.
http://hg.python.org/cpython/rev/4ab386b00aaf
History
Date User Action Args
2013-02-02 10:19:59python-devsetmessages: + msg181157
2013-02-02 08:53:03serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg181146

stage: patch review -> resolved
2013-02-02 08:44:34python-devsetnosy: + python-dev
messages: + msg181145
2013-01-14 11:00:23serhiy.storchakasetfiles: + sax_unicode_fn_alt-2.7.patch

messages: + msg179932
2013-01-14 09:12:02serhiy.storchakasetmessages: + msg179926
2013-01-14 07:58:17christian.heimessetnosy: + christian.heimes
messages: + msg179919
2013-01-14 02:17:58ezio.melottisetnosy: + ezio.melotti
2013-01-13 11:48:30serhiy.storchakasetfiles: + sax_unicode_fn-3.x.patch

messages: + msg179866
2013-01-13 11:42:37serhiy.storchakasetfiles: + sax_unicode_fn-2.7.patch
2013-01-13 11:39:29serhiy.storchakasetfiles: - sax_unicode_fn-2.7.patch
2013-01-11 09:32:31Sergey.Prokhorovsetnosy: + Sergey.Prokhorov
2012-12-29 22:00:15serhiy.storchakasetassignee: serhiy.storchaka
2012-12-09 12:18:19serhiy.storchakasetfiles: + sax_unicode_fn-2.7.patch

nosy: + serhiy.storchaka
messages: + msg177211

keywords: + patch
stage: patch review
2012-12-08 20:29:32daniel.urbansettype: crash -> behavior
2012-12-08 19:14:51cgrohmannsetnosy: + cgrohmann
2011-08-22 04:15:27John.Chandlersetnosy: + John.Chandler
messages: + msg142666
2011-02-09 14:20:01ricli85create