classification
Title: Encoding error with sax and codecs
Type: behavior Stage: patch review
Components: Library (Lib), XML Versions: Python 3.4, Python 3.2, Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: georg.brandl, haypo, larry, pitrou, python-dev, sconseil, serhiy.storchaka
Priority: release blocker Keywords: patch

Created on 2013-05-06 11:14 by sconseil, last changed 2013-05-12 21:19 by sconseil. This issue is now closed.

Files
File name Uploaded Description Edit
report.txt sconseil, 2013-05-06 11:14 Minimal example to reproduce the issue
test_codecs.py haypo, 2013-05-06 21:51
XMLGenerator_codecs_stream.patch serhiy.storchaka, 2013-05-07 13:43 review
Messages (12)
msg188508 - (view) Author: Simon Conseil (sconseil) * Date: 2013-05-06 11:14
There is an encoding issue between codecs.open and sax (see attached file). The issue is reproducible on Python 3.3.1, it is working fine on Python 3.3.0
msg188587 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-05-06 20:31
Since this is a regression, setting (temporarily perhaps) as release blocker.
msg188599 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-05-06 21:48
It looks like a regression of introduced by the fix of the issue #1470548, changeset 66f92f76b2ce.
msg188600 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-05-06 21:51
Extracted test from report.txt. Test with Python 3.4:

$ ./python test_codecs.py 
Traceback (most recent call last):
  File "test_codecs.py", line 7, in <module>
    xml.startDocument()
  File "/home/haypo/prog/python/default/Lib/xml/sax/saxutils.py", line 148, in startDocument
    self._encoding)
  File "/home/haypo/prog/python/default/Lib/codecs.py", line 699, in write
    return self.writer.write(data)
  File "/home/haypo/prog/python/default/Lib/codecs.py", line 355, in write
    data, consumed = self.encode(object, self.errors)
TypeError: Can't convert 'bytes' object to str implicitly

_gettextwriter() of xml.sax.saxutils does not recognize codecs classes. (See also the PEP 400 :-)).
msg188640 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-07 10:50
It is not working fine on Python 3.3.0.

>>> with codecs.open('/tmp/test.txt', 'w', encoding='iso-8859-1') as f:
...     xml = XMLGenerator(f, encoding='iso-8859-1')
...     xml.startDocument()
...     xml.startElement('root', {'attr': u'\u20ac'})
...     xml.endElement('root')
...     xml.endDocument()
... 
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "/home/serhiy/py/cpython-3.3.0/Lib/xml/sax/saxutils.py", line 141, in startElement
    self._write(' %s=%s' % (name, quoteattr(value)))
  File "/home/serhiy/py/cpython-3.3.0/Lib/xml/sax/saxutils.py", line 96, in _write
    self._out.write(text)
  File "/home/serhiy/py/cpython-3.3.0/Lib/codecs.py", line 699, in write
    return self.writer.write(data)
  File "/home/serhiy/py/cpython-3.3.0/Lib/codecs.py", line 355, in write
    data, consumed = self.encode(object, self.errors)
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 7: ordinal not in range(256)

And shouldn't. On Python 2 XMLGenerator works only with binary files and "works" with text files only due implicit str->unicode converting. On Python 3 working with binary files was broken. Issue1470548 restores working with binary file (for which only XMLGenerator can work correctly), but for backward compatibility accepting of text files was left. The problem is that there no trustworthy method to determine whenever a file-like object is binary or text.

Accepting of text streams in XMLGenerator should be deprecated in future versions.
msg188642 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-05-07 12:06
> Accepting of text streams in XMLGenerator should be deprecated in future versions.

I agree that the following pattern is strange:

with codecs.open('/tmp/test.txt', 'w', encoding='iso-8859-1') as f:
   xml = XMLGenerator(f, encoding='iso-8859-1')

Why would I specify a codec twice? What happens if I specify two
different codecs?

with codecs.open('/tmp/test.txt', 'w', encoding='utf-8') as f:
   xml = XMLGenerator(f, encoding='iso-8859-1')

It may be simpler (and safer?) to reject text files. If you cannot
detect that f is a text file, just make it explicit in the
documentation that f must be a binary file.

2013/5/7 Serhiy Storchaka <report@bugs.python.org>:
>
> Serhiy Storchaka added the comment:
>
> It is not working fine on Python 3.3.0.
>
>>>> with codecs.open('/tmp/test.txt', 'w', encoding='iso-8859-1') as f:
> ...     xml = XMLGenerator(f, encoding='iso-8859-1')
> ...     xml.startDocument()
> ...     xml.startElement('root', {'attr': u'\u20ac'})
> ...     xml.endElement('root')
> ...     xml.endDocument()
> ...
> Traceback (most recent call last):
>   File "<stdin>", line 4, in <module>
>   File "/home/serhiy/py/cpython-3.3.0/Lib/xml/sax/saxutils.py", line 141, in startElement
>     self._write(' %s=%s' % (name, quoteattr(value)))
>   File "/home/serhiy/py/cpython-3.3.0/Lib/xml/sax/saxutils.py", line 96, in _write
>     self._out.write(text)
>   File "/home/serhiy/py/cpython-3.3.0/Lib/codecs.py", line 699, in write
>     return self.writer.write(data)
>   File "/home/serhiy/py/cpython-3.3.0/Lib/codecs.py", line 355, in write
>     data, consumed = self.encode(object, self.errors)
> UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 7: ordinal not in range(256)
>
> And shouldn't. On Python 2 XMLGenerator works only with binary files and "works" with text files only due implicit str->unicode converting. On Python 3 working with binary files was broken. Issue1470548 restores working with binary file (for which only XMLGenerator can work correctly), but for backward compatibility accepting of text files was left. The problem is that there no trustworthy method to determine whenever a file-like object is binary or text.
>
> Accepting of text streams in XMLGenerator should be deprecated in future versions.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue17915>
> _______________________________________
msg188650 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-07 13:43
Here is a patch which adds explicit checks for codecs stream writers and adds tests for these cases. The tests are not entirely honest, they test only that XMLGenerator works with some specially prepared streams. XMLGenerator doesn't work with a stream with arbitrary encoding and errors handler.
msg188654 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-07 13:48
Of course, if this patch will be committed, perhaps it will be worth to apply it also for 3.2 which has the same regression.
msg188657 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-07 13:57
Perhaps we should add a deprecation warning for codecs streams right in this patch?
msg189003 - (view) Author: Roundup Robot (python-dev) Date: 2013-05-12 10:32
New changeset 1c01571ce0f4 by Georg Brandl in branch '3.2':
Issue #17915: Fix interoperability of xml.sax with file objects returned by
http://hg.python.org/cpython/rev/1c01571ce0f4
msg189009 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2013-05-12 10:45
Fixed in 3.2, 3.3 and default.
msg189063 - (view) Author: Simon Conseil (sconseil) * Date: 2013-05-12 21:19
thanks everybody !
History
Date User Action Args
2013-05-12 21:19:48sconseilsetmessages: + msg189063
2013-05-12 10:45:59georg.brandlsetstatus: open -> closed
resolution: fixed
messages: + msg189009
2013-05-12 10:32:42python-devsetnosy: + python-dev
messages: + msg189003
2013-05-07 13:57:03serhiy.storchakasetmessages: + msg188657
2013-05-07 13:48:21serhiy.storchakasetstage: needs patch -> patch review
messages: + msg188654
components: + XML
versions: + Python 3.2
2013-05-07 13:43:48serhiy.storchakasetfiles: + XMLGenerator_codecs_stream.patch
keywords: + patch
messages: + msg188650
2013-05-07 12:06:06hayposetmessages: + msg188642
2013-05-07 10:50:38serhiy.storchakasetmessages: + msg188640
2013-05-06 21:51:08hayposetfiles: + test_codecs.py

messages: + msg188600
2013-05-06 21:48:19hayposetmessages: + msg188599
2013-05-06 20:31:35pitrousetpriority: normal -> release blocker

nosy: + larry, pitrou, georg.brandl
messages: + msg188587

stage: needs patch
2013-05-06 20:30:39pitrousetnosy: + haypo, serhiy.storchaka

type: behavior
versions: + Python 3.4
2013-05-06 11:14:06sconseilcreate