classification
Title: sys.stdout fails to use default encoding as advertised
Type: behavior Stage:
Components: Documentation, Interpreter Core Versions: Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: georg.brandl Nosy List: cvrebert, denilsonsa, eric.araujo, ezio.melotti, georg.brandl, ggenellina, haypo, hongqn, lemburg, pitrou, sorin, srid, steven.daprano, zuo
Priority: high Keywords: patch

Created on 2009-01-14 11:18 by steven.daprano, last changed 2010-09-08 10:51 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
file_write-2.7-v3.patch haypo, 2010-08-14 14:38
Messages (12)
msg79849 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2009-01-14 11:18
Documentation for files states that when writing to a file, unicode 
strings are converted to byte strings using the encoding specified by 
file.encoding.
http://docs.python.org/library/stdtypes.html#file.encoding

sys.stdout is a file, but it does not behave as stated above:

>>> type(sys.stdout)
<type 'file'>
>>> sys.stdout.encoding
'UTF-8'
>>> u = u"\u554a"
>>> print u
啊
>>> sys.stdout.write(u)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u554a' in 
position 0: ordinal not in range(128)
>>> sys.stdout.write(u.encode('utf-8'))
啊
msg79852 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-14 14:31
It probably uses sys.getdefaultencoding() instead.
msg79887 - (view) Author: Gabriel Genellina (ggenellina) Date: 2009-01-15 04:53
> It probably uses sys.getdefaultencoding() instead.

That would be wrong too, according to the cited documentation.

file.encoding is a read only attribute; it can be set in C code using 
PyFile_SetEncoding. Apart from its definition in fileobject.c, it is 
*only* used by PyInitialize when sys.stdin/stdout/stderr are created. 
There are no tests, nor any other use of it anywhere. Apparently the 
attribute *is* checked when writing unicode objects, but it does not 
work.

I'm guessing now, but probably the original intent was to make file 
objects behave like the wrapper returned by codecs.open works now -- 
later it was deemed impractical and forgotten. Now, the "declarative" 
meaning of file.encoding survives, but the "behavior" is broken.

I don't know what would be the right thing to do. The encoding used by 
stdin/stdout/stderr is valuable information so the attribute should 
remain. Fixing the behavior is like having a crippled 
StreamReaderWriter and I don't see the point. But StreamReaderWriter 
has an "encoding" attribute too, and it "works", so one cannot rely on 
having such attribute to know whether the stream automatically encodes 
its data or not.
msg87635 - (view) Author: Jan Kaliszewski (zuo) Date: 2009-05-12 14:44
The matter had been discussed (and not once...), IMO without 
satisfactory conclusion -- see:

* http://bugs.python.org/issue612627 (the feature added)
* http://bugs.python.org/issue1214889 (another feature rejected)
* http://bugs.python.org/issue1099364 (problems reported)
* http://bugs.python.org/issue967986 (problems reported)
* http://mail.python.org/pipermail/python-list/2008-December/693601.html
* http://mail.python.org/pipermail/python-dev/2008-December/084362.html
* and probably in many other places...

Anyway, it's definitely a bug -- either in the language/implementation 
or in the documentation.
msg87637 - (view) Author: Jan Kaliszewski (zuo) Date: 2009-05-12 14:56
PS. The main problem is not a lack of feature but that inconsistency, 
and that's not documented if File type docs:

print >>my_file, my_unicode  # <- is encoded with my_file.encoding
my_file.write(my_unicode)  # <- is encoded with my_file.encoding

# and on the other hand:
print my_unicode -- works  # <- is encoded with my_file.encoding
sys.stdout.write(my_unicode)  # <- is encoded with what is returned by 
sys.getdefaultencoding()
msg87638 - (view) Author: Jan Kaliszewski (zuo) Date: 2009-05-12 15:01
s / if File / in File
s / -- works  # <- is encoded with my_file.encoding /  # <- is encoded 
with sys.stdout.encoding

(sorry, too little sleep)
msg113890 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-08-14 12:04
Attached patch fixes this old and annoying issue. The issue only concerns sys.std* files, because Python only set the encoding and errors attributes for these files.
msg113891 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-08-14 12:13
Oh, I forgot to write that my patch uses also the errors attribute. Update the patch to add tests on errors: file_write-2.7-v2.patch.
msg113892 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-08-14 12:19
Your patch threatens to break compatibility. I think it would be better to simply change the "encoding" and "errors" attributes of standard streams.
msg113900 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-08-14 14:26
> Your patch threatens to break compatibility

Yes it does. But I think that nobody relies on this bug. If your terminal uses something that utf-8, you will see strange characters if you write something else than ascii characters. I supopse that anybody facing this problem uses a workaround like replacing sys.stdout object, encode manually each string with the right encoding or something else.
msg113901 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-08-14 14:38
3rd version of the patch: accept character buffer objects without reencoding them. Add also tests on character buffer objects.
msg115857 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-09-08 10:51
I commited my patch (with a new test, iso-8859-1:replace) to 2.7: r84621. I will no backport to 2.6 because this branch now only accept security fixes.
History
Date User Action Args
2010-09-08 10:51:52hayposetstatus: open -> closed
resolution: fixed
messages: + msg115857
2010-09-08 00:51:55eric.araujosetnosy: + lemburg
2010-08-14 14:38:25hayposetfiles: - file_write-2.7-v2.patch
2010-08-14 14:38:14hayposetfiles: + file_write-2.7-v3.patch

messages: + msg113901
2010-08-14 14:26:35hayposetmessages: + msg113900
2010-08-14 12:19:05pitrousetmessages: + msg113892
2010-08-14 12:13:16hayposetfiles: - file_write-2.7.patch
2010-08-14 12:13:08hayposetfiles: + file_write-2.7-v2.patch

messages: + msg113891
2010-08-14 12:04:15hayposetfiles: + file_write-2.7.patch
keywords: + patch
messages: + msg113890

versions: + Python 2.7
2010-08-14 05:25:29eric.araujosetnosy: + haypo
2010-06-09 21:23:58terry.reedysetversions: - Python 2.5, Python 2.4
2010-02-16 05:53:21eric.araujosetnosy: + eric.araujo
2009-09-24 18:51:42sridsetnosy: + srid
2009-07-28 13:04:45ezio.melottisetnosy: + ezio.melotti
2009-07-28 08:23:45cvrebertsetnosy: + cvrebert
2009-05-12 23:41:32zuosetassignee: georg.brandl

components: + Documentation
nosy: + georg.brandl
2009-05-12 15:21:34denilsonsasetnosy: + denilsonsa
2009-05-12 15:01:40zuosetmessages: + msg87638
2009-05-12 14:56:02zuosetmessages: + msg87637
versions: + Python 2.4
2009-05-12 14:44:44zuosetnosy: + zuo
messages: + msg87635
2009-02-10 23:25:38sorinsetnosy: + sorin
2009-02-04 09:54:10hongqnsetnosy: + hongqn
2009-01-15 04:53:12ggenellinasetnosy: + ggenellina
messages: + msg79887
2009-01-14 14:32:00pitrousetpriority: high
type: behavior
messages: + msg79852
components: + Interpreter Core
nosy: + pitrou
2009-01-14 11:18:47steven.dapranocreate