This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Strange unicode behaviour
Type: Stage:
Components: None Versions:
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: georg.brandl, massysett, sgala
Priority: normal Keywords:

Created on 2007-02-25 11:10 by sgala, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (9)
msg31339 - (view) Author: Santiago Gala (sgala) Date: 2007-02-25 11:10
I know that python is very funny WRT unicode processing, but this defies all my knowledge.

I use the es_ES.UTF-8 encoding on linux. The script:


python -c "print unicode('á %s' % 'éí','utf8') " works, i.e., prints á éí in the next line.

However, if I redirect it to less or to a file, like

python -c "print unicode('á %s' % 'éí','utf8') " >test
Traceback (most recent call last):
  File "<string>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)


Why is the behaviour different when stdout is redirected? How can I get it to do "the right thing" in both cases?
msg31340 - (view) Author: Santiago Gala (sgala) Date: 2007-02-25 11:17
Forgot to say that it happens consistently with 2.4.3, 2.5-svn and svn trunk

Also, some people asks for repr of strings (I guess to reproduce if they can't read the caracters). Those are printed in utf-8:

$python -c "print repr('á %s')"
'\xc3\xa1 %s'
$ python -c "print repr('éi')"
'\xc3\xa9i'
msg31341 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2007-02-25 19:43
First of all: Python's Unicode handling is very consistent and straightforward, if you know the basics. Sadly, most people don't know the difference between Unicode and encoded strings.

What you're seeing is not a bug, it is due to the fact that if you print Unicode to the console, and Python could correctly find out your terminal encoding, the Unicode string is automatically encoded in that encoding.

If you output to a file, Python does not know which encoding you want to have, so all Unicode strings are converted to ascii only.

Please direct further questions to the Python mailing list or newsgroup.

The basic rule when handling Unicode is: use Unicode everywhere inside the program, and byte strings for input and output.
So, your code is exactly the other way round: it takes a byte string, decodes it to unicode and *then* prints it.

You should do it the other way: use Unicode literals in your code, and when you write something to a file, *encode* them in utf-8.
msg31342 - (view) Author: Santiago Gala (sgala) Date: 2007-02-25 22:27
re: consistent, my experience it is that python unicode handling is consistently stupid, doing almost always the wrong thing. It remembers me of the defaults of WordPerfect, that were always exactly the opposite of what the user wanted 99% of time. I hope python 3000 comes fast and stops that real pain.

I love the language, but the way it handles unicode provokes hundreds of bugs.

>Python could correctly find out your terminal
>encoding, the Unicode string is automatically encoded in that encoding.
>
>If you output to a file, Python does not know which encoding you want to
>have, so all Unicode strings are converted to ascii only.

>>> sys.getfilesystemencoding()
'UTF-8'

so python is really dumb if print does not know my filesystemencoding, but knows my terminal encoding.

I though breaking the least surprising behaviour was not considered pythonic, and now you tell me that having a program running on console but issuing an exception when redirected is intended. I would prefer an exception in both cases. Or, even better, using sys.getfilesystemencoding(), or allowing me to set defaultencoding()

>Please direct further questions to the Python mailing list or newsgroup.

I would if I didn't consider this behaviour a bug, and a serious one. 

>The basic rule when handling Unicode is: use Unicode everywhere inside the
>program, and byte strings for input and output.
>So, your code is exactly the other way round: it takes a byte string,
>decodes it to unicode and *then* prints it.
>
>You should do it the other way: use Unicode literals in your code, and
>when y(ou write something to a file, *encode* them in utf-8.

Do you mean that I need to say print unicode(whatever).encode('utf8'), like:

>>> a = unicode('\xc3\xa1','utf8') # instead of 'á', easy to read and understand, even in files encoded as utf8. Assume this is a literal or input
...
>>> print unicode(a).encode('utf8') # because a could be a number, or a different object

every time, instead of "a='á'; print a"

Cool, I'm starting to really love it. Concise and pythonic

Are you seriously meaning that there is no way to tell print to use a default encoding, and it will magically try to find it and fail for everything not being a terminal?


Are you seriously telling me that this is not a bug? Even worse, that it is "intended behaviour". BTW, jython acts differently about this, in all the versions I tried.

And with -S I am allowed to change the encoding, which is crippled in site for no known good reason. 

python -S -c "import sys; sys.setdefaultencoding('utf8'); print unicode('\xc3\xa1','utf8')" >test
(works, test contains an accented a as intended


>use Unicode everywhere inside the
>program, and byte strings for input and output.

Have you ever wondered that to use unicode everywhere inside the program, one needs to decode literals (or input) to unicode (the next sentence you complain about)?

>So, your code is exactly the other way round: it takes a byte string,
>decodes it to unicode and *then* prints it.

I follow this principle in my programming since about 6 years ago, so I'm not a novice. I'm playing by the rules:
a) "decodes it to unicode" is the first step to get it into processing. This is just a test case, so processing is zero.
b) I refuse to believe that the only way to ensure something to be printed right is wrapping every item into unicode(var).encode('utf8') [The redundant unicode call is because the var could be a number, or a different object]
c) or making my code non portable by patching site.py to get a real encoding instead of ascii.
msg31343 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2007-02-25 23:27
> >>> sys.getfilesystemencoding()
> 'UTF-8'
>
> so python is really dumb if print does not know my filesystemencoding, but
> knows my terminal encoding.

the file system encoding is the encoding of file names, not of file content.

> I though breaking the least surprising behaviour was not considered
> pythonic, and now you tell me that having a program running on console but
> issuing an exception when redirected is intended. I would prefer an
> exception in both cases. Or, even better, using
> sys.getfilesystemencoding(), or allowing me to set defaultencoding()

I agree that using the terminal encoding is perhaps a bit too DWIMish, but you
can always get consistent results if you *do not write Unicode strings anywhere*.

> Do you mean that I need to say print unicode(whatever).encode('utf8'),
> like:
> 
> >>> a = unicode('\xc3\xa1','utf8') # instead of 'á', easy to read and
> understand, even in files encoded as utf8. Assume this is a literal or
> input

No. You can directly put Unicode literals in your files, with u'...'.
For that to work, you need to tell Python the encoding your file has,
using the coding cookie (see the docs).

> ...
> >>> print unicode(a).encode('utf8') # because a could be a number, or a
> different object
> 
> every time, instead of "a='á'; print a"

> Cool, I'm starting to really love it. Concise and pythonic

> Are you seriously meaning that there is no way to tell print to use a
> default encoding, and it will magically try to find it and fail for
> everything not being a terminal?

This is not magic. "print" looks for an "encoding" attribute on the file
it is printing to. This is the terminal encoding for sys.stdout and None for
other files.

> Are you seriously telling me that this is not a bug? Even worse, that it
> is "intended behaviour". BTW, jython acts differently about this, in all
> the versions I tried.

It *is* not a bug. This was implemented as a simplification for terminal output.

> And with -S I am allowed to change the encoding, which is crippled in site
> for no known good reason. 

> python -S -c "import sys; sys.setdefaultencoding('utf8'); print
> unicode('\xc3\xa1','utf8')" >test
> (works, test contains an accented a as intended

Because setdefaultencoding() affects *every* conversion from unicode to string
and from string to unicode, which can be very confusing if you have to handle
different encodings.


>>use Unicode everywhere inside the
>>program, and byte strings for input and output.

> Have you ever wondered that to use unicode everywhere inside the program,
> one needs to decode literals (or input) to unicode (the next sentence you
> complain about)?

Yes, you have to decode input (for files, you can do this automatically if you
use codecs.open(), not builtin open()). No, you don't have to decode literals as
Unicode literals exist.

> I follow this principle in my programming since about 6 years ago, so I'm
> not a novice. I'm playing by the rules:
> a) "decodes it to unicode" is the first step to get it into processing.
> This is just a test case, so processing is zero.
> b) I refuse to believe that the only way to ensure something to be printed
> right is wrapping every item into unicode(var).encode('utf8') [The
> redundant unicode call is because the var could be a number, or a different
> object]

No, that is of course not the only way. An alternative is to use an encoded file,
as the codecs module offers.

If you e.g. set

sys.stdout = codecs.EncodedFile(sys.stdout, 'utf-8')

you can print Unicode strings to stdout, and they will automatically be converted
using utf-8. This is clear and explicit.

> c) or making my code non portable by patching site.py to get a real
> encoding instead of ascii.

If you still cannot live without setdefaultencoding(), you can do reload(sys) to get
a sys module with this method.

Closing again.
msg31344 - (view) Author: Santiago Gala (sgala) Date: 2007-03-03 13:22
>This is not magic. "print" looks for an "encoding" attribute on the file
>it is printing to. This is the terminal encoding for sys.stdout and None
>for other files.

I'll correct you:

"print" looks for an "encoding" attribute on the file it is printing to. This is the terminal encoding for sys.stdout *if sys.stdout is a terminal* and None when sys.stdout is not a terminal.

After all, the bug reported is that *the same program* behaved different when used standalone than when piped to less:

$ python -c "import sys; print sys.stdout.encoding" 
UTF-8
$ python -c "import sys; print sys.stdout.encoding" | cat
None

If you say that this is intended, not a bug, that an external process is altering the behavior of a python program, I'd just leave it written to warn other naive people like myself, that thinks that an external program should not influence python behavior (with *the same environment*):

$ locale
LANG=es_ES.UTF-8
LC_CTYPE="es_ES.UTF-8"
LC_NUMERIC="es_ES.UTF-8"
LC_TIME="es_ES.UTF-8"
LC_COLLATE="es_ES.UTF-8"
LC_MONETARY="es_ES.UTF-8"
LC_MESSAGES="es_ES.UTF-8"
LC_PAPER="es_ES.UTF-8"
LC_NAME="es_ES.UTF-8"
LC_ADDRESS="es_ES.UTF-8"
LC_TELEPHONE="es_ES.UTF-8"
LC_MEASUREMENT="es_ES.UTF-8"
LC_IDENTIFICATION="es_ES.UTF-8"
LC_ALL=es_ES.UTF-8

But I take it as a design flaw, and against all pythonic principles, probably coming from the fact that a lot of python developers/users are windows people that don't care about stdout at all.

IMO, the behavior should be either:
- use always None for sys.stdout
- use always LC_CTYPE or LANG for sys.stdout

I prefer the second one, as when I pipe stdout, after all, I expect it to be honoring my locale settings. Don't forget that the same person that types "|" after a call to python can type LC_ALL=blah before, while s/he can't sometimes modify the script because it is out of their permission set.

The implementation logic would be simpler too, I guess.

And more consistent with jython (it uses the second "always LC_CTYPE" solution). Not sure about iron-python or pypy.
msg31345 - (view) Author: Omari Norman (massysett) Date: 2007-06-27 11:29
The fix given by gbrandl, which is to use 

sys.stdout = codecs.EncodedFile(sys.stdout, 'utf-8')

does not work. EncodedFile expects to receive encoded strings, so if you try to use it with Unicode strings, you get errors. I could of course 

I tried

sys.stdout = codecs.open(sys.stdout, 'w', 'utf-8')

but that gives me "Type Error: coercing to Unicode: need string or buffer, file found."

Since this was (absurdly) closed as invalid, are there any good fixes that actually work?
msg31346 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2007-06-27 12:04
Yes, I'm sorry my fix is bad, you should rather use

sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
msg31347 - (view) Author: Santiago Gala (sgala) Date: 2007-06-28 12:59
Still:

$ python -c "import codecs,sys; sys.stdout = codecs.getwriter('utf-8')(sys.stdout); print sys.stdout.encoding" 
UTF-8
$ python -c "import codecs,sys; sys.stdout = codecs.getwriter('utf-8')(sys.stdout); print sys.stdout.encoding" | cat
None

but now, at least


$ python -c "import codecs,sys; sys.stdout = codecs.getwriter('utf-8')(sys.stdout); print unicode('á %s' % 'éí','utf8') "
á éí
$ python -c "import codecs,sys; sys.stdout = codecs.getwriter('utf-8')(sys.stdout); print unicode('á %s' % 'éí','utf8') " | cat
á éí

can be piped.

It still looks amazing to me that people is happy with this behavior.

Waiting anxiously to see how this is dealt with in the str/unicode unification coming for python 3000...
Santiago :)
History
Date User Action Args
2022-04-11 14:56:22adminsetgithub: 44614
2007-02-25 11:10:53sgalacreate