classification
Title: Confusing statement about unicode strings in tutorial introduction
Type: enhancement Stage:
Components: Documentation Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Daniel.U..Thibault, docs@python, georg.brandl, r.david.murray
Priority: normal Keywords:

Created on 2014-02-19 16:24 by Daniel.U..Thibault, last changed 2014-03-20 23:14 by georg.brandl.

Messages (9)
msg211627 - (view) Author: Daniel U. Thibault (Daniel.U..Thibault) Date: 2014-02-19 16:24
Near the end of 3.1.3 http://docs.python.org/2/tutorial/introduction.html#unicode-strings you can read:

"When a Unicode string is printed, written to a file, or converted with str(), conversion takes place using this default encoding."

This can be interpreted as stating that stating that printing a Unicode string (using the print function or the shell's default print behaviour) results in ASCII printout.  It can likewise be interpreted as stating that any write of a Unicode string to a file converts the string to ASCII.  Experimentation shows this is not true.  Perhaps you meant something like this:

"When a Unicode string is converted with str() in order to be printed or written to a file, conversion takes place using this default encoding."

Grammatical comments: In the statement "When a Unicode string is printed, written to a file, or converted with str(), conversion takes place using this default encoding.", the ", or" puts the three elements of the enumeration on the same level (respectively "printed", "written to a file", and "converted with str()"). The confusion seems to arise because "with str()" was meant to apply to the list as a whole, not just its last element.
msg211635 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-19 18:16
It seems to me the statement is correct as written.  What experiments indicate otherwise?
msg211638 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2014-02-19 18:25
The only problem I can see is that "print" uses the console encoding.

For files and str(), the comment is correct for Python 2.
msg211721 - (view) Author: Daniel U. Thibault (Daniel.U..Thibault) Date: 2014-02-20 12:32
"It seems to me the statement is correct as written.  What experiments indicate otherwise?"

Here's a simple one:

>>> print «1»

The guillemets are certainly not ASCII (Unicode AB and BB, well outside ASCII's 7F upper limit) but are rendered as guillemets.  (Guillemets are easy for me 'cause I use a French keyboard)  I haven't actually checked yet what happens when writing to a file.  If Python is unable to write anything but ASCII to file, it becomes nearly useless.
msg211726 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-02-20 14:22
Thanks, yes, Georg already pointed out the issue with print.  I suppose that this is something that changed at some point in Python2's history but this bit of the docs was not updated.

Python can write anything to a file, you just have to tell it what encoding to use, either by explicitly encoding the unicode to binary before writing it to the file, or by using codecs.open and specifying an encoding for the file.  (This is all much easier in python3, where the unicode support is part of the core of the language.)
msg214217 - (view) Author: Daniel U. Thibault (Daniel.U..Thibault) Date: 2014-03-20 12:56
"The default encoding is normally set to ASCII [...]. When a Unicode string is printed, written to a file, or converted with str(), conversion takes place using this default encoding."

>>> u"äöü"
u'\xe4\xf6\xfc'
   Printing a Unicode string uses ASCII encoding: false (the characters are not converted to their ASCII equivalents) (compare with str(), below)

>>> str(u"äöü")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
   Converting a Unicode string with str() uses ASCII encoding: true (if print (see above) behaved like str(), you'd get an error too)

>>> f = open('workfile', 'w')
>>> f.write('This is a «test»\n')
>>> f.close()
   Writing a Unicode string to a file uses ASCII encoding: false (examination of the file reveals UTF-8 characters (hex dump: 54 68 69 73 20 69 73 20 61 20 C2 AB 74 65 73 74 C2 BB 0A))
msg214223 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-03-20 13:12
re: file.  You forgot the 'u' in front of the string:

>>> f.write(u'This is a «test»\n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xab' in position 10: ordinal not in range(128)

So you were actually writing binary in your console encoding, which must have been utf-8.  (This kind of confusion is the main reason python3 exists).
msg214268 - (view) Author: Daniel U. Thibault (Daniel.U..Thibault) Date: 2014-03-20 20:00
>>> mystring="äöü"
>>> myustring=u"äöü"

>>> mystring
'\xc3\xa4\xc3\xb6\xc3\xbc'
>>> myustring
u'\xe4\xf6\xfc'

>>> str(mystring)
'\xc3\xa4\xc3\xb6\xc3\xbc'
>>> str(myustring)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

>>> f = open('workfile', 'w')
>>> f.write(mystring)
>>> f.close()
>>> f = open('workufile', 'w')
>>> f.write(myustring)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>> f.close()

workfile contains C3 A4 C3 B6 C3 BC

So the Unicode string (myustring) does indeed try to convert to ASCII when written to file. But not when just printed.

It seems really strange that non-Unicode strings (mystring) should actually be more flexible than Unicode strings...
msg214302 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2014-03-20 23:14
First, entering a string at the command prompt like this is not considered "printing"; it's invoking the repr().

Then, when you say flexible, you say it as if it's a good thing.  In this context "flexible" means as much as "easy to produce mojibake" and is not desirable.

For all these use cases, there are ways to do the right thing with Unicode strings in Python 2 (e.g. using io.open instead of builtin open).  But making these the builtin case was the big gain of Python 3.
History
Date User Action Args
2014-03-20 23:14:58georg.brandlsetmessages: + msg214302
2014-03-20 20:00:38Daniel.U..Thibaultsetmessages: + msg214268
2014-03-20 13:12:04r.david.murraysetmessages: + msg214223
title: Confusing statement -> Confusing statement about unicode strings in tutorial introduction
2014-03-20 12:56:40Daniel.U..Thibaultsetmessages: + msg214217
2014-02-20 14:22:26r.david.murraysetmessages: + msg211726
versions: + Python 2.7
2014-02-20 12:32:38Daniel.U..Thibaultsetmessages: + msg211721
2014-02-19 18:25:06georg.brandlsetnosy: + georg.brandl
messages: + msg211638
2014-02-19 18:16:41r.david.murraysetnosy: + r.david.murray
messages: + msg211635
2014-02-19 16:24:55Daniel.U..Thibaultcreate