Issue 20686: Confusing statement about unicode strings in tutorial introduction

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/64885

classification

Title:	Confusing statement about unicode strings in tutorial introduction
Type:	enhancement	Stage:	resolved
Components:	Documentation	Versions:	Python 2.7

process

Status:	closed	Resolution:	out of date
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	Daniel.U..Thibault, docs@python, georg.brandl, r.david.murray, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2014-02-19 16:24 by Daniel.U..Thibault, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (10)
msg211627 - (view)	Author: Daniel U. Thibault (Daniel.U..Thibault)	Date: 2014-02-19 16:24
Near the end of 3.1.3 http://docs.python.org/2/tutorial/introduction.html#unicode-strings you can read: "When a Unicode string is printed, written to a file, or converted with str(), conversion takes place using this default encoding." This can be interpreted as stating that stating that printing a Unicode string (using the print function or the shell's default print behaviour) results in ASCII printout. It can likewise be interpreted as stating that any write of a Unicode string to a file converts the string to ASCII. Experimentation shows this is not true. Perhaps you meant something like this: "When a Unicode string is converted with str() in order to be printed or written to a file, conversion takes place using this default encoding." Grammatical comments: In the statement "When a Unicode string is printed, written to a file, or converted with str(), conversion takes place using this default encoding.", the ", or" puts the three elements of the enumeration on the same level (respectively "printed", "written to a file", and "converted with str()"). The confusion seems to arise because "with str()" was meant to apply to the list as a whole, not just its last element.
msg211635 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-02-19 18:16
It seems to me the statement is correct as written. What experiments indicate otherwise?
msg211638 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2014-02-19 18:25
The only problem I can see is that "print" uses the console encoding. For files and str(), the comment is correct for Python 2.
msg211721 - (view)	Author: Daniel U. Thibault (Daniel.U..Thibault)	Date: 2014-02-20 12:32
"It seems to me the statement is correct as written. What experiments indicate otherwise?" Here's a simple one: >>> print «1» The guillemets are certainly not ASCII (Unicode AB and BB, well outside ASCII's 7F upper limit) but are rendered as guillemets. (Guillemets are easy for me 'cause I use a French keyboard) I haven't actually checked yet what happens when writing to a file. If Python is unable to write anything but ASCII to file, it becomes nearly useless.
msg211726 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-02-20 14:22
Thanks, yes, Georg already pointed out the issue with print. I suppose that this is something that changed at some point in Python2's history but this bit of the docs was not updated. Python can write anything to a file, you just have to tell it what encoding to use, either by explicitly encoding the unicode to binary before writing it to the file, or by using codecs.open and specifying an encoding for the file. (This is all much easier in python3, where the unicode support is part of the core of the language.)
msg214217 - (view)	Author: Daniel U. Thibault (Daniel.U..Thibault)	Date: 2014-03-20 12:56
"The default encoding is normally set to ASCII [...]. When a Unicode string is printed, written to a file, or converted with str(), conversion takes place using this default encoding." >>> u"äöü" u'\xe4\xf6\xfc' Printing a Unicode string uses ASCII encoding: false (the characters are not converted to their ASCII equivalents) (compare with str(), below) >>> str(u"äöü") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) Converting a Unicode string with str() uses ASCII encoding: true (if print (see above) behaved like str(), you'd get an error too) >>> f = open('workfile', 'w') >>> f.write('This is a «test»\n') >>> f.close() Writing a Unicode string to a file uses ASCII encoding: false (examination of the file reveals UTF-8 characters (hex dump: 54 68 69 73 20 69 73 20 61 20 C2 AB 74 65 73 74 C2 BB 0A))
msg214223 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-03-20 13:12
re: file. You forgot the 'u' in front of the string: >>> f.write(u'This is a «test»\n') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xab' in position 10: ordinal not in range(128) So you were actually writing binary in your console encoding, which must have been utf-8. (This kind of confusion is the main reason python3 exists).
msg214268 - (view)	Author: Daniel U. Thibault (Daniel.U..Thibault)	Date: 2014-03-20 20:00
>>> mystring="äöü" >>> myustring=u"äöü" >>> mystring '\xc3\xa4\xc3\xb6\xc3\xbc' >>> myustring u'\xe4\xf6\xfc' >>> str(mystring) '\xc3\xa4\xc3\xb6\xc3\xbc' >>> str(myustring) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) >>> f = open('workfile', 'w') >>> f.write(mystring) >>> f.close() >>> f = open('workufile', 'w') >>> f.write(myustring) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) >>> f.close() workfile contains C3 A4 C3 B6 C3 BC So the Unicode string (myustring) does indeed try to convert to ASCII when written to file. But not when just printed. It seems really strange that non-Unicode strings (mystring) should actually be more flexible than Unicode strings...
msg214302 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2014-03-20 23:14
First, entering a string at the command prompt like this is not considered "printing"; it's invoking the repr(). Then, when you say flexible, you say it as if it's a good thing. In this context "flexible" means as much as "easy to produce mojibake" and is not desirable. For all these use cases, there are ways to do the right thing with Unicode strings in Python 2 (e.g. using io.open instead of builtin open). But making these the builtin case was the big gain of Python 3.
msg370436 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-05-31 13:04
Python 2.7 is no longer supported.

History
Date	User	Action	Args
2022-04-11 14:57:58	admin	set	github: 64885
2020-05-31 13:04:50	serhiy.storchaka	set	status: open -> closed nosy: + serhiy.storchaka messages: + msg370436 resolution: out of date stage: resolved
2014-03-20 23:14:58	georg.brandl	set	messages: + msg214302
2014-03-20 20:00:38	Daniel.U..Thibault	set	messages: + msg214268
2014-03-20 13:12:04	r.david.murray	set	messages: + msg214223 title: Confusing statement -> Confusing statement about unicode strings in tutorial introduction
2014-03-20 12:56:40	Daniel.U..Thibault	set	messages: + msg214217
2014-02-20 14:22:26	r.david.murray	set	messages: + msg211726 versions: + Python 2.7
2014-02-20 12:32:38	Daniel.U..Thibault	set	messages: + msg211721
2014-02-19 18:25:06	georg.brandl	set	nosy: + georg.brandl messages: + msg211638
2014-02-19 18:16:41	r.david.murray	set	nosy: + r.david.murray messages: + msg211635
2014-02-19 16:24:55	Daniel.U..Thibault	create