classification
Title: Improve docs for string interpolation "%s" re Unicode strings
Type: Stage: needs patch
Components: Documentation Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Arfrever, cmcqueen1975, docs@python, eric.araujo, eric.smith, ezio.melotti
Priority: normal Keywords:

Created on 2010-07-08 07:07 by cmcqueen1975, last changed 2014-06-02 18:19 by ezio.melotti.

Files
File name Uploaded Description Edit
class_str_unicode_methods.py cmcqueen1975, 2011-01-10 05:26
Messages (7)
msg109516 - (view) Author: Craig McQueen (cmcqueen1975) Date: 2010-07-08 07:07
I have just been trying to figure out how string interpolation works for "%s", when Unicode strings are involved. It seems it's a bit complicated, but the Python documentation doesn't really describe it. It just says %s "converts any Python object using str()".

Here is what I have found (I think), and it could be worth improving the documentation of this somehow.

Example 1:
    "%s" % test_object

From what I can tell, in this case:
1. test_object.__str__() is called.
2. If test_object.__str__() returns a string object, then that is substituted.
3. If test_object.__str__() returns a Unicode object (for some reason), then test_object.__unicode__() is called, then _that_ is substituted instead. The output string is turned into Unicode. This behaviour is surprising.

[Note that the call to test_object.__str__() is not the same as str(test_object), because the former can return a Unicode object without causing an error, while the latter, if it gets a Unicode object, will then try to encode('ascii') to a string, possibly generating a UnicodeEncodeError exception.]


Example 2:
    u"%s" % test_object

In this case:
1. test_object.__unicode__() is called, if it exists, and the result is substituted. The output string is Unicode.
2. If test_object.__unicode__() doesn't exist, then test_object.__str__() is called instead, converted to Unicode, and substituted. The output string is Unicode.


Example 3:
    "%s %s" % (u'unicode', test_object)

In this case:
1. The first substitution causes the output string to be Unicode.
2. It seems that (1) causes the second substitution to follow the same rules as Example 2. This is a little surprising.
msg109517 - (view) Author: Craig McQueen (cmcqueen1975) Date: 2010-07-08 07:15
Another thing I discovered, for Example 1:
4. If test_object.__str__() returns a Unicode object (for some reason), and test_object.__unicode__() does not exist, then the Unicode value from the __str__() call is used as-is (no conversion to string, no encoding errors). This is also a little surprising [in this situation unicode(test_object) also returns the Unicode object returned by __str__() as-is, so I guess there's some consistency there].
msg124662 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-12-26 02:52
I’m not sure how much effort should be put into a patch here, considering that the horrible bytes/text confusion and implicit conversion should stop in Python 3, and %-formatting is mildly deprecated.  Ezio, what do you think?

Craig, could you attach your test_object class and test code?  I wonder if the mixed behavior is still present in 3.x.
msg124664 - (view) Author: Craig McQueen (cmcqueen1975) Date: 2010-12-26 10:49
I should be able to attach my test code. But it is at my work, and I'm on holidays for 2 more weeks. Sorry 'bout that!

I do assume that Python 3 greatly simplifies this.
msg125880 - (view) Author: Craig McQueen (cmcqueen1975) Date: 2011-01-10 05:26
I'm attaching a file that I used (in Python 2.x).

It's a little rough--I manually commented and uncommented various lines to see what would change under various circumstances. But at least you should be able to see what I was doing.
msg126688 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-01-21 03:47
Python 3 checks the return types of __bytes__ and __str__, raising an error if it's not bytes and str respectively:
>>> str(C())
TypeError: __str__ returned non-string (type bytes)
>>> bytes(C())
TypeError: __bytes__ returned non-bytes (type str)

The Python 2 doc for unicode() says[0]:
"""
For objects which provide a __unicode__() method, it will call this method without arguments to create a Unicode string. For all other objects, the 8-bit string version or representation is requested and then converted to a Unicode string using the codec for the default encoding in 'strict' mode.
"""

The doc for .__unicode__() says[1]:
"""
Called to implement unicode() built-in; should return a Unicode object. When this method is not defined, string conversion is attempted, and the result of string conversion is converted to Unicode using the system default encoding.
"""
This is consistent with unicode() doc (but it doesn't mention that 'strict' is used).  It also says that the method *should* return unicode, but it can also returns a str that gets coerced by unicode().

The doc for .__str__() says[2]:
"""
Called by the str() built-in function and by the print statement to compute the “informal” string representation of an object. [...] The return value must be a string object.
"""
This is wrong because the return value can be unicode too (this has been changed at some point, it used to be true on older versions).

That said, some of the behaviors described by Craig (e.g. __str__ that returns unicode) are not documented and documenting them might save some confusion. However these "weird" behaviors are most likely errors and the fact that there are no exception is just because Python 2 is not strict with str/unicode.

I think a better way to solve the problem is to document clearly how these methods should be used (i.e. if __unicode__ should be preferred over __str__, if it's necessary to implement both, what they should return, etc.).

[0]: http://docs.python.org/library/functions.html#unicode
[1]: http://docs.python.org/reference/datamodel.html#object.__unicode__
[2]: http://docs.python.org/reference/datamodel.html#object.__str__
msg148563 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-11-29 13:41
More info on this thread: http://mail.python.org/pipermail/python-dev/2006-December/070237.html
History
Date User Action Args
2014-06-02 18:19:11ezio.melottisetnosy: + eric.smith
2011-11-29 13:41:13eric.araujosetmessages: + msg148563
2011-04-22 00:21:16Arfreversetnosy: + Arfrever
2011-01-21 03:47:30ezio.melottisetnosy: ezio.melotti, eric.araujo, cmcqueen1975, docs@python
messages: + msg126688
2011-01-10 05:26:49cmcqueen1975setfiles: + class_str_unicode_methods.py
nosy: ezio.melotti, eric.araujo, cmcqueen1975, docs@python
messages: + msg125880
2010-12-26 10:49:14cmcqueen1975setnosy: ezio.melotti, eric.araujo, cmcqueen1975, docs@python
messages: + msg124664
2010-12-26 02:52:14eric.araujosetnosy: + eric.araujo
messages: + msg124662
2010-07-08 07:15:45cmcqueen1975setmessages: + msg109517
2010-07-08 07:09:09ezio.melottisetnosy: + ezio.melotti

stage: needs patch
2010-07-08 07:07:08cmcqueen1975create