classification
Title: str(bytes_obj) should raise an error
Type: Stage:
Components: Interpreter Core, Unicode Versions: Python 3.6, Python 3.5
process
Status: closed Resolution: rejected
Dependencies: Superseder: py3k-pep3137: issue warnings / errors on str(bytes()) and similar operations
View: 1392
Assigned To: gvanrossum Nosy List: eryksun, ezio.melotti, gvanrossum, lemburg, pitrou, r.david.murray, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2015-04-22 13:23 by lemburg, last changed 2015-04-23 16:57 by gvanrossum. This issue is now closed.

Messages (8)
msg241800 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2015-04-22 13:23
In Python 2, the unicode() constructor does not accept bytes arguments, unless an encoding argument is given:

>>> unicode(u'abcäöü'.encode('utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

In Python 3, the str() constructor masks this programming error by returning the repr() of the bytes object:

>>> str('abcäöü'.encode('utf-8'))
"b'abc\\xc3\\xa4\\xc3\\xb6\\xc3\\xbc'"

I think it would be more helpful to point the programmer to the most probably missing encoding argument by raising an error.

Also note that you get a different output with encoding argument set:

>>> str('abcäöü'.encode('utf-8'), 'utf-8')
'abcäöü'

I know this is documented, but it is still not very helpful and can easily hide errors.
msg241802 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2015-04-22 13:39
bytes.__str__ can already raise either a warning (-b) 

    >>> str('abcäöü'.encode('utf-8'))
    __main__:1: BytesWarning: str() on a bytes instance
    "b'abc\\xc3\\xa4\\xc3\\xb6\\xc3\\xbc'"

or error (-bb), which applies equally to implicit conversion by print():

    >>> print('abcäöü'.encode('utf-8'))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    BytesWarning: str() on a bytes instance
msg241803 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-04-22 13:43
In Python 2, the unicode() constructor accepts bytes argument if it is decodeable with sys.getdefaultencoding().

>>> unicode(b'abc')
u'abc'
>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding("utf-8")
>>> unicode(u'abcäöü'.encode('utf-8'))
u'abc\xe4\xf6\xfc'

In Python 3, the str() constructor does not accept bytes arguments if Python is ran with -bb option.
msg241804 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-04-22 13:52
str accepting bytes and returning the repr was a conscious design choice, as evidenced by the -bb option, and I'm sure there is code that is both unintentionally and *intentionally* using this, despite the warning.  Unless we want to discuss making the -bb behavior the default in a future version of python, this issue should be closed.
msg241808 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2015-04-22 14:48
On 22.04.2015 15:52, R. David Murray wrote:
> str accepting bytes and returning the repr was a conscious design choice, as evidenced by the -bb option, and I'm sure there is code that is both unintentionally and *intentionally* using this, despite the warning.  Unless we want to discuss making the -bb behavior the default in a future version of python, this issue should be closed.

I guess that would be helpful, yes.

Here's the original patch which introduced -b and -bb:

http://bugs.python.org/issue1392

This was Guido's answer back then:

"""
I'll look at the patches later, but we've gone over this before on the
list. str() of *any* object needs to return *something*. Yes, it's
unfortunate that this masks bugs in the transitional period, but it
really is the best thing in the long run. We had other exceptional
treatement for str vs. bytes (e.g. the comparison was raising TypeError
for a while) and we had to kill that too.
"""

I'm not sure what the "transitional period" refers to, though.
It's 8 years later now and doesn't look like str(bytes_object) will
go away a source of subtle bugs anytime soon :-)
msg241811 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-04-22 15:11
Yeah, that's why I run tests with -bb myself.  Except that there was a bug in -W/-bb handling that meant I wasn't really...and that bit me because there is at least one buildbot that really does, and it complained...

(Although in that case the 'bug' was really benign, since it was just optional debug print output for which the repr of the bytes was actually fine.)
msg241867 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-04-23 16:34
> I'm not sure what the "transitional period" refers to, though.

The Python 2 -> Python 3 migration.

> It's 8 years later now and doesn't look like str(bytes_object) will
go away a source of subtle bugs anytime soon

str(bytes_object) is perfectly reasonable when logging stuff, for example.

Recommend closing.
msg241869 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2015-04-23 16:57
It would be unacceptable if print(b) were to raise an exception. The reason the transitional period is long is just that people are still porting Python 2 code.
History
Date User Action Args
2015-04-23 16:57:52gvanrossumsetstatus: pending -> closed
assignee: gvanrossum
messages: + msg241869
2015-04-23 16:34:30pitrousetstatus: open -> pending

nosy: + pitrou, gvanrossum
messages: + msg241867

superseder: py3k-pep3137: issue warnings / errors on str(bytes()) and similar operations
resolution: rejected
2015-04-22 15:11:24r.david.murraysetmessages: + msg241811
2015-04-22 14:48:38lemburgsetmessages: + msg241808
2015-04-22 13:52:34r.david.murraysetnosy: + r.david.murray
messages: + msg241804
2015-04-22 13:43:36serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg241803
2015-04-22 13:39:03eryksunsetnosy: + eryksun
messages: + msg241802
2015-04-22 13:23:32lemburgcreate