This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: 2to3 incorrectly converts two parameter unicode() constructor to str()
Type: behavior Stage: resolved
Components: Versions: Python 3.3, Python 3.4, Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: gregory.p.smith, serhiy.storchaka
Priority: normal Keywords:

Created on 2013-10-04 00:57 by gregory.p.smith, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (4)
msg198929 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2013-10-04 00:57
From a conversion through 2to3:

<       default_value=unicode("", "utf-8"),
---
>       default_value=str("", "utf-8"),

The Python 2 unicode constructor takes an optional second parameter which is the codec to use to convert when the first parameter is non-unicode.

2to3 should check the parameters on uses of unicode() and if there is a second parameter and the first is explicitly b"" bytes it should turn it into
  default_value=b"whatever".decode(second_param)

if the first is valid utf-8 and the second is "utf-8" (or its other spellings) it should leave it as is and simply become:
  default_value="thing passed to unicode() that was already utf-8"
msg198936 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-04 08:07
This is not a bug, str accepts the encoding argument in Python 3. And in contrast to the decode method it works with arbitrary byte-like objects (i.e. array.array).
msg198966 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2013-10-04 21:08
Correct, my characterization above was wrong (I shouldn't write these up without the interpreter right in front of me). What is wrong with the conversion is:

unicode("", "utf-8") in python 2.x should become either str(b"", "utf-8") or, better, just "" in Python 3.x.  The better version could be done if the codec and value can be represented in the encoding of the output 3.x source code file as is but that optimization is not critical.

In order for str() to take a second arg (the codec) the first cannot be a unicode string already:

>>> str("foo", "utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: decoding str is not supported
msg198968 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-04 21:36
Just add the "b" prefix to literal string argument of unicode() in Python 2.
History
Date User Action Args
2022-04-11 14:57:51adminsetgithub: 63358
2017-03-07 18:39:15serhiy.storchakasetstatus: pending -> closed
resolution: wont fix
stage: needs patch -> resolved
2016-11-28 23:35:56serhiy.storchakasetstatus: open -> pending
2013-10-04 21:36:24serhiy.storchakasetmessages: + msg198968
2013-10-04 21:08:39gregory.p.smithsetmessages: + msg198966
2013-10-04 08:07:58serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg198936
2013-10-04 00:57:01gregory.p.smithcreate