This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Change 2to3 to replace 'basestring' with '(str,bytes)'
Type: behavior Stage: resolved
Components: 2to3 (2.x to 3.x conversion tool) Versions: Python 3.9
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, bkline, josh.r, terry.reedy, xtreak
Priority: normal Keywords:

Created on 2019-09-01 21:40 by bkline, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (11)
msg350964 - (view) Author: Bob Kline (bkline) * Date: 2019-09-01 21:40
We are attempting to convert a large Python 2 code base. Following the guidance of the official documentation (https://docs.python.org/2/library/functions.html#basestring) we created tests in many, many places that look like this:

if isinstance(value, basestring):
    if not isinstance(value, unicode):
        value = value.decode(encoding)
else:
    some other code

It seems that the 2to3 tool is unaware that replacing basestring with str in such cases will break the software.

Here's an example.

$ 2to3 repro.py
...
--- repro.py	(original)
+++ repro.py	(refactored)
@@ -1,8 +1,8 @@
 from frobnitz import transform

 def foo(value, encoding=None):
-    if isinstance(value, basestring):
-        if not isinstance(value, unicode):
+    if isinstance(value, str):
+        if not isinstance(value, str):
             value = value.decode(encoding or "utf-8")
         return value
     else:

Help me understand how this "fix" results in the correct behavior.
msg350965 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2019-09-01 22:51
https://docs.python.org/3.0/whatsnew/3.0.html

> The builtin basestring abstract type was removed. Use str instead. The str and bytes types don’t have functionality enough in common to warrant a shared base class. The 2to3 tool (see below) replaces every occurrence of basestring with str.

For a longer explanation of this and other changes you might find below link useful. In Python 2 str is used to represent both text and bytes. Hence to check the type is str in python 2 you have to check it to be basestring and then check it to be unicode. In python 3 all strings are unicode with str and bytes being two different types. Hence there is no basestring and unicode string since they are both unified to be str itself in Python 3.

https://portingguide.readthedocs.io/en/latest/strings.html

Hope this helps.
msg350967 - (view) Author: Bob Kline (bkline) * Date: 2019-09-02 01:47
> Use str instead.

Sure. I understand the advantages of the new approach to strings. Which, by the way, weren't available when this project began. I don't disagree with anything you say in the context of writing new code. I was, however, surprised and dismayed to learn of the cavalier approach the upgrade tool has taken to silently breaking existing code which it is claiming to "fix."

Here's my favorite so far.

--- cdr.py      (original)
+++ cdr.py      (refactored)
@@ -36,15 +36,15 @@
 # ======================================================================
 from six import itervalues
 try:
-    basestring
+    str
     is_python3 = False
     base64encode = base64.encodestring
     base64decode = base64.decodestring
 except:
     base64encode = base64.encodebytes
     base64decode = base64.decodebytes
-    basestring = (str, bytes)
-    unicode = str
+    str = (str, bytes)
+    str = str
     is_python3 = True

We wrote this following the example of comparable techniques in http://python-future.org/compatible_idioms.html and similar guides to an upgrade path. Seems we're being punished for taking the trouble to make our code work with Python 2 and 3 during the transition period. :-(

It's hard to see how this conversion resulted in something better than what we already had.
msg351272 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2019-09-06 22:25
Bob, this issue tracker is for managing patches to the cpython repository.  not for 'help me understand' requests,  The latter belong on, for instance, python-list.  Unless you have a specific proposal, other than leaving 'basestring' alone(1), that we could reject or accept and implement, please close this(2) as 'not a bug' or whatever.

(1) Rejected as leaving code broken for 3.x.
(2) Discussion elsewhere *might* result in a concrete suggestion appropriate for this or a new issue.
msg351273 - (view) Author: Bob Kline (bkline) * Date: 2019-09-06 22:34
> Unless you have a specific proposal, ...

I _do_ have a specific proposal: replace `basestring` with `(str, bytes)`, which preserves the behavior of the original code. So, 

if isinstance(value, basestring)

becomes

if isinstance(value, (str, bytes))
msg351277 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2019-09-07 02:56
basestring in Python 2 means "thing that is logically text", because in Python 2, str can mean *either* logical text *or* binary data, and unicode is always logical text. str and unicode can kinda sorta interoperate on Python 2, so it can make sense to test for basestring if you're planning to use it as logical text; if you do 'foo' + u'bar', that's fine in Python 2. In Python 3, only str is logically text; b'foo' + 'bar' is completely illegal, so it doesn't make sense to convert it to recognize both bytes and str.

Your problem is that you're using basestring incorrectly in Python 2, and it happens to work only because Python 2 did a bad job of separating text and binary data. Your original example code should actually have been written in Python 2 as:


if isinstance(value, bytes):  # bytes is an alias of str, and only str, on 2.7
    value = value.decode(encoding)
elif not isinstance(value, unicode):
    some other code

which 2to3 would convert correctly (changing unicode to str, and leaving everything else untouched) because you actually tested what you meant to test to control the actions taken:

1. If it was binary data (which you interpret all Py2 strs to be), then it is decoded to text (Py2 unicode/Py3 str)
2. If it wasn't binary data and it wasn't text, you did something else

Point is, the converter is doing the right thing. You misunderstood the logical meaning of basestring, and wrote code that depended on your misinterpretation, that's all.

Your try/except to try to detect Python 3-ness was doomed from the start; you referenced basestring, and 2to3 (reasonably) converts that to str, which breaks your logic. You wrote cross-version code that can't be 2to3-ed because it's *already* Python 3 code; Python 3 code should never be subjected to 2to3, because it'll do dumb things (e.g. change print(1, 2) to print((1, 2))); it's 2to3, not 2or3to3 after all.
msg351299 - (view) Author: Bob Kline (bkline) * Date: 2019-09-07 10:33
OK, I give up. In parting I will point out that the official Python 2 documentation says "basestring() This abstract type is the superclass for str and unicode. It cannot be called or instantiated, but it can be used to test whether an object is an instance of str or unicode. isinstance(obj, basestring) is equivalent to isinstance(obj, (str, unicode))." That's exactly what the code we are converting (much of which was written years before Python 3 even existed) was doing. As for the idea that we weren't really "planning to use it as logical text" (ignoring the fact that _everyone_ used Python 2 str objects to represent logical text back in 2003, and ignoring the fact that the repro case given at the top of this report converts the 8-bit string value to Unicode -- why else would it do that except to use the value as "logical text"?) ... well, I don't know where to start. I'm done here. :->}
msg351304 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-09-07 16:24
Even at this late stage, we could really change 2to3's behavior here. Presumably many others are relying on the current behavior.
msg351305 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-09-07 16:24
meant to say "really couldn't"
msg351308 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2019-09-07 18:28
Replacing 'basestring' with 'str' is not a bug in the behavioral sense because it is intended and documented.
https://docs.python.org/3/library/2to3.html#2to3fixer-basestring

How the current behavior is correct: 2to3 converts syntactically valid 2.x code to syntactically valid 3.x code.  It cannot, however, guarantee semantic correctness.  A particular problem is that str is semantically ambiguous in 2.x, as it is used both for (encoded) text and binary data.  To resolve the ambiguity, 2.6 introduced 'bytes' as a synonym for 'str'.  2to3 assumes that 'bytes' means binary data, including text that will still be encoded in 3.x, while 'str' means text that is encoded bytes in 2.x but *will be unicode* in 3.x.  Hence it changes 'unicode' to unambiguous 'str' and 'basestring' == Union(unicode, str) to Union(str, str) == 'str'.

If you fool 2to3 by applying isinstance(value, basestring) to a value that will still be bytes at that point in 3.x, you get a semantic change.  Possible fixes:

1. Since you decode value after the check, do it before the check.

if isinstance(value, bytes):
    value = value.decode(encoding)
if not isinstance(value, unicode):
    some other code

2. Replace 'basestring' with '(unicode, basestring)'

In both cases, the 'unicode' to 'str' replacement should result in correct 3.x code.

3. Edit Lib/lib2to3/fixes/fix_basestring.py to replace with '(str, bytes)'.  This should be straightforward, but ask on python-list if you need help.

As for your second example, 2to3 is not meant for 2&3 code using exception tricks and six/future imports.  Turning 2&3 code into idiomatic 3-only code is a separate subject.

Since other have and will run into the same issues, I intend to post a revised version of the explanation above, with fixes for a revised example, to python-list as "2to3, str, and basestring".  Any further discussion should go there.
msg351310 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2019-09-07 19:31
Replace 2. above with "2. Replace 'basestring' with '(unicode, bytes)'."
History
Date User Action Args
2022-04-11 14:59:19adminsetgithub: 82184
2019-09-07 19:31:47terry.reedysetmessages: + msg351310
2019-09-07 18:28:56terry.reedysetresolution: rejected -> not a bug
messages: + msg351308
2019-09-07 16:24:54benjamin.petersonsetmessages: + msg351305
2019-09-07 16:24:36benjamin.petersonsetstatus: open -> closed
resolution: rejected
messages: + msg351304

stage: resolved
2019-09-07 10:33:29bklinesetmessages: + msg351299
2019-09-07 03:05:12terry.reedysetnosy: + benjamin.peterson
title: Incorrect "fixing" of isinstance tests for basestring -> Change 2to3 to replace 'basestring' with '(str,bytes)'

versions: + Python 3.9, - Python 3.7
2019-09-07 02:56:40josh.rsetnosy: + josh.r
messages: + msg351277
2019-09-06 22:34:51bklinesetmessages: + msg351273
2019-09-06 22:25:19terry.reedysetnosy: + terry.reedy
messages: + msg351272
2019-09-02 01:47:10bklinesetmessages: + msg350967
2019-09-01 22:51:06xtreaksetnosy: + xtreak
messages: + msg350965
2019-09-01 21:40:09bklinecreate