This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: bug or feature: unicode() and subclasses
Type: Stage:
Components: Interpreter Core Versions:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: tim.peters Nosy List: doerwalter, gvanrossum, tim.peters
Priority: normal Keywords:

Created on 2001-09-09 15:41 by doerwalter, last changed 2022-04-10 16:04 by admin. This issue is now closed.

Messages (23)
msg6461 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2001-09-09 15:41
The unicode constructor returns the object passed in, 
when an instance of a subclass of unicode is passed in:
--
class U(unicode):
   pass

u1 = U(u"foo")
print type(u1)
u2 = unicode(u1)
print type(u2) 
--
this gives
--
<type '__main__.U'>
<type '__main__.U'>
--
instead of
--
<type '__main__.U'>
<type 'unicode'>
--
as it probably should be (The unicode constructor 
should construct unicode objects). With the current 
behaviour it is nearly impossible to construct a 
unicode object with the value of an instance of a 
unicode subclass, because most methods are optimized 
to return the original object if possible, e.g.
--
print type(unicode.__getslice__(u1, 0, 3))
print type(unicode.__getslice__(u1, 0, 2))
--
gives
--
<type '__main__.U'>
<type 'unicode'>
--
This should be made consistent, so that either a 
unicode object is always returned, or all methods use 
a "virtual constructor", i.e. create an object of the 
type passed in. This would simplify deriving classes 
from unicode as far fewer methods have to be 
overwritten.

But first of all the constructor should be fixed, so 
that the argument is returned unmodified only when it 
is an instance of unicode and not of a unicode 
subclass.
msg6462 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2001-09-10 14:48
Logged In: YES 
user_id=6380

Good catch! Other types also suffer from this, e.g. int.

added to my to-do list.
msg6463 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-09-10 20:45
Logged In: YES 
user_id=31435

Reassigned to me.
msg6464 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-09-10 20:57
Logged In: YES 
user_id=31435

Partially repaired (for int and long) in:

Include/intobject.h; new revision: 2.24
Include/longintrepr.h; new revision: 2.12
Include/longobject.h; new revision: 2.24
Lib/test/test_descr.py; new revision: 1.33
Objects/abstract.c; new revision: 2.75
Objects/longobject.c; new revision: 1.104
msg6465 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-09-10 21:29
Logged In: YES 
user_id=31435

float() also repaired, in

Include/floatobject.h; new revision: 2.20
Lib/test/test_descr.py; new revision: 1.34
Objects/abstract.c; new revision: 2.76
msg6466 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-09-10 23:39
Logged In: YES 
user_id=31435

tuple() repaired, in

Include/tupleobject.h; new revision: 2.27
Lib/test/test_descr.py; new revision: 1.36
Objects/abstract.c; new revision: 2.77
msg6467 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-09-11 01:43
Logged In: YES 
user_id=31435

str() repaired (yes, unicode is next <wink>), in

Include/stringobject.h; new revision: 2.31
Lib/test/test_descr.py; new revision: 1.37
Objects/object.c; new revision: 2.146
Objects/stringobject.c; new revision: 2.130
msg6468 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-09-11 03:09
Logged In: YES 
user_id=31435

unicode() repaired in

Include/unicodeobject.h; new revision: 2.33
Lib/test/test_descr.py; new revision: 1.39
Objects/unicodeobject.c; new revision: 2.111
msg6469 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2001-09-11 11:31
Logged In: YES 
user_id=89016

Thanks for the quick fix, but the second problem still 
remains:
---
class U(unicode):
   pass

u = U(u"foo")

print type(u[0:3])
print type(u[0:2])
---
This gives:
---
<type '__main__.U'>
<type 'unicode'>
---
I think this should be changed to either always return a 
unicode object, or to always return an instance of the real 
class passed in. (This should be done for all unicode 
methods that return a new unicode object). The second 
solution would simplify creating derived classes, because 
all the methods that return unicode objects would 
automatically return the derived type, so these methods 
don't have to be overwritten.
msg6470 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2001-09-11 12:01
Logged In: YES 
user_id=6380

You're asking for the impossible though. I don't think any
other OO language supports this automatically (although I
could be wrong). The problem is, what to do with a subclass
of unicode like this:

class U(unicode):
  def __init__(self, arg):
    self.orig = arg

How is U("foobar")[0:3] going to know what argument to pass
in to __init__? The base class simply can't know what
additional invariants the subclass imposes.
msg6471 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2001-09-11 12:04
Logged In: YES 
user_id=6380

Apologies. I missed half of what you were asking. It's
impossible for U(...)[0:2] to return a U instance, but I
agree that then at least then it should *always* return a
unicode instance.

So this is still open. For Tim: the problem is that a slice
(or other) operation may decide to return the original
object unchanged; this should (probably?) only be done when
the original object is exactly a unicode instance. I'm
afraid that we'll have to systematically look through all
144 Unicode methods to see where they exhibit this behavior.
msg6472 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2001-09-11 14:03
Logged In: YES 
user_id=89016

> You're asking for the impossible though.
> I don't think any other OO language supports
> this automatically (although I
> could be wrong). 

Python uses it, e.g. in Lib/UserString.py:
   def rstrip(self): return self.__class__(self.data.rstrip
())

So if someone derives a new class X from UserString, 
calling X("y ").rstrip() returns an X object. The only 
assumption that UserString makes, is that the derived class 
has a constructor that can handle at least the same 
arguments as UserString.__init__.

This "virtual constructor" is used in several places:
grep -l "self.__class__(" `find -name '*.py' | grep -v Mac`
returns:
./dist/src/Lib/UserString.py
./dist/src/Lib/copy.py
./dist/src/Lib/MimeWriter.py
./dist/src/Lib/test/test_descr.py
./dist/src/Lib/xml/sax/xmlreader.py
./dist/src/Lib/UserList.py
./dist/src/Demo/pdist/rcvs.py
msg6473 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2001-09-11 14:49
Logged In: YES 
user_id=6380

> Python uses it, e.g. in Lib/UserString.py:
[and other cases]

Yes, and I'm no longer comfortable with such code, for
exactly the reason I mentioned, unless it's an explicit and
intentional part of the class specification. :-(

Doing this consistenyly for all built-in types would cause
too much upheaval -- we'd have to change every single
built-in operation.

But the other interpretation stands:  unicode (and other)
operations should only optimize by returning "self" when
self is a strict instance of the type.
msg6474 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-09-11 16:56
Logged In: YES 
user_id=31435

Trying to change Resolution to something sensible 
("Accepted" doesn't make sense).
msg6475 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-09-11 16:59
Logged In: YES 
user_id=31435

Oh well -- it's stuck at "Accepted".
msg6476 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-09-11 19:50
Logged In: YES 
user_id=31435

Here we go again.  For tuples, hunted down and disabled t
[:], t*0 and t*1 optimizations when t is of a tuple 
subclass type:

Lib/test/test_descr.py; new revision: 1.41
Objects/tupleobject.c; new revision: 2.60

More later (this is time-consuming work).
msg6477 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-09-11 21:45
Logged In: YES 
user_id=31435

For I a subclass of int, disabled the

+I(whatever)
I(0) << whatever
I(0) >> whatever
I(whatever) << 0
I(whatever) >> 0

optimizations, in

Lib/test/test_descr.py; new revision: 1.42
Objects/intobject.c; new revision: 2.74
msg6478 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-09-11 21:55
Logged In: YES 
user_id=31435

For F a subclass of float, disabled the

+F(whatever)

optimization, in

Lib/test/test_descr.py; new revision: 1.43
Objects/floatobject.c; new revision: 2.98
msg6479 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-09-11 22:32
Logged In: YES 
user_id=31435

A number of similar long optimizations were disabled for 
long subclasses, in

Lib/test/test_descr.py; new revision: 1.44
Objects/longobject.c; new revision: 1.105
msg6480 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-09-12 02:19
Logged In: YES 
user_id=31435

Lots of str optimizations inhibited ("the usual", + 
replace, translate, ljust, rjust, center, strip), in

Lib/test/test_descr.py; new revision: 1.45
Objects/stringobject.c; new revision: 2.131
msg6481 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-09-12 03:04
Logged In: YES 
user_id=31435

And lots of unicode optimizations (on subclass instances) 
were disabled in

Lib/test/test_descr.py; new revision: 1.46
Objects/unicodeobject.c; new revision: 2.112
msg6482 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-09-12 08:11
Logged In: YES 
user_id=31435

Additional patches:

+ Repaired hash() applied to str and unicode subclass 
instances (was always returning 0, with baffling 
consequences for dict operations).

+ Ensured that interning an object of a str subclass 
interned a genuine string (w/ the same value) instead.

The complex type got overlooked in all this, so keeping 
this open until that's done too.
msg6483 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-09-12 19:14
Logged In: YES 
user_id=31435

Similar changes also completed for the complex type, and 
closing this bug report as Fixed again:

Include/complexobject.h; new revision: 2.9
Lib/test/test_descr.py; new revision: 1.49
Objects/complexobject.c; new revision: 2.45
Objects/floatobject.c; new revision: 2.99
History
Date User Action Args
2022-04-10 16:04:25adminsetgithub: 35142
2001-09-09 15:41:14doerwaltercreate