Message 99084 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ldeller
Recipients	ldeller
Date	2010-02-09.03:06:05
SpamBayes Score	2.2957248e-07
Marked as misclassified	No
Message-id	<1265684782.83.0.641089509966.issue7890@psf.upfronthosting.co.za>
In-reply-to

Content
The documentation for the hash() function says: "Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0)" This can be violated when comparing a unicode object with its str equivalent. Here is an example: C:\>c:\Python27\python -S Python 2.7a3 (r27a3:78021, Feb 7 2010, 00:00:09) [MSC v.1500 32 bit (Intel)] on win32 >>> import sys; sys.setdefaultencoding('utf-8') >>> unicodeobj = u'No\xebl' >>> strobj = str(unicodeobj) >>> unicodeobj == strobj True >>> hash(unicodeobj) == hash(strobj) False The last response should be True not False. I tested this on Python 2.7a3/windows, 2.6.4/linux, 2.5.2/linux. The problem is not relevant to Python 3.0+. Looking at unicodeobject.c:unicode_hash() and stringobject.c:string_hash(), I think that this problem would arise for "equal" objects strobj and unicodeobj when the unicode code points are not aligned with the encoded bytes, ie when: map(ord, unicodeobj) != map(ord, strobj) This means that the problem never arises when sys.getdefaultencoding() is 'ascii' or 'iso8859-1'/'latin1'.

The documentation for the hash() function says:
"Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0)"

This can be violated when comparing a unicode object with its str equivalent.  Here is an example:

C:\>c:\Python27\python -S
Python 2.7a3 (r27a3:78021, Feb  7 2010, 00:00:09) [MSC v.1500 32 bit (Intel)] on win32
>>> import sys; sys.setdefaultencoding('utf-8')
>>> unicodeobj = u'No\xebl'
>>> strobj = str(unicodeobj)
>>> unicodeobj == strobj
True
>>> hash(unicodeobj) == hash(strobj)
False

The last response should be True not False.

I tested this on Python 2.7a3/windows, 2.6.4/linux, 2.5.2/linux.  The problem is not relevant to Python 3.0+.

Looking at unicodeobject.c:unicode_hash() and stringobject.c:string_hash(), I think that this problem would arise for "equal" objects strobj and unicodeobj when the unicode code points are not aligned with the encoded bytes, ie when:
    map(ord, unicodeobj) != map(ord, strobj)
This means that the problem never arises when sys.getdefaultencoding() is 'ascii' or 'iso8859-1'/'latin1'.

History
Date	User	Action	Args
2010-02-09 03:06:23	ldeller	set	recipients: + ldeller
2010-02-09 03:06:22	ldeller	set	messageid: <1265684782.83.0.641089509966.issue7890@psf.upfronthosting.co.za>
2010-02-09 03:06:07	ldeller	link	issue7890 messages
2010-02-09 03:06:05	ldeller	create