Issue 7890: equal unicode/str objects can have unequal hash

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/52138

classification

Title:	equal unicode/str objects can have unequal hash
Type:	behavior	Stage:
Components:	Interpreter Core	Versions:	Python 2.7, Python 2.6, Python 2.5

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	ldeller, lemburg, loewis
Priority:	normal	Keywords:

Created on 2010-02-09 03:06 by ldeller, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (4)
msg99084 - (view)	Author: lplatypus (ldeller) *	Date: 2010-02-09 03:06
The documentation for the hash() function says: "Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0)" This can be violated when comparing a unicode object with its str equivalent. Here is an example: C:\>c:\Python27\python -S Python 2.7a3 (r27a3:78021, Feb 7 2010, 00:00:09) [MSC v.1500 32 bit (Intel)] on win32 >>> import sys; sys.setdefaultencoding('utf-8') >>> unicodeobj = u'No\xebl' >>> strobj = str(unicodeobj) >>> unicodeobj == strobj True >>> hash(unicodeobj) == hash(strobj) False The last response should be True not False. I tested this on Python 2.7a3/windows, 2.6.4/linux, 2.5.2/linux. The problem is not relevant to Python 3.0+. Looking at unicodeobject.c:unicode_hash() and stringobject.c:string_hash(), I think that this problem would arise for "equal" objects strobj and unicodeobj when the unicode code points are not aligned with the encoded bytes, ie when: map(ord, unicodeobj) != map(ord, strobj) This means that the problem never arises when sys.getdefaultencoding() is 'ascii' or 'iso8859-1'/'latin1'.
msg99086 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-02-09 03:22
This is not a bug in Python, but in your code. sys.setdefaultencoding is only supported when setting the default encoding to either latin-1, or ascii, or 'undefined'. Setting it to any other value will have undesirable consequences like the one you report. Likewise, changing it after any Unicode objects have been created is not supported, either.
msg99102 - (view)	Author: lplatypus (ldeller) *	Date: 2010-02-09 10:38
Okay thanks, but in that case might I suggest that this limitation be mentioned in the documentation for sys.setdefaultencoding? It currently reads as if any available encoding is acceptable. Perhaps even a warning or exception should be produced when calling it wrongly? Other places that may need review include: - the programming FAQ on python.org which presents the option of calling setdefaultencoding('mbcs') on windows ( http://www.python.org/doc/faq/programming/#what-does-unicodeerror-ascii-decoding-encoding-error-ordinal-not-in-range-128-mean ) - the comments in site.py which provoke changing the default encoding - PEP100 which suggests enabling this code in site.py BTW would patches ever be considered to fix issues such as this with using other encodings as default encodings, or is there some objection to the concept?
msg99104 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-02-09 11:14
lplatypus wrote: > > lplatypus <luke@deller.id.au> added the comment: > > Okay thanks, but in that case might I suggest that this limitation be mentioned in the documentation for sys.setdefaultencoding? It currently reads as if any available encoding is acceptable. Perhaps even a warning or exception should be produced when calling it wrongly? > > Other places that may need review include: > - the programming FAQ on python.org which presents the option of calling setdefaultencoding('mbcs') on windows ( http://www.python.org/doc/faq/programming/#what-does-unicodeerror-ascii-decoding-encoding-error-ordinal-not-in-range-128-mean ) > - the comments in site.py which provoke changing the default encoding > - PEP100 which suggests enabling this code in site.py > > BTW would patches ever be considered to fix issues such as this with using other encodings as default encodings, or is there some objection to the concept? No, Python 2.x's Unicode implementation only supports ASCII as default encoding. In Python 3.x, UTF-8 is used as default encoding. Note that this limitation only affects cases where you mix string and Unicode objects used as keys in a dictionary. If you avoid this situation, there are no dictionary problems with using different default encoding. However, you may run into other problems.

History
Date	User	Action	Args
2022-04-11 14:56:57	admin	set	github: 52138
2010-02-09 11:14:52	lemburg	set	nosy: + lemburg messages: + msg99104
2010-02-09 10:38:11	ldeller	set	messages: + msg99102
2010-02-09 03:22:32	loewis	set	status: open -> closed nosy: + loewis messages: + msg99086 resolution: wont fix
2010-02-09 03:06:07	ldeller	create