This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: equal unicode/str objects can have unequal hash
Type: behavior Stage:
Components: Interpreter Core Versions: Python 2.7, Python 2.6, Python 2.5
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: ldeller, lemburg, loewis
Priority: normal Keywords:

Created on 2010-02-09 03:06 by ldeller, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (4)
msg99084 - (view) Author: lplatypus (ldeller) * Date: 2010-02-09 03:06
The documentation for the hash() function says:
"Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0)"

This can be violated when comparing a unicode object with its str equivalent.  Here is an example:

C:\>c:\Python27\python -S
Python 2.7a3 (r27a3:78021, Feb  7 2010, 00:00:09) [MSC v.1500 32 bit (Intel)] on win32
>>> import sys; sys.setdefaultencoding('utf-8')
>>> unicodeobj = u'No\xebl'
>>> strobj = str(unicodeobj)
>>> unicodeobj == strobj
True
>>> hash(unicodeobj) == hash(strobj)
False

The last response should be True not False.

I tested this on Python 2.7a3/windows, 2.6.4/linux, 2.5.2/linux.  The problem is not relevant to Python 3.0+.

Looking at unicodeobject.c:unicode_hash() and stringobject.c:string_hash(), I think that this problem would arise for "equal" objects strobj and unicodeobj when the unicode code points are not aligned with the encoded bytes, ie when:
    map(ord, unicodeobj) != map(ord, strobj)
This means that the problem never arises when sys.getdefaultencoding() is 'ascii' or 'iso8859-1'/'latin1'.
msg99086 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-02-09 03:22
This is not a bug in Python, but in your code. sys.setdefaultencoding is only supported when setting the default encoding to either latin-1, or ascii, or 'undefined'. Setting it to any other value will have undesirable consequences like the one you report. Likewise, changing it after any Unicode objects have been created is not supported, either.
msg99102 - (view) Author: lplatypus (ldeller) * Date: 2010-02-09 10:38
Okay thanks, but in that case might I suggest that this limitation be mentioned in the documentation for sys.setdefaultencoding?  It currently reads as if any available encoding is acceptable. Perhaps even a warning or exception should be produced when calling it wrongly?

Other places that may need review include:
- the programming FAQ on python.org which presents the option of calling setdefaultencoding('mbcs') on windows ( http://www.python.org/doc/faq/programming/#what-does-unicodeerror-ascii-decoding-encoding-error-ordinal-not-in-range-128-mean )
- the comments in site.py which provoke changing the default encoding
- PEP100 which suggests enabling this code in site.py

BTW would patches ever be considered to fix issues such as this with using other encodings as default encodings, or is there some objection to the concept?
msg99104 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-02-09 11:14
lplatypus wrote:
> 
> lplatypus <luke@deller.id.au> added the comment:
> 
> Okay thanks, but in that case might I suggest that this limitation be mentioned in the documentation for sys.setdefaultencoding?  It currently reads as if any available encoding is acceptable. Perhaps even a warning or exception should be produced when calling it wrongly?
> 
> Other places that may need review include:
> - the programming FAQ on python.org which presents the option of calling setdefaultencoding('mbcs') on windows ( http://www.python.org/doc/faq/programming/#what-does-unicodeerror-ascii-decoding-encoding-error-ordinal-not-in-range-128-mean )
> - the comments in site.py which provoke changing the default encoding
> - PEP100 which suggests enabling this code in site.py
> 
> BTW would patches ever be considered to fix issues such as this with using other encodings as default encodings, or is there some objection to the concept?

No, Python 2.x's Unicode implementation only supports ASCII as default
encoding. In Python 3.x, UTF-8 is used as default encoding.

Note that this limitation only affects cases where you mix string
and Unicode objects used as keys in a dictionary. If you avoid
this situation, there are no dictionary problems with using
different default encoding. However, you may run into other problems.
History
Date User Action Args
2022-04-11 14:56:57adminsetgithub: 52138
2010-02-09 11:14:52lemburgsetnosy: + lemburg
messages: + msg99104
2010-02-09 10:38:11ldellersetmessages: + msg99102
2010-02-09 03:22:32loewissetstatus: open -> closed

nosy: + loewis
messages: + msg99086

resolution: wont fix
2010-02-09 03:06:07ldellercreate