Issue 2948: Unicode support for hashing algorithms

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/47197

classification

Title:	Unicode support for hashing algorithms
Type:	enhancement	Stage:
Components:	Unicode	Versions:	Python 2.6

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:	lemburg	Nosy List:	lemburg, loewis, rhettinger, vvro
Priority:	low	Keywords:

Created on 2008-05-23 03:22 by vvro, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (6)
msg67214 - (view)	Author: Vasco Rodrigues (vvro)	Date: 2008-05-23 03:22
The hashing algorithms don't support Unicode. Any Unicode text given to them is first tried to convert ascii and then hashed. Not all strings are convertible to ascii. Now that Unicode is becoming the default encoding, specially for the web side of python, where a lot of this hashing algorithms are used. There should be some kind of Unicode support in them. Example: from hashlib import md5 md5(u'joão') UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 2: ordinal not in range(128)
msg67215 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2008-05-23 03:38
I don't think this is the right thing to do. The hash algorithms are defined in terms of bytes, but Unicode is an abstracted from a byte level encoding. It doesn't make sense to convert using an arbitrary encoding (such as UTF-8) because someone else might hash the same text using a different encoding. Mark, do you concur?
msg67216 - (view)	Author: Vasco Rodrigues (vvro)	Date: 2008-05-23 04:19
You could just make a check for unicode strings and issue the encode in the hash function. I understand the byte abstraction, but if you issue an encode on a unicode string with only ascii chars it gets converted to the same in ascii, result will be the same. So i got to do md5(u'joão'.encode("utf-8"))? Wasn't unicode becoming the default? If I do md5(u'john'), it works. And that's a unicode string. It should have told me, no unicode then...
msg67218 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2008-05-23 05:49
Only 2.6 should be marked. This is a feature request for an implicit conversion with a default encoding; it is not a bugfix. FWIW, here's a reference to an earlier discussion: http://mail.python.org/pipermail/python-list/2004-April/258630.html Also, it is unpersuasive that md5() works with u'john'. That is just an artifact of the 2.x series. In 3.0, there is a more clear distinction between bytes and text. Recommending that this be rejected and closed. Without a universally accepted (not just with Python) implicit coding, there's no way to check MD5 checksums match for the same unicode text.
msg67219 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-05-23 05:59
I'm rejecting this idea, for the reasons already given by others: the same string might have different hash values, depending on which encoding is chosen. Users will have to be explicit when hashing, just as they need to be explicit when they chose a hash algorithm (i.e. md5, sha1, or sha256 - they all do the same thing, but still produce different output). If you want a hash algorithm that abstracts from these details, use the builtin hash function: py> hash(u'joão') 679553179
msg67222 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2008-05-23 08:32
On 2008-05-23 05:38, Raymond Hettinger wrote: > Raymond Hettinger <rhettinger@users.sourceforge.net> added the comment: > > I don't think this is the right thing to do. The hash algorithms are > defined in terms of bytes, but Unicode is an abstracted from a byte > level encoding. It doesn't make sense to convert using an arbitrary > encoding (such as UTF-8) because someone else might hash the same text > using a different encoding. > > Marc, do you concur? Yes. While we could fix an encoding to use for converting Unicode to bytes, e.g. UTF-8, you clearly want hash functions to be portable across platforms, programming languages and implementations. Other languages or implementations might choose UTF-16 or some other encoding, so it's not clear which encoding to choose and there doesn't seem to be a standard for this either. -1 on the idea. Martin already closed and rejected the idea for me. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 23 2008) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ :::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

History
Date	User	Action	Args
2022-04-11 14:56:34	admin	set	github: 47197
2008-05-23 08:32:50	lemburg	set	messages: + msg67222
2008-05-23 05:59:15	loewis	set	status: open -> closed nosy: + loewis resolution: not a bug messages: + msg67219
2008-05-23 05:50:34	rhettinger	set	priority: low messages: + msg67218 versions: - Python 2.5, Python 2.4
2008-05-23 04:45:22	vvro	set	versions: + Python 2.5, Python 2.4
2008-05-23 04:19:45	vvro	set	messages: + msg67216
2008-05-23 03:38:28	rhettinger	set	assignee: lemburg type: enhancement messages: + msg67215 nosy: + rhettinger, lemburg versions: + Python 2.6, - Python 2.5
2008-05-23 03:22:49	vvro	create